Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
TRAJECTORY-BASED POINT OF INTEREST
RECOMMENDATION
by
Geoffrey Benjamin Zenger
B.Sc. (Hons. First Class), Simon Fraser University, 2007
a Thesis submitted in partial fulfillment
of the requirements for the degree of
Master of Science
in the School
of
Computing Science
c© Geoffrey Benjamin Zenger 2009
SIMON FRASER UNIVERSITY
Fall 2009
All rights reserved. This work may not be
reproduced in whole or in part, by photocopy
or other means, without the permission of the author.
Last revision: Spring 09
Declaration of Partial Copyright Licence The author, whose copyright is declared on the title page of this work, has granted to Simon Fraser University the right to lend this thesis, project or extended essay to users of the Simon Fraser University Library, and to make partial or single copies only for such users or in response to a request from the library of any other university, or other educational institution, on its own behalf or for one of its users.
The author has further granted permission to Simon Fraser University to keep or make a digital copy for use in its circulating collection (currently available to the public at the “Institutional Repository” link of the SFU Library website <www.lib.sfu.ca> at: <http://ir.lib.sfu.ca/handle/1892/112>) and, without changing the content, to translate the thesis/project or extended essays, if technically possible, to any medium or format for the purpose of preservation of the digital work.
The author has further agreed that permission for multiple copying of this work for scholarly purposes may be granted by either the author or the Dean of Graduate Studies.
It is understood that copying or publication of this work for financial gain shall not be allowed without the author’s written permission.
Permission for public performance, or limited permission for private scholarly use, of any multimedia materials forming part of this work, may have been granted by the author. This information may be found on the separately catalogued multimedia material and in the signed Partial Copyright Licence.
While licensing SFU to permit the above uses, the author retains copyright in the thesis, project or extended essays, including the right to change the work for subsequent purposes, including editing and publishing the work in whole or in part, and licensing other parties, as the author may desire.
The original Partial Copyright Licence attesting to these terms, and signed by this author, may be found in the original bound copy of this work, retained in the Simon Fraser University Archive.
Simon Fraser University Library Burnaby, BC, Canada
Abstract
Existing point of interest (POI) recommendation systems for mobile users only consider a
user’s present spatio-temporal location, and do not utilize a user’s trajectory history. In this
thesis, we identify some essential requirements for a mobile trajectory-based recommenda-
tion system, and present a new framework for trajectory-based POI recommendation. We
construct a k-truncated generalized suffix tree to represent a historical trajectory database,
and use it to execute exact matching recommendation queries. In addition to individual
points of interest, we can recommend generalizations of POIs by using density estimation.
We also consider extensions of our framework. Two variants are developed, allowing for the
execution of fuzzy matching and order-flexible queries. Furthermore, a technique for diver-
sifying recommendations is presented. The resulting system can efficiently and accurately
predict a user’s next visited point given a query, and is demonstrated to be effective and
scalable on two real world datasets.
Keywords: trajectory mining; POI recommendation; recommendation systems; fuzzy
matching; order-flexible matching; recommending generalizations
iii
To Brittany
iv
“You’re beginning with an illogical premise and proceeding
perfectly logically to an illogical conclusion.”
— Donald Rumsfeld, 2001
v
Acknowledgments
I would like to extend my gratitude to my senior supervisor, Dr. Jian Pei, for guiding me
through the last two years of study and research. Through his creativity, energy, and exper-
tise he has given me a great appreciation for academic research and the joy of conducting
original research. I would also like to thank him for his patience even when work and a
medical emergency distracted me from my academic work.
In addition, I want to thank Dr. Qianping Gu for agreeing to serve on my committee
and to Dr. Joseph Peters for his willingness to serve as one of my supervisors. Through
the courses I took with him and numerous discussions held in his office, Dr. Peters played
an instrumental role in teaching me that research can, and in fact should, be a fun and
enjoyable endeavour.
I would like to thank Michael Tsumura, Ivailo Ivanov, Nebojsa Stefanovic, and everybody
else that I have worked with and worked for at SAP Business Objects for their flexibility
that has allowed me to attend courses at SFU and write this thesis.
I would like to thank my friends and family for supporting my decision to continue my
studies and graciously endure an additional two years in which I had little free time.
Finally, I want to express my gratitude to Brittany, who has cared for me, provided support,
and has graciously served as a sounding board for my research ideas.
vi
Contents
Approval ii
Abstract iii
Dedication iv
Quotation v
Acknowledgments vi
Contents vii
List of Tables xi
List of Figures xii
List of Algorithms xiv
1 Introduction 1
1.1 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Related Work 5
2.1 Mobile Recommendation Systems . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Trajectory Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
vii
3 Problem Description 10
3.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Recommendation System Requirements . . . . . . . . . . . . . . . . . . . . . 14
3.3.1 Quantifiability of Confidence . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.2 On-line Recommendation Capability . . . . . . . . . . . . . . . . . . . 15
3.3.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.4 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.5 Diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.6 Fuzziness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.7 Order-Flexibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.8 Personalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Satisfying the Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Exact Matching 20
4.1 Exact Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.1 Naıve Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Accounting for POI Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2.1 Increasing the Confidence of Each Similar POI . . . . . . . . . . . . . 23
4.2.2 Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2.3 Recommending Generalizations . . . . . . . . . . . . . . . . . . . . . . 26
4.3 Spatio-specific Generalized Recommendations . . . . . . . . . . . . . . . . . . 27
4.4 Implementing Exact Matching . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4.1 Suffix Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4.2 k-truncated Generalized Suffix Trees . . . . . . . . . . . . . . . . . . . 32
4.4.3 Computing Point Distance . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4.4 Executing Exact Matching Queries . . . . . . . . . . . . . . . . . . . . 35
4.5 Diversification of Recommendations . . . . . . . . . . . . . . . . . . . . . . . 38
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5 Variants 42
5.1 Fuzzy matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.1.1 Implementing Fuzzy Matching . . . . . . . . . . . . . . . . . . . . . . 43
5.2 Order-Flexible Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
viii
5.2.1 History-Centric Approach . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2.2 Query-Centric Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6 Experimental Results 53
6.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.1.1 Dataset Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.1.2 Dataset Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.2 Evaluating Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.3 Experimentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.3.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.3.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.3.3 Basic Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.3.4 Query Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.3.5 Number of Recommendations . . . . . . . . . . . . . . . . . . . . . . . 73
6.3.6 Diversification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.3.7 Other Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.3.8 Effects of k-Truncated Suffix Trees . . . . . . . . . . . . . . . . . . . . 79
6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7 Conclusion 83
7.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.1.1 Personalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.1.2 Parallelizing Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.1.3 User Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.1.4 Temporal Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.1.5 Continuous Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.1.6 Longer Tails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.1.7 Other Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
A Constructing suffix trees 89
A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
A.2 Ukkonen’s Algorithm for Suffix Trees . . . . . . . . . . . . . . . . . . . . . . . 89
A.3 Constructing Generalized Suffix Trees . . . . . . . . . . . . . . . . . . . . . . 92
ix
A.4 Constructing k-truncated Generalized Suffix Trees . . . . . . . . . . . . . . . 92
Bibliography 95
x
List of Tables
6.1 Dataset Trajectory Information . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.2 Dataset POI Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
xi
List of Figures
4.1 Graphical depiction of example . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Example of Gaussian kernel estimation . . . . . . . . . . . . . . . . . . . . . . 26
4.3 Example illustrating problem over over-generalization . . . . . . . . . . . . . 27
4.4 Solving over-generalization problem with grid cells . . . . . . . . . . . . . . . 29
4.5 Suffix tree for the word “mississippi$” . . . . . . . . . . . . . . . . . . . . . . 31
4.6 3-truncated suffix tree for the word “mississippi$”. . . . . . . . . . . . . . . . 33
4.7 Example of concept distance. conceptDistance(x, y) = 3 . . . . . . . . . . . . 34
5.1 Demonstrating the Fuzzy Search Radius around a Trajectory . . . . . . . . . 44
6.1 Datasets used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.2 Processed Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.3 Concept Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.4 INFATI Datasets: Weighted Scores vs. Fuzzy matching radius . . . . . . . . . 65
6.5 Trucks Datasets: Weighted Scores vs. Fuzzy matching radius . . . . . . . . . 66
6.6 INFATI Datasets: Unsatisfiable Queries vs. Fuzzy matching radius . . . . . . 66
6.7 Trucks Datasets: Unsatisfiable Queries vs. Fuzzy matching radius . . . . . . . 67
6.8 INFATI Datasets: Binary Scores vs. Fuzzy matching radius . . . . . . . . . . 67
6.9 Trucks Datasets: Binary Scores vs. Fuzzy matching radius . . . . . . . . . . . 68
6.10 INFATI Datasets: Query Time vs. Fuzzy matching radius . . . . . . . . . . . 68
6.11 Trucks Datasets: Query Time vs. Fuzzy matching radius . . . . . . . . . . . . 69
6.12 INFATI-500: Effect of Query Length on Weighted Score . . . . . . . . . . . . 71
6.13 INFATI-500: Effect of Query Length on Binary Score . . . . . . . . . . . . . 71
6.14 INFATI-500: Effect of Query Length on Query Time . . . . . . . . . . . . . . 72
6.15 INFATI-500: Effect of Query Length on Unsatisfiable Queries . . . . . . . . . 72
xii
6.16 INFATI-500: Effect of the Number of Recommendations on Weighted Score . 74
6.17 INFATI-500: Effect of the Number of Recommendations on Binary Score . . 74
6.18 INFATI-500: Effect of Diversification on Weighted Score . . . . . . . . . . . . 75
6.19 INFATI-500: Effect of Diversification on Binary Score . . . . . . . . . . . . . 76
6.20 INFATI-500: Effect of Spatial Factor on Weighted Score . . . . . . . . . . . . 76
6.21 INFATI-500: Effect of Spatial Factor on Query Time . . . . . . . . . . . . . . 78
6.22 INFATI-500: Effect of Kernel Width on Weighted Score . . . . . . . . . . . . 78
6.23 Effects of Query Length on Suffix Tree Construction Time . . . . . . . . . . . 80
6.24 Effects of Truncation on Suffix Tree Memory Usage . . . . . . . . . . . . . . . 81
6.25 INFATI-500: Effects of Truncation on Query Times . . . . . . . . . . . . . . . 81
A.1 3-truncated suffix tree for the word ”mississippi$. . . . . . . . . . . . . . . . . 93
xiii
List of Algorithms
1 Searching Suffix Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2 Processing Next Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3 Algorithm for Diversifying Recommendations . . . . . . . . . . . . . . . . . . . 40
4 Searching Suffix Trees For Fuzzy Matching . . . . . . . . . . . . . . . . . . . . 46
5 Searching Suffix Trees for Order-Flexible Matching . . . . . . . . . . . . . . . . 51
6 Pseudo-code for Dataset Processing . . . . . . . . . . . . . . . . . . . . . . . . 57
7 Ukkonen’s Algorithm (High Level) . . . . . . . . . . . . . . . . . . . . . . . . . 90
8 Modified Ukkonen’s Algorithm for k truncated suffix trees. . . . . . . . . . . . 94
xiv
Chapter 1
Introduction
Portable GPS devices, cell phones, and other location-aware mobile devices have become
ubiquitous in recent years. These devices are capable of gathering vast quantities of data
regarding a user’s movements. Each user’s movements constitute a trajectory: a sequence of
points, each with a precise time-stamp and location. Although some may view the gathering
and use of this data as an invasion of personal privacy, the availability of this data opens
new avenues for improving the quality of point of interest recommendation systems.
Current mobile point of interest (POI) recommendation systems take into account the
present location of an individual, along with other attributes of the individual, such as age,
sex, and occupation. However, they are unable to incorporate the recent movements of
an individual and knowledge about historical trajectories into the recommendation process.
This thesis addresses the problem of incorporating a user’s current trajectory, as well as
a database of historical trajectories, into the point of interest recommendation process in
order to improve the quality of the returned recommendations.
Imagine yourself visiting a new city, either as a tourist or for business purposes, and
pulling out your cell phone to enable a point of interest recommendation system. The
research presented in this thesis would allow you to query the recommendation system with
your recently travelled trajectory and be presented with an interesting museum to visit, a
restaurant to eat at, and a store to shop at. For example, the system may determine that
after you visited Science World, Stanley Park, and the Planetarium, the place that you
are most likely to want to visit next is the Van Dusen Gardens. Given this knowledge, a
trajectory-based recommendation system could recommend that you visit the gardens, and
presumably pay for itself by charging the gardens a nominal fee to display an advertisement.
1
CHAPTER 1. INTRODUCTION 2
A very similar application to the application of recommending points of interest to
visitors to a new city is that of recommending places to visit in a vast museum for people
short on time. For example, a trajectory-based recommendation system could be used in an
art museum, such as the Prado, where the system could recommend that individuals who
had just spent time viewing Zurbaran’s Agnus Dei and El Greco’s Annunciation may want
to view El Greco’s The Knight with His Hand on His Breast next.
Another application of the research in this thesis is to people’s morning commutes.
We can surmise that there will often be a small set of points of interest, including places
such as coffee shops, newsstands, cafes, and convenience stores that are frequented during
the morning commute by individuals following certain trajectories. Using a trajectory-
based POI recommendation system, we could combine information about a user’s historical
trajectories along with knowledge of other users’ historical trajectories to recommend points
of interest to an individual during their commutes. By using historical trajectory information
from other users, it would be possible for the recommendation system to recommend points
even when the user takes a novel route to work. There are two ways of using this information
on which points a user is most likely to want to visit. The first way is to recommend the
point of interest that the user is most likely to want to visit. However, another use of these
recommendations would be to sell advertising to the competitors of the top recommended
points of interest in the hopes of shifting the preferences of the user. It is possible that a
viable business could be built on the model of giving away location-aware mobile devices to
commuters for free and having them pay for themselves through advertising revenue.
Beyond the realm of recommending points of interest and mobile ad delivery, there
are other potential applications of a trajectory-based POI recommendation system. For
example, it is plausible that it could be used to predict the movements of tracked animals.
Nonetheless, the commuting and tourism applications are the principal motivation for the
research in this thesis, and the methods contained in this thesis have been developed with
these applications in mind.
The problem of trajectory-based POI recommendation is challenging for two primary
reasons. The first principal challenge is that trajectory-based POI recommendation is a
new problem and there are no previously published requirements for how a trajectory-based
POI recommendation system should behave. Previous research into mobile recommendation
systems does not take into account trajectory information. Furthermore, existing systems
tend to be capable of recommending only specific points of interest, and not generalizations
CHAPTER 1. INTRODUCTION 3
of points of interest. Both of these limitations are addressed by this thesis.
The second principal challenge is that recommendation queries need to be answered
in real time. It is easy to devise methods that do not remain efficient as the number of
previously observed trajectories grows. However, as the goal is to build a system capable
of executing recommendation queries in mere seconds on a mobile device, we need to make
sure that queries can be answered efficiently even when the results are based on a large
historical database.
1.1 Main Contributions
The main contributions of this thesis are:
• The introduction, motivation, and formalization of the trajectory-based POI recom-
mendation problem. Previous research into mobile recommendation systems does not
incorporate a user’s recent trajectory history into the recommendation process.
• The development of a set of desired properties for a useful trajectory-based POI rec-
ommendation system.
• A practical solution to the trajectory-based POI recommendation problem, built upon
the (k-truncated) generalized suffix tree data structure. This system is capable of
answering fuzzy-matching and order-flexible queries in addition to more basic exact-
matching queries. The framework developed is highly configurable, and can be con-
figured to show a wide variety of behaviours.
• An effective approach for recommending generalizations of points of interest in addition
to specific points of interest. Previous research into mobile recommendation systems
always recommends specific points of interest and is not capable of recommending
generalizations.
• An efficient method for ensuring that the recommendations returned for a given query
are diverse. Diversifying the result set is demonstrated to improve the quality of a
query’s recommendations.
• Experimental evidence that trajectory-based POI recommendation can be performed
efficiently on large datasets and generates higher quality recommendations than exist-
ing recommendation systems based only on a user’s current location.
CHAPTER 1. INTRODUCTION 4
1.2 Outline
• Chapter 1 (Introduction): Motivates the contents of this thesis, describes the main
contributions, and contains the outline you’re presently reading.
• Chapter 2 (Related Work): Overview of past research related to the research pursued
in this thesis, and descriptions of how past research differs from work presented in this
thesis.
• Chapter 3 (Problem Description): Presents technical definitions of all terms used in
this thesis, formal definition of the trajectory-based POI recommendation problem, an
overview of requirements for a useful trajectory-based POI recommendation system,
and a description of the specific methods constructed to satisfy these requirements.
• Chapter 4 (Exact Matching): Description of the exact matching problem, naive confi-
dence measure for trajectory-based POI recommendation, two methods for accounting
for POI similarity, a method for recommending generalizations of POIs, a method for
recommending spatio-localized generalizations, a greedy algorithm for diversifying the
set of recommendations returned, overview of the (k-truncated) generalized suffix tree
data structure, and algorithms for executing recommendation queries.
• Chapter 5 (Variants): Motivation for the fuzzy matching and order-flexible matching
variants, formal definition of these variants, description of an efficient algorithm to
execute fuzzy matching queries, description of two options for defining the order-
flexible matching problem, and presentation of an efficient algorithm to execute order-
flexible matching queries.
• Chapter 6 (Experimental Results): Descriptions and visualizations of the datasets
used for experimentation, algorithms for processing datasets, and experimental results
demonstrating the effectiveness of the methods presented in this thesis.
• Chapter 7 (Conclusion): Proposes future research directions, and summarizes the rest
of the thesis.
• Appendix 1 (Constructing Suffix Trees): Detailed descriptions of the generalized suffix
tree and k-truncated generalized suffix tree data structures, efficient algorithms for
constructing suffix trees, and examples of suffix tree construction.
Chapter 2
Related Work
The existing research related to this thesis can be grouped into two broad categories: mobile
recommendation systems, and trajectory mining. Due to the popularity of collaborative
filtering in recommendation systems, we briefly discuss the concepts and major ideas, though
collaborative filtering is not used in this thesis.
2.1 Mobile Recommendation Systems
One of the first systems that could arguably fall under the term “mobile recommendation
system” was the “Personalized Shopping Assistant (PSA)” device proposed by Asthana et al.
back in 1994 [1]. The PSA was a walkman sized wireless device that was able to communicate
with a server using a radio (RF) link to radio transceivers placed around a store. Through
a simple user interface, it was able to locate items, engage the customer by telling jokes
to him or her, and crucially for us, direct a customers attention to new items or to those
“of particular interest to a particular customer.” For example, knowing that a customer
had recently purchased a VCR, it was able to recommend that the customer buy a video.
Furthermore, although it does not appear to have been implemented by the researchers, it
is proposed that the PSA would be location aware, and able to recommend only those items
near the customer in the expanse of a vast supermarket. Although extremely primitive, the
PSA implemented the basic functionality present in mobile recommendations to the present
day.
Moving forward a decade, mobile technology had greatly developed, to the point where
cellular phones and global position system (GPS) devices are becoming near ubiquitous.
5
CHAPTER 2. RELATED WORK 6
By the early 2000s, a cheap cell phone was capable of performing every function that the
PSA was capable of, without the constraint of being tied to a specific store. With this
additional power available, it became possible to add context to recommendations, and with
the popularity of mobile devices it became worthwhile to aggressively research methods for
delivering meaningful recommendations to mobile devices.
In 2004, van Setten et al. [29] developed COMPASS and proposed combining context-
awareness with recommendation systems, such as those discussed in [21]. According to van
Setten et al., context “is any information that can be used to characterize the situation of an
entity,” where an entity is simply any object, place, or individual relevant to the functioning
of the application. Thus, context could include time, day of week, age of a user, physical
location, or car model being driven. Although both context-awareness and recommendation
systems are “used to provide users with relevant information and/or services”, they are
distinguished by the former being based on a user’s context, and the latter being based on
a user’s interests. The goal of [29] is to provide a system unifying these two concepts.
Although COMPASS is a large system composed of many parts, including a user profiler
and a recommendation engine, it is fundamentally a mobile application that proposes point
of interest recommendations based on a user’s present location, the present time, and other
information about the individual, such as the acceptable price range for a dinner. [29] does
not address the particular recommendation process used, but the authors did perform a user
study of 57 individuals, demonstrating that the public believes that context-aware recom-
mendations are indeed useful. No trajectory information is used in their recommendation
system, nor is any historical knowledge about the user taken into account.
One aspect of some mobile recommender systems is the idea of critique-based recom-
mendation. For example, Nguyen and Ricci’s [22] discussed how allowing a user to critique
the recommendations made and to incorporate these critiques into the recommendations
can improve future recommendations. Although critique-based feedback is interesting and
useful, the work presented in this thesis does not incorporate a mechanism for critique-based
feedback.
In 2006, Horozov et al. proposed a system for personalized POI recommendations known
as “Geowhiz” [12], that like COMPASS considers a user’s context when making recommen-
dations, but goes deeper than COMPASS, explicitly describing techniques to incorporate
the user’s context into the recommendation process. At the core of Geowhiz is an enhanced
collaborative filtering method that works by taking into account a user’s location. It is an
CHAPTER 2. RELATED WORK 7
item-based collaborative filtering method that works by first identifying points of interest
near a user’s present location (within a defined radius), and performing collaborative filter-
ing only on that set of nearby POIs. Like with COMPASS, the context considered is a static
snapshot of the user’s present state, and so the user’s location history is not considered. It
is worth noting that [12] includes a number of useful technical insights, such as how to use
“pseudo-users” to bootstrap a recommendation system, as well as how to use “serendipity”
to introduce a small amount of randomness into the recommendation system.
Finally, another modern, real-world mobile recommendation system is “CityVoyager”
[27]. Unlike the systems previously discussed, CityVoyager bases its recommendations on
its users’ location history. It does this by identifying its users’ frequent locations, and these
frequent locations are used as input to an item-based collaborative filtering system (see
section 2.2 for a description of item-based collaborative filtering). No user attributes (such
as age or gender) are considered. Although based on a tiny sample of only two users to
gauge the quality of recommendations, Takeuchi and Sugimoto’s results [27] indicate that
their system may be useful. However, once again it considers only the present location of a
user and not where he/she is coming from and where he/she has been.
2.2 Collaborative Filtering
Collaborative filtering is a technique developed in the 1990s and is found at the root of many
recommendation systems. The two principal categories of collaborative filtering methods
are model-based methods, and memory-based methods [12]. Memory-based (or user-based)
methods, such as the RINGO system [25] work by dynamically computing the relationships
between users each time a query is presented to the system. Historical data for the most
similar users is then used to make a recommendation. Model-based methods, such as item-
based techniques [23], are highly scalable, and work by computing the relationships between
items. They do not require the computation of the relationships between all users on each
query.
Collaborative filtering remains an active field of research, and although its methods are
not directly used by this thesis, as discussed in section 7.1.1 at the end of this thesis, it would
be an interesting problem to integrate collaborative filtering into the mobile trajectory-based
recommendation system developed in this thesis.
CHAPTER 2. RELATED WORK 8
2.3 Trajectory Mining
The goal of this thesis is to devise a framework for a mobile trajectory-aware point of interest
recommendation system. In addition to being built upon research into recommendation
systems, the other field closely related to the content of this thesis is trajectory mining.
Trajectory mining is a very new field of research. There was a modicum of related
research performed in the 1990s, tackling problems such as vehicle classification [4] and
trajectory clustering using regression models [6]. However, these research studies tackle the
problem of trajectory mining from highly mathematical and statistical stances, respectively.
Trajectory mining as a topic of data mining is very new and has only been the subject of
intensive research in the past several years.
Like data mining, research into trajectory mining has tended to focus on the traditional
three pillars of clustering, classification, and pattern mining. For example, Lee, Han, and
Whang [18] introduced a trajectory clustering method known as TRACLUS that uses a
partition-and-group idea to cluster trajectories and generate representative trajectories for
these clusters. Another line of research has been to investigate incremental trajectory clus-
tering methods, such as those discovered by Elnekave et al [3]. Related to this research into
trajectory clustering methods is convoy discovery based on a method of trajectory simpli-
fication [15], where convoys are sets of trajectories that are density-connected during some
time interval.
Regarding trajectory classification, Lee et al. [17] presented the “TraClass” algorithm
to classify trajectories. The features for the classifier are discovered by performing a region-
based clustering of the trajectories, followed by a trajectory similarity-based clustering step.
Among the applications in mind for this direction of research is to classify whether a boat is
an oil-tanker, a tugboat, a fishing-boat, and so on, and another application is to classify an
animal given its historical trajectories. While interesting, this research is not particularly
relevant to the problem and methods in this thesis.
Giannotti et al. [7] addressed the problem of trajectory pattern mining by using a
“region-of-interest” approach to find trajectories moving between regions of interest. A
spatial approach to trajectory pattern detection employs a spatial approach to trajectory
pattern detection, spatial information is considered in a pre-processing phase that reduces
trajectories to a sequence of regions of interest. In their approach, temporal differences be-
tween visits matter, but the exact time of a visit does not, and it is not possible for the order
CHAPTER 2. RELATED WORK 9
of points in trajectories to be swapped. Similarly, spatial regions matter, but not specific
locations. [7] is relevant to the research contained in this thesis, but the problem tackled is
different, and for our purposes suffers from the serious limitation of only considering regions
and not specific locations. Finally, Gidofalvi and Pedersen [9] mined long trajectories of
moving objects and showed how to identify trips using an SQL-based implementation.
Highly relevant to the problem addressed by this thesis is the research done by Zheng et
al. [31] on mining interesting locations and travel sequences from GPS trajectories. Using
ideas from the HITS (hypertext induced topic search) model developed by Jon Kleinberg
[16], Zheng et al. [31] used a HITS-based inference model to find locations and trajectories
that could be recommended. In particular, they treated users as hubs, and locations as
authorities, and this is used to compute the interest of each location. A very useful appli-
cation of this research would be to devise tour plans for cities, as the methods described
could determine popular tour routings from GPS trajectory data. However, this method
does not allow for queries to be executed of the form “given my historical trajectory Q,
where should I visit next?” which are the main focus of this thesis. In addition, it is worth
noting that their methods incorporate no collaborative filtering aspect and thus do not take
into account any knowledge about the users of the system.
Finally, the line of research perhaps similar to this thesis is the research of Frentzos et al.
[5] on nearest-neighbour searches on moving object databases. One possible approach to the
problem tackled in this thesis would be to find the k nearest neighbours to a query trajectory
and to use them to determine the optimal next points to recommend the querying user to
visit. This is conceptually similar to the fuzzy matching method proposed in chapter 5,
although Frentzos et al. discussed only how to find similar trajectories and do not address the
recommendation process. Whereas the methods in this thesis are generally tied only to the
order of points visited in historical trajectories and use time information only occasionally,
the methods in [5] are intimately tied to time, and work based on the distance between
trajectories over a definite period of time. As a result, their methods are able to compare
trajectories visiting a different number of points in a particular time interval. Lastly, the
methods contained in this thesis are sensitive to the particular point of interest / concept
visited at each trajectory point, whereas the methods of Frentzos et al. are based purely
on spatial and temporal information. Nonetheless, an interesting future research direction
would be to try and merge the ideas in this thesis with the methods used by Frentzos et al.
and to see if an effective system could be designed.
Chapter 3
Problem Description
This thesis addresses the problem of efficiently generating a set of recommended next points
to visit following a given trajectory. In addition to this query trajectory, we have a database
of historical trajectories along with information about the points of interest in the region.
In this chapter, after presenting some necessary definitions, we define the general problem
tackled by the thesis. Following this, we present and motivate a number of requirements
that we believe to be desirable for a point of interest recommendation system to possess,
and then with these requirements define the three specific problems tackled by this thesis.
3.1 Definitions
This section contains a listing of the definitions that will be used to motivate and describe
the problem of trajectory-based point of interest. Other definitions needed only for imple-
menting the methods presented later in this thesis will be presented when they are needed.
Definition 3.1.1. A concept is a tuple c = (name, children, ...) consisting of a string
c.name that is a description of the concept, as well as a set of child concepts c.children that
are contained within c. A concept will generally be referred to by its name. For example,
we could have a concept “Coffee Shop” with child concepts “Starbucks” and “Second Cup”.
Lastly, there exists a function conceptDistance(c, d) that computes the distance between
any two concepts c, d. Let z denote the lowest common ancestor of c and d. If z does
not exist then conceptDistance(c, d) = ∞. Otherwise, thesis, conceptDistance(c, d) =
max(depth(c) − depth(z), depth(d) − depth(z)), where depth(x) denotes the depth of x in
10
CHAPTER 3. PROBLEM DESCRIPTION 11
the concept hierarchy
Definition 3.1.2. A concept hierarchy is a forest of concepts. It is possible, and normal,
for a concept hierarchy to have multiple roots, and the distance between any two concepts
not sharing a common root is defined to be infinity. Given two concepts a, b, if a = b or a is
an ancestor of b then we write that a ≥ b, and say that a is a super-concept of b. The depth
of a concept c, depth(c) is the number of edges between c and the root of the component of
the concept hierarchy containing c.
Definition 3.1.3. A point of interest or POI is a tuple poi = (lon, lat, concept) containing
at minimum, a longitude poi.lon, a latitude poi.lat, and a concept poi.concept. A point
of interest is generally any specific location that a trajectory can visit and that can be
recommended when answering a query.
Definition 3.1.4. A point is the fundamental object used in this thesis. A point p always
has an associated concept p.concept, and there exists a function pointDistance(x, y) that
computes the distance (possibly infinite) between any two points x, y, and so all points are
comparable. This distance measure combines the conceptual and spatial distance between
two points, and a description of how to construct such a measure is presented in section
4.4.3. Two points x, y are said to be similar if pointDistance(x, y) < ∞. Three types of
point are used in this thesis: trajectory points, concept points, and localized points.
Definition 3.1.5. A trajectory point is a tuple p = (poi, ts, ...) containing at minimum, a
point of interest p.poi, and a time-stamp p.ts, along with any other information deemed
relevant. Note that for convenience, we will sometimes refer to a point’s longitude p.lon,
and latitude p.lat, although this notation is merely shorthand for p.poi.lon, and p.poi.lat.
For convenience, we will often use p.concept to refer to the concept associated with p.poi,
p.poi.concept. Note that this refers to a concept, and not a concept point. Every trajectory
point corresponds to a particular point of interest, but this is not a great limitation as it
would be easy to add a notion of “non-recommendable” points of interest. We will see,
however, that for our methods to be efficient, it is desirable to have as few points of interest
as possible.
Definition 3.1.6. A generalized point is any point that can potentially represent other
points. If a generalized point gp contains another point p then we say that gp generalizes p
and can write gp ≥ p. In this thesis we use two types of generalized point: concept points,
CHAPTER 3. PROBLEM DESCRIPTION 12
and localized points. For convenience, we will also say that a generalized point gp contains
a point of interest poi if gp ≥ q for any trajectory point q with q.poi = poi
Definition 3.1.7. A concept point is a tuple p = (concept, ...) containing a concept p.concept
along with any relevant information. A concept point p has no spatial location, and is said
to generalize any other point q if p.concept ≥ q.concept.
Definition 3.1.8. A localized point is a tuple p = (concept, region, ...) representing a con-
cept p.concept in a particular region p.region. A localized point p generalizes another point
q if q lies entirely within q.region and if p.concept ≥ q.concept. In this thesis, regions asso-
ciated with localized points are always square, although there is no limitation on the shape
of the region.
Definition 3.1.9. A trajectory is a sequence of trajectory points, t = p1 → p2 → ...→ pn,
where each pi = (poi, ts, ...) is a trajectory point and pi+1.ts ≥ pi.ts for 1 ≤ i < n, and
|t| = n is the length of t.
Definition 3.1.10. A query trajectory is any trajectory presented as input to the trajectory-
based recommendation system. Given a query trajectory q, the objective of this thesis is to
generate a set of recommended next points for the user presenting q to visit. Generally, a
query trajectory will be very short, with |q| ≤ 5 in most cases.
Definition 3.1.11. A trajectory fragment is any substring of a trajectory. That is, given
a trajectory t = p1 → ... → pn of length |n|, a fragment of T is any trajectory f = p1+i →p2+i → ...→ pm+i where i ≥ 0 and m+ i ≤ n. Any trajectory fragment f = q1 → ...→ qm
can be written in the form (b : n), where b = q1 → ...→ qm−1 is the body of f , and n = qm
is the next point of f .
Definition 3.1.12. A historical trajectory database, denoted tDB is a bag of trajectories
that have been traversed by users of the system some time in the past. The recommendations
for each query will be constructed based on the information in this database.
Definition 3.1.13. Two trajectory fragments f = f1 → ... → fn, and g = g1 → ... → gm
match exactly or have an exact match if |f | = |g| = n, and if fi.poi = gi.poi for all 1 ≤ i < n.
Note that the time-stamps of the trajectory points in f and g are ignored. The longitudes
and latitudes of corresponding trajectory points in f and g will always match if f and g
match exactly because we require that their associated POIs be equal.
CHAPTER 3. PROBLEM DESCRIPTION 13
Definition 3.1.14. Two trajectory fragments f, g match fuzzily with order k or have a fuzzy
match of order k if |f | = |g| = n, and if fuzzyError =
n∑i=1
pointDistance(fi, gi) < k.
Definition 3.1.15. A recommendation is a point that has been output by a recommendation
system given a query q. Each recommendation r has an associated confidence, where 0 ≤confidence(r) ≤ 1. The confidence of a recommendation is the estimated probability of the
user visiting r after traversing the query trajectory q.
3.2 Problem Definition
The primary goals of this thesis are to provide a realistic model for framing the problem
of trajectory-based point of interest recommendation and to then describe an efficient and
scalable method for answering recommendation queries. In this section I define the problem
tackled by this thesis at a high level. Later in this chapter the three specific variations of
this problem solved by this thesis will be presented.
To begin, we assume that the following information is available:
• A historical trajectory database tDB
• A database P of points of interest (POIs)
• A concept hierarchy C
• A query trajectory q
Like a traditional search engine, the problem of trajectory-based POI recommendation
is a query-answering problem. However, unlike a traditional search engine where a query
consists of a series of words, here a query is a trajectory that a user of the system has just
traversed. The goal is then to return the top-k points that the user is most likely to desire
to visit next.
Definition 3.2.1. Trajectory-Based POI Recommendation Problem: Given a database tDB
of historical trajectories, a database P of points of interest (POIs), a concept hierarchy
C, and a query trajectory q, find the top-k points most likely to follow q. These top-k
recommendations are known as the recommendations for q.
This is a very general definition of the problem, and one interesting ambiguity in its
statement is that it does not state which trajectories are to be contained in the historical
CHAPTER 3. PROBLEM DESCRIPTION 14
trajectory database. By varying the contents of the historical trajectory database, we can in
fact construct multiple models of the problem. For example, we could perform recommenda-
tion based on personal history, recommendation based on user group, and recommendation
based on all historical trajectories. In this thesis we will generally be thinking of the last
of these, but all can be done simply by working with a subset of the historical trajectory
database. In section 7.1.1 we will discuss an approach for combining the POI recommenda-
tions generated by these different models.
One important insight that will be useful later in this thesis is that the query trajectory
q will generally be very short. People rarely stop at more than a few points of interest
on a given trip. The length of the query trajectory will determine how easy it is for the
query trajectory to match a trajectory fragment in the historical trajectory database and
thus, how many available recommendations will be available. There is a trade-off involved
in choosing the length of query trajectory to use, as longer query trajectories may increase
the precision of results, but it may lead to a lack of diversity of the results, as well as over-
fitting. Using long query trajectories rather than shorter queries trajectories could result in
a system much less likely to return useful results for rarely traversed trajectories. We will
use some experiments to illuminate this trade-off in chapter 6.
3.3 Recommendation System Requirements
Later in this chapter we will describe the three particular instances of the trajectory-based
POI recommendation problem that are tackled by this thesis. However, before doing so,
we want to first motivate and describe some properties that I believe to be desired of a
useful trajectory-based recommendation system. The three instances of the recommendation
problems solved by this thesis each will incorporate more of these requirements than the
previous. These requirements are:
1. (Quantifiability of Confidence) The confidence of each recommendation must be quan-
tifiable and should range between 0 and 1.
2. (On-line Recommendation Capability) Recommendation queries must execute in real-
time. However, there is no limitation on the amount of pre-processing time.
3. (Scalability) The trajectory-based recommendation system must be scalable to be able
to handle an arbitrarily large historical trajectory database, as well as any number of
CHAPTER 3. PROBLEM DESCRIPTION 15
simultaneous requests.
4. (Generalization) Highly similar possible recommendations should mutually boost each
others’ confidence.
5. (Diversity) The k recommended points should be diverse.
6. (Fuzziness) The next points of trajectory fragments similar to, but not exactly match-
ing, the query trajectory should factor into the recommendation process.
7. (Order-Flexibility) The order of trajectory points visited very close in time in a tra-
jectory should be ignored.
8. (Personalization) Trajectories in the database of users very similar to the querying
user are more useful for making a recommendation than those of others users
3.3.1 Quantifiability of Confidence
One highly desirable requirement for any trajectory-based recommendation system is for a
statistically grounded confidence to be assigned to each recommendation. Aside from the
obvious use of ranking recommendations, advertisers may only want to pay for advertising
to a user of the recommendation system if the probability of the user wanting to visit the
advertiser’s establishment is greater than a certain threshold.
3.3.2 On-line Recommendation Capability
To be useful in the real world, a trajectory-based recommendation system must be able
to execute recommendation queries in real time. As the envisioned use of the system is
for people on the move, it is important that queries be satisfied quick enough that it is
possible for a user to act based on the returned set of recommendations. On the other
hand, like a normal search engine, we can allow for large amounts of pre-processing time
and computational resources. It is desirable to minimize the resources required for pre-
processing the historical trajectory data to be able to satisfy queries, but this is of much
less importance than ensuring that queries can be executed extremely quickly. Even if
the choices were made for queries to be executed locally on a mobile device rather than a
backend server, any amount of work could be performed prior to loading a processed dataset
onto the mobile device.
CHAPTER 3. PROBLEM DESCRIPTION 16
3.3.3 Scalability
If a trajectory-based recommendation system were to be put into use, the size of the histori-
cal trajectory database could be expected to grow rapidly as the system became increasingly
popular, and it is important that any system be able to scale. There are two dimensions of
scalability to handle. The first is the number of incoming requests, but this is essentially
solved by scaling the hardware used to process requests and will not be mentioned further
in this thesis. The second is the size of the historical trajectory database. It is important
that the time required to answer a recommendation query grows sub-linearly with respect
to the size of the historical database.
If we assume a fixed maximum query length and ignore the fuzziness requirement, it
would be possible for queries to be answered in expected constant time by pre-computing
the results of all possible queries using a hash table to store and retrieve their results.
However, the memory requirements of this approach are prohibitive, and furthermore, this
approach can not handle the fuzziness requirement because that incorporating that require-
ment requires the recommendation system to be able to execute queries with a novel query
trajectory that has never been previously observed.
3.3.4 Generalization
Consider a situation in which at a certain street corner there are three coffee shops and
a bank. Further suppose that in all of history, each coffee shop has been visited by 8
individuals after those individuals traversed some trajectory, and that 10 individuals have
visited the bank after traversing the same trajectory. Although if we only look at raw
probabilities we should recommend visiting the bank to a new user who has just traversed
this trajectory, intuitively it seems that we would be better off recommending a coffee shop.
The generalization requirement encapsulates the idea that the presence of a number of
highly similar points of interest in some neighbourhood should bolster our confidence in
recommending each of these points of interest.
3.3.5 Diversity
Building upon the same situation used to motivate the generalization requirement, suppose
that a user wants to see the top 3 recommended points to visit, given her recent trajectory
history. It is possible that working purely from a mathematical standpoint that the top 3
CHAPTER 3. PROBLEM DESCRIPTION 17
recommended points could be a Starbucks on one corner, a Second Cup on another corner,
and a Blenz on a third corner of the intersection. The problem with recommending these
three points is that they are too similar, and this decreases the usefulness of the recom-
mendations to the user of the system, and may even discourage advertising as a potential
advertiser may not want to have his ad get lost in a flurry of highly similar ads. The diver-
sity requirement is that the top-k recommendations returned to answer a query should be
diverse when possible.
3.3.6 Fuzziness
Working again off of the same example where we had three coffee shops on the corners of
an intersection, suppose that very few people have historically visited one of them, perhaps
because it is a new coffee shop. Further suppose that when a user goes to visit the coffee
shop and uses her mobile recommendation system, nobody has ever visited the coffee shop
after following her historical trajectory. If we were to base our system’s recommendations
only on those historical trajectory fragments exactly matching our user’s last few locations
visited, we would not be able to recommend any points of interest for her to visit next.
The solution is for a POI recommendation system to base its recommendations also
on the historical trajectory fragments that are “close to” or “fuzzy matches of” our user’s
last few locations visited. For example, suppose that all users have previously travelled
the trajectory a → b → c, but the current user queries the recommendation system with
the trajectory fragment a → b′, where b and b′ are similar. The requirement of fuzziness
expresses the notion that it should be possible to recommend c given this query, albeit with
diminished confidence due to the fact that query trajectory does not exactly match the
trajectory in the historical database. Furthermore, the requirement expresses the notion
that even if c were the next point of a historical trajectory fragment exactly matching the
query trajectory, the fact that c is the next point of other historical trajectory fragments
that fuzzy match the query trajectory should bolster our confidence in recommending c.
3.3.7 Order-Flexibility
Imagine a set of commuters who take the subway to work, half of whom visit a coffee shop
followed by a newsstand after they disembark, while the other half visit the newsstand
followed by the coffee shop. Each of these two visits may occur within a minute or two
CHAPTER 3. PROBLEM DESCRIPTION 18
of each other, and it is this situation we have in mind when thinking of the requirement
of order-flexibility. The order-flexibility requirement expresses the idea that the order of
events visited very close in time should not matter significantly when answering queries, as
the order of these events may carry very little information.
3.3.8 Personalization
The remaining piece of information that is likely to be available in a real-world scenario
is knowledge about the users of the system. For example, we may know the gender, oc-
cupation, age, and any number of other facts about each user. This information could be
used to improve the quality of recommendations by allowing us to integrate some form of
collaborative filtering into the recommendation process.
Using the methods in this thesis it would be possible to perform recommendations based
on user groups. For example, given the knowledge that a particular user is a banker, it would
be possible to execute queries for this user based only on the historical trajectories of other
bankers. Similarly, it would be possible to base recommendations for a given user based
only on the historical trajectories of this user. The recommendations based on personal
history, user group, and the entire historical trajectory database could be combined using
a mixture model. Of all the requirements expressed in this section, the requirement of
personalization is the only one not explicitly addressed by this thesis. More ideas on how to
build personalization on top of the methods contained in this thesis can be found in Section
7.1.1.
3.4 Satisfying the Requirements
In this thesis, we address all of the above requirements except for that of personalization.
In the following two chapters we proceed in stages, building up a sequence of solutions
to the trajectory-based point of interest recommendation problem. Each satisfies more re-
quirements than the previous, and thus each solves a particular instance of the general
trajectory-based POI recommendation problem introduced above. The three primary in-
stances of the trajectory-based POI recommendation problem tackled are:
1. Exact matching (Quanfiability of Confidence, On-Line Recommendation Capability,
Scalability, Generalization, Diversity)
CHAPTER 3. PROBLEM DESCRIPTION 19
2. Fuzzy matching (+ Fuzziness)
3. Order-flexible matching (+ Order-Flexibility)
The exact matching problem is described in chapter 4, and its solution captures the
main technical contributions of this thesis. The fuzzy matching, and order-flexible matching
problems are solved in chapter 5, and their solutions will naturally build upon the foundation
laid in chapter 4 by the solution for the fuzzy matching problem.
Chapter 4
Exact Matching
This chapter describes how we can achieve all of the requirements of the previous chapter,
except for the requirements of fuzziness and order-flexibility, using the technique of exact
trajectory matching. The exact matching methods contained in this chapter will be extended
in the next chapter to incorporate fuzzy and order-flexible matching as well. The methods
described in this chapter constitute the core contribution of this thesis.
Recall from definition 3.1.13 that an exact match between two trajectory fragments q, s
means that all corresponding points in q and s visit the same point of interest (the time-
stamps of points are ignored). Exact matching then means that given a query trajectory
q, we shall generate the top-k point of interest (POI) recommended next points for q only
considering the trajectory fragments in the historical trajectory database exactly matching
q. Although the exact matching methods are simple to understand and easy to formulate,
they are still sufficiently complex to motivate the description and use of the principal data
structures and algorithms that will be used to later incorporate fuzziness and order-flexible
queries.
In order for a method to be useful for mobile point of interest recommendation, recall
that recommendation queries must be executed in real-time, but that we are allowed an
arbitrary amount of time to pre-process the historical trajectory database. Thus, this chap-
ter is split into two parts. The first part is a step-wise construction of how to achieve the
requirements of quantifiability, generalization, and diversity, given the set of next points for
the query trajectory q. The second part describes how to pre-process the historical database
efficiently in order to permit efficient query execution, thus meeting the on-line recommen-
dation and scalability requirements, and also describes how to query this pre-processed data
20
CHAPTER 4. EXACT MATCHING 21
efficiently. This second part primarily relies on the k-truncated generalized suffix tree data
structure, and a brief description of the data structure is contained here, while a more
detailed description, including methods for construction, is contained in appendix A.
4.1 Exact Matching
Let tDB be the trajectory database, consisting of a bag of trajectories, and let q be a query
trajectory, where l = |q| is the length of q. Let H be the set of all trajectory fragments of
length l + 1 in tDB, so that each fragment h = (b : n) ∈ H consists of two parts: a body b
of length l, followed by a next point n. Then, let M = {h = (b : n) ∈ H|exactMatch(q, b)}be the set of all trajectory fragments in H with a body that exactly matches q. M will be
known as the set of exactly matches. With these definitions we can now precisely define the
exact matching problem.
Definition 4.1.1. Exact Matching Problem: For a given query trajectory q, find the top-k
next points (ranked by decreasing confidence) of all trajectory fragments in M . These top-k
next points are known as the recommendations for q.
In order to devise a solution to the exact matching problem satisfying all of our require-
ments, we must come up with a good measure of confidence. This will be done in a few steps.
First, we present a naıve method satisfying only the quantifiability requirement, and then
from this starting point we will show how to handle the generalization requirement. The
diversity requirement will be handled later as a pre-processing step that can be executed on
the output from our other methods.
4.1.1 Naıve Approach
As a first step towards devising a good definition of confidence, it is natural to begin with
the raw observed probabilities of each possible recommendation. To begin, given a query
trajectory q, and the set of exact matches M , let N denote the set of all next points of
the trajectory fragments in M . These are the next points of our query trajectory q, and
to compute their confidences we need to define a function, support(x,N) to compute the
number of occurrences of a next point x in N (again, ignoring the point’s time-stamp):
support(x,N) = |{h = (b : n) ∈M |n = x}| (4.1)
CHAPTER 4. EXACT MATCHING 22
Now we can compute the naıve confidence of recommending each possible next point x:
confidence(x) =
support(x,N)
|N | if x ∈ N
0 otherwise(4.2)
This confidence measure clearly satisfies the requirement of quantifiability, but it does
not satisfy the generalization requirement. To see this, consider the following example. Let
q = a→ b→ c, so that l = 3, and let the trajectory database tDB be:
Body Next Point Support
abc Starbucks-1 2
abc Starbucks-2 2
abc Starbucks-3 2
abc Second Cup 3
Second Cup
Starbucks 1
Starbucks 2 Starbucks 3Start
Figure 4.1: Graphical depiction of example
Suppose that all four points are equally distant from each other in space, but where
the conceptual distance between the three Starbucks locations is smaller than between the
Starbucks locations and the Second Cup. Then the point distance between each of the three
Starbucks locations is smaller than the distance between any of the Starbucks locations and
the Second Cup. Using the confidence measure defined above, the top recommendation
would be “Second Cup”, but we can see that two-thirds of all trips traversing the trajectory
abc led to an individual visiting one of the Starbucks locations. In accordance with the gen-
eralization requirement described in the previous chapter, we want to be able to recommend
a Starbucks location (or “Starbucks” the concept) above the Second Cup location.
4.2 Accounting for POI Similarity
To remedy this problem, we need to find a method to account for point of interest (POI)
similarity, and there are two distinct means of accomplishing this goal. The first approach is
CHAPTER 4. EXACT MATCHING 23
to somehow increase the confidence of recommending each of the individual, highly similar,
points of interest on account of there being other highly similar POI nearby. The second
approach is more interesting, and it is to recommend a generalization of the highly similar
points that would encompass all of them. I describe both methods below, and it will be
argued that the latter approach is superior.
4.2.1 Increasing the Confidence of Each Similar POI
The first means of accounting for POI similarity is to increase the confidence of each point
of interest if there are similar points of interest nearby. Suppose that we were to define
a function of two points similarity(x, y) to compute their similarity, where the function
returns a number between 0 and 1. Using this, we could then create a new confidence
measure for a next point n. For example, we could define the confidence of a recommendation
x to be:
confidence(x) =1
|N |∑y∈N
support(y,N)× similarity(x, y) (4.3)
This measure clearly satisfies the quantifiability requirement as the maximum possible
sum is |N |, and so this measure will always return a confidence between 0 and 1. Further-
more, it appears to satisfy the generalization requirement as well. However, there are a
number of problems with this approach that will lead us to favour the approach presented
in section 4.2.3.
The first problem is that there is no theoretical foundation for computing the confidence
of x in this manner. Yet even disregarding this, there is a conceptual problem with altering
the confidences of specific points of interest. If we alter the confidences of individual points of
interest by incorporating the supports of similar points of interest it becomes very difficult to
interpret the results returned by the recommendation system. This is because it is no longer
possible to infer from the confidence of a point of interest recommendation whether the point
is even well visited. Given a confidence score computed using the confidence measure stated
above, it is unclear what can actually be inferred. From the resulting confidence score,
there would be no indication that it was being recommended only due its proximity and
not because anybody had ever visited the point previously. In an extreme example, this
confidence measure could lead us to recommend a point with only a single historical visit,
even though all of its surrounding points had been visited hundreds of time. This issue
CHAPTER 4. EXACT MATCHING 24
could be somewhat alleviated by more heavily weighting the contribution of the support
of x in N , perhaps by squaring the result of the similarity function, but we shall see that
there is a better approach that will naturally avoid these problems. Rather than alter the
confidences of individual POI, the new approach will be to recommend a generalization of
points.
4.2.2 Density Estimation
The probability distribution function (PDF) is one of the fundamental concepts in statistics,
as it both is a description of the distribution of a random variable X as well as a means
of computing the probabilities associated with X. That is, given a probability distribution
function f for random variable X, it is possible to compute the probability of observing any
value associated with X by the simple equation, Pr(X = a) = f(a), and more generally for
continuous variables, Pr(a < X < b) =
∫ b
af(x) dx.
For our purposes of computing the confidences of recommending points of interest, if
we knew the probability distribution function fq for the random variable representing all
possible next points following a query trajectory q, then the exact matching problem being
tackled in this chapter would be trivial. The algorithm would simply be to compute fq(x)
for all possible next points x, and to choose those x with the top-k results. Unfortunately,
it is not possible to have the PDF given to us for all possible query trajectories, and so this
simple idea would not work. However, it is possible to build an estimate of the PDF from
observed data, and this procedure is known as density estimation. An excellent resource on
density estimation is [26].
As described by Silverman [26], there are many means of computing density estimates
that can be grouped into two broad categories: parametric and non-parametric. Paramet-
ric density estimation techniques require as input a certain probability distribution, while
non-parametric density estimation techniques make no assumptions about the underlying
distribution of the observed data. The most common non-parametric density estimation
techniques include: histogram estimation, kernel estimation, and nearest neighbour estima-
tion.
Histogram estimation requires a random variable that represents values that can be
mapped onto the real numbers, and so does not easily apply to the problem of predicting
points of interest. Nearest neighbour estimation on the other hand, also does not easily apply
CHAPTER 4. EXACT MATCHING 25
to our situation as many points of interest may be very similar to each other. To claim that
the probability of observing a novel point is the probability of its nearest neighbour will not
allow us to recommend generalized points that contain many highly related points because
the predicted probability of observing this generalized point would be far too small.
Kernel Estimation
For this thesis we choose to use kernel estimation as the means of estimation, due to its
effectiveness in handling unknown data distributions. Kernel estimation is related to the
process of sampling, in that the predicted probability of observing a given point is based on
the distribution of sample points, where all sample points are equally weighted. However,
kernel estimation is based on the idea that observing a point increases the probability of
observing other points nearby, and consequently, distributes the contribution weight of other
points according to a kernel function, K. Furthermore, the kernel width h (also known as
the smoothing parameter) is introduced to control the effect of the kernel function in the
neighbourhood of each point.
According to Silverman [26], the accuracy of kernel estimation depends much more on
the chosen kernel width h than on the particular kernel function chosen. Considering this,
and due to their broad applicability and common use, we have chosen to use the Gaussian
kernels as our kernel functions.
Definition 4.2.1. A Gaussian kernel is a function Gh(x, y) = 12πe− d(x,y)2
2h2
With this Gaussian kernel, given a set of observed objects S = (y1, y2, ..., yn) it is possible
to estimate the density of a point x using the following density estimation function:
fh(x) =1
n
n∑i=1
Gh(x, yi) (4.4)
Notice that there is no requirement that x be a member of S, and so this density estima-
tion function will allow us to estimate the probability of observing a previously unobserved
point. This will be exploited in section 4.2.3 in order to be able to recommend generalized
points.
Figure 4.2 demonstrates how Gaussian kernel estimation functions. A Gaussian curve is
constructed around each of the six points lying on the x-axis, and top curve is the sum of
these six curves. The Gaussian kernel estimate for this dataset is not shown, but as there are
six points in the dataset, it would be one sixth the sum of the Gaussian curves constructed
CHAPTER 4. EXACT MATCHING 26
Figure 4.2: Example of Gaussian kernel estimation
around each point. In other words, the Gaussian kernel estimate for this dataset is one sixth
the top curve in the figure.
4.2.3 Recommending Generalizations
The better means of accounting for POI similarity is to add the ability to our system to rec-
ommend generalizations of POI. In addition to computing the confidence of recommending
each next point n ∈ N as above, we will also compute the confidence of recommending all
generalizations of n. Recall that if gp is a generalized point that contains a POI p, then we
can say that gp generalizes p, and can write gp ≥ p.Suppose that for a given query there are next points n1, n2, ..., nm all with common
conceptual ancestor z, and recall from the previous section that the density estimate of a
point z is the expected probability of observing z. This allows us to compute the confidence
of recommending z by computing the support of z and normalizing it by the Gaussian kernel
density estimate for z.
Recalling that we want the confidences of non-generalized points to not be affected by
other points, using a Gaussian kernel density estimation function fh(x), we can write:
confidence(z) =
fh(z) if z a generalized point
support(z,N)2π|N | otherwise
(4.5)
This family of measures (there is a different measure for each possible h) has been
selected because it is both simple and theoretically well-founded. Furthermore, if we define
the distance between two points as 0 if they are the same, and ∞ otherwise, then this
confidence for a point computed by this family is exactly 12π the value computed by the
naıve exact matching confidence measure presented in section 4.1.1. Thus the naıve exact
matching case is just a special case of our method for recommending generalizations.
As a final note about recommending generalized POI, if we were to leave our method
as described above, it would be possible to recommend both a generalization of a point
CHAPTER 4. EXACT MATCHING 27
as well as the point itself. However, this is practically undesirable as one of our goals is
recommend a diverse set of POI. What we can do is to only recommend a generalization z
if our confidence in it is greater than our confidence in recommending any of its children,
whether an explicit POI or a more specialized generalization.
4.3 Spatio-specific Generalized Recommendations
Using the generalized confidence measure presented in section 4.2.3 we are able to recom-
mend both trajectory points and concept points without difficulty. However, there is still a
problem to be addressed, which is that the generalized recommendations made contain no
spatial information and are purely conceptual recommendations. That is, we still have no
mechanism for recommending localized points.
To see why this is a problem, suppose there are three franchises of a popular coffee shop
very near to each other, but that there is another franchise of the same coffee shop across
town. Now suppose that all four of these franchises are next points of our query trajectory
q, and that the common generalization of these four franchises is z. The problem we can see
is that confidence(z) will be low, due to the spatial distance between three of the franchises
and the fourth franchise.
Shop 1 Shop 2
Shop 3
Shop 4Start
Figure 4.3: Example illustrating problem over over-generalization
What we want is to be able to recommend a generalized point y that subsumes only the
three franchises that are very nearby to each other, and to be able to disregard the other
POI. However, it would be undesirable for the administrator for the POI recommendation
system to have to manually create an intermediate layer in the POI hierarchy between the
least generalization and the POI themselves that indicated spatial proximity. What we
need is to create a dynamic conceptual hierarchy level capable of recommending POI in
CHAPTER 4. EXACT MATCHING 28
close proximity. This will allow us to recommend localized points in addition to trajectory
points and concept points.
To accomplish this, we can overlay the space containing all of our trajectories with four
interleaving grids and give each cell in each of the four grids a cell code. Each of these grids
will have an edge length 2r, and the four grids are offset from each other by r in one or
both dimensions. Given a cell for one of the grids, cells for each of the other grids could
be found by adding r to the longitude of our original cell and leaving the latitude alone, by
adding r to the latitude of our original cell and leaving its longitude unchanged, and finally
by adding r to both the latitude and longitude of our original grid cell.
With these four interleaving grids we can attach a tuple (cell1, cell2, cell3, cell4) to each
point in our trajectories. Two points can then be considered to be in close proximity if they
share a cell code. By using four interleaving grids, we have a trivial method of determining
whether points are nearby to each other because if each grid cell has dimensions 2r × 2r
then any two points of distance no more than r from each other will share a cell code. For
the purposes of distance computations, the location of a localized point associated with a
cell will be taken to be the centroid of all contained points of interest rather than the cell
center.
Definition 4.3.1. The extents of the historical trajectory database is a tuple extents =
(minLon,maxLon,minLat,maxLat) that contains the minimum longitude, maximum lon-
gitude, minimum latitude, and maximum latitude observed on any trajectory point in the
historical trajectory database.
In this thesis, we use the following simple method for computing the cell codes for a given
trajectory point p, the extents extents of the historical trajectory database, and a cell edge
length r. Two helper variables, baseCellCodeLon and baseCellCodeLat are introduced to
simplify the equations:
CHAPTER 4. EXACT MATCHING 29
baseCellCodeLon = b2(p.lon− extents.minLon)
r− 1c
baseCellCodeLat = b2(p.lat− extents.minLat)r
− 1c
cell1 = (baseCellCodeLon, baseCellCodeLat) (4.6)
cell2 = (baseCellCodeLon, baseCellCodeLat+ 1) (4.7)
cell3 = (baseCellCodeLon+ 1, baseCellCodeLat) (4.8)
cell4 = (baseCellCodeLon+ 1, baseCellCodeLat+ 1) (4.9)
A visualization of how using grid cells can solve the problems of recommending only
concept points by allowing the recommendation system to recommend concepts in a specific
region is presented in figure 4.4. In this figure we can see that shops 1, 2, and 3, all share a
grid cell, and so we could recommend a localized point associated with this grid cell for the
concept “shop”. Using only concept points, we would only be able to recommend individual
shops or all of the shops together. Note that all four interleaving grids are shown in the
figure, although all of the lines are shared by two interleaving grids. One cell from each of
the four interleaving grids is shaded in the figure.
Shop 1 Shop 2
Shop 3
Shop 4
Start
Figure 4.4: Solving over-generalization problem with grid cells
4.4 Implementing Exact Matching
One of the principal goals of this thesis is that all recommendation queries should execute
in real-time, and thus they need to be extremely efficient. However, we are permitted an
arbitrary amount of data pre-processing time. Hence, what is needed is to pre-process our
CHAPTER 4. EXACT MATCHING 30
historical trajectory data, and to store it in some data structure that will permit us to
perform queries efficiently.
Each subsection of this section covers one aspect of efficiently implementing exact match-
ing. I begin by espousing on the traditional generalized suffix tree data structure, and then
proceed to describe the k-truncated generalized suffix tree, which for our purposes will be
cheaper to construct and consume less memory than a traditional generalized suffix tree.
After describing these data structures, I cover the details in computing the distance between
points of interest (and their generalized varieties), as well as how to execute exact matching
queries on a generalized suffix tree (k-truncated or not).
4.4.1 Suffix Trees
First introduced by Peter Weiner in 1973 [30], who referred to them as “position trees”,
suffix trees have become part of the standard data structure tool-box, and have found
wide application in many string algorithms. According to Dan Gusfield, author of the
comprehensive book on suffix trees, “Algorithms on Strings, Trees, and Sequences” [10], the
“classic application” for suffix trees is the substring problem. In the normal formulation,
the problem is to determine whether some string r is a substring of the string on which we
have constructed a suffix tree. This is easy to perform with a suffix tree because a suffix tree
is a tree that contains all suffixes of a given string. This means that solving the substring
problem becomes a simple matter of traversing the suffix tree, matching characters of r to
the characters on edges of the suffix tree until all characters of r have been matched or it is
possible to proceed no further.
Following Gusfield [10], we define a suffix tree:
Definition 4.4.1. A suffix tree T for a string s of length m is a rooted directed tree with
m leaves. Each internal node other than the root has at least two children and each edge
is labelled with a non-empty substring of s. No two edges leading out of a node can have
labels beginning with the same character. The defining characteristic of a suffix tree is that
for any leaf i, the concatenation of the labels of the edges on the path from the root to leaf
i is si..m, the suffix of s that starts at position i
Definition 4.4.2. A generalized suffix tree T for a set of strings S = {s1, s2, ..., sn}, and
where |sj | = mj is a suffix tree constructed on a set of strings rather than a single string and
contains all suffixes of all of the strings in S. Each internal node other than the root has at
CHAPTER 4. EXACT MATCHING 31
least two children and each edge is labelled with a non-empty substring of one or more sj .
No two edges leading out of a node can have labels beginning with the same character. The
defining characteristic of a generalized suffix tree is that for any leaf (i, j), the concatenation
of the labels of the edges on the path from the root to leaf (i, j) is sji..m , the suffix of sj
that starts at position i.
An example of a suffix tree for the word “mississippi$” is presented in 4.5. The “$” is
not strictly necessary, but is used so that every suffix has its own leaf node. In this example
we can see the defining characteristic of a suffix tree: that the concatenation of the labels of
the edges on the path from the root to a leaf i is si..m, the suffix of s that starts at position
i. For example, we can see that the concatenation of edge labels from the root to node 8 is
“ippi$”, which is the suffix of “mississippi$” starting from position 8.
MISSISSIPPI$I$S
P
$SSI
SSIPPI$PPI$
PPI$
I$
PI$I SI
PPI$
SSIPPI$SSIPPI$ P
PI$
1
234 567
8910 11
Figure 4.5: Suffix tree for the word “mississippi$”
Suffix trees are great for the problem of exact matching, and they will be useful for
the fuzzy matching techniques presented in the next chapter as well. They are both space
and time efficient to construct, and will permit us to rapidly query for the next points of
a query trajectory q. Although a linear time method for suffix tree construction was first
discovered in 1973 by Peter Weiner [30], the first online method for suffix tree construction
was published in 1995 by Esko Ukkonen [28]. “Online” in this context means that characters
are added to the suffix tree in the order in which they’re presented, and this means that it
is possible to update the suffix tree with new characters as they’re discovered. The methods
used in this thesis are based on Ukkonen’s algorithm.
The methods described in this section all have published time and space complexity
bounds of O(m), where m is the sum of lengths of all input strings. However, it is important
to note that this bound is only valid assuming a fixed alphabet. The linear bounds are not
alphabet-independent. As Gusfield describes it: “All linears are equal but some are more
CHAPTER 4. EXACT MATCHING 32
equal than others.” [10]. For the purposes of this thesis, the alphabet Σ is not fixed, and
so the real time complexity bound is O(m log |Σ|) [10]. Furthermore, the space complexity
for constructing a suffix tree is in fact O(mΣ). It is thus important that the alphabet size
required for a problem be kept as small as possible. If the alphabet size grows linearly with
the length of the strings, then the time required to construct a suffix tree degenerates to
O(m logm), and the space complexity to O(m2). As suffix tree construction can be done
offline this is not too bad from the perspective of running time, but the space requirements
can be prohibitive.
This complexity bound leads us to an important practical consideration, which is that
the alphabet size should be limited as much as possible. For example, given a set of real
trajectories it is likely that almost all points will be unique if the position is measured too
finely, but if points are locked such that longitudes and latitudes are rounded to the nearest
10 metres then the alphabet size would be reduced drastically. In the experimentation
section of this thesis, a different approach is taken, and all trajectories are represented as a
sequence of visited points of interest, which limits the alphabet size under consideration to
the number of points of interest.
Suffix trees can be used to solve the exact matching problem because given a query
trajectory q, we can traverse down the suffix tree until we match q, and from there we can
determine all possible next points by simply reading the children of that node.
Detailed descriptions of the methods required to construct suffix trees are omitted here
for brevity and can be found in Appendix A. However, an important practical note is that
many suffix tree implementations use a linked list at each node, and this can slow down
insertion and lookup time because as the tree grows, it takes increasingly longer to find
the child node corresponding to the next character in the suffix being inserted. This is an
important consideration to take when noting that one of our requirements is that queries
should be executed in real-time.
4.4.2 k-truncated Generalized Suffix Trees
An ordinary (generalized) suffix tree can be thought of having a maximal string depth equal
to the length of the longest string present in the suffix tree. However, for the purposes
of this thesis, a query trajectory q is unlikely to be very long, and furthermore, if it were
long it is questionable how useful older trajectory points would be. One assumption that
can be used to reduce the time and memory required to construct and store a suffix tree is
CHAPTER 4. EXACT MATCHING 33
to require that a query trajectory not exceed some fixed length l, and in this case we can
improve on using a plain generalized suffix tree.
The k-truncated generalized suffix tree (kTGST) was first introduced in 2008 by Schulz et
al. [24]. It differs from an ordinary generalized suffix tree in that their algorithm constructs
a tree of depth at most k and permits searching only for sequences of length at most k.
It was originally introduced as a means of improving bioinformatics algorithms where the
query strings are typically short DNA, RNA, or amino acid sequences.
Rather than containing all suffixes of all strings added to the tree, the kTGST contains
all substrings of length k (known as k-grams) of the strings added to the tree. This is
exactly what is needed to solve the exact matching methods described in this chapter. As
with regular generalized suffix trees, for the sake of brevity, a description of how to construct
kTGSTs has been excised from this chapter and can be found in appendix A.
The kTGST improves on traditional generalized suffix trees in a number of ways : im-
proved construction time, reduced memory usage, and quicker queries. The improved con-
struction time and reduced memory usage arise because by limiting the strings being added
to the suffix tree to be k-grams we greatly increase the chance of identical strings being
added to the tree. The quicker queries result because in a normal generalized suffix tree,
information about the string id and position of a suffix in a string is stored only in the leaf
nodes of the tree, and so after matching a suffix, we need to traverse the entire subtree below
the match. With a kTGST if our query is length k, then we will have no subtree below the
match that needs exploring, and for queries of length less than k, the subtree that needs to
be traversed is likely to be small compared to the case of regular generalized suffix trees.
Figure 4.6 contains an example of a k-truncated suffix tree for the word “mississippi$”
with k = 3. Note that some leaf nodes now correspond to multiple locations in the original
word, and that the maximum depth of the suffix tree does not exceed 3.
MIS$I$
SP
$SS
PP$I$
PI$I SI
S P
1
2, 53, 6
4 7
8910 11
Figure 4.6: 3-truncated suffix tree for the word “mississippi$”.
CHAPTER 4. EXACT MATCHING 34
4.4.3 Computing Point Distance
In sections 4.2.3 and 4.5 we depend on some function dist(x, y) to compute the distance
between two points x and y, where each is either a point of interest, a generalized point, or
a localized generalized point. Note that generalized points have no spatial location, and all
types of point have an associated concept.
Throughout this thesis, the spatial distance sDist(x, y) is taken to mean the Euclidean
distance between two points x, y, although any well defined distance measure could be used
instead. Recall from definition 3.1.1 the definition of concept distance used in this thesis. Let
z be the lowest common ancestor of both x and y in the concept hierarchy. If z does not exist
(i.e. x and y have no common ancestor in the concept hierarchy), then the conceptual dis-
tance between x and y is said to be infinite. Otherwise, the conceptual distance, cDist(x, y)
between two points x and y is defined as max(depth(x) − depth(z), depth(y) − depth(z)),
if z exists, and where depth(x) denotes the depth of a concept x in the concept hierarchy.
Note that other distance measures, such as using the sum of these two distances, could be
used. An example of computing cDist is given in figure 4.7.
A
B
x
C
D
y
32
Figure 4.7: Example of concept distance. conceptDistance(x, y) = 3
To compute the distance between a concept point x and any other point y, we can com-
pute the conceptual distance, cDist(x, y) and normalize it by a user defined generalization
factor gf used to control how distant concept points are from each other and from other
points.
To compute the distance between two points x, y where neither x nor y is a concept point,
we can separately compute the spatial and conceptual distances between them, normalizing
each distance by an arbitrary spatial factor, sf , and concept factor, cf respectively, and
summing the two values.
CHAPTER 4. EXACT MATCHING 35
To summarize, the distance between two points x, y is computed as:
distance(x, y) =
cDist(x, y)× gf if x or y is a concept point
cDist(x, y)× cf + sDist(x, y)× sf otherwise
(4.10)
The three constants used in this equation, the generalized factor gf , the concept factor
cf , and the spatial factor sf need to be carefully chosen in order for our recommendation
system to perform optimally. Unfortunately, there is no rule that allows us to determine
good values for these constant a priori. The most critical decision to be made is what
we want the relative weights of these distances to be. For example, by choosing a large
spatial factor sf and a small concept factor cf , we could tune the recommendation system
to recommend generalized points covering trajectory points with highly dissimilar concepts
provided that the trajectory points are very close to each other. Our recommendation for
choosing values for these factors is to try a number of combinations of these factors and to
select the combination that works the best for a given dataset. This method of choosing the
constants may seem arbitrary, but there is no a priori means of knowing good values for these
constants. The constants must be chosen for each application, and by choosing different
values it is possible to tune the recommendation system to show different properties. For
example, by tweaking these parameters we can alter the observed ratios of concept, localized,
and trajectory points recommended. Taken to an extreme, one interesting idea would be to
set the spatial factor near zero in order to be able to answer recommendation queries even by
tourists in cities for which we have no historical data. The values chosen for experimentation
in this thesis are discussed in Chapter 6.
4.4.4 Executing Exact Matching Queries
Given the techniques presented thus far in this chapter, it is simple to execute an exact
matching query. When presented with a query trajectory q, all that is required is to walk
the edges of a generalized suffix tree until we find a matching node. From this node, we can
determine all of the next points N of q, and with this set of next points, all that remains
is to compute the confidences of all possible recommendations according to the measures
defined in sections 4.2.3 and 4.3.
CHAPTER 4. EXACT MATCHING 36
Searching Suffix Trees
Intuitively, it is very simple to search a suffix tree, although there are some details that need
to be handled correctly when actually implementing searching a suffix tree. To search a suffix
tree one must simply walk down the edges of the suffix tree iterating through each point in
the query trajectory until either all points have been matched, a mismatch is encountered,
or no suitable child is found to continue walking down the tree. The pseudo-code for our
suffix tree searching algorithm is presented in algorithm 1.
Input: A suffix tree node node and a query trajectory q, and an integer lenrepresenting how much of q has already been matched to reach node
Output: A set of next points N of q
N ← {}1
edgestr ← node.edgestr2
edgelen← node.edgelen3
pos← 04
/* Compare characters until mismatch or end of string reached */
while len < |q| and pos < edgelen and q[len] = edgestr[pos] do5
pos← pos+ 16
len← len+ 17
if len = |q| then /* Matched trajectory */8
/* Traverse subtree to find all next points */
N ← TraverseSubtree(node)9
return N10
else if pos < edgelen then /* Encountered mismatch */11
return {}12
/* Reached end of edge. Need to recurse if node is internal */
if not(node.isALeaf) then13
child← FindChild(node, q[len])14
if notnull(child) then /* We found a child to continue searching */15
return searchTree (child, q, len)16
/* Node is a leaf, or no child exists. Can’t recurse */
return {}17
Algorithm 1: Searching Suffix Trees
Two important technical notes must be made about the algorithm for searching for all
next points of a query trajectory q:
CHAPTER 4. EXACT MATCHING 37
• Once we’ve found a node n matching the query trajectory q, we cannot immediately
determine the next points of q. This is because information about the strings / trajec-
tories stored in the suffix tree is stored only in the leaf nodes, and so we must traverse
the entire tree below n in order to determine the next points of q. This is performed
by the function TraverseSubtree(node).
If k-truncated generalized suffix trees are being used, the potential amount of tree
that needs to be traversed below n may be much smaller. This is a major reason why
using k-truncated generalized suffix trees will turn out to be more efficient than using
regular generalized suffix trees.
• In order to efficiently implement the FindChild(node, point) method, it is critical
that the children of node be indexed for efficient lookup. For trajectory-based POI
recommendation, the alphabet tends to be extremely large, and it is important that
this lookup be performed using a hash table, or a tree instead of iterating through
all children and comparing them to point. Our implementation uses a binary tree to
perform this lookup.
Processing Next Points
Once we have obtained the bag N of all next points of q, it remains to compute the confi-
dences of all possible recommendations and to determine the set of recommendations that
should be presented to the user. The processing of next points takes place in two phases:
gathering and diversification.
The gathering phase computes the confidences of all possible recommendations. To
do this, we begin by determining the supports of all next points of q, and by computing
the confidences of recommending each of these points according to the methods described
earlier in this chapter. The next phase is to compute the confidences of recommending all
concept generalizations of the next points of q. The final confidence computation phase is
to determine all cells that contain a next point of q, and then for each cell c to perform
the process of computing the confidence of all generalizations (now spatio-localized) of the
next points of q contained in c. By performing this process we can construct a set R of
all possible recommendations for the query trajectory q, along with their associated raw
confidences. Pseudo-code for this process is presented in algorithm 2.
The next phase performed in processing next points is the optional diversification of
CHAPTER 4. EXACT MATCHING 38
Input: A bag N of all next points of a query trajectory qOutput: An set R consisting of (point, confidence) pairs, each representing a
recommendation and its computed confidence.
/* Phase 1: Compute the confidences of each next point */
R← computeConfidencesOfTrajPoints(N)1
/* Phase 2: Compute the confidences of all concept points */
C ← gatherAllObservedConcepts(N)2
R← R ∪ computeConfidencesOfConceptPoints(C,N)3
/* Phase 3: Iterate over all cells and compute the confidences of
all localized points */
Cells← gatherAllObservedCells(N)4
for cell ∈ Cells do5
Ncell ← gatherPointsInCell(N, cell)6
R← R ∪ computeConfidencesOfTrajPoints(Ncell, cell)7
Ccell ← gatherConceptsInCell(N, cell)8
R← R ∪ computeConfidencesOfConceptPoints(Ccell, Ncell, cell)9
return R10
Algorithm 2: Processing Next Points
recommendations. This process is discussed in the next section. The output of the diversi-
fication process is an updated R with new, diversified, confidences.
4.5 Diversification of Recommendations
Although not a theoretical problem, there is a practical problem with the methods described
in the previous sections. In particular, there is the chance of recommending, for example, a
generalized Starbucks POI, as well as a generalized Second Cup POI, or perhaps a different
Starbucks POI in a different region. This is potentially undesirable both for a potential
advertiser, as well as for a user of the system. Even though it is most likely that an
individual will visit, for example, one of three coffee shops, the user may not desire to be
recommended all three coffee shops, and similarly, one coffee shop may not want to spend
money advertising just to appear with other coffee shops.
In the previous sections, we wanted the recommendation system to recommend a gen-
eralized point only if the points contained in the generalized point are highly related. This
is because we do not want to recommend a generalized point that is too general as that
CHAPTER 4. EXACT MATCHING 39
would not be useful for a user of the recommendation system. However, some of the top-k
recommendations may not be similar enough to recommend together as a generalized point
but nonetheless be similar enough to not want to recommend both of them. For example,
it would be possible for two of the top-k recommendations for a query to be a Starbucks
and a Second Cup that are far enough apart that the confidences for recommending each
of them is greater than the confidence for recommending a generalized point for “Coffee
Shop”. Despite the two coffee shops being too distant from each other for the recommenda-
tion system to recommend a generalized point subsuming both of them, we may not want
to recommend two coffee shops in the case that we could recommend a book store or a
hardware store instead.
This is not a problem with the confidence measure defined above, but a matter of how
to practically apply the confidences as estimated. What we can do is to process the list of
recommendations after computing the confidences of each possible next point for a given
query trajectory. The goal is to compute a “diversified confidence” for each next point
that can be used to generate a list of diversified top-k recommendations. The goal is not
to improve the confidences of the recommendations, as the method can be guaranteed to
reduce the sum of confidences of the top-k recommendations returned, but instead merely
to ensure that the recommendations are diverse.
An easy method to implement diversity is to iterate through the list of all recommenda-
tions in order of decreasing confidence, and to decrease the confidence of each recommen-
dation by how similar it is to recommendations already processed. If a recommendation
is very similar to recommendations already made, then it will have its effective confidence
decreased by a large amount, and conversely, if a recommendation is very dissimilar to
recommendations already made, then its effective confidence should not be affected.
The effect of point similarity on the diversification process can be controlled by a di-
versification factor divFac, 0 ≤ divFac ≤ 1. If divFac = 1 then no two recommendations
representing comparable points will both end up with non-zero confidences.
Pseudocode for the diversification algorithm used by this thesis is presented in algorithm
3. This is a straightforward greedy algorithm, and it is easy to see by inspection that it
runs in time O(|R|2), where R is the set of potential recommendations to be returned to
the user.
One objection that can be made to using this means of diversifying the set of rec-
ommendations is that it is completely deterministic, and given a set of recommendations
CHAPTER 4. EXACT MATCHING 40
Input: A set of recommendations R, a diversification factor divFacOutput: A set of recommendations S with confidences adjusted to ensure
diversity
S ← {}1
sort R by decreasing confidence2
for i← 1 to |R| do3
rec← R[i]4
for j ← 1 to |S| do5
pointDistance← distance(rec.point, S[j].point)6
if pointDistance <∞ then7
divReduction← pointDistance× (1− divFac)8
if divReduction < 1 then9
rec.confidence← rec.confidence× divReduction10
S ← S ∪ {rec}11
return S12
Algorithm 3: Algorithm for Diversifying Recommendations
always returns the same result. This is objectionable because using the algorithm presented
above we may always recommend a Starbucks for some query and never a Second Cup,
even though our confidence in recommending the Second Cup may be only slightly lower
than our confidence in recommending the Starbucks. This is undesirable both for users of
the recommendation system, for the establishments that are not being recommended due
to the recommendation process, and for the operator of the recommendation system, who
may lose out on potential advertising revenue. This problem can be remedied by slightly
randomizing the order in which the points are considered, the more likely that point is to
have its confidence left untouched. As mentioned in Section 2.1, this process is known as
serendipity. However, we do not pursue any experimentation along this line in this thesis
as the greedy method presented is sufficient for our purposes.
4.6 Summary
This chapter began by motivating and defining the exact matching problem, and then
proceeded to present a naıve solution to the problem. The naıve solution was revealed to
suffer an inability to account for POI similarity. Two approaches to accounting for POI
similarity were presented, one that worked by altering the confidences of recommending
CHAPTER 4. EXACT MATCHING 41
similar points of interest, and one that functioned by using density estimation to recommend
concept points. The former was demonstrated to be flawed, and the latter was selected as
the superior approach. Following this, we introduced a method for recommending localized
points using a system of four interleaving grids to determine whether the next points for a
query are spatially nearby.
After describing the conceptual framework for performing exact matching we delved into
the details of implementing exact matching using the (k-truncated) generalized suffix tree
data structure. Furthermore, we presented a concrete equation for computing the distances
between points and provided algorithms for executing recommendation queries. Finally
we presented a simple method for ensuring that the top-k returned recommendations are
diverse.
Chapter 5
Variants
The exact matching techniques described in the previous chapter are useful and are the
core of this thesis. However, they still do not address the fuzziness and order-flexibility
objectives for a useful trajectory-based recommendation system. The goal of this chapter
is to demonstrate how these two limitations can be naturally addressed in a reasonably
efficient manner by building on the exact matching techniques of the previous chapter.
5.1 Fuzzy matching
To repeat some preliminary definitions from previous chapters, let tDB be the trajectory
database, consisting of a bag of trajectories, and let q be a query trajectory, where l = |q|is the length of q. Furthermore, let H be the set of all trajectory fragments of length l + 1
in tDB, so that each fragment h = (b : n) ∈ H consists of two parts: a body b of length l,
followed by a next point n.
Using the exact matching techniques of the previous chapter, when presented with a
query trajectory q we are only able to recommend next points that follow a body q in H.
This has two principal limitations:
• (No Recommendations) Given a query trajectory q, if q does not appear as the body
of some trajectory fragment in H, then we are unable to recommend any points of
interest.
• (Similar Trajectories) If there is a trajectory fragment s that is very similar to q, and
where there is a recommendation for s with very high confidence then we should be
42
CHAPTER 5. VARIANTS 43
able to make this next point in addition to a next point following q. Depending on
the particular dataset considered, it may be quite common for there to be many very
similar, but non-identical, trajectories to be present in the trajectory database, and
the exact matching methods of the previous chapter are unable to utilize all of this
information.
Our goal is for the trajectory-based POI recommendation system to base its recommen-
dations also on the historical trajectory fragments that are “close to” or “fuzzy matches
of” the query trajectory q. The next points of these historical trajectory fragments that
fuzzily match the query trajectory should be considered as possible recommendations when
executing a query. Furthermore, if some point a is the next point of one or more trajectory
fragments similar to q (including an exact match of q), then the support of these similar
trajectory fragments leading to point a should boost our confidence in recommending a.
More concretely, in this section we want to develop a method whereby the next points of
all trajectory fragments in H are considered, and where their contributions are weighted
proportionally to their body’s similarity to the query trajectory q.
Let F denote the set F = {(b : n) ∈ H|similarity(b, q) > 0}. That is, F the set of all
trajectory fragments in the historical trajectory database that have bodies with a positive
similarity with q. With this definition we can define the fuzzy matching problem analogously
to the exact matching problem defined in the previous chapter.
Definition 5.1.1. Fuzzy Matching Problem: For a given query trajectory q, find the top-k
next points (ranked by decreasing confidence) of all trajectory fragments in F . The effect
of the next point of a fragment f = (b : n) ∈ F on the confidence for a recommendation
must be weighted proportionally to similarity(b, q).
5.1.1 Implementing Fuzzy Matching
One of the primary goals of this thesis is for all methods to be efficient, and for all queries
to be executable in real time. We must be mindful that the method chosen to perform
fuzzy matching remain efficient even on large data sets. The fundamental change required
to perform fuzzy matching is to modify the suffix tree searching algorithm presented in
the previous chapter to add a fuzzyError variable to the input for the algorithm, and to
relax the condition of the while loop to not require exact matches, but to instead allow for
mismatches, but to add the distance between the observed point and the expected point to
CHAPTER 5. VARIANTS 44
the fuzzyError. When processing next points, the fuzzyError will be used to determine how
to weight each next point.
At first glance, it may appear that this is all we need to do to handle the fuzzy matching
case, but there are still two problems that need to be addressed. The first is that the
contribution of the next points of the trajectory being considered s was defined in terms of
the similarity between s and the query trajectory q, but the fuzzyError as described can
grow arbitrarily large. The second problem is that as described, we may need to consider all
trajectories in the trajectory database (only those with a point of infinite distance from its
corresponding point in q could be excluded), and so as the data set grows the time required
to search for all possible next points may grow (linearly, not exponentially as our generalized
suffix tree can contain no more nodes than there are characters in the dataset) such that
queries can no longer be executed sufficiently quickly.
The solution to both of these problems is to specify a fuzzy search radius, fsr that can
serve both as a control to limit the breadth of the search and also to give a scale to convert
the fuzzyError into a similarity score, so that similarity(s, t) = 1 − min(fsr,fuzzyError)fsr .
The unit of fuzzy search radius is the abstract unit of point distance as defined in section
4.4.3. By defining a fuzzy search radius, we are able to limit our search space to those
historical trajectory fragments that fuzzy match the query trajectory with order fsr. This
will typically allow us to avoid visiting most of the nodes in our generalized suffix tree.
Figure 5.1 is an abstract representation of the space surrounding a query trajectory that
lies within the fuzzy search radius maxDist.
Start
maxDist
Figure 5.1: Demonstrating the Fuzzy Search Radius around a Trajectory
With this enhancement, it is now possible to efficiently execute fuzzy matching queries.
Recall that in the previous chapter, we defined the Kernel estimation function fh to be:
CHAPTER 5. VARIANTS 45
fh(x, S) =1
n
n∑i=1
Gh(x, S[i]) (5.1)
In equation 5.1 we sum over all possible next points, and in many situations it is possible
for many of these points to be the same point. Suppose that from S we construct a new
set R consisting of pairs (y, support(y)), where y is a unique point (no two pairs share
the same point x). In addition, let support(y) = |{s ∈ S|s = y}| be the support of y in S.
Furthermore, let totalSupport =∑|R|
i=1 support(R[i]) be the sum of the supports of all unique
points of R. At this point, the support for a point is independent of the query trajectory,
and so for now we set support(y, q) = support(y). Recalling that in order to execute fuzzy
matching queries our density estimates will need to depend on the query trajectory, we can
rewrite the definition of fh as:
gh(x,R, q) =1
totalSupport
|R|∑i=1
support(R[i], q)Gh(x,R[i]) (5.2)
With equation 5.2, it is possible for us to incorporate the next points of trajectory
fragments that do not exactly match the query trajectory, but that have a positive similarity
score with the query trajectory. All that we need to do is to change the means of computing
support(y) for each next point y. For each trajectory fragment (b : n) ∈ F , the contribution
of this fragment towards support(n) will be equal to similarity(b, q), where q is the query
trajectory. This leads us to a revised equation for support(y, q):
support(y, q) =∑
(b:n)∈F
similarity(b, q) if y = n
0 otherwise(5.3)
Given the revised equation for support(y, q), equation 5.3, and the reworked kernel
density estimate computed by equation 5.2 we can compute the fuzzy confidence for a given
recommendation z using what is essentially the same confidence measure as was used in the
previous chapter:
fuzzyConfidence(z, q) =
gh(z,N, q) if z a generalized point
support(z,N,q)2π|N | otherwise
(5.4)
In order to efficiently compute support(y, q) for each trajectory point y given a query q,
we need to modify the method for searching a generalized suffix tree presented in section
CHAPTER 5. VARIANTS 46
4.4.4. The updated algorithm will output both the next points of all historical trajectory
fragments that fuzzy match the query trajectory q as well as the similarity between q and
these historical trajectory fragments. Pseudo-code for our revised algorithm is presented in
algorithm 4.
Input: A suffix tree node node and a query trajectory q, a fuzzy search radiusmaxDistance, the current cumulative distance cumDistance, and aninteger len representing how much of q has already been matched toreach node
Output: A set of pairs (N, similarity) of next points and similarity scores foreach historical trajectory fragment that fuzzy matches q
N ← {}1
edgestr ← node.edgestr2
edgelen← node.edgelen3
pos← 04
/* Compare characters until max distance or end of string reached */
while len < |q| and pos < edgelen and cumDistance < maxDistance do5
cumDistance← cumDistance+ pointDistance(q[len], edgestr[pos])6
pos← pos+ 17
len← len+ 18
if cumDistance ≥ maxDistance then /* Max Distance Exceeded */9
return {}10
else if len = |q| then /* Matched trajectory */11
/* Traverse subtree to find all next points */
N ← TraverseSubtree(node)12
return (N, cumDistance)13
/* Reached end of edge. Need to recurse if node is internal */
Results← {}14
if not(node.isALeaf) then15
for child ∈ node.children do16
Results← Results∪searchTree(child, q,maxDistance, cumDistance, len)17
return Results18
Algorithm 4: Searching Suffix Trees For Fuzzy Matching
A further optimization could be to perform an A∗-search of the generalized suffix tree,
where we begin by first exploring those paths most similar to the query trajectory t. Our
goal is always to find the top-k recommendations, and we should be able to explore until we
CHAPTER 5. VARIANTS 47
find the top-k recommendations and no other branches explored could possibly contain a
top-k recommendation. In the worst case, all branches of the tree may be explored, although
in practice this is unlikely. The principal reason for not implementing this approach in the
thesis is that if diversification is performed, then the top-k recommendations found before
diversification will likely not be the top-k recommendations after diversification. A possible
resolution would be to search until the top-x recommendations are discovered where x > k,
but this is not pursued any further here.
One optimization that has been suggested is that we could only consider fuzzy matching
against similar trajectories that lead to the next points of q, the query trajectory. However,
this suggestion is flawed and the reason is simple: if q has never been previously observed
then we would not be able to make any recommendations, even if there were many previously
observed trajectories similar to q.
5.2 Order-Flexible Queries
Despite the power offered by the fuzzy matching technique of the previous section, there is
still a remaining limitation to the method. In particular, the order of points in the trajectory
query q must exactly match the order of points in the body of a trajectory s in the trajectory
database, and this violates the desire expressed in chapter 3 that the order of points close
in time should not matter.
There is an ambiguity in this problem statement in that it does not state whether we’re
looking for points close in time in the query trajectory or in the trajectory database, and this
leads to two approaches to solving the problem of order-flexible queries. The first approach
is the history-centric approach, wherein we only match the query trajectory against the
trajectory database, but allow a degree of out-of-order matching when the time difference
between points in a historical trajectory are close enough in time. The other approach is the
query-centric approach, wherein we look for points close in time in the query trajectory and
use this to generate permutations of the query trajectory to match against the trajectory
database. Considering the motivation for performing order-flexible queries presented in
Chapter 3, we can posit that in a real situation that these two approaches would lead
to a similar set of recommendations, but this would need extensive testing. Due to its
much greater efficiency, the query-centric approach is the approach taken by this thesis for
experimental purposes.
CHAPTER 5. VARIANTS 48
5.2.1 History-Centric Approach
The history-centric approach is perhaps the more obvious of the two approaches, and would
be an extension of the fuzzy-matching method described earlier in the chapter. That is, given
a query trajectory q, we want to match q against the database of historical trajectories,
performing a form of fuzzy matching whereby q would match (with some error factor) a
historical trajectory h even where q 6= h, but where swapping some points in h that are
close in time to each other would transform h into q. For the purpose of this discussion, we
can further limit the problem by not allowing any point h to be swapped with any point
other than its immediate predecessor or successor in the trajectory.
At first glance, it may appear that all that we need to do is to match q against the
historical trajectory database, performing some form of look-ahead while matching to see if
swapping two points in the historical trajectory would allow it to match q. This would work
if we did not care about the time difference between points when performing order-flexible
matching. However, as argued in Chapter 3, we do care about the time difference between
visiting points in this scenario.
One proposed solution to handle this would be to attach a list to all nodes in the
generalized suffix tree of all historical trajectories passing through the node, and to keep
track of the temporal differences between a user’s visit to that node and the next node in the
trajectory. Clearly this would greatly increase the memory requirements of storing the suffix
tree, but that is not the most severe problem with this solution. The real problem is that
even with this information we cannot simply search the tree, looking ahead for potential
node swaps of small time distance, because we would have no efficient means of determining
how many historical trajectories followed the path traversed and of determining the support
for any next points. The only solution would be to trace historical trajectories through the
generalized suffix tree, which means that in the worst case, we would need to search for
all historical trajectory fragments of length k in the tree. The time required to do this is
O(n ∗ k) where n is the sum of lengths of all historical trajectories, and k is the length of
the query trajectory.
To see why it is necessary to search for all historical trajectory fragments of length k (k-
grams), consider a historical database considering of two trajectories, x = y = a → b → c.
Suppose that the difference between x and y is that for x the time difference between the
visits to a and b is 1 minute, and the time between the visits to b and c is 10 minutes.
CHAPTER 5. VARIANTS 49
Similarly, suppose that for y, the time between the visits to a and b is 10 minutes, but the
time between the visits to b and c is only 1 minute. Finally, suppose that the maximum
allowable time difference between two points for them to be swapped is 1 minute. Then,
given a query trajectory q = a→ c→ b, we can see that we match q against a permutation
of y, but we cannot match q to any permutation of x. In order to determine this we need to
separately consider both x and y. The generalized suffix tree as constructed for executing
exact matching and fuzzy matching queries does not contain enough information for us to
determine that only one of x and y can match q when taking the history-centric approach
to order-flexible matching.
In order to examine all historical k-grams in the suffix tree, when constructing the suffix
tree we would need to consider the time-stamps of trajectory points when determining
if trajectory points match. It is highly unlikely for many trajectory points to share the
same time-stamps, and so using a suffix tree offers no advantages over a plain sorted list
of historical trajectory fragments in order to solve the history-centric approach to order
flexible matching. It is possible that there exists an efficient solution to the history-centric
approach, but we have been unable to think of a solution, and we believe that it is unlikely
that an efficient solution exists.
In conclusion, although the history-centric approach to order-flexible matching is intu-
itively reasonable, implementing the approach requires time linearly proportional to the size
of the historical database. This runs contrary to the requirement of on-line recommenda-
tion. We want an efficient and scalable approach where there is no linear factor of the size
of the historical database in the complexity for the method.
5.2.2 Query-Centric Approach
A more scalable method for performing order flexible queries is the query-centric approach.
This approach addresses the problem by repeating the fuzzy matching process with all per-
mutations of the query trajectory q that are temporally close to q. Taking this approach
will allow us to satisfy the order-flexibility requirement while allowing for an efficient imple-
mentation with no linear factors of the size of the historical trajectory database in its time
complexity.
Let Qp = (Qp1, Qp2, ..., Qpl) denote the set of all permutations of the query trajectory q
such that qi is swapped with qi+1 only if qi+1.time− qi.time < maxDiff , where maxDiff
is an arbitrary threshold. Using the same definitions as in the previous section, let H denote
CHAPTER 5. VARIANTS 50
the set of all trajectory fragments in the historical trajectory database tDB. Furthermore,
let OF = (OF1, OF2, ..., OFl) be the set of all sets of trajectory fragments with positive
similarity to each Qpi, so that OFi = {(b : n) ∈ H|similarity(b,Qpi) > 0}.
Definition 5.2.1. Order-Flexible Problem: For a given query trajectory q, find the top-k
next points (ranked by decreasing confidence) of all trajectory fragments in OF . The contri-
butions of the next point of a fragment f = (b : n) ∈ OFi must be weighted proportionally
to similarity(b, q), as well as to orderError(Qpi, q).
The query-centric approach can be implemented as an extension of the fuzzy matching
technique presented in the previous section. It works by first identifying all points in q that
are close in time to the next point in q, and build up a vector, canSwap, where canSwapi
is a boolean value indicating whether qi+1.time − qi.time < maxDiff . That is, whether
the qi is nearby in time to qi+1. With this information it is possible for us to recursively
generate all permutations of q such that no point x in a permutation has a time-stamp more
than maxDiff later than the successor of x. For simplicity, we require that the index of
p in a permutation be no more than 1 off from its index in q. This requirement could be
removed, at the cost of a more complex and less efficient implementation.
As with the fuzzy matching case described in the previous section, we need to weight the
contributions of the next points resulting from matching an order swapped query trajectory
appropriately. Let swaps(p, q) denote the minimum number of point swaps needed to obtain
a trajectory p from q. The error resulting from searching for p in the generalized suffix tree
rather than q is simply orderError = swaps(p, q)× orderFactor, where orderFactor is an
arbitrary constant. As with the fuzzy matching methods of the previous section, in order
to guarantee that the algorithm runs efficiently we need to limit the search by bounding the
maximum allowable error.
Combining this with the fuzzyError factor introduced in the previous section, the final
error used to weight the contribution of a discovered next point is error = fuzzyError +
orderFactor.
Although the number of possible combinations of swaps is exponential in the length of
the query trajectory, q, the length of q is expected to be very short (≤ 5 in most applica-
tions), and so there is no exponential blow up to be experienced here. Assuming a fixed
query trajectory length, the query-centric approach will execute in only O(1) times the time
required to execute a fuzzy-matching query. This query-centric approach is the approach
taken by this thesis for experimentation purposes.
CHAPTER 5. VARIANTS 51
Pseudo-code for the basic order-flexible matching algorithm is presented in algorithm 5.
Note that in this pseudo-code the searchTree(...) function call refers to the fuzzy matching
suffix tree search algorithm presented in section 5.1.1. A consequence of this is that our
implementation of the query-centric approach to order-flexible matching is built upon the
fuzzy matching. If someone wanted to perform order-flexible matching without allowing
fuzzy matching, this function call could be replaced with a call to the exact matching suffix
tree search algorithm presented in section 4.4.4.
Input: A suffix tree root root and a query trajectory q, a fuzzy search radiusmaxDistance, a maximum time duration maxOrderDiff for orderswapping, and a swap error swapError
Output: A set of pairs (N, similarity) of next points and similarity scorescorresponding to each historical trajectory fragment that order-flexiblematches q
/* Phase 1: Generate all allowed permutations of q */
/* X = ((q0, numSwaps0), (q1, numSwaps1), ..., (qm, numSwapsm))where q0 = qand numSwapsi denotes the number of swaps required to obtain qifrom q */
X =← generateAllowedPermutations(q,maxOrderDiff, swapError)1
/* Phase 2: Generate results for all permutations */
Results← {}2
for (qi, numSwapsi) ∈ X do3
initialError ← numSwapsi × swapError4
Results← Results ∪ searchTree(root, qi,maxDistance, initialError, 0)5
return Results6
Algorithm 5: Searching Suffix Trees for Order-Flexible Matching
To illustrate the simplicity of the query-centric approach to order-flexible matching, con-
sider the following example. Consider a historical database considering of two trajectories,
x = a→ b→ c→ d, and y = b→ a→ c→ e. The times between visits in these trajectories
is irrelevant. Now, suppose we have two query trajectories, p = q = a → b → c. In trajec-
tory p, 1 minute elapses between the visits to a and b, and between the visits to c and d.
In trajectory q, 10 minutes elapses between the visits to a and b, but only 1 minute elapses
between the visits to b and c.
Given a maximum time difference of 1 minute, the set of allowed permutations of p is
Pp = (a → b → c, b → a → c, a → c → b), whereas the set of allowed permutations of q
CHAPTER 5. VARIANTS 52
is Qp = (a → b → c, a → c → b). Matching will be performed using each of the allowed
permutations of p and q. We can see that a permutation of p matches the body of x and
another matches the body of y, and so we will be able to recommend both d and e given
the query p. However, no permutation of query q matches y, and so given query q our
trajectory-based recommendation system will only be able to recommend point d.
5.3 Summary
This chapter introduced two extensions of the exact matching problem: the fuzzy matching,
and the order-flexible matching problem, and provided methods to solve both of them.
The exact matching formulation of the trajectory-based POI recommendation problem was
revealed to suffer from two limitations: no recommendations, and similar trajectories. These
limitations were overcome by defining a fuzzy search radius and incorporating the next
points of all trajectories lying within this radius into the recommendation process. The
search algorithm presented was a simple extension of the algorithm for executing exact
matching queries presented in section 4.4.4. After solving the fuzzy matching problem we
considered the order-flexible matching problem, and found that there are two approaches to
the problem. The first, the history-centric approach, was demonstrated to be intractable,
but the second, the query-centric approach, was demonstrated to be solvable by further
extending the search algorithms used for fuzzy matching and exact matching.
Chapter 6
Experimental Results
The goal of this chapter is to present experimental results demonstrating the efficiency and
effectiveness of the methods presented in chapters 4 and 5. This chapter is divided into
three parts. The first contains a description of the datasets used and the methods used to
process them, and the second section describes how to evaluate the quality of results, and
the final part contains the results of the experiments run along with analysis of these results.
6.1 Datasets
The experimental results in this thesis have been collected using a number of processed
variants of two publicly available datasets: the INFATI dataset [14], derived from tracking
the movements of cars in a town in northern Denmark, and the trucks dataset [5], which
tracks the movements of a number of trucks in Athens, Greece. The INFATI dataset is split
into two “teams”, and for simplicity we use only the trajectories gathered by “team 1”. The
entirety of the trucks dataset is used. Visualizations of the two datasets are presented in
figure 6.1.
6.1.1 Dataset Processing
Both the INFATI and the trucks datasets are pure trajectory datasets and suffer from
the limitation that although they contain a plethora of trajectory points, each point in a
trajectory does not denote a visit to a point of interest. Rather, each point merely denotes
the location of a vehicle some fixed amount of time following its predecessor. Furthermore,
53
CHAPTER 6. EXPERIMENTAL RESULTS 54
(a) INFATI dataset (b) Trucks dataset
Figure 6.1: Datasets used
there are no points of interest present in the dataset, and hence to make the datasets useful
for experimenting on the methods presented in chapters 4 and 5, we need to process the
datasets to add points of interest and to map trajectories to these points of interest. In
addition, we will choose a method for probabilistically perturbing the trajectories in our
datasets in order to highlight the differences between the exact matching technique and its
variants.
Dataset Processing Model
Each of the initial datasets (INFATI and trucks) can be thought of as a historical trajectory
database tDB, where each point has its own point of interest. The initial problem to be
solved by processing the datasets is that of choosing a subset of points where we will assume
that real points of interest exist. There are three obvious approaches to solving this problem.
The first approach is to simply choose a random selection of points in the initial dataset
and place a point of interest at the location of each of these points. The second would
be to spatially cluster the trajectory points, and to declare that a reference point for each
of the top-n clusters is a point of interest. Yet another approach is to place points of
CHAPTER 6. EXPERIMENTAL RESULTS 55
interest independently of the given trajectories, either randomly or at regular intervals.
This approach is flawed because most of these points of interest will never be visited, and
real points of interest tend to be clustered into a small subset of the total space. For
the experimentation in this thesis, I have chosen to go with the first approach due to its
simplicity and because it allows us to easily generate processed datasets of varying sizes.
What we do is use a user-defined poiRate, so that every poiRateth observed point in the
initial dataset is assumed to be the location of a point of interest. This POI will be randomly
assigned a leaf concept.
Given a set of points of interest, the next problem that needs to be addressed is that of
mapping the initial trajectories to the points of interest. This is done in a straightforward
manner: we compute the distance of each point of an initial trajectory as it is observed to
each point of interest, ignoring all POIs that an earlier point in the initial trajectory has
already mapped to. If there is an unvisited POI poi within some fixed distance maxDist of
the trajectory point, then the trajectory point is mapped to poi. Otherwise, the trajectory
point is ignored.
By itself, the processing algorithm described in the previous two paragraphs works quite
well, but suffers from the problem that there is too little variety in the resulting datasets.
This can be attributed to the fact that the initial datasets used were generated by a relatively
small number of individuals, and it can be expected that a dataset generated from tracking
thousands of individuals would show much more variety. As a result of this deficiency, in
order to illuminate the differences between the exact matching technique of chapter 4 and
the fuzzy and order-flexible methods of chapter 5, we need to extend the processing model.
To enhance the fuzziness of the processed datasets, we define a constant splitProb that
denotes the probability of “splitting” a point of interest. Splitting a point of interest will
generate a new POI y from an initial point x, such that y.concept = x.concept, and such
that pointDistance(x, y) < maxSplitDistance. Each time we map an initial trajectory
point p to a POI x, we will split p with probability splitProb. If we don’t split p, then x we
will randomly map p to x or one of the POIs generated by splitting x, where each of these
POIs will be selected with equal probability.
Similarly, to ensure that some trajectories visit points in different orders, we define a
constant orderProb that denotes the probability of swapping any two trajectory points p, q
such that |p.ts−q.ts| < maxTimeDiff . After all initial trajectory points have been mapped
to points of interest, we will make a pass over all generated trajectories, and perform this
CHAPTER 6. EXPERIMENTAL RESULTS 56
order-swapping process.
Pseudo-code describing this processing algorithm can be found in Algorithm 6.
INFATI Processing
The INFATI dataset was processed three times using the processing algorithm on page 57
to create three variant datasets. For each run, the splitting probability was set to 0.025, and
the order swapping probability to 0.33. Furthermore, the maximum distance for mapping a
trajectory point to a POI was set to 50. Finally, the maximum split distance was set to 2, and
the maximum time difference for order swapping to 2 seconds. Clearly the maximum time
difference is impractical for a real world application, but the points in the initial INFATI
dataset are separated by only 1 second, and so the order swapping time difference must be
made accordingly small.
Three processed INFATI datasets were generated: Infati-750, Infati-500, and Infati-
250. These datasets were processed using poiRate values of 750, 500, and 250, respectively.
Detailed information and a visualization of the Infati-500 dataset can be found in section
6.1.2.
Trucks Processing
The Trucks dataset was processed twice to create two variant datasets. As with the INFATI
dataset, for each run, the splitting probability was set to 0.025, and the order swapping
probability to 0.33. However, the Trucks use a different coordinate system than the INFATI
dataset, and thus the distances need to be adjusted. For processing the Trucks dataset, the
maximum distance for mapping a trajectory point to a POI was set to 0.007. The maximum
split distance was set to 0.0005 and the maximum time difference to 1 minute.
Two processed Trucks datasets were generated: Trucks-100, and Trucks-50. These
datasets were processed using poiRate values of 100 and 50, respectively, and like for the
INFATI dataset, detailed information and a visualization of the Trucks− 100 dataset can
be found in the following section.
6.1.2 Dataset Statistics
Taking together the results of processing the INFATI and Trucks datasets, we have five
datasets available for us to experiment with: Infati-750, Infati-500, Infati-250, Trucks-100,
CHAPTER 6. EXPERIMENTAL RESULTS 57
Input: A set of initial trajectories T , a POI rate poiRate, a maximum spatialdistance maxDist, a splitting probability splitProb, a maximum splitdistance maxSplitDist, an order-swapping probability orderProb, and amaximum time difference maxTimeDiff
Output: A set of POIs POIs, and a historical trajectory database tDB
/* Phase 1: Generate POIs for every poiRateth point */
initialPOIs← generatePOIAtEveryNthPoint(T, poiRate)1
POIs← initialPOIs tDB ← {}2
/* Phase 2: Map trajectory points to POIs */
for traj ∈ T do3
visitedPOIs← {}4
for point ∈ traj do5
nextPoint← nearestUnvisitedPOI(point, POIs,maxDist, visitedPOIs)6
if exists(nextPoint) then7
/* There is an unvisited POI within maxDist of point */
if random() < splitProb then8
/* Generate a new POI within maxSplitDist of nextPoint*/
trajPoint← split(nextPoint,maxSplitDist)9
else10
splitPoints← allpointsderivedfromnextpoint11
trajPoint← chooseRandom(nextPoint, splitPoints)12
POIs← POIs ∪ {trajPoint}13
visitedPOIs← visitedPOIs ∪ {trajPoint}14
tDB ← tDB ∪ {visitedPOIs}15
/* Phase 3: Swap order of points close in time */
for traj ∈ tDB do16
for point ∈ traj do17
if successor(point).ts− point.ts < maxTimeDiff then18
if random() < orderProb then19
swap(point, successor(point))intraj20
return POIs, tDB21
Algorithm 6: Pseudo-code for Dataset Processing
CHAPTER 6. EXPERIMENTAL RESULTS 58
Name #Traj. Points #Trajs. Avg. Length #Swaps #Users
Infati-750 35830 1487 24.1 1878 11Infati-500 54521 1546 35.3 4200 11Infati-250 107280 1624 66.1 13187 11Trucks-100 90654 5260 17.2 20272 60Trucks-50 100520 5339 18.8 23157 60
Table 6.1: Dataset Trajectory Information
Name Number of POIs Median POI Visits Max POI Visits
Infati-750 1812 13 133Infati-500 2760 13 140Infati-250 5413 13 163Trucks-100 3352 18 257Trucks-50 4776 12 269
Table 6.2: Dataset POI Information
and Trucks-50. This section contains a pair of tables of information about the processed
datasets, as well as visualizations of two of the processed datasets. Most of the columns in
these tables are self explanatory.
Table 6.1 contains information about the trajectories in each processed dataset. The
only non-obvious column is “#Swaps”, which lists how many points were swapped during
phase 3 of the processing algorithm
Table 6.2 contains information about the points of interest in each processed dataset.
The columns that need explanation are “Median POI Visits”, which lists the median number
of visits to each point of interest in the dataset, and “Max POI Visits’, which similarly lists
the maximum number of visits to any POI in the dataset.
Visualizations of Infati-500 and Trucks-100 can be found in figure 6.2. Comparing the
visualizations of these processed datasets with the graphs of the unprocessed initial trajec-
tory data found in figure 6.1, we can observe that they’re generally very similar, although
much of the fine detail has been lost. This is not a problem for us as we want trajectories
where each point denotes a visit to a point of interest, and we argue that the rarely visited
regions lost in the processing do not contain points of interest. The concept hierarchy used
CHAPTER 6. EXPERIMENTAL RESULTS 59
(a) INFATI-500 dataset (b) Trucks-100 dataset
Figure 6.2: Processed Datasets
Restaurant
Slow Food Fast Food
Milestone'sWhite Spot McDonald's
Subway
Coffee Shop
Starbucks Second CupBlenz
Tourism
Bridge AquariumSkyride
Figure 6.3: Concept Hierarchy
CHAPTER 6. EXPERIMENTAL RESULTS 60
to generate the processed datasets is presented in figure 6.3.
6.2 Evaluating Quality
In order to analyze the results of our experiments, we need to define appropriate metrics
for evaluating the quality of a recommendation or set of recommendations. In this section
we briefly describe the two scoring metrics used by this thesis to evaluate the quality of
results: binary scoring, and weighted scoring. Before stating the definitions of these two
scoring metrics, we need a preliminary definition for the set of matching POIs.
Definition 6.2.1. The matching POIs of a test trajectory fragment t = (q, n) is defined as
matchingPOIs(q, n) = {p ∈ recommendations(q)|p ≥ n}. That is, matchingPOIs is the
set of all POIs in the recommendations for q that contain the test fragment’s next point n.
Definition 6.2.2. The binary score of a test trajectory fragment t = (q, n) is 1 if any of
the recommendations for q contain the test fragment’s next point n, and 0 otherwise. This
can be concisely expressed as:
binaryScore(q, n) =
1 if matchingPOIs 6= ∅
0 otherwise(6.1)
The binary score is useful for informing us about how many test points we were able
to make a meaningful recommendation for, but it is a very poor measure of the quality of
recommendations. If we imagine a scenario where every concept in the concept hierarchy
had a common ancestor c, then all our recommendation system would have to do in order
to have a binary score of 1 for every test is return a concept point associated with c as one
of its recommendations. Although we want the sum of binary scores over all test trajectory
fragments to be high, this cannot be the metric that we try to optimize because it can
be optimized by always returning the most general recommendations. What we need is to
devise a measure that incorporates a trade-off between maximizing the number of queries
that can be satisfied and minimizing the uncertainty of each recommendation. For our
purposes, the uncertainty of a recommendation will be equal to the number of distinct POIs
that are contained in the recommendation.
CHAPTER 6. EXPERIMENTAL RESULTS 61
Recall from definition 3.2.1 that P denotes the database of all points of interest. Let us
define a function pois(p) that given a point p returns the set of all points of interest that
could be represented by p.
pois(p) =
{p.poi} if p is a trajectory point
{poi ∈ P |p ≥ poi} if p is a generalized point(6.2)
Definition 6.2.3. The weighted score of a test trajectory fragment t = (q, n) is the sum of
the probabilities that n is represented by each the recommendations for q. The probability
that n is represented by a recommendation p is equal to 1|pois(p)| if p ≥ n and 0 otherwise.
This can be written concisely as:
weightedScore(q, n) =∑
p∈matchingPOIs(q,n)
1
|pois(p)|(6.3)
The weighted score measure avoids the problems of the binary score measure. In addition
to maximizing the number of tests for which a valid recommendation is returned, optimizing
the weighted score will minimize the uncertainty of the recommendations returned.
6.3 Experimentation
6.3.1 Implementation
All experiments were implemented in C++. Nearly all of the code was written specifically
for this thesis, with the notable exception of the core algorithm for k-truncated suffix tree
construction, which was adapted and extended from an open source implementation written
by the authors of [24]. In particular, the algorithm was adapted to work on objects other
than character strings, and its memory management concepts were redesigned.
All experimentation was performed on an Intel Core 2 Quad, with 4 GB of RAM.
Although the CPU has four cores, to minimize the risk of processes affecting each other,
such as through cache contention, only one core at a time was utilized for experimentation.
6.3.2 Design
The experimentation on each dataset was performed using the technique of k-fold cross
validation, with k = 10. k-fold cross validation works by dividing the dataset into k mutually
CHAPTER 6. EXPERIMENTAL RESULTS 62
exclusive subsets of roughly equal size, known as folds. Suffix tree construction and testing
are then performed k times. On each iteration, 1 fold is selected as the test set, and the
other k−1 folds are used as the training set that we construct the (l-truncated) generalized
suffix tree on. For more background on this process, please see [11].
In this section, all averages, such as for query time, are taken as the average value over
all k folds. On the other hand, all sums, such as the total score for a dataset given certain
parameter values, are taken as the sum of the values over all k folds. To eliminate a possible
bias, the contents of the folds for each run of each experiment were randomly selected. As a
final note, for these experiments we fold on the users in a dataset, rather than on individual
trajectories.
For a given set of test trajectories test and a query length l, every trajectory fragment
f = (q : n) in test of length l + 1 is used as a test trajectory. The score (similarly, binary
score, average query time, etc.) for a dataset is defined to be the sum (or average, where
applicable) of the value for the metric in question over all test trajectories in each fold.
Except where otherwise indicated, the following parameters were used to configure the
recommendation algorithm: queryLength = 3, numRecommendations = 3, gf = 0.2,
cf = 0.01, and kernelWidth = 1. Specifically for experiments on the INFATI dataset,
values of sf = 1, and cellEdgeLength = 100 are used. For experiments on the Trucks
dataset, default values of spatialFactor = 500 and cellEdgeLength = 0.01 are used. There
is nothing intrinsically special about these values, and they were chosen merely for their
observed effectiveness in leading to accurate recommendations. For experiments on the
INFATI dataset testing the effect of order-flexibility, a maximum time difference of 2 seconds
is used. Similarly, for experiments on the Trucks dataset testing the effect of order-flexibility,
a maximum time difference of 60 seconds is used. Unless otherwise indicated, order-flexible
matching is disabled. When swapping is allowed we always set swapError = 1, except for
when the fuzzy matching radius is set to 0, in which case we have set swapError to 0 in
order to illuminate the effects of allowing order-flexibility when only exact matches between
trajectory points are allowed. Finally, l-truncated suffix trees are used by default, where l
is set to be equal to the query length parameter being used.
In many of the graphs on the following pages, data series are identified by the dataset
the series was generated from. A number of experiments were run twice, once disallowing
order-flexible matching, and once allowing order-flexible matching. For these experiments,
the data series generated when allowing order-flexible is distinguished by appending “(o)”
CHAPTER 6. EXPERIMENTAL RESULTS 63
to the name of the dataset. For example, the data series name “INFATI-500” implies
that order-flexible matching was disallowed for that experimental run, whereas the data
series name “INFATI-500 (o)” implies that order-flexible matching was allowed for that
experimental run.
6.3.3 Basic Results
This section contains the results of some baseline experiments using the default configura-
tions for each dataset presented above. The majority of the experimental results described
in subsequent sections are comparisons of variations on the default configuration with the
results described in this section.
This section addresses the following questions:
• How does the fuzzy matching radius affect the weighted and binary scores of a dataset?
How does it affect the time required to execute a query?
• How does the fuzzy matching radius affect the number of queries for which we are
unable to make any recommendation, good or not?
• How does allowing order-flexible matching affect the weighted and binary scores of a
dataset? How does it affect the time required to execute a query?
We begin by looking at how varying the fuzzy matching radius affects the weighted and
binary scores of a dataset, both when allowing and when disallowing order-flexible matching.
Figures 6.4 and 6.8 contain these results for the INFATI dataset, and figures 6.5 and 6.9
contain these results for the Trucks dataset.
In figure 6.4 we see that for the INFATI dataset, the weighted score increases rapidly
with increasing fuzzy matching radius until the fuzzy matching radius reaches approximately
25, after which the weighted score slowly declines with increasing fuzzy matching radius.
Similarly, in figure 6.5 we see that a small fuzzy matching radius results in a higher score
than disallowing fuzzy matching, but that continuing to increase the fuzzy matching radius
results in a quickly diminishing weighted score. This behaviour, in which there is a hump
in the weighted score curve, can be explained as follows. As the fuzzy matching radius is
increased from zero, at first we match only historical trajectory fragments that are highly
similar to the query trajectory, and are in fact only small variations. However, when the
fuzzy matching radius reaches a certain threshold we begin matching against very dissimilar
CHAPTER 6. EXPERIMENTAL RESULTS 64
trajectories in the historical trajectory database, and it is at this point that the slope of the
weighted score curve turns downwards. This demonstrates that the fuzzy matching radius
used in a trajectory-based recommendation system needs to be carefully chosen in order to
maximize the benefits of allowing fuzzy matching.
Another observation that can be derived from these figures 6.4 and 6.5 is the effect of
order-flexible matching on scores. In both of these charts we can see that when the fuzzy
matching radius is 0, allowing order-flexible matching results in increased weighted scores,
whereas when the fuzzy matching radius is greater than 0, the weighted scores are lower
than when disallowing order-flexible matching. Similar to how we explained the observation
that setting the fuzzy matching radius too large, allowing order flexible matching can lead
to a query matching fragments from highly dissimilar historical trajectories. This problem
could be partially alleviated by using a very large swap error parameter.
Figures 6.6 and 6.7 demonstrate that for both the INFATI and Trucks datasets, in-
creasing the fuzzy search radius leads to the number of queries for which we are unable to
make any recommendations decreasing monotonically. However, this says nothing about
the quality of the returned results. An interesting observation in these two charts is that
allowing order-flexible matching results in a significant decrease in the number of unsatisfi-
able queries, even when we are not allowing for fuzzy matching. This observation explains
why allowing order-flexible matching helps when the fuzzy matching radius is set to 0 (recall
from section 6.3.2 that when the fuzzy matching radius is set to 0 we set swapError = 0
as well).
Looking at the figures 6.8 and 6.9 we see that results similar to those for the weighted
score measure can be observed when looking at the binary scores for our datasets. For both
INFATI and Trucks datasets, we see a large increase in the binary score as the fuzzy matching
radius is increased to a certain level. However, once this threshold is transgressed, increasing
the fuzzy matching radius decreases the binary score as the quality of recommendations
becomes worse.
The final basic results are those on how the varying the fuzzy matching radius and
allowing for order-flexibility affect the time to execute per query. These results are presented
in figures 6.10 and 6.11 for the INFATI and Trucks datasets, respectively. In these figures we
can observe that the growth in the time per query is nearly linear with the fuzzy search radius
for the Trucks dataset. We can see this relationship as well for the INFATI dataset, but
it is interesting to note that for small increases in the fuzzy search radius the query time
CHAPTER 6. EXPERIMENTAL RESULTS 65
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
0 20 40 60 80 100 120 140
Wei
ghte
dS
core
Fuzzy search radius
INFATI-250INFATI-250 (o)
INFATI-500INFATI-500 (o)
INFATI-750INFATI-750 (o)
Figure 6.4: INFATI Datasets: Weighted Scores vs. Fuzzy matching radius
does not necessarily increase. This demonstrates that the time per query is not directly
dependent on the search radius, but only on the number of points contained within the
search radius. Regarding allowing order-flexible matching, we can clearly see in these figures
that allowing order-flexible matching significantly increases the time required to execute a
recommendation query.
CHAPTER 6. EXPERIMENTAL RESULTS 66
100
200
300
400
500
600
700
800
900
0 5 10 15 20
Wei
ghte
dS
core
Fuzzy search radius
Trucks-50Trucks-50 (o)
Trucks-100Trucks-100 (o)
Figure 6.5: Trucks Datasets: Weighted Scores vs. Fuzzy matching radius
0
10000
20000
30000
40000
50000
60000
0 20 40 60 80 100 120 140
Cou
nt
of
Un
sati
sfiab
leQ
uer
ies
Fuzzy search radius
INFATI-250INFATI-250 (o)
INFATI-500INFATI-500 (o)
INFATI-750INFATI-750 (o)
Figure 6.6: INFATI Datasets: Unsatisfiable Queries vs. Fuzzy matching radius
CHAPTER 6. EXPERIMENTAL RESULTS 67
0
2000
4000
6000
8000
10000
12000
0 5 10 15 20
Cou
nt
of
Un
sati
sfiab
leQ
uer
ies
Fuzzy search radius
Trucks-50Trucks-50 (o)
Trucks-100Trucks-100 (o)
Figure 6.7: Trucks Datasets: Unsatisfiable Queries vs. Fuzzy matching radius
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
0 20 40 60 80 100 120 140
Bin
ary
Sco
re
Fuzzy search radius
INFATI-250INFATI-250 (o)
INFATI-500INFATI-500 (o)
INFATI-750INFATI-750 (o)
Figure 6.8: INFATI Datasets: Binary Scores vs. Fuzzy matching radius
CHAPTER 6. EXPERIMENTAL RESULTS 68
0
1000
2000
3000
4000
5000
6000
7000
0 5 10 15 20
Bin
ary
Sco
re
Fuzzy search radius
Trucks-50Trucks-50 (o)
Trucks-100Trucks-100 (o)
Figure 6.9: Trucks Datasets: Binary Scores vs. Fuzzy matching radius
0
2
4
6
8
10
12
14
16
0 20 40 60 80 100 120 140
Tim
ep
erQ
uer
y(m
s)
Fuzzy search radius
INFATI-250INFATI-250 (o)
INFATI-500INFATI-500 (o)
INFATI-750INFATI-750 (o)
Figure 6.10: INFATI Datasets: Query Time vs. Fuzzy matching radius
CHAPTER 6. EXPERIMENTAL RESULTS 69
0
5
10
15
20
25
30
35
0 5 10 15 20
Tim
ep
erQ
uer
y(m
s)
Fuzzy search radius
Trucks-50Trucks-50 (o)
Trucks-100Trucks-100 (o)
Figure 6.11: Trucks Datasets: Query Time vs. Fuzzy matching radius
CHAPTER 6. EXPERIMENTAL RESULTS 70
6.3.4 Query Length
All of the basic results in section 6.3.3 were generated using a query length of 3, and this
section addresses the question of how the scores and query times are affected by varying the
query length. For brevity, results are presented only for the INFATI dataset. Results on how
varying the query length affects the memory usage and construction time for k-truncated
generalized suffix trees are presented later, in section 6.3.8.
The first results of this section are on the effect of varying the query length k on the
weighted score and binary scores of a dataset. In figure 6.12 we see that except in the case
where the fuzzy matching radius is 0, the weighted scores are significantly greater for query
lengths of 2 and 3 than they are for a query length of 1. The same result can be observed
for the binary score measure in 6.13.
Recalling from Chapter 2 that all existing mobile recommendation systems do not incor-
porate a user’s trajectory history, these results demonstrate that the methods of this thesis
are an improvement over existing methods.
The observant reader will notice that the scores for a query length of 5 are lower than
those for a query length of 1. This is not because the recommendations for long queries
are lower quality than those for short queries, but is due to insufficient historical data. In
figure 6.15 we can observe that when our query length is 5 there are approximately 25, 000
queries that we were unable to make any recommendations for. This could problem could be
avoided by making recommendations based on the longest suffix of the query trajectory that
can be matched in the historical trajectory database, but this idea is not pursued further
in this thesis.
The final observation in this section is derived from figure 6.14, which compares the
time required to execute a query with the fuzzy matching radius for a variety of different
query lengths. In this figure it is clear that increasing the query length results in shorter
query execution times. This is because increasing the query length decreases the number of
matching trajectory points that we need to consider recommending.
CHAPTER 6. EXPERIMENTAL RESULTS 71
2000
4000
6000
8000
10000
12000
14000
16000
0 5 10 15 20 25
Wei
ghte
dS
core
Fuzzy search radius
k = 1k = 2k = 3k = 5
Figure 6.12: INFATI-500: Effect of Query Length on Weighted Score
0
5000
10000
15000
20000
25000
30000
35000
0 5 10 15 20 25
Bin
ary
Sco
re
Fuzzy search radius
k = 1k = 2k = 3k = 5
Figure 6.13: INFATI-500: Effect of Query Length on Binary Score
CHAPTER 6. EXPERIMENTAL RESULTS 72
0
2
4
6
8
10
12
0 5 10 15 20 25
Tim
ep
erQ
uer
y(m
s)
Fuzzy search radius
k = 1k = 2k = 3k = 5
Figure 6.14: INFATI-500: Effect of Query Length on Query Time
0
5000
10000
15000
20000
25000
30000
0 5 10 15 20 25
Cou
nt
of
Un
sati
sfiab
leQ
uer
ies
Fuzzy search radius
k = 1k = 2k = 3k = 5
Figure 6.15: INFATI-500: Effect of Query Length on Unsatisfiable Queries
CHAPTER 6. EXPERIMENTAL RESULTS 73
6.3.5 Number of Recommendations
All of the baseline results presented in section 6.3.3 were generated after configuring the
recommendation system to return the best 3 recommendations, and it is natural to wonder
how the number of returned recommendations affects the scores of the results for all queries.
Altering the number of recommendations returned does not affect the time required per
query because in order to select the top k recommendations the system as designed needs to
compute the confidences of all possible recommendations. Furthermore, varying k does not
affect the number of unsatisfiable queries because as long as there is at least one possible
recommendation to be made for a query, we say that the query is satisfiable. Thus, all that
we need to look at is the effect of the number of recommendations on the weighed score and
binary score of a dataset.
In figure 6.16 we can see that generally more recommendations leads to a greater weighed
score, although not always. That returning two recommendations performs slightly worse
than returning one recommendation can be explained by noting that the folds were randomly
generated for each experimental run. The differences between these weighted scores are so
small that we believe that this anomaly can be ignored. Looking only at the weighted scores
gives us only weak evidence that increasing the number of recommendations improves the
quality of results of the recommendation system.
Looking at figure 6.17 we see that increasing the number of recommendations greatly
increases the binary score, although each added recommendation diminishes the benefit of
returning an additional recommendation. Combining this result with our previous obser-
vation that increasing the number of recommendations returned has a small effect on the
weighted score, we can draw a number of conclusions. The first is that the best recommen-
dation returned tends to be a specific trajectory point, and the second is that subsequent
recommendations tend to be generalized points. Together with the idea that it is easier for
a user to understand and make decisions based on a small number of recommended points
than a large number, this leads us to recommend that in a real world system between 2 and
4 recommendations be returned to the user. Furthermore, these conclusions lend credibility
to our decision to return 3 recommendations for the baseline results.
CHAPTER 6. EXPERIMENTAL RESULTS 74
8000
9000
10000
11000
12000
13000
14000
15000
0 5 10 15 20
Wei
ghte
dS
core
Fuzzy search radius
numRecs = 1numRecs = 2numRecs = 3numRecs = 5
Figure 6.16: INFATI-500: Effect of the Number of Recommendations on Weighted Score
5000
10000
15000
20000
25000
30000
35000
0 5 10 15 20
Bin
ary
Sco
re
Fuzzy search radius
numRecs = 1numRecs = 2numRecs = 3numRecs = 5
Figure 6.17: INFATI-500: Effect of the Number of Recommendations on Binary Score
CHAPTER 6. EXPERIMENTAL RESULTS 75
9000
10000
11000
12000
13000
14000
15000
0 5 10 15 20 25
Wei
ghte
dS
core
Fuzzy search radius
Not DiversifiedDiversified
Figure 6.18: INFATI-500: Effect of Diversification on Weighted Score
6.3.6 Diversification
One of the desired requirements for a useful trajectory-based POI recommendation system
described in section 3.3 was the requirement of diversification. Later in the thesis, in section
4.5, a simply greedy algorithm for diversifying a result set was presented. This section
presents results demonstrating that performing diversification does not significantly diminish
the quality of returned results and can in fact increase the scores for a dataset. Note that
for all experiments, the top 3 recommendations are returned.
In both figures 6.18 and 6.19 we can see that enabling diversification boosted both the
weighed and binaries scores for the INFATI-500 dataset. This result may appear surprising
at first, but it can be taken as evidence that the results returned by the recommendation
system when diversification is disabled are too similar to each other. Diversification is pro-
viding exactly the benefit that we desired it to provide, which is to increase the probability
that one of the top-k trajectories is similar to the next point visited by a user of the system.
CHAPTER 6. EXPERIMENTAL RESULTS 76
14000
16000
18000
20000
22000
24000
26000
28000
30000
32000
0 5 10 15 20 25
Bin
ary
Sco
re
Fuzzy search radius
Not DiversifiedDiversified
Figure 6.19: INFATI-500: Effect of Diversification on Binary Score
9000
10000
11000
12000
13000
14000
15000
0 20 40 60 80 100 120 140
Wei
ghte
dS
core
Fuzzy search radius
Spatial Factor (sf) = 0.5Spatial Factor (sf) = 1Spatial Factor (sf) = 2Spatial Factor (sf) = 4
Figure 6.20: INFATI-500: Effect of Spatial Factor on Weighted Score
CHAPTER 6. EXPERIMENTAL RESULTS 77
6.3.7 Other Variations
The framework developed in this thesis has a large number of parameters, and results
demonstrating the effects of the most important parameters were presented in the previous
sections of this chapter. To present results on every parameter would require too much space
and provide little additional insight, and so in this section we present results demonstrating
the effects of what we believe to be the next two most important parameters. These are the
spatial factor sf , and the kernel width kw.
The effects of varying the spatial factor sf on the weighted score of the INFATI-500
dataset are presented in figure 6.20. The first observation drawn from this figure is that
using a very small spatial factor, such as sf = 0.5 results in the optimal fuzzy matching
radius being much smaller than when a larger spatial factor is used. The curves for larger
spatial factors are stretched out. This is unsurprising given our earlier results, but what is
more interesting is that the maximum weighted score depends on the chosen spatial factor.
Furthermore, there is no clear relationship between the maximum weighted score attained
and the chosen spatial factor.
On the other hand, we can observe in figure 6.21 that there is a clear relationship between
the chosen spatial factor and the time required to execute a query. As the spatial factor
increases, the time required to execute a query decreases. This result is expected given that
increasing the spatial factor will decrease the number of trajectories that do not exactly
match the query trajectory but that lie within the fuzzy matching radius.
The other parameter varied in this section is the kernel width kw. As mentioned in
Section 4.2.2, the accuracy of kernel estimation generally depends more on the chosen kernel
width kw than on the particular kernel function chosen. Consequently, we explore the effect
of varying kw here. In figure 6.22 we can see that the selected kernel width has a clear
effect on the weighted score for the INFATI-500 dataset. On the other hand, the effect is
not monotonic, and the optimal kernel width appears to be 2, rather than 1 or 4. In any
real world implementation of the recommendation system the kernel width would need to
be carefully tuned to return optimal results.
CHAPTER 6. EXPERIMENTAL RESULTS 78
1
2
3
4
5
6
7
0 20 40 60 80 100 120 140
Tim
ep
erQ
uer
y(m
s)
Fuzzy search radius
Spatial Factor (sf) = 0.5Spatial Factor (sf) = 1Spatial Factor (sf) = 2Spatial Factor (sf) = 4
Figure 6.21: INFATI-500: Effect of Spatial Factor on Query Time
9000
10000
11000
12000
13000
14000
15000
0 5 10 15 20 25
Wei
ghte
dS
core
Fuzzy search radius
Kernel Width (kw) = 1Kernel Width (kw) = 2Kernel Width (kw) = 4
Figure 6.22: INFATI-500: Effect of Kernel Width on Weighted Score
CHAPTER 6. EXPERIMENTAL RESULTS 79
6.3.8 Effects of k-Truncated Suffix Trees
Until this point, all of our experimentation has been performed using k-truncated generalized
suffix trees, with k set to match the current queryLength parameter. In this section we
compare the use of k-truncated generalized suffix trees with the use of ordinary generalized
suffix trees.
In figure 6.23 we compare the effects of varying k on the time required to construct
a k-truncated generalized suffix tree. Furthermore, we compare this time this with the
time required to construct an ordinary generalized suffix tree, which is independent of the
length of queries allowed. In this figure we can clearly see that for small values of k that
the construction of k-truncated suffix trees is much more efficient than the construction of
ordinary suffix trees, and that as k increases, the time required to construct a k-truncated
suffix tree converges towards the time required to construct an ordinary suffix tree. Although
the tree construction times shown in this figure are small, for a large real-world system the
time benefits of using a k-truncated suffix tree could be considerable.
In figure 6.24 we show the effects of varying k on the memory required to store a k-
truncated generalized suffix tree, and compare this with the memory required to store a
plain generalized suffix tree. Similar to how for small values of k, k-truncated suffix trees
required much less time to construct than plain suffix trees, we see that for small values of
k, k-truncated suffix trees use much less memory than plain suffix trees. Furthermore, as k
increases, the memory used by a k-truncated suffix tree converges towards the memory used
by a plain suffix tree. In the paper [24] introducing k-truncated generalized suffix trees it is
noted that by using an optimization known as “multi-int leaves” it may be possible to reduce
the memory usage of the k-truncated suffix tree. This optimization works by allowing leaves
to represent locations in multiple strings. Thus, it may be possible to improve the memory
advantage of using k-truncated suffix trees over plain suffix trees, but this is not explored
in this thesis.
Finally, in figure 6.25 we compare the time required to execute queries using truncated
and non-truncated suffix trees. Interestingly, for query lengths of 3 and 5, there is a slight
time advantage to using truncated suffix trees, but when the maximum query length is 2
queries are executed slightly quicker on a plain suffix tree. The precise reason for this is
unclear, but we postulate that this is due to the fact that the internal leaves of a k-truncated
suffix tree are stored in a linked list, and that when k = 2 these lists of internal leaves may
CHAPTER 6. EXPERIMENTAL RESULTS 80
100
200
300
400
500
600
700
800
1 2 3 4 5 6 7 8 9 10
Su
ffix
Tre
eC
onst
ruct
ion
Tim
e(m
s)
Query Length (k)
INFATI (truncated)INFATI (non-truncated)
Trucks (truncated)Trucks (non-truncated)
Figure 6.23: Effects of Query Length on Suffix Tree Construction Time
grow very long. Regardless, the effect of truncating the suffix tree on the time required
to execute is minimal in every case. In [24], the authors find that queries are executed
significantly more quickly on k-truncated suffix trees than plain suffix trees. This contrasts
with our present observations, and the key difference between their experiments and ours
is that the strings (trajectories) stored in our suffix trees are generally fairly short, whereas
the strings (DNA sequences) stored in their suffix trees tend to be thousands of characters
long.
CHAPTER 6. EXPERIMENTAL RESULTS 81
4000
6000
8000
10000
12000
14000
16000
18000
20000
22000
24000
1 2 3 4 5 6 7 8 9 10
Su
ffix
Tre
eM
emor
yU
sage
(KB
)
Query Length (k)
INFATI (truncated)INFATI (non-truncated)
Trucks (truncated)Trucks (non-truncated)
Figure 6.24: Effects of Truncation on Suffix Tree Memory Usage
0
1
2
3
4
5
6
0 5 10 15 20 25
Tim
ep
erQ
uer
y(m
s)
Fuzzy search radius
k = 2k = 2 (NT)
k = 3k = 3 (NT)
k = 5k = 5 (NT)
Figure 6.25: INFATI-500: Effects of Truncation on Query Times
CHAPTER 6. EXPERIMENTAL RESULTS 82
6.4 Summary
This chapter began by describing the two real world datasets used by this thesis, the INFATI
dataset, and the Trucks dataset. We then described how we processed these two datasets to
generate five derived datasets that are suitable for use by the trajectory-based recommen-
dation system described in this thesis. Furthermore, a method for evaluating the quality
of returned results was presented. Following these preliminary sections, a large number of
experiments were run and analyzed. The most important observations are:
• (Section 6.3.4) Using a query length greater than 1 leads to increased recommendation
quality. This means that using the framework described in this thesis can produce
higher quality recommendations than existing mobile POI recommendation systems.
• (Section 6.3.3) Allowing fuzzy matching can increase the weighted score of a dataset,
but if the fuzzy matching radius used is too large then the query trajectory will match
too many other trajectories, leading to a decrease in recommendation quality.
• (Section 6.3.3) Allowing order-flexible matching has a significant benefit when fuzzy
matching is disallowed, but when fuzzy matching is allowed the effects are negligible.
• (Section 6.3.3) Queries can be answered quickly, in the order of milliseconds, even
when the fuzzy matching radius is large. However, the query execution time grows
with both the fuzzy matching radius and the size of the historical trajectory database.
• (Section 6.3.5) Returning a small number of recommendations, such as 3 is the optimal
trade-off between ensuring that at least one recommendation is relevant to the user
and on the other hand, not overwhelming the user with recommendations.
• (Section 6.3.6) Using the diversification algorithm presented in section 4.5 can result
in improved weighted and binary scores.
• (Section 6.3.7) Varying the parameters of the recommendation system affect the qual-
ity of results returned and need to be tuned for optimal performance. However, the
system’s performance is relatively stable with respect to these parameters.
• (Section 6.3.8) For small values of k, k-truncated generalized suffix trees instead of
plain generalized suffix trees reduces the time required for suffix tree construction as
well as the memory required for storing the suffix tree.
Chapter 7
Conclusion
Location-aware mobile devices are quickly becoming ubiquitous, and this provides an oppor-
tunity for mobile recommendation systems. Existing research into mobile recommendation
systems has focused on recommending points of interest to a user based on a user’s current
location, ignoring the recent trajectory history of the user. Any methods for improving the
quality of the returned recommendations stands to greatly increase the usefulness of the
mobile recommendation system.
The most significant contribution of this thesis is the introduction and formalization of
the trajectory-based POI recommendation problem along with a set of desired requirements
for a useful trajectory-based POI recommendation system and efficient solutions to three
variants of the problem. Beginning with a naive approach to the exact matching problem,
we proceeded to construct a trajectory-based recommendation system framework capable
of recommending concept points and localized points in addition to individual points of in-
terest. Following this construction, the recommendation framework was extended to allow
for fuzzy matching and order-flexible queries to be executed in an efficient manner. For
each of these variants we provided the necessary details for their efficient implementation.
Finally, we demonstrated that the trajectory-based POI recommendation framework de-
veloped in this thesis is both efficient and effective on a group of datasets constructed by
processing two real world datasets. The recommendation system framework constructed in
this thesis is efficient, scalable, highly configurable and capable of generating higher quality
recommendations than a recommendation system that ignores trajectory histories.
83
CHAPTER 7. CONCLUSION 84
7.1 Future Directions
Many research directions were considered, but not pursued as the main research thrust
of this thesis changed directions during its development. Many of these directions may
nonetheless lead to interesting extensions of the research presented in this thesis, and these
are summarized below.
7.1.1 Personalization
One of the requirements for a useful trajectory-based POI recommendation system presented
in section 3.3 was the requirement of personalization. This requirement expresses the desire
for the recommendations returned for a given query trajectory to be personalized for the
user submitting the query. We foresee two distinct approaches for accomplishing this.
The first approach is to perform collaborative filtering on the set of potential recommen-
dations. This approach implements personalization as a post-processing step where the set
of recommendations returned to the user could be selected based on the points of interest
visited by similar users. Although this approach might work, a more interesting approach
could be developed by integrating personalization more deeply into the recommendation
process.
The second approach is to construct three sets of recommendations. The first set of
recommendations is based on personal history. These recommendations are generated by
performing the recommendation process only on the trajectories in the historical trajectory
database that were generated by the current user. The second set of recommendations
is based on user group history. These recommendations are generated by performing the
recommendation process only on those trajectories in the historical trajectory database that
were generated by users in the same user group as the current user. This user group could
be determined according to user attributes such as occupation or age. The final set of
recommendation is based on the histories of all users, and is the set of recommendations
that we have been computing in this thesis. Notice that all of these sets of recommendations
are generated by the same recommendation process, only that they are based on different
subsets of the historical trajectory database. These three sets of recommendations could be
mixed to determine the final set of recommendations to be returned to the user.
CHAPTER 7. CONCLUSION 85
7.1.2 Parallelizing Matching
One of our requirements for a useful trajectory-based recommendation system presented
in section 3.3 was that the system should be highly scalable, and one approach to achieve
this would be to parallelize the matching process. Not only is to possible to parallelize the
suffix tree search methods across multiple machines, they can easily be parallelized across
multiple threads on the same physical machine. As most computers now have multiple CPU
cores, this means that we will be able to more fully utilize the processing power available
to answer trajectory-based top-k recommendation queries.
To parallelize the methods in this chapter, what we can do is to split our set of historical
trajectories S into m disjoint sets, each containing approximately |S|m trajectories. Next,
we can build a generalized suffix tree (or k-truncated generalized suffix tree) for each of
these sets of trajectories. This will be less space efficient than building a single suffix tree,
but each of these suffix trees will be independent of the others, and so nothing prevents
us from building all of them in parallel. The total running time to build all of the suffix
trees will then be O(m log |Σ|), exactly the same as before, except that this work can be
split amongst all available processors on the current machine (or across machines for truly
massive data sets). Thus, it is possible to arbitrarily split up the pre-processing of our
historical trajectories.
Generating m suffix trees instead of a single suffix tree improves the time required to
construct the suffix trees because each one can be constructed in parallel, and could also
assist by keeping the memory requirements of each tree small. Doing this would also affect
the time required to execute queries, but the effect on query time should generally be
negligible. This is because a query trajectory q is expected to be very short, and the time
required to match a query trajectory q is proportional only to the length of q and not to
the size of the suffix tree (assuming the tree has a fixed alphabet size). Even if we need to
perform m suffix tree matching operations, the running time should not be largely affected.
Furthermore, each of these matching operations can be done in parallel to build up the
set of next points of q, as they are independent of each other. A potential future research
direction would be to implement a parallel matching algorithm and to test its effect on query
execution times.
CHAPTER 7. CONCLUSION 86
7.1.3 User Feedback
In order to build a trajectory-based POI recommendation system that would be useful for
real human beings we may want to incorporate user feedback into the recommendation
process. This means that it would be possible for users to vote on whether a given rec-
ommendation is relevant or not, and for this vote to affect future recommendations. We
could accomplish this by modifying our confidence measures as described in sections 4.1.1
and 4.2.3 to incorporate relevance feedback. For example, we could multiply the confidence
of each recommendation by the proportion of users who have previously found that recom-
mendation to be useful and relevant. More details on possible mechanisms for incorporating
relevance feedback into recommendation systems can be found in [19].
7.1.4 Temporal Constraints
Many points of interest are only likely to be visited at particular times of day. For example,
a breakfast cafe may be visited only in the morning, and a movie theatre may be visited only
in the evening. A breakfast cafe recommendation in the evening is not useful for a user of the
system, and undesirable from the perspective of an advertiser who may have to pay for an
irrelevant recommendation. The trajectory-based POI recommendation system developed
in this thesis is not time-aware and has no means of incorporating temporal constraints,
such as that a breakfast cafe is relevant only in the morning, into the recommendation
process. One potential means of addressing this limitation would be to add an attribute
corresponding to the time of day, such as “morning” or ”evening”. This attribute would
be part of the key for a trajectory point, in addition to the point of interest corresponding
to the trajectory point. When searching for a query trajectory in the historical trajectory
database we would need to match both the time of day and the point of interest visited
in order to proceed down a branch of the generalized suffix tree. An interesting research
direction would be to investigate if this is an effective means of incorporating temporal
constraints and whether there are any other means of incorporating temporal constraints
that are more effective.
7.1.5 Continuous Matching
In section 2.3 we discussed research by Frentzos et al. [5] into nearest-neighbour searches on
moving object databases. Whereas our methods are intimately connected with the number
CHAPTER 7. CONCLUSION 87
of points lying on a trajectory, the methods described by Frentzos et al. are intimately
connected with the notion of time. Their methods are able to compute the similarity
between two trajectories over a specific period of time, regardless of the number of points in
each of the two trajectories, and thus can compute the similarity between two trajectories
in a continuous manner. An interesting future research direction would be to develop a
trajectory-based POI recommendation system based on the continuous matching methods
described by Frentzos et al. that satisfies the requirements described in section 3.3.
7.1.6 Longer Tails
Given a trajectory t of length n, and a trajectory fragment f = ti..j , let the tail of f be
defined to be tj..n. That is, the tail of f is the suffix of t starting from the last point in f .
The m-tail of f is defined to be tj..min(j+m,n), the first m points of the tail of f .
Throughout this thesis we have taken a query trajectory q and matched it against the his-
torical trajectory database in order to find all historical trajectory fragments F that match
q (perhaps allowing fuzzy matches or order-flexible matches). Then, for each historical tra-
jectory fragment t ∈ F we consider the next point of t as a potential recommendation for the
query q. In other words, recommendations are based on the 1-tail of each matching histor-
ical trajectory fragment. One future research direction would be to consider incorporating
more than the 1-tail of each matching historical trajectory fragment into the recommenda-
tion process and to incorporate the m-tail of each matching historical trajectory fragment.
To see why this would be useful, suppose that following a query trajectory q, some people
visit a museum x, but many people visit x following a visit to a cafe y. By only considering
the next points of the historical trajectory fragments matching q, the confidence of recom-
mending x may be low, but by considering the 2-tails of the historical trajectory fragments
matching q the confidence of recommending x could be significantly higher.
Incorporating longer tails into the recommendation process would be a simple extension
of the methods described in this thesis. After matching a query trajectory q in the suffix
tree representing the historical trajectory database, we can easily walk the subtree below
the match in the suffix tree to determine the m-tails of each matching historical trajectory
fragment. The contribution of points in the m-tail should be weighted according to their
position in the tail. For example, the next point of the query trajectory (the first point in
the m-tail) should be given full weight, whereas later points should be given less weight. It
would be a very interesting future research direction to explore the effects of incorporating
CHAPTER 7. CONCLUSION 88
longer tails into the recommendation process and to observe if it can significantly improve
recommendation quality.
7.1.7 Other Directions
This section briefly describes a number of other potential research directions that are too
small to warrant their own section.
• Investigate using alternative distance measures and kernel functions. In this thesis we
use Gaussian kernel estimation for performing density estimation, and use the distance
metric defined in section 4.4.3 to compute point distances. It would be interesting to
investigate if other distance measures and kernel functions could be used to improve
the quality of recommendations.
• Investigate the possibility of allowing concept points to be present in the historical
trajectory database and query trajectory in addition to the trajectory points that we
currently permit. This could be utilized to eliminate user-specific locations such as a
user’s home from the historical trajectory database and query trajectory. Instead of
everybody starting their day at their own home, it could be possible for everybody to
start at the concept point “home”, and this could potentially increase the effectiveness
of the recommendation system.
• Investigate whether weighting the error of fuzzy matches improves the quality of query
results when the query length is large. Intuitively, the most recent point in a query
trajectory is more important than earlier points, and so a potential improvement would
be to weight the distance between a point p in the query trajectory q and a point in a
historical trajectory according to the position of p in q. For example, we could weight
the distance between p and another point by ik where i is the index of p in q and
k = |q| is the length of q.
Appendix A
Constructing suffix trees
A.1 Introduction
Definitions of suffix trees, generalized suffix trees, and k-truncated generalized suffix trees
can be found in Chapter 4 in section 4.4.1, and it is recommended that it be read prior
to reading this appendix. This appendix describes an efficient method for their construc-
tion. One useful resource for a description of suffix trees and their construction is Dan
Gusfield’s book, “Algorithms on Strings, Trees, and Sequences” [10]. However, the content
and organization of this appendix are essentially taken from Schultz et al. [24].
Although a linear time method for suffix tree construction was first discovered in 1973
by Peter Weiner [30], the first online method for suffix tree construction was published in
1995 by Esko Ukkonen [28]. “Online” in this context means that characters are added to
the suffix tree in the order in which they are presented, and this means that it is possible
to update the suffix tree with new characters as they are discovered.
The methods described in this appendix all have a published complexity bound of O(m),
where m is the sum of lengths of all input strings. However, it is important to note that this
bound is only valid assuming a fixed alphabet. For the purposes of this thesis, the alphabet
Σ is not fixed, and so the real complexity bound is O(m log Σ).
A.2 Ukkonen’s Algorithm for Suffix Trees
Given a string s of length m, Ukkonen’s algorithm processes s from left to right (in the
order that characters are presented) in m phases in order to construct an suffix tree T . In
89
APPENDIX A. CONSTRUCTING SUFFIX TREES 90
phase i, the substring s1..i and all of its suffixes s2..i, ..., si,i are inserted into T if they are
not already present in T . Furthermore, each phase is divided into extensions, so that the
action of extension j of phase i is to add the substring sj..i into the T if it is not already
present.
Following Gusfield [10], extremely high level pseudo-code for Ukkonen’s algorithm is
presented in algorithm 7.
Input: A string sOutput: A suffix tree T for s
T ← tree consisting of a single edge representing s11
for i← 2 to |s| do2
/* Begin phase i */
for j ← 1 to i do3
/* Begin extension j */
Starting from the root, find the end of the path labeled sj..i in the current4
treeIf needed, extend the path by adding character si5
Algorithm 7: Ukkonen’s Algorithm (High Level)
Ukkonen’s algorithm’s handling of each extension j for phase i + 1 can be divided into
two distinct parts. The first part is concerned with inserting the string sj..i+1 into the suffix
tree, and the second is concerned with finding the substring sj+1..i in the suffix tree, and
thus preparing for the next extension.
For the first part, we need to know how to insert sj..i+1 into T given that β = sj..i is
already present in the tree. This is done according to three rules:
• Rule 1: The path β from the root ends at a leaf of T . To update T , simply add si+1
to the end of the leaf’s label.
• Rule 2: The path β from the root does not end at a leaf of T , and no path continuing
from β begins with si+1. In this case, a new leaf must be added to T . Note that if β
ends in the middle of an edge, then that edge must be split.
• Rule 3: The path β from the root does not end at a leaf of T , but there is a path
continuing from β beginning with si+1. In this case, there is nothing to do.
It is useful and interesting to note that the first rule is not strictly required because any
APPENDIX A. CONSTRUCTING SUFFIX TREES 91
leaf node created by the second rule will always remain a leaf node, and so when the leaf
node is added, we can simply setting the label of the edge leading to the new leaf to be the
entire suffix from si+1.
As mentioned above, in addition to performing suffix extensions, the other part of Ukko-
nen’s algorithm is concerned with finding next suffix sj+1..i+1 to be extended. At a high-level
this is very straightforward to understand. However, we need to be careful to avoid con-
structing an O(n2) algorithm, and to avoid this Ukkonen uses a number of implementational
“tricks”.
The first technique, originally proposed in 1976 by McCreight [20], is the suffix link. A
suffix link is a pointer between two internal nodes, N,M of the suffix tree T , such that if
the path to N is xα, and the path to M is α, then there will be a suffix link from N to M .
The suffix link will be denoted as N.link.
At first glance, it seems that these suffix links will be sufficient to find the location in
the tree where the next extension is to be performed. However, we may need to traverse
upwards from our extension point to find the nearest internal node N , and we can then
follow N.link to another internal node M , but from that node may again need to traverse
down the tree along some path γ to find the next extension point. The technique used to
optimize this is known as the skip and count trick.
The key to the skip and count trick is that we are guaranteed that γ is already present
in the tree. Thus, at M we need only look for the child of M whose first character is the
first character of γ. We then either move to this child of M or to the end of γ, if it ends
in the middle of an edge, and this is repeated until we reach the end of γ. The essential
point here is that we can move from node to node (or node to end of γ) using constant time
operations, and so the time to find the location of the next suffix extension is proportional
only to the number of nodes passed through when traversing γ. Using these techniques, it
can be shown that a suffix tree can be constructed in time linear to the length of the input
string, assuming a fixed alphabet.
An important practical note is that many suffix tree implementations use a linked list
at each node, and this can slow down insertion and lookup time because as the tree grows,
it takes increasingly longer to find the child node corresponding to the next character in the
suffix being inserted. This is an important consideration to take when noting that one of
our requirements is that queries should be executed in real-time.
An algorithm block with the pseudo-code for all of the methods found in this section
APPENDIX A. CONSTRUCTING SUFFIX TREES 92
and the next section can be found at the end of the chapter.
A.3 Constructing Generalized Suffix Trees
The algorithm described in the previous section work for when we want to construct a suffix
tree on a single string. However, there are two small enhancements that need to be made
to the algorithm in order to be able to construct generalized suffix trees on multiple strings.
The first change that needs to be made is to add to every leaf an identifier for its source
string. The second change is the use of Internal Leaves, which are linked lists of pairs (string
id, position in string) that serve to indicate which strings a suffix is present in, as well as
the starting position of the suffix in that string. These two minor changes are sufficient to
make the algorithm described in the previous section able to construct generalized suffix
trees. These changes do not affect the complexity of the algorithm described in the previous
section, and it is possible to construct a generalized suffix tree in time O(m), where m is
the sum of lengths of all input strings. One optimization presented by Schultz et al is that
it can be more efficient to represent all internal leaves using a single node as this reduces
the number of pointers required to store the suffix tree, and can result in faster time by
increasing reference locality and reducing the number of cache misses.
As a final point for this section, these changes lead to an additional rule for suffix
extension:
• Rule 4: Whenever inserting a suffix leads to a node in the tree T , create a new
internal leaf to record the ID of the current string being processed and position of the
suffix of the current suffix in that string, and add it to the node.
A.4 Constructing k-truncated Generalized Suffix Trees
The two obvious methods to construct a k-truncated generalized suffix tree (kTST) for a set
of strings are firstly, to delete subtrees of a full generalized suffix tree, and secondly, to build
a generalized suffix tree by inserting every k-mer of each input string. However, neither of
these methods are particularly good, with the latter approach having a time complexity of
O(km), where m is the sum of lengths of all input strings. Continuing to follow Schultz,
Bauer, and Robinson [24], it is possible to construct k-truncated generalized suffix trees in
linear time.
APPENDIX A. CONSTRUCTING SUFFIX TREES 93
What we need to do to construct the kTST is move left to right over each input string,
considering a window of size no greater than k. The current string depth will be kept track
of and denoted by depth. Due to considering only a window of size no greater than k, rather
than an entire suffix, we need to modify rule 2 as follows:
• Rule 2∗: Same as rule 2, except that only a k-mer truncated suffix is inserted into
the tree.
In addition to this suffix extension rule modification, a modification to the second part of
Ukkonen’s algorithm is required as well. Once the algorithm reaches depth k, it inserts the k-
mer sj..j+k into T , and the next string to be inserted is then sj+1..j+k+1. Whereas Ukkonen’s
algorithm would normally break, and move to the next phase, resetting the extension, what
we do here is increment the phase and advance to the next extension, so that the algorithm
arrives at sj+1..j+k at the end of the extension. This string already exists in the three as it
would have been inserted in the previous phase. Note that the algorithm for constructing k-
truncated generalized suffix trees can be used to construct ordinary generalized suffix trees
by simply setting k = ∞. Pseudo-code for the modifed Ukkonen’s algorithm is adapted
from [24] and presented in algorithm 8
MIS$I$
SP
$SS
PP$I$
PI$I SI
S P
1
2, 53, 6
4 7
8910 11
Figure A.1: 3-truncated suffix tree for the word ”mississippi$.
APPENDIX A. CONSTRUCTING SUFFIX TREES 94
Input: A string sOutput: A k-truncated suffix tree T for s
m← length(s)1
lastNode← root(T ), node← lastNode2
j ← 1, depth← 03
for i← 1 to m+ 1 do /* Phase i */4
for j while j ≤ i and i ≤ m do /* Extension j */5
/* Part 1: Insert suffix sj..i+1 into T */
if si+1 isn’t contained in tree at current position then6
if sj..i doesn’t end directly at node then7
node← SPLITEDGE8
Add leaf to node with edge label starting with si+1 ; /* Rule 2/2∗ */9
lastNode.link ← node, lastNode← node10
Move down one character along edge ; /* Rule 3 */11
depth← depth+ 112
if depth = k then13
Add new internal leaf ; /* Rule 4 */14
i← i+ 115
break16
/* Part 2: Update current position to sj+1..i */
if node 6= root(T ) then17
if sj..i ends directly at node and node has a suffix link then18
node← node.link19
depth← depth− 120
xα← label between current position and node.parent21
node← node.parent22
depth← depth− 123
if node 6= root(T ) then24
node← node.link25
γ ← xα26
γ ← α27
Use skip & count to move back down via γ ; /* This alters node */28
if current position is node and lastNode has no suffix link then29
lastNode.link ← node, lastNode← node30
Algorithm 8: Modified Ukkonen’s Algorithm for k truncated suffix trees.
Bibliography
[1] A. Asthana, M. Crauatts, and P. Krzyzanowski. An indoor wireless system for person-alized shopping assistance. In WMCSA ’94: Proceedings of the 1994 First Workshopon Mobile Computing Systems and Applications, pages 69–74, Washington, DC, USA,1994. IEEE Computer Society.
[2] Zhixiang Chen, Richard Fowler, Ada W. Fu, and Chunyue Chen. Fast Construction ofGeneralized Suffix Trees Over a Very Large Alphabet, volume 2697 of Lecture Notes inComputer Science, pages 284–293. Springer Berlin / Heidelberg, 2003.
[3] Sigal Elnekave, Mark Last, and Oded Maimon. Incremental clustering of mobile objects.Data Engineering Workshops, 22nd International Conference on, 0:585–592, 2007.
[4] R. Fraile and S. J. Maybank. Vehicle trajectory approximation and classification. InPaul H. Lewis and Mark S. Nixon, editors, British Machine Vision Conference, 1998.
[5] Elias Frentzos, Kostas Gratsias, Nikos Pelekis, and Yannis Theodoridis. Algorithms fornearest neighbor search on moving object trajectories. Geoinformatica, 11(2):159–193,2007.
[6] Scott Gaffney and Padhraic Smyth. Trajectory clustering with mixtures of regressionmodels. In KDD ’99: Proceedings of the fifth ACM SIGKDD international conferenceon Knowledge discovery and data mining, pages 63–72, New York, NY, USA, 1999.ACM.
[7] Fosca Giannotti, Mirco Nanni, Fabio Pinelli, and Dino Pedreschi. Trajectory patternmining. In KDD ’07: Proceedings of the 13th ACM SIGKDD international conferenceon Knowledge discovery and data mining, pages 330–339, New York, NY, USA, 2007.ACM.
[8] Gyozo Gidofalvi, Xuegang Huang, and Torben B. Pedersen. Privacy-preserving datamining on moving object trajectories. In Proceedings of the 8th International Confer-ence on Mobile Data Management,, Mannheim, Germany, May 2007.
[9] Gyozo Gidofalvi and Torben Bach Pedersen. Mining long, sharable patterns in trajec-tories of moving objects. Geoinformatica, 13(1):27–55, 2009.
95
BIBLIOGRAPHY 96
[10] Dan Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science andComputational Biology. Cambridge University Press, January 1997.
[11] Jiawei Han and Micheline Kamber. Data Mining, Second Edition, Second Edition :Concepts and Techniques. Morgan Kaufmann, 2 edition, January 2006.
[12] Tzvetan Horozov, Nitya Narasimhan, and Venu Vasudevan. Using location for person-alized poi recommendations in mobile environments. In SAINT ’06: Proceedings ofthe International Symposium on Applications on Internet, pages 124–129, Washington,DC, USA, 2006. IEEE Computer Society.
[13] Ming Hua, Jian Pei, Ada W.C. Fu, and Xuemin Lin Ho-Fung Leung. Efficiently answer-ing top-k typicality queries on large databases. In VLDB ’07: Proceedings of the 33rdinternational conference on Very large data bases, pages 890–901. VLDB Endowment,2007.
[14] Christian S. Jensen, H. Lahrmann, Stardas Pakalnis, and J. Runge. The infati data.CoRR, cs.DB/0410001, 2004.
[15] Hoyoung Jeung, Man Lung Yiu, Xiaofang Zhou, Christian S. Jensen, and Heng TaoShen. Discovery of convoys in trajectory databases. Proc. VLDB Endow., 1(1):1068–1080, 2008.
[16] Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM,46(5):604–632, 1999.
[17] Jae-Gil Lee, Jiawei Han, Xiaolei Li, and Hector Gonzalez. Traclass: trajectory classi-fication using hierarchical region-based and trajectory-based clustering. Proc. VLDBEndow., 1(1):1081–1094, 2008.
[18] Jae-Gil Lee, Jiawei Han, and Kyu-Young Whang. Trajectory clustering: a partition-and-group framework. In SIGMOD ’07: Proceedings of the 2007 ACM SIGMOD in-ternational conference on Management of data, pages 593–604, New York, NY, USA,2007. ACM.
[19] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze. Introduction toInformation Retrieval. Cambridge University Press, Cambridge, UK, 2008.
[20] Edward M. McCreight. A space-economical suffix tree construction algorithm. J. ACM,23(2):262–272, 1976.
[21] Paul Resnick and Hal R. Varian. Recommender systems. Commun. ACM, 40(3):56–58,1997.
[22] Francesco Ricci and Quang Nhat Nguyen. Acquiring and revising preferences in acritique-based mobile recommender system. IEEE Intelligent Systems, 22(3):22–29,2007.
BIBLIOGRAPHY 97
[23] Badrul Sarwar, George Karypis, Joseph Konstan, and John Reidl. Item-based collab-orative filtering recommendation algorithms. In WWW ’01: Proceedings of the 10thinternational conference on World Wide Web, pages 285–295, New York, NY, USA,2001. ACM.
[24] Marcel H. Schulz, Sebastian Bauer, and Peter N. Robinson. The generalised k-truncatedsuffix tree for time-and space-efficient searches in multiple dna or protein sequences.Int. J. Bioinformatics Res. Appl., 4(1):81–95, 2008.
[25] Upendra Shardanand and Pattie Maes. Social information filtering: algorithms forautomating “word of mouth”. In CHI ’95: Proceedings of the SIGCHI conference onHuman factors in computing systems, pages 210–217, New York, NY, USA, 1995. ACMPress/Addison-Wesley Publishing Co.
[26] B.W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman &Hall/CRC, April 1986.
[27] Yuichiro Takeuchi and Masanori Sugimoto. CityVoyager: An Outdoor RecommendationSystem Based on User Location History, volume 4159 of Lecture Notes in ComputerScience, pages 625–636. Springer Berlin / Heidelberg, 2006.
[28] Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249–260, 1995.
[29] Mark van Setten, Stanislav Pokraev, and Johan Koolwaaij. Context-Aware Recommen-dations in the Mobile Tourist Application COMPASS, volume 3137 of Lecture Notes inComputer Science, pages 235–244. Springer Berlin / Heidelberg, 2004.
[30] Peter Weiner. Linear pattern matching algorithms. In SWAT ’73: Proceedings of the14th Annual Symposium on Switching and Automata Theory (swat 1973), pages 1–11,Washington, DC, USA, 1973. IEEE Computer Society.
[31] Yu Zheng, Lizhu Zhang, Xing Xie, and Wei-Ying Ma. Mining interesting locationsand travel sequences from gps trajectories. In WWW ’09: Proceedings of the 18thinternational conference on World wide web, pages 791–800, New York, NY, USA,2009. ACM.