Retrieving k-Nearest Neighboring Trajectories by a Set of Point Locations Lu-An Tang, Yu Zheng, Xing...

Retrieving k-Nearest Neighboring Trajectories by a Set of Point Locations

Lu-An Tang, Yu Zheng, Xing Xie, Jing Yuan, Xiao Yu, Jiawei Han

•University of Illinois at Urbana-Champaign

Microsoft Research Asia

Motivation: trajectory query by locations

Huge volume of spatial trajectories

Require to search trajectories by a set of point locations

Geo-tagged photos Taxi trajectories Check-ins

k-Nearest Neighboring trajectory query

Query the top k trajectories with the minimum aggregated distance to the given locations

The trajectories may not exactly pass those locations

k-NNT query

Task Definition:

Given the trajectory dataset D, and a set of query points, Q, the k-NNT query

retrieves k trajectories K from D, K = {R1, R2, …, Rk} that for ∀ Ri ∈ K, ∀ Rj ∈ D -

K, dist(Ri,Q) ≤ dist(Rj,Q).

ChallengesHuge trajectory dataset: High I/O cost to scan all the trajectories

Aggregated distance computation

Non-uniform distribution:

the trajectories are sparse/dense in different regions

the user-given query locations may be far from all the trajectories

p1,1p1,2

p1,3p1,4 p1,5

p2,1p2,2

The aggregate distance in k-NNT query

1. Find out the closest point from a trajectory to each query point (i.e., shortest matching pairs)

3. Sum up the lengths of all matching pairs

• dist(R1, q1)= dist(p1,2, q1)= 20 m• dist(R1, q2)= dist(p1,3, q2)= 50 m• dist(R1, q3)= dist(p1,5, q3)= 15 m• dist(R1, Q)=∑ dist(R1, qi)= 85 m

• dist(R2, q1)= dist(p2,3, q1)= 30 m• dist(R2, q2)= dist(p2,4, q2)= 5 m• dist(R2, q3)= dist(p2,6, q3)= 40 m• dist(R2, Q)=∑ dist(R2, qi)= 75 m

Related Work: k-BCT query

k-Best Connected Trajectory (k-BCT) query [SIGMOD2010]the similarity function between a trajectory R and query locations Q is

Problem: This function changes over units (inconsistent)

An example If query Q has two points q1 and q2;

dist(R1, q1) = dist(R1, q2) = 2.4km = 1.48 miles,

dist(R2, q1) = 1.5 km =0.93 miles, dist(R2, q2) = 5km = 3.1 miles

Use unit “mile”, Sim(R1, Q) = 0.45 > Sim(R2, Q) = 0.43

Use unit “km”, Sim(R1, Q) = 0.18 < Sim(R2, Q) = 0.22

Advantages of k-NNT over k-BCT

The distance function of k-BCT changes over units (inconsistent)

The distance function of k-BCT is sensitive to a query

• k-BCT&k-NNT

• k-NNT

• k-BCT

Query framework: candidate-generation-and-verification

Candidate generation

Best-first search based individual heaps

Coordination by a global heap

Candidate verification

Lower-bound estimation

Efficient pruning with the global heap

Outlier query location

Qualifier expectation based method

R1 R2R3 R4

dist(R1, Q)= 5+2+2=9 mdist(R2, Q)= 25+20+30=75mdist(R3, Q)= 80+25+30=135mdist(R4, Q)= 90+5+3=98 mdist(R5, Q)= 55+8+70=123mdist(R6, Q)= 120+80+40=240 m

Direct Computing

Candidate Generation

R5 dist(R1, Q)= 5+2+2=9 mdist(R4, Q)= 90+5+3=98 mdist(R5, Q)= 55+8+70=123m

Candidate Verification

Candidate Generation

Given a query Q = {q1, q2, …, qm}, generate a trajectory

candidate set including all the k-NNTs (i.e., complete set)

Step 1: searching k-NN points using best-first-based individual heap

Step 2: generating the candidate trajectories by the global heap

R1 R2R3 R4

<p2,3, q1><p5,2, q1><p1,6, q1><p2,9, q1>

…...

<p6,2, q2><p5,3, q2><p7,4, q2><p4,8, q2>

…...

<p2,2, q3><p3,5, q3><p7,3, q3><p8,6, q3>

…...

Global heap

A minimum heap sorting matching pairs by the distance

Retrieves new matching pair from individual heaps

Pops the matching pairs to the candidate set

Step 2: generating candidate trajectories

<p2,3, q1><p5,2, q1><p1,6, q1><p2,9, q1>

…...

<p6,2, q2><p5,3, q2><p7,4, q2><p4,8, q2>

…...

<p2,2, q3><p3,5, q3><p7,3, q3><p8,6, q3>

…...

<p5,1, qm><p2,3, qm><p5,7, qm><p9,2, qm>

…...

<p1,4, q1>, <p5,1, q3>, <p6,4, q4>, <p3,4, q2>, …...

Global Heap (Size=m)

R1: <p1,2, q1>, <p1,5, q2>, <p1,3, q3>, ……, <p1,9, qm>. R2: , <p2,2, q2>, <p2,4, q3>, ……, . R4: <p4,5, q1>, , <p4,3, q3>, ……, <p4,7, qm> ………... Candidate Set

h2 h3 hmIndividual Heaps

Advantages

guarantee including all k-NNTs in candidate set

generate compact candidate sets

p4,5p1,4

Example: Search based on the global heap

Candidate Set

Global Heap

Individual Heaps

h1 h2 h3

…… …… ……

• <p1,2, q1>

• <p1,4, q2>

• <p1,6, q3>

p4,5p1,4

Candidate Set

Global Heap

Individual Heaps

h1 h2 h3

…… …… ……

• <p1,2, q1>

• <p1,4, q2>

• <p1,6, q3>

• R1: (Partial Match)

• <p5,5, q2>

p4,5p1,4

Candidate Set

Global Heap

Individual Heaps

h1 h2 h3

…… …… ……

• <p1,2, q1>

• <p1,4, q2>

• <p1,6, q3>

• <p5,5, q2>

• <p4,5, q3>

p4,5p1,4

Candidate Set

Global Heap

Individual Heaps

h1 h2 h3

…… …… ……

• <p1,2, q1>

• <p1,4, q2>

• <p1,6, q3>

• <p5,5, q2>

• <p4,5, q3>

• <p4,4, q2>

R1: <p1,2, q1>, <p1,4, q2>, <p1,6, q3>. (Full Match)

R4: <p4,5, q3>. (Partial Match)

Candidate Set

Global Heap<p1,2, q1>, <p4,4, q2>, <p1,5, q3>

Individual Heaps

…… ……

h1 h2 h3

……

p4,5p1,4

Stop critiria: when there is k full-matching candidates – Property 1: The candidate set is complete if G has popped out k full-matching candidates (In this example k=1)

The full-matching candidate may not be the final k-NNT

The system has to retrieve the partial-matching trajectories (R4 and R5) to compute their aggregate distance (I/O cost)

Question: can we compute a lower-bound for R4 and R5

without retrieving their details?

If LB(R4/5) > dist(R1,Q), we can prune it directly

R1: <p1,2, q1>, <p1,4, q2>, <p1,6, q3>. (Full Match)

Candidate Set

The lower-bound of a partial-matching trajectory is

If the LB(R) is larger than the distance of full-matching candidate, R can be pruned directly

R1: <p1,2, q1> <p1,4, q2> <p1,6, q3> dist(R1) = 95

R4: <p4,5, q3>

R5: <p5,5, q2>

Candidate Set

Global Heap• <p1,5, q3>

• <p1,2, q1>

• <p4,4, q2>

• <p1,5, q3>

• <p1,2, q1>

• <p4,4, q2>

• <p1,5, q3>

• <p1,2, q1>

• <p4,4, q2>

LB(R4) =114 (pruned)

LB(R5) =90 (passed)

Problem of Outlier Query Location

A query location is an outlier if it is far from all the trajectories

Too many partial-matching candidates will be generated before finding a full-matching candidates

R1: <p1,1, q1>, <p1,4, q2>, . (Partial Matching) R2: <p2,1, q1>, <p2,5, q2>, . (Partial Matching)

R4: , <p4,4, q2>, . (Partial Matching)

<p1,1, q1>, <p4,4, q2>, <p1,7, q3> Iteration 4

Global Heap

Candidate Set

……

<p1,7, q3> cannot be popped out

p1,7p2,6

Qualifier expectation based method

The system can make up the missing pairs of a partial-matching trajectory by retrieving all its points

Two key issues:Guarantee the completeness of candidate set

Property 2: If there are k made-up candidates (qualifier) with distance smaller than the sum of the pairs in global heap, the candidate set is complete

Which candidate should be selected to make up? The qualifier expectation measure

p1,7p2,6

Example of Qualifier Expectation

R1: <p1,1, q1>, <p1,4, q2>, .

R2: <p2,1, q1>, <p2,5, q2>, .

R4: ,<p4,4, q2>, .

Candidate Set

Global Heap, total dist sum(G) = 200m<p2,1, q1>, <p4,4, q2>, <p1,7, q3>

R1: 40m. R2: 30m. R4: 15m.

Qualifier Expectation

• R1: <p1,1, q1>, <p1,4, q2>, <p1,7, q3>.

dist(R1) =160m < sum(G), R1 is a qualifier

Experiment Setup

Real Dataset: collected from the Microsoft Geolife and T-Drive projects , with over 20,000 real trajectories

Synthetic datasets with both uniform distribution and biased distribution

Random generated query Q

The proposed methods are compared with Fagin’s Algorithm (FA) and Threshold Algorithm (TA) (used in k-BCT)

Evaluations on synthetic dataset (biased distribution)

GH (global heap) is faster than baselines with less I/O costs

QE( global heap+ qualifier expectation ) is an order of magnitude faster than others

100000

1000000

10000000

2 4 6 8 10

100000

2 4 6 8 10

100000

3k 6k 9k 12k

GH QE TA FA

Time (unit: ms) Accessed Rtree Nodes

(a) Query Time vs. |Q| (b) I/O Cost vs. |Q|

Evaluations on real dataset

When |Q| is small, the probability of outlier location is low, GH achieves the best performance

When |Q| is larger, the probability of outlier location is high, QE is more efficient

100000

1000000

2 4 6 8 10

100000

3k 6k 9k 12k

GH QE TA FA

Time (unit: ms) Accessed Rtree Nodes

(a) Query Time vs. |Q| (b) I/O Cost vs. |Q|

Conclusion

k-Nearest Neighboring Trajectory (k-NNT) queryretrieve trajectories by a set of locations

Candidate-generation-and-verification frameworkGenerate candidate trajectories with global heap

Efficient lower-bound computation

Outlier query location: qualifier expectation based method

Thanks very much!

Any Questions?

Retrieving k-Nearest Neighboring Trajectories by a Set of Point Locations Lu-An Tang, Yu Zheng, Xing...

Documents

Spherical Text Embedding...Spherical Text Embedding Yu Meng 1, Jiaxin Huang , Guangyuan Wang , Chao Zhang2, Honglei Zhuang 1 , Lance Kaplan 3, Jiawei Han 1 Department of Computer Science,

OFC13 Review Elastic Jiawei

© 2008 IBM Corporation Mining Significant Graph Patterns by Leap Search Xifeng Yan (IBM T. J. Watson) Hong Cheng, Jiawei Han (UIUC) Philip S. Yu (UIC)

4HPlus – Retrieving Information

Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

EventCube Aviation Safety Data Analysis System Fangbo Tao, Xiao Yu, Jiawei Han 08/10/13

Retrieving European Lives

Three forms of consciousness in retrieving memories Autonoetic Consciousness Self-Knowing Remembering Presenter: Ting-Ru Chen Advisor: Chun-Yu Lin Date:

Concurrent Alignment of Multiple Anonymized Social ... › ~jzhang2 › files › 2015_aisc_paper.pdfwith Generic Stable Matching Jiawei Zhang, Qianyi Zhan and Philip S. Yu Abstract

M ETA - PATH BASED M ULTI - N ETWORK C OLLECTIVE L INK P REDICTION Speaker: Jim-An Tsai Advisor: Jia-ling Koh Author: Jiawei Zhang, Philip S. Yu, Zhi-Hua

02/13/20071 Indexing Noncrashing Failures: A Dynamic Program Slicing-Based Approach Chao Liu, Xiangyu Zhang, Jiawei Han, Yu Zhang, Bharat K. Bhargava University

Copyright 2020, Jiawei Tu

CSE5243 INTRO. TO DATA MINING · TO DATA MINING Chapter 1. Introduction Yu Su, CSE@TheOhio State University Slides adapted from UIUC CS412 by Prof. Jiawei Han and OSU CSE5243 by Prof

Weakly-Supervised Neural Text Classification · Weakly-Supervised Neural Text Classification Yu Meng, Jiaming Shen, Chao Zhang, Jiawei Han Department of Computer Science, University

1 Exploring the Power of Links in Data Mining Jiawei Han 1, Xiaoxin Yin 2, and Philip S. Yu 3 1: University of Illinois at Urbana-Champaign 2: Google Inc

Age-Linked Declines in Retrieving Orthographic Knowledge ...mackay.bol.ucla.edu/1998 Age-linked declines in retrieving orthographic... · Age-Linked Declines in Retrieving Orthographic

Jiawei Meng Supervisors: Prof. Dr. - pkusz.edu.cn

Retrieving Snowfall Rate from Satellite Measurements Huan Meng 1, Ralph Ferraro 1, Banghua Yan 1, Cezar Kongoli 2, Jun Dong 2, Nai-Yu Wang 2, Limin Zhao

1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at

Meta Paths and Meta Structures: Analysing Large ...oDefinition [Sun et al. VLDB 2011] 20 Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, Tianyi Wu. PathSim: Meta Path-Based Top-K