Upload
tassos
View
36
Download
1
Embed Size (px)
DESCRIPTION
Computer Science and Engineering. Efficiently Monitoring Top-k Pairs over Sliding Windows. Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema 1 , Xuemin Lin 21 , Wenjie Zhang 1 , Haixun Wang 3. 1 The University of New South Wales, Australia - PowerPoint PPT Presentation
Citation preview
Computer Science and Engineering
Efficiently Monitoring Top-k Pairs over Sliding Windows
Presented By: Zhitao Shen1
Joint work with Muhammad Aamir Cheema1, Xuemin Lin21, Wenjie Zhang1, Haixun
Wang3
1The University of New South Wales, Australia
2 East China Normal University3 Microsoft Research Asia
2
IntroductionTop-k Pairs Query:• Given a scoring function score() that computes the score of a pair of
objects, return k pairs of objects with the smallest scores.
Examples:• k closest pairs queries• k furthest pairs queries
Top-k Pairs against sliding windows• Given a data stream, return top-k pairs among the most recent N objects.
Applications• Wireless sensor network, stock market, traffic monitoring and transaction
monitoring
3
MotivationNo existing work for general pairs queries over sliding windows
Support arbitrary scoring functions.
Example:Fraud detection over transaction streams
– Query the transaction pairs that have small time difference but the locations are far away.
Select a.id, b.id from trans a, trans bwhere a.id <> b.id and a.account = b.accountorder by |a.time - b.time| - dist(a.loc, b.loc)limit kwindow [24 hours]
203-13845 10:15:20 New York $1000
203-13845 10:18:10 L.A. $1000
4
Problem Definitions (Preliminaries)Sliding Windows
– A sliding window contains most recent N objects of the data stream.
– The number of pairs is N(N – 1) / 2
Sliding window of size 5
neweroldero1o2o3o4o5o6o7
. . . . .o0
Lower bound runtime cost : O(N) for each new objectLower bound storage cost : O(N)
Age of an object: 5 4 3 2 1 0
The age of a pair depends on the
older object.
5
ContributionsUnified framework • First to study top-k pairs queries over sliding windows.• Support arbitrarily complex scoring functions• Support efficient queries for any window size n ≤ N and any k ≤ K
Lower bound Expected cost for our algorithms
Storage requirement O(N) O(N) + O(K log(N/K)) for eachscoring function
Skyband maintenance cost for each object
O(N) O(N (log (log N) + log K))
Answering top-k pairs O(k) O(log(log n) + log K + k)
6
Preliminaries
p1
p2
p4
p7
Age
Sco
re
Map all the pairs to an age–score spaceTop-2 pairs
K-skyband[Papadias et al., TODS05] keeps the minimum set for the candidate results.
p2 dominates p5 because p2.score < p5.score and p2 expires no later than p5.
Task1 : how we efficiently maintain the K-skyband Task2 : how we use the K-skyband to efficiently obtain top-k pairs against any sliding window n ≤ N
p1(o0, o1) (p1.age, p1.score) (1, 3)
o1o2o3o4 o0
p3
p5
p6
p8
p9
p10
1 2 3 4
Naive: O(N |SKB|) for checking all N-1 pairs
Expected size of skyband is O(K log(N/K))
Our: O(N log|SKB|)
7
p1
p2
p3
p4
2-skyband Age
Sco
re
p5
Efficient Skyband MaintenanceCan we find a boundary between the
skyband points and non-skyband points?
K-staircase
How can we efficiently compute the K-staircase and K-skyband?
s1
Update the K-staircase and K-skyband in O(|SKB| log K)),
Check if a pair is dominated by K-skyband in O(log |SKB|) time for each new pair by doing binary search.
p5
K-staircase
s1
s2
s2 p1
p6
p7
8
Window size = NAny window size = n < N
Efficient Query Answering
p3
p1
p5
p7
p8
2-skyband Age
Sco
re
p6
p4
p2
Can we do better for any sliding window size n < N?
Use Priority Search Tree to index the skyband points
Self-balancing treeEfficient 3-sides range query
6p1
3p5
1p7 4p6
2p8
9p2
8p3
5p4
Priority Search Tree
9
Efficient Query Answering
p3
p1
p5
p7
p8
2-skyband Age
Sco
re
p6
p4
p2
Our contribution: Retrieve top-k pairs in the 1-sided range.
An algorithm similar to post-order traversal costs O(log|SKB| + k)
Any window size = n < N
6p1
3p5
1p7 4p6
2p8
9p2
8p3
5p4
Priority Search Tree
10
What else in the paper?Efficient continuous queries on the skyband.• Continuously monitoring the top-k results for any fixed k (k ≤ K) and
n (n ≤ N).• Amortized O(k/n (log |SKB| + k)) time per update.
Optimization on monotonic scoring functions.• Handling the k-closest pairs, k-furthest pairs queries.• Applying Threshold Algorithm on sorted lists • Improving the number of considered pairs for each new object from
N to (d+1) N d/(d+1) K 1/(d+1)
11
Experimental SettingsReal dataset.
– Sensor data in the Intel research lab– 2.3 million records.
Synthetic data.– Uniform, correlated and anti-correlated distributions.– 2 million objects– Closest and furthest pairs in Manhattan distance
|.humidityo-.humidityo| |.tempo-.tempo| |.timeo-.timeo|
)o ,score(oyxyx
yxyx
12
Experiments (Overall Cost on real data)SCase: our algorithm using K-staircase to maintain the skyband.Naïve: maintains kN pairs and sort them on their scores.LB: shows lower bound cost
Varying K Varying N (in thousands)
13
Experiments (Query Answering)Linear: scan the skyband points to find the top-k pairs.Snapshot: our snapshot query algorithm.Continuous: our continuous query algorithm.LB: an algorithm to obtain top-k results in O(k) time.
Varying K Varying |Q| (in thousands)
14
Conclusion:• First to study a broad class of top-k pairs queries over
sliding windows.
• We present efficient algorithms and show that the performance of our algorithm is reasonably close to the lower bound cost.
• We provide extensive experiment results on both real and synthetic data sets to show the efficiency and scalability of the proposed algorithms.
15
Question and Answer
Thank You!Any Questions?
16
Related WorkTop-k Query Processing• Fagin’s Algorithm (FA), threshold Algorithm (TA), no-random access
(NRA)
Top-k Pairs Queries Processing• k-closest pairs queries• k-furthest pairs queries• Top-k pairs queries [Cheema et al., ICDE’11]
Data Stream Processing• Top-k query processing over data stream [Mouratidis et al.,
SIGMOD’06]• k-nearest neighbour queries [Böhm et al., ICDE’07]
17
Experiments (Skyband Maintenance algorithm)Basic: maintening algorithm without K-staircase
SCase: our algorithm using K-staircase to maintain the skyband.TA: Optimized algorithm for monotonic scoring functions.LB: show lower bound cost
# of attributesVarying K