17
Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema 1 , Xuemin Lin 21 , Wenjie Zhang 1 , Haixun Wang 3 1 The University of New South Wales, Australia 2 East China Normal University 3 Microsoft Research Asia

Computer Science and Engineering

  • Upload
    tassos

  • View
    36

  • Download
    1

Embed Size (px)

DESCRIPTION

Computer Science and Engineering. Efficiently Monitoring Top-k Pairs over Sliding Windows. Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema 1 , Xuemin Lin 21 , Wenjie Zhang 1 , Haixun Wang 3. 1 The University of New South Wales, Australia - PowerPoint PPT Presentation

Citation preview

Page 1: Computer Science and Engineering

Computer Science and Engineering

Efficiently Monitoring Top-k Pairs over Sliding Windows

Presented By: Zhitao Shen1

Joint work with Muhammad Aamir Cheema1, Xuemin Lin21, Wenjie Zhang1, Haixun

Wang3

1The University of New South Wales, Australia

2 East China Normal University3 Microsoft Research Asia

Page 2: Computer Science and Engineering

2

IntroductionTop-k Pairs Query:• Given a scoring function score() that computes the score of a pair of

objects, return k pairs of objects with the smallest scores.

Examples:• k closest pairs queries• k furthest pairs queries

Top-k Pairs against sliding windows• Given a data stream, return top-k pairs among the most recent N objects.

Applications• Wireless sensor network, stock market, traffic monitoring and transaction

monitoring

Page 3: Computer Science and Engineering

3

MotivationNo existing work for general pairs queries over sliding windows

Support arbitrary scoring functions.

Example:Fraud detection over transaction streams

– Query the transaction pairs that have small time difference but the locations are far away.

Select a.id, b.id from trans a, trans bwhere a.id <> b.id and a.account = b.accountorder by |a.time - b.time| - dist(a.loc, b.loc)limit kwindow [24 hours]

203-13845 10:15:20 New York $1000

203-13845 10:18:10 L.A. $1000

Page 4: Computer Science and Engineering

4

Problem Definitions (Preliminaries)Sliding Windows

– A sliding window contains most recent N objects of the data stream.

– The number of pairs is N(N – 1) / 2

Sliding window of size 5

neweroldero1o2o3o4o5o6o7

. . . . .o0

Lower bound runtime cost : O(N) for each new objectLower bound storage cost : O(N)

Age of an object: 5 4 3 2 1 0

The age of a pair depends on the

older object.

Page 5: Computer Science and Engineering

5

ContributionsUnified framework • First to study top-k pairs queries over sliding windows.• Support arbitrarily complex scoring functions• Support efficient queries for any window size n ≤ N and any k ≤ K

Lower bound Expected cost for our algorithms

Storage requirement O(N) O(N) + O(K log(N/K)) for eachscoring function

Skyband maintenance cost for each object

O(N) O(N (log (log N) + log K))

Answering top-k pairs O(k) O(log(log n) + log K + k)

Page 6: Computer Science and Engineering

6

Preliminaries

p1

p2

p4

p7

Age

Sco

re

Map all the pairs to an age–score spaceTop-2 pairs

K-skyband[Papadias et al., TODS05] keeps the minimum set for the candidate results.

p2 dominates p5 because p2.score < p5.score and p2 expires no later than p5.

Task1 : how we efficiently maintain the K-skyband Task2 : how we use the K-skyband to efficiently obtain top-k pairs against any sliding window n ≤ N

p1(o0, o1) (p1.age, p1.score) (1, 3)

o1o2o3o4 o0

p3

p5

p6

p8

p9

p10

1 2 3 4

Naive: O(N |SKB|) for checking all N-1 pairs

Expected size of skyband is O(K log(N/K))

Our: O(N log|SKB|)

Page 7: Computer Science and Engineering

7

p1

p2

p3

p4

2-skyband Age

Sco

re

p5

Efficient Skyband MaintenanceCan we find a boundary between the

skyband points and non-skyband points?

K-staircase

How can we efficiently compute the K-staircase and K-skyband?

s1

Update the K-staircase and K-skyband in O(|SKB| log K)),

Check if a pair is dominated by K-skyband in O(log |SKB|) time for each new pair by doing binary search.

p5

K-staircase

s1

s2

s2 p1

p6

p7

Page 8: Computer Science and Engineering

8

Window size = NAny window size = n < N

Efficient Query Answering

p3

p1

p5

p7

p8

2-skyband Age

Sco

re

p6

p4

p2

Can we do better for any sliding window size n < N?

Use Priority Search Tree to index the skyband points

Self-balancing treeEfficient 3-sides range query

6p1

3p5

1p7 4p6

2p8

9p2

8p3

5p4

Priority Search Tree

Page 9: Computer Science and Engineering

9

Efficient Query Answering

p3

p1

p5

p7

p8

2-skyband Age

Sco

re

p6

p4

p2

Our contribution: Retrieve top-k pairs in the 1-sided range.

An algorithm similar to post-order traversal costs O(log|SKB| + k)

Any window size = n < N

6p1

3p5

1p7 4p6

2p8

9p2

8p3

5p4

Priority Search Tree

Page 10: Computer Science and Engineering

10

What else in the paper?Efficient continuous queries on the skyband.• Continuously monitoring the top-k results for any fixed k (k ≤ K) and

n (n ≤ N).• Amortized O(k/n (log |SKB| + k)) time per update.

Optimization on monotonic scoring functions.• Handling the k-closest pairs, k-furthest pairs queries.• Applying Threshold Algorithm on sorted lists • Improving the number of considered pairs for each new object from

N to (d+1) N d/(d+1) K 1/(d+1)

Page 11: Computer Science and Engineering

11

Experimental SettingsReal dataset.

– Sensor data in the Intel research lab– 2.3 million records.

Synthetic data.– Uniform, correlated and anti-correlated distributions.– 2 million objects– Closest and furthest pairs in Manhattan distance

|.humidityo-.humidityo| |.tempo-.tempo| |.timeo-.timeo|

)o ,score(oyxyx

yxyx

Page 12: Computer Science and Engineering

12

Experiments (Overall Cost on real data)SCase: our algorithm using K-staircase to maintain the skyband.Naïve: maintains kN pairs and sort them on their scores.LB: shows lower bound cost

Varying K Varying N (in thousands)

Page 13: Computer Science and Engineering

13

Experiments (Query Answering)Linear: scan the skyband points to find the top-k pairs.Snapshot: our snapshot query algorithm.Continuous: our continuous query algorithm.LB: an algorithm to obtain top-k results in O(k) time.

Varying K Varying |Q| (in thousands)

Page 14: Computer Science and Engineering

14

Conclusion:• First to study a broad class of top-k pairs queries over

sliding windows.

• We present efficient algorithms and show that the performance of our algorithm is reasonably close to the lower bound cost.

• We provide extensive experiment results on both real and synthetic data sets to show the efficiency and scalability of the proposed algorithms.

Page 15: Computer Science and Engineering

15

Question and Answer

Thank You!Any Questions?

Page 16: Computer Science and Engineering

16

Related WorkTop-k Query Processing• Fagin’s Algorithm (FA), threshold Algorithm (TA), no-random access

(NRA)

Top-k Pairs Queries Processing• k-closest pairs queries• k-furthest pairs queries• Top-k pairs queries [Cheema et al., ICDE’11]

Data Stream Processing• Top-k query processing over data stream [Mouratidis et al.,

SIGMOD’06]• k-nearest neighbour queries [Böhm et al., ICDE’07]

Page 17: Computer Science and Engineering

17

Experiments (Skyband Maintenance algorithm)Basic: maintening algorithm without K-staircase

SCase: our algorithm using K-staircase to maintain the skyband.TA: Optimized algorithm for monotonic scoring functions.LB: show lower bound cost

# of attributesVarying K