34
Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1 , Muhammad Aamir Cheema 2 , Ying Zhang 3 , Xuemin Lin 1 1 The University of New South Wales, Australia 2 Monash University, Australia 3 The University of Technology, Australia

Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin

Embed Size (px)

Citation preview

Page 1: Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin

Information Technology

Selecting Representative Objects Considering Coverage and Diversity

Shenlu Wang1, Muhammad Aamir Cheema2, Ying Zhang3, Xuemin Lin1

1 The University of New South Wales, Australia2 Monash University, Australia3 The University of Technology, Australia

Page 2: Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin

Faculty of Information Technology

Outline

Influence SetsReverse k Nearest Neighbors QueriesReverse Top-k QueriesReverse Skyline Queries

Representative Objects using Influence SetsTechniquesExperiment ResultsSummary

Page 3: Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin

Faculty of Information Technology

Influence Set

In a data set consisting of facilities and users, a facility influences a user if considers as one of its most “important” facilities

A set of users influenced by is called influence set of

Influence

Influence Set

U1

U2f2

f1

Influence Set of Coles

Page 4: Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin

Faculty of Information Technology

Influence Set

A facility f is important for u if it is one of the top-k facilities for a user u considering her preferences, e.g., Distance Rating Price

Important facility?

Who are my potential customers ?

Page 5: Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin

Faculty of Information Technology

Influence Set

Important to identify potential users/customers Used in various applications such as marketing, cluster and

outlier analysis, and decision support systems

Significance

Reverse Nearest Neighbors Reverse Top- Reverse Skyline

Types

Page 6: Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin

Faculty of Information Technology

Outline

Influence SetsReverse k Nearest Neighbors QueriesReverse top-k QueriesReverse Skyline Queries

Representative Objects using Influence SetsTechniquesExperiment ResultsSummary

Page 7: Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin

Faculty of Information Technology

Reverse k Nearest Neighbors (RkNN)

• Definition of importance– A facility f is important to a user if f is

one of its k closest facilities

• Reverse k Nearest Neighbors– Find every user u for which the query

facility q is important, i.e., q is one of its k-closest facilities.

Influence set of f1 is {u1,u2}

Influence set of f2 is {u3}

K=1

u2

f1

f2

u1

u3

Page 8: Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin

Faculty of Information Technology

RkNN Algorithms

Pruning Verification

Half-space

Region-based

TPL (VLDB 2004),FINCH (VLDB 2008),InfZone (ICDE 2011)

Six-regions (SIGMOD 2000)

SLICE (ICDE 2014)

Six-regions (Stanoi et al., SIGMOD 2000)

TPL (Tao et al., VLDB 2004) FINCH (Wu et al., VLDB 2008) Boost (Emrich et al., SIGMOD

2010) InfZone (Cheema et al.,

ICDE2011)SLICE (Yang et al., ICDE 2014)

Page 9: Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin

Faculty of Information Technology

• Regions-based Pruning:

-Six-regions

[Stanoi et al., SIGMOD 2000]

1. Divide the whole space centred at the query q into six equal regions

2. Find the k-th nearest neighbor in each Partition.

3. The k-th nearest facility of q in each region defines the area that can be pruned

k=2

The user points that cannot be pruned should be verified by range query

ba

c

d

q

u1

u2

RkNN Algorithms

Page 10: Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin

Faculty of Information Technology

• Half-space Pruning: the space that is contained by k half- spaces can be pruned

-TPL [Tao et al., VLDB 2004]1. Find the nearest facility f in the

unpruned area.

2. Draw a bisector between q and f, prune by using the half-space

3. Go to step 1 unless all facilities in the unpruned area have been accessed

k=2

ba

c

d

q

RkNN Algorithms

u

Checking which k-half spaces prune a point/node is expensive TPL ++ [Yang et al., PVLDB 2015]

Page 11: Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin

Faculty of Information Technology

• FINCH [Wu et al., VLDB 2008]– Approximate the unpruned

area by a convex polygon

k=2

ba

c

d

q

RkNN Algorithms

Page 12: Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin

Faculty of Information Technology

• InfZone [Cheema et al., ICDE 2011]

1. The influence zone corresponds to the unpruned area when the bisectors of all the facilities have been considered for pruning.

2. A user u is a RkNN of q if and only if u lies inside the influence zone

3. No verification phase.

k=2

ba

c

d

q

RkNN Algorithms

Page 13: Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin

Faculty of Information Technology

• SLICE [Yang et al., ICDE 2014]

1. Divide the whole space centred at the query q into t equal regions

2. Draw arcs for each facility

3. k-th arc in each partition defines the pruning region

Pruning requires checking only one distance

RkNN Algorithms

q

f1

f2

k=2

Page 14: Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin

Faculty of Information Technology

Outline

Influence SetsReverse k Nearest Neighbors QueriesReverse top-k QueriesReverse Skyline Queries

Representative Objects using Influence SetsTechniquesExperiment ResultsSummary

Page 15: Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin

Faculty of Information Technology

Influence Set based on Reverse Top-k

• Definition of importance– Each user u has a preference function– A facility f is important to a user u if f is

one of the top-k facilities for u• Reverse Top-k Query (RTk)

– Find every user u for which the query facility q is one of her top-k facilities.

Influence set of f1 is {u2}

Influence set of f2 is {u1,u3}

K=1

u2

f1

f2

u1

u3

Price=1

Price=22

3

0.9*price + 0.1*distance

0.5*price + 0.5*distance

1*distance

Page 16: Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin

Faculty of Information Technology

Existing work on Reverse Top-k

Vlachou et al., “Reverse top-k queries”, ICDE 2010 Chester et al., “Indexing reverse top-k queries in two dimensions,” DASFAA

2013 Cheema et al., “A Unified Framework for Efficiently Processing Ranking

Related Queries”, EDBT 2014 Vlachou et al., “Branch-and-bound algorithm for reverse top-k queries”,

SIGMOD 2013 Ge et al., “Efficient all top-k computation: A unified solution for all top-k, reverse

top-k and top-m influential queries”, TKDE 2013. Vlachou et al., “Monitoring reverse top-k queries over mobile devices”, MobiDE

2011 Yu et al., “Processing a large number of continuous preference top-k queries”,

SIGMOD 2012

Page 17: Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin

Faculty of Information Technology

Outline

Influence SetsReverse k Nearest Neighbors QueriesReverse top-k QueriesReverse Skyline Queries

Representative Objects using Influence SetsTechniquesExperiment ResultsSummary

Page 18: Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin

Faculty of Information Technology

Influence Set based on Reverse Skyline • Dominance

A facility x dominates another facility y w.r.t. a user u, if for every attribute, u prefers x over y

• Definition of importance A facility f is important to a user u if f is not

dominated by any other facility• Reverse Skyline

Find every user u for which the query facility q is not dominated by any other facility.

Influence set of f1 is {u1,u2}

Influence set of f2 is {u1,u2,u3}

u2

f1

f2

u1

u3

Price=1

Price=2

Page 19: Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin

Faculty of Information Technology

Existing work on Reverse Skylines

Dellis et al., “Efficient computation of reverse skyline queries”, VLDB 2007 Lian et al., “Reverse skyline search in uncertain databases”, TODS 2010 Prasad et al., “Efficient reverse skyline retrieval with arbitrary non-metric

similarity measures”, EDBT 2011 Wang et al., “Energy-efficient reverse skyline queries processing over wireless

sensor networks”, TKDE 2012 Wu et al., “Finding the influence set through skylines”, EDBT 2009

Page 20: Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin

Faculty of Information Technology

Outline

Influence SetsReverse k Nearest Neighbors QueriesReverse top-k QueriesReverse Skyline Queries

Representative Objects using Influence SetsTechniquesExperiment ResultsSummary

Page 21: Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin

Faculty of Information Technology

Representative Objects

Given a set of facilities and a set of users, choose t representative facilities considering coverage and diversity

CoverageLet I(f) denote the influence set of a facility.Given a set of facilities F, its coverage is the measure of total

number of distinct users that are influenced by the facilities in F

• Koh et al., “Finding k most favorite products based on reverse top-t queries”, VLDB J. 2014

• Gkorgkas et al., “ Finding the most diverse products using preference queries”, EDBT 2015

Page 22: Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin

Faculty of Information Technology

Representative Objects

DiversityLet I(f) denote the influence set of a facility.Dissimilarity between two facilities is defined based on the Jaccard

similarity of their influence sets

Diversity of a set of facility F is the minimum of the pair-wise dissimilarities between the facilities in the set

Page 23: Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin

Faculty of Information Technology

Representative Objects

Problem DefinitionScore of a set of facilities F is

Given a set of facilities and a set of users, return a set of t facilities with maximum score.

Page 24: Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin

Faculty of Information Technology

Outline

Influence SetsReverse k Nearest Neighbors QueriesReverse top-k QueriesReverse Skyline Queries

Representative Objects using Influence SetsTechniquesExperiment ResultsSummary

Page 25: Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin

Faculty of Information Technology

Techniques

ChallengesProblem is NP-HardRequires computing influence sets for many facilitiesRequires set intersection and union operations to compute diversity

Page 26: Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin

Faculty of Information Technology

Techniques

Phase 1: Compute influence setsPrune the facilities that cannot be among the representative facilitiesCompute influence sets of remaining facilities

Phase 2: Greedy Algorithm Iteratively select a facility f that maximizes the score of current setStop when t facilities have been selected

Page 27: Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin

Faculty of Information Technology

Techniques

Phase 1: Compute influence setsPrune the facilities that cannot be among the representative facilitiesCompute influence sets of remaining facilities

1. Apply existing reverse top-k algorithm for each remaining facility

2. Compute top-k facilities for each user and populate the influence sets of each facilitya) Use branch-and-bound top-k algorithm for each user

b) Use brute-force algorithm to compute top-k for each user

RTK

TK

NBF

Page 28: Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin

Faculty of Information Technology

Techniques

Phase 2: Greedy Algorithm Iteratively select a facility f that maximizes the score of current setStop when t facilities have been selected

Selecting f requires computing set intersection and union operations

1. Compute exact set operations

2. Compute approximate set intersection and union

ESO

MK

Page 29: Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin

Faculty of Information Technology

Outline

Influence SetsReverse k Nearest Neighbors QueriesReverse top-k QueriesReverse Skyline Queries

Representative Objects using Influence SetsTechniquesExperiment ResultsSummary

Page 30: Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin

Faculty of Information Technology

Experimental Results

Page 31: Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin

Faculty of Information Technology

Experimental Results

Page 32: Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin

Faculty of Information Technology

Outline

Influence SetsReverse k Nearest Neighbors QueriesReverse top-k QueriesReverse Skyline Queries

Representative Objects using Influence SetsTechniquesExperiment ResultsSummary

Page 33: Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin

Faculty of Information Technology

Summary

We studied the problem of computing representative objects using influence sets based on reverse top-k queries

Proposed a two phase greedy algorithm with approximation guarantee

Experimental results demonstrate that the greedy algorithms produce high quality results

Page 34: Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin

Faculty of Information Technology

Thanks