CUbRIK research at SIGMOD 2012

+

Top-k bounded diversification

Piero Fraternali, Davide Martinenghi, Marco TagliasacchiPolitecnico di Milano, Italy

Scottsdale, AZ, USA - May 24, 20121

+Motivation

Diversification is useful in application domains where objects can be described by a score a 2- or 3-dimensional feature vector

Many examples from search (real estate, image search, …) Apartments distributed over a map

Score (e.g., price) + 2D feature vector (geo-localization) Evolution in time of price of apartments over a map

Score (e.g., price) + 3D feature vector (geo-localization + time)

Properties of images (e.g., HSI color features) Score (e.g., relevance to a given keyword) + 3D feature

vector (e.g., average HSI components in the image)

2

+Diversified result setLooking for good restaurants in Milan

3


4

top 15


5

top 15 diversified

over the region

top 15

+Diversification

We are given a set O of N objects is the vector-space representation of

object o is the relevance score of object o

Diversification problem

6

+Diversification

We are given a set O of N objects is the vector-space representation of

object o is the relevance score of object o

Diversification problem

7

Best diversified set of K objects

Relevance to query (as

score)

Diversity (as distance)

Set of objects

Objective function

+ Greedy approach to diversification

Diversification problems are NP-hard

Approximate greedy algorithms are needed

MMR is a well-known greedy algorithm with good quality of result (i.e., value of the objective function) Find K objects that are both relevant and diverse At each step, pick the object with largest diversity-weighted

score K steps in total

MMR (Maximum Marginal Relevance)

8







9

Balance between

relevance and diversity

RelevanceDiversity

Diversity-weighted score






Corresponding objective function:


10






Main disadvantage: All objects must be available from the beginning


11

+Bounded diversification

Objects are embedded in a bounded region of space E.g., a bounding rectangle

Accessing objects is costly Objects are progressively accessed (not available at time 0) The number of accessed objects (sumDepths) should be

minimized

Indexes for sorted access to objects are available Access by score (in descending order) Access by distance from a given point (in ascending order) Both are very common in services on the Web (e.g.,

apartments search)

12

+Distance-based accessRestaurants by distance from a given point q

13

+

Size of icon proportional to score

+Score-based accessRestaurants by score

14

+

Size of icon proportional to score

+ Attacking bounded diversification

Goal: achieve the same quality of result as MMR But minimizing the number of accessed objects

K iterations: within each of them do this as long as needed Pulling strategy: choose an access method (by score or

distance) If by distance, choose from which point (probing location)

Bounding scheme: compute an upper bound on the diversity-weighted score that can be achieved by unseen objects

If a seen object exceeds the bound, select it and do next iteration

Credits to [Schnaitter&Polyzotis 2008] for their Pull-Bound Rank Join template

The Pull-Bound MMR (PBMMR) template

15

+Choosing probing locations

Goal of distance-based access: Exploring the region of space in which the object with the

best diversity-weighted score is most likely to be found

At each of the K iterations, we fix the probing locations at the most promising points of the unexplored space Vertices of the bounded Voronoi diagram of the points

selected at the previous iterations

Of these, the most promising ones are as far as possible from all the objects of the current selection

16

+Example

4 objects x1, …, x4 selected during the first 4 iterations

Bounding region is a square

Voronoi diagram of selected objects

17

+Example

4 objects x1, …, x4 selected during the first 4 iterations

Bounding region is a square


18

Probing locations

+Example

A new object is selected


19

+

Probing locations: v1, …, v4 (vertices of the bounding region)

Shading: distance from closest points (brightest in vertices)

ExampleBounded Voronoi diagram of selected objects

20

+

Probing locations: v1, …, v6 (vertices of bounded Voronoi diagram)


The local maxima of the function “distance from the closest point between x1 and x2” are among v1, …, v6


21

+

Probing locations: v1, …, v8


The local maxima of the function “distance from the closest point among x1, …, x3” are among v1, …, v8


22

+

Probing locations: v1, …, v10




23

+

Probing locations: v1, …, v12 (no other intersection in region)




24

+Example

Inside red circumferences: explored region

Pink discs: objects retrieved by distance-based access

A running state

25

+Example



A running state

26

+Example



A running state

27

+Example



A running state

28

+Example



A running state

29

+Bounding schemeComputing a tight upper bound

30

A bound is tight if it can be achieved in some hypothetical continuation of the instance being explored

A tight upper bound can be computed as follows:


31



Maximal minimal

distance from the selected

objectsSet of selected objects

Unexplored region of space

Highest score possible (last seen by score-based access)


32



Theorem: the point x* that maximizes the minimal distance from all the selected objects is a vertex of the convex hull of unexplored part of a cell of the bounded Voronoi diagram

Theorem: the bound obtained in this way is tight

+Selecting the next probing location

In 2D, the point maximizing the minimal distance can only be A vertex of the bounded

Voronoi diagram An intersection between

an edge and a circumference

An intersection between two circumferences

The corresponding vertex is selected as the next probing location

33







34

Point maximizing the minimal

distance

Vertex selected as next probing

location







35

Point maximizing the minimal

distance

Vertex selected as next probing

location

+Pulling strategy

Round robin: select, in alternation, each probing location Some loose form of instance optimality can already be

achieved with a tight bounding scheme and round robin

Potential adaptive: Choose the probing location that is most likely to reduce

the upper bound Potential adaptive is never worse than round robin Choice between access by score or by distance

Looking at how they reduce the upper bound wrt. the number of accessed objects

36

+Batched access

In the model so far, objects are accessed one by one Not practical for many scenarios “Batched access” modes available in many practical

systems: Give a point and a radius and receive all objects that fall

within

Strategy with batched access: Perform exactly one request per probing location with an

optimal choice of the radius This amounts to solving an optimization problem that

Minimizes the threshold by appropriately choosing the radii

Is subject to a budget constraint (how many objects am I willing to retrieve)

37

+ExperimentsSynthetic data, uniform distribution

38

+ExperimentsSynthetic data, exponential distribution

39

+ExperimentsReal data

40

+Conclusion

Diversification revisited Sorted access modes to avoid accessing all objects Same quality as MMR A structured template with bounding scheme and pulling

strategy

Optimality guarantees with one-by-one access to objects Tight bound Instance optimality (in a loose sense)

Extreme practical efficiency with batched access mode

Future work: Adaptation to other diversification algorithms

41

+Acknowledgments:CUbRIK Project CUbRIK is a research project

financed by the European Union

Goals: Advance the architecture of

multimedia search Exploit the human

contribution in multimedia search

Use open-source components provided by the community

Start up a search business ecosystem

http://www.cubrikproject.eu/

42

Documents

CUbRIK research at SIGMOD 2012