Using the Crowd for Top-K or Group-By Queries Susan Davidson U. of Pennsylvania Sanjeev Khanna U. of Pennsylvania Tova Milo Tel Aviv U. Sudeepa Roy U

ICDT 2013 1

Using the Crowd for Top-K or Group-By Queries

Susan Davidson U. of PennsylvaniaSanjeev Khanna U. of Pennsylvania

Tova Milo Tel Aviv U.Sudeepa Roy U. of Washington

3/20/2013

ICDT 2013 2

Example: Using Wisdom of Crowd in DB Queries

Database of football players pictures– Multiple photos at different

ages of each player

Q. Group the photos of individual players

Q. Find their most recent photos

3/20/2013

ICDT 2013 3

How a DBMS ThinksQ. Group the photos of

individual players

Group-By Queries!Use “Name” attribute


Max/Top-k Queries!Use “Date” attribute

What if name/date is missingImage processing? Photo forensics?

3/20/2013

ICDT 2013 4

Ask the “Crowd”!

3/20/2013

ICDT 2013 5

How the Crowd Thinks - 1

Q. Group the photos of individual players

3/20/2013

ICDT 2013 6

How the Crowd Thinks - 2


3/20/2013

ICDT 2013 7

Crowd Sourcing

3/20/2013

• Using human intelligence to do tasks which are harder to automate

• A recent topic of interest in database and other research communities

• Many crowdsourcing platforms

ICDT 2013 8

This talk: Using Wisdom of Crowd in Top-K and Group-By Queries

SELECT TOP 1 R.picture

FROM SoccerPlayerPhotoTable AS R

GROUP BY R.player

ORDER BY R.date DESC

Top-k/max

Group By / Clustering

A possible query

Fixed but unknown attributes: R.player, R.date

3/20/2013

ICDT 2013 9

Outline of this talk

• Our Model • Max and Top-K Queries• Group-By Queries (Clustering)

Combination is easy..

3/20/2013

ICDT 2013 10



3/20/2013

ICDT 2013 11

Elements: Type and Value• #Elements = n

– (n = 16)– Two attributes: Type and Value

• Type– e.g. Name = “Maradona” – used in Group-By– #Types = J (J clusters, J = 4)

• Value– e.g. Date when photo was taken– used in Top-K

• Unknown, but “ground-truth’’ exists for Types and Values

3/20/2013

ICDT 2013 12

Type and Value Comparisons• DB Queries vs. Comparisons (questions to the crowd)

• Type Comparisons:– Type(x) = Type(y)?

• Value Comparisons:– Value(x) > Value(y)?– Assumes same type

Same Person?Is the first photo older?

• Answer is Boolean• We cannot ask

– “What is Type(x)/Value(x)?”But, answers are not always correct

• Crowd makes mistakes• Next, error model3/20/2013

ICDT 2013 13

Constant Error ModelType comparisons: Type(x) = Type(y)? Value comparisons: Value(x) > Value(y)?

Constant error model: – Standard model– Probability of wrong answer ≤ ½ - , >0

Is the first photo older?Wrong: w.p. ≤ ½ -

Correct: w.p. ≥ ½ +

Same person? Wrong: w.p. ≤ ½ -

Correct: w.p. ≥ ½ + 3/20/2013

ICDT 2013 14

Variable Error Model (this paper)

For value comparisons only : Value(x) > Value(y)?

• Error probability ≤ 1/ f() - – = Distance of x, y in sorted order

Is the first photo older? – Easier

Is the first photo older? - Harder

x1

e.g. Error probability when f() = e

f is 1. strictly monotone

• x1 > x2 f(x1) > f(x2)

2. f(x) ≥ 2

e.g.f() = e

f() = + 1 f() = log + 2x2 x3 x4

≤ 1/e ≤ 1/e2

= 2

3/20/2013

≤ ½ -

ICDT 2013 15

Framework at a Glance

Internal Computation

(Top-K/Group-By)

Crowd

Value(a) > Value(b)

Yes/No

Value(b) > Value(c)

Yes/No

Crowd-sourced DBMS

Comparison ErrorCrowd’s answer may be wrong with some prob.

Cost ModelAsking questions costs money

• Additive• Count #comparisons

We still want the correct answer w.h.p. 3/20/2013

ICDT 2013 16

Our Goal

Minimize the total #comparisons while outputting the correct answer (exact top-k or clusters) w.p. 1 – δ (given constant δ > 0)

3/20/2013

ICDT 2013 17



3/20/2013

ICDT 2013 18

Problem Statement: Max/Top-k

• Input: n elements, k– Same type– Only value comparisons are used

• Output: All top-k elements (w.p. 1 – δ)– (k = 2)

• We focus on Max: Algorithm for top-k builds on the algorithm for max

x3 x1 x2 x4

3/20/2013

ICDT 2013 19

Max: Our ResultsExact

ComparisonsConstant Error

[Feige et. al. ‘94]Error prob < ½

Variable Error[This paper]

Error prob < 1/f()

Upper Boundn - 1 O(n)

n + o(n)for any strictly

monotone function fLower Bound n - 1 Ω(n)

≥ (1+c) n, c > 0 when

High success prob. required with high error

Asymptotically smaller than all c.n, c>0

Required confidence: 1 – δ (δ = constant), = distance in sorted order

3/20/2013

Further, • f() = exp. n + O(1) • f() = linear n + O(log log n) • f() = log. n + O(n1/1+ δ)

ICDT 2013 20

Review: Tournament Tree for Max10

9 10

5 9 10 2

n elements as leaves

• Exact comparisons: Binary tree structure is not necessary

• Noisy comparisons: Repeat comparison + majority vote

• Constant error model: θ(n) algorithm (Feige et. al. ’94)

• Our goal: Total no. of comparisons = n + o(n) cannot repeat even twice in most of the internal nodes

Comparison at internal nodes

Max

#comparisons = c . 1



3/20/2013

ICDT 2013 21

Main Steps for MaxUpper levels

• Use Feige’s algorithm• Y nodes Ө(Y) cost

n nodes

Binary tournament tree Lower Levels

Upper Levels

Y = o(n) for any fY = O(1) for f = e

Lower levelsJust 1 comparison at each internal node

No majority vote

Y nodes

• Max does not lose in the lower levels w.h.p.• Intuition: Max does not meet 2nd Max in the lower levels w.h.p. • Total no. of comparisons = n + Ө(Y) = n + o(n)

3/20/2013

Goal: n + o(n)

Key idea: Start with a random permutation of elements at the leaves

ICDT 2013 22

Extension to Top-K

CorollaryIf k = o(n),

• n + o(n) comparisons suffice to find top-k w.h.p.

• x1, …, xk do not meet in the lower levels

3/20/2013

ICDT 2013 23



– Simple Clustering (using types only)– Clustering with Correlated Values and Types

3/20/2013

ICDT 2013 24

Simple Clustering• Group n elements into J clusters

– Only type comparisons– Constant error model

• O(nJ) comparisons are sufficient– Log factor for comparison error– J can be unknown– Not surprising

• We show Ω(nJ) lower bound– Even if no comparison error– Even for randomized algorithms– Even when J is known

~

3/20/2013

ICDT 2013 25

O(n log J) value and type comparisons suffice to find the clusters w.h.p. ~

Clustering withCorrelated Types and Values

Full correlation: • Elements of same type form contiguous blocks in the sorted order on values

For no correlation, we gave a lower bound of Ω(nJ)

Type 1Type 2

Type 3

High values Low values

3-br

2-br1-br

Assume no error

Apts. In a building

Type = #bedrooms

Value = rent

3/20/2013

ICDT 2013 26

Algorithm for Full Correlation

• Linear cost in total in each inner iteration• #Inner iterations = O(log J)• Cost: C(n) ≤ C(n/2) + O(n log J)• C(n) = O(n log J)

Type 1 Type 2 Type 3

s = 13

- Find median and partition

- If ≥ 2 types in a block, Repeat

- Same type in a block, keep only one element

Repeat Until each block has one type

4J blocks at most J blocks contain ≥ 2 types, ≤ s/2 elements

- Until #elements ≤ s/2

J = 3

3/20/2013

ICDT 2013 27

Type 2

Extension to Partial Correlation

Partial correlation: ― Elements of same type form almost contiguous blocks

Type 1 Type 3Type 1

O(n log (α J) + αJ) value and type comparisons suffice w.h.p. ~

≤ α changes in the same type Hotels in a city

Type = Star-rating

Value = Avg. price

3/20/2013

ICDT 2013 28

Related WorkMax/Top-K:• [Guo et. al. ’12]: Which element has the maximum likelihood of being max,

which future queries are most effective• [Venetis et. al. ’12]: Efficient heuristics that tune latency, cost, quality to find max• [Feige et. al. ’94]: Tight bounds for Max/Top-K/Sorting in constant error model• Other error model and objectives exist in theory literature

Clustering• [Gomes et. al. ’11]: Machine learning approach to define clusters• [Parameswaran et. al. ’12, Wang et. al. ’12]: Filtering data, entity resolution

Other Work:• Crowd sourced DBs: CrowdDB, Deco, Qurk, MoDaS…• Crowd sourcing for data cleansing, data integration, entity resolution, data

analytics, …

3/20/2013

ICDT 2013 29

Conclusion• We studied Max/Top-k and Group-By queries in crowd-sourced setting

– Top-K: Proposed a variable error model, allows fewer comparisons– Group by: Obvious simple algorithm is the best possible, correlation

reduces #comparisons

• Future Work:– Other objectives: latency, #rounds, or to reduce error when a budget

on #comparisons is given– Other cost functions: e.g. more #similar queries in a round, less cost

3/20/2013

ICDT 2013 30

Thank You

Questions?

3/20/2013

Documents

Using the Crowd for Top-K or Group-By Queries Susan Davidson U. of Pennsylvania Sanjeev Khanna U. of Pennsylvania Tova Milo Tel Aviv U. Sudeepa Roy U