Upload
karla-farthing
View
217
Download
2
Tags:
Embed Size (px)
Citation preview
ICDT 2013 1
Using the Crowd for Top-K or Group-By Queries
Susan Davidson U. of PennsylvaniaSanjeev Khanna U. of Pennsylvania
Tova Milo Tel Aviv U.Sudeepa Roy U. of Washington
3/20/2013
ICDT 2013 2
Example: Using Wisdom of Crowd in DB Queries
Database of football players pictures– Multiple photos at different
ages of each player
Q. Group the photos of individual players
Q. Find their most recent photos
3/20/2013
ICDT 2013 3
How a DBMS ThinksQ. Group the photos of
individual players
Group-By Queries!Use “Name” attribute
Q. Find their most recent photos
Max/Top-k Queries!Use “Date” attribute
What if name/date is missingImage processing? Photo forensics?
3/20/2013
ICDT 2013 4
Ask the “Crowd”!
3/20/2013
ICDT 2013 5
How the Crowd Thinks - 1
Q. Group the photos of individual players
3/20/2013
ICDT 2013 6
How the Crowd Thinks - 2
Q. Find their most recent photos
3/20/2013
ICDT 2013 7
Crowd Sourcing
3/20/2013
• Using human intelligence to do tasks which are harder to automate
• A recent topic of interest in database and other research communities
• Many crowdsourcing platforms
ICDT 2013 8
This talk: Using Wisdom of Crowd in Top-K and Group-By Queries
SELECT TOP 1 R.picture
FROM SoccerPlayerPhotoTable AS R
GROUP BY R.player
ORDER BY R.date DESC
Top-k/max
Group By / Clustering
A possible query
Fixed but unknown attributes: R.player, R.date
3/20/2013
ICDT 2013 9
Outline of this talk
• Our Model • Max and Top-K Queries• Group-By Queries (Clustering)
Combination is easy..
3/20/2013
ICDT 2013 10
Outline of this talk
• Our Model • Max and Top-K Queries• Group-By Queries (Clustering)
3/20/2013
ICDT 2013 11
Elements: Type and Value• #Elements = n
– (n = 16)– Two attributes: Type and Value
• Type– e.g. Name = “Maradona” – used in Group-By– #Types = J (J clusters, J = 4)
• Value– e.g. Date when photo was taken– used in Top-K
• Unknown, but “ground-truth’’ exists for Types and Values
3/20/2013
ICDT 2013 12
Type and Value Comparisons• DB Queries vs. Comparisons (questions to the crowd)
• Type Comparisons:– Type(x) = Type(y)?
• Value Comparisons:– Value(x) > Value(y)?– Assumes same type
Same Person?Is the first photo older?
• Answer is Boolean• We cannot ask
– “What is Type(x)/Value(x)?”But, answers are not always correct
• Crowd makes mistakes• Next, error model3/20/2013
ICDT 2013 13
Constant Error ModelType comparisons: Type(x) = Type(y)? Value comparisons: Value(x) > Value(y)?
Constant error model: – Standard model– Probability of wrong answer ≤ ½ - , >0
Is the first photo older?Wrong: w.p. ≤ ½ -
Correct: w.p. ≥ ½ +
Same person? Wrong: w.p. ≤ ½ -
Correct: w.p. ≥ ½ + 3/20/2013
ICDT 2013 14
Variable Error Model (this paper)
For value comparisons only : Value(x) > Value(y)?
• Error probability ≤ 1/ f() - – = Distance of x, y in sorted order
Is the first photo older? – Easier
Is the first photo older? - Harder
x1
e.g. Error probability when f() = e
f is 1. strictly monotone
• x1 > x2 f(x1) > f(x2)
2. f(x) ≥ 2
e.g.f() = e
f() = + 1 f() = log + 2x2 x3 x4
≤ 1/e ≤ 1/e2
= 2
3/20/2013
≤ ½ -
ICDT 2013 15
Framework at a Glance
Internal Computation
(Top-K/Group-By)
Crowd
Value(a) > Value(b)
Yes/No
Value(b) > Value(c)
Yes/No
Crowd-sourced DBMS
Comparison ErrorCrowd’s answer may be wrong with some prob.
Cost ModelAsking questions costs money
• Additive• Count #comparisons
We still want the correct answer w.h.p. 3/20/2013
ICDT 2013 16
Our Goal
Minimize the total #comparisons while outputting the correct answer (exact top-k or clusters) w.p. 1 – δ (given constant δ > 0)
3/20/2013
ICDT 2013 17
Outline of this talk
• Our Model • Max and Top-K Queries• Group-By Queries (Clustering)
3/20/2013
ICDT 2013 18
Problem Statement: Max/Top-k
• Input: n elements, k– Same type– Only value comparisons are used
• Output: All top-k elements (w.p. 1 – δ)– (k = 2)
• We focus on Max: Algorithm for top-k builds on the algorithm for max
x3 x1 x2 x4
3/20/2013
ICDT 2013 19
Max: Our ResultsExact
ComparisonsConstant Error
[Feige et. al. ‘94]Error prob < ½
Variable Error[This paper]
Error prob < 1/f()
Upper Boundn - 1 O(n)
n + o(n)for any strictly
monotone function fLower Bound n - 1 Ω(n)
≥ (1+c) n, c > 0 when
High success prob. required with high error
Asymptotically smaller than all c.n, c>0
Required confidence: 1 – δ (δ = constant), = distance in sorted order
3/20/2013
Further, • f() = exp. n + O(1) • f() = linear n + O(log log n) • f() = log. n + O(n1/1+ δ)
ICDT 2013 20
Review: Tournament Tree for Max10
9 10
5 9 10 2
n elements as leaves
• Exact comparisons: Binary tree structure is not necessary
• Noisy comparisons: Repeat comparison + majority vote
• Constant error model: θ(n) algorithm (Feige et. al. ’94)
• Our goal: Total no. of comparisons = n + o(n) cannot repeat even twice in most of the internal nodes
Comparison at internal nodes
Max
#comparisons = c . 1
#comparisons = c . 3
#comparisons = c . 5
3/20/2013
ICDT 2013 21
Main Steps for MaxUpper levels
• Use Feige’s algorithm• Y nodes Ө(Y) cost
n nodes
Binary tournament tree Lower Levels
Upper Levels
Y = o(n) for any fY = O(1) for f = e
Lower levelsJust 1 comparison at each internal node
No majority vote
Y nodes
• Max does not lose in the lower levels w.h.p.• Intuition: Max does not meet 2nd Max in the lower levels w.h.p. • Total no. of comparisons = n + Ө(Y) = n + o(n)
3/20/2013
Goal: n + o(n)
Key idea: Start with a random permutation of elements at the leaves
ICDT 2013 22
Extension to Top-K
CorollaryIf k = o(n),
• n + o(n) comparisons suffice to find top-k w.h.p.
• x1, …, xk do not meet in the lower levels
3/20/2013
ICDT 2013 23
Outline of this talk
• Our Model • Max and Top-K Queries• Group-By Queries (Clustering)
– Simple Clustering (using types only)– Clustering with Correlated Values and Types
3/20/2013
ICDT 2013 24
Simple Clustering• Group n elements into J clusters
– Only type comparisons– Constant error model
• O(nJ) comparisons are sufficient– Log factor for comparison error– J can be unknown– Not surprising
• We show Ω(nJ) lower bound– Even if no comparison error– Even for randomized algorithms– Even when J is known
~
3/20/2013
ICDT 2013 25
O(n log J) value and type comparisons suffice to find the clusters w.h.p. ~
Clustering withCorrelated Types and Values
Full correlation: • Elements of same type form contiguous blocks in the sorted order on values
For no correlation, we gave a lower bound of Ω(nJ)
Type 1Type 2
Type 3
High values Low values
3-br
2-br1-br
Assume no error
Apts. In a building
Type = #bedrooms
Value = rent
3/20/2013
ICDT 2013 26
Algorithm for Full Correlation
• Linear cost in total in each inner iteration• #Inner iterations = O(log J)• Cost: C(n) ≤ C(n/2) + O(n log J)• C(n) = O(n log J)
Type 1 Type 2 Type 3
s = 13
- Find median and partition
- If ≥ 2 types in a block, Repeat
- Same type in a block, keep only one element
Repeat Until each block has one type
4J blocks at most J blocks contain ≥ 2 types, ≤ s/2 elements
- Until #elements ≤ s/2
J = 3
3/20/2013
ICDT 2013 27
Type 2
Extension to Partial Correlation
Partial correlation: ― Elements of same type form almost contiguous blocks
Type 1 Type 3Type 1
O(n log (α J) + αJ) value and type comparisons suffice w.h.p. ~
≤ α changes in the same type Hotels in a city
Type = Star-rating
Value = Avg. price
3/20/2013
ICDT 2013 28
Related WorkMax/Top-K:• [Guo et. al. ’12]: Which element has the maximum likelihood of being max,
which future queries are most effective• [Venetis et. al. ’12]: Efficient heuristics that tune latency, cost, quality to find max• [Feige et. al. ’94]: Tight bounds for Max/Top-K/Sorting in constant error model• Other error model and objectives exist in theory literature
Clustering• [Gomes et. al. ’11]: Machine learning approach to define clusters• [Parameswaran et. al. ’12, Wang et. al. ’12]: Filtering data, entity resolution
Other Work:• Crowd sourced DBs: CrowdDB, Deco, Qurk, MoDaS…• Crowd sourcing for data cleansing, data integration, entity resolution, data
analytics, …
3/20/2013
ICDT 2013 29
Conclusion• We studied Max/Top-k and Group-By queries in crowd-sourced setting
– Top-K: Proposed a variable error model, allows fewer comparisons– Group by: Obvious simple algorithm is the best possible, correlation
reduces #comparisons
• Future Work:– Other objectives: latency, #rounds, or to reduce error when a budget
on #comparisons is given– Other cost functions: e.g. more #similar queries in a round, less cost
3/20/2013
ICDT 2013 30
Thank You
Questions?
3/20/2013