Mariam Salloum (YP.com) Xin Luna Dong (Google) Divesh
Srivastava (AT&T Research) Vassilis J. Tsotras (UC Riverside) 1
Online Ordering of Overlapping Data Sources VLDB 2014 - Hangzhou,
China
Slide 2
Motivation for Source Ordering VLDB 2014 - Hangzhou, China 2
For online query answering, users want results as soon as possible.
For some domains, there are hundreds to thousands of relevant data
sources. * * Dalvi et. al. VLDB 2012. All sources cannot be queried
in parallel due to bandwidth limitation, etc. Hence, we must
consider the order in which sources are queried. Would like all
listings in Pasadena, California?
Slide 3
Source Ordering 3 Consider this example of 5 sources Each area
in the venn diagram represents the number of answers given by the
set of sources that it covers. 5 out of 120 possible orderings are
shown Orderings are compared by the area-under- the-curve measure.
VLDB 2014 - Hangzhou, China 3 1 4 1 2 2 31 1 2 3 5 1 0 1 A B D E C
5 Source Venn Diagram
Slide 4
Challenges 4 Source ordering needs to consider three factors:
Coverage number of answers provided by a source. Overlap percentage
of overlapping answers between sources. Cost monetary or latency
cost incurred when connecting or retrieving answers from a source.
Challenges Gathering all coverage and overlap statistics is
infeasible. 20 sources => 1 million overlap statistics 30
sources => 1 billion overlap statistics Such statistics are
typically stale. VLDB 2014 - Hangzhou, China
Slide 5
Source Querying Statistics Enrichment Planning Source Ordering
Statistic Server Overlap Estimation Statistics Repository Collected
New Statistics QAnswers Overlap Ordering Data Statistics Enrichment
Plan Data Sources We consider 3 problems: Overlap Estimation -
Given a partial set of overlap statistics, how to estimate overlap
statistics that are not known. Source Ordering How to order sources
to maximize the area- under-the-curve. Statistics Enrichment - How
to select additional unknown statistics to improve accuracy of
Overlap Estimation and in-turn improve Source Ordering. OASIS:
Online Query Answering System 5 VLDB 2014 - Hangzhou, China
Slide 6
Basic Overlap Estimation Solution VLDB 2014 - Hangzhou, China
P(A B) = ABCDE+ABCDE+ ABCDE+ABCDE+ABCDE+ABCDE+ ABCDE+ABCDE = 0.30
Ex. P(A B ) = 0.30 Ex. P(A B C D) = 0.03 P( A B C D) = ABCDE+ABCDE
= 0.03 Provides highest likelihood under given constraints with no
additional assumptions. Changes smoothly with addition/change in
statistics. Find MaxEnt solution under given constraints. Given
coverage and partial set of overlap statistics, formulate
constraints:
Slide 7
Overlap Estimation (Cont.) 7 Challenges Formulating the problem
exactly requires the definition of 2 n variables, where n is the
number of data sources. Ex. 30 sources = 1 billion variables.
Observation Number of non-zero variables should not exceed the
number of answers, which is usually much smaller than 2 n. VLDB
2014 - Hangzhou, China
Slide 8
Scalable Overlap Estimation Solution VLDB 2014 - Hangzhou,
China 8 Given Statistics P(A) P(A B) P(B)P(A D) P(C)P(A B C D) P(D)
P(E) V = { A B'C'D'E', A B CDE, AB C DE,ABC D E, ABCD E, AB CDE, A
BC D E, ABCD E, ABCDE} 1) Define constraints using a subset of
variables with high cardinality. 2)Solve MaxEnt problem
Slide 9
Scalable Overlap Estimation Solution VLDB 2014 - Hangzhou,
China 9 3)Include additional variable that are expected to have
high cardinality, and remove variables whose value is close to
zero. 4)Repeat procedure until no new variables are added.
Slide 10
Source Ordering 10 An optimal ordering of sources returns
answers as fast as possible, measured by the area-under-the-curve.
Since an optimal solution is NP-Hard, we propose a greedy algorithm
which orders sources based on highest residual coverage over cost
ratio. We propose two source ordering strategies: STATIC Ordering
DYNAMIC Ordering VLDB 2014 - Hangzhou, China
Slide 11
Solve MaxEnt problem Select next source with highest residual
coverage over cost ratio Probed selected source. STATIC Ordering 11
VLDB 2014 - Hangzhou, China Iterate until threshold is
reached.
Slide 12
Solve MaxEnt problem Select next source to probe Probed
selected source Iterate until threshold is reached. DYNAMIC
Ordering 12 VLDB 2014 - Hangzhou, China Compute additional
statistics
Slide 13
Statistics Enrichment VLDB 2014 - Hangzhou, China 13 The
Statistics Enrichment component chooses additional unknown
statistics with the goal of improving source ordering.
Incorporating additional statistics into Static and Dynamic
ordering: STATIC+ Ordering DYNAMIC + Ordering STATIC+ STATIC
DYNAMIC+ DYNAMIC Requests Additional Statistics? Adaptable?
Slide 14
Experimental Evaluation 14 Data Set Snapshot of Computer
Science book listings from AbeBooks.com 1,028 bookstores (sources)
1,256 unique books / 25,347 book records in total Cost: fixed 356
ms source-connection cost & 0.3ms per tuple cost (based on
empirical tests) Ordering Strategies STATIC / STATIC+ DYNAMIC /
DYNAMIC+ Random: Randomly choose an order of the sources Coverage:
Order the sources in decreasing order of their coverage Baseline:
Nave usage of given coverage and overlap statistics FullKnowledge:
Greedy algorithm with accurate and complete set of coverage and
overlap statistics. VLDB 2014 - Hangzhou, China
Slide 15
Evaluation of Algorithms 15 VLDB 2014 - Hangzhou, China DYNAMIC
yields a larger area-under-the-curve, and probes fewer sources to
get 90% coverage, than STATIC. DYNAMIC+ /STATIC+ perform better
than their DYNAMIC/STATIC counterparts.
Slide 16
Conclusions 16 Proposed Overlap Estimation method generates
good overlap estimates for the purpose of source ordering. An
adaptive ordering strategy (DYNAMIC ordering) generates a better
source ordering compared to a static ordering strategy.
Incorporating new statistics (whether accurate, approximate, or
stale) can improve source ordering (DYNAMIC+) As long as the
statistic selection procedure is fast, incorporating new statistics
on-the-fly can improve source ordering. VLDB 2014 - Hangzhou,
China