Mariam Salloum (YP.com) Xin Luna Dong (Google) Divesh Srivastava (AT&T Research) Vassilis J. Tsotras (UC Riverside) 1 Online Ordering of Overlapping Data

Embed Size (px)

Citation preview

  • Slide 1
  • Mariam Salloum (YP.com) Xin Luna Dong (Google) Divesh Srivastava (AT&T Research) Vassilis J. Tsotras (UC Riverside) 1 Online Ordering of Overlapping Data Sources VLDB 2014 - Hangzhou, China
  • Slide 2
  • Motivation for Source Ordering VLDB 2014 - Hangzhou, China 2 For online query answering, users want results as soon as possible. For some domains, there are hundreds to thousands of relevant data sources. * * Dalvi et. al. VLDB 2012. All sources cannot be queried in parallel due to bandwidth limitation, etc. Hence, we must consider the order in which sources are queried. Would like all listings in Pasadena, California?
  • Slide 3
  • Source Ordering 3 Consider this example of 5 sources Each area in the venn diagram represents the number of answers given by the set of sources that it covers. 5 out of 120 possible orderings are shown Orderings are compared by the area-under- the-curve measure. VLDB 2014 - Hangzhou, China 3 1 4 1 2 2 31 1 2 3 5 1 0 1 A B D E C 5 Source Venn Diagram
  • Slide 4
  • Challenges 4 Source ordering needs to consider three factors: Coverage number of answers provided by a source. Overlap percentage of overlapping answers between sources. Cost monetary or latency cost incurred when connecting or retrieving answers from a source. Challenges Gathering all coverage and overlap statistics is infeasible. 20 sources => 1 million overlap statistics 30 sources => 1 billion overlap statistics Such statistics are typically stale. VLDB 2014 - Hangzhou, China
  • Slide 5
  • Source Querying Statistics Enrichment Planning Source Ordering Statistic Server Overlap Estimation Statistics Repository Collected New Statistics QAnswers Overlap Ordering Data Statistics Enrichment Plan Data Sources We consider 3 problems: Overlap Estimation - Given a partial set of overlap statistics, how to estimate overlap statistics that are not known. Source Ordering How to order sources to maximize the area- under-the-curve. Statistics Enrichment - How to select additional unknown statistics to improve accuracy of Overlap Estimation and in-turn improve Source Ordering. OASIS: Online Query Answering System 5 VLDB 2014 - Hangzhou, China
  • Slide 6
  • Basic Overlap Estimation Solution VLDB 2014 - Hangzhou, China P(A B) = ABCDE+ABCDE+ ABCDE+ABCDE+ABCDE+ABCDE+ ABCDE+ABCDE = 0.30 Ex. P(A B ) = 0.30 Ex. P(A B C D) = 0.03 P( A B C D) = ABCDE+ABCDE = 0.03 Provides highest likelihood under given constraints with no additional assumptions. Changes smoothly with addition/change in statistics. Find MaxEnt solution under given constraints. Given coverage and partial set of overlap statistics, formulate constraints:
  • Slide 7
  • Overlap Estimation (Cont.) 7 Challenges Formulating the problem exactly requires the definition of 2 n variables, where n is the number of data sources. Ex. 30 sources = 1 billion variables. Observation Number of non-zero variables should not exceed the number of answers, which is usually much smaller than 2 n. VLDB 2014 - Hangzhou, China
  • Slide 8
  • Scalable Overlap Estimation Solution VLDB 2014 - Hangzhou, China 8 Given Statistics P(A) P(A B) P(B)P(A D) P(C)P(A B C D) P(D) P(E) V = { A B'C'D'E', A B CDE, AB C DE,ABC D E, ABCD E, AB CDE, A BC D E, ABCD E, ABCDE} 1) Define constraints using a subset of variables with high cardinality. 2)Solve MaxEnt problem
  • Slide 9
  • Scalable Overlap Estimation Solution VLDB 2014 - Hangzhou, China 9 3)Include additional variable that are expected to have high cardinality, and remove variables whose value is close to zero. 4)Repeat procedure until no new variables are added.
  • Slide 10
  • Source Ordering 10 An optimal ordering of sources returns answers as fast as possible, measured by the area-under-the-curve. Since an optimal solution is NP-Hard, we propose a greedy algorithm which orders sources based on highest residual coverage over cost ratio. We propose two source ordering strategies: STATIC Ordering DYNAMIC Ordering VLDB 2014 - Hangzhou, China
  • Slide 11
  • Solve MaxEnt problem Select next source with highest residual coverage over cost ratio Probed selected source. STATIC Ordering 11 VLDB 2014 - Hangzhou, China Iterate until threshold is reached.
  • Slide 12
  • Solve MaxEnt problem Select next source to probe Probed selected source Iterate until threshold is reached. DYNAMIC Ordering 12 VLDB 2014 - Hangzhou, China Compute additional statistics
  • Slide 13
  • Statistics Enrichment VLDB 2014 - Hangzhou, China 13 The Statistics Enrichment component chooses additional unknown statistics with the goal of improving source ordering. Incorporating additional statistics into Static and Dynamic ordering: STATIC+ Ordering DYNAMIC + Ordering STATIC+ STATIC DYNAMIC+ DYNAMIC Requests Additional Statistics? Adaptable?
  • Slide 14
  • Experimental Evaluation 14 Data Set Snapshot of Computer Science book listings from AbeBooks.com 1,028 bookstores (sources) 1,256 unique books / 25,347 book records in total Cost: fixed 356 ms source-connection cost & 0.3ms per tuple cost (based on empirical tests) Ordering Strategies STATIC / STATIC+ DYNAMIC / DYNAMIC+ Random: Randomly choose an order of the sources Coverage: Order the sources in decreasing order of their coverage Baseline: Nave usage of given coverage and overlap statistics FullKnowledge: Greedy algorithm with accurate and complete set of coverage and overlap statistics. VLDB 2014 - Hangzhou, China
  • Slide 15
  • Evaluation of Algorithms 15 VLDB 2014 - Hangzhou, China DYNAMIC yields a larger area-under-the-curve, and probes fewer sources to get 90% coverage, than STATIC. DYNAMIC+ /STATIC+ perform better than their DYNAMIC/STATIC counterparts.
  • Slide 16
  • Conclusions 16 Proposed Overlap Estimation method generates good overlap estimates for the purpose of source ordering. An adaptive ordering strategy (DYNAMIC ordering) generates a better source ordering compared to a static ordering strategy. Incorporating new statistics (whether accurate, approximate, or stale) can improve source ordering (DYNAMIC+) As long as the statistic selection procedure is fast, incorporating new statistics on-the-fly can improve source ordering. VLDB 2014 - Hangzhou, China
  • Slide 17
  • Thank You Questions?