24
Xin Luna Dong (AT&T Labs Google Inc.) Barna Saha, Divesh Srivastava (AT&T Labs- Research) VLDB’2013 * Less is More: Selecting Sources Wisely for Integration

Less is More: Selecting Sources Wisely for Integration

  • Upload
    rae

  • View
    31

  • Download
    0

Embed Size (px)

DESCRIPTION

Less is More: Selecting Sources Wisely for Integration. Xin Luna Dong (AT&T Labs  Google Inc.) Barna Saha , Divesh Srivastava (AT&T Labs-Research) VLDB’2013. “The More, The Better” — for Men. “The More, The Better” —for Women. - PowerPoint PPT Presentation

Citation preview

Xin Luna Dong (AT&T Labs Google Inc.)Barna Saha, Divesh Srivastava (AT&T Labs-

Research)VLDB’2013

* Less is More: Selecting Sources Wisely for Integration

*“The More, The Better” —for Men

*“The More, The Better” —for Women

*“The More, The Better” —for DBers

*But Data Come with A Cost*Lots of money

*But Data Come with A Cost*Lots of machines

*But Data Come with A Cost*Lots of people

*And The Gain Could Be Small

1096 books

from the largest source

1213 books

from the 2 largest sources

1250 books

from the 10 largest sources

1260 books from the first 35 sources

All 1265 books from the first 537

sources

In total 894 sources, 1265 CS books

CS books from AbeBooks.com

*And The Gain Could Even Be Negative

90 > 80 books w. correct

authors after 579 sources

(Accu)

93 > 80 books w. correct

authors after 583 sources

(Vote)

All 100 books (gold

standard) from the first 548 sources

78 books w. correct

authors for Vote

80 books w. correct

authors for Accu

CS books from AbeBooks.com

*Less Is More—Source Selection [VLDB’13]

*Questions*Is it best to integrate all data?*How to spend the computing resources in a wise way?*How to wisely select sources before real integration to balance the gain and the cost?

*Prelude for data integration and outside traditional integration tasks (schema mapping, entity resolution, data fusion)

*Maximize Quality Under Budget?

14 books (17.6% fewer) w. correct authors from the

first 200 (33% less resources)

sources

17 books w. correct authors from 300 sources (budget)

CS books from AbeBooks.com

*Minimize Cost w. Minimal Quality Requirement?

65 books w. correct authors

(quality requirement)

from the first 520 sources

81 books (25% more) w. correct authors from 526

sources (1% more)

CS books from AbeBooks.com

*Marginalism Principle in Economic Theory

Marginal gainII

Marginal cost

0 3 6 902468

1012

GainCost

#(Resource Unit)$

0 3 6 90

0.51

1.52

2.53

Marginal GainMarginal Cost

#(Resource Unit)

$

The law of Diminishing

ReturnsLargest profit

*Marginalism for Source Selection Marginal point with

the largest profit in this ordering: 548

sources

CS books from AbeBooks.com

Challenge 1. The Law of Diminishing

Returns does not necessarily hold, so multiple marginal

points Challenge 2. Each

source is different in quality, so different ordering leads to different marginal

points: best solution integrates 26 sources

Challenge 3. Estimating gain and

cost w/o real integration

*Insight I. Maximizing Profit*Input*S: a set of available sources*F: integration model

*Output: subset Ŝ to maximize profitGF(Ŝ)-CF(Ŝ)

*GF(Ŝ): Gain of integrating Ŝ using model F*CF(Ŝ): Cost of integrating Ŝ using model F

*Gain and cost need to be in the same unit to be comparable; e.g., $

*Insight II. Yes, It Is A HARD Problem*Theorem I (NP-Completeness). Under the arbitrary cost model (i.e., different sources have different costs), Marginalism is NP-complete.

*Theorem II (A greedy solution can obtain arbitrarily bad results): Let dopt be the optimal profit and d be the profit by a greedy solution. For any θ, there exists an input set of sources and a gain model s.t. d/dopt < θ.

*Insight III. An Efficient Algorithm—GRASP Solution

Improvement I. Randomly select from Top-k solutions

Improvement II. Hill climbing to improve the initial solution

Improvement III. Repeat r times and choose the best solution

*Side Contributions*Side contributions on data fusion*The PopAccu model: monotonicity—adding a source should never decrease fusion quality*Algorithms to estimate fusion quality: dynamic programming

*Experimental Setup*Book data set: CS books at Abebooks.com in 2007*894 sources*1265 books*24364 records

*Flight data set: Deep-Web sources for “flight status” in 2011*38 sources*1200 flights*27469 records

*Maximizing Fusion Quality

228 sources provide books in gold

standard

Marginalism selects 165 sources; reaching

the highest quality

PopAccu outperforms Vote and Accu, and is nearly monotonic for “good”

sources

*Source Selection: The Goal

Marginalism has higher profit than MaxGLimitC and

MinCLimitG most of the time

Greedy solution often cannot find the optimal solution

GRASP (top-10, repeating 320 times) obtains nearly

optimal results

*Source Selection: The Approach

*Future Work*Full-fledged source selection for data integration*Other quality measures: e.g., freshness, consistency, redundancy; correlations, copying relationships between sources*Complex cost and gain models*Selecting subsets of data from each source*Other components of data integration: schema mapping, entity resolution

The More the Better? OR Less is More?