Automatically Incorporating New Sources in Keyword-Search ...talukdar.net/./papers/slides/auto_integrate_sigmod2010_slides.pdfNew Sources in Keyword-Search based Data Integration SIGMOD

Partha Pratim Talukdar (Microsoft Research) Zack Ives (University of Pennsylvania)

Fernando Pereira (Google)

Automatically Incorporating New Sources in Keyword-Search

based Data Integration

SIGMOD 2010, June 9, 2010

“For (m)any data integration problem, if you don’t involve human, then there is no hope.”

2


2

-- AnHai Doan


2

-- AnHai Doan (Yesterday)

Automatic Data Integration

3

Tables(Data Sources)


3


Info. Need


3


One of the few tables to be joined to answer

user query

Info. Need


3



user query

Schema Matching(with errors)

Info. Need


3


New Table


user query


Info. Need


3


New Table


user query


Info. Need


3


End GoalTo be able to pose integrative queries against a

growing heterogeneous dataset and get meaningful answer.

New Table


user query


Info. Need

The Reality Today

4

The Reality Today

• Multiple steps requiring expert integrator– Poll users, create global schema– Semi-automatically generate schema mappings

• Fix errors

– Create query forms• Fix errors revealed by bad data

4

The Reality Today

• Multiple steps requiring expert integrator– Poll users, create global schema– Semi-automatically generate schema mappings

• Fix errors

– Create query forms• Fix errors revealed by bad data

• But this doesn’t work well for discovery (ad hoc) queries, e.g., in science– Too many sources, queries to administer– Mistakes not revealed until queries posed– Too many attributes for pairwise schema matching

4

4

d

c

bP

M

G

Data Sources

Q: Query-driven, Admin-Free Integration

4

d

c

bP

M

G

Data Sources


P b

M

G

c

dMatchingScores

SchemaGraph

Schema Matching

(Alignment)

4

d

c

bP

M

G

Data Sources


P b

M

G

c

dMatchingScores

SchemaGraph

Schema Matching

(Alignment)

Ranked Query

Answering

“a b”KeywordQuery

Results + feedback

4

d

c

bP

M

G

Data Sources


P b

M

G

c

dMatchingScores

SchemaGraph

Schema Matching

(Alignment)

MatchingCorrection

Ranked Query

Answering


Results + feedback

4

d

c

bP

M

G

Data Sources


NNewSource

P b

M

G

c

dMatchingScores

SchemaGraph

Schema Matching

(Alignment)

MatchingCorrection

Ranked Query

Answering


Results + feedback

4

d

c

bP

M

G

Data Sources


NNewSource

P b

M

G

c

dMatchingScores

SchemaGraph

View-based Pruning ofMatching

Schema Matching

(Alignment)

MatchingCorrection

Ranked Query

Answering


Results + feedback

4

d

c

bP

M

G

Data Sources

1. Discovering Schema Matches

NNewSource

P b

M

G

c

dMatchingScores

SchemaGraph


Schema Matching

(Alignment)

MatchingCorrection

Ranked Query

Answering


Results + feedback


Schema Matchers


• Metadata Level– COMA++ [Do and Rahm, 2007]

• pairwise column comparisons necessary

COMA++

Schema Matchers


• Metadata Level– COMA++ [Do and Rahm, 2007]

• pairwise column comparisons necessary

• Instance Level– Based on Modified Adsorption (MAD) [next slide]

• random-walk inspired, previously used in NLP problems• pairwise column comparisons not necessary • parallelizable, suitable for large datasets

COMA++

.

.

.

MAD

Schema Matchers

Schema Matching using MAD

GO_ID Locus

GO30 AT2G34

GO12 AT1G35

DB2ID Name

GO12 aco-2

GO25 p3

DB1Loci

AT2G35

AT1G36

DB3


GO_ID Locus

GO30 AT2G34

GO12 AT1G35

DB2ID Name

GO12 aco-2

GO25 p3

DB1Loci

AT2G35

AT1G36

DB3

Value Node

Attribute Node

GO12 P3GO25 AT2G34 GO30 AT1G35 aco-2

DB2.GO_ID

DB2.LocusDB1.ID DB1.Name DB3.

Loci

AT1G36


GO_ID Locus

GO30 AT2G34

GO12 AT1G35

DB2ID Name

GO12 aco-2

GO25 p3

DB1Loci

AT2G35

AT1G36

DB3


DB2.GO_ID


Loci

AT1G36

L1 L2 L3 L4 L5

Seed Label (unique)


GO_ID Locus

GO30 AT2G34

GO12 AT1G35

DB2ID Name

GO12 aco-2

GO25 p3

DB1Loci

AT2G35

AT1G36

DB3


DB2.GO_ID


Loci

AT1G36

L1 L2 L3 L4 L5

L3L1

Seed Label (unique)


GO_ID Locus

GO30 AT2G34

GO12 AT1G35

DB2ID Name

GO12 aco-2

GO25 p3

DB1Loci

AT2G35

AT1G36

DB3


DB2.GO_ID


Loci

AT1G36

L1 L2 L3 L4 L5

L3L1

L1L3

L4L5

L5L4L2

Seed Label (unique)

All Labels Propagated in Parallel by MAD

4

d

c

bP

M

G

Data Sources

2. Correcting Matching Errors

NNewSource

P b

M

G

c

dMatchingScores

SchemaGraph


Schema Matching

(Alignment)

MatchingCorrection

Ranked Query

Answering


Results + feedback


10

P b

M

G

c

d

Keyword Matched Sources in Q’s Schema Graph


10

P b

M

G

c

d

Schema Graph



10

0.07

0.1

0.1

0.1

0.04

0.1

P b

M

G

c

d

Schema Graph

Edge Cost Encodes User Preference (lower is better)



10

0.07

0.1

0.1

0.1

0.04

0.1

P b

M

G

c

d

Schema Graph

Edge Cost Encodes User Preference (lower is better)

Matching Error: How can we assign it higher (worse) cost?

Correcting Error: Learning New Edge Costs

.

.

.

Top

Bottom

0.1

0.1

0.1

0.1

b

d G

M

P

0.07

0.1

0.10.04

0.1

b c

d G

M

P

11

[Talukdar+, VLDB 2008]

Cost= 0.4

Cost= 0.41


Query

Query*

.

.

.

Query

.

.

.

Top

Bottom

0.1

0.1

0.1

0.1

b

d G

M

P

0.07

0.1

0.10.04

0.1

b c

d G

M

P

11


Cost= 0.4

Cost= 0.41


Query

Query*

.

.

.

Query

.

.

.

Tuples

.

.

.

Top

Bottom

0.1

0.1

0.1

0.1

b

d G

M

P

0.07

0.1

0.10.04

0.1

b c

d G

M

P

11


Cost= 0.4

Cost= 0.41


Query

Query*

.

.

.

Query

.

.

.

Tuples

.

.

.

Top

Bottom

feedback on answers, which is what the user cares about

0.1

0.1

0.1

0.1

b

d G

M

P

0.07

0.1

0.10.04

0.1

b c

d G

M

P

11


Cost= 0.4

Cost= 0.41


Query

Query*

.

.

.

Query

.

.

.

Tuples

.

.

.

Top

Bottom

updated cost

0.1

0.1

0.1

0.1

b

d G

M

P

0.07

0.1

0.10.04

0.1

b c

d G

M

P

0.5

11

Cost= 0.41

Cost= 0.8

Decomposition of Edge Cost

12

TABLE1 TABLE 2


12

FeatureName

Matching Cost

Coefficient(Values Learned)

COMA++ Matched 0.90 wCOMA++

MAD Matched 0.7 wLP

--- --- ---

TABLE1 TABLE 2


12

FeatureName

Matching Cost



MAD Matched 0.7 wLP

--- --- ---

TABLE1 TABLE 2

Edge Cost = 0.9 * WCOMA++ + 0.7 * WLP


12

FeatureName

Matching Cost



MAD Matched 0.7 wLP

--- --- --- Learned

TABLE1 TABLE 2

Edge Cost = 0.9 * WCOMA++ + 0.7 * WLP

Learning: Incorporating User Feedback

• Model feedback incorporation as a constrained optimization problem.

13


MIRA Algorithm(Crammer et al., 2006)


13




13

New Model

Parameters

CurrentModel

Parameters




13

New Model

Parameters

CurrentModel

Parameters

Tree Cost

Loss



Tree whose tuples user likes

Tree whose tuples user doesn’t like.


13

New Model

Parameters

CurrentModel

Parameters

Tree Cost

Loss

4

d

c

bP

M

G

Data Sources

3. Where to Align New Source?

NNewSource

P b

M

G

c

dMatchingScores

SchemaGraph


Schema Matching

(Alignment)

MatchingCorrection

Ranked Query

Answering


Results + feedback

15

Where to Match a New Source?

3

15

GO

acc term_id

InterPro2GO

go_id entry_ac

InterPro Pub

pub_idtitle

InterProEntry

name entry_ac

InterProEntry 2 Pub

entry_ac pub_id

Keyword CostNeighborhood

plasma membrane

term

0 0 0 0 0 0

0 00 0

1

0.5

0.25

0.25

2

2

22

A schema graph with 5 sources and 2 keywords: term and plasma membrane. The shaded oval includes all nodes reachable with cost ≤ 2 from at least one of the keywords.


3

Keywords

15

GO

acc term_id

InterPro2GO

go_id entry_ac

InterPro Pub

pub_idtitle

InterProEntry

name entry_ac

InterProEntry 2 Pub

entry_ac pub_id


plasma membrane

term

0 0 0 0 0 0

0 00 0

1

0.5

0.25

0.25

2

2

22



3

Neighborhood imposed by cost of

kth best answer.

Keywords

15

GO

acc term_id

InterPro2GO

go_id entry_ac

InterPro Pub

pub_idtitle

InterProEntry

name entry_ac

InterProEntry 2 Pub

entry_ac pub_id


plasma membrane

term

0 0 0 0 0 0

0 00 0

1

0.5

0.25

0.25

2

2

22



?New

Source

3


kth best answer.

Keywords

15

GO

acc term_id

InterPro2GO

go_id entry_ac

InterPro Pub

pub_idtitle

InterProEntry

name entry_ac

InterProEntry 2 Pub

entry_ac pub_id


plasma membrane

term

0 0 0 0 0 0

0 00 0

1

0.5

0.25

0.25

2

2

22



?New

Source

3

Matchings outside this neighborhood is not going to affect k-best answers

(i.e., current view).


kth best answer.

Keywords

15

GO

acc term_id

InterPro2GO

go_id entry_ac

InterPro Pub

pub_idtitle

InterProEntry

name entry_ac

InterProEntry 2 Pub

entry_ac pub_id


plasma membrane

term

0 0 0 0 0 0

0 00 0

1

0.5

0.25

0.25

2

2

22



?New

Source

3

View Based AlignerConsider only those matchings which are likely to affect query

results, as otherwise there will be no feedback from user.

Matchings outside this neighborhood is not going to affect k-best answers

(i.e., current view).


kth best answer.

Keywords

Experiments

16

Experiments

Two questions:

16

Experiments

Two questions:I. Can we repair alignment errors by exploiting

user feedback over answers?

16

Experiments

Two questions:I. Can we repair alignment errors by exploiting

user feedback over answers?

II.Can we reduce the number of pairwise comparisons necessary during alignment discovery for new source?

16

1. Correcting Schema Matching Errors: Setup

17


17

go_term

interpro_interpro2go

interpro_entry2pub interpro_method2pub

interpro_methodinterpro_pubinterpro_entry

interpro_journal

Schema Graph (InterPro-GO) with Gold Matchings


17

go_term




interpro_journal


• Start with just the tables


17

go_term




interpro_journal



• Use automatic schema matchers (e.g., COMA++, MAD)


17

go_term




interpro_journal




• Rank matchings based on cost learned from keyword queries and feedback over answers (using Q)


17

go_term




interpro_journal




• Rank matchings based on cost learned from keyword queries and feedback over answers (using Q)

• Compute precision-recall w.r.t. the gold matchings (left figure)

I. Correcting Schema Matching Errors

18


18

0.25

0.438

0.625

0.813

1

0.125 0.25 0.375 0.5 0.625 0.75 0.875 1

Precision-Recall Plots for Various Methods

Pre

cisi

on

Recall

COMA++ MAD Q


18

0.25

0.438

0.625

0.813

1

0.125 0.25 0.375 0.5 0.625 0.75 0.875 1

Precision-Recall Plots for Various Methods

Pre

cisi

on

Recall

COMA++ MAD Q

Learning with Q helps correct schema

matching errors.

I. Correcting Schema Matching Errors (contd.)

19


19

0.15

1.363

2.575

3.788

5

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

Gold vs Non-Gold Edge Costs After Increasing Feedback

Ave

rage

Ed

ge C

ost

s (L

ow

er

is B

ett

er)

Feedback Step Number

Avg. Gold Edge CostAvg. Non-Gold Edge Cost


19

0.15

1.363

2.575

3.788

5

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

Gold vs Non-Gold Edge Costs After Increasing Feedback

Ave

rage

Ed

ge C

ost

s (L

ow

er

is B

ett

er)

Feedback Step Number

Avg. Gold Edge CostAvg. Non-Gold Edge Cost

Learning with Q helps identify the correct (gold) alignments.

20

II. Reducing Pairwise Comparisons during New Source Integration

20


0

5000.0

10000.0

15000.0

20000.0

18 100 500

# P

air

wis

e C

om

pari

son

s

Number of Tables in the Schema Graph

Exhaustive ViewBasedAligner

20


0

5000.0

10000.0

15000.0

20000.0

18 100 500

# P

air

wis

e C

om

pari

son

s

Number of Tables in the Schema Graph

Exhaustive ViewBasedAligner

View Based Aligner Significantly Reduces the Number of Comparisons.

Related Work• B. Alexe, L. Chiticariu, R. J. Miller, and W.-C. Tan. Muse: Mapping

understand- ing and design by example. In ICDE 2008

• Laura Chiticariu, Phokion G. Kolaitis, Lucian Popa: "Interactive Generation of Integrated Schemas". SIGMOD Conference 2008

• Anish Das Sarma, Luna Dong, Alon Halevy. Bootstrapping Pay-As-You-Go Data Integration System. SIGMOD 2008

• Fagin+, Clio: Schema Mapping Creation and Data Exchange. Conceptual Modeling: Foundations and Applications 2009

• S.R. Jeffery, M.J. Franklin, and A.Y. Halevy. Pay-as-you-go user feedback for dataspace systems. In SIGMOD, 2008

• Talukdar+. Learning to create data-integrating queries. In VLDB, 2008.

21

Summary

22

Summary

• A new data-centric schema matching algorithm based on Modified Adsorption (MAD)– doesn’t require pairwise column comparison, scalable

22

Summary


• A system architecture that– combines off-the-shelf schema matchers’ alignments– exploits user feedback over answers to repair matching

errors

22

Summary


• A system architecture that– combines off-the-shelf schema matchers’ alignments– exploits user feedback over answers to repair matching

errors

• Integrates new sources– through incremental updates to schema matchings

22

Thank You!

Poster: Tomorrow (Thu), 3:30pm Cosmopolitan AB

Documents

Automatically Incorporating New Sources in Keyword-Search ...talukdar.net/./papers/slides/auto_integrate_sigmod2010_slides.pdfNew Sources in Keyword-Search based Data Integration SIGMOD