Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Partha Pratim Talukdar (Microsoft Research) Zack Ives (University of Pennsylvania)
Fernando Pereira (Google)
Automatically Incorporating New Sources in Keyword-Search
based Data Integration
SIGMOD 2010, June 9, 2010
“For (m)any data integration problem, if you don’t involve human, then there is no hope.”
2
“For (m)any data integration problem, if you don’t involve human, then there is no hope.”
2
-- AnHai Doan
“For (m)any data integration problem, if you don’t involve human, then there is no hope.”
2
-- AnHai Doan (Yesterday)
Automatic Data Integration
3
Tables(Data Sources)
Automatic Data Integration
3
Tables(Data Sources)
Info. Need
Automatic Data Integration
3
Tables(Data Sources)
One of the few tables to be joined to answer
user query
Info. Need
Automatic Data Integration
3
Tables(Data Sources)
One of the few tables to be joined to answer
user query
Schema Matching(with errors)
Info. Need
Automatic Data Integration
3
Tables(Data Sources)
New Table
One of the few tables to be joined to answer
user query
Schema Matching(with errors)
Info. Need
Automatic Data Integration
3
Tables(Data Sources)
New Table
One of the few tables to be joined to answer
user query
Schema Matching(with errors)
Info. Need
Automatic Data Integration
3
Tables(Data Sources)
End GoalTo be able to pose integrative queries against a
growing heterogeneous dataset and get meaningful answer.
New Table
One of the few tables to be joined to answer
user query
Schema Matching(with errors)
Info. Need
The Reality Today
4
The Reality Today
• Multiple steps requiring expert integrator– Poll users, create global schema– Semi-automatically generate schema mappings
• Fix errors
– Create query forms• Fix errors revealed by bad data
4
The Reality Today
• Multiple steps requiring expert integrator– Poll users, create global schema– Semi-automatically generate schema mappings
• Fix errors
– Create query forms• Fix errors revealed by bad data
• But this doesn’t work well for discovery (ad hoc) queries, e.g., in science– Too many sources, queries to administer– Mistakes not revealed until queries posed– Too many attributes for pairwise schema matching
4
4
d
c
bP
M
G
Data Sources
Q: Query-driven, Admin-Free Integration
4
d
c
bP
M
G
Data Sources
Q: Query-driven, Admin-Free Integration
P b
M
G
c
dMatchingScores
SchemaGraph
Schema Matching
(Alignment)
4
d
c
bP
M
G
Data Sources
Q: Query-driven, Admin-Free Integration
P b
M
G
c
dMatchingScores
SchemaGraph
Schema Matching
(Alignment)
Ranked Query
Answering
“a b”KeywordQuery
Results + feedback
4
d
c
bP
M
G
Data Sources
Q: Query-driven, Admin-Free Integration
P b
M
G
c
dMatchingScores
SchemaGraph
Schema Matching
(Alignment)
MatchingCorrection
Ranked Query
Answering
“a b”KeywordQuery
Results + feedback
4
d
c
bP
M
G
Data Sources
Q: Query-driven, Admin-Free Integration
NNewSource
P b
M
G
c
dMatchingScores
SchemaGraph
Schema Matching
(Alignment)
MatchingCorrection
Ranked Query
Answering
“a b”KeywordQuery
Results + feedback
4
d
c
bP
M
G
Data Sources
Q: Query-driven, Admin-Free Integration
NNewSource
P b
M
G
c
dMatchingScores
SchemaGraph
View-based Pruning ofMatching
Schema Matching
(Alignment)
MatchingCorrection
Ranked Query
Answering
“a b”KeywordQuery
Results + feedback
4
d
c
bP
M
G
Data Sources
1. Discovering Schema Matches
NNewSource
P b
M
G
c
dMatchingScores
SchemaGraph
View-based Pruning ofMatching
Schema Matching
(Alignment)
MatchingCorrection
Ranked Query
Answering
“a b”KeywordQuery
Results + feedback
1. Discovering Schema Matches
Schema Matchers
1. Discovering Schema Matches
• Metadata Level– COMA++ [Do and Rahm, 2007]
• pairwise column comparisons necessary
COMA++
Schema Matchers
1. Discovering Schema Matches
• Metadata Level– COMA++ [Do and Rahm, 2007]
• pairwise column comparisons necessary
• Instance Level– Based on Modified Adsorption (MAD) [next slide]
• random-walk inspired, previously used in NLP problems• pairwise column comparisons not necessary • parallelizable, suitable for large datasets
COMA++
.
.
.
MAD
Schema Matchers
Schema Matching using MAD
GO_ID Locus
GO30 AT2G34
GO12 AT1G35
DB2ID Name
GO12 aco-2
GO25 p3
DB1Loci
AT2G35
AT1G36
DB3
Schema Matching using MAD
GO_ID Locus
GO30 AT2G34
GO12 AT1G35
DB2ID Name
GO12 aco-2
GO25 p3
DB1Loci
AT2G35
AT1G36
DB3
Value Node
Attribute Node
GO12 P3GO25 AT2G34 GO30 AT1G35 aco-2
DB2.GO_ID
DB2.LocusDB1.ID DB1.Name DB3.
Loci
AT1G36
Schema Matching using MAD
GO_ID Locus
GO30 AT2G34
GO12 AT1G35
DB2ID Name
GO12 aco-2
GO25 p3
DB1Loci
AT2G35
AT1G36
DB3
GO12 P3GO25 AT2G34 GO30 AT1G35 aco-2
DB2.GO_ID
DB2.LocusDB1.ID DB1.Name DB3.
Loci
AT1G36
L1 L2 L3 L4 L5
Seed Label (unique)
Schema Matching using MAD
GO_ID Locus
GO30 AT2G34
GO12 AT1G35
DB2ID Name
GO12 aco-2
GO25 p3
DB1Loci
AT2G35
AT1G36
DB3
GO12 P3GO25 AT2G34 GO30 AT1G35 aco-2
DB2.GO_ID
DB2.LocusDB1.ID DB1.Name DB3.
Loci
AT1G36
L1 L2 L3 L4 L5
L3L1
Seed Label (unique)
Schema Matching using MAD
GO_ID Locus
GO30 AT2G34
GO12 AT1G35
DB2ID Name
GO12 aco-2
GO25 p3
DB1Loci
AT2G35
AT1G36
DB3
GO12 P3GO25 AT2G34 GO30 AT1G35 aco-2
DB2.GO_ID
DB2.LocusDB1.ID DB1.Name DB3.
Loci
AT1G36
L1 L2 L3 L4 L5
L3L1
L1L3
L4L5
L5L4L2
Seed Label (unique)
All Labels Propagated in Parallel by MAD
4
d
c
bP
M
G
Data Sources
2. Correcting Matching Errors
NNewSource
P b
M
G
c
dMatchingScores
SchemaGraph
View-based Pruning ofMatching
Schema Matching
(Alignment)
MatchingCorrection
Ranked Query
Answering
“a b”KeywordQuery
Results + feedback
2. Correcting Matching Errors
10
P b
M
G
c
d
Keyword Matched Sources in Q’s Schema Graph
2. Correcting Matching Errors
10
P b
M
G
c
d
Schema Graph
Keyword Matched Sources in Q’s Schema Graph
2. Correcting Matching Errors
10
0.07
0.1
0.1
0.1
0.04
0.1
P b
M
G
c
d
Schema Graph
Edge Cost Encodes User Preference (lower is better)
Keyword Matched Sources in Q’s Schema Graph
2. Correcting Matching Errors
10
0.07
0.1
0.1
0.1
0.04
0.1
P b
M
G
c
d
Schema Graph
Edge Cost Encodes User Preference (lower is better)
Matching Error: How can we assign it higher (worse) cost?
Correcting Error: Learning New Edge Costs
.
.
.
Top
Bottom
0.1
0.1
0.1
0.1
b
d G
M
P
0.07
0.1
0.10.04
0.1
b c
d G
M
P
11
[Talukdar+, VLDB 2008]
Cost= 0.4
Cost= 0.41
Correcting Error: Learning New Edge Costs
Query
Query*
.
.
.
Query
.
.
.
Top
Bottom
0.1
0.1
0.1
0.1
b
d G
M
P
0.07
0.1
0.10.04
0.1
b c
d G
M
P
11
[Talukdar+, VLDB 2008]
Cost= 0.4
Cost= 0.41
Correcting Error: Learning New Edge Costs
Query
Query*
.
.
.
Query
.
.
.
Tuples
.
.
.
Top
Bottom
0.1
0.1
0.1
0.1
b
d G
M
P
0.07
0.1
0.10.04
0.1
b c
d G
M
P
11
[Talukdar+, VLDB 2008]
Cost= 0.4
Cost= 0.41
Correcting Error: Learning New Edge Costs
Query
Query*
.
.
.
Query
.
.
.
Tuples
.
.
.
Top
Bottom
feedback on answers, which is what the user cares about
0.1
0.1
0.1
0.1
b
d G
M
P
0.07
0.1
0.10.04
0.1
b c
d G
M
P
11
[Talukdar+, VLDB 2008]
Cost= 0.4
Cost= 0.41
Correcting Error: Learning New Edge Costs
Query
Query*
.
.
.
Query
.
.
.
Tuples
.
.
.
Top
Bottom
updated cost
0.1
0.1
0.1
0.1
b
d G
M
P
0.07
0.1
0.10.04
0.1
b c
d G
M
P
0.5
11
Cost= 0.41
Cost= 0.8
Decomposition of Edge Cost
12
TABLE1 TABLE 2
Decomposition of Edge Cost
12
FeatureName
Matching Cost
Coefficient(Values Learned)
COMA++ Matched 0.90 wCOMA++
MAD Matched 0.7 wLP
--- --- ---
TABLE1 TABLE 2
Decomposition of Edge Cost
12
FeatureName
Matching Cost
Coefficient(Values Learned)
COMA++ Matched 0.90 wCOMA++
MAD Matched 0.7 wLP
--- --- ---
TABLE1 TABLE 2
Edge Cost = 0.9 * WCOMA++ + 0.7 * WLP
Decomposition of Edge Cost
12
FeatureName
Matching Cost
Coefficient(Values Learned)
COMA++ Matched 0.90 wCOMA++
MAD Matched 0.7 wLP
--- --- --- Learned
TABLE1 TABLE 2
Edge Cost = 0.9 * WCOMA++ + 0.7 * WLP
Learning: Incorporating User Feedback
• Model feedback incorporation as a constrained optimization problem.
13
Learning: Incorporating User Feedback
MIRA Algorithm(Crammer et al., 2006)
• Model feedback incorporation as a constrained optimization problem.
13
Learning: Incorporating User Feedback
MIRA Algorithm(Crammer et al., 2006)
• Model feedback incorporation as a constrained optimization problem.
13
New Model
Parameters
CurrentModel
Parameters
Learning: Incorporating User Feedback
MIRA Algorithm(Crammer et al., 2006)
• Model feedback incorporation as a constrained optimization problem.
13
New Model
Parameters
CurrentModel
Parameters
Tree Cost
Loss
Learning: Incorporating User Feedback
MIRA Algorithm(Crammer et al., 2006)
Tree whose tuples user likes
Tree whose tuples user doesn’t like.
• Model feedback incorporation as a constrained optimization problem.
13
New Model
Parameters
CurrentModel
Parameters
Tree Cost
Loss
4
d
c
bP
M
G
Data Sources
3. Where to Align New Source?
NNewSource
P b
M
G
c
dMatchingScores
SchemaGraph
View-based Pruning ofMatching
Schema Matching
(Alignment)
MatchingCorrection
Ranked Query
Answering
“a b”KeywordQuery
Results + feedback
15
Where to Match a New Source?
3
15
GO
acc term_id
InterPro2GO
go_id entry_ac
InterPro Pub
pub_idtitle
InterProEntry
name entry_ac
InterProEntry 2 Pub
entry_ac pub_id
Keyword CostNeighborhood
plasma membrane
term
0 0 0 0 0 0
0 00 0
1
0.5
0.25
0.25
2
2
22
A schema graph with 5 sources and 2 keywords: term and plasma membrane. The shaded oval includes all nodes reachable with cost ≤ 2 from at least one of the keywords.
Where to Match a New Source?
3
Keywords
15
GO
acc term_id
InterPro2GO
go_id entry_ac
InterPro Pub
pub_idtitle
InterProEntry
name entry_ac
InterProEntry 2 Pub
entry_ac pub_id
Keyword CostNeighborhood
plasma membrane
term
0 0 0 0 0 0
0 00 0
1
0.5
0.25
0.25
2
2
22
A schema graph with 5 sources and 2 keywords: term and plasma membrane. The shaded oval includes all nodes reachable with cost ≤ 2 from at least one of the keywords.
Where to Match a New Source?
3
Neighborhood imposed by cost of
kth best answer.
Keywords
15
GO
acc term_id
InterPro2GO
go_id entry_ac
InterPro Pub
pub_idtitle
InterProEntry
name entry_ac
InterProEntry 2 Pub
entry_ac pub_id
Keyword CostNeighborhood
plasma membrane
term
0 0 0 0 0 0
0 00 0
1
0.5
0.25
0.25
2
2
22
A schema graph with 5 sources and 2 keywords: term and plasma membrane. The shaded oval includes all nodes reachable with cost ≤ 2 from at least one of the keywords.
Where to Match a New Source?
?New
Source
3
Neighborhood imposed by cost of
kth best answer.
Keywords
15
GO
acc term_id
InterPro2GO
go_id entry_ac
InterPro Pub
pub_idtitle
InterProEntry
name entry_ac
InterProEntry 2 Pub
entry_ac pub_id
Keyword CostNeighborhood
plasma membrane
term
0 0 0 0 0 0
0 00 0
1
0.5
0.25
0.25
2
2
22
A schema graph with 5 sources and 2 keywords: term and plasma membrane. The shaded oval includes all nodes reachable with cost ≤ 2 from at least one of the keywords.
Where to Match a New Source?
?New
Source
3
Matchings outside this neighborhood is not going to affect k-best answers
(i.e., current view).
Neighborhood imposed by cost of
kth best answer.
Keywords
15
GO
acc term_id
InterPro2GO
go_id entry_ac
InterPro Pub
pub_idtitle
InterProEntry
name entry_ac
InterProEntry 2 Pub
entry_ac pub_id
Keyword CostNeighborhood
plasma membrane
term
0 0 0 0 0 0
0 00 0
1
0.5
0.25
0.25
2
2
22
A schema graph with 5 sources and 2 keywords: term and plasma membrane. The shaded oval includes all nodes reachable with cost ≤ 2 from at least one of the keywords.
Where to Match a New Source?
?New
Source
3
View Based AlignerConsider only those matchings which are likely to affect query
results, as otherwise there will be no feedback from user.
Matchings outside this neighborhood is not going to affect k-best answers
(i.e., current view).
Neighborhood imposed by cost of
kth best answer.
Keywords
Experiments
16
Experiments
Two questions:
16
Experiments
Two questions:I. Can we repair alignment errors by exploiting
user feedback over answers?
16
Experiments
Two questions:I. Can we repair alignment errors by exploiting
user feedback over answers?
II.Can we reduce the number of pairwise comparisons necessary during alignment discovery for new source?
16
1. Correcting Schema Matching Errors: Setup
17
1. Correcting Schema Matching Errors: Setup
17
go_term
interpro_interpro2go
interpro_entry2pub interpro_method2pub
interpro_methodinterpro_pubinterpro_entry
interpro_journal
Schema Graph (InterPro-GO) with Gold Matchings
1. Correcting Schema Matching Errors: Setup
17
go_term
interpro_interpro2go
interpro_entry2pub interpro_method2pub
interpro_methodinterpro_pubinterpro_entry
interpro_journal
Schema Graph (InterPro-GO) with Gold Matchings
• Start with just the tables
1. Correcting Schema Matching Errors: Setup
17
go_term
interpro_interpro2go
interpro_entry2pub interpro_method2pub
interpro_methodinterpro_pubinterpro_entry
interpro_journal
Schema Graph (InterPro-GO) with Gold Matchings
• Start with just the tables
• Use automatic schema matchers (e.g., COMA++, MAD)
1. Correcting Schema Matching Errors: Setup
17
go_term
interpro_interpro2go
interpro_entry2pub interpro_method2pub
interpro_methodinterpro_pubinterpro_entry
interpro_journal
Schema Graph (InterPro-GO) with Gold Matchings
• Start with just the tables
• Use automatic schema matchers (e.g., COMA++, MAD)
• Rank matchings based on cost learned from keyword queries and feedback over answers (using Q)
1. Correcting Schema Matching Errors: Setup
17
go_term
interpro_interpro2go
interpro_entry2pub interpro_method2pub
interpro_methodinterpro_pubinterpro_entry
interpro_journal
Schema Graph (InterPro-GO) with Gold Matchings
• Start with just the tables
• Use automatic schema matchers (e.g., COMA++, MAD)
• Rank matchings based on cost learned from keyword queries and feedback over answers (using Q)
• Compute precision-recall w.r.t. the gold matchings (left figure)
I. Correcting Schema Matching Errors
18
I. Correcting Schema Matching Errors
18
0.25
0.438
0.625
0.813
1
0.125 0.25 0.375 0.5 0.625 0.75 0.875 1
Precision-Recall Plots for Various Methods
Pre
cisi
on
Recall
COMA++ MAD Q
I. Correcting Schema Matching Errors
18
0.25
0.438
0.625
0.813
1
0.125 0.25 0.375 0.5 0.625 0.75 0.875 1
Precision-Recall Plots for Various Methods
Pre
cisi
on
Recall
COMA++ MAD Q
Learning with Q helps correct schema
matching errors.
I. Correcting Schema Matching Errors (contd.)
19
I. Correcting Schema Matching Errors (contd.)
19
0.15
1.363
2.575
3.788
5
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
Gold vs Non-Gold Edge Costs After Increasing Feedback
Ave
rage
Ed
ge C
ost
s (L
ow
er
is B
ett
er)
Feedback Step Number
Avg. Gold Edge CostAvg. Non-Gold Edge Cost
I. Correcting Schema Matching Errors (contd.)
19
0.15
1.363
2.575
3.788
5
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
Gold vs Non-Gold Edge Costs After Increasing Feedback
Ave
rage
Ed
ge C
ost
s (L
ow
er
is B
ett
er)
Feedback Step Number
Avg. Gold Edge CostAvg. Non-Gold Edge Cost
Learning with Q helps identify the correct (gold) alignments.
20
II. Reducing Pairwise Comparisons during New Source Integration
20
II. Reducing Pairwise Comparisons during New Source Integration
0
5000.0
10000.0
15000.0
20000.0
18 100 500
# P
air
wis
e C
om
pari
son
s
Number of Tables in the Schema Graph
Exhaustive ViewBasedAligner
20
II. Reducing Pairwise Comparisons during New Source Integration
0
5000.0
10000.0
15000.0
20000.0
18 100 500
# P
air
wis
e C
om
pari
son
s
Number of Tables in the Schema Graph
Exhaustive ViewBasedAligner
View Based Aligner Significantly Reduces the Number of Comparisons.
Related Work• B. Alexe, L. Chiticariu, R. J. Miller, and W.-C. Tan. Muse: Mapping
understand- ing and design by example. In ICDE 2008
• Laura Chiticariu, Phokion G. Kolaitis, Lucian Popa: "Interactive Generation of Integrated Schemas". SIGMOD Conference 2008
• Anish Das Sarma, Luna Dong, Alon Halevy. Bootstrapping Pay-As-You-Go Data Integration System. SIGMOD 2008
• Fagin+, Clio: Schema Mapping Creation and Data Exchange. Conceptual Modeling: Foundations and Applications 2009
• S.R. Jeffery, M.J. Franklin, and A.Y. Halevy. Pay-as-you-go user feedback for dataspace systems. In SIGMOD, 2008
• Talukdar+. Learning to create data-integrating queries. In VLDB, 2008.
21
Summary
22
Summary
• A new data-centric schema matching algorithm based on Modified Adsorption (MAD)– doesn’t require pairwise column comparison, scalable
22
Summary
• A new data-centric schema matching algorithm based on Modified Adsorption (MAD)– doesn’t require pairwise column comparison, scalable
• A system architecture that– combines off-the-shelf schema matchers’ alignments– exploits user feedback over answers to repair matching
errors
22
Summary
• A new data-centric schema matching algorithm based on Modified Adsorption (MAD)– doesn’t require pairwise column comparison, scalable
• A system architecture that– combines off-the-shelf schema matchers’ alignments– exploits user feedback over answers to repair matching
errors
• Integrates new sources– through incremental updates to schema matchings
22
Thank You!
Poster: Tomorrow (Thu), 3:30pm Cosmopolitan AB