Upload
allayna
View
25
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Learning to Map Between Schemas Ontologies. Alon Halevy University of Washington Joint work with Anhai Doan and Pedro Domingos. Agenda. Ontology mapping is a key problem in many applications: Data integration Semantic web Knowledge management E-commerce LSD: - PowerPoint PPT Presentation
Citation preview
Alon Halevy
University of Washington
Joint work with Anhai Doan and Pedro Domingos
Learning to Map Between Learning to Map Between SchemasSchemas
Ontologies Ontologies
2
AgendaAgenda Ontology mapping is a key problem in many
applications:– Data integration– Semantic web– Knowledge management– E-commerce
LSD:– Solution that uses multi-strategy learning.– We’ve started with schema matching (I.e., very simple
ontologies)– Currently extending to more expressive ontologies.– Experiments show the approach is very promising!
3
The The Structure Structure Mapping ProblemMapping Problem
Types of structures:– Database schemas, XML DTDs, ontologies, …,
Input:– Two (or more) structures, S1 and S2
– Data instances for S1 and S2
– Background knowledge
Output:– A mapping between S1 and S2
– Should enable translating between data instances.
– Semantics of mapping?
4
Semantic Mappings between SchemasSemantic Mappings between Schemas Source schemas = XML DTDs
house
location contact
house
address
name phone
num-baths
full-baths half-baths
contact-info
agent-name agent-phone
1-1 mapping non 1-1 mapping
5
MotivationMotivation Database schema integration
– A problem as old as databases themselves. – database merging, data warehouses, data migration
Data integration / information gathering agents– On the WWW, in enterprises, large science projects
Model management:– Model matching: key operator in an algebra where
models and mappings are first-class objects.– See [Bernstein et al., 2000] for more.
The Semantic Web– Ontology mapping.
System interoperability– E-services, application integration, B2B applications, …,
6
Desiderata from Proposed SolutionsDesiderata from Proposed Solutions Accuracy, efficiency, ease of use. Realistic expectations:
– Unlikely to be fully automated. Need user in the loop.
Some notion of semantics for mappings. Extensibility:
– Solution should exploit additional background knowledge.
“Memory”, knowledge reuse:– System should exploit previous manual or automatically
generated matchings.– Key idea behind LSD.
7
LSD OverviewLSD Overview L(earning) S(ource) D(escriptions) Problem: generating semantic mappings between
mediated schema and a large set of data source schemas.
Key idea: generate the first mappings manually, and learn from them to generate the rest.
Technique: multi-strategy learning (extensible!) Step 1:
– [SIGMOD, 2001]: 1-1 mappings between XML DTDs. Current focus:
– Complex mappings– Ontology mapping.
8
OutlineOutline Overview of structure mapping
Data integration and source mappings
LSD architecture and details
Experimental results
Current work.
9
Data IntegrationData Integration
Find houses with four bathrooms priced under $500,000
mediated schema
homes.comrealestate.com
source schema 2
homeseekers.com
source schema 3source schema 1
Applications: WWW, enterprises, science projectsTechniques: virtual data integration, warehousing, custom code.
wrappers
Query reformulationand optimization.
10
Semantic Mappings between SchemasSemantic Mappings between Schemas Source schemas = XML DTDs
house
location contact
house
address
name phone
num-baths
full-baths half-baths
contact-info
agent-name agent-phone
1-1 mapping non 1-1 mapping
11
Semantics (preliminary)Semantics (preliminary) Semantics of mappings has received no attention. Semantics of 1-1 mappings – Given:
– R(A1,…,An) and S(B1,…,Bm)– 1-1 mappings (Ai,Bj)
Then, we postulate the existence of a relation W, s.t.:– (C1,…,Ck) (W) = (A1,…,Ak) (R) , – (C1,…,Ck) (W) = (B1,…,Bk) (S) , – W also includes the unmatched attributes of R and S.
In English: R and S are projections on some universal relation W, and the mappings specify the projection variables and correspondences.
12
Why Matching is DifficultWhy Matching is Difficult Aims to identify same real-world entity
– using names, structures, types, data values, etc Schemas represent same entity differently
– different names => same entity: – area & address => location
– same names => different entities: – area => location or square-feet
Schema & data never fully capture semantics!– not adequately documented, not sufficiently expressive
Intended semantics is typically subjective!– IBM Almaden Lab = IBM?
Cannot be fully automated. Often hard for humans. Committees are required!
13
Current State of AffairsCurrent State of Affairs
Finding semantic mappings is now the bottleneck!– largely done by hand– labor intensive & error prone– GTE: 4 hours/element for 27,000 elements [Li&Clifton00]
Will only be exacerbated– data sharing & XML become pervasive– proliferation of DTDs– translation of legacy data– reconciling ontologies on semantic web
Need semi-automatic approaches to scale up!
14
OutlineOutline Overview of structure mapping
Data integration and source mappings
LSD architecture and details
Experimental results
Current work.
15
The LSD ApproachThe LSD Approach User manually maps a few data sources to the
mediated schema. LSD learns from the mappings, and proposes
mappings for the rest of the sources. Several types of knowledge are used in learning:
– Schema elements, e.g., attribute names– Data elements: ranges, formats, word frequencies, value
frequencies, length of texts.– Proximity of attributes– Functional dependencies, number of attribute
occurrences. One learner does not fit all. Use multiple learners
and combine with meta-learner.
16
listed-price $250,000 $110,000 ...
address price agent-phone description
Example Example
location Miami, FL Boston, MA ...
phone(305) 729 0831(617) 253 1429 ...
commentsFantastic houseGreat location ...
realestate.com
location listed-price phone comments
Schema of realestate.com
If “fantastic” & “great”
occur frequently in data values =>
description
Learned hypotheses
price $550,000 $320,000 ...
contact-phone(278) 345 7215(617) 335 2315 ...
extra-infoBeautiful yardGreat beach ...
homes.com
If “phone” occurs in the name =>
agent-phone
Mediated schema
17
Multi-Strategy LearningMulti-Strategy Learning
Use a set of base learners:– Name learner, Naïve Bayes, Whirl, XML learner
And a set of recognizers:– County name, zip code, phone numbers.
Each base learner produces a prediction weighted by confidence score.
Combine base learners with a meta-learner, using stacking.
18
Name Learner
Base LearnersBase Learners
(contact,agent-phone)
(contact-info,office-address)
(phone,agent-phone)(listed-price,price)
contact-phone => (agent-phone,0.7), (office-address,0.3)
Naive Bayes Learner [Domingos&Pazzani 97]
– “Kent, WA” => (address,0.8), (name,0.2)
Whirl Learner [Cohen&Hirsh 98]
XML Learner– exploits hierarchical structure of XML data
(contact,agent-phone)
(contact-info,office-address)
(phone,agent-phone)(listed-price,price)
(contact-phone, ? )
19
<location> Boston, MA </> <listed-price> $110,000</> <phone> (617) 253 1429</> <comments> Great location </>
<location> Miami, FL </> <listed-price> $250,000</> <phone> (305) 729 0831</> <comments> Fantastic house </>
Training the Base Learners Training the Base Learners
Naive Bayes Learner
(location, address)(listed-price, price)(phone, agent-phone) ...
(“Miami, FL”, address)(“$ 250,000”, price)(“(305) 729 0831”, agent-phone) ...
realestate.com
Name Learner
address price agent-phone description
Schema of realestate.com
Mediated schema
location listed-price phone comments
20
Entity RecognizersEntity RecognizersUse pre-programmed knowledge to identify
specific types of entities– date, time, city, zip code, name, etc– house-area (30 X 70, 500 sq. ft.)– county-name recognizer
Recognizers often have nice characteristics– easy to construct– many off-the-self research & commercial products– applicable across many domains
– help with special cases that are hard to learn
21
Meta-Learner: StackingMeta-Learner: Stacking Training of meta-learner produces a weight for every pair of:
– (base-learner, mediated-schema element)
– weight(Name-Learner,address) = 0.1– weight(Naive-Bayes,address) = 0.9
Combining predictions of meta-learner:– computes weighted sum of base-learner confidence scores
<area>Seattle, WA</>(address,0.6)(address,0.8)
Name LearnerNaive Bayes
Meta-Learner (address, 0.6*0.1 + 0.8*0.9 = 0.78)
22
Least-SquaresLinear Regression
Training the Meta-LearnerTraining the Meta-Learner
<location> Miami, FL</><listed-price> $250,000</><area> Seattle, WA </><house-addr>Kent, WA</><num-baths>3</>...
Extracted XML Instances Name Learner
0.5 0.8 1 0.4 0.3 0 0.3 0.9 1 0.6 0.8 1 0.3 0.3 0 ... ... ...
Naive Bayes True Predictions
Weight(Name-Learner,address) = 0.1Weight(Naive-Bayes,address) = 0.9
For address
23
<extra-info>Beautiful yard</><extra-info>Great beach</><extra-info>Close to Seattle</>
<day-phone>(278) 345 7215</><day-phone>(617) 335 2315</><day-phone>(512) 427 1115</>
<area>Seattle, WA</><area>Kent, WA</><area>Austin, TX</>
Applying the LearnersApplying the Learners
Name LearnerNaive Bayes
Meta-Learner
(address,0.8), (description,0.2)(address,0.6), (description,0.4)(address,0.7), (description,0.3)
(description,0.8), (address,0.2)
Meta-LearnerName LearnerNaive Bayes
(address,0.7), (description,0.3)
(agent-phone,0.9), (description,0.1)
address price agent-phone description
Schema of homes.com Mediated schema
area day-phone extra-info
24
The Constraint HandlerThe Constraint Handler
Extends learning to incorporate constraints– hard constraints
– a = address & b = address a = b– a = house-id a is a key– a = agent-info & b = agent-name b is nested in a
– soft constraints– a = agent-phone & b = agent-name
a & b are usually close to each other
– user feedback = hard or soft constraints
Details in [Doan et. al., SIGMOD 2001]
25
The Current LSD SystemThe Current LSD System
Mediated schema
Source schemas
Data listings
Constraint Handler
Mappings
User Feedback
Domain Constraints
Matching PhaseTraining Phase
Base-Learner1 Base-Learnerk Meta-Learner
26
OutlineOutline Overview of structure mapping
Data integration and source mappings
LSD architecture and details
Experimental results
Current work.
27
Empirical EvaluationEmpirical Evaluation
Four domains– Real Estate I & II, Course Offerings, Faculty Listings
For each domain– create mediated DTD & domain constraints– choose five sources– extract & convert data listings into XML (faithful to schema!)– mediated DTDs: 14 - 66 elements, source DTDs: 13 - 48
Ten runs for each experiment - in each run:– manually provide 1-1 mappings for 3 sources– ask LSD to propose mappings for remaining 2 sources– accuracy = % of 1-1 mappings correctly identified
28
Matching AccuracyMatching Accuracy
0
10
20
30
40
50
60
70
80
90
100
Real Estate I Real Estate II CourseOfferings
FacultyListings
LSD’s accuracy: 71 - 92%
Best single base learner: 42 - 72%
+ Meta-learner: + 5 - 22%
+ Constraint handler: + 7 - 13%
+ XML learner: + 0.8 - 6%
Ave
rage
Mat
chin
g A
cccu
racy
(%
)
29
Sensitivity to Amount of Available DataSensitivity to Amount of Available Data
40
50
60
70
80
90
100
0 100 200 300 400 500
Ave
rage
mat
chin
g ac
cura
cy (
%)
Number of data listings per source (Real Estate I)
30
0
10
20
30
40
50
60
70
80
90
100
Real Estate I Real Estate II Course Offerings Faculty Listings
Contribution of Schema vs. DataContribution of Schema vs. Data
LSD with only schema info.
LSD with only data info.
Complete LSD
Ave
rage
mat
chin
g ac
cura
cy (
%)
More experiments in the paper [Doan et. al. 01]
31
Reasons for Incorrect MatchingReasons for Incorrect Matching
Unfamiliarity – suburb– solution: add a suburb-name recognizer
Insufficient information– correctly identified general type, failed to pinpoint exact type– <agent-name>Richard Smith</>
<phone> (206) 234 5412 </>– solution: add a proximity learner
Subjectivity– house-style = description?
32
OutlineOutline Overview of structure mapping
Data integration and source mappings
LSD architecture and details
Experimental results
Current work.
33
Moving Up the Expressiveness LadderMoving Up the Expressiveness Ladder
Schemas are very simple ontologies. More expressive power = More domain constraints. Mappings become more complex, but constraints
provide more to learn from. Non 1-1 mappings:
– F1(A1,…,Am) = F2(B1,…,Bm)
Ontologies (of various flavors):– Class hierarchy (I.e., containment on unary relations)– Relationships between objects– Constraints on relationships
34
Given two schemas, find– 1-many mappings: address = concat(city,state)– many-1: half-baths + full-baths = num-baths– many-many: concat(addr-line1,addr-line2) = concat(street,city,state)
1-many mappings– expressed as query
– value correspondence expression: room-rate = rate * (1 + tax-rate)
– relationship: state of tax-rate = state of hotel that has rate
– special case: 1-many mappings between two relational tables
Finding Non 1-1 MappingsFinding Non 1-1 MappingsCurrent workCurrent work
address description num-baths
Source schemaMediated schema
city state comments half-baths full-baths
35
Brute-Force SolutionBrute-Force Solution
m1, m2, ..., mk
m1
Define a set of operators– concat, +, -, *, /, etc
For each set of mediated-schema columns– enumerate all possible mappings– evaluate & return best mapping
Source-schema columnsMediated-schema columns
compute similarityusing all base learners
36
Search-Based SolutionSearch-Based Solution States = columns
– goal state: mediated-schema column– initial states: all source-schema columns
– use 1-1 matching to reduce the set of initial states
Operators: concat, +, -, *, /, etc Column-similarity:
– use all base learners + recognizers
37
Multi-Strategy SearchMulti-Strategy Search
Use a set of expert modules: L1, L2, ..., Ln
Each module– applies to only certain types of mediated-schema column– searches a small subspace– uses a cheap similarity measure to compare columns
Example– L1: text; concat; TF/IDF– L2: numeric; +, -, *, /; [Ho et. al. 2000]– L3: address; concat; Naive Bayes
Search techniques– beam search as default– specialized, do not have to materialize columns
38
Multi-Strategy Search (cont’d)Multi-Strategy Search (cont’d)
Combine modules’ predictions & select the best one
L1: m11, m12, m13, ..., m1x
L2: m21, m22, m23, ..., m2y
L3: m31, m32, m33, ..., m3z
Apply all applicable expert modules
m11, m12,m21, m22,m31,m32
m11
compute similarityusing all base learners
39
Related WorkRelated Work
TRANSCM [Milo&Zohar98]ARTEMIS [Castano&Antonellis99] [Palopoli et. al. 98] CUPID [Madhavan et. al. 01]
SEMINT [Li&Clifton94]ILA [Perkowitz&Etzioni95]DELTA [Clifton et. al. 97]
DELTA [Clifton et. al. 97]
LSD [Doan et. al. 2000, 2001] CLIO [Miller et. al. 00],[Yan et. al. 01]
Single Learner + 1-1 Matching
Hybrid + 1-1 Matching
Schema + Data1-1 + non 1-1 MatchingSophisticated Data-Driven User Interaction
Recognizers + Schema + 1-1 Matching
Multi-Strategy LearningLearners + RecognizersSchema + Data1-1 + non 1-1 Matching
?
40
SummarySummary LSD:
– uses multi-strategy learning to semi-automatically generate semantic mappings.
– LSD is extensible and incorporates domain and user knowledge, and previous techniques.
– Experimental results show the approach is very promising.
Future work and issues to ponder:– Accommodating more expressive languages: ontologies– Reuse of learned concepts from related domains.– Semantics?
Data management is a fertile area for Machine Learning research!
41
Backup SlidesBackup Slides
42
Mapping MaintenanceMapping Maintenance
Source-schema S’Mediated-schema M’
m2
m3
m1
Source-schema SMediated-schema M
m2
m3
m1
Ten months later ...– are the mappings still correct?
43
Information Extraction from TextInformation Extraction from Text
Extract data fragments from text documents– date, location, & victim’s name from a news article
Intensive research on free-text documents Many documents do have substantial structure
– XML pages, name card, tables, list
Each such document = a data source– structure forms a schema– only one data value per schema element– “real” data source has many data values per schema element
Ongoing research in the IE community
44
Contribution of Each ComponentContribution of Each Component
0
20
40
60
80
100
Real Estate I Course Offerings Faculty Listings Real Estate II
Ave
rage
Mat
chin
g A
cccu
racy
(%
)
Without Name Learner
Without Naive Bayes
Without Whirl Learner
Without Constraint Handler
The complete LSD system
45
Existing learners flatten out all structures
Developed XML learner– similar to the Naive Bayes learner
– input instance = bag of tokens
– differs in one crucial aspect– consider not only text tokens, but also structure tokens
Exploiting Hierarchical Structure Exploiting Hierarchical Structure
<description> Victorian house with a view. Name your price! To see it, contact Gail Murphy at MAX Realtors.</description>
<contact> <name> Gail Murphy </name> <firm> MAX Realtors </firm></contact>
46
Domain ConstraintsDomain Constraints
Impose semantic regularities on sources– verified using schema or data
Examples– a = address & b = address a = b– a = house-id a is a key– a = agent-info & b = agent-name b is nested in a
Can be specified up front– when creating mediated schema– independent of any actual source schema
47
area: address contact-phone: agent-phoneextra-info: description
area: address contact-phone: agent-phoneextra-info: address
area: (address,0.7), (description,0.3)contact-phone: (agent-phone,0.9), (description,0.1)extra-info: (address,0.6), (description,0.4)
The Constraint HandlerThe Constraint Handler
Can specify arbitrary constraints User feedback = domain constraint
– ad-id = house-id Extended to handle domain heuristics
– a = agent-phone & b = agent-name a & b are usually close to each other
0.30.10.40.012
0.70.90.60.378
0.70.90.40.252
Domain Constraintsa = address & b = adderss a = b
Predictions from Meta-Learner