View
40
Download
5
Category
Tags:
Preview:
DESCRIPTION
Generic Schema Matching using Cupid. Jayant Madhavan University of Washington. Philip A. Bernstein Erhard Rahm Microsoft Research University of Leipzig. PO. PurchaseOrder. POLines. Items. DeliverTo. POShipTo. POShipTo. POShipTo. - PowerPoint PPT Presentation
Citation preview
Generic Schema Matching using Generic Schema Matching using CupidCupid
Jayant MadhavanJayant MadhavanUniversity of WashingtonUniversity of Washington
Philip A. Bernstein Erhard RahmPhilip A. Bernstein Erhard Rahm Microsoft Research University of LeipzigMicrosoft Research University of Leipzig
September 11th 2001
VLDB 2001 Roma Italy 2
Schema MatchingSchema Matching
PO
Item
POLines
Qty
LineUoM
POShipTo
City
Street
Item
PurchaseOrder
Items
Quantity
ItemNumber
UnitofMeasure
DeliverTo
City
Street
Address
NameNam
e
POShipTo
DeliverTo
Line
ItemNumber
Qty
UoM
Quantity
UnitofMeasure
POShipTo
DeliverTo
Qty
UoM
Quantity
UnitofMeasure
September 11th 2001
VLDB 2001 Roma Italy 3
• Given two schemas obtain a mapping Given two schemas obtain a mapping between them that identifies corresponding between them that identifies corresponding elementselements
The ProblemThe Problem
• A hard problemA hard problem– Naming and structural differences in schemasNaming and structural differences in schemas– Similar, but non-identical concepts modeledSimilar, but non-identical concepts modeled– Multiple data models – SQL DDL, XML, ODMG…Multiple data models – SQL DDL, XML, ODMG…
– Minimize user involvement (semi-automatic)Minimize user involvement (semi-automatic)– Data model independent matching (generic)Data model independent matching (generic)
September 11th 2001
VLDB 2001 Roma Italy 4
MotivationMotivation
• Important component in many applicationsImportant component in many applications– Data IntegrationData Integration– Data MigrationData Migration– E-CommerceE-Commerce
• Model Management [Bernstein, Halevy, Model Management [Bernstein, Halevy, Pottinger ’00]Pottinger ’00]– Algebra for manipulating models and mappingsAlgebra for manipulating models and mappings– Match, Merge, Compose …Match, Merge, Compose …
September 11th 2001
VLDB 2001 Roma Italy 5
Schema Matching ApproachesSchema Matching Approaches
Individual matchers
Schema-based Content-based
• Graph matching
Linguistic Constraint-based• Types• Keys
• Value pattern and ranges
Constraint-based
Linguistic
• IR (word frequencies, key terms)
Constraint-based
• Names• Descriptions
StructuralPer-Element Per-Element
Combined matchers
automatic composition
Composite
manual composition
Hybrid
Taxonomy based survey [Rahm,Bernstein’00]Taxonomy based survey [Rahm,Bernstein’00]
September 11th 2001
VLDB 2001 Roma Italy 6
Related WorkRelated Work
• Hybrid approaches for schema integrationHybrid approaches for schema integration– DIKE [Palopoli, Sacca, Ursino, Terracina]DIKE [Palopoli, Sacca, Ursino, Terracina]– MOMIS [Bergamaschi, Castano, Vincini]MOMIS [Bergamaschi, Castano, Vincini]
• Linguistic and Instance based Linguistic and Instance based – SEMINT, DELTA [Clifton, Hausman, Rosenthal, Li]SEMINT, DELTA [Clifton, Hausman, Rosenthal, Li]
• Instance based Multi-strategy learningInstance based Multi-strategy learning– LSD [Doan, Domingos, Halevy]LSD [Doan, Domingos, Halevy]
• OthersOthers– Hybrid rule based - Transcm [Milo, Zohar]Hybrid rule based - Transcm [Milo, Zohar]– Query Discovery - CLIO [Haas, Hernandez, Miller]Query Discovery - CLIO [Haas, Hernandez, Miller]
September 11th 2001
VLDB 2001 Roma Italy 7
ContributionsContributions
• Taxonomy of schema matching approachesTaxonomy of schema matching approaches
• Cupid system that exploits linguistic, data-type, Cupid system that exploits linguistic, data-type, structure and referential integrity informationstructure and referential integrity information– New algorithm that exploits schema structureNew algorithm that exploits schema structure
• Experimental validation and comparison with Experimental validation and comparison with other systemsother systems
September 11th 2001
VLDB 2001 Roma Italy 8
Cupid architectureCupid architecture
Schema 1
Schema 2
StructureMatching
GenerateMapping
Output Mapping
Linguistic Matching
Thesaurus
LSIM
SSIMWSIM
September 11th 2001
VLDB 2001 Roma Italy 9
Linguistic MatchingLinguistic Matching
– Tokenization of namesTokenization of namesPOOrderNum POOrderNum PO, Order, Num PO, Order, Num
– Expansion of short-forms, acronymsExpansion of short-forms, acronymsPOPO Purchase, Order; Num Purchase, Order; Num Number Number
– Clustering of schema elements based on keywords Clustering of schema elements based on keywords and data-typesand data-types
Street, City, POAddress Street, City, POAddress Address Address
– Thesaurus of synonyms, hypernyms, acronymsThesaurus of synonyms, hypernyms, acronyms
– Linguistic Similarity coefficient (lsim) Linguistic Similarity coefficient (lsim) [0,1] [0,1]
• Heuristic name matchingHeuristic name matching
September 11th 2001
VLDB 2001 Roma Italy 10
Structure MatchingStructure Matching
PO
Item
POLines
Qty
LineUoM
City
Street
Item
PurchaseOrder
Items
Quantity
ItemNumber
UnitofMeasure
POShipTo
DeliverTo
City
Street
Address
Name
Qty
UoM
Quantity
UnitofMeasure
Item Item
Line
ItemNumber
POShipTo
DeliverTo
NameCit
yStreet Cit
yStreet
Name
September 11th 2001
VLDB 2001 Roma Italy 11
Structure MatchStructure MatchMutually Reinforcing SimilarityMutually Reinforcing Similarity
PO
Item
POLines
Qty
Line
UoM
Item
PurchaseOrder
Items
Quantity
ItemNum
UnitofMeasure
Wsim > thhigh
Wsim > thhigh
Ssim ++
Ssim ++
Ssim ++
Ssim ++
Ssim ++
Ssim ++
Qty
UoM
Quantity
UnitofMeasureQt
y
UoM
Quantity
UnitofMeasure
Item ItemItem Item
Line
ItemNum
Line
ItemNum
POLines
ItemsPOLines
Items
PO PurchaseOrder
September 11th 2001
VLDB 2001 Roma Italy 12
Structure MatchStructure MatchContext dependent disambiguationContext dependent disambiguation
PO
POShipTo
PurchaseOrder
InvoiceTo DeliverT
o
Street
City
Address
Street
City
POBillTo
Street
City Address
Street
City
Ssim++
Ssim++
Ssim--
City
City
City
City
City
City
POShipTo
Address
POBillTo
POShipTo
Address
POBillTo
AddressAddress
InvoiceTo
POBillTo
InvoiceTo
POBillToPOShipTo
InvoiceToPOShipTo
InvoiceTo DeliverT
oPOShipTo
September 11th 2001
VLDB 2001 Roma Italy 13
IntuitionIntuition
• Atomic elements are similarAtomic elements are similar – Linguistically and data-type similarLinguistically and data-type similar– Their ancestors are similarTheir ancestors are similar
• Compound elements (non-leaf) are similar ifCompound elements (non-leaf) are similar if– Linguistically similarLinguistically similar– Subtrees rooted at the elements are similarSubtrees rooted at the elements are similar
• Mutually recursive Mutually recursive – Leaves determine internal node similarityLeaves determine internal node similarity– Similarity of internal nodes leads to increase in leaf Similarity of internal nodes leads to increase in leaf
similaritysimilarity
September 11th 2001
VLDB 2001 Roma Italy 14
Structure Match detailsStructure Match details
• Subtrees are similar ifSubtrees are similar if– Immediate children are similarImmediate children are similar– Leaf sets are similarLeaf sets are similar
• Subtree Similarity (nodes s and t)Subtree Similarity (nodes s and t)– Fraction of leaves in subtree s that can be mapped to a Fraction of leaves in subtree s that can be mapped to a
leaf in the other subtree t and vice-versaleaf in the other subtree t and vice-versa– Less sensitive to variation in intermediate structureLess sensitive to variation in intermediate structure
• Pruning the number of comparisonsPruning the number of comparisons– Elements must have comparable number of leavesElements must have comparable number of leaves
September 11th 2001
VLDB 2001 Roma Italy 15
Order-Customer-fk
Referential IntegrityReferential Integrity
• Join nodes added to the schema tree for each Join nodes added to the schema tree for each referential integrity constraintreferential integrity constraint
• Views can be similarly usedViews can be similarly used
Purchase Order
Product Name
Order ID
Customer ID
Customer
Customer ID Nam
e
Address
Order-Customer-fk
Schema A
Customer-Purchase-Order
Schema B
September 11th 2001
VLDB 2001 Roma Italy 16
Cupid architectureCupid architecture
Schema 1
Schema 2
StructureMatching
Lsim
GenerateMapping
Output Mapping
Linguistic Matching
Thesaurus
Structural(Ssim), Weighted(Wsim) similarity
InvoiceTo BillTo 0.7
UoM UnitMeasure 0.9
City City 1.0
Linguistic Similarity (Lsim)
Ssim,Wsim
InvoiceTo BillTo 0.8 0.7
UoM UnitMeasure 0.7 0.8
InvoiceTo/City BillTo/City 0.8 0.9
September 11th 2001
VLDB 2001 Roma Italy 17
Mapping GenerationMapping Generation
• Individual mapping elements computed from Individual mapping elements computed from Wsim Wsim valuesvalues
– Consider only mapping pairs that have Wsim greater Consider only mapping pairs that have Wsim greater than thresholdthan threshold
– For each element of target find most similar source For each element of target find most similar source elementelement
– Not accepted mappings with high similarity are Not accepted mappings with high similarity are returned in order to help user modify map returned in order to help user modify map
September 11th 2001
VLDB 2001 Roma Italy 18
Cupid ArchitectureCupid Architecture
Schema 1
Schema 2
StructureMatching
Lsim
GenerateMapping
Output Mapping
Linguistic Matching
Thesaurus
Ssim,Wsim
Input hint
September 11th 2001
VLDB 2001 Roma Italy 19
Experimental ValidationExperimental Validation
• DIKEDIKE– Graph Matching of ER modelsGraph Matching of ER models– No Lsim component (LSPD entries)No Lsim component (LSPD entries)
• MOMISMOMIS– Class Level Matching of OO descriptionsClass Level Matching of OO descriptions– Word senses manually chosen from WordNetWord senses manually chosen from WordNet
MOMISDIKE Cupid
CanonicalExamples
Real WorldExamples
September 11th 2001
VLDB 2001 Roma Italy 20
Evaluation InsightsEvaluation InsightsLinguistic SimilarityLinguistic Similarity
• Cupid is less sensitive to Cupid is less sensitive to name variationsname variations due to token due to token level manipulationslevel manipulations
• MOMIS is able to infer linguistic relationships based on MOMIS is able to infer linguistic relationships based on intra-schema propertiesintra-schema properties using Description Logic using Description Logic techniquestechniques
• MOMIS has a interface to WordNetMOMIS has a interface to WordNet– Word senses need to be chosen manuallyWord senses need to be chosen manually– Choosing a single sense is not always possibleChoosing a single sense is not always possible
• Matching performance without thesaurusMatching performance without thesaurus depends on depends on similarity of terms used and on available structure similarity of terms used and on available structure (tokenization helps Cupid)(tokenization helps Cupid)
September 11th 2001
VLDB 2001 Roma Italy 21
Evaluation InsightsEvaluation InsightsStructural SimilarityStructural Similarity
• DIKE and Cupid exploit DIKE and Cupid exploit structural similarity beyond the structural similarity beyond the immediate neighborhoodimmediate neighborhood of schema elements of schema elements
• Leaf structure for sub-tree similarityLeaf structure for sub-tree similarity relaxes relaxes requirements on intermediate structure match requirements on intermediate structure match
• Class-level structural similarityClass-level structural similarity in MOMIS can be in MOMIS can be restrictive while matching schemas with different nestingrestrictive while matching schemas with different nesting
• Context-dependent matchingContext-dependent matching in Cupid resolves mapping in Cupid resolves mapping ambiguityambiguity
• Linguistic similarity with complete path namesLinguistic similarity with complete path names (and no (and no structural similarity) is insufficientstructural similarity) is insufficient
September 11th 2001
VLDB 2001 Roma Italy 22
SummarySummary
• Taxonomy of schema matching approachesTaxonomy of schema matching approaches
• Cupid system that performs linguistic and Cupid system that performs linguistic and structural matching structural matching
• New algorithm for exploiting schema structure New algorithm for exploiting schema structure
• Comparative evaluation Comparative evaluation
September 11th 2001
VLDB 2001 Roma Italy 23
Future WorkFuture Work
• Towards a more robust solutionTowards a more robust solution– Auto-tuning parametersAuto-tuning parameters– Thesaurus Generation and EvolutionThesaurus Generation and Evolution– More scalability testingMore scalability testing
• Schema matching component architectureSchema matching component architecture– Easily extensible by adding multiple techniquesEasily extensible by adding multiple techniques– Data Instances for matchingData Instances for matching– Mapping, Expression and Query DiscoveryMapping, Expression and Query Discovery
• Model ManagementModel Management
September 11th 2001
VLDB 2001 Roma Italy 24
Model ManagementModel Management
• Other recent publicationsOther recent publications– A Model Theory for Generic Schema Management, A Model Theory for Generic Schema Management, DBPL DBPL
20012001– Generic Model Management – A Database Infrastructure for Generic Model Management – A Database Infrastructure for
Schema Manipulation, Schema Manipulation, CoopIS 2001CoopIS 2001– A Vision for Management of Complex Models, A Vision for Management of Complex Models, Sigmod Sigmod
Record, Dec 2000Record, Dec 2000– Data Warehouse Scenarios for Model Management, Data Warehouse Scenarios for Model Management, ER ER
20002000
• More informationMore information– http://data.cs.washington.edu/model/http://data.cs.washington.edu/model/– http://www.cs.washington.edu/homes/jayanthttp://www.cs.washington.edu/homes/jayant– MSR Technical Report MSR Technical Report
• Talk to us for a demoTalk to us for a demo
September 11th 2001
VLDB 2001 Roma Italy 26
Schema Matching Schema Matching
For each Lines create Items
For each Item create ItemItemNumber = concat(“Itm”, Line)Price = “Unknown”Quantity = Pounds2Kgs(Qty)
Count = Number of Item in Lines
PO
Item
Lines
QtyLine Unit
PurchaseOrder
Item
Items
Quantity
ItemNumber
Price
Count
ItemNumber=concat(“Itm”,Line)
Quantity=Pounds2Kgs(Qty)
September 11th 2001
VLDB 2001 Roma Italy 27
Tree MatchTree Match
For each pair of leaves For each pair of leaves initialize ssiminitialize ssim to be their to be their data-type data-type compatibilitycompatibility
For each s in S (For each s in S (post orderpost order))For each t in T(For each t in T(post orderpost order))
Compute Compute ssim(s,t) = structural-similarity(s,t)ssim(s,t) = structural-similarity(s,t)wsim(s,t) = g(lsim(s,t), ssim(s,t))wsim(s,t) = g(lsim(s,t), ssim(s,t))
If (If (wsim(s,t) > thwsim(s,t) > thhighhigh) ) Inc-struct-similarity(leaves(s), leaves(t))Inc-struct-similarity(leaves(s), leaves(t))
If (If (wsim(s,t) < thwsim(s,t) < thlowlow))Dec-struct-similarity(leaves(s), leaves(t)Dec-struct-similarity(leaves(s), leaves(t)
Tree Match (Schema tree S, Schema tree T)Tree Match (Schema tree S, Schema tree T)
September 11th 2001
VLDB 2001 Roma Italy 28
Tree Match (example)Tree Match (example)
POShipTo
PO
Item
POLines
Qty
LineUoM
POBillTo
Count
City
Street
City
Street
Item
PurchaseOrder
Items
Quantity
ItemNumber
UnitofMeasure
InvoiceTo
DeliverToItemCoun
t
City Street
Address
City Street
Address
September 11th 2001
VLDB 2001 Roma Italy 29
Canonical ExamplesCanonical Examples
MOMISMOMIS DIKEDIKE CupidCupid
Identical schemasIdentical schemas YY YY YY
Attributes with identical names, Attributes with identical names, but different data-typesbut different data-types YY YY YY
Attributes with same data-types, Attributes with same data-types, but slightly different namesbut slightly different names YY NN YY
Different class names, but same Different class names, but same attribute namesattribute names NN YY YY
Different nesting of schema Different nesting of schema elementselements NN YY YY
Type substitutionType substitution NN YY YY
September 11th 2001
VLDB 2001 Roma Italy 30
Real world exampleReal world example
PO
POHeader
PODate PONumber ContactName
ContactEmail
Contact
ContactFunctionCode
ContactPhone
POBillTo
Street4 Street3
PostalCode
attn
StateProvince City
Street2
Country
Street1
entityIdentifier
POShipTo
Street4 Street3
PostalCode
attn
StateProvince City
Street2
Country
Street1
entityIdentifier startAt
POLines
partno
Item
line
qty
unitPrice uom
count
PurchaseOrder
partNumber
unitPrice
Item
itemNumber
unitOfMeasure
Items
Quantity
itemCount
yourPartNumber
partDescription
DeliverTo InvoiceTo
street2
city stateProvince
street3
country
Address
street1
postalCode
street4
contactName
Contact
companyName
telephone
yourAccountCode
orderDate ourAccountCode
orderNum
Header
Footer
totalValue
CIDX Purchase Order Excel Purchase Order
Recommended