27
CSE 636 Data Integration Schema Matching Cupid Fall 2006

CSE 636 Data Integration

  • Upload
    goro

  • View
    34

  • Download
    1

Embed Size (px)

DESCRIPTION

CSE 636 Data Integration. Schema Matching Cupid. Fall 2006. Virtual Integration Architecture. Wrapper. Wrapper. Design-Time. Run-Time. . Schema Matching. Query Reformulation. Query. Result. End User. Mediation Language. Optimization & Execution. Mediator. Global Schema. - PowerPoint PPT Presentation

Citation preview

Page 1: CSE 636 Data Integration

CSE 636Data Integration

Schema Matching

Cupid

Fall 2006

Page 2: CSE 636 Data Integration

2

Mediator

Virtual Integration Architecture

DataSource

DataSource

GlobalSchema

LocalSchema

LocalSchema

Query Result

Wrapper Wrapper

End User

Design-Time

MediationLanguage

SchemaMatching

Run-Time

QueryReformulation

Optimization& Execution

XML

Web Services

Page 3: CSE 636 Data Integration

3

Independently created schemas…… might be modeling similar information…

… in slightly different ways

Schema Heterogeneity

nameugradID

ugrad *DB1

enrollment *courseIDugradIDgrade

type

courseIDcourse *

student *DB3

studentIDnametype

letter

title ?evaluation

studentIDstudent *

course *DB2

courseIDtitle

nametype

Page 4: CSE 636 Data Integration

4

Schema Heterogeneity

nameugradID

ugrad *DB1

enrollment *courseIDugradIDgrade

type

courseIDcourse *

student *DB3

studentIDnametype

letter

title ?

• Similar entities represented• Dissimilar structures (inverted nesting)• Different element names for similar data values• Similar element names for different data values

evaluation

studentIDstudent *

course *DB2

courseIDtitle

nametype

Page 5: CSE 636 Data Integration

5

Schema Matching vs. Schema Mapping

• GAV and LAV are schema mapping languages• Mappings:

– set of queries– associations + semantics

• Match:– set of associations only

• Schema Matching:– Identifying associations– First step towards constructing mappings

Page 6: CSE 636 Data Integration

6

Associations

Semantics

Schema Matching vs. Schema Mapping

for $s1 in DB3/studentwhere $s1/type = ‘UGRAD’return <DB1>

<ugrad><ugradID>{$s1/studentID}</ugradID><name>{$s1/name}</name>

</ugrad></DB1>

LAV Mapping: DB1 Q(DB3)

nameugradID

ugrad *DB1

enrollment *courseIDugradIDgrade

type

courseIDcourse *

student *DB3

studentIDnametype

letter

title ?

Page 7: CSE 636 Data Integration

7

The Problem of Schema Matching

Input

• Schemas S1 and S2

• Possibly data instances for S1 and S2

• Background knowledge– thesauri– validated matches– standard schemas– reference instances– ontologies– constraints (keys, data types etc)

Output

• Associations between S1 and S2

Goal• Schema matching tools with significant automated support

Page 8: CSE 636 Data Integration

8

Schema Matching

How is the match result expressed?

type

courseIDcourse *

student *DB3

studentIDnametype

letter

title ?evaluation

studentIDstudent *

course *DB2

courseIDtitle

nametype

• Pairs of paths• Lists of paths• Schema names

Page 9: CSE 636 Data Integration

9

Schema Matching

What do we match?

• Depends on the queries we want to ask1. Elements in isolation (leaves in particular)2. Substructures3. Whole schemas

Page 10: CSE 636 Data Integration

10

Motivation

• Important component in many applications– Data Integration– Data Migration– E-Commerce

• Model Management[Bernstein, Halevy, Pottinger ’00]– Algebra for manipulating models and mappings– Match, Merge, Compose …

Page 11: CSE 636 Data Integration

11

• Minimize user involvement (semi-automatic)• Data model independent matching (generic)• Schema matching is a hard problem

– Naming and structural differences in schemas– Similar, but non-identical concepts modeled– Multiple data models – SQL DDL, XML, ODMG…

Problems

Page 12: CSE 636 Data Integration

12

Schema Matching Approaches

• Graph matching

Constraint-based

Individual matchers

Schema-based Content-based

StructuralPer-Element

Constraint-based• Types• Keys

Linguistic

• Names• Descriptions

• Value pattern

and ranges

Constraint-based

Linguistic

• IR (word frequencies, key terms)

Per-Element

Combined matchers

CompositeHybrid

automatic composition

manual composition

Taxonomy based survey: Rahm and Bernstein, VLDB J, 2001

How to match?

Page 13: CSE 636 Data Integration

13

Cupid

Individual matchers

Schema-based Content-based

• Graph matching

Linguistic Constraint-based

StructuralPer-Element

• Types• Keys

• Value pattern

and ranges

Constraint-based

Linguistic

• IR (word frequencies, key terms)

Per-Element

Constraint-based

• Names• Descriptions

Combined matchers

automatic composition

Composite

manual composition

Hybrid

Madhavan, Bernstein and Rahm, VLDB, 2001

Page 14: CSE 636 Data Integration

14

Cupid Example

PO

Item

POLines

Qty

LineUoM

POShipTo

City

Street

Item

PurchaseOrder

Items

Quantity

ItemNumberUnitofMeasu

re

DeliverTo

City

Street

Address

NameNam

e

Page 15: CSE 636 Data Integration

15

Cupid Architecture

Schema 1

Schema 2

StructureMatching

GenerateMapping

Output Mapping

Thesaurus

Linguistic Matching

LSIM

SSIMWSIM

Page 16: CSE 636 Data Integration

16

Linguistic Matching

• Heuristic name matching– Tokenization of names

POOrderNum PO, Order, Num

– Expansion of short-forms, acronymsPO Purchase, Order; Num Number

– Clustering of schema elements based on keywords and data-typesStreet, City, POAddress Address

– Thesaurus of synonyms, hypernyms, acronyms

– Linguistic Similarity coefficient (LSIM) [0,1]

Page 17: CSE 636 Data Integration

17

Structure Matching

PO

Item

POLines

Qty

LineUoM

City

Street

Item

PurchaseOrder

Items

Quantity

ItemNumber

UnitofMeasure

POShipTo

DeliverTo

City

Street

Address

Name

Name

Page 18: CSE 636 Data Integration

18

PO

Item

POLines

Qty

Line

UoM

Item

PurchaseOrder

Items

Quantity

ItemNum

UnitofMeasure

WSIM > thhigh

WSIM > thhigh

SSIM++

SSIM++

SSIM++

Structure MatchingMutually Reinforcing Similarity

Page 19: CSE 636 Data Integration

19

PO

POShipTo

PurchaseOrder

InvoiceTo DeliverT

o

Street

City

Address

Street

City

POBillTo

Street

City Address

Street

City

SSIM++

SSIM++

SSIM--

Structure MatchingContext Dependent Disambiguation

Page 20: CSE 636 Data Integration

20

Intuition

• Atomic elements are similar – Linguistically and data-type similar– Their ancestors are similar

• Compound elements (non-leaf) are similar if– Linguistically similar– Subtrees rooted at the elements are similar

• Mutually recursive – Leaves determine internal node similarity– Similarity of internal nodes leads to increase in leaf

similarity

Page 21: CSE 636 Data Integration

21

Structure Match Details

• Subtrees are similar if– Immediate children are similar– Leaf sets are similar

• Subtree Similarity (nodes s and t)– Fraction of leaves in subtree s that can be mapped to a

leaf in the other subtree t and vice-versa– Less sensitive to variation in intermediate structure

• Pruning the number of comparisons– Elements must have comparable number of leaves

Page 22: CSE 636 Data Integration

22

Order-Customer-fk

Referential Integrity

Purchase Order

Product Name

Order ID

Customer ID

Customer

Customer ID Nam

e

Address

Order-Customer-fk

Schema A

Customer-Purchase-Order

Schema B

• Join nodes added to the schema tree for each referential integrity constraint

• Views can be similarly used

Page 23: CSE 636 Data Integration

23

Cupid Architecture

Schema 1

Schema 2

StructureMatching

GenerateMapping

Output Mapping

Thesaurus

Linguistic Matching

LSIM

SSIMWSIM

Structural (SSIM), Weighted (WSIM) Similarity

InvoiceTo BillTo 0.7

UoM UnitMeasure 0.9

City City 1.0

Linguistic Similarity (LSIM)

InvoiceTo BillTo 0.8 0.7

UoM UnitMeasure 0.7 0.8

InvoiceTo/City BillTo/City 0.8 0.9

Page 24: CSE 636 Data Integration

24

Mapping Generation

• Individual mapping elements computed from WSIM values:

– Consider only mapping pairs that have WSIM greater than threshold

– For each element of target find most similar source element

– Not accepted mappings with high similarity are returned in order to help user modify map

Page 25: CSE 636 Data Integration

25

Cupid Architecture

Schema 1

Schema 2

StructureMatching

GenerateMapping

Output Mapping

Thesaurus

Linguistic Matching

LSIM

SSIMWSIM

Input hint

Page 26: CSE 636 Data Integration

26

Work Needed

• A more robust solution– Auto-tuning parameters– Thesaurus Generation and Evolution

• Schema matching component architecture– Easily extensible by adding multiple techniques– Data Instances for matching– Look at COMA & ProtoPlasm systems

Page 27: CSE 636 Data Integration

27

References

1. J. Madhavan, P. A. Bernstein, E. RahmGeneric Schema Matching with CupidVLDB, 2001

2. H. H. Do, E. Rahm:COMA - A System for Flexible Combination of Schema Matching ApproachesVLDB, 2002

3. P. A. Bernstein, S. Melnik, M. Petropoulos, C. QuixIndustrial-Strength Schema MatchingSIGMOD Record 33(4), 2004