On Schema Matching with Opaque Column Names and Data Values

On Schema Matching with Opaque Column Names and Data Values

Jaewoo KangNC State (Aug 2003)

Jeffrey F. Naughton

Univ. of Wisconsin-Madison

June 10, 2003SIGMOD 2003

2 Jaewoo Kang

What is Schema Matching?

Finding semantic correspondences of schema elements across heterogeneous sources.

Old problem yet attracting new interests.


3 Jaewoo Kang

What is Schema Matching? (Cont’d)

Important for enterprise applications Data warehouses, data migration.

Also important for Internet data Virtual databases, web information

systems. Fundamental element of data

integration.


4 Jaewoo Kang

No Silver Bullet!

State of the art: A collection of techniques that propose matches.

We have added a new technique to this collection that works when previous techniques don’t even apply.


5 Jaewoo Kang

Some Previous Approaches

Schema-based approaches

Manager

Employ Salary

J. K. J. D. $50K

T. J. N. D $80K

P. K. Z. I. $75K

MNG EMP WAGE

U. P. D. S. $85K

A. H. M. H. $75K

J. N. D. F. $60K

Site 1 Site 2


6 Jaewoo Kang

Some Previous Approaches II

Instance-based approaches

DeptEmploy

Phone

HR J. D. 267-7622

R&D N. D 354-8736

Sales Z. I. 219-0457

DPT EMP CONT

R&D D. S. 387-9802

Sales M. H. 546-3856

Adm D. F. 326-1284Site 1 Site 2


7 Jaewoo Kang

So two previous approaches

Schema-based (interpret column names)

Instance-based (interpret data values)


8 Jaewoo Kang

But what about this problem?

t1 t2 t3 t4 t5 t6 t719

37.3 3.6 5.7 9

0.39

0.2

176

7 4.5 8 150.8

70.4

123

6.3 3.8 7 120.5

60.5

238

6.7 3.9 3.7 180.4

40.5

174

6.1 3.5 4.4 210.5

60.6

96 6.1 4.1 3.1 100.7

30.3

133

8.4 4.7 6.3 120.7

70.3

t1 t2 t3 t4 t5 t6 t716

47.4 4.2 3.8 13

0.57

0.4

129

6.3 3.4 4.8 160.4

40.2

136

7.6 4 3.1 90.5

20.6

395

6.9 3.6 4.8 80.3

80.4

93 6.6 3.7 3.9 170.6

10.6

114

6.8 3.9 4 170.3

20.5

144

7.8 4.3 3.8 160.5

10.9

Site 1 Site 2


9 Jaewoo Kang

This is the “Un-interpreted Matching” Problem.

Focus of this talk Outline of the remainder of this

talk Formal definition Terminology Algorithm Experimental Results


10 Jaewoo Kang

Un-interpreted Matching

M1 = match(R(r1, r2, .., rn), S(s1, s2, .., sm))M2 = match(R(r1, r2, .., rn), S’(f1(s1), f2(s2), .., fm(sm))

where match = a schema matching algorithm,Mi = {(ri-sj)} : set of matching column

pairs,fi = arbitrary one-to-one function.

‘match’ is an un-interpreted matching iff M1=M2 for all fi’s.

Main idea: specific token representing column name and value is not important.


11 Jaewoo Kang

Motivating example

Model

Color Tire

XLE white

P2R6

XG2.5

silver

XR5

LE red GM6

: : :

Model

C1 C2

GL3.5

a1 b1

XGL a2 b2

XE a3 b3

: : :

Two Car Part Tables


12 Jaewoo Kang

Motivating example

Model

Color Tire

XLE white

P2R6

XG2.5

silver

XR5

LE red GM6

: : :

Model

C1 C2

GL3.5

a1 b1

XGL a2 b2

XE a3 b3

: : :

Two Car Part Tables


13 Jaewoo Kang

Motivating example

Model

Color Tire

XLE white

P2R6

XG2.5

silver

XR5

LE red GM6

: : :

Model

C1 C2

GL3.5

a1 b1

XGL a2 b2

XE a3 b3

: : :

Two Car Part Tables


14 Jaewoo Kang

Motivating example

Model

Color Tire

XLE white

P2R6

XG2.5

silver

XR5

LE red GM6

: : :

Model

C1 C2

GL3.5

a1 b1

XGL a2 b2

XE a3 b3

: : :

Two Car Part Tables


15 Jaewoo Kang

Background

Before introducing our algorithm, need: Information Entropy Mutual Information Modeling Dependency Relations Graph Matching


16 Jaewoo Kang

Information Entropy Measures the

uncertainty of values in an attribute

Standard information theoretic measure( ) ( ) log ( )

x

H X p x p x

X

Entropy of Coin Flip Test

0

0.2

0.4

0.6

0.8

1

1.2

p(x=front)

H(X

)


17 Jaewoo Kang

Mutual Information Another standard information theoretic

measure Measures the amount of information

captured in one attribute about the other.

Note Self-information MI(X;X) = H(X)

( , )( ; ) ( , ) log

( ) ( )x y

p x yMI X Y p x y

p x p y

X Y


18 Jaewoo Kang

Modeling Dependency Relation

A B C D

a1 b2 c1 d1

a3 b4 c2 d2

a1 b1 c1 d2

a4 b3 c2 d3

A

DC

B

1.5 2.0

1.0 1.5

1.0 1.5

1.5

0.5

1.0

1.0

Table R G=Table2DepGraph(R)


19 Jaewoo Kang

Graph Matching

A

DC

B

1.5 2.0

1.0 1.5

1.0 1.5

1.5

0.5

1.0

1.0W

ZY

X

2.0 1.5

1.0 1.5

1.0 1.0

1.5

1.0

0.5

1.5

G1 G2 Our algorithm will use graph matching. {(G1(a),G2(b))}=GraphMatch(G1,G2) Finds a mapping that minimizes the

distance between the two graphs.


20 Jaewoo Kang

Distance Between the Graphs Euclidean distance metric (Frobenius

norm)

where aij and bij = mutual information between node i and j.

m(node in A) = matching node in B.

2( ) ( )

,( , ) ( )U

ij m i m jM i jD A B a b


21 Jaewoo Kang

Measuring the quality of match results

#of correct matches producedPrecision =

#of all matches produced


22 Jaewoo Kang

Finally, Our Matching Algorithm

1. G1 = Table2DepGraph(S1); G2 = Table2DepGraph(S2);

2. {(G1(a), G2(b))} = GraphMatch(G1, G2);

where Si = an input table, Gi = a dependency graph, (G1(a), G2(b)) = a matching

node pair.


23 Jaewoo Kang

Validating the Framework

Graph matching algorithm Used exhaustive search w/ simple

filtering. Can be replaced w/ approximate

algorithms in practice. System

Java HotSpot VM 1.4


24 Jaewoo Kang

Goals of experiments… Main goal: see if mutual information-

based un-interpreted matching works.

Secondary goal: see if mutual information is necessary, or if a simpler approach, Entropy-only Matching, works just as well. Only compares the entropies of

attributes in isolation, without considering mutual information.


25 Jaewoo Kang

Data Set I Census Data (U.S. Census Bureau)

State census data files: NY and CA. Can algorithm find mapping between attributes

in NY and CA tables?

1 2 3 4 5 6 7 8 9 10

18091 1063 10 9 9 41 15 368 368 288

17511 3281 25 21 40 89 59 1211 1211 796

609 3424 29 13 15 148 26 1055 1055 861

3861 2884 18 7 4 114 11 670 670 568

18614 1478 12 10 15 40 16 630 630 459


26 Jaewoo Kang

Data Set II Medical Data

Thrombosis lab exam data (12 years of patient records.) Range partitioned into two tables based on exam dates. Can algorithm find mapping between attributes in

resulting two tables?

1 2 3 4 5 6 7 8 9 10

970709 23 530 104 6.4 4 14 0.5 232 100

971022 26 564 108 6.8 5.3 13 0.55 250 103

971224 25 483 90 6.5 5.1 15 0.62

980120 26 578 101 7 4.6 16 0.49 224 93

980217 34 521 98 5.3 10 0.62 234 111


27 Jaewoo Kang

50%55%60%65%70%75%80%85%90%95%

100%

2 4 6 8 10 12 14 16 18 20

Schema size (#of attributes)

Pre

cisi

on

Mutual InformationEntropy-only

50%55%60%65%70%75%80%85%90%95%

100%

2 4 6 8 10 12 14 16 18 20


Pre

cisi

on

Mutual InformationEntropy-only

Results

Thrombosis exam Census data

Match precision deteriorates as the size of match increases. However, deterioration is small compared to the

exponential increase in search space. MI-based approach dominates entropy-only approach.


28 Jaewoo Kang

Why does mutual information-based approach dominate entropy-only approach?

0123456789

1011121314

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Attributes

Ent

ropy

Census NY

Census CA


29 Jaewoo Kang

Cardinality Constraints in Schema Matching

One-to-one mapping (bijective)

A

B

C

A

B

C

G1 G2


30 Jaewoo Kang


One-to-one mapping (bijective) Onto mapping (surjective)

A

B

C

A

B

C

D

G1 G2


31 Jaewoo Kang


One-to-one mapping Onto mapping Partial mapping

A

B

C

E

A

B

C

D

G1 G2


32 Jaewoo Kang

What about schemas that don’t match?

Examined how our matching algorithm reacts to the matching of unrelated schemas. (NY-CA vs. Lab1-CA)


33 Jaewoo Kang

Distinguishing Good and Bad Matches

Clearly detects case where there is no good matching.

0

10

20

30

40

50

60

70

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20


Met

ric

valu

e

One-to-One NY-CA Euclidean

One-to-One Lab1-CA Euclidean


34 Jaewoo Kang

Summary Identified new class of schema

matching problems that have not been addressed by existing solutions.

First to introduce an un-interpreted matching technique that addresses the new class of problems.

Evaluation suggests it may be useful as an addition to existing matching techniques.


35 Jaewoo Kang

Future Work

Find an efficient, accurate graph matching approximation algorithm.

Extend the techniques to nested structures such as XML, OO schemas.

See if the technique applicable to the problems of schema classification / clustering.


36 Jaewoo Kang

Questions?

For more information: [email protected] http://www.cs.wisc.edu/~jaewoo

Documents

On Schema Matching with Opaque Column Names and Data Values