36
On Schema Matching with Opaque Column Names and Data Values Jaewoo Kang NC State (Aug 2003) Jeffrey F. Naughton Univ. of Wisconsin-Madison

On Schema Matching with Opaque Column Names and Data Values

Embed Size (px)

DESCRIPTION

On Schema Matching with Opaque Column Names and Data Values. Jaewoo Kang NC State (Aug 2003) Jeffrey F. Naughton Univ. of Wisconsin-Madison. What is Schema Matching?. Finding semantic correspondences of schema elements across heterogeneous sources. Old problem yet attracting new interests. - PowerPoint PPT Presentation

Citation preview

Page 1: On Schema Matching with Opaque Column Names and Data Values

On Schema Matching with Opaque Column Names and Data Values

Jaewoo KangNC State (Aug 2003)

Jeffrey F. Naughton

Univ. of Wisconsin-Madison

Page 2: On Schema Matching with Opaque Column Names and Data Values

June 10, 2003SIGMOD 2003

2 Jaewoo Kang

What is Schema Matching?

Finding semantic correspondences of schema elements across heterogeneous sources.

Old problem yet attracting new interests.

Page 3: On Schema Matching with Opaque Column Names and Data Values

June 10, 2003SIGMOD 2003

3 Jaewoo Kang

What is Schema Matching? (Cont’d)

Important for enterprise applications Data warehouses, data migration.

Also important for Internet data Virtual databases, web information

systems. Fundamental element of data

integration.

Page 4: On Schema Matching with Opaque Column Names and Data Values

June 10, 2003SIGMOD 2003

4 Jaewoo Kang

No Silver Bullet!

State of the art: A collection of techniques that propose matches.

We have added a new technique to this collection that works when previous techniques don’t even apply.

Page 5: On Schema Matching with Opaque Column Names and Data Values

June 10, 2003SIGMOD 2003

5 Jaewoo Kang

Some Previous Approaches

Schema-based approaches

Manager

Employ Salary

J. K. J. D. $50K

T. J. N. D $80K

P. K. Z. I. $75K

MNG EMP WAGE

U. P. D. S. $85K

A. H. M. H. $75K

J. N. D. F. $60K

Site 1 Site 2

Page 6: On Schema Matching with Opaque Column Names and Data Values

June 10, 2003SIGMOD 2003

6 Jaewoo Kang

Some Previous Approaches II

Instance-based approaches

DeptEmploy

Phone

HR J. D. 267-7622

R&D N. D 354-8736

Sales Z. I. 219-0457

DPT EMP CONT

R&D D. S. 387-9802

Sales M. H. 546-3856

Adm D. F. 326-1284Site 1 Site 2

Page 7: On Schema Matching with Opaque Column Names and Data Values

June 10, 2003SIGMOD 2003

7 Jaewoo Kang

So two previous approaches

Schema-based (interpret column names)

Instance-based (interpret data values)

Page 8: On Schema Matching with Opaque Column Names and Data Values

June 10, 2003SIGMOD 2003

8 Jaewoo Kang

But what about this problem?

t1 t2 t3 t4 t5 t6 t719

37.3 3.6 5.7 9

0.39

0.2

176

7 4.5 8 150.8

70.4

123

6.3 3.8 7 120.5

60.5

238

6.7 3.9 3.7 180.4

40.5

174

6.1 3.5 4.4 210.5

60.6

96 6.1 4.1 3.1 100.7

30.3

133

8.4 4.7 6.3 120.7

70.3

t1 t2 t3 t4 t5 t6 t716

47.4 4.2 3.8 13

0.57

0.4

129

6.3 3.4 4.8 160.4

40.2

136

7.6 4 3.1 90.5

20.6

395

6.9 3.6 4.8 80.3

80.4

93 6.6 3.7 3.9 170.6

10.6

114

6.8 3.9 4 170.3

20.5

144

7.8 4.3 3.8 160.5

10.9

Site 1 Site 2

Page 9: On Schema Matching with Opaque Column Names and Data Values

June 10, 2003SIGMOD 2003

9 Jaewoo Kang

This is the “Un-interpreted Matching” Problem.

Focus of this talk Outline of the remainder of this

talk Formal definition Terminology Algorithm Experimental Results

Page 10: On Schema Matching with Opaque Column Names and Data Values

June 10, 2003SIGMOD 2003

10 Jaewoo Kang

Un-interpreted Matching

M1 = match(R(r1, r2, .., rn), S(s1, s2, .., sm))M2 = match(R(r1, r2, .., rn), S’(f1(s1), f2(s2), .., fm(sm))

where match = a schema matching algorithm,Mi = {(ri-sj)} : set of matching column

pairs,fi = arbitrary one-to-one function.

‘match’ is an un-interpreted matching iff M1=M2 for all fi’s.

Main idea: specific token representing column name and value is not important.

Page 11: On Schema Matching with Opaque Column Names and Data Values

June 10, 2003SIGMOD 2003

11 Jaewoo Kang

Motivating example

Model

Color Tire

XLE white

P2R6

XG2.5

silver

XR5

LE red GM6

: : :

Model

C1 C2

GL3.5

a1 b1

XGL a2 b2

XE a3 b3

: : :

Two Car Part Tables

Page 12: On Schema Matching with Opaque Column Names and Data Values

June 10, 2003SIGMOD 2003

12 Jaewoo Kang

Motivating example

Model

Color Tire

XLE white

P2R6

XG2.5

silver

XR5

LE red GM6

: : :

Model

C1 C2

GL3.5

a1 b1

XGL a2 b2

XE a3 b3

: : :

Two Car Part Tables

Page 13: On Schema Matching with Opaque Column Names and Data Values

June 10, 2003SIGMOD 2003

13 Jaewoo Kang

Motivating example

Model

Color Tire

XLE white

P2R6

XG2.5

silver

XR5

LE red GM6

: : :

Model

C1 C2

GL3.5

a1 b1

XGL a2 b2

XE a3 b3

: : :

Two Car Part Tables

Page 14: On Schema Matching with Opaque Column Names and Data Values

June 10, 2003SIGMOD 2003

14 Jaewoo Kang

Motivating example

Model

Color Tire

XLE white

P2R6

XG2.5

silver

XR5

LE red GM6

: : :

Model

C1 C2

GL3.5

a1 b1

XGL a2 b2

XE a3 b3

: : :

Two Car Part Tables

Page 15: On Schema Matching with Opaque Column Names and Data Values

June 10, 2003SIGMOD 2003

15 Jaewoo Kang

Background

Before introducing our algorithm, need: Information Entropy Mutual Information Modeling Dependency Relations Graph Matching

Page 16: On Schema Matching with Opaque Column Names and Data Values

June 10, 2003SIGMOD 2003

16 Jaewoo Kang

Information Entropy Measures the

uncertainty of values in an attribute

Standard information theoretic measure( ) ( ) log ( )

x

H X p x p x

X

Entropy of Coin Flip Test

0

0.2

0.4

0.6

0.8

1

1.2

p(x=front)

H(X

)

Page 17: On Schema Matching with Opaque Column Names and Data Values

June 10, 2003SIGMOD 2003

17 Jaewoo Kang

Mutual Information Another standard information theoretic

measure Measures the amount of information

captured in one attribute about the other.

Note Self-information MI(X;X) = H(X)

( , )( ; ) ( , ) log

( ) ( )x y

p x yMI X Y p x y

p x p y

X Y

Page 18: On Schema Matching with Opaque Column Names and Data Values

June 10, 2003SIGMOD 2003

18 Jaewoo Kang

Modeling Dependency Relation

A B C D

a1 b2 c1 d1

a3 b4 c2 d2

a1 b1 c1 d2

a4 b3 c2 d3

A

DC

B

1.5 2.0

1.0 1.5

1.0 1.5

1.5

0.5

1.0

1.0

Table R G=Table2DepGraph(R)

Page 19: On Schema Matching with Opaque Column Names and Data Values

June 10, 2003SIGMOD 2003

19 Jaewoo Kang

Graph Matching

A

DC

B

1.5 2.0

1.0 1.5

1.0 1.5

1.5

0.5

1.0

1.0W

ZY

X

2.0 1.5

1.0 1.5

1.0 1.0

1.5

1.0

0.5

1.5

G1 G2 Our algorithm will use graph matching. {(G1(a),G2(b))}=GraphMatch(G1,G2) Finds a mapping that minimizes the

distance between the two graphs.

Page 20: On Schema Matching with Opaque Column Names and Data Values

June 10, 2003SIGMOD 2003

20 Jaewoo Kang

Distance Between the Graphs Euclidean distance metric (Frobenius

norm)

where aij and bij = mutual information between node i and j.

m(node in A) = matching node in B.

2( ) ( )

,( , ) ( )U

ij m i m jM i jD A B a b

Page 21: On Schema Matching with Opaque Column Names and Data Values

June 10, 2003SIGMOD 2003

21 Jaewoo Kang

Measuring the quality of match results

#of correct matches producedPrecision =

#of all matches produced

Page 22: On Schema Matching with Opaque Column Names and Data Values

June 10, 2003SIGMOD 2003

22 Jaewoo Kang

Finally, Our Matching Algorithm

1. G1 = Table2DepGraph(S1); G2 = Table2DepGraph(S2);

2. {(G1(a), G2(b))} = GraphMatch(G1, G2);

where Si = an input table, Gi = a dependency graph, (G1(a), G2(b)) = a matching

node pair.

Page 23: On Schema Matching with Opaque Column Names and Data Values

June 10, 2003SIGMOD 2003

23 Jaewoo Kang

Validating the Framework

Graph matching algorithm Used exhaustive search w/ simple

filtering. Can be replaced w/ approximate

algorithms in practice. System

Java HotSpot VM 1.4

Page 24: On Schema Matching with Opaque Column Names and Data Values

June 10, 2003SIGMOD 2003

24 Jaewoo Kang

Goals of experiments… Main goal: see if mutual information-

based un-interpreted matching works.

Secondary goal: see if mutual information is necessary, or if a simpler approach, Entropy-only Matching, works just as well. Only compares the entropies of

attributes in isolation, without considering mutual information.

Page 25: On Schema Matching with Opaque Column Names and Data Values

June 10, 2003SIGMOD 2003

25 Jaewoo Kang

Data Set I Census Data (U.S. Census Bureau)

State census data files: NY and CA. Can algorithm find mapping between attributes

in NY and CA tables?

1 2 3 4 5 6 7 8 9 10

18091 1063 10 9 9 41 15 368 368 288

17511 3281 25 21 40 89 59 1211 1211 796

609 3424 29 13 15 148 26 1055 1055 861

3861 2884 18 7 4 114 11 670 670 568

18614 1478 12 10 15 40 16 630 630 459

Page 26: On Schema Matching with Opaque Column Names and Data Values

June 10, 2003SIGMOD 2003

26 Jaewoo Kang

Data Set II Medical Data

Thrombosis lab exam data (12 years of patient records.) Range partitioned into two tables based on exam dates. Can algorithm find mapping between attributes in

resulting two tables?

1 2 3 4 5 6 7 8 9 10

970709 23 530 104 6.4 4 14 0.5 232 100

971022 26 564 108 6.8 5.3 13 0.55 250 103

971224 25 483 90 6.5 5.1 15 0.62

980120 26 578 101 7 4.6 16 0.49 224 93

980217 34 521 98 5.3 10 0.62 234 111

Page 27: On Schema Matching with Opaque Column Names and Data Values

June 10, 2003SIGMOD 2003

27 Jaewoo Kang

50%55%60%65%70%75%80%85%90%95%

100%

2 4 6 8 10 12 14 16 18 20

Schema size (#of attributes)

Pre

cisi

on

Mutual InformationEntropy-only

50%55%60%65%70%75%80%85%90%95%

100%

2 4 6 8 10 12 14 16 18 20

Schema size (#of attributes)

Pre

cisi

on

Mutual InformationEntropy-only

Results

Thrombosis exam Census data

Match precision deteriorates as the size of match increases. However, deterioration is small compared to the

exponential increase in search space. MI-based approach dominates entropy-only approach.

Page 28: On Schema Matching with Opaque Column Names and Data Values

June 10, 2003SIGMOD 2003

28 Jaewoo Kang

Why does mutual information-based approach dominate entropy-only approach?

0123456789

1011121314

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Attributes

Ent

ropy

Census NY

Census CA

Page 29: On Schema Matching with Opaque Column Names and Data Values

June 10, 2003SIGMOD 2003

29 Jaewoo Kang

Cardinality Constraints in Schema Matching

One-to-one mapping (bijective)

A

B

C

A

B

C

G1 G2

Page 30: On Schema Matching with Opaque Column Names and Data Values

June 10, 2003SIGMOD 2003

30 Jaewoo Kang

Cardinality Constraints in Schema Matching

One-to-one mapping (bijective) Onto mapping (surjective)

A

B

C

A

B

C

D

G1 G2

Page 31: On Schema Matching with Opaque Column Names and Data Values

June 10, 2003SIGMOD 2003

31 Jaewoo Kang

Cardinality Constraints in Schema Matching

One-to-one mapping Onto mapping Partial mapping

A

B

C

E

A

B

C

D

G1 G2

Page 32: On Schema Matching with Opaque Column Names and Data Values

June 10, 2003SIGMOD 2003

32 Jaewoo Kang

What about schemas that don’t match?

Examined how our matching algorithm reacts to the matching of unrelated schemas. (NY-CA vs. Lab1-CA)

Page 33: On Schema Matching with Opaque Column Names and Data Values

June 10, 2003SIGMOD 2003

33 Jaewoo Kang

Distinguishing Good and Bad Matches

Clearly detects case where there is no good matching.

0

10

20

30

40

50

60

70

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Schema size (#of attributes)

Met

ric

valu

e

One-to-One NY-CA Euclidean

One-to-One Lab1-CA Euclidean

Page 34: On Schema Matching with Opaque Column Names and Data Values

June 10, 2003SIGMOD 2003

34 Jaewoo Kang

Summary Identified new class of schema

matching problems that have not been addressed by existing solutions.

First to introduce an un-interpreted matching technique that addresses the new class of problems.

Evaluation suggests it may be useful as an addition to existing matching techniques.

Page 35: On Schema Matching with Opaque Column Names and Data Values

June 10, 2003SIGMOD 2003

35 Jaewoo Kang

Future Work

Find an efficient, accurate graph matching approximation algorithm.

Extend the techniques to nested structures such as XML, OO schemas.

See if the technique applicable to the problems of schema classification / clustering.

Page 36: On Schema Matching with Opaque Column Names and Data Values

June 10, 2003SIGMOD 2003

36 Jaewoo Kang

Questions?

For more information: [email protected] http://www.cs.wisc.edu/~jaewoo