66
UFMG, June 2002 YU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

  • View
    226

  • Download
    1

Embed Size (px)

Citation preview

Page 1: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Automating Schema Matchingfor Data Integration

David W. Embley

Brigham Young University

Funded by NSF

Page 2: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Information ExchangeSource Target

InformationExtraction

SchemaMatching

Leveragethis …

… to dothis

Page 3: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Presentation Outline

• Information Extraction

• Schema Matching for HTML Table

• Direct Schema Matching

• Indirect Schema Matching

• Conclusions and Future Work

Page 4: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Information Extraction

Page 5: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Extracting Pertinent Information from Documents

Page 6: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

A Conceptual Modeling SolutionYear Price

Make Mileage

Model

Feature

PhoneNr

Extension

Car

hashas

has

has is for

has

has

has

1..*

0..1

1..*

1..* 1..*

1..*

1..*

1..*

0..1 0..10..1

0..1

0..1

0..1

0..*

1..*

Page 7: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Car-Ads OntologyCar [->object];Car [0..1] has Year [1..*];Car [0..1] has Make [1..*];Car [0...1] has Model [1..*];Car [0..1] has Mileage [1..*];Car [0..*] has Feature [1..*];Car [0..1] has Price [1..*];PhoneNr [1..*] is for Car [0..*];PhoneNr [0..1] has Extension [1..*];Year matches [4]

constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d,[^\d]"; substitute "^" -> "19"; }, …End;

Page 8: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Recognition and Extraction

Car Year Make Model Mileage Price PhoneNr0001 1989 Subaru SW $1900 (363)835-85970002 1998 Elandra (336)526-54440003 1994 HONDA ACCORD EX 100K (336)526-1081

Car Feature0001 Auto0001 AC0002 Black0002 4 door0002 tinted windows0002 Auto0002 pb0002 ps0002 cruise0002 am/fm0002 cassette stero0002 a/c0003 Auto0003 jade green0003 gold

Page 9: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Schema Matching forHTML Tables

Page 10: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Table-Schema Matching(Basic Idea)

• Many tables on the Web• Ontology-Based Extraction:

– Works well for unstructured or semistructured data– What about structured data – tables?

• Method:– Form attribute-value pairs– Do extraction– Infer mappings from extraction patterns

Page 11: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Problem: Different Schemas

Target Database Schema{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}

Different Source Table Schemas– {Run #, Yr, Make, Model, Tran, Color, Dr}– {Make, Model, Year, Colour, Price, Auto, Air Cond.,

AM/FM, CD}– {Vehicle, Distance, Price, Mileage}– {Year, Make, Model, Trim, Invoice/Retail, Engine,

Fuel Economy}

?

Page 12: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Problem: Attribute is Value

Page 13: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Problem: Attribute-Value is Value

? ?

Page 14: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Problem: Value is not Value

Page 15: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Problem: Implied Values

Page 16: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Problem: Missing Attributes

Page 17: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Problem: Compound Attributes

Page 18: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Problem: Merged Values

Page 19: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Problem: Values not of Interest

Page 20: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Problem: Factored Values

Page 21: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Problem: Split Values

Page 22: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Problem: Information Behind Links

Single-ColumnTable (formateda list)

Tableextendingover severalpages

Page 23: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Solution

• Form attribute-value pairs (adjust if necessary)

• Do extraction

• Infer mappings from extraction patterns

Page 24: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Solution: Remove Internal Factoring

Discover Nesting: Make, (Model, (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*)*

Unnest: (Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table

Legend

ACURA

ACURA

Page 25: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Solution: Replace Boolean Values

Legend

ACURA

ACURA

CD Table

Yes,

CD

CD

Yes,Yes,AutoAir CondAM/FM

Yes,

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

Page 26: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Solution: Form Attribute-Value Pairs

Legend

ACURA

ACURA

CD

CD

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>

Page 27: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Solution: Adjust Attribute-Value Pairs

Legend

ACURA

ACURA

CD

CD

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto>, <Air Cond>, <AM/FM>

Page 28: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Solution: Do Extraction

Legend

ACURA

ACURA

CD

CD

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto>, <Air Cond>, <AM/FM>

Page 29: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Solution: Infer Mappings

Legend

ACURA

ACURA

CD

CD

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}

Each row is a car. Model(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*TableMake(Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*TableYearTable

Note: Mappings produce sets for attributes. Joining to form recordsis trivial because we have OIDs for table rows (e.g. for each Car).

Page 30: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Solution: Do Extraction

Legend

ACURA

ACURA

CD

CD

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}

Model(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table

Page 31: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Solution: Do Extraction

Legend

ACURA

ACURA

CD

CD

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}

PriceTable

Page 32: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Solution: Do Extraction

Legend

ACURA

ACURA

CD

CD

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}

Yes,ColourFeatureColourTable AutoFeatureAutoAutoTable Air Cond.FeatureAir Cond.

Air Cond.Table AM/FMFeatureAM/FMAM/FMTable CDFeatureCDCDTableYes, Yes, Yes,

Page 33: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Experiment

• Tables from 60 sites• 10 “training” tables• 50 test tables• 357 mappings (from all 60 sites)

– 172 direct mappings (same attribute and meaning)– 185 indirect mappings (29 attribute synonyms, 5 “Yes/No”

columns, 68 unions over columns for Feature, 19 factored values, and 89 columns of merged values that needed to be split)

Page 34: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Results• 10 “training” tables

– 100% of the 57 mappings (no false mappings)– 94.6% of the values in linked pages (5.4% false declarations)

• 50 test tables– 94.7% of the 300 mappings (no false mappings)– On the bases of sampling 3,000 values in linked pages, we obtained 97%

recall and 86% precision

• 16 missed mappings– 4 partial (not all unions included)– 6 non-U.S. car-ads (unrecognized makes and models)– 2 U.S. unrecognized makes and models– 3 prices (missing $ or found MSRP instead)– 1 mileage (mileages less than 1,000)

Page 35: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Direct Schema Matching

Page 36: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Attribute Matchingfor Populated Schemas

• Central Idea: Exploit All Data & Metadata

• Matching Possibilities (Facets)– Attribute Names– Data-Value Characteristics– Expected Data Values– Data-Dictionary Information– Structural Properties

Page 37: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Approach

• Target Schema T

• Source Schema S

• Framework– Individual Facet Matching– Combining Facets– Best-First Match Iteration

Page 38: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Example

Source Schema S

Car

Year

has

0:1

Make

has0:1

Modelhas

0:1

Cost

Style

has

has0:1

0:*

Year

has

0:1

Feature

has

0:*Cost

has0:1

Car

Mileage

has

Phone

has

0:10:1

Modelhas

0:1

Target Schema T

Make

has0:1

Miles

has0:1

Year

Model

Make YearMake

ModelCar

Car

Mileage Miles

Page 39: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Individual Facet Matching

• Attribute Names

• Data-Value Characteristics

• Expected Data Values

Page 40: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Attribute Names

• Target and Source Attributes – T : A – S : B

• WordNet• C4.5 Decision Tree: feature selection, trained on

schemas in DB books– f0: same word– f1: synonym– f2: sum of distances to a common hypernym root– f3: number of different common hypernym roots– f4: sum of the number of senses of A and B

Page 41: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

WordNet Rule

The number

of different common

hypernym roots of A

and B

The sum of distances of A and B to a

common hypernym

The sum of the

number of senses of A and B

Page 42: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Confidence Measures

Page 43: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Data-Value Characteristics

• C4.5 Decision Tree

• Features– Numeric data

(Mean, variation, standard deviation, …)

– Alphanumeric data(String length, numeric ratio, space ratio)

Page 44: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Confidence Measures

Page 45: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Expected Data Values

• Target Schema T and Source Schema S– Regular expression recognizer for attribute A in T

– Data instances for attribute B in S

• Hit Ratio = N’/N for (A, B) match– N’ : number of B data instances recognized by the

regular expressions of A

– N: number of B data instances

Page 46: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Confidence Measures

Page 47: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Combined Measures

Threshold: 0.5

10

000000

0 0 0 0 0 01

00000

0 0 0 0

10

0

0 0 0 000000

1

000

0 010 00

00

00

Page 48: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Final Confidence Measures

00

0

Page 49: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Experimental Results

• This schema, plus 6 other schemas– 32 matched attributes– 376 unmatched attributes

• Matched: 100%

• Unmatched: 99.5%– “Feature” ---”Color”– “Feature” ---”Body Type”

F193.8%

F284%

F392%

F198.9%

F297.9%

F398.4%

F1: WordNetF2: Value CharacteristicsF3: Expected Values

Page 50: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Indirect Schema Matching

Page 51: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Schema Matching

Source

Car

Year

Cost

Style

YearFeature

Cost

Phone

Target

Car

MilesMileage

Model

Make Make&

Model

Color

Body Type

Page 52: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Mapping Generation

• Direct Matches as described earlier:– Attribute Names based on WordNet– Value Characteristics based on value lengths, averages, …– Expected Values based on regular-expression recognizers

• Indirect Matches:– Direct matches– Structure Evaluation

• Union• Selection• Decomposition• Composition

Page 53: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Union and Selection

Car

Source

Year

Cost

Style

YearFeature

Cost

Phone

Target

Car

MilesMileage

Model

Make Make&

Model

Color

Body Type

Page 54: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Decomposition and Composition

Car

Source

Year

Cost

Style

YearFeature

Cost

Phone

Target

Car

MilesMileage

Model

Make Make&

Model

Color

Body Type

Page 55: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Structure

PO

POShipTo POBillTo POLines

City Street City Street Item

Count

Line Qty UoM

PurchaseOrder

DeliverToInvoiceTo

Items

ItemItemCount

ItemNumber

Quantity UnitOfMeasure

City Street

Address

Target Source

Example Taken From [MBR, VLDB’01]

Page 56: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Structure(Nonlexical Matches)

PO

POShipTo POBillTo POLines

City Street City Street Item

Count

Line Qty UoM

PurchaseOrder

DeliverToInvoiceTo

Items

ItemCount

ItemNumber

Quantity UnitOfMeasure

City Street

Address

DeliverTo

Target Source

Page 57: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Structure(Join over FD Relationship Sets, …)

PO

POBillTo POLines

City Street City Street Item

Count

Line Qty UoM

PurchaseOrder

InvoiceTo

Items

ItemCount

ItemNumber

Quantity UnitOfMeasure

City

Street City

Street

POShipTo DeliverTo

Target Source

Page 58: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Structure(Lexical Matches)

PO

POBillTo POLines

City Street City Street Item

Count

Line Qty UoM

PurchaseOrder

InvoiceTo

Items

ItemCount

ItemNumber

Quantity

City

Street City

StreetCity

City

StreetStreet

City

City

Street

StreetCount

Count

Line QtyQuantity UnitOfMeasure

POShipTo DeliverTo

Target Source

Page 59: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Experiments• Methodology• Measures

– Precision

– Recall

– F Measure

falseposcorrect

correctprecision

falsenegcorrect

correctrecall

recallprecsion

F11

2

Page 60: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

ResultsApplications

(Number of Schemes)

Precision

(%)

Recall

(%)

F

(%)

Correct False

Positive

False

Negative

Course Schedule (5) 98 93 96 119 2 9

Faculty Member (5) 100 100 100 140 0 0

Real Estate (5) 92 96 94 235 20 10

Data borrowed from Univ. of Washington [DDH, SIGMOD01]

Indirect Matches: 94% (precision, recall, F-measure)

Rough Comparison with U of W Results (Direct Matches only)

* Course Schedule – Accuracy: ~71%

* Faculty Members – Accuracy, ~92%

* Real Estate (2 tests) – Accuracy: ~75%

Page 61: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Conclusions and Future Work

Page 62: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Conclusions

• Table Mappings– Tables: 94.7% (Recall); 100% (Precision)– Linked Text: ~97% (Recall); ~86% (Precision)

• Direct Attribute Matching– Matched 32 of 32: 100% Recall– 2 False Positives: 94% Precision

• Direct and Indirect Attribute Matching– Matched 494 of 513: 96% Recall– 22 False Positives: 96% Precision

www.deg.byu.edu

Page 63: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Current & Future Work:Improve and Extend Indirect Matching

• Improve Object-Set Matching (e.g. Lex/non-Lex) • Add Relationship-Set Matching• Computations

Page 64: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Current & Future Work:Tables Behind Forms

• Crawling the Hidden Web

• Filling in Forms from Global Queries

Page 65: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Current & Future Work:Developing Extraction Ontologies

• Creation from Knowledge Sources and Sample Application Pages K Ontology + Data Frames, Lexicons, …– RDF Ontologies

• User Creation by Example

Page 66: UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF

UFMG, June 2002BYU Data Extraction Group

Current & Future Work:and Much More …

• Table Understanding• Microfilm Census Records• Generate Ontologies by Reading Tables• …

www.deg.byu.edu