34
ER 2002 YU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao, Stephen W. Liddle Brigham Young University Funded by NSF

ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

  • View
    220

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

ER 2002BYU Data Extraction Group

Automatically Extracting Ontologically Specified Data

from HTML Tableswith Unknown Structure

David W. Embley, Cui Tao, Stephen W. Liddle

Brigham Young University

Funded by NSF

Page 2: ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

ER 2002BYU Data Extraction Group

Information ExchangeSource Target

InformationExtraction

SchemaMatching

Leveragethis …

… to dothis

Page 3: ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

ER 2002BYU Data Extraction Group

Information Extraction

Page 4: ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

ER 2002BYU Data Extraction Group

Extracting Pertinent Information from Documents

Page 5: ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

ER 2002BYU Data Extraction Group

A Conceptual-Modeling SolutionYear Price

Make Mileage

Model

Feature

PhoneNr

Extension

Car

hashas

has

has is for

has

has

has

1..*

0..1

1..*

1..* 1..*

1..*

1..*

1..*

0..1 0..10..1

0..1

0..1

0..1

0..*

1..*

Page 6: ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

ER 2002BYU Data Extraction Group

Car-Ads OntologyCar [->object];Car [0..1] has Year [1..*];Car [0..1] has Make [1..*];Car [0...1] has Model [1..*];Car [0..1] has Mileage [1..*];Car [0..*] has Feature [1..*];Car [0..1] has Price [1..*];PhoneNr [1..*] is for Car [0..*];PhoneNr [0..1] has Extension [1..*];Year matches [4]

constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d[^\d]"; substitute "^" -> "19"; }, … …End;

Page 7: ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

ER 2002BYU Data Extraction Group

Recognition and Extraction

Car Year Make Model Mileage Price PhoneNr0001 1989 Subaru SW $1900 (336)835-85970002 1998 Elantra (336)526-54440003 1994 HONDA ACCORD EX 100K (336)526-1081

Car Feature0001 Auto0001 AC0002 Black0002 4 door0002 tinted windows0002 Auto0002 pb0002 ps0002 cruise0002 am/fm0002 cassette stereo0002 a/c0003 Auto0003 jade green0003 gold

Page 8: ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

ER 2002BYU Data Extraction Group

Schema Matching for HTML Tables with Unknown Structure

Page 9: ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

ER 2002BYU Data Extraction Group

Table-Schema Matching(Basic Idea)

• Many Tables on the Web• Ontology-Based Extraction

– Works well for unstructured or semistructured data– What about structured data – tables?

• Method– Form attribute-value pairs– Do extraction– Infer mappings from extraction patterns

Page 10: ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

ER 2002BYU Data Extraction Group

Problem: Different Schemas

Target Database Schema{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}

Different Source Table Schemas– {Run #, Yr, Make, Model, Tran, Color, Dr}– {Make, Model, Year, Colour, Price, Auto, Air Cond.,

AM/FM, CD}– {Vehicle, Distance, Price, Mileage}– {Year, Make, Model, Trim, Invoice/Retail, Engine,

Fuel Economy}

Page 11: ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

ER 2002BYU Data Extraction Group

Problem: Attribute is Value

Page 12: ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

ER 2002BYU Data Extraction Group

Problem: Attribute-Value is Value

? ?

Page 13: ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

ER 2002BYU Data Extraction Group

Problem: Value is not Value

Page 14: ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

ER 2002BYU Data Extraction Group

Problem: Implied Values

``````

Page 15: ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

ER 2002BYU Data Extraction Group

Problem: Missing Attributes

Page 16: ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

ER 2002BYU Data Extraction Group

Problem: Compound Attributes

Page 17: ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

ER 2002BYU Data Extraction Group

Problem: Factored Values

Page 18: ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

ER 2002BYU Data Extraction Group

Problem: Split Values

Page 19: ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

ER 2002BYU Data Extraction Group

Problem: Merged Values

Page 20: ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

ER 2002BYU Data Extraction Group

Problem: Values not of Interest

Page 21: ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

ER 2002BYU Data Extraction Group

Problem: Information Behind Links

Single-ColumnTable (formattedas list)

Tableextendingover severalpages

Page 22: ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

ER 2002BYU Data Extraction Group

Solution

• Form attribute-value pairs (adjust if necessary)

• Do extraction

• Infer mappings from extraction patterns

Page 23: ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

ER 2002BYU Data Extraction Group

Solution: Remove Internal Factoring

Discover Nesting: Make, (Model, (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*)*

Unnest: μ(Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* μ (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table

Legend

ACURA

ACURA

Page 24: ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

ER 2002BYU Data Extraction Group

Solution: Replace Boolean Values

Legend

ACURA

ACURA

β CD Table

Yes,

CD

CD

Yes,Yes,βAutoβAir CondβAM/FMYes,

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

Page 25: ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

ER 2002BYU Data Extraction Group

Solution: Form Attribute-Value Pairs

Legend

ACURA

ACURA

CD

CD

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>, <CD, >

Page 26: ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

ER 2002BYU Data Extraction Group

Solution: Adjust Attribute-Value Pairs

Legend

ACURA

ACURA

CD

CD

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto>, <Air Cond>, <AM/FM>

Page 27: ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

ER 2002BYU Data Extraction Group

Solution: Do Extraction

Legend

ACURA

ACURA

CD

CD

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

Page 28: ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

ER 2002BYU Data Extraction Group

Solution: Infer Mappings

Legend

ACURA

ACURA

CD

CD

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}

Each row is a car. πModelμ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*TableπMakeμ(Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*μ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*TableπYearTable

Note: Mappings produce sets for attributes. Joining to form recordsis trivial because we have OIDs for table rows (e.g. for each Car).

Page 29: ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

ER 2002BYU Data Extraction Group

Solution: Do Extraction

Legend

ACURA

ACURA

CD

CD

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}

πModelμ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table

Page 30: ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

ER 2002BYU Data Extraction Group

Solution: Do Extraction

Legend

ACURA

ACURA

CD

CD

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}

πPriceTable

Page 31: ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

ER 2002BYU Data Extraction Group

Solution: Do Extraction

Legend

ACURA

ACURA

CD

CD

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}

Yes,ρ Colour←Feature π ColourTable U ρ Auto←Feature π Auto β AutoTable U ρ Air Cond.←Feature π Air Cond.

β Air Cond.Table U ρ AM/FM←Feature π AM/FM β AM/FMTable U ρ CD←Featureπ CDβ CDTableYes, Yes, Yes,

Page 32: ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

ER 2002BYU Data Extraction Group

Experiment

• Tables from 60 sites• 10 “training” tables• 50 test tables• 357 mappings (from all 60 sites)

– 172 direct mappings (same attribute and meaning)– 185 indirect mappings (29 attribute synonyms, 5 “Yes/No”

columns, 68 unions over columns for Feature, 19 factored values, and 89 columns of merged values that needed to be split)

Page 33: ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

ER 2002BYU Data Extraction Group

Results• 10 “training” tables

– 100% of the 57 mappings (no false mappings)– 94.6% of the values in linked pages (5.4% false declarations)

• 50 test tables– 94.7% of the 300 mappings (no false mappings)– On the bases of sampling 3,000 values in linked pages, we obtained 97%

recall and 86% precision

• 16 missed mappings– 4 partial (not all unions included)– 6 non-U.S. car-ads (unrecognized makes and models)– 2 U.S. unrecognized makes and models– 3 prices (missing $ or found MSRP instead)– 1 mileage (mileages less than 1,000)

Page 34: ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

ER 2002BYU Data Extraction Group

Conclusions• Summary

– Transformed schema-matching problem to extraction– Inferred semantic mappings– Discovered source-to-target mapping rules

• Evidence of Success– Tables (mappings): 95% (Recall); 100% (Precision)– Linked Text (value extraction): ~97% (Recall); ~86% (Precision)

• Future Work– Discover and exploit structure in linked text– Broaden table understanding– Integrate with current extraction tools

www.deg.byu.edu