23
Scheme Matching and Data Extraction over HTML Tables Cui Tao June, 2002 supported by NSF

Scheme Matching and Data Extraction over HTML Tables

Embed Size (px)

DESCRIPTION

Scheme Matching and Data Extraction over HTML Tables. Cui Tao June, 2002. supported by NSF. Introduction. Many tables on the Web Ontology-based extraction: Works for unstructured or semi-structured data Does not work well for structured data -- tables - PowerPoint PPT Presentation

Citation preview

Page 1: Scheme Matching and Data Extraction over HTML Tables

Scheme Matching and Data Extraction over HTML Tables

Cui TaoJune, 2002

supported by NSF

Page 2: Scheme Matching and Data Extraction over HTML Tables

Introduction

Many tables on the Web Ontology-based extraction:

Works for unstructured or semi-structured data

Does not work well for structured data -- tables

Only tables for information, not for layout

Page 3: Scheme Matching and Data Extraction over HTML Tables

Problems

Different source table schemas {Run #, Yr, Make, Model, Tran, Color, Dr} {Make, Model, Year, Colour, Price, Auto, Air

Cond., AM/FM, CD} {Vehicle, Distance, Price, Mileage} {Year, Make, Model, Trim, Invoice/Retail,

Engine, Fuel Economy} Target database schema

{Car, Year, Make, Model, Mileage, Price, PhoneNr},

{Car, Feature}

Different schemas

Page 4: Scheme Matching and Data Extraction over HTML Tables

ProblemsAttribute value pairs

?

Page 5: Scheme Matching and Data Extraction over HTML Tables

ProblemsAttribute value switch

Page 6: Scheme Matching and Data Extraction over HTML Tables

ProblemsAttribute/value combinations

Year/sty Cyl. # Dr Tran Color

Page 7: Scheme Matching and Data Extraction over HTML Tables

ProblemsAttribute/value split

Model

Page 8: Scheme Matching and Data Extraction over HTML Tables

Problems Information in linked pages

Tables Lists Unstructured data …

Header information

Page 9: Scheme Matching and Data Extraction over HTML Tables

Thesis Statement

Extraction Ontology

HTML table withUnknown-structure

MappingRules

ExtractedData

Page 10: Scheme Matching and Data Extraction over HTML Tables

Methods

• Understand Table• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping • Extract Data

• Understand Table.• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping• Extract Data.

Understand table Recognize table and its element

<TABLE>, </TABLE> <TR>: Row; <TD>: Data Entry; <TH>: Header.

Page 11: Scheme Matching and Data Extraction over HTML Tables

Methods Form attribute-value

pairs Regular table

• Understand Table• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping• Extract Data

Nrcom =

Most common number of columns in the table

Table with factors

Page 12: Scheme Matching and Data Extraction over HTML Tables

Table has Boolean values

Methods

Form Attribute-Value Pairs Regular Table Table with factors

• Understand Table• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping • Extract Data

Replace Boolean Values:

Page 13: Scheme Matching and Data Extraction over HTML Tables

Form Attribute-Value pairs

Methods

Form Attribute-Value Pairs Regular Table Table with factors Table has Boolean values

• Understand Table• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping• Extract Data

<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>

Page 14: Scheme Matching and Data Extraction over HTML Tables

Methods• Understand Table

• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping• Extract Data

<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto>, <Air Cond>, <AM/FM>

<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>

Adjust attribute-value Pairs

Page 15: Scheme Matching and Data Extraction over HTML Tables

Table: attribute-value pairs

Methods• Understand Table

• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping • Extract Data Add Information Hidden

Behind Links Unstructured and semi structured: concatenate

<Manufacturer, Honda>, <Model, Civic EX>, <Door, 4>, <Year, 1995>, <Color, White>, <Engine, 2.0L 4 Cylinders> <Transmission, Auto>, <Mileage, 82,628> <Price, $6300>

Page 16: Scheme Matching and Data Extraction over HTML Tables

Methods• Understand Table

• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping• Extract Data Add Information Hidden

Behind Links Unstructured and semi- structured: concatenate Table: attribute-value pairs

Page 17: Scheme Matching and Data Extraction over HTML Tables

Methods• Understand Table

• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping • Extract Data Add Information Hidden

Behind Links Unstructured and semi- structured: concatenate Table:attribute value pairs List:

<Features, AIR CONDITIONING, CD, AM/FM, CLOTH UPHOLSTERY, CONSOLE, CRUISE CONTROL, DUAL AIR BAGS, INSIDE HOOD RELEASE, POWER DOOR LOCKS, POWER STEERING, POWER SUNROOF, POWER WINDOWS, RADIAL TIRES, REAR DEFROSTER, REAR SPOILER, RECLINING SEATS>

Page 18: Scheme Matching and Data Extraction over HTML Tables

• Understand Table• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping• Extract Data

Inferred Mapping Creation:

Method

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

Each row is a car.

Page 19: Scheme Matching and Data Extraction over HTML Tables

• Understand Table• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping • Extract Data

Method

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

Page 20: Scheme Matching and Data Extraction over HTML Tables

• Understand Table• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping • Extract Data

Method

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

Page 21: Scheme Matching and Data Extraction over HTML Tables

• Table Understanding.• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Inferred Mapping Creation• Data Extraction.

Method

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

• Understand Table• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping• Extract Data

Page 22: Scheme Matching and Data Extraction over HTML Tables

Evaluation Measure percentage of correct

mappings: Correct mapping Partially correct mapping Incorrect mapping

Measure precision and recall: Data in the table Data in linked pages

Compare the results for extracted data before mapping and after mapping

Page 23: Scheme Matching and Data Extraction over HTML Tables

Contribution Provides an approach to extract

information automatically from HTML tables

Suggests a different way to solve the problem of schema matching