Scheme Matching and Data Extraction over HTML Tables

Preview:

DESCRIPTION

Scheme Matching and Data Extraction over HTML Tables. Cui Tao June, 2002. supported by NSF. Introduction. Many tables on the Web Ontology-based extraction: Works for unstructured or semi-structured data Does not work well for structured data -- tables - PowerPoint PPT Presentation

Citation preview

Scheme Matching and Data Extraction over HTML Tables

Cui TaoJune, 2002

supported by NSF

Introduction

Many tables on the Web Ontology-based extraction:

Works for unstructured or semi-structured data

Does not work well for structured data -- tables

Only tables for information, not for layout

Problems

Different source table schemas {Run #, Yr, Make, Model, Tran, Color, Dr} {Make, Model, Year, Colour, Price, Auto, Air

Cond., AM/FM, CD} {Vehicle, Distance, Price, Mileage} {Year, Make, Model, Trim, Invoice/Retail,

Engine, Fuel Economy} Target database schema

{Car, Year, Make, Model, Mileage, Price, PhoneNr},

{Car, Feature}

Different schemas

ProblemsAttribute value pairs

?

ProblemsAttribute value switch

ProblemsAttribute/value combinations

Year/sty Cyl. # Dr Tran Color

ProblemsAttribute/value split

Model

Problems Information in linked pages

Tables Lists Unstructured data …

Header information

Thesis Statement

Extraction Ontology

HTML table withUnknown-structure

MappingRules

ExtractedData

Methods

• Understand Table• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping • Extract Data

• Understand Table.• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping• Extract Data.

Understand table Recognize table and its element

<TABLE>, </TABLE> <TR>: Row; <TD>: Data Entry; <TH>: Header.

Methods Form attribute-value

pairs Regular table

• Understand Table• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping• Extract Data

Nrcom =

Most common number of columns in the table

Table with factors

Table has Boolean values

Methods

Form Attribute-Value Pairs Regular Table Table with factors

• Understand Table• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping • Extract Data

Replace Boolean Values:

Form Attribute-Value pairs

Methods

Form Attribute-Value Pairs Regular Table Table with factors Table has Boolean values

• Understand Table• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping• Extract Data

<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>

Methods• Understand Table

• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping• Extract Data

<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto>, <Air Cond>, <AM/FM>

<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>

Adjust attribute-value Pairs

Table: attribute-value pairs

Methods• Understand Table

• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping • Extract Data Add Information Hidden

Behind Links Unstructured and semi structured: concatenate

<Manufacturer, Honda>, <Model, Civic EX>, <Door, 4>, <Year, 1995>, <Color, White>, <Engine, 2.0L 4 Cylinders> <Transmission, Auto>, <Mileage, 82,628> <Price, $6300>

Methods• Understand Table

• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping• Extract Data Add Information Hidden

Behind Links Unstructured and semi- structured: concatenate Table: attribute-value pairs

Methods• Understand Table

• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping • Extract Data Add Information Hidden

Behind Links Unstructured and semi- structured: concatenate Table:attribute value pairs List:

<Features, AIR CONDITIONING, CD, AM/FM, CLOTH UPHOLSTERY, CONSOLE, CRUISE CONTROL, DUAL AIR BAGS, INSIDE HOOD RELEASE, POWER DOOR LOCKS, POWER STEERING, POWER SUNROOF, POWER WINDOWS, RADIAL TIRES, REAR DEFROSTER, REAR SPOILER, RECLINING SEATS>

• Understand Table• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping• Extract Data

Inferred Mapping Creation:

Method

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

Each row is a car.

• Understand Table• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping • Extract Data

Method

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

• Understand Table• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping • Extract Data

Method

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

• Table Understanding.• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Inferred Mapping Creation• Data Extraction.

Method

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

• Understand Table• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links

• Infer Mapping• Extract Data

Evaluation Measure percentage of correct

mappings: Correct mapping Partially correct mapping Incorrect mapping

Measure precision and recall: Data in the table Data in linked pages

Compare the results for extracted data before mapping and after mapping

Contribution Provides an approach to extract

information automatically from HTML tables

Suggests a different way to solve the problem of schema matching

Recommended