View
28
Download
3
Category
Tags:
Preview:
DESCRIPTION
Scheme Matching and Data Extraction over HTML Tables. Cui Tao June, 2002. supported by NSF. Introduction. Many tables on the Web Ontology-based extraction: Works for unstructured or semi-structured data Does not work well for structured data -- tables - PowerPoint PPT Presentation
Citation preview
Scheme Matching and Data Extraction over HTML Tables
Cui TaoJune, 2002
supported by NSF
Introduction
Many tables on the Web Ontology-based extraction:
Works for unstructured or semi-structured data
Does not work well for structured data -- tables
Only tables for information, not for layout
Problems
Different source table schemas {Run #, Yr, Make, Model, Tran, Color, Dr} {Make, Model, Year, Colour, Price, Auto, Air
Cond., AM/FM, CD} {Vehicle, Distance, Price, Mileage} {Year, Make, Model, Trim, Invoice/Retail,
Engine, Fuel Economy} Target database schema
{Car, Year, Make, Model, Mileage, Price, PhoneNr},
{Car, Feature}
Different schemas
ProblemsAttribute value pairs
?
ProblemsAttribute value switch
ProblemsAttribute/value combinations
Year/sty Cyl. # Dr Tran Color
ProblemsAttribute/value split
Model
Problems Information in linked pages
Tables Lists Unstructured data …
Header information
Thesis Statement
Extraction Ontology
HTML table withUnknown-structure
MappingRules
ExtractedData
Methods
• Understand Table• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links
• Infer Mapping • Extract Data
• Understand Table.• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links
• Infer Mapping• Extract Data.
Understand table Recognize table and its element
<TABLE>, </TABLE> <TR>: Row; <TD>: Data Entry; <TH>: Header.
Methods Form attribute-value
pairs Regular table
• Understand Table• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links
• Infer Mapping• Extract Data
Nrcom =
Most common number of columns in the table
Table with factors
Table has Boolean values
Methods
Form Attribute-Value Pairs Regular Table Table with factors
• Understand Table• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links
• Infer Mapping • Extract Data
Replace Boolean Values:
Form Attribute-Value pairs
Methods
Form Attribute-Value Pairs Regular Table Table with factors Table has Boolean values
• Understand Table• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links
• Infer Mapping• Extract Data
<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>
Methods• Understand Table
• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links
• Infer Mapping• Extract Data
<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto>, <Air Cond>, <AM/FM>
<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>
Adjust attribute-value Pairs
Table: attribute-value pairs
Methods• Understand Table
• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links
• Infer Mapping • Extract Data Add Information Hidden
Behind Links Unstructured and semi structured: concatenate
<Manufacturer, Honda>, <Model, Civic EX>, <Door, 4>, <Year, 1995>, <Color, White>, <Engine, 2.0L 4 Cylinders> <Transmission, Auto>, <Mileage, 82,628> <Price, $6300>
Methods• Understand Table
• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links
• Infer Mapping• Extract Data Add Information Hidden
Behind Links Unstructured and semi- structured: concatenate Table: attribute-value pairs
Methods• Understand Table
• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links
• Infer Mapping • Extract Data Add Information Hidden
Behind Links Unstructured and semi- structured: concatenate Table:attribute value pairs List:
<Features, AIR CONDITIONING, CD, AM/FM, CLOTH UPHOLSTERY, CONSOLE, CRUISE CONTROL, DUAL AIR BAGS, INSIDE HOOD RELEASE, POWER DOOR LOCKS, POWER STEERING, POWER SUNROOF, POWER WINDOWS, RADIAL TIRES, REAR DEFROSTER, REAR SPOILER, RECLINING SEATS>
• Understand Table• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links
• Infer Mapping• Extract Data
Inferred Mapping Creation:
Method
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
Each row is a car.
• Understand Table• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links
• Infer Mapping • Extract Data
Method
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
• Understand Table• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links
• Infer Mapping • Extract Data
Method
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
• Table Understanding.• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links
• Inferred Mapping Creation• Data Extraction.
Method
{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
• Understand Table• Recognize Attributes and Values• Form Attribute-Value Pairs• Adjust Attribute-Value Pairs• Add Information Hidden Behind Links
• Infer Mapping• Extract Data
Evaluation Measure percentage of correct
mappings: Correct mapping Partially correct mapping Incorrect mapping
Measure precision and recall: Data in the table Data in linked pages
Compare the results for extracted data before mapping and after mapping
Contribution Provides an approach to extract
information automatically from HTML tables
Suggests a different way to solve the problem of schema matching
Recommended