Conceptual-Model-Based Web Data Extraction by Example

Preview:

DESCRIPTION

Conceptual-Model-Based Web Data Extraction by Example. Yuanqiu (Joe) Zhou Data Extraction Group Brigham Young University Sponsored by NSF. Motivation. Data-rich Websites in abundance Conceptual-Model-Based Methodology is resilient “By Example” approach is user-friendly. - PowerPoint PPT Presentation

Citation preview

Conceptual-Model-Based Web Data Extraction by Example

Yuanqiu (Joe) ZhouData Extraction Group

Brigham Young UniversitySponsored by NSF

Motivation

Data-rich Websites in abundance

Conceptual-Model-Based Methodology is resilient

“By Example” approach is user-friendly

“By Example” Approach

Web users specify desired information by creating a form

Users collect sample pages on the Web

An ontology generator learns the task by analyzing the form and the sample pages

Interactions may be needed to improve or complete the ontology

Architecture

Data Frame Libraries

User Created Form GUI

Sample Pages

Ontology Generator

Extraction Engine Target PagesPopulated Database

Extraction Ontology

Digital Camera

Brand

Model

CCD Resolution

Image Resolution

Optical Zoom

Digital Zoom

PowerShot G2

4.0

2272 x 1074

3

2

Sample Web Page User Created Form

Canon

Extraction Ontology

Relationship Set and Constraints

Extraction Patterns

Keywords

Context Expressions

Primary Object Name

Other Objects’ Names

Participation Constraints

DigitalCamera [-> object];

DigitalCamera [0:1] has Brand [1:*];

DigitalCamera [0:1] has Model [1:*];

DigitalCamera [0:1] has CCDResolution [1:*];

DigitalCamera [0:1] has ImageResolution [1:*];

DigitalCamera [0:1] has OpticalZoom [1:*];

DigitalCamera [0:1] has DigitalZoom [1:*];

Relationship Set and Constraints

Primary Object Name

Other Objects’ Names

Participation Constraints

DigitalCamera [-> object];

DigitalCamera [0:1] has Brand [1:*];

DigitalCamera [0:1] has Model [1:*];

DigitalCamera [0:1] has CCDResolution [1:*];

DigitalCamera [0:1] has ImageResolution [1:*];

DigitalCamera [0:1] has OpticalZoom [1:*];

DigitalCamera [0:1] has DigitalZoom [1:*];

Relationship Set and Constraints

Relationship Set and Constraints

Primary Object Name

Other Objects’ Names

Participation Constraints

DigitalCamera [-> object];

DigitalCamera [0:1] has Brand [1:*];

DigitalCamera [0:1] has Model [1:*];

DigitalCamera [0:1] has CCDResolution [1:*];

DigitalCamera [0:1] has ImageResolution [1:*];

DigitalCamera [0:1] has OpticalZoom [1:*];

DigitalCamera [0:1] has DigitalZoom [1:*];

Primary Object Name

Other Objects’ Names

Participation Constraints

DigitalCamera [-> object];

DigitalCamera [0:1] has Brand [1:*];

DigitalCamera [0:1] has Model [1:*];

DigitalCamera [0:1] has CCDResolution [1:*];

DigitalCamera [0:1] has ImageResolution [1:*];

DigitalCamera [0:1] has OpticalZoom [1:*];

DigitalCamera [0:1] has DigitalZoom [1:*];

Relationship Set and Constraints

Extraction Patterns

Data Frame Libraries Lexicons Synonym Dictionary Regular Expressions

Extraction Pattern: Lexicons for Brand and Model Regular Expressions for numbers and Image

resolution

From Data Frame Libraries

CCDResolution matches [20]constant{ extract "\b\d(\.\d{1,2})?\b"; };

keyword "\bMegapixel\b","\bCCD\b","\bResolution\b";

Features a high-quality 4.0 Megapixel Resolution CCD

The new Nikon Coolpix 995 offers a boasting 3.34 Megapixel CCD

3 effective megapixel

Extraction Patterns Data Frame Libraries

Keywords

Features a high-quality 4.0 Megapixel Resolution CCD

The new Nikon Coolpix 995 offers a boasting 3.34 Megapixel CCD

3 effective megapixel

Keywords

Features a high-quality 4.0 Megapixel Resolution CCD

The new Nikon Coolpix 995 offers a boasting 3.34 Megapixel CCD

3 effective megapixel

Keywords

Features a high-quality 4.0 Megapixel Resolution CCD

The new Nikon Coolpix 995 offers a boasting 3.34 Megapixel CCD

3 effective megapixel

CCDResolution matches [20]constant{ extract "\b\d(\.\d{1,2})?\b"; };

keyword "\bMegapixel\b","\bCCD\b","\bResolution\b";

Context Expressions

3.5x optical zoom (2.5x digital)

a superior 4x Optical Zoom Nikkor lens, plus 4x stepless digital zoom

optical 3X /digital 6X zoom

OpticalZoom matches [10]constant{ extract "\b\d(\.\d)?";

context "\b\d(\.\d)?(x)\b"; };keyword "\boptical\b";

DigitalCamera [-> object];

DigitalCamera [0:1] has Brand [1:*];Brand matches [10] constant{ extract "\bNikon\b";},

{ extract "\bCanon\b";},{ extract "\bOlympus\b";},{ extract "\bMinolta\b";},{ extract "\bSony\b";};

end;

DigitalCamera [0:1] has CCDResolution [1:*];CCDResolution matches [20] constant{ extract "\b\d(\.\d{1,2})?\b"; };

keyword "\bMegapixel\b“, "\bCCD\b", "\bResolution\b";

end;

DigitalCamera [0:1] has ImageResolution [1:*];ImageResolution matches [20] constant{ extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }, { extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; };

keyword "\bResolution\b", "\bImage\b";

end;

DigitalCamera [0:1] has OpticalZoom [1:*];OpticalZoom matches [10]

constant{ extract "\b\d"; context "\b\d(x)\b"; };

keyword "\boptical\b";end;

Extraction Ontology

DigitalCamera [-> object];

DigitalCamera [0:1] has Brand [1:*];Brand matches [10] constant{ extract "\bNikon\b";},

{ extract "\bCanon\b";},{ extract "\bOlympus\b";},{ extract "\bMinolta\b";},{ extract "\bSony\b";};

end;

DigitalCamera [0:1] has CCDResolution [1:*];CCDResolution matches [20] constant{ extract "\b\d(\.\d{1,2})?\b"; };

keyword "\bMegapixel\b“, "\bCCD\b", "\bResolution\b";

end;

DigitalCamera [0:1] has ImageResolution [1:*];ImageResolution matches [20] constant{ extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }, { extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; };

keyword "\bResolution\b", "\bImage\b";

end;

DigitalCamera [0:1] has OpticalZoom [1:*];OpticalZoom matches [10]

constant{ extract "\b\d"; context "\b\d(x)\b"; };

keyword "\boptical\b";end;

Extraction Ontology

DigitalCamera [-> object];

DigitalCamera [0:1] has Brand [1:*];Brand matches [10] constant{ extract "\bNikon\b";},

{ extract "\bCanon\b";},{ extract "\bOlympus\b";},{ extract "\bMinolta\b";},{ extract "\bSony\b";};

end;

DigitalCamera [0:1] has CCDResolution [1:*];CCDResolution matches [20] constant{ extract "\b\d(\.\d{1,2})?\b"; };

keyword "\bMegapixel\b“, "\bCCD\b", "\bResolution\b";

end;

DigitalCamera [0:1] has ImageResolution [1:*];ImageResolution matches [20] constant{ extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }, { extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; };

keyword "\bResolution\b", "\bImage\b";

end;

DigitalCamera [0:1] has OpticalZoom [1:*];OpticalZoom matches [10]

constant{ extract "\b\d(\.\d)"; context "\b\d(\.\d)?(x)\b"; };

keyword "\boptical\b";end;

Extraction Ontology

DigitalCamera [-> object];

DigitalCamera [0:1] has Brand [1:*];Brand matches [10] constant{ extract "\bNikon\b";},

{ extract "\bCanon\b";},{ extract "\bOlympus\b";},{ extract "\bMinolta\b";},{ extract "\bSony\b";};

end;

DigitalCamera [0:1] has CCDResolution [1:*];CCDResolution matches [20] constant{ extract "\b\d(\.\d{1,2})?\b"; };

keyword "\bMegapixel\b“, "\bCCD\b", "\bResolution\b";

end;

DigitalCamera [0:1] has ImageResolution [1:*];ImageResolution matches [20] constant{ extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }, { extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; };

keyword "\bResolution\b", "\bImage\b";

end;

DigitalCamera [0:1] has OpticalZoom [1:*];OpticalZoom matches [10]

constant{ extract "\b\d(\.\d)"; context "\b\d(\.\d)?(x)\b"; };

keyword "\boptical\b";end;

Extraction Ontology

Results (Same Site)

Results (Different Site)

Summary and Future Work

The example indicates that the approach is feasible

Some open questions need to be explored

Recommended