11
Automatic Extraction of Automatic Extraction of Information Behind Web Forms Information Behind Web Forms Based on Based on Application Ontologies Application Ontologies by Sai Ho Yau Brigham Young University

Automatic Extraction of Information Behind Web Forms Based on Application Ontologies Automatic Extraction of Information Behind Web Forms Based on Application

  • View
    220

  • Download
    5

Embed Size (px)

Citation preview

Page 1: Automatic Extraction of Information Behind Web Forms Based on Application Ontologies Automatic Extraction of Information Behind Web Forms Based on Application

Automatic Extraction of Information Automatic Extraction of Information Behind Web Forms Based on Behind Web Forms Based on

Application OntologiesApplication Ontologies

by

Sai Ho Yau

Brigham Young University

Page 2: Automatic Extraction of Information Behind Web Forms Based on Application Ontologies Automatic Extraction of Information Behind Web Forms Based on Application

Next

Previous

Introduction Introduction

There are enormous amounts of information available from the Web, but it is difficult to extract the data automatically due to several reasons:

Dynamically generated Web pages Form interfaces Relevant information can be obtained only after a

Web form is filled out and submitted

Page 3: Automatic Extraction of Information Behind Web Forms Based on Application Ontologies Automatic Extraction of Information Behind Web Forms Based on Application

Next

Previous

Problems Dealing with Forms Problems Dealing with Forms

No general Web form design

Required text fields

One form may lead to another

Resulting information embedded within forms

Returned error messages versus valid data

Elimination of possible duplicate data

Page 4: Automatic Extraction of Information Behind Web Forms Based on Application Ontologies Automatic Extraction of Information Behind Web Forms Based on Application

Next

Previous

The FrameworkThe Framework

Page 5: Automatic Extraction of Information Behind Web Forms Based on Application Ontologies Automatic Extraction of Information Behind Web Forms Based on Application

Next

Previous

ToolsTools

Language and Internet browser used: JavaScript, Java, PHP3., MySQL; Microsoft Internet Explorer

Platform: Solaris Intel (Unix), with Sun Java 1.1.6.

Page 6: Automatic Extraction of Information Behind Web Forms Based on Application Ontologies Automatic Extraction of Information Behind Web Forms Based on Application

Next

Previous

Method: Construct the Method: Construct the Query StringQuery String

Query String:

http://www.automobilesearch.com/search.html?cat2=0&manufacturer=&searcharea=0&mincost=&maxcost=&currency=USD&minyear=&maxyear=&go=Search

Domain_Path: http://www.automobilesearch.com/win2form_action: search.html

win2form_length: 1

win2Elem_length_0: 9

win2Elem_name_0: cat2 win2Elem_type_0: select-one win2Elem_value_0: 0 win2Elem_option_length: 16

win2Elem_option_Text_0: All Typeswin2Elem_option_0: 0win2Elem_option_Text_1: Accessorieswin2Elem_option_1: 4940win2Elem_option_Text_2: Classic Carswin2Elem_option_2: 4981

::

win2Elem_name_1: manufacturerwin2Elem_type_1: select-onewin2Elem_value_1:

win2Elem_option_length: 43win2Elem_option_Text_0: Any Manufacturerwin2Elem_option_0:win2Elem_option_Text_1: AM Generalwin2Elem_option_1: AM General

::

win2Elem_name_6: minyearwin2Elem_type_6: textwin2Elem_value_6:

win2Elem_name_7: maxyearwin2Elem_type_7: textwin2Elem_value_7:

win2Elem_name_8: gowin2Elem_type_8: submitwin2Elem_value_8: Search

Page 7: Automatic Extraction of Information Behind Web Forms Based on Application Ontologies Automatic Extraction of Information Behind Web Forms Based on Application

Next

Previous

The GoalThe Goal

Deal with as many Web forms as possible.

Retrieve all relevant information.

Automate the extraction process.

Page 8: Automatic Extraction of Information Behind Web Forms Based on Application Ontologies Automatic Extraction of Information Behind Web Forms Based on Application

Next

Previous

Returned Web PageReturned Web Page

Page 9: Automatic Extraction of Information Behind Web Forms Based on Application Ontologies Automatic Extraction of Information Behind Web Forms Based on Application

Next

Previous

Suggested SolutionSuggested Solution

Page 10: Automatic Extraction of Information Behind Web Forms Based on Application Ontologies Automatic Extraction of Information Behind Web Forms Based on Application

Next

Previous

ConclusionsConclusions

Eliminate duplicate data.

We can automatically:

Fill in Web forms.

Extract information behind forms.

Screen out error messages and inapplicable Web pages.

Page 11: Automatic Extraction of Information Behind Web Forms Based on Application Ontologies Automatic Extraction of Information Behind Web Forms Based on Application

Next

Previous

Thank YouThank You