39
1 HTML HTML - - aware aware tools tools for for Web data Web data extraction extraction Student: Xavier Azagra Supervisor: Andreas Thor Thesis Thesis presentation presentation

Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

1

HTMLHTML--awareaware toolstools forforWeb data Web data extractionextraction

Student: Xavier Azagra

Supervisor: Andreas Thor

ThesisThesis presentationpresentation

Page 2: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

2

Table of contentsTable of contents

IntroductionIntroductionData Extraction ProcessData Extraction ProcessData Extraction ToolsData Extraction ToolsRealized testsRealized testsFuture WorkFuture Work

Web Data ExtractionWeb Data Extraction

Page 3: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

3

IntroductionIntroduction

Web Data Web Data ExtractionExtraction

We are going to center our effort in HTML data extractionThe predominant markup language for web pagesKind of semi-structured dataInformation following a nested structureSupport from W3C (World Wide Web Consortium)

Page 4: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

4

Web Data Web Data ExtractionExtraction

Internet Internet growthgrowth

IntroductionIntroduction

1400 Million of Internet users

168 Million sites

May 2008 Web Server Survey - www.netcraft.com

Wikipedia – The free encyclopedia

Page 5: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

5

Web Data Web Data ExtractionExtraction

IntroductionIntroductionPurposesPurposes ofof Web data Web data extractionextraction

Users Applications

Query

Integration

Extraction

Web data source

Get information from the Web to be used in other areas or by applications

Information retrieval ( e.g. Feeds, Web search engines…)

Let the user to access particular data from the Web

Economical issues ( e.g. stock market, shopping comparison…)

Page 6: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

6

Web Data Web Data ExtractionExtraction

Internet was designed as a source of data for a human use. Problems appear when we want to extract data from HTML

Data not presented in HTML format:Password protected sitesCookiesSessions ID’sJavascriptDynamic content

Deep resources:Unlinked contentContextual webLimited access content

MainMain problemsproblems

Data Data extractionextraction processprocess

Page 7: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

7

Web Data Web Data ExtractionExtraction

Data Data extractionextraction processprocessTypesTypes ofof contentcontent

Free text Structuredtext

Semi-structuredtext

Natural language texts

Patterns involving syntactic relations between words or semantic classes

of words

Textual information following a predefined

strict format

Use of the format description

Between unstructured collections of textual documents and fully structured tuples of

typed data

Extraction patterns are often based on tokens

and delimiters

Page 8: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

8

Web Data Web Data ExtractionExtraction

WaysWays toto performperform data data extractionextraction

Data Data extractionextraction processprocess

Manual API Wrapper

Manual Semiautomatic Automatic

PreciseTreat elements

individually

Specific Web SitesLimited specifications

Set of methodsIndependent of source

Ad hoc codeNot trivial

Error-prone

Support toolGUI support

Less Error-prone

Machine-learningtechniques

Supervised learning

Page 9: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

9

Web Data Web Data ExtractionExtraction

When speaking about HTML-aware tools, before performing the extraction process, these tools turn the document into a parsing tree

HTML HTML structurestructure forfor data data extractionextraction

Data Data extractionextraction processprocess

Each node represents a tagOuter tags are leavesExpressions to navigate through all the hierarchy

Maximum precision is found on the content of a leave

Page 10: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

10

Presentation of the data without following a structureLogic, simple and organized content help to realize correct extractionsUnorganized content affects the HTML tree structure

Bad constructed HTML source documentsBad placed tagsRepeated tagsNo closed tags

Nested data elementsElements that are nesting data and then element by element could

contain differences

Web Data Web Data ExtractionExtraction

HTML HTML problemsproblems toto extractextract data (I)data (I)

Data Data extractionextraction processprocess

Page 11: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

11

Web Data Web Data ExtractionExtraction

HTML HTML problemsproblems toto extractextract data (II)data (II)

Data Data extractionextraction processprocess

Problems choosing the correct Web page source exampleContent structure could change depending on some factorsExample: Result page of Web Search Engines

Problems using scripts or dynamic contentHidden or changing informationSyntax different to HTMLJavascript, PHP, AJAX or Flash

Page 12: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

12

Web Data Web Data ExtractionExtraction

Data Data extractionextraction toolstoolsTaxonomyTaxonomy (I)(I)

Page 13: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

13

Web Data Web Data ExtractionExtraction

Data Data extractionextraction toolstoolsTaxonomyTaxonomy (II)(II)

Languages forwrapper

development

Ontology-based NLP-based

Assist wrapper constructionAlternatives to general purpose

languages

Wrapperinduction Modeling-based HTML-aware

Rely on inherent structural features of

HTML documents

Based on syntactic and semantic constraints

Rules derived from a givenset of training examples

Try to locate in Web pagesportions of data that implicitly

conform to a structure

Extraction relyingdirectly on the data

Page 14: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

14

Web Data Web Data ExtractionExtraction

Data Data extractionextraction toolstoolsFlowFlow ofof datadata

http://

INPUT

URL

Data File

Data extractionprocess

OUTPUT

Wrapper

XML, HTML,RSS/ATOMTEXT

Modules,CSV, email,JSON, XSL,Google Maps,Flash…

Page 15: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

15

Web Data Web Data ExtractionExtraction

Data Data extractionextraction toolstoolsStructureStructure

10 HTML-aware tools

Categorization of this tools using several criterias

Test-bench scenarios

Page 16: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

16

Web Data Web Data ExtractionExtraction

Data Data extractionextraction toolstoolsUsedUsed HTMLHTML--awareaware toolstools

DapperRobomakerRoadrunnerXWRAPLixto

WebharvestGoldseekerWinTaskAutomation AnywhereWeb Content Extractor

Commercial and non commercial toolsShell and GUI support toolsScreen scrapping and non screen scrapping toolsLinux and Windows tools…

Page 17: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

17

Web Data Web Data ExtractionExtraction

Data Data extractionextraction toolstoolsStructureStructure

10 HTML-aware tools

Categorization of this tools using several criterias

Test-bench scenarios

Page 18: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

18

Web Data Web Data ExtractionExtraction

Data Data extractionextraction toolstoolsGUIGUI

No GUI- Shell commands- Configuration files and coding- Input files- Roadrunner

Integrated browser- Direct Interaction between the tool and the navigation browser- Visualize information of the Web elements- Lixto, Robomaker, Web Content Extractor

Web browser- Loads Javascript and Dynamic content- Separation between the tool and the window browser- Automation anywhere, Wintask

Page 19: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

19

Web Data Web Data ExtractionExtraction

Data Data extractionextraction toolstoolsResilienceResilience

Capacity of continuing to work properly in the ocurrence ofchanges in the pages for which they are targeted

Common changes to:the data the structure

Add, erase or modify elementsthe visual designintroduce new technologies (AJAX, PHP, Javascript…)

The resilience grad varies depending the used tool

Page 20: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

20

Web Data Web Data ExtractionExtraction

Data Data extractionextraction toolstoolsAdaptivenessAdaptiveness

Grade of a wrapper for built pages of a specific Web source on a given application domain to work properly withpages from another source in the same application domain

From all of the taxonomy of web data extraction tools onlythe Ontology-based tools feature fully resilience andadaptativeness properties

Page 21: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

21

Web Data ExtractionWeb Data Extraction

Data extraction toolsData extraction toolsScripting and expressionsScripting and expressions

The atomicity of the HTML parsing tree is found in a leave (outer tag)

Necessity to extract information in a more precise way

Self-scripting syntax

Regular expressions Patterns

Remove specialcharacters

Others

WintaskWeb ContentExtractorGoldseeker

Lixto Robomaker

Date formatting Text replacing

Robomaker LixtoRobomaker

Robomaker

Page 22: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

22

Web Data ExtractionWeb Data Extraction

Data extraction toolsData extraction toolsInput variablesInput variables

In some cases we need input variables to realize searches through Internet:

EbayWeb search enginesYoutubeAmazon…

We want to extract data from the resulting pages, we need tool support

Robomaker, Dapper, Lixto, Wintask

Page 23: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

23

Web Data ExtractionWeb Data Extraction

Input/Output formatsInput/Output formats

Data extraction toolsData extraction tools

RSS/Atom Feed, REST Web

Service, Web ClipHTMLRobomaker

XMLHTMLLixto

XMLHTMLXWRAP

XML, HTMLHTMLRoadRunner

XML, RSS, HTML, Modules,

Atom Feed, CSV,JSON,XSL,

YAML, email

HTMLDapper

Output FormatsInput Formats

TextHTML anddocumentsGoldSeeker

File, Excel, DB, SQL script File, MySQL script File, HTML, XML, HTTP

submit

HTMLWeb

Content Extractor

File, Excel, DB, EXE

HTML anddocuments

AutomationAnywhere

File, Excel, DBHTML anddocumentsWinTask

XMLHTMLWebHarvest

Output Formats

Input Formats

Page 24: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

24

Web Data ExtractionWeb Data Extraction

General features (I)General features (I)

Data extraction toolsData extraction tools

FreeExecution timeResilienceComplexityInterface

No, requires licenseVery GoodGoodMedium

Program GUI, Internet browser

Lixto

YesGoodGoodMediumInternet browserXWRAP

YES, GNU GPL LicenseGoodPoorMediumLinux ShellRoadRunner

YesVery GoodVery goodMediumProgram GUI,

Internet browser

Robomaker

YesVery GoodGoodLowInternet browserDapper

Page 25: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

25

Web Data ExtractionWeb Data Extraction

General features (II)General features (II)

Data extraction toolsData extraction tools

FreeExecution timeResilienceComplexityInterface

NoPoorPoorLowProgram GUI,

Internet browser

Web Content Extractor

NoGoodPoorLowProgram GUI,

Internet browser

Automation

Anywhere

NoGoodPoorMediumProgram GUI,

internet browser

Wintask

Yes, GNU LGPL

LicensePoorGoodMediumInternet

browserGoldseeker

YesGoodGoodHighProgram GUIWebHarvest

Page 26: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

26

Web Data ExtractionWeb Data Extraction

Advanced characteristicsAdvanced characteristics

Data extraction toolsData extraction tools

GoodNoYesNoYesDapperGoodYesYesYesYesRobomakerPoorNoNoNoNoRoadrunnerPoorNoYesNoNoXWRAPGoodYesYesYesYesLixto

Javascriptor Dynamic

content

More than one page

Non static content pages

Scripts usage

Input variables

GoodNo No NoNoWeb Content Extractor

GoodYesNo NoNoAutomation

Anywhere

GoodYesNoYesBy scriptWintaskPoorNo YesYesNoGoldseekerPoorNoYesNoNoWebHarvest

Page 27: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

27

Web Data Web Data ExtractionExtraction

Data Data extractionextraction toolstoolsStructureStructure

10 HTML-aware tools

Categorization of this tools using several criterias

Test-bench scenarios

Page 28: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

28

Web Data Web Data ExtractionExtraction

Realized testsRealized testsMethodologyMethodology

Created/SelectedWeb page

Selected Tool

Selected data

Toolresult

Correctresult

Compare TestResult

Page 29: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

29

Web Data Web Data ExtractionExtraction

Realized testsRealized testsWeb Web searchsearch enginesengines (I)(I)

One of the most usedresources of the Web

Use of input variables anddynamic result pages

Yahoo! Search uses a livesearch input form

Page 30: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

30

Web Data Web Data ExtractionExtraction

aaaWeb Content ExtractorrrrAutomation AnywhererrrWinTaskaaaLixtoaraRobomakeraaaDapper

MS Live SearchYahoo! SearchGoogle Search

Realized testsRealized testsWeb Web searchsearch enginesengines (II)(II)

Page 31: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

31

Web Data Web Data ExtractionExtraction

aWeb Content

Extractor

rAutomationAnywhere

rWinTaskaLixtoaRobomaker

a / rDapper

Ebaysearch

Realized testsRealized testsEbayEbay

The most important auction shop of Internet

Use of input variables and dynamic result pages

Fields containing variable content

Page 32: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

32

Web Data Web Data ExtractionExtraction

rWeb Content

Extractor

rAutomationAnywhere

rWinTaskrLixtorRobomakerrDapper

Pageflakes

Realized testsRealized testsDynamic content Web pagesDynamic content Web pages

AJAX based start page

Use of Dynamic content and personalized user modules

Page 33: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

33

For each test:Realize a modification to the source pageUpload to a test serverExecute the tool and see if problems appear

Web Data Web Data ExtractionExtraction

Realized testsRealized testsResilience tests (I)Resilience tests (I)

1- Obtain a result page of Amazon.com2- Download the source page and related files3- Upload to a test server4- Configure tools to extract 4 fields: title, book format, new price and valuation

Page 34: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

34

Web Data Web Data ExtractionExtraction

Deleting contentModifying CSS style tagsDuplicating extracted dataChanging order of extracted data

Realized testsRealized testsResilience tests (II)Resilience tests (II)

Deleting content Example:

rWeb

Content Extractor

aLixto

aRobomaker

aDapper

Erase td[0]

Page 35: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

35

Designed a published books Web Page

We are going to extract data from the Last Published edition column with different precision each time:

All the information of the rowDate of the last publicationYear of the last publication2 last digits of the year of the last publication

Web Data Web Data ExtractionExtraction

Realized testsRealized testsPrecision tests (I)Precision tests (I)

Page 36: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

36

Web Data Extraction

Done three different modifications to the source page with different characteristics to:

Extract data from formatted textExtract data using styled text (class attribute)Extract data from CSV formatted text

Realized testsPrecision tests (II)Precision tests (II)

Page 37: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

37

Web Data Extraction

Realized tests

aaaaWeb

Content Extractor

rrrrAutomation Anywhere

rrrrWinTaskaaaaLixtoaaaaRobomakerrrrrDapper

2 last digits of the year of the last publication

Year of the last

publication

Date of the last

publication

All the information of

the last published

edition

Precision tests (III)Precision tests (III)Example: Extracting data from CSV source

Page 38: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

38

Web Data Extraction

Future work

Given a Web source which features the tool accomplish. Useful to find the most suitable tool

Testing with non visual GUI tools

Realize a detailed document that contains all the realized work

Page 39: Web data extraction v2 - uni-leipzig.de · 2009-04-01 · tool support Robomaker, Dapper, Lixto, Wintask. 23 Web Data Extraction ... Internet browser Web Content Extractor Low Poor

39

Web Data Extraction

Thanks for your attention!