Web Data Extraction Como2010

Preview:

DESCRIPTION

 

Citation preview

DIADEMA Short Overview

Georg Gottlob

Web data extraction

WEBHTML pages

layout

Corporateedp apps

structured data,Databases,

XML

WRAPPER

Goal: Make web contents accessible to electronic data processing

Wrappers: HTMLselect extract annotate XML

Lixto Visual Developer (VD)

Navigation Steps

Mozilla Web

Browser

Extraction Configuration

Need for Automatic Extraction Technology

Example: Real Estate UK 17,000 sites Many not covered by aggregators We do have a list of all homepages (Yellow

Pgs. UK) Manual or semi-automatic wrapping too

expensive - wrapper construction - testing - keeping track of changes No tool or method can do it fully

automatically. Other domains: Hospitals,restaurants, schools, travel

agents, airlines, hospitals, pharmaceutical companies

and retail companies such as supermarket

chains…..

Need for Automatic Extraction Technology (2)

All search engine providers need it! Many work on it.

Keywords: Vertical search, object search, semantic search.

Raghu Ramakrishnan, Yahoo!, March 2009: “no one really has done this successfully at scale yet”

Alon Halevy, Google, Feb. 2009: “Current technologies are not good enough yet to provide what search engines really need. […] any successful approach would probably need a combination of knowledge and learning.”

The Blackbox we want to construct

BLACKBOX

Application domain with thousands of websites

URL

Application relevant Structured data (XML or RDF)

To achieve this, we plan to combine a host of annotators with a new knowledge-based approach.

Real estate

Restaurants

Relationship to SeCo & Webdam

Q: Find apartments in Milan whose prices are average in quarters were restaurant quality > average.

Results

Web service A

Web service

How to achieve it?Rationale: Combine existing and new

“low level” annotators with “high level” AI and reasoning.

Low level annotators: - Bottom-up page analysis. - ML-based entity recognizers - NLP & ontological text annotation - Web page classification & analysis - Basic link analysis

High level reasoning: - Goal oriented - Conceptual domain objects. - Conceptual interaction

elements - High-level object ontology - Domain knowledge

<table>113

<tr> 134<tr>115

“I’m interested in”

<td>119

<table>124

radiobuttons

<tr>125 <tr>126

<td>129 <td>130

“Buying” “Renting”

<td>135

“Maximum price”

<select>136

<option>137<option>138

<td>139 <td>140

“GBP” “EUR”

Bottom-up (low-level) annotation

Monochromatic Rectangle

Georaphic search facility

Postcode input field

Active map ….

ISA ISA

Occurs in

Price search facility …

.

….

Occurs in

….105

105 127

[(02873,227)(03900,417)]

Geo-Price-Searchbox

ISA

[(02873,227)(03900,417)]

Top-down reasoning

Property SearchFacility

Property List

Single Property Description

Specially highlightedproperty

part-of m1

Bottom-up processing Top-down reasoning

Monochromatic Rectangle

Georaphic search facility

Postcode input field

Active map ….

ISA ISA

Occurs in

Price search facility …

.

….

Occurs in

….105

105 127

[(02873,227)(03900,417)]

Property SearchFacility

Property List

Single Property Description

Geo-Price-Searchbox

ISA

[(02873,227)(03900,417)]

Specially highlightedproperty

Phenomenology

part-of m1

table(T) & occurs_in(T,areaselection) & occurs_in(T,priceselection) goodtable(T).

goodtable(T) & child(Parent,T) containsgoodtable(Parent).

goodtable(T) & containsgoodtable(T) propertysearchmask(T).

If a table contains an area selection input field and a price selection field, both of which are not simultaneously contained in a smaller table, then this table is the property search mask

Datalog for Web-Object Reasoning

Crucial steps

• WP1 data model (KRR model)• WP2 low & intermediate level annotation • WP3 High level ontology and Rules (top down) + mapping HL to Int. Level: Phenomenology• WP4 Access, interaction, & navigation• WP5 Compilation; Learning Xpath expressions• WP6 Highly parallel execution on clouds• WP7 General methodology

Knowledge base

General web knowledgeHTML, CSS, script handling

Domain-specific knowledgerules, constraints, tasks, ontology

Site-specific knowledge

WP1

Factual knowledge extraction

Bottom-up property extraction

Access, interaction, navigation

Top-down pattern perception

WP2 WP3 WP4

WWW

Analysisphase

Compilation phase• Extraction program build• Optimization, parallelization

Runtime phase• Highly parallel extraction

• maximize speed• improve consistency

• Use of elastic framework(cloud computing)

WP5

WP6

Crucial steps

• WP1 data model (KRR model)• WP2 low & intermediate level annotation • WP3 High level ontology and Rules (top down) + mapping HL to Int. Level: Phenomenology• WP4 Access, interaction, & navigation• WP5 Compilation; Learning Xpath expressions• WP6 Highly parallel execution on clouds• WP7 General methodology

The Data Model

Datalog is good but does not suffice.On top of it:

Need for object creation Need for ontological reasoning Need for probabilistic reasoning Need for default reasoning

Object creation in Datalog+

table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2)   X (tablebox(X) & contains(X,T1) & contains(X,T2)).

PRODUCTToshiba Protégé cxDell 25416 Dell 23233Acer 78987

PRICE480360 470390

Object creation in Datalog+

table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2)   X (tablebox(X) & contains(X,T1) & contains(X,T2)).

PRODUCTToshiba Protégé cxDell 25416 Dell 23233Acer 78987

PRICE480360 470390

T1 T2

Object creation in Datalog+

table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2)   X (tablebox(X) & contains(X,T1) & contains(X,T2)).

PRODUCTToshiba Protégé cxDell 25416 Dell 23233Acer 78987

PRICE480360 470390

PRICE480360 470390

T1 T2

Object creation in Datalog+

table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2)   X (tablebox(X) & contains(X,T1) & contains(X,T2)).

Deduction in Datalog+ undecidable (TGDs)

Object creation in Datalog+

table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2)   X (tablebox(X) & contains(X,T1) & contains(X,T2)).

Deduction in Datalog+ undecidable (TGDs)

Datalog : require guardedness of rule bodies. Decidable, linear-time data complexity.

Datalog

Family of languages.

Incorporates ontological reasoning (>DL-LITE)

Further research needed for extending it so to be an ideal language for web objects.

Transitivity:

containedin(T1,T2), containedin(T2,T3) containedin (T1,T3)

Datalog

Family of languages.

Incorporates ontological reasoning (>DL-LITE)

Further research needed for extending it so to be an ideal language for web objects.

Transitivity:

containedin(T1,T2), containedin(T2,T3) containedin (T1,T3)

unguarded!

DL-LITE

DL-LITE Datalog[ ,;Lin]

Professor TeachesTo Professsor(x) y TeachesTo(x,y)

TeachesTo- Student TeachesTo(x,y) Student(y)

HasTutor- TeachesTo HasTutor(x,y) ->TeachesTo(y,x)

funct(HasTutor) HasTutor(x,y) & HasTutor(x,y’)

(always innocuous!) & Neq(y,y’)

Professor Student Professor(x) & Student(x)

DL-Litecore

DL-LiteR

DL-LiteF

Crucial steps

• WP1 data model (KRR model)

• WP2 low & intermediate level annotation

• WP3 High level ontology and Rules (top down)

+ mapping HL to Int. Level: Phenomenology

• WP4 Access, interaction, & navigation

• WP5 Compilation; Learning Xpath expressions

• WP6 Highly parallel execution on clouds

• WP7 General methodology

We will use various existing tools and techniques(rather than re-invent the wheel)

Low & Intermediate Level Annotation

• Named entity recognizers• Machine learning• Computational linguistics• Page layout analysis • PDF- Extraction

Extraction from PDF

Tamir Hassan

Crucial steps

• WP1 data model (KRR model)• WP2 low & intermediate level annotation • WP3 High level ontology and Rules (top down) + mapping HL to Int. Level: Phenomenology• WP4 Access, interaction, & navigation• WP5 Compilation; Learning Xpath expressions• WP6 Highly parallel execution on clouds• WP7 General methodology

Navigation & Interaction

Crucial steps

• WP1 data model (KRR model)• WP2 low & intermediate level annotation • WP3 High level ontology and Rules (top down) + mapping HL to Int. Level: Phenomenology• WP4 Access, interaction, & navigation• WP5 Compilation; Learning Xpath expressions• WP6 Highly parallel execution on clouds• WP7 General methodology

OXPath

• Extension of XPath • Facilitates querying web form and retrieving

returned data• Simulates a user filling out web forms• Highly parallelizable (geared towds cloud

computing)• Navigation and collecting data across multiple

pages

Result Extraction

..../next-field::*/{“Renting”}/.../{...}/.../{“Submit”}

Atomic resultsregardless of presentation (list, table, etc.)

/<XQ>

Result Extraction<XQ> : For each atomic result A

Letprice = A/.../.../text()description = A/.../.../../text()

........Return

<rental area=Oxford><price> 1,200 </price><bedrooms> 3 </bedrooms><bathrooms> 1 </bathrooms><type> Flat </type><location> George Street,OX1 </location><description> ... </description><otherInfo> Furnished; Long let - more than six months</otherInfo>...

<\rental>

price description

Type OtherInfo

Bathrooms

location

type = A/.../.../../text()

Crucial steps

• WP1 data model (KRR model)• WP2 low & intermediate level annotation • WP3 High level ontology and Rules (top down) + mapping HL to Int. Level: Phenomenology• WP4 Access, interaction, & navigation• WP5 Compilation; Learning Xpath expressions• WP6 Highly parallel execution on clouds• WP7 General methodology

Recommended