87
DIADEM Domain-centric, Intelligent, Automated Data Extraction Tim Furche, Georg Gottlob, Giorgio Orsi May 11th, 2011 @ Oxford University Computing Laboratories joint work with Giovanni Grasso, Omer Gunes, Xiaonan Guo, Andrey Kravchenko, Thomas Lukasiewicz, Christian Schallhart, Andrew Sellers, Gerardo Simaris, Cheng Wang

Diadem 1.0

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Diadem 1.0

DIADEMDomain-centric, Intelligent, Automated

Data ExtractionTim Furche, Georg Gottlob, Giorgio Orsi

May 11th, 2011 @ Oxford University Computing Laboratoriesjoint work with Giovanni Grasso, Omer Gunes, Xiaonan Guo, Andrey Kravchenko, Thomas

Lukasiewicz, Christian Schallhart, Andrew Sellers, Gerardo Simaris, Cheng Wang

Page 2: Diadem 1.0

3

1

Web Data Extraction

Page 3: Diadem 1.0

4

Section 1: Web Data Extraction

Data on the Web

there is more of it than we can use

no longer availability, but finding, integrating, analysing, …

Page 4: Diadem 1.0

5

Section 1: Web Data Extraction

Surface vs. Deep Web

estimated 500 × surface web

estimated 400 000 deep web databases

What?

Products (stores)

Directories (yellow pages)

Catalogs (libraries)

Public DBs (publications, census, data.gov,…)

Public services (weather, location, …)

Page 5: Diadem 1.0

6And it’s not just one haystack …

Page 6: Diadem 1.0

8

Page 7: Diadem 1.0

10

Page 8: Diadem 1.0

11

7 bedrooms

5 bedrooms

Page 9: Diadem 1.0

12

Section 1: Web Data Extraction

The Web is more than HTML

Page 10: Diadem 1.0

13

Section 1: Web Data Extraction

Overview

Introducing Web Data Extraction

Scenarios

Why now?

Supervised Web Data Extraction

Unsupervised Web Data Extraction

DIADEM

OPAL

AMBER

OXPath

IVLIA

Datalog±

Page 11: Diadem 1.0

14

1.1

Web Data Extraction: Scenarios

Page 12: Diadem 1.0

15

Section 1: Web Data Extraction

The Need of Web Data Extraction

information

drives business (decision making, trend analysis, …)

available in troves on the internet

but: as HTML made for humans, not as structured data

companies need

product specifications

pricing information

market trends

regulatory information

Page 13: Diadem 1.0

17

keyword search fails

Page 14: Diadem 1.0

18

Section 1: Web Data Extraction

Scenario ➀: Electronics retailer

electronics retailer: online market intelligence

comprehensive overview of the market

daily information on price, shipping costs, trends, product mix

by product, geographical region, or competitor

thousands of products

hundreds of competitors

nowadays: specialised companies

mostly manual, interpolation

large cost

Page 15: Diadem 1.0

19

Section 1: Web Data Extraction

Scenario ➁: Supermarket chain

supermarket chain

competitors’ product prices

special offer or promotion (time sensitive)

new products, product formats & packaging

Page 16: Diadem 1.0

20

Section 1: Web Data Extraction

Scenario ➂: Hotel Agency

online travel agency

best price guarantee

prices of competing agencies

average market price

Page 17: Diadem 1.0

21

Section 1: Web Data Extraction

Scenario ➃: Hedge Fund

house price index

published in regular intervals by national statistics agency

affects share values of various industries

hedge fund

online market intelligence to predict the house price index

Page 18: Diadem 1.0

22

Section 1: Web Data Extraction

And a lot more …

monitor blogs and forums

market intelligence, e.g., complaints, common problems

customer opinions

ranking and analysing product reviews

financial analysts

monitor trends and stats for products of a certain company / category

interest rates from financial institutions

press releases and financial reports

patent search & analysis

Page 19: Diadem 1.0

24

1.1

Web Data Extraction: Why Now?

Page 20: Diadem 1.0

25

Scale

Page 21: Diadem 1.0

26

Applications

Page 22: Diadem 1.0

27

Section 1: Web Data Extraction

How to book a flight?

Page 23: Diadem 1.0

31

Structured Data

Page 24: Diadem 1.0

33

Section 1: Web Data Extraction

Why Web Data Extraction Now?

Why now? Trends

Trend ➊: scale—every business is online

automation at scale

Trend ➋: web applications rather than web documents

automated form filling (deep web navigation)

Trend ➌: structured, common-sense data available

allows more sophisticated automated analysis

also a tool for improved data extraction?

Page 25: Diadem 1.0

34

Web Data Extraction: Supervised

2

Page 26: Diadem 1.0

35

manual: (e.g., Web Harvest)

user writes the wrapper, sometimes using wrapping libraries

supervised: (e.g., Lixto)

user provides examples and refines the wrapper

semi-supervised:

user provides examples (per site), wrapper is automatically learned

unsupervised: entirely automated (e.g., DIADEM)

some systems omit examples and run analysis directly on all pages

some systems automatically guess examples

Page 27: Diadem 1.0

36

Section 2: Supervised Web Data Extraction

Supervised Web Data Extraction

User interaction needed to

rather than manually writing in a programming language

record interaction sequences (such as form fillings)

visually select examples for data

Current gold standard for high-accuracy extraction

Examples:

Lixto

Automation Anywhere

Web Harvest

Page 28: Diadem 1.0

38

Page 29: Diadem 1.0

40

Section 1: Supervised Web Data Extraction

Lixto: Extraction & Analysis

Lixto: sophisticated, visual semi-automated extraction tool

visually select, automatically derives patterns, verification

highly scalable extraction and processing with Lixto server

but also: data integration & business analytics suite

data cleaning

data flow scenarios: merge & filter from different web sites

market intelligence & analytics

Page 30: Diadem 1.0

43

Web Data Extraction: Unsupervised

3

Page 31: Diadem 1.0

44

17000 real estate sites in the UK

alone

Page 32: Diadem 1.0

45

Section 3: Unsupervised Web Data Extraction

Why Automating Data Extraction?

Too many fish in the pond

> 17 000 real estate UK sites

similar for restaurants, travel, airlines, pharmacies, retail shops, …

aggregators cover only a fraction

updated slowly

⇒ per site manual work infeasible

wrapper construction too expensive

tracking changes

excludes manual & (semi-) supervised

Page 33: Diadem 1.0

46

Section 3: Unsupervised Web Data Extraction

Why Automating Data Extraction?

All the fish are different

large, modern aggregators (>100000)

nation-wide agencies (>10000)

agencies for single quarter (< 15)

⇒ no single unsupervised wrapper

can do this today

Page 34: Diadem 1.0

47

Section 3: Unsupervised Web Data Extraction

… and we really need it!

search engine providers (Google, Microsoft, Yahoo!) all work on

information and data extraction for

“vertical”, “object” and “semantic” search

turn search engines into knowledge bases for decision support

Page 35: Diadem 1.0

48

“no one really has done this successfully at scale yet”

Raghu Ramakrishnan, Yahoo!, March 2009

“Current technologies are not good enough yet to provide what search

engines really need. [...] Any successful approach would probably need a combination of knowledge and

learning.”

Alon Halevy, Google, Feb. 2009

Page 36: Diadem 1.0

49

Section 3: Unsupervised Web Data Extraction

Unsupervised: The Story so Far

Key observation:

“database” web sites are generated using templates

wrapper generators need to automatically identifying templates

Two major approaches

machine learning from a few hand-labeled examples

similar to semi-supervised, but only one set of examples for an entire domain

high precision only for simple domains (single entity type, few attributes)

fully automatically exploit the repeated structure of result pages

good precision needs a lot of data (many records per page, many pages)

doesn’t work for forms (no repetition)

Page 37: Diadem 1.0
Page 38: Diadem 1.0

51

?

Page 39: Diadem 1.0

52

4

DIADEM

Page 40: Diadem 1.0

53

Section 4: DIADEM

Domain-Centric Data Extraction

Blackbox analyser that

turns any of the thousands of websites of a domain

into structured data

Page 41: Diadem 1.0

54

host of domain specific annotators

Page 42: Diadem 1.0

55

domain ontology & phenomenology

Page 43: Diadem 1.0

56

+ everything the others are doing

machine learning for classification

template discovery

Page 44: Diadem 1.0

57

Page 45: Diadem 1.0

58

Page 46: Diadem 1.0

59

Section 4: DIADEM

DIADEM: Overview

DIADEM combines

host of domain-specific annotators with

gives us a first “guess” to automatically generate examples

high-level ontology about domain entities and

their phenomenology on web sites of the domain

allows us to verify & refine examples

+ advances in existing techniques for

repeated structure analysis

page & block classification

bottom-up understanding & top-down reasoning

Page 47: Diadem 1.0

60

4.1

DEMO

Page 48: Diadem 1.0

61

Page 49: Diadem 1.0

62

DIADEM 0.1First prototype

Page 50: Diadem 1.0

63

Page 51: Diadem 1.0

69

Page 52: Diadem 1.0

70

OPAL:Ontologies for Form Analysis

4.2

Page 53: Diadem 1.0

71

Page 54: Diadem 1.0

72

Diversity

Page 55: Diadem 1.0

74

Section 4: DIADEM » OPAL

OPAL: Overview

Three step process:

browser extraction and annotation

labelling & segmentation

classification (phenomenological mapping)

Model-based, knowledge driven

latter two steps are model transformations

thin layer of domain-dependent concepts

field types and labels

triggers for field & form creation

Page 56: Diadem 1.0

75

Page 57: Diadem 1.0

77

Page 58: Diadem 1.0

78

Page 59: Diadem 1.0

79ICQ Data Set: Application to Other Domains

Page 60: Diadem 1.0

80

AMBER:Ontologies for

Record Extraction

4.3

Page 61: Diadem 1.0

81

7 bedrooms

5 bedrooms

Page 62: Diadem 1.0

82

just opposite as in OPAL

Page 63: Diadem 1.0

83

AMBER: Overview

Three step process like OPAL

browser extraction and annotation

classification (phenomenological mapping)

record segmentation (much harder than in OPAL)

Model-based, knowledge driven

latter two steps are model transformations

thin layer of domain-dependent concepts

record and attribute types

triggers for record & attribute creation

Section 4: DIADEM » AMBER

Page 64: Diadem 1.0

84

Page 65: Diadem 1.0

85

Page 66: Diadem 1.0

86

Repeating

Page 67: Diadem 1.0

87

Similarity

Page 68: Diadem 1.0

88

Page 69: Diadem 1.0

89

OXPath:Scalable, Memory-

Efficient Web Extraction

4.4

Page 70: Diadem 1.0

90

How to book a flight?

Section 4: DIADEM » OXPath

Page 71: Diadem 1.0

92

How to find a flat?

Section 4: DIADEM » OXPath

Page 72: Diadem 1.0

94

How to find a flat with OXPath

Start at rightmove.co.uk: doc("rightmove.co.uk")

Fill “oxford’ into the first visible field /descendant::field()[1]/{"oxford"}

Click on the second next button /following::field()[2]/{click /}

On the refinement form just continue by clicking on the last field /descendant::field()[last()]/{click /}

Grab all the prices //p.price

Section 4: DIADEM » OXPath

Page 73: Diadem 1.0

95

State of Web Extraction

No interaction with rich, scripted interfaces

no actions other than form filling and submission

➀ Imperative extraction scripts

explicit variable assignments, flow control, etc.

either proprietary selection language or mix of XPath & external flow control

➁ Focus on automation and visual interfaces

no or very limited extraction language, only ad-hoc extractions

no multiway navigation, no optimization

Section 4: DIADEM » OXPath

Page 74: Diadem 1.0

98

Summary of Complexity

Section 4: DIADEM » OXPath

Time Space

OXPath w/o Actions & Kleene

O( n6⋅q2 ) O( n5⋅q2 )

OXPath w/o Kleene O( (p⋅n)6⋅q3 ) O( n5⋅q3 )

OXPath w/o unbounded Kleene

O( (p⋅n)6⋅q3 ) O( n5⋅q∑3 )

OXPath (full) O( (p⋅n)6⋅q3 ) O( n5⋅(q+d)3 )

O(n4⋅q2) O(n3⋅q2)

Combined: PTime-hard PTime-hard

Data: NLogSpace

LogSpaceExtraction marker = n-ary, nested

queries

Contextual actions (action free prefix)

Actions = multiple pages

Buffer bounded by page depth

Page 75: Diadem 1.0

99

Constant Memory

Page 76: Diadem 1.0

105

even faster

Page 77: Diadem 1.0

106

4.5

IVLIA:Ontologies for PDF Extraction

Page 78: Diadem 1.0

107

Page 79: Diadem 1.0

108

PDF Analysis

Section 4: DIADEM » IVLIA

Page 80: Diadem 1.0

109

Semantic Analysis and Annotation

Section 4: DIADEM » IVLIA

Page 81: Diadem 1.0

110

Datalog±:Ontological Reasoning

at Web Scale

4.6

Page 82: Diadem 1.0

113

Relational Schemaperson(ssn, name, birthdate)employee (ssn, empID, name, birthdate, department)department (depName, building)project (projID, startDate, duration)supervision (supervisor, supervised)assignment (employee, project)

E/R Schema Object Relational Schema

Ontological Databases

Section 4: DIADEM » Datalog±

Page 83: Diadem 1.0

114

Taxonomy Definitions

employee(X,Y,Z,W) → ∃V person(V,Y,Z)

project(X,Y,Z) → activity(X,Y,Z)

employee(X1,Y1,Z1,W1,U1), supervision(Y1,Y2), employee(X2,Y2,Z2,W2,U2) → supervisor(X1,Y1,Z1,W1,U1)

Concept Definitions

generalManager(X1,Y1,Z1,W1,U1) → supervision(Y1,Y1)

An employee who supervises another employee is a supervisor

A general manager supervises him/herself

Ontological Constraints

Section 4: DIADEM » Datalog±

Page 84: Diadem 1.0

115

efficiency

KR

expressiveness

expressiveness

DB

efficiency

Big Picture

Page 85: Diadem 1.0

116

Big Picture

Page 86: Diadem 1.0

123

Page 87: Diadem 1.0

Q&A

diadem-project.info