Upload
giorgio-orsi
View
744
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
DIADEMDomain-centric, Intelligent, Automated
Data ExtractionTim Furche, Georg Gottlob, Giorgio Orsi
May 11th, 2011 @ Oxford University Computing Laboratoriesjoint work with Giovanni Grasso, Omer Gunes, Xiaonan Guo, Andrey Kravchenko, Thomas
Lukasiewicz, Christian Schallhart, Andrew Sellers, Gerardo Simaris, Cheng Wang
3
1
Web Data Extraction
4
Section 1: Web Data Extraction
Data on the Web
there is more of it than we can use
no longer availability, but finding, integrating, analysing, …
5
Section 1: Web Data Extraction
Surface vs. Deep Web
estimated 500 × surface web
estimated 400 000 deep web databases
What?
Products (stores)
Directories (yellow pages)
Catalogs (libraries)
Public DBs (publications, census, data.gov,…)
Public services (weather, location, …)
6And it’s not just one haystack …
8
10
11
7 bedrooms
5 bedrooms
12
Section 1: Web Data Extraction
The Web is more than HTML
13
Section 1: Web Data Extraction
Overview
Introducing Web Data Extraction
Scenarios
Why now?
Supervised Web Data Extraction
Unsupervised Web Data Extraction
DIADEM
OPAL
AMBER
OXPath
IVLIA
Datalog±
14
1.1
Web Data Extraction: Scenarios
15
Section 1: Web Data Extraction
The Need of Web Data Extraction
information
drives business (decision making, trend analysis, …)
available in troves on the internet
but: as HTML made for humans, not as structured data
companies need
product specifications
pricing information
market trends
regulatory information
17
keyword search fails
18
Section 1: Web Data Extraction
Scenario ➀: Electronics retailer
electronics retailer: online market intelligence
comprehensive overview of the market
daily information on price, shipping costs, trends, product mix
by product, geographical region, or competitor
thousands of products
hundreds of competitors
nowadays: specialised companies
mostly manual, interpolation
large cost
19
Section 1: Web Data Extraction
Scenario ➁: Supermarket chain
supermarket chain
competitors’ product prices
special offer or promotion (time sensitive)
new products, product formats & packaging
20
Section 1: Web Data Extraction
Scenario ➂: Hotel Agency
online travel agency
best price guarantee
prices of competing agencies
average market price
21
Section 1: Web Data Extraction
Scenario ➃: Hedge Fund
house price index
published in regular intervals by national statistics agency
affects share values of various industries
hedge fund
online market intelligence to predict the house price index
22
Section 1: Web Data Extraction
And a lot more …
monitor blogs and forums
market intelligence, e.g., complaints, common problems
customer opinions
ranking and analysing product reviews
financial analysts
monitor trends and stats for products of a certain company / category
interest rates from financial institutions
press releases and financial reports
patent search & analysis
…
24
1.1
Web Data Extraction: Why Now?
25
Scale
26
Applications
27
Section 1: Web Data Extraction
How to book a flight?
31
Structured Data
33
Section 1: Web Data Extraction
Why Web Data Extraction Now?
Why now? Trends
Trend ➊: scale—every business is online
automation at scale
Trend ➋: web applications rather than web documents
automated form filling (deep web navigation)
Trend ➌: structured, common-sense data available
allows more sophisticated automated analysis
also a tool for improved data extraction?
34
Web Data Extraction: Supervised
2
35
manual: (e.g., Web Harvest)
user writes the wrapper, sometimes using wrapping libraries
supervised: (e.g., Lixto)
user provides examples and refines the wrapper
semi-supervised:
user provides examples (per site), wrapper is automatically learned
unsupervised: entirely automated (e.g., DIADEM)
some systems omit examples and run analysis directly on all pages
some systems automatically guess examples
36
Section 2: Supervised Web Data Extraction
Supervised Web Data Extraction
User interaction needed to
rather than manually writing in a programming language
record interaction sequences (such as form fillings)
visually select examples for data
Current gold standard for high-accuracy extraction
Examples:
Lixto
Automation Anywhere
Web Harvest
…
38
40
Section 1: Supervised Web Data Extraction
Lixto: Extraction & Analysis
Lixto: sophisticated, visual semi-automated extraction tool
visually select, automatically derives patterns, verification
highly scalable extraction and processing with Lixto server
but also: data integration & business analytics suite
data cleaning
data flow scenarios: merge & filter from different web sites
market intelligence & analytics
43
Web Data Extraction: Unsupervised
3
44
17000 real estate sites in the UK
alone
45
Section 3: Unsupervised Web Data Extraction
Why Automating Data Extraction?
Too many fish in the pond
> 17 000 real estate UK sites
similar for restaurants, travel, airlines, pharmacies, retail shops, …
aggregators cover only a fraction
updated slowly
⇒ per site manual work infeasible
wrapper construction too expensive
tracking changes
excludes manual & (semi-) supervised
46
Section 3: Unsupervised Web Data Extraction
Why Automating Data Extraction?
All the fish are different
large, modern aggregators (>100000)
nation-wide agencies (>10000)
agencies for single quarter (< 15)
⇒ no single unsupervised wrapper
can do this today
47
Section 3: Unsupervised Web Data Extraction
… and we really need it!
search engine providers (Google, Microsoft, Yahoo!) all work on
information and data extraction for
“vertical”, “object” and “semantic” search
turn search engines into knowledge bases for decision support
48
“no one really has done this successfully at scale yet”
Raghu Ramakrishnan, Yahoo!, March 2009
“Current technologies are not good enough yet to provide what search
engines really need. [...] Any successful approach would probably need a combination of knowledge and
learning.”
Alon Halevy, Google, Feb. 2009
49
Section 3: Unsupervised Web Data Extraction
Unsupervised: The Story so Far
Key observation:
“database” web sites are generated using templates
wrapper generators need to automatically identifying templates
Two major approaches
machine learning from a few hand-labeled examples
similar to semi-supervised, but only one set of examples for an entire domain
high precision only for simple domains (single entity type, few attributes)
fully automatically exploit the repeated structure of result pages
good precision needs a lot of data (many records per page, many pages)
doesn’t work for forms (no repetition)
51
?
52
4
DIADEM
53
Section 4: DIADEM
Domain-Centric Data Extraction
Blackbox analyser that
turns any of the thousands of websites of a domain
into structured data
54
host of domain specific annotators
55
domain ontology & phenomenology
56
+ everything the others are doing
machine learning for classification
template discovery
57
58
59
Section 4: DIADEM
DIADEM: Overview
DIADEM combines
host of domain-specific annotators with
gives us a first “guess” to automatically generate examples
high-level ontology about domain entities and
their phenomenology on web sites of the domain
allows us to verify & refine examples
+ advances in existing techniques for
repeated structure analysis
page & block classification
bottom-up understanding & top-down reasoning
60
4.1
DEMO
61
62
DIADEM 0.1First prototype
63
69
70
OPAL:Ontologies for Form Analysis
4.2
71
72
Diversity
74
Section 4: DIADEM » OPAL
OPAL: Overview
Three step process:
browser extraction and annotation
labelling & segmentation
classification (phenomenological mapping)
Model-based, knowledge driven
latter two steps are model transformations
thin layer of domain-dependent concepts
field types and labels
triggers for field & form creation
75
77
78
79ICQ Data Set: Application to Other Domains
80
AMBER:Ontologies for
Record Extraction
4.3
81
7 bedrooms
5 bedrooms
82
just opposite as in OPAL
83
AMBER: Overview
Three step process like OPAL
browser extraction and annotation
classification (phenomenological mapping)
record segmentation (much harder than in OPAL)
Model-based, knowledge driven
latter two steps are model transformations
thin layer of domain-dependent concepts
record and attribute types
triggers for record & attribute creation
Section 4: DIADEM » AMBER
84
85
86
Repeating
87
Similarity
88
89
OXPath:Scalable, Memory-
Efficient Web Extraction
4.4
90
How to book a flight?
Section 4: DIADEM » OXPath
92
How to find a flat?
Section 4: DIADEM » OXPath
94
How to find a flat with OXPath
Start at rightmove.co.uk: doc("rightmove.co.uk")
Fill “oxford’ into the first visible field /descendant::field()[1]/{"oxford"}
Click on the second next button /following::field()[2]/{click /}
On the refinement form just continue by clicking on the last field /descendant::field()[last()]/{click /}
Grab all the prices //p.price
Section 4: DIADEM » OXPath
95
State of Web Extraction
No interaction with rich, scripted interfaces
no actions other than form filling and submission
➀ Imperative extraction scripts
explicit variable assignments, flow control, etc.
either proprietary selection language or mix of XPath & external flow control
➁ Focus on automation and visual interfaces
no or very limited extraction language, only ad-hoc extractions
no multiway navigation, no optimization
Section 4: DIADEM » OXPath
98
Summary of Complexity
Section 4: DIADEM » OXPath
Time Space
OXPath w/o Actions & Kleene
O( n6⋅q2 ) O( n5⋅q2 )
OXPath w/o Kleene O( (p⋅n)6⋅q3 ) O( n5⋅q3 )
OXPath w/o unbounded Kleene
O( (p⋅n)6⋅q3 ) O( n5⋅q∑3 )
OXPath (full) O( (p⋅n)6⋅q3 ) O( n5⋅(q+d)3 )
O(n4⋅q2) O(n3⋅q2)
Combined: PTime-hard PTime-hard
Data: NLogSpace
LogSpaceExtraction marker = n-ary, nested
queries
Contextual actions (action free prefix)
Actions = multiple pages
Buffer bounded by page depth
99
Constant Memory
105
even faster
106
4.5
IVLIA:Ontologies for PDF Extraction
107
108
PDF Analysis
Section 4: DIADEM » IVLIA
109
Semantic Analysis and Annotation
Section 4: DIADEM » IVLIA
110
Datalog±:Ontological Reasoning
at Web Scale
4.6
113
Relational Schemaperson(ssn, name, birthdate)employee (ssn, empID, name, birthdate, department)department (depName, building)project (projID, startDate, duration)supervision (supervisor, supervised)assignment (employee, project)
E/R Schema Object Relational Schema
Ontological Databases
Section 4: DIADEM » Datalog±
114
Taxonomy Definitions
employee(X,Y,Z,W) → ∃V person(V,Y,Z)
project(X,Y,Z) → activity(X,Y,Z)
employee(X1,Y1,Z1,W1,U1), supervision(Y1,Y2), employee(X2,Y2,Z2,W2,U2) → supervisor(X1,Y1,Z1,W1,U1)
Concept Definitions
generalManager(X1,Y1,Z1,W1,U1) → supervision(Y1,Y1)
An employee who supervises another employee is a supervisor
A general manager supervises him/herself
Ontological Constraints
Section 4: DIADEM » Datalog±
115
efficiency
KR
expressiveness
expressiveness
DB
efficiency
Big Picture
116
Big Picture
123