Upload
kristian-doyle
View
223
Download
0
Tags:
Embed Size (px)
Citation preview
Large-Scale Deep Web Integration:
Exploring and Querying Structured Data on the Deep
WebKevin C. Chang
Tutorial in SIGMOD’06
2
Still challenges on the Web?
Google is only the start of search(and MSN will not be the end of it).
3
Structured Data--- Prevalent but ignored!
4
Challenges on the Web come in “dual”: Getting access to the structured
information!
Access
Structure
Surface Web Deep Web
Kevin’s 4-quardants:
5
Tutorial Focus: Large Scale Integration of structured
data over the Deep Web That is: Search-flavored integration. Disclaimer-- What it is not:
Small-scale, pre-configured, mediated-querying settings many related techniques some we will relate today
Text databases (or, meta-search) Several related but “text-oriented” issues in meta-search
eg, Stanford, Columbia, UIC more in the IR community (distributed IR)
And, never a “complete” bibliography!! http://metaquerier.cs.uiuc.edu/ “Web Integration” bibliography
Finally, no intention to “finish” this tutorial.
6
An evidence in Beta: Google Base.
7
When Google speaks up…“What is an “Attribute”,” says Google!
8
And things are indeed happening!
9
10
11
The Deep Web:Databases on the
Web
12
The previous Web: Search used to be “crawl and index”
13
The current Web: Search must eventually resort to integration
14
How to enable effective access to the deep Web?
Cars.com Amazon.com
Apartments.comBiography.com
401carfinder.com411localte.com
15
Survey the frontier: BrightPlanet.com, March 2000 [Bergman00] Overlap analysis of search engines.
“Search sites” not clearly defines.
Estimated 43,000 – 96,000 deep Web sites. Content size 500 times that of surface Web.
N
n
n
n b
a
0
16
Survey the frontier UIUC MetaQuerier, April 2004 [ChangHL+04]
Macro: Deep Web at large Data: Automatically-sampled 1 million IPs
Micro: per-source specific characteristics Data: Manually-collected sources 8 representative domains, 494 sources
Airfare (53), Autos (102), Books (69), CarRentals (24)
Hotels (38), Jobs (55), Movies (78), MusicRecords (75)
Available at http://metaquerier.cs.uiuc.edu/repository
17
They wanted to observe…
How many deep-Web sources are out there? “The dot-com bust has brought down DBs on the Web.”
How many structured databases? “There are just (or, much more) text databases.”
How do search engines cover them? “Google does it all.”– Or, “InvisibleWeb.com does it all.”
How hidden are they? “It is the hidden Web.”
How complex are they? “Queries on the Web are much simpler, even trivial.” “Coping with semantics is hopeless– Let’s Just wait till the
semantic Web.”
18
And their results are…
How many deep-Web sources are out there? 307,000 sites, 450,000 DBs, 1,258,000 interfaces.
How many structured databases? 348,000 (structured) : 102,000 (text) == 3 : 1
How do search engines cover them? Google covered 5% fresh and 21% state objects. InvisibleWeb.com covered 7.8% sources.
How hidden are they? CarRental (0%) > Airfares (~4%) > … > MusicRec > Books > Movies (80%+)
How complex are they? “Amazon effects”
19
Reported the “Amazon effect”…
Attributes converge in a domain!
Condition patterns converge even across domains!
20
Google’s Recent Survey [courtesy Jayant Madhavan]
21
Driving Force: The Large Scale
22
Circa 2000: Example System– Information Agents [MichalowskiAKMTT04,
Knoblock03]
23
Circa 2000: Example System– Comparison Shopping Engines
[GuptaHR97]Virtual Database
24
System: Example
Applications
25
Integrating information from multiple types of sources Ranking papers, conferences, and authors for a given query Handling structured queries
WebDatabase
WebDatabase
WebDatabase
WebDatabase
WebDatabase…
PS DOC
JournalHomepage
AuhtorHomepage
Conf.Homepage
Vertical Search Engines—”Warehousing” approach e.g., Libra Academic Search [NieZW+05] (courtesy MSRA)
26
On-the-fly Meta-querying Systems—e.g., WISE [HeMYW03], MetaQuerier
[ChangHZ05]
FIND sources
QUERY sources
db of dbs
unified query interface
Amazon.comCars.com
411localte.com
Apartments.com
MetaQuerier@UIUC :
27
What needs to be done? Technical Challenges:
Source Modeling & Selection Schema Matching Source Querying, Crawling, and Obj Ranking Data Extraction System Integration
28
The Problems:Technical
Challenges
29
Technical Challenges
1. Source Modeling & Selection
How to describe a source and find right sources for query answering?
30
Source Modeling: Circa 2000
Focus: Design of expressive model mechanism.
Techniques: View-based mechanisms: answering queries using
views, LAV, GAV (see [Halevy01] for survey). Hierarchical or layered representations for modeling
in-site navigations ([KnoblockMA+98], [DavulcuFK+99]).
31
Source Modeling & Selection: for Large Scale Integration
Focus: Discovery of sources. Focused crawling to collect query interfaces [BarbosaF05,
ChangHZ05]. Focus: Extraction of source models.
Hidden grammar-based parsing [ZhangHC04]. Proximity-based extraction [HeMY+04]. Classification to align with given taxonomy [HessK03,
Kushmerick03]. Focus: Organization of sources and query routing
Offline clustering [HeTC04, PengMH+04]. Online search for query routing [KabraLC05].
32
Form Extraction: the Problem
Output all the conditions, for each: Grouping elements (into query conditions) Tagging elements with their “semantic roles”
attributeoperator value
33
Observation: Interfaces share “patterns” of presentation.
Hypothesis:
Now, the problem:
Given , how to find ?
Form Extraction: Parsing Approach [ZhangHC04]
A hidden syntactic model exist?
Grammar
Interface Creation
query capabilities
34
Best-Effort Visual Language Parsing Framework
Layout Engine
TokenizerBE-Parser
Ambiguity
Resolution
Error Handling
Output:semantic structure
Input:HTML query form
Productions Preferences
2P Grammar
X
35
Form Extraction: Clustering Approach [HessK03, Kushmerick03]Concept: A form as a Bayesian network. Training: Estimate the Bayesian probabilities. Classification: Max-likelihood predictions given terms.
36
Technical Challenges
2. Schema Matching
How to match the schematic structures between sources?
37
Schema Matching: Circa 2000 Focus:
Generic matching without assuming Web sources Techniques: [RahmB01]
38
Schema Matching: for Large Scale Integration
Focus: Matching large number of interface schemas, often in a holistic way. Statistical model discovery [HeC03]; correlation mining [HeCH04,
HeC05]. Query probing [WangWL+04]. Clustering [HeMY+03, WuYD+04]. Corpus-assisted [MadhavanBD+05]; Web-assisted [WuDY06].
Focus: Constructing unified interfaces. As a global generative model [HeC03]. Cluster-merge-select [HeMY+03].
39
WISE-Integrator: Cluster-Merge-Represent
[HeMY+03]
40
Matching attributes: Synonymous label: WordNet, string similarity Compatible value domains (enum values or type)
Constructing integrated interface: form = initial empty until all attribtes covered:
take one attribute select a representative and merge values
WISE-Integrator: Cluster-Merge-Represent
[HeMY+03]
41
Observation: Schemas share “tendencies” of attribute usage.
Hypothesis:
Now, the problem:
Given , how to find ?
Statistical Schema Matching: MGS A hidden statistical model exist? [HeC03, HeCH04, HeC05]
Statistical Model
Schema Generation
attribute matchings
αβ
η
αβ
δ
γη
α β γ η δ
αβ
ηα
βδγ
η
αβ
ηα
βδγ
η α β γ η δ
42
Statistical Hypothesis Discovery
Statistical formulation: Given as observations:
Find underlying hypothesis:
“Global” approach: Hidden model discovery [HeC03] Find entire global model at once
“Local” approach: Correlation mining [HeCH04, HeC05] Find local fragments of matchings one at a time.
αβηαβδγη
α β γ η δ
Prob
QIs
43
Technical Challenges
3. Source Querying, Crawling & Search
How to query a source? How to crawl all objects and to search them?
44
Source Querying: Circa 2000
Focus: Mediation of cross-source, join-able queries Query rewriting, planning– Extensive study: e.g.,
[LevyRO96, AmbiteKMP01, Halevy01].
Focus: Execution & optimization of queries Adaptive, speculative query optimization; e.g.,
[NaughtonDM+01, BarishK03, IvesHW04].
45
Source Querying: for Large Scale Integration
1. Metaquerying model: Focus: On-the-fly Querying.
MetaQuerier Query Assistant [ZhangHC05].
2. Vertical-search-engine model: Focus: Source crawling to collect objects.
Form submission by query generation/selection e.g., [RaghavanG01, WuWLM06].
Focus: Object search and ranking [NieZW+05]
46
On-the-fly Querying: [ZhangHC05]
Type-locality based Predicate Translation
Target template P
Target Predicate t*
Type Recognizer
Domain Specific Handler
Text Handler
Numeric Handler
Datetime Handler
Predicate Mapper
Source predicate s
Correspondences occur within localities Translation by type-handler
47
Source Crawling by Query Selection [WuWL+06]
Author Title Category
Ullman Complier System
Ullman Data Mining Application
Ullman Automata Theory
Han Data Mining ApplicationUllman
Han
Compiler
Automata
Data Mining
Application
TheorySystem
Conceptually, the DB as a graph: Node: Attributes Edge: Occurrence relationship
Crawling is transformed into graph traversal problem:Find a set of nodes N in the graph G such that for every node i in G, there exists a node j in N, j->i. And the summation of the cost of nodes in N should be minimum.
48
Object Ranking-- Object Relationship Graph
[NieZW+05]
Popularity Propagation Factor for each type of relationship link
Popularity of an object is also affected by the popularity of the Web pages containing the object
49
Object Ranking-- Training Process [NieZW+05]
PopRankCalculator
Ranking DistanceEstimator
new combination from neighbors
Chosen as the best
Link Graph Initial Combination of PPFs
Better than the best
?
AcceptThe worse one
?
Expert Ranking
Yes
No
Yes
Subgraph selection to approximate rank calculation for speeding up.
50
Technical Challenges
3. Data Extraction
How to extract result pages into relations?
51
Data Extraction: Circa 2000 Need for rapid wrapper construction well recognized.
Mediator
Wrapper Wrapper Wrapper
Focus: Semi-automatic wrapper construction.
Techniques: Wrapper-mediator architecture [Wiederhold92] . Manual construction: Semi-automatic: Learning-based
HLRT [KushmerickWD97],
Stalker [MusleaMK99],
Softmealy [HsuD98];
52
Data Extraction: for Large Scale Even more automatic approaches.
Mediator
Wrapper Wrapper Wrapper
Focus: Even more automatic approaches.
Techniques: Semi-automatic: Learning-based
[ZhaoMWRY05], [IRMKS06]. Automatic: Syntax-based
RoadRunner [MeccaCM01],
ExAlg [ArasuG03],
DEPTA [LiuGZ03, ZhaiL05].
53
HLRT Wrapper: the first “Wrapper Induction” [KushmerickWD97]
ExecuteHLRT(<h,t,l1,r1,..,lk,rk>,page P)skip past first occurrence of h in Pwhile next l1 is before next t in P for each <lk,rk>belongs to {<l1,r1>,..,< lk, rk >} skip past next occurrence of lk in P extract attr from P to next occurrence of rkreturn extracted tuples
ExtractCCs(page P)skip past first occurrence of <B> in Pwhile next <B> is before next <HR> in P for each <lk,rk>belongs to {< <B>,</B>>,< <I>,</I>>} skip past next occurrence of lk in P extract attribute from P to next occurrence of rk return extracted tuples
A manual wrapper:
A generalized wrapper:
wrapper rules:(delimiters)hl1, r1l2, r2……lk, rkt
InductionAlgorithm
labeled data
54
RoadRunner [MeccaCM01]
Basic idea: Page generation: filling (encoding) data into a template Data extraction: as the reverse, decoding the template
Algorithm Compare two HTML pages at one time
one as wrapper and the other as sample Solving the mismatches
string mismatch -- content slot tag mismatch -- structure variance
55
RoadRunner
56
RoadRunner
the template
57
Technical Challenges
3. System Integration
Putting things together?
58
Our “system” research often ends up with “components in isolation” [ChangHZ05]
59
System integration: Sample issues
New challenges How will errors in automatic form extraction impact the
subsequent schema matching? New opportunities
Can the result of schema matching help to correct such errors? e.g., (adults, children) together form a matching, then?
AA.com
Result of extraction:
60
Current agenda: “Science” of system integration
jSiS kSCascade
Feedback
new challenge: error cascading
new opportunity: result feedback
61
Finally, observationsLarge scale is not
only a challenge, but also an opportunity!
62
Observation #1: Large scale introduces
New Problems! Several issues arise in the context:
Evidences of new problems: Source modeling & selection Source querying, crawling, ranking:
On-the-fly query translation Object crawling, ranking
System integration
63
Observation #2: Large scale introduces
New Semantics! Relaxed metrics possible– even the same problems.
Evidences of new metrics: Search-flavored integration– large scale but simplistic
Function: Simple queries Source: Transparency no more the fundamental doctrine User: In the loop of querying Techniques: Automatic but error-likely Results: Fuzzy, ranked
meta-querying: ranking of matching sources vertical-search-engine: ranking of objects
64
Observation #2: Large scale introduces
New Insights! The multitude of sources gives a holistic context for study.
Evidences of new insights: Schema matching: Many holistic approaches Source modeling: “Lego”-based extraction System integration: Holistic error correction/feedback
65
The Web “Trio” (My three circles...)
Integration Mining
Search
66
DB People: Buckle Up!
Our time has finally come…
Looking ForwardRecall the first time I heard about
Google Base.