View
217
Download
1
Category
Tags:
Preview:
Citation preview
Cimple: Building Community Portal Sitesthrough Crawling & Extraction
Zachary G. IvesUniversity of Pennsylvania
CIS 650 – Implementing Data Management Systems
November 4, 2008
Slides based on content by AnHai Doan, used with permission
Administrivia
By next Tuesday: a rough schedule and division of duties for your project
Please read the Halevy et al. paper on Piazza
2
The Web Is Full of Special-Interest Portal Sites for Communities
Academia Certain bioinformatics topics; citations; etc.
Medicine WebMD
Infotainment Rotten Tomatoes, IMDB, fantasy football
Business enterprise intranets, tech support groups, lawyers
CIA / homeland security Intellipedia
Some of these gather information from the Web
3
Cimple Project @ Wisconsin (+ Yahoo)
Researcher
Homepages
Conference
Pages
Group Pages
DBworld
mailing list
DBLP
Web pages
Text documents
* **
** * ***
SIGMOD-04
**
** give-talk
Jim Gray
Keyword search
SQL querying
Question answering
Browse
Mining
Alert/Monitor
News summary
Jim Gray
SIGMOD-04
**
Maintain and add more sources
Develops a general solution to community Web portals using extraction + integration + mass collaboration
Mass collaboration
The Basic Ideas
Architecture mainly consists of extractors and ER-graphs
The key challenge is to extract “good” information for the ER-graphs, and to allow all of the information to be extended and repaired
5
Prototype System: DBLife
Integrate data of the DB research community 1164 data sources
Crawled daily, 11000+ pages = 160+ MB / day
Data Integration
Raghu Ramakrishnan
co-authors = A. Doan, Divesh Srivastava, ...
Resulting ER Graph
“Proactive Re-optimization
Jennifer Widom
Shivnath Babu
SIGMOD 2005
David DeWitt
Pedro Bizarrocoauthor
coauthor
coauthor
advise advise
write
write
write
PC-Chair
PC-member
Provide Services
DBLife system
Mass Collaboration via Wiki
Issues Addressed by Cimple
Cimple addresses challenges in 1. Source selection2. Extraction and integration3. Detecting problems and providing
feedback4. Mass collaboration
1. Source Selection
Researcher
Homepages
Conference
Pages
Group Pages
DBworld
mailing list
DBLP
Web pages
Text documents
* **
** * ***
SIGMOD-04
**
** give-talk
Jim Gray
Keyword search
SQL querying
Question answering
Browse
Mining
Alert/Monitor
News summary
Jim Gray
SIGMOD-04
**
Maintain and add more sources
Mass collaboration
Current Solutions vs. Cimple Current solutions: topic specific crawlers
find all relevant data sources (e.g., using focused crawling, search engines)
maximize coverage results in many “noisy” sources
Cimple allows for incremental development, deployment starts with a small set of high-quality “core”
sources incrementally adds more sources
only from “high-quality” places or as suggested by users (mass collaboration)
Start with a Small Set of “Core” Sources
Key observation: communities often follow 80-20 rule 20% of sources cover 80% of interesting
activities
Initial portal over these 20% often is already quite useful
How do we select these 20%? select as many sources as possible then evaluate and select most relevant ones
Evaluate the Relevance of Sources
Use PageRank + virtual links across entities + TF/IDF
... Gerhard Weikum
G. Weikum
See [VLDB-07a]
Add More Sources over Time Key observation: most important sources will
eventually be mentioned within the community so monitor certain “community channels” to find them
Message type: conf. ann.Subject: Call for Participation: VLDB Workshop on Management of Uncertain Data
Call for Participation Workshop on
"Management of Uncertain Data" in conjunction with VLDB 2007
http://mud.cs.utwente.nl ...
Also allow users to suggest new sources– e.g., the Silicon Valley Database Society
Summary: Source Selection
Incremental approach: start with highly relevant sources expand carefully minimize “garbage in, garbage out”
Need a notion of source relevance Need a way to compute this
2. Extraction and Integration
Researcher
Homepages
Conference
Pages
Group Pages
DBworld
mailing list
DBLP
Web pages
Text documents
* **
** * ***
SIGMOD-04
**
** give-talk
Jim Gray
Keyword search
SQL querying
Question answering
Browse
Mining
Alert/Monitor
News summary
Jim Gray
SIGMOD-04
**
Maintain and add more sources
Mass collaboration
Extracting Entity Mentions Key idea: reasonable plan, then “patch” Reasonable basic plan:
collect person names, e.g., David Smith generate variations, e.g., D. Smith, Dr. Smith, etc. find occurrences of these variations
ExtractMbyName
Union
s1 … sn
Works well, but can’t handle
certain difficult spots
Handling Difficult Spots Example
R. Miller, D. Smith, B. Jones if “David Miller” is in the dictionary
will flag “Miller, D.” as a person name
Solution: patch such spots with stricter plans
ExtractMbyName
Union
s1 … sn
FindPotentialNameLists
ExtractMStrict
Matching Entity Mentions Key idea: reasonable plan, then patch Reasonable plan
mention names are the same (modulo some variation) match
e.g., David Smith and D. Smith
Union
Extract Plan
MatchMbyName
s1 sn…Works well, but can’t handle
certain difficult spots
Handling Difficult Spots
Estimate the semantic ambiguity of data sources use social networking techniques related to cohesion of graphs [see ICDE-
07a]
Apply stricter matchers to more ambiguous sources
MatchMStrict
Extract Plan
MatchMbyName
Union
{s1 … sn} DBLP\
Extract Plan
DBLP
DBLP: Chen Li
· · ·41. Chen Li, Bin Wang, Xiaochun Yang.
VGRAM. VLDB 2007.· · ·
38. Ping-Qi Pan, Jian-Feng Hu, Chen Li.Feasible region contraction.
Applied Mathematics and Computation.· · ·
Summary: Extraction and Integration Most current solutions
try to find a single good plan, applied to all of data
Cimple solution: reasonable plan, then patch So the focus shifts to:
how to find a reasonable plan? how to detect problematic data spots? how to patch those?
Need a notion of semantic ambiguity Different from the notion of source relevance
3. Detecting Problems and Making Corrections
Researcher
Homepages
Conference
Pages
Group Pages
DBworld
mailing list
DBLP
Web pages
Text documents
* **
** * ***
SIGMOD-04
**
** give-talk
Jim Gray
Keyword search
SQL querying
Question answering
Browse
Mining
Alert/Monitor
News summary
Jim Gray
SIGMOD-04
**
Maintain and add more sources
Mass collaboration
How to Detect Problems?
After extraction and matching, build services e.g., superhomepages
Many such homepages contain minor problems e.g., X graduated in 19998
X chairs SIGMOD-05 and VLDB-05 X published 5 SIGMOD-03 papers
Intuitively, something is semantically incorrect
To fix this, build a Semantic Debugger learns what is a normal profile for researcher, paper, etc. alerts the builder to potentially buggy superhomepages so corrections / feedback can be provided
What Types of Feedback?
Say that a certain data item Y is wrong Provide correct value for Y, e.g., Y = SIGMOD-06 Add domain knowledge
e.g., no researcher has ever published 5 SIGMOD papers in a year
Add more data e.g., X was advised by Z e.g., here is the URL of another data source
Modify the underlying algorithm e.g., pull out all data involving X
match using names and co-authors, not just names
How to Make Providing Feedback Very Easy?
Extremely crucial in DBLife context If feedback can be provided easily
can get more feedback can leverage the mass of users
Critical but unsolved
Provide a Wiki interface
How to Make Providing Feedback Very Easy?
Say that a certain data item Y is wrong Provide correct value for Y, e.g., Y = SIGMOD-06
Add domain knowledge Add more data Modify the underlying algorithm
Provide form interfaces
Unsolved: some recent interest on how to mass
customize software
Summary: Detection and Feedback
How to detect problems? Semantic Debugger
What types of feedback & how to easily provide them? critical, largely unsolved
What feedback would make most impact? crucial in large-scale systems need a notion of a Feedback Advisor need a precise notion of system quality
4. Mass Collaboration
Researcher
Homepages
Conference
Pages
Group Pages
DBworld
mailing list
DBLP
Web pages
Text documents
* **
** * ***
SIGMOD-04
**
** give-talk
Jim Gray
Keyword search
SQL querying
Question answering
Browse
Mining
Alert/Monitor
News summary
Jim Gray
SIGMOD-04
**
Maintenance and expansion
Mass collaboration
Mass Collaboration: Voting
Can be applied to numerous problems
Example: Matching
Hard for machine, but easy for human
Mouse for Dell laptop 200 series ...
Dell X200; mouse at reduced price ...
Dell laptop X200 with mouse ...
Mass Collaboration: Wiki
Community wikipedia built by machine + human backed up by a structured database
DataSources G
T
V1
V2
V3
W1
W2
W3
u1
V3’ W3’
T3’
M
Machine MachineHuman
Mass Collaboration: Wiki
<# person(id=1){name}=David J. DeWitt #>
<# person(id=1){title}=Professor #>
<strong>Interests:</strong><# person(id=1).interests(id=3)
.topic(id=4){name}=Parallel Database #>
David J. DeWitt
Professor
Interests: Parallel Database
<# person(id=1){name}=David J. DeWitt #>
<# person(id=1){title}=John P. Morgridge Professor #>
<# person(id=1) {organization}=UW #> since 1976
<strong>Interests:</strong><# person(id=1).interests(id=3)
.topic(id=4){name}=Parallel Database #>
<# person(id=1){name}=David J. DeWitt #>
<# person(id=1){title}= John P. Morgridge Professor #>
<# person(id=1){organization}=UW-Madison#>since 1976
<strong>Interests:</strong><# person(id=1).interests(id=3)
.topic(id=4){name}=Parallel Database #>
<# person(id=1).interests(id=5).topic(id=6){name}=Privacy #>
David J. DeWitt
John P. Morgridge ProfessorUW-Madison since 1976
Interests: Parallel Database
Privacy
Machine
Human
Summary: Mass Collaboration
What can users contribute? How to evaluate user quality? How to reconcile inconsistent data?
Summary: Cimple
A very interesting attempt to rethink Web crawling and information extraction
Based on a “best-effort” notion One of many concurrent efforts in that vein “Dataspaces”
Simple building blocks, progressive refinement
36
Open Questions and Issues
Incorporating uncertain data
Can we really scale, and manage the complexity, of many, many extractors that might produce data of different quality?
How do we provide feedback? Particularly as there are many learners, data and feedback are very sparse
Others?
37
Recommended