Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University...

Cimple: Building Community Portal Sitesthrough Crawling & Extraction

Zachary G. IvesUniversity of Pennsylvania

CIS 650 – Implementing Data Management Systems

November 4, 2008

Slides based on content by AnHai Doan, used with permission

Administrivia

By next Tuesday: a rough schedule and division of duties for your project

Please read the Halevy et al. paper on Piazza

The Web Is Full of Special-Interest Portal Sites for Communities

Academia Certain bioinformatics topics; citations; etc.

Medicine WebMD

Infotainment Rotten Tomatoes, IMDB, fantasy football

Business enterprise intranets, tech support groups, lawyers

CIA / homeland security Intellipedia

Some of these gather information from the Web

Cimple Project @ Wisconsin (+ Yahoo)

Researcher

Homepages

Conference

Group Pages

DBworld

mailing list

Web pages

Text documents

** * ***

SIGMOD-04

** give-talk

Jim Gray

Keyword search

SQL querying

Question answering

Browse

Mining

Alert/Monitor

News summary

Jim Gray

SIGMOD-04

Maintain and add more sources

Develops a general solution to community Web portals using extraction + integration + mass collaboration

Mass collaboration

The Basic Ideas

Architecture mainly consists of extractors and ER-graphs

The key challenge is to extract “good” information for the ER-graphs, and to allow all of the information to be extended and repaired

Prototype System: DBLife

Integrate data of the DB research community 1164 data sources

Crawled daily, 11000+ pages = 160+ MB / day

Data Integration

Raghu Ramakrishnan

co-authors = A. Doan, Divesh Srivastava, ...

Resulting ER Graph

“Proactive Re-optimization

Jennifer Widom

Shivnath Babu

SIGMOD 2005

David DeWitt

Pedro Bizarrocoauthor

coauthor

advise advise

PC-Chair

PC-member

Provide Services

DBLife system

Mass Collaboration via Wiki

Issues Addressed by Cimple

Cimple addresses challenges in 1. Source selection2. Extraction and integration3. Detecting problems and providing

feedback4. Mass collaboration

1. Source Selection

Researcher

Homepages

Conference

Group Pages

DBworld

mailing list

Web pages

Text documents

** * ***

SIGMOD-04

** give-talk

Jim Gray

Keyword search

SQL querying

Question answering

Browse

Mining

Alert/Monitor

News summary

Jim Gray

SIGMOD-04

Mass collaboration

Current Solutions vs. Cimple Current solutions: topic specific crawlers

find all relevant data sources (e.g., using focused crawling, search engines)

maximize coverage results in many “noisy” sources

Cimple allows for incremental development, deployment starts with a small set of high-quality “core”

sources incrementally adds more sources

only from “high-quality” places or as suggested by users (mass collaboration)

Start with a Small Set of “Core” Sources

Key observation: communities often follow 80-20 rule 20% of sources cover 80% of interesting

activities

Initial portal over these 20% often is already quite useful

How do we select these 20%? select as many sources as possible then evaluate and select most relevant ones

Evaluate the Relevance of Sources

Use PageRank + virtual links across entities + TF/IDF

... Gerhard Weikum

G. Weikum

See [VLDB-07a]

Add More Sources over Time Key observation: most important sources will

eventually be mentioned within the community so monitor certain “community channels” to find them

Message type: conf. ann.Subject: Call for Participation: VLDB Workshop on Management of Uncertain Data

Call for Participation Workshop on

"Management of Uncertain Data" in conjunction with VLDB 2007

http://mud.cs.utwente.nl ...

Also allow users to suggest new sources– e.g., the Silicon Valley Database Society

Summary: Source Selection

Incremental approach: start with highly relevant sources expand carefully minimize “garbage in, garbage out”

Need a notion of source relevance Need a way to compute this

2. Extraction and Integration

Researcher

Homepages

Conference

Group Pages

DBworld

mailing list

Web pages

Text documents

** * ***

SIGMOD-04

** give-talk

Jim Gray

Keyword search

SQL querying

Question answering

Browse

Mining

Alert/Monitor

News summary

Jim Gray

SIGMOD-04

Mass collaboration

Extracting Entity Mentions Key idea: reasonable plan, then “patch” Reasonable basic plan:

collect person names, e.g., David Smith generate variations, e.g., D. Smith, Dr. Smith, etc. find occurrences of these variations

ExtractMbyName

s1 … sn

Works well, but can’t handle

certain difficult spots

Handling Difficult Spots Example

R. Miller, D. Smith, B. Jones if “David Miller” is in the dictionary

will flag “Miller, D.” as a person name

Solution: patch such spots with stricter plans

ExtractMbyName

s1 … sn

FindPotentialNameLists

ExtractMStrict

Matching Entity Mentions Key idea: reasonable plan, then patch Reasonable plan

mention names are the same (modulo some variation) match

e.g., David Smith and D. Smith

Extract Plan

MatchMbyName

s1 sn…Works well, but can’t handle

certain difficult spots

Handling Difficult Spots

Estimate the semantic ambiguity of data sources use social networking techniques related to cohesion of graphs [see ICDE-

Apply stricter matchers to more ambiguous sources

MatchMStrict

Extract Plan

MatchMbyName

{s1 … sn} DBLP\

Extract Plan

DBLP: Chen Li

· · ·41. Chen Li, Bin Wang, Xiaochun Yang.

VGRAM. VLDB 2007.· · ·

38. Ping-Qi Pan, Jian-Feng Hu, Chen Li.Feasible region contraction.

Applied Mathematics and Computation.· · ·

Summary: Extraction and Integration Most current solutions

try to find a single good plan, applied to all of data

Cimple solution: reasonable plan, then patch So the focus shifts to:

how to find a reasonable plan? how to detect problematic data spots? how to patch those?

Need a notion of semantic ambiguity Different from the notion of source relevance

3. Detecting Problems and Making Corrections

Researcher

Homepages

Conference

Group Pages

DBworld

mailing list

Web pages

Text documents

** * ***

SIGMOD-04

** give-talk

Jim Gray

Keyword search

SQL querying

Question answering

Browse

Mining

Alert/Monitor

News summary

Jim Gray

SIGMOD-04

Mass collaboration

How to Detect Problems?

After extraction and matching, build services e.g., superhomepages

Many such homepages contain minor problems e.g., X graduated in 19998

X chairs SIGMOD-05 and VLDB-05 X published 5 SIGMOD-03 papers

Intuitively, something is semantically incorrect

To fix this, build a Semantic Debugger learns what is a normal profile for researcher, paper, etc. alerts the builder to potentially buggy superhomepages so corrections / feedback can be provided

What Types of Feedback?

Say that a certain data item Y is wrong Provide correct value for Y, e.g., Y = SIGMOD-06 Add domain knowledge

e.g., no researcher has ever published 5 SIGMOD papers in a year

Add more data e.g., X was advised by Z e.g., here is the URL of another data source

Modify the underlying algorithm e.g., pull out all data involving X

match using names and co-authors, not just names

How to Make Providing Feedback Very Easy?

Extremely crucial in DBLife context If feedback can be provided easily

can get more feedback can leverage the mass of users

Critical but unsolved

Provide a Wiki interface

How to Make Providing Feedback Very Easy?

Say that a certain data item Y is wrong Provide correct value for Y, e.g., Y = SIGMOD-06

Add domain knowledge Add more data Modify the underlying algorithm

Provide form interfaces

Unsolved: some recent interest on how to mass

customize software

Summary: Detection and Feedback

How to detect problems? Semantic Debugger

What types of feedback & how to easily provide them? critical, largely unsolved

What feedback would make most impact? crucial in large-scale systems need a notion of a Feedback Advisor need a precise notion of system quality

4. Mass Collaboration

Researcher

Homepages

Conference

Group Pages

DBworld

mailing list

Web pages

Text documents

** * ***

SIGMOD-04

** give-talk

Jim Gray

Keyword search

SQL querying

Question answering

Browse

Mining

Alert/Monitor

News summary

Jim Gray

SIGMOD-04

Maintenance and expansion

Mass collaboration

Mass Collaboration: Voting

Can be applied to numerous problems

Example: Matching

Hard for machine, but easy for human

Mouse for Dell laptop 200 series ...

Dell X200; mouse at reduced price ...

Dell laptop X200 with mouse ...

Mass Collaboration: Wiki

Community wikipedia built by machine + human backed up by a structured database

DataSources G

V3’ W3’

Machine MachineHuman

Mass Collaboration: Wiki

<# person(id=1){name}=David J. DeWitt #>

<# person(id=1){title}=Professor #>

<strong>Interests:</strong><# person(id=1).interests(id=3)

.topic(id=4){name}=Parallel Database #>

David J. DeWitt

Professor

Interests: Parallel Database

<# person(id=1){title}=John P. Morgridge Professor #>

<# person(id=1) {organization}=UW #> since 1976

<# person(id=1){title}= John P. Morgridge Professor #>

<# person(id=1){organization}=UW-Madison#>since 1976

<# person(id=1).interests(id=5).topic(id=6){name}=Privacy #>

David J. DeWitt

John P. Morgridge ProfessorUW-Madison since 1976

Interests: Parallel Database

Privacy

Machine

Summary: Mass Collaboration

What can users contribute? How to evaluate user quality? How to reconcile inconsistent data?

Summary: Cimple

A very interesting attempt to rethink Web crawling and information extraction

Based on a “best-effort” notion One of many concurrent efforts in that vein “Dataspaces”

Simple building blocks, progressive refinement

Open Questions and Issues

Incorporating uncertain data

Can we really scale, and manage the complexity, of many, many extractors that might produce data of different quality?

How do we provide feedback? Particularly as there are many learners, data and feedback are very sparse

Others?

Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University...

Documents

10 Crawling

Cimple: Instruction and Memory Level Parallelism

5 Benefits of Web Crawling Services Over In-house Crawling

Web Scraping : Crawling

Google crawling SEO

Ives marathon

Charles Ives

Crawling the world

Intelligent web crawling

Scalable Web Crawling and Basic Transactions Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems October 6, 2015

Advanced Crawling Techniques Chapter 6. Outline Selective Crawling Focused Crawling Distributed Crawling Web Dynamics

Adaptive Focused Crawling

Web Crawling & Crawler

Admon - IVES

Web Crawling and Scraping or post, copy, not Do€¦ · Web Crawling and Scraping. 37. Web Crawling. Web crawling is the process of building a collection of webpages by starting with

AnHai Doan University of Wisconsin-Madison The Cimple Project on Community Information Management

Ives HousatonicAtStockbridge

Robert Ives

Crawling The Web For a Search Engine Or Why Crawling is Cool

Currier & Ives