47
The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

Embed Size (px)

Citation preview

Page 1: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

The Web’s Many Models

Michael J. Cafarella University of Michigan

AKBCMay 19, 2010

?

Page 2: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

2

Web Information Extraction Much recent research in information

extractors that operate over Web pages Snowball (Agichtein and Gravano, 2001) TextRunner (Banko et al, 2007) Yago (Suchanek et al, 2007) WebTables (Cafarella et al, 2008) DBPedia, ExDB, Freebase (make use of IE data)

Web crawl + domain-independent IE should allow comprehensive Web KBs with: Very high, “web-style” recall “More-expressive-than-search” query

processing But where is it?

Page 3: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

3

Web Information Extraction Omnivore

“Extracting and Querying a Comprehensive Web Database.” Michael Cafarella. CIDR 2009. Asilomar, CA.

Suggested remedies for data ingestion, user interaction

This talk says why ideas in that paper might already be out of date, gives alternative ideas

If there are mistakes here, then you have a chance to save me years of work!

Page 4: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

4

Outline Introduction Data Ingestion

Previously: Parallel Extraction Alternative: The Data-Centric Web

User Interaction Previously: Model Generation for

Output Alternative: Data Integration as UI

Conclusion

Page 5: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

5

Parallel Extraction Previous hypothesis

Many data models for interesting data, e.g., relational tables and E/R graphs, etc.

Should build large integration infrastructure to consume many extraction streams

Page 6: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

6

Database Construction (1)

Start with a single large Web crawl

Page 7: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

7

Database Construction (2)

Each of k extractors emits output that: Has an extractor-dependent model Has an extractor-and-Web-page-

dependent schema

Page 8: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

8

Database Construction (3)

For each extractor output, unfold into common entity-relation model

Page 9: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

9

Database Construction (4)

Unify results

Page 10: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

10

Database Construction (5)

Emit final database

Page 11: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

11

Potential Problems Pressing problems:

Recall Simple intra-source reconciliation Time

Tables, entities probably OK for now Many data sources (DBPedia, Facebook,

IMDB) already match one of these two pretty well

One possible different direction: the Data-Centric Web Addresses recall only

Page 12: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

12

The Data-Centric Web

Page 13: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

13

The Data-Centric Web

Page 14: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

14

The Data-Centric Web

Page 15: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

15

The Data-Centric Web

Page 16: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

16

The Data-Centric Web

Page 17: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

17

The Data-Centric Web

Page 18: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

18

The Data-Centric Web

Page 19: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

19

The Data-Centric Web

Page 20: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

20

The Data-Centric Web

Page 21: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

21

The Data-Centric Web

Page 22: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

22

The Data-Centric Web

Page 23: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

23

The Data-Centric Web

Page 24: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

24

Data-Centric Lists Lists of Data-Centric Entities give

hints: About what the target entity contains

That all members of set are DCEs, or not

That members of set belong to a class or type (e.g., program committee members)

Page 25: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

25

Build the Data-Centric Web1. Download the Web2. Train classifiers to detect DCEs, DCLs3. Filter out all pages that fail both tests4. Use lists to fix up incorrect Data-Centric

Entity classifications5. Run attr/val extractors on DCEs

Yields E/R dataset, for insertion into DBPedia, YAGO, etc

In progress now… with student Ashwin Balakrishnan, entity detector >95% acc.

Page 26: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

26

Research Question 1 How many useful entities…

Lack a page in the Data-Centric Web? (That means no homepage, no Amazon

page, no public Facebook page, etc.) AND are otherwise well-described

enough online that IE can recover an entity-centric view?

Put differently: Does every entity worth extracting

already have a homepage on the Web?

Page 27: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

27

Research Question 2 Does a single real-world entity

have more than one “authoritative” URL? Note that Wikipedia provides pretty

minimal assistance in choosing the right entity, but does a good job

Page 28: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

28

Outline Introduction Data Ingestion

Previously: Parallel Extraction Alternative: The Data-Centric Web

User Interaction Previously: Model Generation for

Output Alternative: Data Integration as UI

Conclusion

Page 29: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

29

Model Generation for Output Previous hypothesis

Many different user applications built against single back-end database

Difficult task is translating from back-end data model to the application’s data model

Page 30: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

30

Query Processing (1)

Query arrives at system

Page 31: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

31

Query Processing (2)

Entity-relation database processor yields entity results

Page 32: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

32

Query Processing (3)

Query Renderer chooses appropriate output schema

Page 33: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

33

Query Processing (4)

User corrections are logged and fed into later iterations of db construction

Page 34: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

34

Potential Problems Many plausible front-end applications,

none yet totally compelling and novel Ad- and search-driven ones not novel Freebase, Wolfram Alpha not compelling Raw input to learners: useful, not an

end-user application Need to explore possible applications

rather than build multi-app infrastructure

One possible different direction: data integration as user primitive

Page 35: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

35

Data Integration as UI Can we combine tables to create

new data sources? Many existing “mashup” tools,

which ignore realities of Web data A lot of useful data is not in XML User cannot know all sources in

advance Transient integrations Dirty data

Page 36: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

36

Interaction Challenge Try to create a database of all“VLDB program committee members”

Page 37: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

37

Provides “workbench” of data integration operators to build target database Most operators are not correct/incorrect, but

high/low quality (like search) Also, prosaic traditional operators

Originally ran on WebTable data [VLDB 2009, Cafarella, Khoussainova,

Halevy]

Octopus

Page 38: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

38

Walkthrough - Operator #1 SEARCH(“VLDB program committee members”)

serge abiteboul inria

anastassia ail… carnegie…

gustavo alonso etz zurich

… …

serge abiteboul inria

michael adiba …grenoble

antonio albano …pisa

… …

Page 39: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

39

Walkthrough - Operator #2 Recover relevant data

serge abiteboul inria

michael adiba …grenoble

antonio albano …pisa

… …

serge abiteboul inria

anastassia ail… carnegie…

gustavo alonso etz zurich

… …

CONTEXT()

CONTEXT()

Page 40: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

40

Walkthrough - Operator #2 Recover relevant data

serge abiteboul inria 1996

michael adiba …grenoble 1996

antonio albano …pisa 1996

… … …

serge abiteboul inria 2005

anastassia ail… carnegie… 2005

gustavo alonso etz zurich 2005

… … …

CONTEXT()

CONTEXT()

Page 41: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

41

Walkthrough - Union Combine datasets

serge abiteboul inria 1996

michael adiba …grenoble 1996

antonio albano …pisa 1996

… … …

serge abiteboul inria 2005

anastassia ail… carnegie… 2005

gustavo alonso etz zurich 2005

… … …

Union()

serge abiteboul inria 1996

michael adiba …grenoble 1996

antonio albano …pisa 1996

serge abiteboul inria 2005

anastassia ail… carnegie… 2005

gustavo alonso etz zurich 2005

… … …

Page 42: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

42

Walkthrough - Operator #3 Add column to data Similar to “join” but join target is a topic

EXTEND( “publications”, col=0)

serge abiteboul inria 1996

michael adiba …grenoble 1996

antonio albano …pisa 1996

serge abiteboul inria 2005

anastassia ail… carnegie… 2005

gustavo alonso etz zurich 2005

… … …

serge abiteboul inria 1996 “Large Scale P2P Dist…”

michael adiba …grenoble 1996 “Exploiting bitemporal…”

antonio albano …pisa 1996 “Another Example of a…”

serge abiteboul inria 2005 “Large Scale P2P Dist…”

anastassia ail… carnegie… 2005 “Efficient Use of the…”

gustavo alonso etz zurich 2005 “A Dynamic and Flexible…”

… … …

• User has integrated data sources with little effort• No wrappers; data was never intended for reuse

“publications”

Page 43: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

43

CONTEXT Algorithms Input: table and source page Output: data values to add to table

SignificantTerms sorts terms in source page by “importance” (tf-idf)

Page 44: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

44

Related View Partners Looks for different “views” of same

data

Page 45: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

45

CONTEXT Experiments

Page 46: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

46

Data Integration as UI Compelling for db researchers, but

will large numbers of people use it?

Page 47: The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?

47

Conclusion Automatic Web KBs rapidly

progressing Recall still not good enough for many

tasks, but progress is rapid Not clear what those tasks should be, and

progress is much slower Difficult to predict what’s useful Sometimes difficult to write a “new app” paper

Omnivore’s approach not wrong, but did not directly address these problems