146
Information Network Analysis and Extraction Extraction and Integration of the Semi- Structured Web Tim Weninger Computer Science and Engineering Department University of Notre Dame

Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web

  • Upload
    minnie

  • View
    41

  • Download
    2

Embed Size (px)

DESCRIPTION

Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web. Tim Weninger Computer Science and Engineering Department University of Notre Dame. Rules of this tutorial. Ask questions Ask lots of questions - PowerPoint PPT Presentation

Citation preview

Page 1: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Information Network Analysis and Extraction

Extraction and Integration of the Semi-Structured Web

Tim WeningerComputer Science and Engineering Department

University of Notre Dame

Page 2: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Rules of this tutorial

1. Ask questions2. Ask lots of questions

3. If you don’t agree with something, let me know4. If something is not clear, ask a question

Slides can be found online at: http://web.engr.illinois.edu/~weninge1/publications.htmlGoogle/Bing/Yahoo: ‘Tim Weninger Publications’

Page 3: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

The Web

Social Networks› Early Messenger Networks› Social Media› Gaming Networks› Professional Networks

Hyperlink Networks› Blog Networks› Wiki-networks› Web-at-large

» Internal links» External links

Page 4: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

The Web is a Hyperlink Network

Page 5: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Ranking on the Web

Query:

Page 6: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Clustering on the Web

Sim(

Page 7: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

This Tutorial is about the structure and content of the Web

NamePhoneOfficeAge

GenderEmail

AuthorDateline

TopicPersonsLocation

Page 8: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Imagine what we could do…

Search› Show structured information in response to query› Automatically rank and cluster entities› Reasoning on the Web

» Who are the people at some company?» What are the courses in some college department?

Analysis› Expand the known information of an entity

» What is a professor’s phone number, email, courses taught, research, etc?

Page 9: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Outline

PreliminariesInformation ExtractionBreak (30 min)Information IntegrationWeb Information Networks

Page 10: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Databases and Schemas

Databases usually have a well defined schema

Page 11: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Databases and Schemas

Databases usually have a well defined schema

Page 12: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

XML – a data description language

XML Schema

Page 13: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

XML – a data description language

XML Instance

Page 14: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

HTML and Semi-Structured data

Page 15: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

HTML and Semi-Structured data

What’s the schema?

Page 16: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

HTML and Semi-Structured data

HTML has no schema!

HTML is a markup language› A description for a browser to render› HTML describes how the data should be displayed

HTML was never meant to describe the data.

Page 17: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

HTML and Semi-Structured data

HTML was never meant to describe the data.

But there is so much data on the Web…we have to try

Page 18: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Document Object Model

HTML -> DOM› DOM is a tree model of the HT markup language

Page 19: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

What the DOM is not

From the W3C:

The Document Object Model does not define what information in a document is relevant or how information in a document is structured. For XML, this is specified by the W3C XML Information Set [Infoset]. The DOM is simply an API to this information set.

Page 20: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Web page rendering

HTML -> DOM -> WebPage› Web page rendering according to Web standards

Uses the Boxes Model

Page 21: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Web databases

LOTS of pages on the Web are database interfaces

Page 22: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Web databases

Some pages are not database interfaces….but they could be

Page 23: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Relational Databases on the Web

WebPages can have relational data

Page 24: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Data can be hidden in text too!

Page 25: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

HTML and Semi-Structured data

Our goal is to extract information from the Web

…and make sense out of it!

Page 26: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Outline

PreliminariesInformation Extraction from textBreak (30 min)Information Extraction from tables and listsWeb Information Networks

Page 27: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Content Extraction

Page 28: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Web Content Extraction

Extract only the content of a page

Taken from The Hutchinson News on 8/14/2008

Page 29: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Web Content Extraction

Two Approaches1. Heuristic Approaches

Work one “document-at-a-time”2. Template Detection Approaches

Require multiple documents that contain the same template

Benefits of content extraction• Reduce the noise in the document

» Reduce document size» Better indexing, search processing» Easier to fit on small screens

Page 30: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Wrapper Generation

Documents on the Web are made from templates• Popularity of Content Management Systems

• Database queries are used to “fill out” HTML content

Template are the framework of the Web page(s)• The structure of is very similar (near identical) among

template Web pages.

1. Cluster similarly structured documents2. Generate Wrappers3. Extract Information

Page 31: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Wrapper Generation

Documents on the Web are made from templates• Database query “fills in” the content• Separate AJAX/HTTP calls “fill in” content

Page 32: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Locating Web page templates

First Bar-Yossef and Rajagopalan ‘02 proposed a template recognition algorithm using DOM tree segmentation• Template detection via data mining and its applications

Lin and Ho ‘02 developed InfoDiscoverer which uses the heuristic that template generated contents appear more frequently.• Discovering informative content blocks from web documents

Debnath et al. ‘05 develop ContentExtractor but also include features like image or script elements.• Automatic extraction of informative blocks from webpages

Page 33: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Locating Web page templates

Yi, Liu and Li ‘03 use the Site Style Tree(SST) approach finds that identically formatted DOM sub-trees denote the template• Eliminating noisy information in web pages for data mining

Crecensi et al. ’01 develop Roadrunner which uses the Align, collapse under mismatch and extract (ACME) approach to generate wrappers.• Towards Automatic Data Extraction from Large Web Sites.

Buttler ‘04 proposes the path shingling approach which makes use of the shingling technique.• A short survey of document structure similarity algorithms

Page 34: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Wrapper Generation

Generate extraction rules

//div[@class ="content"]/table[1]/tr/td[2]/text()

A home away from school

Day care has after-school duties as some clients start academic year

By Kristen Roderick – The Hutchinson News – [email protected]

The doors at Hadley Day Care opened Wednesday afternoon, and children scurried in with tales of…

Page 35: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Wrapper Generation

Advantages• Easy to implement and learn• Can have perfect precision and recall

Disadvantages• Web sites change their templates often

» Any small change breaks the wrapper• Need several examples to learn the wrapper

» Called “domain-centric” approaches

Page 36: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Single Document Content Extraction

Look at a single document at a time• Use heuristics and data mining principles to find main

content.

No template detectionNo extraction rule learning

Called “Web-centric” approaches

Page 37: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Early Content Extraction Approaches

Body Text Extraction (BTE) • Interprets HTML document as word and tag tokens• Identifies a single, continuous region which contains most

words while excluding most tags.

Document Slope Curves (DSC) • Extension of BTE that looks at several document regions.

Link Quota Filters (LQF) • Remove DOM elements which consist mainly of text

occurring in hyperlink anchors.

Page 38: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Tag Ratios Content Extraction

Two algorithms• Same time, same conference• Same concept

Gottron, et al. ‘07 Content Code Blurring Weninger, et al. ‘07 Content Extraction via Tag Ratios

Page 39: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Text to Tag Ratio

http://www2010.org/www/2010/04/program-guide/

Text: 21 - Tags: 8 -> TTR: 2.63

Text: 22 - Tags: 8 -> TTR: 2.75

Text: 298 - Tags: 6 -> TTR: 49.67

Text: 0 - Tags: 0 -> TTR: 0Text: 0 - Tags: 1 -> TTR: 0

Page 40: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

1 26 51 76 1011261511762012262512763013263513764014260

50

100

150

200

250

Line Number

Text

To

Tag

Ratio

Text to Tag Ratio Histogram

Page 41: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Histogram Clustering in 2-Dimensions

Looks for jumps in the moving average of TTR

1 50 99 1481972462953443930

20

40

60

80

100

120

Line Number

Text

To

Tag

Ratio

1 50 99 148197246295344393-150

-100

-50

0

50

100

150

Line Number

Page 42: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Histogram Clustering in 2-Dimensions

Absolute value gives insight

1 52 103154205256307358409-150

-100

-50

0

50

100

150

Line Number

1 46 91 1361812262713163614060

100200300400500600700800

Line Number

Page 43: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

0 25 50 75 1000

102030405060708090

100

TTR (hʹ)

Diffe

renc

es (g

')

Histogram Clustering in 2-Dimensions

Make a scatterplot

0 25 50 75 1000

20

40

60

80

100

TTR (hʹ)

Diffe

renc

es (g

')

Page 44: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

0 25 50 75 1000

10

20

30

40

50

60

70

80

90

100

TTR (hʹ)

Diffe

renc

es (g

')

Modified k-Means

Page 45: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Single Document Content Extraction

Advantages› Only need a single document at a time› Unsupervised

» No training required

Disadvantages› Precision and Recall varies

» On the (1) algorithm, (2) parameters, (3) Web page

› What are other problems?» Javascript!

Page 46: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Rule Extraction

Page 47: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Textual Extraction

Web text holds good information, but full NLP understanding is difficult

Two flavors of text extraction› Domain-at-a-time› Web-at-large (domain-agnostic)

Very different techniques required for each

Page 48: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Domain at a time

Documents on the Web are made from templates› A single domain has similar language

Page 49: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Domain at a time text extraction

If we know the schema/domain, we know the rules

BBC Business – “owned by”, “sales of”, “CEO of”, etc.

Page 50: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Known Domains: Rule Learning

1. User provides initial data

2. Algorithm searches for terms, then induces rules.

[ORGANIZATION]’s headquarters in [LOCATION][LOCATION]-based [ORGANIZATION] [ORGANIZATION], [LOCATION]

“Servers at Microsoft’s headquarters in Redmond…”“The Armonk-based IBM has introduced…”“Intel, Santa Clara, cut prices of its Pentium…”

Microsoft RedmondIBM ArmonkIntel Santa Clara

Page 51: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Known Domains: Rule Learning

1. User provides initial data

2. Algorithm searches for terms, then induces rules.

Extraction rules are intricate and break easily› Different extraction rules per domain

» Can’t scaleHave to parse all of the text

› Computationally very expensive

Microsoft RedmondIBM ArmonkIntel Santa Clara

Page 52: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Domain independent – Source dependent

Don’t analyze raw text - use dataset-specific extraction techniques

Yet another great ontology (YAGO)Finds TYPE relationship in Wikipedia

› Looks at Wikipedia category pages› Categories can be different

» Conceptual (naturalized citizens of the US)» Relational (1879 births)» Thematic (Physics)» Administrative (unsourced articles)» Only Conceptual ones indicate TYPE

YAGO parses category names, tests if head of the name is plural; if so, it’s Conceptual

Page 53: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Domain independent – Source dependent

YAGO/YAGO2

Looks at the Wikipedia structures to learn rules

Page 54: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Domain independent – Source dependent

YAGO/YAGO2

Page 55: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

YAGO

Techniques are not general at all› Limited to 14-100 hand-picked relations

» Manually generate the relationships we want to look for

Great performance› Able to extract 40 Million facts in YAGO› 80 million facts in YAGO2

Page 56: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Web-At-Large Text Extraction

“Open Information Extraction”

Discovers rules/predicates on the flyDoes not require domain semantics or much human

input.› Run on the whole Web

Textrunner Banko et al. ‘07

Page 57: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Open Information Extraction - Textrunner

Self-Supervised Classifier› Train extraction-classifier using data & features generated

by (expensive) linguistic parser› Dependency Parser -

Page 58: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Open Information Extraction - Textrunner

Page 59: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Open Information Extraction - Textrunner

Result Assessment› Tuple-extraction frequency counts › Use heuristics

» not a too-long parse dependency between the two NPs» neither NP is simply a pronoun» path between NPs does not pass a sentence-like boundary» etc.

› Use Naïve Bayes Classifier to find good extractions» Features: » part-of-speech tags» Number of tokens in a relation» whether an NP is a proper noun

Page 60: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Open Information Extraction - Textrunner

Compared to Domain-dependent extraction

Better coverage› It’s not restricted on the types of relations › It’s not restricted on the domain

Lower precision› Increase in recall results in lower precision› More noise introduced from the Web-at-large

Page 61: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Outline

PreliminariesInformation Extraction from textBreak (30 min)Information Extraction from tables and listsWeb Information Networks

Page 62: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Outline

PreliminariesInformation Extraction from textBreak (30 min)Information Extraction from tables and listsWeb Information Networks

Page 63: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Record Extraction

Page 64: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Record Extraction

Find structured data in semi-structured HTML• Find database tables (rows & columns) in a Web page

Data Record ExtractionList ExtractionWebTable Integration

Page 65: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Example of Data Records

Page 66: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Data Record Extraction

Mining Data Records from the Web (MDR), Liu et al ’031. Generate Tag Tree

Page 67: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

MDR

2. Find Generalized Nodes

Generalized nodes have subtrees of the same size, depth, are adjacent, and have a certain string similarity

Page 68: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

MDR

3. Match identical data records

Page 69: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

DEPTA

Zhai, Liu ‘05 DEPTA • Structured Data Extraction from the Web based on Partial

Tree Alignment

3. Match similar data records

Page 70: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Record Extraction using Tag Path Clustering

Inverted Index

Page 71: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Record Extraction using Tag Path Clustering

Derive similarities from the visual signal vectors

Distance between centers of gravity

Interleaving measure

Similarity measure

Page 72: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Record Extraction using Tag Path Clustering

Similarity Matrix of tag paths

Page 73: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

MiBAT – Extraction of Records containing UGC

Song et al. ‘10 – Extracts data records containing user generated content (UGC)

Page 74: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

MiBAT

Finding Anchor Trees• Nodes within the record that match across all subtrees

• Use those anchors to tie the data records together• Those anchor trees need to be predefined

• Are a date, time, or some common structured text that a Regular Expression can find.

Page 75: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

DOM Record Extraction

Advantages• Unsupervised

» Only needs one page at a time• Tag-agnostic

» Doesn’t matter what the type of the HTML tag is

Disadvantages• Precision and Recall varies

» Depends on the Web page and assumptions of the algorithm• HTML is not a schema

» Misses AJAX, Javascript, other HTTP calls» What is the purpose of HTML?

Page 76: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Visual Based Record Extraction

Assumptions: • HTML describes the structure of a document• Repeating Patterns = Records• HTML is a markup language

We need to render the Web page

Page 77: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Visual Web Page Rendering

Page 78: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

VENTex – Visual Record Extraction

Gatterbauer et al. ‘07 Visual Record Extraction VENTex • Towards Domain-Independent Information

Extraction from Web Tables

Page 79: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Visual Record Extraction

VENTex relies on lots of heuristics

Does not consider underlying DOM

Page 80: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Hybrid List Extraction

Property 1: If box a is contained in box b, then b is an ancestor of a in the rendered box tree.

Property 2: If a and b are not related under property 1, then they do not overlap visually on the page.

Fumarola et al. ‘12 Hybrid List Extraction HyLiEn

Page 81: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Candidate Generation based on Visual Features

A list candidate on a rendered Web page consists of a set of vertically and/or horizontally aligned boxes.

Two lists and are related if they have an element in common.

A set of lists is a tiled structure if for every list there exists at least one other list such that and . Lists in a tiled structure are called tiled lists.

Page 82: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Output: Web page annotated

Tiled ListVertical List

Horizontal List

Page 83: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

HyLiEn

Page 84: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

HyLiEn

RESTful service: http://dmserv1.cs.illinois.edu/listextractorservice.listextractorsvc.svc/extract/xml/?url= http://cs.illinois.edu/people/faculty

61 Faculty

Tarek A.

Sarita A.

Vikram A.

…and 58 more…

Page 85: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Lets take a look at a single record

Tarek A.

Name & Link

Title

Phone

Email

Research

Page 86: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Lets take a look at a ANOTHER record

Vikram A.

Name & Link

Title

Phone

Email

Research

Page 87: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Visual Record Extraction

Advantages• More accurate than DOM-methods• Unsupervised

» Only needs one page at a time• Tag-agnostic

» Doesn’t matter what the type of the HTML tag is

Disadvantages• Precision and Recall varies

» Depends on the Web page and assumptions of the algorithm» Precision not as good as tag-gnostic methods» Recall not as good as wrappers

Page 88: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Integrating Web data

Page 89: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

WebTables

Cafarella et al. ‘08 – The Relational Web WebTables• Exploring the Relational Web

In corpus of 14B raw tables, they estimate 154M are “good” relations› Single-table databases; Schema = attr labels + types› Largest corpus of databases & schemas available

The WebTables system:› Recovers good relations from crawl and enables search› Builds novel apps on the recovered data

Page 90: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Bad table

WebTables

Good table

Slide courtesy Cafarella & Halevy

Page 91: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Some Challenges

Data is semi-structured:› No schema› Columns do not have uniform type› Quality varies a lot› Finding real tables is hard, as is extraction

Data is about everything. › You can’t build a schema over everything

Page 92: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Vertical Tables

Slide courtesy Cafarella & Halevy

Page 93: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Winners of the Boston Marathon

Slide adapted from Cafarella & Halevy

…but that information is nowhere in the table

Page 94: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Much better, but schema extraction is needed

Slide courtesy Cafarella & Halevy

Page 95: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Schema Ok, but context is subtle (year = 2006)

Slide courtesy Cafarella & Halevy

Page 96: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Population Table #2

Slide courtesy Cafarella & Halevy

Page 97: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Asian Population Table

Slide courtesy Cafarella & Halevy

Page 98: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

WebTables: Exploring the Relational Web

In corpus of 14B raw tables, Cafarella et al estimate 154M are “good” relations› Single-table databases; Schema = attr labels +

types› Largest database ever!

The Webtables system:› Recovers good relations from crawl and enables

search› Builds novel apps on the recovered data

Page 99: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

WebTables

Raw HTML Tables Recovered Relations Relation Search

Inverted Index

Job-title, company, date 104

Make, model, year 916

Rbi, ab, h, r, bb, avg, slg 12

Dob, player, height, weight 4

… …

Attribute Correlation Statistics Db

• 2.6M distinct schemas

• 5.4M attributes

Slide courtesy Cafarella & Halevy

Page 100: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Synonym Discovery

Use schema statistics to automatically compute attribute synonyms› More complete than thesaurus

Given input “context” attribute set C:1. A = all attrs that appear with C2. P = all (a,b) where aA, bA, ab3. rm all (a,b) from P where p(a,b)>04. For each remaining pair (a,b) compute:

Slide courtesy Cafarella & Halevy

Page 101: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Synonym Discovery Examples

name e-mail|email, phone|telephone, e-mail_address|email_address, date|last_modified

instructor course-title|title, day|days, course|course-#,course-name|course-title

elected candidate|name, presiding-officer|speaker

ab k|so, h|hits, avg|ba, name|player

sqft bath|baths, list|list-price, bed|beds, price|rent

Slide courtesy Cafarella & Halevy

Page 102: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

More Work on WebTables

Annotate the data in WebTables with ontology information extracted earlier

Physicist

Person

Entity Typehierarchy

Entities

Catalog

B94 P22

The Time and Spaceof Uncle Albert

Albert Einstein

Book

Lemmas

Title Author

B95

Uncle Albert and theQuantum Quest

Writes(Book,Person)bornAt(Person,Place)leader(Person,Country)

Type label

Relation label

B41

Relativity: The Special…

Entity label

Uncle Albert and the Quantum Quest Russell Stannard

Relativity: The Special and the General Theory

A DoxiadisUncle Petros and the Goldback conjecture

A Einstein

Page 103: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Further Challenges

Noisy data› A. Einstien vs Albert Einstein vs Einstien

Ambiguity of entity names› “Michael Jordan” is both a computer scientist and an athlete

Missing type links in Ontology› Universities in Rome -> Universities in Italy

Page 104: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Outline

PreliminariesInformation ExtractionBreak (30 min)Information IntegrationWeb Information Networks

Page 105: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Hyperlink Networks as Homogeneous Info. Networks

Page 106: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Homogeneous Networks lack class

The IMDB Movie Network

Actor MovieDirector

Movie Studio

The Facebook Network

Heterogeneous networks have type information

Page 107: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Hyperlink Networks as Heterogeneous Info. Networks

Page 108: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Hyperlink Networks as Heterogeneous Info. Networks

NamePhoneOfficeAge

GenderEmail

AuthorDateline

TopicPersonsLocation

Page 109: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Homogeneous -> Heterogeneous Information Networks

Task – Heterogenize the Web

Classification Task with many nuances› What are the classes?› Class granularity?

› How do we predict the types computationally?

?

Page 110: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Heterogenization

What is this thing?

ANIMAL, PERSON, PROFESSOR, FULL PROFESSOR, MAN, DATA MINER, MALE-FULL PROFESSOR-DATA MINER?

Page 111: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Heterogenization

ANIMAL, PERSON, PROFESSOR, FULL PROFESSOR, MAN, DATA MINER, MALE-FULL PROFESSOR-DATA MINER?

This is the goal!

The answer is importantWe use these results to do other things

HINT - The network tells us

Page 112: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Extracting Typed-Information networks from the Hierarchical Web

Page 113: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Web Hierarchies

The objects’ location within the network indicates:› Its class› Its relative class

Network Hierarchy› Networks have a hidden Hierarchy

» Note: hidden latent

If we can organize a graph according to its hierarchy:› Information extraction becomes easier› topic models become more expressive› information retrieval models can be enhanced› etc.

Page 114: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Some Methods create/learn Taxonomies

Hierarchical LDA (hLDA) Blei et al. ’03,10

TopicBlock Ho et al. ‘12

Pachinko Allocation Model (hPAM) Mimno et al. ’07

Page 115: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

We are interested in Hierarchies

Hierarchical Document Topic Model (HDTM) Weninger et al ‘12

Page 116: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

From Taxonomies to Hierarchies - Change the Stochastic Process

Major Difference is that items (documents) can live at non-leaf nodes• How is this accomplished?

Change the Stochastic Model – CRP, nCRP, SB, DSB• Random Walk – Brownian Motion

• Especially random walks on a graph• Page Rank – Random Surfer Model

• Random Surfer Model – PageRank• Jump to a random node with probability

• Random Walk with Restart (RWR/PPR)• Jump back to the starting point (root) with probability

Page 117: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

The Generative Story

HDTM

Page 118: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Drawing paths

𝑝 ( 𝐼𝑙𝑙 .→𝐴𝑐𝑎𝑑 .→𝐸𝑛𝑔 .→𝐶𝑆 )=¿

𝑝 ( 𝐼𝑙𝑙 .→𝐶𝑆 )=log [( 1−𝛾𝑛 ) ]( 1−𝛾𝑛 )( 1−𝛾𝑚 )log❑ ( 1−𝛾𝑙 )+¿ +¿

Page 119: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

The Generative Story

Page 120: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Sample paths

Similar to standard LDA

RWR Probability

Page 121: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

The Generative Story

Page 122: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Sample Words for a topic/document

Similar to standard LDA

RWR Probability

Page 123: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Sample words

Clip of Wikipedia Graph rooted at COMPUTER SCIENCE

𝑐2𝑐1

Page 124: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Example: Hierarchy inferred from Web graph

Colleges

Departments

Engineering Departments

Page 125: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

What does this give us?

Given a rooted graph we find a hierarchy› Random Walk with Restart generates parenthood

probabilities

This gives us one possible hierarchy. There are many.

New Challenge - Can’t label

𝑋

𝑌 <: 𝑋

𝑍< :𝑌

𝑊< :𝑍

Page 126: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Set of similarly typed pages

What can we say about these pages?› Class Label/Type?› Name?

Page 127: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Exploring Link Paths

Let’s explore link-paths in a hierarchy

Hierarchy #1PeopleFacultyJiawei HanPersonal Site

Hierarchy #2ResearchData MiningJiawei HanPersonal Site

Page 128: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Exploring Link Paths

What do these pages have in common?

Hierarchy #1PeopleFaculty

Hierarchy #2ResearchData Mining

NamePhoneOfficeAge

GenderEmailNext Step

Page 129: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Table/Record Attribute Extraction

Page 130: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Extract database records from the WebRESTful service: http://dmserv1.cs.illinois.edu/listextractorservice.listextractorsvc.svc/extract/xml/?url= http://cs.illinois.edu/people/faculty

61 Faculty

Tarek A.

Sarita A.

Vikram A.

…and 58 more…

Page 131: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Attribute Propagation

Propagate information through he link paths

NamePhoneOffice

Fax

ResearchEmail

Page 132: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Attribute Propagation Results

Columns match within a Web site (a single hierarchy)› Columns do not match outside of a hierarchy

Columns cannot be labeled easily.

CalTechIowa St.Norfolk St.

Stanford

Page 133: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Links Paths for Known Item Search

Anchor texts look like queries.› Often resemble database records too› Lets match Web pages to improve Web search

HT’12

Hierarchy #1PeopleFacultyJiawei HanPersonal Site

Hierarchy #2ResearchData MiningJiawei HanPersonal Site

#1

Page 134: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Link Paths for Known Item Retrieval

Known Item Retrieval using BM25F› Fields – Slope determines importance

» Content» incoming anchor text (BLP)» Link Paths (FLP)

Page 135: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

So what does all this tell us?

What are the other objects?

Page 136: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

So what does all this tell us?

What type of object is this?PeopleFacultyData MiningResearch

Page 137: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

So what does all this tell us?

What attributes describe this object?

Page 138: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

So what does all this tell us?

How can we best search for this object?

PeopleFacultyJiawei HanPersonal SiteResearchData MiningJiawei HanPersonal Site…

Page 139: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Graph Search

Page 140: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

New types of search - Web Meta-Paths

Objects are connected together via different types of relationships!› Results from Notre Dame Network collected from the Web

“Bowyer-Viz-Flynn”“Flynn-CCL-Thain”

“Flynn-CCL-Emrich”

Prof-Group-Prof

“CSE40151- Bowyer-Viz-Flynn – CSE40535”“CSE40535 - Flynn-CCL-Thain – CSE20211”

“CSE40535 - Flynn-CCL-Emrich – CSE40532”

Course-Prof-Group-Prof-Course

Page 141: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

New types of search - Web Meta-Paths

Objects are connected together via different types of relationships!› Results from Kentucky Network collected from the Web

“Seales-Viz-Jacobs”“Sealas-Viz-Yang”

“Griffeon-EDUCE-Sealas”

Prof-Group-Prof

“CS636-Jacobs-Viz-Yang-CS738”“CS215-Sealas-Viz-Yang-CS738”

“CS485-Griffoen-EDUCE-Sealas-CS215”

Course-Prof-Group-Prof-Course

Page 142: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

New types of search - Web Meta-Paths

Objects are connected together via different types of relationships!› Results from New Mexico Network collected from the

Web

“Luger-AI-Lane”“Dorian-SSL-Patrick”“Lance-SciViz-John”

Prof-Group-Prof

“CS 341 - Dorian-SSL-Patrick – CS 442”“CS 481 - Dorian-SSL-Patrick – CS 481”

“CS 357 - Lance-AI-Stephanie – CS 691”

Course-Prof-Group-Prof-Course

Page 143: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

New types of search - Web Meta-Paths

Objects are connected together via different types of relationships!› Results from Nebraska Network collected from the Web

“Hong-ADSL-David”“Matthew-E2-Myra”

“Myra-E2-Anita”

Prof-Group-Prof

“CS 432/832 - Hong-ADSL-David – N/A”“CS 496/896 - Matthew-E2-Myra – CS 990”

“CS 990 - Myra-E2-Anita – CS 361”

Course-Prof-Group-Prof-Course

Page 144: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

New types of search - Web Meta-Paths

Objects are connected together via different types of relationships!› Results from Illinois Network collected from the Web

“Han-DAIS-Zhai”“Chang-DAIS-Han”

“Roth-AI-Hockenmaier”

Prof-Group-Prof

“CS412- Han-DAIS-Zhai – CS410”“CS512 - Chang-DAIS-Han – CS512”

“CS446 - Roth-AI-Hockenmaier – CS440”

Course-Prof-Group-Prof-Course

Page 145: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Typifying the Web

What do to with a Typed Web?› Query Processing

» Looking for people, professors, CEOs, etc.?› New Search Techniques

» Return structured search results for unstructured query

Typed Graphs› NINA project

» Large scale heterogeneous information network analysis tookit• Graph generation, graph statistics, classification, clustering, etc.

» On github - https://github.com/tweninger/nina

Page 146: Information Network Analysis and Extraction  Extraction  and Integration of the Semi-Structured  Web

Thank you