50
Instructor: Alan Ritter CSE 5539: Web Information Extraction

Instructor: Alan Ritter CSE 5539: Web Information Extraction

Embed Size (px)

Citation preview

Page 1: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Instructor: Alan Ritter

CSE 5539: Web Information Extraction

Page 2: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Motivation

• Data Analytics / Big Data– Companies have lots of data

lying around– Computing cycles are cheap– Using data to get insights:

• Business, Healthcare, Science, Government, Politics

• Challenge: Most of the world’s data is Unstructured– Text– Speech– Images

Structured Data

Bigger Unstructured Data

Page 3: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Extracting Knowledge from Text

The Web News

Text Extractors

Structured Data

Page 4: Instructor: Alan Ritter CSE 5539: Web Information Extraction
Page 5: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Example: Information Extraction from Twitter

“Yess! Yess! Its official Nintendo announced today that they Will release the Nintendo 3DS in north America

march 27 for $250”

Page 6: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Example: Information Extraction from Twitter

“Yess! Yess! Its official Nintendo announced today that they Will release the Nintendo 3DS in north America

march 27 for $250”

Page 7: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Example: Information Extraction from Twitter

“Yess! Yess! Its official Nintendo announced today that they Will release the Nintendo 3DS in north America

march 27 for $250”

COMPANY PRODUCT DATE PRICE REGION

PRODUCT RELEASE

Page 8: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Example: Information Extraction from Twitter

“Yess! Yess! Its official Nintendo announced today that they Will release the Nintendo 3DS in north America

march 27 for $250”

COMPANY PRODUCT DATE PRICE REGION

Nintendo 3DS March 27 $250 North America

PRODUCT RELEASE

Page 9: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Example: Information Extraction from Twitter

Samsung Galaxy S5 Coming to All Major U.S. Carriers Beginning April 11th

COMPANY PRODUCT DATE PRICE REGION

Samsung Galaxy S5 April 11 ? U.S.

Nintendo 3DS March 27 $250 North America

PRODUCT RELEASE

Page 10: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Example: Information Extraction from Twitter

COMPANY PRODUCT DATE PRICE REGION

Samsung Galaxy S5 April 11 ? U.S.

Nintendo 3DS March 27 $250 North America

… … … … …

PRODUCT RELEASE

News

Page 11: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Example Applications

• Question Answering / Structured Queries– Which companies are releasing new smartphones

new products in Europe this Spring?– Alert me anytime a new smartphone is announced

in the U.S.• Data Mining

– Analyze trends in product releases across different industries

– Is there a correlation between price and date of release?

Page 12: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Knowledge GraphsThings not strings!

CSE 5539

Ohio State Univ.

Course offered at

Alan Ritter

Instructor

Columbus OH

Located In

Page 13: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Data Sources

Page 14: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Available Data Sources

All of these databases are sparsely populated

and out of date. We need to extract this type of knowledge from

text!!!!

Page 15: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Available Data Sources

All of these databases are sparsely populated

and out of date. We need to extract this type of knowledge from

text!!!!

Page 16: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Traditional information Extraction

Page 17: Instructor: Alan Ritter CSE 5539: Web Information Extraction
Page 18: Instructor: Alan Ritter CSE 5539: Web Information Extraction
Page 19: Instructor: Alan Ritter CSE 5539: Web Information Extraction
Page 20: Instructor: Alan Ritter CSE 5539: Web Information Extraction
Page 21: Instructor: Alan Ritter CSE 5539: Web Information Extraction
Page 22: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Traditional information Extraction

Page 23: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Example Text from MUC-4 (1992)[Cowie and Wilks]

Page 24: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Example Output from MUC-4 (1992)

[Cowie and Wilks]

Page 25: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Approaches• Initially: Rule Based

– Basically just write a bunch of regular expressions

Page 26: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Approaches• Initially: Rule Based

– Basically just write a bunch of regular expressions

Page 27: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Approaches• Initially: Rule Based

– Basically just write a bunch of regular expressions

Page 28: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Approaches

• Initially: Rule Based– Basically just write a bunch of regular expressions

• Machine Learning (Fietag 1998) (Soderland 1999), (Mooney 1999)

– Annotate training / dev / test documents– Train machine learning models

Page 29: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell School of Computer Science Carnegie Mellon University

3:30 pm 7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

E.g.Looking forseminarlocation

[Slide from William Cohen]

Page 30: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell School of Computer Science Carnegie Mellon University

3:30 pm 7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

E.g.Looking forseminarlocation

[Slide from William Cohen]

Page 31: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell School of Computer Science Carnegie Mellon University

3:30 pm 7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

E.g.Looking forseminarlocation

[Slide from William Cohen]

Page 32: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell School of Computer Science Carnegie Mellon University

3:30 pm 7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

E.g.Looking forseminarlocation

[Slide from William Cohen]

Page 33: Instructor: Alan Ritter CSE 5539: Web Information Extraction

A “Naïve Bayes” Sliding Window Model[Freitag 1997]

00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrunw t-m w t-1 w t w t+n w t+n+1 w t+n+m

prefix contents suffix

If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it.

… …

Estimate Pr(LOCATION|window) using Bayes rule

Try all “reasonable” windows (vary length, position)

Assume independence for length, prefix words, suffix words, content words

Estimate from data quantities like: Pr(“Place” in prefix|LOCATION)

[Slide from William Cohen]

Page 34: Instructor: Alan Ritter CSE 5539: Web Information Extraction

“Naïve Bayes” Sliding Window Results

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell School of Computer Science Carnegie Mellon University

3:30 pm 7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

Domain: CMU UseNet Seminar Announcements

Field F1 Person Name: 30%Location: 61%Start Time: 98%

[Slide from William Cohen]

Page 35: Instructor: Alan Ritter CSE 5539: Web Information Extraction

IE with Hidden Markov Models

Yesterday Pedro Domingos spoke this example sentence.

Yesterday Pedro Domingos spoke this example sentence.

Person name: Pedro Domingos

Given a sequence of observations:

and a trained HMM:

Find the most likely state sequence: (Viterbi)

Any words said to be generated by the designated “person name”state extract as a person name:

),(maxarg osPs

person name

location name

background

[Slide from William Cohen]

Page 36: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Finite State Models

Naïve Bayes

Logistic Regression

Linear-chain CRFs

HMMsGenerative

directed models

General CRFs

Sequence

Sequence

Conditional Conditional Conditional

GeneralGraphs

GeneralGraphs

Page 37: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Various Annotated Datasets for Event / Relation Extraction

• ACE– Automatic Content Extraction– Newswire– Successor to MUC

Page 38: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Various Annotated Datasets for Event / Relation Extraction

• GENIA– Medline abstracts– Similar extraction task in the Biomedical domain

Page 39: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Schemas -> Triples“Yess! Yess! Its official Nintendo announced today that they Will release the Nintendo 3DS in north America

march 27 for $250”

COMPANY PRODUCT DATE PRICE REGION

Nintendo 3DS March 27 $250 North America

PRODUCT RELEASE

Manufacturer(3DS, Nintendo)ReleaseDate(3DS, March 27)Price(3DS, $250)…

RelationExtraction

Page 40: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Open Information Extraction (Banko et. al. 2007)

Page 41: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Demo (TextRunner)

• http://openie.allenai.org/

Page 42: Instructor: Alan Ritter CSE 5539: Web Information Extraction

42

Distant (weak) Supervision for Relation Extraction e.g. [Mintz et. al. 2009]

Person Birth Location

Barack Obama Honolulu

Mitt Romney Detroit

Albert Einstein Ulm

Nikola Tesla Smiljan

… …

“Barack Obama was born on August 4, 1961 at … in the city of Honolulu ...”

“Birth notices for Barack Obama were published in the Honolulu Advertiser…”

“Born in Honolulu, Barack Obama went on to become…”…

(Barack Obama, Honolulu)

(Mitt Romney, Detroit)

(Albert Einstein, Ulm)

Page 43: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Demo (NELL)

• http://rtw.ml.cmu.edu/rtw/kbbrowser/

Page 44: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Demo (Literome)

• http://literome.azurewebsites.net/

Page 45: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Knowledge Base Population Subtasks

• Entity Recognition/Classification/Linking• Relation Extraction• Event Extraction• Knowledge Base Inference

Page 46: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Applications

• Google knowledge graph• Facebook graph search• Biomedical knowledge bases• -> Your application domain here

– Geoscience knowledge graph?– Patent knowledge graph?– Cybersecurity knowledge graph?

Page 47: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Research Groups at Other Places

Page 48: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Why learn about this stuff?

Page 49: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Paper Selection Form!(please fill out before next class)

https://goo.gl/AghZ1f

Page 50: Instructor: Alan Ritter CSE 5539: Web Information Extraction

Administrative Details

• Course Webpage– http://aritter.github.io/courses/5539_fall15.html