Upload
asher-leonard
View
224
Download
3
Tags:
Embed Size (px)
Citation preview
Instructor: Alan Ritter
CSE 5539: Web Information Extraction
Motivation
• Data Analytics / Big Data– Companies have lots of data
lying around– Computing cycles are cheap– Using data to get insights:
• Business, Healthcare, Science, Government, Politics
• Challenge: Most of the world’s data is Unstructured– Text– Speech– Images
Structured Data
Bigger Unstructured Data
Extracting Knowledge from Text
The Web News
Text Extractors
Structured Data
Example: Information Extraction from Twitter
“Yess! Yess! Its official Nintendo announced today that they Will release the Nintendo 3DS in north America
march 27 for $250”
Example: Information Extraction from Twitter
“Yess! Yess! Its official Nintendo announced today that they Will release the Nintendo 3DS in north America
march 27 for $250”
Example: Information Extraction from Twitter
“Yess! Yess! Its official Nintendo announced today that they Will release the Nintendo 3DS in north America
march 27 for $250”
COMPANY PRODUCT DATE PRICE REGION
PRODUCT RELEASE
Example: Information Extraction from Twitter
“Yess! Yess! Its official Nintendo announced today that they Will release the Nintendo 3DS in north America
march 27 for $250”
COMPANY PRODUCT DATE PRICE REGION
Nintendo 3DS March 27 $250 North America
PRODUCT RELEASE
Example: Information Extraction from Twitter
Samsung Galaxy S5 Coming to All Major U.S. Carriers Beginning April 11th
COMPANY PRODUCT DATE PRICE REGION
Samsung Galaxy S5 April 11 ? U.S.
Nintendo 3DS March 27 $250 North America
PRODUCT RELEASE
Example: Information Extraction from Twitter
COMPANY PRODUCT DATE PRICE REGION
Samsung Galaxy S5 April 11 ? U.S.
Nintendo 3DS March 27 $250 North America
… … … … …
PRODUCT RELEASE
News
Example Applications
• Question Answering / Structured Queries– Which companies are releasing new smartphones
new products in Europe this Spring?– Alert me anytime a new smartphone is announced
in the U.S.• Data Mining
– Analyze trends in product releases across different industries
– Is there a correlation between price and date of release?
Knowledge GraphsThings not strings!
CSE 5539
Ohio State Univ.
Course offered at
Alan Ritter
Instructor
Columbus OH
Located In
Data Sources
Available Data Sources
All of these databases are sparsely populated
and out of date. We need to extract this type of knowledge from
text!!!!
Available Data Sources
All of these databases are sparsely populated
and out of date. We need to extract this type of knowledge from
text!!!!
Traditional information Extraction
Traditional information Extraction
Example Text from MUC-4 (1992)[Cowie and Wilks]
Example Output from MUC-4 (1992)
…
[Cowie and Wilks]
Approaches• Initially: Rule Based
– Basically just write a bunch of regular expressions
Approaches• Initially: Rule Based
– Basically just write a bunch of regular expressions
Approaches• Initially: Rule Based
– Basically just write a bunch of regular expressions
Approaches
• Initially: Rule Based– Basically just write a bunch of regular expressions
• Machine Learning (Fietag 1998) (Soderland 1999), (Mooney 1999)
– Annotate training / dev / test documents– Train machine learning models
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
[Slide from William Cohen]
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
[Slide from William Cohen]
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
[Slide from William Cohen]
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
[Slide from William Cohen]
A “Naïve Bayes” Sliding Window Model[Freitag 1997]
00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrunw t-m w t-1 w t w t+n w t+n+1 w t+n+m
prefix contents suffix
If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it.
… …
Estimate Pr(LOCATION|window) using Bayes rule
Try all “reasonable” windows (vary length, position)
Assume independence for length, prefix words, suffix words, content words
Estimate from data quantities like: Pr(“Place” in prefix|LOCATION)
[Slide from William Cohen]
“Naïve Bayes” Sliding Window Results
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
Domain: CMU UseNet Seminar Announcements
Field F1 Person Name: 30%Location: 61%Start Time: 98%
[Slide from William Cohen]
IE with Hidden Markov Models
Yesterday Pedro Domingos spoke this example sentence.
Yesterday Pedro Domingos spoke this example sentence.
Person name: Pedro Domingos
Given a sequence of observations:
and a trained HMM:
Find the most likely state sequence: (Viterbi)
Any words said to be generated by the designated “person name”state extract as a person name:
),(maxarg osPs
person name
location name
background
[Slide from William Cohen]
Finite State Models
Naïve Bayes
Logistic Regression
Linear-chain CRFs
HMMsGenerative
directed models
General CRFs
Sequence
Sequence
Conditional Conditional Conditional
GeneralGraphs
GeneralGraphs
Various Annotated Datasets for Event / Relation Extraction
• ACE– Automatic Content Extraction– Newswire– Successor to MUC
Various Annotated Datasets for Event / Relation Extraction
• GENIA– Medline abstracts– Similar extraction task in the Biomedical domain
Schemas -> Triples“Yess! Yess! Its official Nintendo announced today that they Will release the Nintendo 3DS in north America
march 27 for $250”
COMPANY PRODUCT DATE PRICE REGION
Nintendo 3DS March 27 $250 North America
PRODUCT RELEASE
Manufacturer(3DS, Nintendo)ReleaseDate(3DS, March 27)Price(3DS, $250)…
RelationExtraction
Open Information Extraction (Banko et. al. 2007)
Demo (TextRunner)
• http://openie.allenai.org/
42
Distant (weak) Supervision for Relation Extraction e.g. [Mintz et. al. 2009]
Person Birth Location
Barack Obama Honolulu
Mitt Romney Detroit
Albert Einstein Ulm
Nikola Tesla Smiljan
… …
“Barack Obama was born on August 4, 1961 at … in the city of Honolulu ...”
“Birth notices for Barack Obama were published in the Honolulu Advertiser…”
“Born in Honolulu, Barack Obama went on to become…”…
(Barack Obama, Honolulu)
(Mitt Romney, Detroit)
(Albert Einstein, Ulm)
Demo (NELL)
• http://rtw.ml.cmu.edu/rtw/kbbrowser/
Demo (Literome)
• http://literome.azurewebsites.net/
Knowledge Base Population Subtasks
• Entity Recognition/Classification/Linking• Relation Extraction• Event Extraction• Knowledge Base Inference
Applications
• Google knowledge graph• Facebook graph search• Biomedical knowledge bases• -> Your application domain here
– Geoscience knowledge graph?– Patent knowledge graph?– Cybersecurity knowledge graph?
Research Groups at Other Places
Why learn about this stuff?
Paper Selection Form!(please fill out before next class)
https://goo.gl/AghZ1f
Administrative Details
• Course Webpage– http://aritter.github.io/courses/5539_fall15.html