View
218
Download
0
Category
Tags:
Preview:
Citation preview
Topics of today
IE: Information Extraction Techniques
Wrapper Induction Sliding Windows From FST to HMM
Extracting Job Openings from the Web
foodscience.com-Job2
JobTitle: Ice Cream Guru
Employer: foodscience.com
JobCategory: Travel/Hospitality
JobFunction: Food Services
JobLocation: Upper Midwest
Contact Phone: 800-488-2611
DateExtracted: January 8, 2001
Source: www.foodscience.com/jobs_midwest.html
OtherCompanyJobs: foodscience.com-Job1
Two ways to manage information
Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx
Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx
Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx
Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx
Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx
Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx
Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx
Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx
retrieval
Query Answer Query Answer
advisor(wc,vc)advisor(yh,tm)
affil(wc,mld)affil(vc,lti)
fn(wc,``William”)fn(vc,``Vitor”)
Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx
inference
“ceremonial soldering” X:advisor(wc,Y)&affil(X,lti) ? {X=em; X=vc}
ANDIE
What is Information Extraction? Recovering structured data from
formatted text Identifying fields (e.g. named entity
recognition)
What is Information Extraction? Recovering structured data from
formatted text Identifying fields (e.g. named entity
recognition) Understanding relations between fields (e.g.
record association)
What is Information Extraction? Recovering structured data from
formatted text Identifying fields (e.g. named entity
recognition) Understanding relations between fields (e.g.
record association) Normalization and deduplication
What is Information Extraction?
Recovering structured data from formatted text Identifying fields (e.g. named entity
recognition) Understanding relations between fields (e.g.
record association) Normalization and deduplication
Today, focus mostly on field identification &a little on record association
IE fromChinese Documents regarding
Weather
Chinese Academy of Sciences
200k+ documentsseveral millennia old
- Qing Dynasty Archives- memos- newspaper articles- diaries
“Wrappers”
If we think of things from the database point of view We want to be able to database-style queries But we have data in some horrid textual
form/content management system that doesn’t allow such querying
We need to “wrap” the data in a component that understands database-style querying
Hence the term “wrappers”
Title: Schulz and Peanuts: A Biography Author: David Michaelis List Price: $34.95
Wrappers:Simple Extraction Patterns
Specify an item to extract for a slot using a regular expression pattern. Price pattern: “\b\$\d+(\.\d{2})?\b”
May require preceding (pre-filler) pattern and succeeding (post-filler) pattern to identify the end of the filler. Amazon list price:
Pre-filler pattern: “<b>List Price:</b> <span class=listprice>”
Filler pattern: “\b\$\d+(\.\d{2})?\b” Post-filler pattern: “</span>”
Wrapper tool-kits
Wrapper toolkits Specialized programming environments for
writing & debugging wrappers by hand Some Resources
Wrapper Development Tools LAPIS
Wrapper Induction
Problem description: Task: learn extraction rules based on
labeled examples Hand-writing rules is tedious, error prone, and
time consuming Learning wrappers is wrapper induction
Induction Learning
Rule induction: formal rules are extracted from a set of
observations. The rules extracted may represent a full scientific model of the data, or merely represent local patterns in the data.
INPUT: Labeled examples: training & testing data Admissible rules (hypotheses space) Search strategy
Desired output: Rule that performs well both on training and
testing data
Wrapper induction
Highly regularsource documents
Relatively simple
extraction patterns
Efficient
learning algorithm
Build a training set of documents paired with human-produced filled extraction templates.
Learn extraction patterns for each slot using an appropriate machine learning algorithm.
Goal: learn from a human teacher how to extract certain database records from a particular web site.
Goal: learn from a human teacher how to extract certain database records from a particular web site.
Kushmerick’s WIEN system
Earliest wrapper-learning system (published IJCAI ’97)
Special things about WIEN: Treats document as a string of characters Learns to extract a relation directly, rather
than extracting fields, then associating them together in some way
Example is a completely labeled page
l1, r1, …, lK, rK
Example: Find 4 strings<B>, </B>, <I>,
</I> l1 , r1 , l2 ,
r2
labeled pages wrapper<HTML><HEAD>Some Country Codes</HEAD><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>
<HTML><HEAD>Some Country Codes</HEAD><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>
<HTML><HEAD>Some Country Codes</HEAD><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>
<HTML><HEAD>Some Country Codes</HEAD><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>
Learning LR wrappers
LR: Finding r1
<HTML><TITLE>Some Country Codes</TITLE><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>
r1 can be any prefix
eg </B>
LR: Finding l1, l2 and r2
<HTML><TITLE>Some Country Codes</TITLE><B>Congo</B> <I>242</I><BR><B>Egypt</B> <I>20</I><BR><B>Belize</B> <I>501</I><BR><B>Spain</B> <I>34</I><BR></BODY></HTML>
r2 can be any prefix
eg </I>
l2 can be any suffix
eg <I>
l1 can be any suffix
eg <B>
WIEN system
Assumes items are always in fixed, known order … Name: J. Doe; Address: 1 Main; Phone: 111-1111.
<p> Name: E. Poe; Address: 10 Pico; Phone: 777-1111.
<p> …
Introduces several types of wrappers
LR
Learning LR extraction rules
Admissible rules: prefixes & suffixes of items of interest
Search strategy: start with shortest prefix & suffix, and expand
until correct
Summary of WIEN
Advantages: Fast to learn & extract
Drawbacks: Cannot handle permutations and missing items Must label entire page Requires large number of examples
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
A “Naïve Bayes” Sliding Window Model
[Freitag 1997]
00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrunw t-m w t-1 w t w t+n w t+n+1 w t+n+m
prefix contents suffix
If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it.
… …
Estimate Pr(LOCATION|window) using Bayes rule
Try all “reasonable” windows (vary length, position)
Assume independence for length, prefix words, suffix words, content words
Estimate from data quantities like: Pr(“Place” in prefix|LOCATION)
A “Naïve Bayes” Sliding Window Model
1. Create dataset of examples like these:+(prefix00,…,prefixColon, contentWean,contentHall,….,suffixSpeaker,…)- (prefixColon,…,prefixWean,contentHall,….,ContentSpeaker,suffixColon,….)…
2. Train a NaiveBayes classifier3. If Pr(class=+|prefix,contents,suffix) > threshold, predict the
content window is a location.• To think about: what if the extracted entities aren’t consistent, eg if the
location overlaps with the speaker?
[Freitag 1997]
00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrunw t-m w t-1 w t w t+n w t+n+1 w t+n+m
prefix contents suffix
… …
“Naïve Bayes” Sliding Window Results
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
Domain: CMU UseNet Seminar Announcements
Field F1 Person Name: 30%Location: 61%Start Time: 98%
Finite State Transducers for IE
Basic method for extracting relevant information
IE systems generally use a collection of specialized FSTs
Company Name detection Person Name detection Relationship detection
Finite State Transducers for IE
Frodo Baggins works for Hobbit Factory, Inc.
Text Analyzer:
Frodo – Proper Name
Baggins – Proper Name
works – Verb
for – Prep
Hobbit – UnknownCap
Factory – NounCap
Inc – CompAbbr
Finite State Transducers for IE
Frodo Baggins works for Hobbit Factory, Inc.
Some regular expression for finding company names:“some capitalized words, maybe a comma,
then a company abbreviation indicator”
CompanyName = (ProperName | SomeCap)+
Comma?
CompAbbr
Finite State Transducers for IE
Frodo Baggins works for Hobbit Factory, Inc.
1 2 3 4
word
(CAP | PN)
(CAP| PN)
CAB
comma CAB
word
CAP = SomeCap, CAB = CompAbbr, PN = ProperName, = empty string
Company Name Detection FSA
Finite State Transducers for IE
Frodo Baggins works for Hobbit Factory, Inc.
1 2 3 4
word word
(CAP | PN)
(CAP| PN)
CAB CN
comma CAB CN
word word
CAP = SomeCap, CAB = CompAbbr, PN = ProperName, = empty string, CN = CompanyName
Company Name Detection FST
Finite State Transducers for IE
Frodo Baggins works for Hobbit Factory, Inc.
1 2 3 4
word word
(CAP | PN)
(CAP| PN)
CAB CN
comma CAB CN
word word
CAP = SomeCap, CAB = CompAbbr, PN = ProperName, = empty string, CN = CompanyName
Company Name Detection FST
Non-deterministic!!!
Finite State Transducers for IE
Several FSTs or a more complex FST can be used to find one type of information (e.g. company names)
FSTs are often compiled from regular expressions
Probabilistic (weighted) FSTs
Finite State Transducers for IE
FSTs mean different things to different researchers in IE.
Based on lexical items (words) Based on statistical language models Based on deep syntactic/semantic analysis
Example: FASTUS
Finite State Automaton Text Understanding System (SRI International)
Cascading FSTs Recognize names Recognize noun groups, verb groups etc Complex noun/verb groups are constructed Identify patterns of interest Identify and merge event structures
Hidden Markov Models formalism
HMM = states s1, s2, …(special start state s1 special end state sn)token alphabet a1, a2, …state transition probs P(si|sj)token emission probs P(ai|sj)Widely used in many language processing tasks,
e.g., speech recognition [Lee, 1989], POS tagging[Kupiec, 1992], topic detection [Yamron et al, 1998].
HMM = probabilistic FSA
Applying HMMs to IE
Document generated by a stochastic process modelled by an HMM
Token word State “reason/explanation” for a given token
‘Background’ state emits tokens like ‘the’, ‘said’, … ‘Money’ state emits tokens like ‘million’, ‘euro’, … ‘Organization’ state emits tokens like ‘university’,
‘company’, … Extraction: via the Viterbi algorithm, a
dynamic programming technique for efficiently computing the most likely sequence of states that generated a document.
HMM for research papers: emissions [Seymore et al., 99]
author title institution
Trained on 2 million words of BibTeX data from the Web
...note
ICML 1997...submission to…to appear in…
stochastic optimization...reinforcement learning…model building mobile robot...
carnegie mellon university…university of californiadartmouth college
supported in part…copyright...
What is an HMM?
Graphical Model Representation: Variables by time Circles indicate states Arrows indicate probabilistic dependencies
between states
What is an HMM?
Green circles are hidden states Dependent only on the previous state: Markov
process “The past is independent of the future given the
present.”
HMM Formalism
{S, K, , P , A B} S : {s1…sN } are the values for the hidden states K : {k1…kM } are the values for the observations
SSS
KKK
S
K
S
K
HMM Formalism
{S, K, , P , A B} = {P pi} are the initial state probabilities A = {aij} are the state transition probabilities B = {bik} are the observation state
probabilities
A
B
AAA
BB
SSS
KKK
S
K
S
K
Need to provide structure of HMM & vocabulary Training the model (Baum-Welch algorithm)
Efficient dynamic programming algorithms exist for Finding Pr(K) The highest probability path S that maximizes Pr(K,S)
(Viterbi)
Title
Journal
Author 0.9
0.5
0.50.8
0.2
0.1
Transition probabilities
Year
A
B
C
0.6
0.3
0.1
X
B
Z
0.4
0.2
0.4
Y
A
C
0.1
0.1
0.8
Emission probabiliti
es
dddd
dd
0.8
0.2
Using the HMM to segment
Find highest probability path through the HMM.
Viterbi: quadratic dynamic programming algorithm
House
ot
Road
City
Pin
115 Grant street Mumbai 400070
House
Road
City
Pin
115 Grant ……….. 400070
ot
House
Road
City
Pin
House
Road
Pin
Most Likely Path for a Given Sequence
The probability that the path is taken and the sequence is generated:
L
iiNL iii
axbaxx1
001 11 )()...,...Pr(
Lxx ...1
N ...0
transition
probabilities
emission
probabilities
Example
A 0.1C 0.4G 0.4T 0.1
A 0.4C 0.1G 0.1T 0.4
begin end
0.5
0.5
0.2
0.8
0.4
0.6
0.1
0.90.2
0.8
0 5
4
3
2
1
6.03.08.04.02.04.05.0
)C()A()A(),AACPr( 35313111101
abababa
A 0.4C 0.1G 0.2T 0.3
A 0.2C 0.3G 0.3T 0.2
oTo1 otot-1 ot+1
Finding the most probable path
Find the state sequence that best explains the observations
Viterbi algorithm (1967)
)|(maxarg OXPX
oTo1 otot-1 ot+1
Viterbi Algorithm
),,...,...(max)( 1111... 11
ttttxx
j ojxooxxPtt
The state sequence which maximizes the probability of seeing the observations to time t-1, landing in state j, and seeing the observation at time t
x1 xt-1 j
oTo1 otot-1 ot+1
Viterbi Algorithm
),,...,...(max)( 1111... 11
ttttxx
j ojxooxxPtt
1)(max)1(
tjoijii
j batt
1)(maxarg)1(
tjoijii
j batt Recursive Computation
x1 xt-1 xt xt+1
)1( tj
)(ti
ija
1tjob
Viterbi : Dynamic Programming
House
ot
Road
City
Pin
No 115 Grant street Mumbai 400070
House
Road
City
Pin
115 Grant ……….. 400070
ot
House
Road
City
Pin
House
Road
Pin
1)(max)1(
tjoijii
j batt
oTo1 otot-1 ot+1
Viterbi Algorithm
)(maxargˆ TX ii
T
)1(ˆ1
^
tXtX
t
Compute the most likely state sequence by working backwards
x1 xt-1 xt xt+1 xT
Hidden Markov Models Summary
Popular technique to detect and classify a linear sequence of information in text
Disadvantage is the need for large amounts of training data
Related Works System for extraction of gene names and locations
from scientific abstracts (Leek, 1997) NERC (Biker et al., 1997) McCallum et al. (1999) extracted document
segments that occur in a fixed or partially fixed order (title, author, journal)
Ray and Craven (2001) – extraction of proteins, locations, genes and disorders and their relationships
IE with Symbolic Techniques
Conceptual Dependency Theory Shrank, 1972; Shrank, 1975 mainly aimed to extract semantic information about individual
events from sentences at a conceptual level (i.e., the actor and an action)
Frame Theory Minsky, 1975 a frame stores the properties of characteristics of an entity,
action or event it typically consists of a number of slots to refer to the
properties named by a frame Berkeley FrameNet project
Baker, 1998; Fillmore and Baker, 2001 online lexical resource for English, based on frame semantics
and supported by corpus evidence FASTUS (Finite State Automation Text Understanding
System) Hobbs, 1996 using cascade of FSAs in a frame based information extraction
approach
IE with Machine Learning Techniques
Training data: documents marked up with ground truth
In contrast to text classification, local features crucial. Features of: Contents Text just before item Text just after item Begin/end boundaries
Good Features for Information Extraction
Example word features: identity of word is in all caps ends in “-ski” is part of a noun phrase is in a list of city names is under node X in
WordNet or Cyc is in bold font is in hyperlink anchor features of past & future last person name was
female next two words are “and
Associates”
begins-with-numberbegins-with-ordinalbegins-with-
punctuationbegins-with-question-
wordbegins-with-subjectblankcontains-alphanumcontains-bracketed-
numbercontains-httpcontains-non-spacecontains-numbercontains-pipe
contains-question-markcontains-question-wordends-with-question-
markfirst-alpha-is-
capitalizedindentedindented-1-to-4indented-5-to-10more-than-one-third-
spaceonly-punctuationprev-is-blankprev-begins-with-
ordinalshorter-than-30
Creativity and Domain Knowledge Required!
Is Capitalized Is Mixed Caps Is All Caps Initial CapContains DigitAll lowercase Is InitialPunctuationPeriodCommaApostropheDashPreceded by HTML tag
Character n-gram classifier says string is a person name (80% accurate)
In stopword list(the, of, their, etc)
In honorific list(Mr, Mrs, Dr, Sen, etc)
In person suffix list(Jr, Sr, PhD, etc)
In name particle list (de, la, van, der, etc)
In Census lastname list;segmented by P(name)
In Census firstname list;segmented by P(name)
In locations lists(states, cities, countries)
In company name list(“J. C. Penny”)
In list of company suffixes(Inc, & Associates, Foundation)
Word Features lists of job titles, Lists of prefixes Lists of suffixes 350 informative phrases
HTML/Formatting Features {begin, end, in} x
{<b>, <i>, <a>, <hN>} x{lengths 1, 2, 3, 4, or longer}
{begin, end} of line
Creativity and Domain Knowledge Required!
Good Features for Information Extraction
Landscape of ML Techniques for IE:
Any of these models can be used to capture words, formatting or both.
Classify Candidates
Abraham Lincoln was born in Kentucky.
Classifier
which class?
Sliding Window
Abraham Lincoln was born in Kentucky.
Classifier
which class?
Try alternatewindow sizes:
Boundary Models
Abraham Lincoln was born in Kentucky.
Classifier
which class?
BEGIN END BEGIN END
BEGIN
Finite State Machines
Abraham Lincoln was born in Kentucky.
Most likely state sequence?
Wrapper Induction
<b><i>Abraham Lincoln</i></b> was born in Kentucky.
Learn and apply pattern for a website
<b>
<i>
PersonName
IE History
Pre-Web Mostly news articles
De Jong’s FRUMP [1982] Hand-built system to fill Schank-style “scripts” from news
wire Message Understanding Conference (MUC) DARPA
[’87-’95], TIPSTER [’92-’96] Most early work dominated by hand-built models
E.g. SRI’s FASTUS, hand-built FSMs. But by 1990’s, some machine learning: Lehnert, Cardie,
Grishman and then HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98]
Web AAAI ’94 Spring Symposium on “Software Agents”
Much discussion of ML applied to Web. Maes, Mitchell, Etzioni. Tom Mitchell’s WebKB, ‘96
Build KB’s from the Web. Wrapper Induction
Initially hand-build, then ML: [Soderland ’96], [Kushmeric ’97],…
Summary
Information Extraction Sliding Window From FST(Finite State
Transducer) to HMM Wrapper Induction
Wrapper toolkits LR Wrapper
Finite State Machines
Abraham Lincoln was born in Kentucky.
Most likely state sequence?
Sliding Window
Abraham Lincoln was born in Kentucky.
Classifier
which class?
Try alternatewindow sizes:
Readings
[1] M. Ion, M. Steve, and K. Craig, "A hierarchical approach to wrapper induction," in Proceedings of the third annual conference on Autonomous Agents. Seattle, Washington, United States: ACM, 1999.
Recommended