Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM

Information Extraction

PengBoDec 2, 2010

Topics of today

IE: Information Extraction Techniques

Wrapper Induction Sliding Windows From FST to HMM

What is IE?

Example: The Problem

Martin Baker, a person

Genomics job

Employers job posting form

Example: A Solution

Extracting Job Openings from the Web

foodscience.com-Job2

JobTitle: Ice Cream Guru

Employer: foodscience.com

JobCategory: Travel/Hospitality

JobFunction: Food Services

JobLocation: Upper Midwest

Contact Phone: 800-488-2611

DateExtracted: January 8, 2001

Source: www.foodscience.com/jobs_midwest.html

OtherCompanyJobs: foodscience.com-Job1

Job Openings:Category = Food Services Keyword = Baker Location = Continental U.S.

Data Mining the Extracted Job Information

Two ways to manage information

Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx








retrieval

Query Answer Query Answer

advisor(wc,vc)advisor(yh,tm)

affil(wc,mld)affil(vc,lti)

fn(wc,``William”)fn(vc,``Vitor”)


inference

“ceremonial soldering” X:advisor(wc,Y)&affil(X,lti) ? {X=em; X=vc}

ANDIE

What is Information Extraction? Recovering structured data from

formatted text


formatted text Identifying fields (e.g. named entity

recognition)



recognition) Understanding relations between fields (e.g.

record association)




record association) Normalization and deduplication

What is Information Extraction?

Recovering structured data from formatted text Identifying fields (e.g. named entity


record association) Normalization and deduplication

Today, focus mostly on field identification &a little on record association

Applications

IE from Research Papers

IE fromChinese Documents regarding

Weather

Chinese Academy of Sciences

200k+ documentsseveral millennia old

- Qing Dynasty Archives- memos- newspaper articles- diaries

Wrapper Induction

“Wrappers”

If we think of things from the database point of view We want to be able to database-style queries But we have data in some horrid textual

form/content management system that doesn’t allow such querying

We need to “wrap” the data in a component that understands database-style querying

Hence the term “wrappers”

Title: Schulz and Peanuts: A Biography Author: David Michaelis List Price: $34.95

http://www.amazon.com/exec/obidos/search-handle-url/104-3426442-6811948?%5Fencoding=UTF8&search-type=ss&index=books&field-author=David%20Michaelis

Wrappers:Simple Extraction Patterns

Specify an item to extract for a slot using a regular expression pattern. Price pattern: “\b\$\d+(\.\d{2})?\b”

May require preceding (pre-filler) pattern and succeeding (post-filler) pattern to identify the end of the filler. Amazon list price:

Pre-filler pattern: “List Price: ”

Filler pattern: “\b\$\d+(\.\d{2})?\b” Post-filler pattern: “”

Wrapper tool-kits

Wrapper toolkits Specialized programming environments for

writing & debugging wrappers by hand Some Resources

Wrapper Development Tools LAPIS

http://www.wifo.uni-mannheim.de/~kuhlins/wrappertools/index.html

http://groups.csail.mit.edu/graphics/lapis/

Wrapper Induction

Problem description: Task: learn extraction rules based on

labeled examples Hand-writing rules is tedious, error prone, and

time consuming Learning wrappers is wrapper induction

Induction Learning

Rule induction: formal rules are extracted from a set of

observations. The rules extracted may represent a full scientific model of the data, or merely represent local patterns in the data.

INPUT: Labeled examples: training & testing data Admissible rules (hypotheses space) Search strategy

Desired output: Rule that performs well both on training and

testing data

http://en.wikipedia.org/wiki/Scientific_model

http://en.wikipedia.org/wiki/Patterns

Wrapper induction

Highly regularsource documents

Relatively simple

extraction patterns

Efficient

learning algorithm

Build a training set of documents paired with human-produced filled extraction templates.

Learn extraction patterns for each slot using an appropriate machine learning algorithm.

Goal: learn from a human teacher how to extract certain database records from a particular web site.

Goal: learn from a human teacher how to extract certain database records from a particular web site.

Learner

User gives first K positive—and thus many implicit negative examples

Kushmerick’s WIEN system

Earliest wrapper-learning system (published IJCAI ’97)

Special things about WIEN: Treats document as a string of characters Learns to extract a relation directly, rather

than extracting fields, then associating them together in some way

Example is a completely labeled page

WIEN system: a sample wrapper

l1, r1, …, lK, rK

Example: Find 4 strings, , ,

 l1 , r1 , l2 ,

r2

labeled pages wrapper<HTML><HEAD>Some Country Codes</HEAD>Congo 242 Egypt 20 Belize 501 Spain 34 </BODY></HTML>

<HTML><HEAD>Some Country Codes</HEAD>Congo 242 Egypt 20 Belize 501 Spain 34 </BODY></HTML>



Learning LR wrappers

LR wrapper

Left delimiters L1=“”, L2=“”; Right R1=“”, R2=“”

LR: Finding r1

<HTML><TITLE>Some Country Codes</TITLE>Congo 242 Egypt 20 Belize 501 Spain 34 </BODY></HTML>

r1 can be any prefix

eg 

LR: Finding l1, l2 and r2

<HTML><TITLE>Some Country Codes</TITLE>Congo 242 Egypt 20 Belize 501 Spain 34 </BODY></HTML>

r2 can be any prefix

eg 

l2 can be any suffix

eg 

l1 can be any suffix

eg 

WIEN system

Assumes items are always in fixed, known order … Name: J. Doe; Address: 1 Main; Phone: 111-1111.

 Name: E. Poe; Address: 10 Pico; Phone: 777-1111.

 …

Introduces several types of wrappers

LR

Learning LR extraction rules

Admissible rules: prefixes & suffixes of items of interest

Search strategy: start with shortest prefix & suffix, and expand

until correct

Summary of WIEN

Advantages: Fast to learn & extract

Drawbacks: Cannot handle permutations and missing items Must label entire page Requires large number of examples

Sliding Windows

Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell School of Computer Science Carnegie Mellon University

3:30 pm 7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

E.g.Looking forseminarlocation






















A “Naïve Bayes” Sliding Window Model

[Freitag 1997]

00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrunw t-m w t-1 w t w t+n w t+n+1 w t+n+m

prefix contents suffix

If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it.

… …

Estimate Pr(LOCATION|window) using Bayes rule

Try all “reasonable” windows (vary length, position)

Assume independence for length, prefix words, suffix words, content words

Estimate from data quantities like: Pr(“Place” in prefix|LOCATION)

A “Naïve Bayes” Sliding Window Model

1. Create dataset of examples like these:+(prefix00,…,prefixColon, contentWean,contentHall,….,suffixSpeaker,…)- (prefixColon,…,prefixWean,contentHall,….,ContentSpeaker,suffixColon,….)…

2. Train a NaiveBayes classifier3. If Pr(class=+|prefix,contents,suffix) > threshold, predict the

content window is a location.• To think about: what if the extracted entities aren’t consistent, eg if the

location overlaps with the speaker?

[Freitag 1997]

00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrunw t-m w t-1 w t w t+n w t+n+1 w t+n+m

prefix contents suffix

… …

“Naïve Bayes” Sliding Window Results





Domain: CMU UseNet Seminar Announcements

Field F1 Person Name: 30%Location: 61%Start Time: 98%

Finite State Transducers

Finite State Transducers for IE

Basic method for extracting relevant information

IE systems generally use a collection of specialized FSTs

Company Name detection Person Name detection Relationship detection


Frodo Baggins works for Hobbit Factory, Inc.

Text Analyzer:

Frodo – Proper Name

Baggins – Proper Name

works – Verb

for – Prep

Hobbit – UnknownCap

Factory – NounCap

Inc – CompAbbr



Some regular expression for finding company names:“some capitalized words, maybe a comma,

then a company abbreviation indicator”

CompanyName = (ProperName | SomeCap)+

Comma?

CompAbbr



1 2 3 4

word

(CAP | PN)

(CAP| PN)

CAB

comma CAB

word

CAP = SomeCap, CAB = CompAbbr, PN = ProperName, = empty string

Company Name Detection FSA



1 2 3 4

word word

(CAP | PN)

(CAP| PN)

CAB CN

comma CAB CN

word word

CAP = SomeCap, CAB = CompAbbr, PN = ProperName, = empty string, CN = CompanyName

Company Name Detection FST



1 2 3 4

word word

(CAP | PN)

(CAP| PN)

CAB CN

comma CAB CN

word word

CAP = SomeCap, CAB = CompAbbr, PN = ProperName, = empty string, CN = CompanyName

Company Name Detection FST

Non-deterministic!!!


Several FSTs or a more complex FST can be used to find one type of information (e.g. company names)

FSTs are often compiled from regular expressions

Probabilistic (weighted) FSTs


FSTs mean different things to different researchers in IE.

Based on lexical items (words) Based on statistical language models Based on deep syntactic/semantic analysis

Example: FASTUS

Finite State Automaton Text Understanding System (SRI International)

Cascading FSTs Recognize names Recognize noun groups, verb groups etc Complex noun/verb groups are constructed Identify patterns of interest Identify and merge event structures

Hidden Markov Models

Hidden Markov Models formalism

HMM = states s1, s2, …(special start state s1 special end state sn)token alphabet a1, a2, …state transition probs P(si|sj)token emission probs P(ai|sj)Widely used in many language processing tasks,

e.g., speech recognition [Lee, 1989], POS tagging[Kupiec, 1992], topic detection [Yamron et al, 1998].

HMM = probabilistic FSA

Applying HMMs to IE

Document generated by a stochastic process modelled by an HMM

Token word State “reason/explanation” for a given token

‘Background’ state emits tokens like ‘the’, ‘said’, … ‘Money’ state emits tokens like ‘million’, ‘euro’, … ‘Organization’ state emits tokens like ‘university’,

‘company’, … Extraction: via the Viterbi algorithm, a

dynamic programming technique for efficiently computing the most likely sequence of states that generated a document.

HMM for research papers: transitions [Seymore et al., 99]

HMM for research papers: emissions [Seymore et al., 99]

author title institution

Trained on 2 million words of BibTeX data from the Web

...note

ICML 1997...submission to…to appear in…

stochastic optimization...reinforcement learning…model building mobile robot...

carnegie mellon university…university of californiadartmouth college

supported in part…copyright...

What is an HMM?

Graphical Model Representation: Variables by time Circles indicate states Arrows indicate probabilistic dependencies

between states

What is an HMM?

Green circles are hidden states Dependent only on the previous state: Markov

process “The past is independent of the future given the

present.”

What is an HMM?

Purple nodes are observed states Dependent only on their corresponding hidden

state

HMM Formalism

{S, K, , P , A B} S : {s1…sN } are the values for the hidden states K : {k1…kM } are the values for the observations

SSS

KKK

S

K

S

K

HMM Formalism

{S, K, , P , A B} = {P pi} are the initial state probabilities A = {aij} are the state transition probabilities B = {bik} are the observation state

probabilities

A

B

AAA

BB

SSS

KKK

S

K

S

K

Need to provide structure of HMM & vocabulary Training the model (Baum-Welch algorithm)

Efficient dynamic programming algorithms exist for Finding Pr(K) The highest probability path S that maximizes Pr(K,S)

(Viterbi)

Title

Journal

Author 0.9

0.5

0.50.8

0.2

0.1

Transition probabilities

Year

A

B

C

0.6

0.3

0.1

X

B

Z

0.4

0.2

0.4

Y

A

C

0.1

0.1

0.8

Emission probabiliti

es

dddd

dd

0.8

0.2

Using the HMM to segment

Find highest probability path through the HMM.

Viterbi: quadratic dynamic programming algorithm

House

ot

Road

City

Pin

115 Grant street Mumbai 400070

House

Road

City

Pin

115 Grant ……….. 400070

ot

House

Road

City

Pin

House

Road

Pin

Most Likely Path for a Given Sequence

The probability that the path is taken and the sequence is generated:

L

iiNL iii

axbaxx1

001 11 )()...,...Pr(

Lxx ...1

N ...0

transition

probabilities

emission

probabilities

Example

A 0.1C 0.4G 0.4T 0.1

A 0.4C 0.1G 0.1T 0.4

begin end

0.5

0.5

0.2

0.8

0.4

0.6

0.1

0.90.2

0.8

0 5

4

3

2

1

6.03.08.04.02.04.05.0

)C()A()A(),AACPr( 35313111101

abababa

A 0.4C 0.1G 0.2T 0.3

A 0.2C 0.3G 0.3T 0.2

oTo1 otot-1 ot+1

Finding the most probable path

Find the state sequence that best explains the observations

Viterbi algorithm (1967)

)|(maxarg OXPX

oTo1 otot-1 ot+1

Viterbi Algorithm

),,...,...(max)( 1111... 11

ttttxx

j ojxooxxPtt

The state sequence which maximizes the probability of seeing the observations to time t-1, landing in state j, and seeing the observation at time t

x1 xt-1 j

oTo1 otot-1 ot+1

Viterbi Algorithm

),,...,...(max)( 1111... 11

ttttxx

j ojxooxxPtt

1)(max)1(

tjoijii

j batt

1)(maxarg)1(

tjoijii

j batt Recursive Computation

x1 xt-1 xt xt+1

)1( tj

)(ti

ija

1tjob

Viterbi : Dynamic Programming

House

ot

Road

City

Pin

No 115 Grant street Mumbai 400070

House

Road

City

Pin

115 Grant ……….. 400070

ot

House

Road

City

Pin

House

Road

Pin

1)(max)1(

tjoijii

j batt

oTo1 otot-1 ot+1

Viterbi Algorithm

)(maxargˆ TX ii

T

)1(ˆ1

^

tXtX

t

Compute the most likely state sequence by working backwards

x1 xt-1 xt xt+1 xT

Hidden Markov Models Summary

Popular technique to detect and classify a linear sequence of information in text

Disadvantage is the need for large amounts of training data

Related Works System for extraction of gene names and locations

from scientific abstracts (Leek, 1997) NERC (Biker et al., 1997) McCallum et al. (1999) extracted document

segments that occur in a fixed or partially fixed order (title, author, journal)

Ray and Craven (2001) – extraction of proteins, locations, genes and disorders and their relationships

IE Technique Landscape

IE with Symbolic Techniques

Conceptual Dependency Theory Shrank, 1972; Shrank, 1975 mainly aimed to extract semantic information about individual

events from sentences at a conceptual level (i.e., the actor and an action)

Frame Theory Minsky, 1975 a frame stores the properties of characteristics of an entity,

action or event it typically consists of a number of slots to refer to the

properties named by a frame Berkeley FrameNet project

Baker, 1998; Fillmore and Baker, 2001 online lexical resource for English, based on frame semantics

and supported by corpus evidence FASTUS (Finite State Automation Text Understanding

System) Hobbs, 1996 using cascade of FSAs in a frame based information extraction

approach

IE with Machine Learning Techniques

Training data: documents marked up with ground truth

In contrast to text classification, local features crucial. Features of: Contents Text just before item Text just after item Begin/end boundaries

Good Features for Information Extraction

Example word features: identity of word is in all caps ends in “-ski” is part of a noun phrase is in a list of city names is under node X in

WordNet or Cyc is in bold font is in hyperlink anchor features of past & future last person name was

female next two words are “and

Associates”

begins-with-numberbegins-with-ordinalbegins-with-

punctuationbegins-with-question-

wordbegins-with-subjectblankcontains-alphanumcontains-bracketed-

numbercontains-httpcontains-non-spacecontains-numbercontains-pipe

contains-question-markcontains-question-wordends-with-question-

markfirst-alpha-is-

capitalizedindentedindented-1-to-4indented-5-to-10more-than-one-third-

spaceonly-punctuationprev-is-blankprev-begins-with-

ordinalshorter-than-30

Creativity and Domain Knowledge Required!

Is Capitalized Is Mixed Caps Is All Caps Initial CapContains DigitAll lowercase Is InitialPunctuationPeriodCommaApostropheDashPreceded by HTML tag

Character n-gram classifier says string is a person name (80% accurate)

In stopword list(the, of, their, etc)

In honorific list(Mr, Mrs, Dr, Sen, etc)

In person suffix list(Jr, Sr, PhD, etc)

In name particle list (de, la, van, der, etc)

In Census lastname list;segmented by P(name)

In Census firstname list;segmented by P(name)

In locations lists(states, cities, countries)

In company name list(“J. C. Penny”)

In list of company suffixes(Inc, & Associates, Foundation)

Word Features lists of job titles, Lists of prefixes Lists of suffixes 350 informative phrases

HTML/Formatting Features {begin, end, in} x

{, , <a>, <hN>} x{lengths 1, 2, 3, 4, or longer}

{begin, end} of line

Creativity and Domain Knowledge Required!

Good Features for Information Extraction

Landscape of ML Techniques for IE:

Any of these models can be used to capture words, formatting or both.

Classify Candidates

Abraham Lincoln was born in Kentucky.

Classifier

which class?

Sliding Window


Classifier

which class?

Try alternatewindow sizes:

Boundary Models


Classifier

which class?

BEGIN END BEGIN END

BEGIN

Finite State Machines


Most likely state sequence?

Wrapper Induction

Abraham Lincoln was born in Kentucky.

Learn and apply pattern for a website





PersonName

IE History

Pre-Web Mostly news articles

De Jong’s FRUMP [1982] Hand-built system to fill Schank-style “scripts” from news

wire Message Understanding Conference (MUC) DARPA

[’87-’95], TIPSTER [’92-’96] Most early work dominated by hand-built models

E.g. SRI’s FASTUS, hand-built FSMs. But by 1990’s, some machine learning: Lehnert, Cardie,

Grishman and then HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98]

Web AAAI ’94 Spring Symposium on “Software Agents”

Much discussion of ML applied to Web. Maes, Mitchell, Etzioni. Tom Mitchell’s WebKB, ‘96

Build KB’s from the Web. Wrapper Induction

Initially hand-build, then ML: [Soderland ’96], [Kushmeric ’97],…

Summary

Information Extraction Sliding Window From FST(Finite State

Transducer) to HMM Wrapper Induction

Wrapper toolkits LR Wrapper

Finite State Machines


Most likely state sequence?

Sliding Window


Classifier

which class?

Try alternatewindow sizes:

Readings

[1] M. Ion, M. Steve, and K. Craig, "A hierarchical approach to wrapper induction," in Proceedings of the third annual conference on Autonomous Agents. Seattle, Washington, United States: ACM, 1999.

Thank You!

Q&A