Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques...

Information Extraction

PengBoDec 2, 2010

Topics of today

IE: Information Extraction Techniques

Wrapper Induction Sliding Windows From FST to HMM

What is IE?

Example: The Problem

Martin Baker, a person

Genomics job

Employers job posting form

Example: A Solution

Extracting Job Openings from the Web

foodscience.com-Job2

JobTitle: Ice Cream Guru

Employer: foodscience.com

JobCategory: Travel/Hospitality

JobFunction: Food Services

JobLocation: Upper Midwest

Contact Phone: 800-488-2611

DateExtracted: January 8, 2001

Source: www.foodscience.com/jobs_midwest.html

OtherCompanyJobs: foodscience.com-Job1

Job Openings:Category = Food Services Keyword = Baker Location = Continental U.S.

Data Mining the Extracted Job Information

Two ways to manage information

Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx

retrieval

Query Answer Query Answer

advisor(wc,vc)advisor(yh,tm)

affil(wc,mld)affil(vc,lti)

fn(wc,``William”)fn(vc,``Vitor”)

Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx

inference

“ceremonial soldering” X:advisor(wc,Y)&affil(X,lti) ? {X=em; X=vc}

What is Information Extraction? Recovering structured data from

formatted text

formatted text Identifying fields (e.g. named entity

recognition)

recognition) Understanding relations between fields (e.g.

record association)

record association) Normalization and deduplication

What is Information Extraction?

Recovering structured data from formatted text Identifying fields (e.g. named entity

record association) Normalization and deduplication

Today, focus mostly on field identification &a little on record association

Applications

IE from Research Papers

IE fromChinese Documents regarding

Weather

Chinese Academy of Sciences

200k+ documentsseveral millennia old

- Qing Dynasty Archives- memos- newspaper articles- diaries

Wrapper Induction

“Wrappers”

If we think of things from the database point of view We want to be able to database-style queries But we have data in some horrid textual

form/content management system that doesn’t allow such querying

We need to “wrap” the data in a component that understands database-style querying

Hence the term “wrappers”

Title: Schulz and Peanuts: A Biography Author: David Michaelis List Price: $34.95

Wrappers:Simple Extraction Patterns

Specify an item to extract for a slot using a regular expression pattern. Price pattern: “\b\$\d+(\.\d{2})?\b”

May require preceding (pre-filler) pattern and succeeding (post-filler) pattern to identify the end of the filler. Amazon list price:

Pre-filler pattern: “List Price: ”

Filler pattern: “\b\$\d+(\.\d{2})?\b” Post-filler pattern: “”

Wrapper tool-kits

Wrapper toolkits Specialized programming environments for

writing & debugging wrappers by hand Some Resources

Wrapper Development Tools LAPIS

Wrapper Induction

Problem description: Task: learn extraction rules based on

labeled examples Hand-writing rules is tedious, error prone, and

time consuming Learning wrappers is wrapper induction

Induction Learning

Rule induction: formal rules are extracted from a set of

observations. The rules extracted may represent a full scientific model of the data, or merely represent local patterns in the data.

INPUT: Labeled examples: training & testing data Admissible rules (hypotheses space) Search strategy

Desired output: Rule that performs well both on training and

testing data

Wrapper induction

Highly regularsource documents

Relatively simple

extraction patterns

Efficient

learning algorithm

Build a training set of documents paired with human-produced filled extraction templates.

Learn extraction patterns for each slot using an appropriate machine learning algorithm.

Goal: learn from a human teacher how to extract certain database records from a particular web site.

Learner

User gives first K positive—and thus many implicit negative examples

Kushmerick’s WIEN system

Earliest wrapper-learning system (published IJCAI ’97)

Special things about WIEN: Treats document as a string of characters Learns to extract a relation directly, rather

than extracting fields, then associating them together in some way

Example is a completely labeled page

WIEN system: a sample wrapper

l1, r1, …, lK, rK

Example: Find 4 strings, , ,

l1 , r1 , l2 ,

labeled pages wrapper<HTML><HEAD>Some Country Codes</HEAD>Congo 242 Egypt 20 Belize 501 Spain 34 </BODY></HTML>

<HTML><HEAD>Some Country Codes</HEAD>Congo 242 Egypt 20 Belize 501 Spain 34 </BODY></HTML>

Learning LR wrappers

LR wrapper

Left delimiters L1=“”, L2=“”; Right R1=“”, R2=“”

LR: Finding r1

<HTML><TITLE>Some Country Codes</TITLE>Congo 242 Egypt 20 Belize 501 Spain 34 </BODY></HTML>

r1 can be any prefix

eg

LR: Finding l1, l2 and r2

<HTML><TITLE>Some Country Codes</TITLE>Congo 242 Egypt 20 Belize 501 Spain 34 </BODY></HTML>

r2 can be any prefix

eg

l2 can be any suffix

eg

l1 can be any suffix

eg

WIEN system

Assumes items are always in fixed, known order … Name: J. Doe; Address: 1 Main; Phone: 111-1111.

Name: E. Poe; Address: 10 Pico; Phone: 777-1111.

…

Introduces several types of wrappers

Learning LR extraction rules

Admissible rules: prefixes & suffixes of items of interest

Search strategy: start with shortest prefix & suffix, and expand

until correct

Summary of WIEN

Advantages: Fast to learn & extract

Drawbacks: Cannot handle permutations and missing items Must label entire page Requires large number of examples

Sliding Windows

Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell School of Computer Science Carnegie Mellon University

3:30 pm 7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

E.g.Looking forseminarlocation

A “Naïve Bayes” Sliding Window Model

[Freitag 1997]

00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrunw t-m w t-1 w t w t+n w t+n+1 w t+n+m

prefix contents suffix

If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it.

… …

Estimate Pr(LOCATION|window) using Bayes rule

Try all “reasonable” windows (vary length, position)

Assume independence for length, prefix words, suffix words, content words

Estimate from data quantities like: Pr(“Place” in prefix|LOCATION)

A “Naïve Bayes” Sliding Window Model

1. Create dataset of examples like these:+(prefix00,…,prefixColon, contentWean,contentHall,….,suffixSpeaker,…)- (prefixColon,…,prefixWean,contentHall,….,ContentSpeaker,suffixColon,….)…

2. Train a NaiveBayes classifier3. If Pr(class=+|prefix,contents,suffix) > threshold, predict the

content window is a location.• To think about: what if the extracted entities aren’t consistent, eg if the

location overlaps with the speaker?

[Freitag 1997]

00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrunw t-m w t-1 w t w t+n w t+n+1 w t+n+m

prefix contents suffix

… …

“Naïve Bayes” Sliding Window Results

Domain: CMU UseNet Seminar Announcements

Field F1 Person Name: 30%Location: 61%Start Time: 98%

Finite State Transducers

Finite State Transducers for IE

Basic method for extracting relevant information

IE systems generally use a collection of specialized FSTs

Company Name detection Person Name detection Relationship detection

Frodo Baggins works for Hobbit Factory, Inc.

Text Analyzer:

Frodo – Proper Name

Baggins – Proper Name

works – Verb

for – Prep

Hobbit – UnknownCap

Factory – NounCap

Inc – CompAbbr

Some regular expression for finding company names:“some capitalized words, maybe a comma,

then a company abbreviation indicator”

CompanyName = (ProperName | SomeCap)+

Comma?

CompAbbr

1 2 3 4

(CAP | PN)

(CAP| PN)

comma CAB

CAP = SomeCap, CAB = CompAbbr, PN = ProperName, = empty string

Company Name Detection FSA

1 2 3 4

word word

(CAP | PN)

(CAP| PN)

CAB CN

comma CAB CN

word word

CAP = SomeCap, CAB = CompAbbr, PN = ProperName, = empty string, CN = CompanyName

Company Name Detection FST

1 2 3 4

word word

(CAP | PN)

(CAP| PN)

CAB CN

comma CAB CN

word word

CAP = SomeCap, CAB = CompAbbr, PN = ProperName, = empty string, CN = CompanyName

Company Name Detection FST

Non-deterministic!!!

Several FSTs or a more complex FST can be used to find one type of information (e.g. company names)

FSTs are often compiled from regular expressions

Probabilistic (weighted) FSTs

FSTs mean different things to different researchers in IE.

Based on lexical items (words) Based on statistical language models Based on deep syntactic/semantic analysis

Example: FASTUS

Finite State Automaton Text Understanding System (SRI International)

Cascading FSTs Recognize names Recognize noun groups, verb groups etc Complex noun/verb groups are constructed Identify patterns of interest Identify and merge event structures

Hidden Markov Models

Hidden Markov Models formalism

HMM = states s1, s2, …(special start state s1 special end state sn)token alphabet a1, a2, …state transition probs P(si|sj)token emission probs P(ai|sj)Widely used in many language processing tasks,

e.g., speech recognition [Lee, 1989], POS tagging[Kupiec, 1992], topic detection [Yamron et al, 1998].

HMM = probabilistic FSA

Applying HMMs to IE

Document generated by a stochastic process modelled by an HMM

Token word State “reason/explanation” for a given token

‘Background’ state emits tokens like ‘the’, ‘said’, … ‘Money’ state emits tokens like ‘million’, ‘euro’, … ‘Organization’ state emits tokens like ‘university’,

‘company’, … Extraction: via the Viterbi algorithm, a

dynamic programming technique for efficiently computing the most likely sequence of states that generated a document.

HMM for research papers: transitions [Seymore et al., 99]

HMM for research papers: emissions [Seymore et al., 99]

author title institution

Trained on 2 million words of BibTeX data from the Web

...note

ICML 1997...submission to…to appear in…

stochastic optimization...reinforcement learning…model building mobile robot...

carnegie mellon university…university of californiadartmouth college

supported in part…copyright...

What is an HMM?

Graphical Model Representation: Variables by time Circles indicate states Arrows indicate probabilistic dependencies

between states

What is an HMM?

Green circles are hidden states Dependent only on the previous state: Markov

process “The past is independent of the future given the

present.”

What is an HMM?

Purple nodes are observed states Dependent only on their corresponding hidden

HMM Formalism

{S, K, , P , A B} S : {s1…sN } are the values for the hidden states K : {k1…kM } are the values for the observations

HMM Formalism

{S, K, , P , A B} = {P pi} are the initial state probabilities A = {aij} are the state transition probabilities B = {bik} are the observation state

probabilities

Need to provide structure of HMM & vocabulary Training the model (Baum-Welch algorithm)

Efficient dynamic programming algorithms exist for Finding Pr(K) The highest probability path S that maximizes Pr(K,S)

(Viterbi)

Journal

Author 0.9

0.50.8

Transition probabilities

Emission probabiliti

Using the HMM to segment

Find highest probability path through the HMM.

Viterbi: quadratic dynamic programming algorithm

115 Grant street Mumbai 400070

115 Grant ……….. 400070

Most Likely Path for a Given Sequence

The probability that the path is taken and the sequence is generated:

iiNL iii

axbaxx1

001 11 )()...,...Pr(

Lxx ...1

N ...0

transition

probabilities

emission

probabilities

Example

A 0.1C 0.4G 0.4T 0.1

A 0.4C 0.1G 0.1T 0.4

begin end

0.90.2

6.03.08.04.02.04.05.0

)C()A()A(),AACPr( 35313111101

abababa

A 0.4C 0.1G 0.2T 0.3

A 0.2C 0.3G 0.3T 0.2

oTo1 otot-1 ot+1

Finding the most probable path

Find the state sequence that best explains the observations

Viterbi algorithm (1967)

)|(maxarg OXPX

oTo1 otot-1 ot+1

Viterbi Algorithm

),,...,...(max)( 1111... 11

ttttxx

j ojxooxxPtt

The state sequence which maximizes the probability of seeing the observations to time t-1, landing in state j, and seeing the observation at time t

x1 xt-1 j

oTo1 otot-1 ot+1

Viterbi Algorithm

),,...,...(max)( 1111... 11

ttttxx

j ojxooxxPtt

1)(max)1(

tjoijii

j batt

1)(maxarg)1(

tjoijii

j batt Recursive Computation

x1 xt-1 xt xt+1

)1( tj

Viterbi : Dynamic Programming

No 115 Grant street Mumbai 400070

115 Grant ……….. 400070

1)(max)1(

tjoijii

j batt

oTo1 otot-1 ot+1

Viterbi Algorithm

)(maxargˆ TX ii

)1(ˆ1

Compute the most likely state sequence by working backwards

x1 xt-1 xt xt+1 xT

Hidden Markov Models Summary

Popular technique to detect and classify a linear sequence of information in text

Disadvantage is the need for large amounts of training data

Related Works System for extraction of gene names and locations

from scientific abstracts (Leek, 1997) NERC (Biker et al., 1997) McCallum et al. (1999) extracted document

segments that occur in a fixed or partially fixed order (title, author, journal)

Ray and Craven (2001) – extraction of proteins, locations, genes and disorders and their relationships

IE Technique Landscape

IE with Symbolic Techniques

Conceptual Dependency Theory Shrank, 1972; Shrank, 1975 mainly aimed to extract semantic information about individual

events from sentences at a conceptual level (i.e., the actor and an action)

Frame Theory Minsky, 1975 a frame stores the properties of characteristics of an entity,

action or event it typically consists of a number of slots to refer to the

properties named by a frame Berkeley FrameNet project

Baker, 1998; Fillmore and Baker, 2001 online lexical resource for English, based on frame semantics

and supported by corpus evidence FASTUS (Finite State Automation Text Understanding

System) Hobbs, 1996 using cascade of FSAs in a frame based information extraction

approach

IE with Machine Learning Techniques

Training data: documents marked up with ground truth

In contrast to text classification, local features crucial. Features of: Contents Text just before item Text just after item Begin/end boundaries

Good Features for Information Extraction

Example word features: identity of word is in all caps ends in “-ski” is part of a noun phrase is in a list of city names is under node X in

WordNet or Cyc is in bold font is in hyperlink anchor features of past & future last person name was

female next two words are “and

Associates”

begins-with-numberbegins-with-ordinalbegins-with-

punctuationbegins-with-question-

wordbegins-with-subjectblankcontains-alphanumcontains-bracketed-

numbercontains-httpcontains-non-spacecontains-numbercontains-pipe

contains-question-markcontains-question-wordends-with-question-

markfirst-alpha-is-

capitalizedindentedindented-1-to-4indented-5-to-10more-than-one-third-

spaceonly-punctuationprev-is-blankprev-begins-with-

ordinalshorter-than-30

Creativity and Domain Knowledge Required!

Is Capitalized Is Mixed Caps Is All Caps Initial CapContains DigitAll lowercase Is InitialPunctuationPeriodCommaApostropheDashPreceded by HTML tag

Character n-gram classifier says string is a person name (80% accurate)

In stopword list(the, of, their, etc)

In honorific list(Mr, Mrs, Dr, Sen, etc)

In person suffix list(Jr, Sr, PhD, etc)

In name particle list (de, la, van, der, etc)

In Census lastname list;segmented by P(name)

In Census firstname list;segmented by P(name)

In locations lists(states, cities, countries)

In company name list(“J. C. Penny”)

In list of company suffixes(Inc, & Associates, Foundation)

Word Features lists of job titles, Lists of prefixes Lists of suffixes 350 informative phrases

HTML/Formatting Features {begin, end, in} x

{, , <a>, <hN>} x{lengths 1, 2, 3, 4, or longer}

{begin, end} of line

Creativity and Domain Knowledge Required!

Good Features for Information Extraction

Landscape of ML Techniques for IE:

Any of these models can be used to capture words, formatting or both.

Classify Candidates

Abraham Lincoln was born in Kentucky.

Classifier

which class?

Sliding Window

Classifier

which class?

Try alternatewindow sizes:

Boundary Models

Classifier

which class?

BEGIN END BEGIN END

Finite State Machines

Most likely state sequence?

Wrapper Induction

Abraham Lincoln was born in Kentucky.

Learn and apply pattern for a website

PersonName

IE History

Pre-Web Mostly news articles

De Jong’s FRUMP [1982] Hand-built system to fill Schank-style “scripts” from news

wire Message Understanding Conference (MUC) DARPA

[’87-’95], TIPSTER [’92-’96] Most early work dominated by hand-built models

E.g. SRI’s FASTUS, hand-built FSMs. But by 1990’s, some machine learning: Lehnert, Cardie,

Grishman and then HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98]

Web AAAI ’94 Spring Symposium on “Software Agents”

Much discussion of ML applied to Web. Maes, Mitchell, Etzioni. Tom Mitchell’s WebKB, ‘96

Build KB’s from the Web. Wrapper Induction

Initially hand-build, then ML: [Soderland ’96], [Kushmeric ’97],…

Summary

Information Extraction Sliding Window From FST(Finite State

Transducer) to HMM Wrapper Induction

Wrapper toolkits LR Wrapper

Finite State Machines

Most likely state sequence?

Sliding Window

Classifier

which class?

Try alternatewindow sizes:

Readings

[1] M. Ion, M. Steve, and K. Craig, "A hierarchical approach to wrapper induction," in Proceedings of the third annual conference on Autonomous Agents. Seattle, Washington, United States: ACM, 1999.

Thank You!

Information Extraction PengBo Dec 2, 2010. Topics of today IE: Information Extraction Techniques...

Documents

POWER TOOLS - FST

ROC800 FST

Kuromoji FST

Agm -NPY-FST

POWDERS - FST

Support GRH Fst

10/100Base-TX to 100Base-FX Smart Media Converterdownload.asm.cz/inshop/prod/Planet/EM-FST80x_v1.0.pdf · User's Manual FST-801 FST-802 FST-802S15/802S35/802S50 FST-806A20/806B20

Fst ch2 notes

FST January Newsletter

Manual de FST

FST Operators Manual

Fst ch3 notes

FST-75/150 Operating manual - GTE Industrieelektronik … · FST-75 FST-150 is subject to change ... the user must observe all of the instructions and ... FST-75/150 Operating manual

Fst Dfs Hochiki

FST Ms. Bebout

Fst profile march2016

Function Sequence Tables (FST) Referencekernassociatesinc.com/services/Function Sequence Tables.pdf · FST Development Environment FST (Function Sequence Table) creation, maintenance

DSZMBUFT 1MBTUJDJ[FST

FST Morphology

FST Manual LDR.pdf