Information Extraction CS 652 Information Extraction and Integration

Information Extraction

CS 652 Information Extraction and Integration

Information Extraction(IE)TaskInformation Retrieval(IR) and IEHistory of IEEvaluation MetricsApproaches to IEFree, Structured, and Semistructured TextWeb DocumentsIE SystemsDiscussion

IR and IEIR Retrieves relevant documents from

collections Information theory, probabilistic theory, and

statistics

IE Extracts relevant information from

documents Computational linguistics and natural

language processing

History of IE

Large amount of both online and offline textual data.Message Understanding Conference (MUC) Quantitative evaluation of IE systems Tasks

Latin American terrorism Joint ventures Microelectronics Company management changes

Evaluation MetricsPrecision (PR)

Recall (R)

F-measure

Approaches to IEKnowledge Engineering Approach Grammars are constructed by hand Domain patterns are discovered by human

experts through introspection and inspection of a corpus

Much laborious tuning and “hill climbing”

Automatic Training Approach Use statistical methods when possible Learn rules from annotated corpora Learn rules from interaction with user

Knowledge EngineeringAdvantages With skills and experience, good performing

systems are not conceptually hard to develop.

The best performing systems have been hand crafted.

Disadvantages Very laborious development process Some changes to specifications can be hard

to accommodate Required expertise may not be available

Automatic Training Advantages Domain portability is relatively straightforward System expertise is not required for customization “Data driven” rule acquisition ensures full

coverage of examples

Disadvantages Training data may not exist, and may be very

expensive to acquire Large volume of training data may be required Changes to specifications may require

reannotation of large quantities of training data

TextsFree Text Natural language processing

Structured Text Textual information in a database or file

following a predefined and strict format

Semistructured Text Ungrammatical Telegraphic

Web Documents

Web Document Categorization

[Hsu,1998]Structured Itemised information Uniform syntactic clues (e.g.,

delimiters, attribute orders, …)

Semistructured (e.g., missing attributes, multi-value attributes, …)Unstructured (e.g., linguistic knowledge is required, …)

Free Text

AutoSlogLiepPalkaHastenCrystal WebFoot

WHISK

AutoSlog [1993]

The Parliament building was bombed by Carlos.

LIEP [1995]


PALKA [1995]


HASTEN [1995]


Egraphs(SemanticLabel, StructuralElement)

CRYSTAL [1995]The Parliament building was bombed by Carlos.

CRYSTAL + Webfoot [1997]

WHISK [1999]The Parliament building was bombed by Carlos.

WHISK Rule:*(PhyObj)*@passive *F ‘bombed’ * {PP

‘by’ *F (Person)}

Context-based patterns

ComparisonExtractiongranularity

SemanticClassConstraint

Single_SlotRule

Multi_SlotRule

SyntacticConstraints

AutoSlog

Liep

Palka

Hasten

Crystal

WHISK

Web DocumentsSemistructured and Unstructured RAPIER (E. Califf, 1997) SRV (D. Freitag, 1998) WHISK (S. Soderland, 1998)

Semistructured and Structured WIEN (N. Kushmerick, 1997) SoftMealy (C-H. Hsu, 1998) STALKER (I. Muslea, S. Minton, C. Knoblock,

1998)

Inductive Learning

TaskInductive InferenceLearning Systems Zero-order First-order, e.g., Inductive Logic

Programming (ILP)

RAPIER [1997]Inductive Logic ProgrammingExtraction Rules Syntactic information Semantic information

Advantage Efficient learning (bottom-up)

Drawback Single-slot extraction

RAPIER Rule

SRV [1998]Relational Algorithm (top-down)Features Simple features (e.g., length, character

type, …) Relational features (e.g., next-token, …)

Advantages Expressive rule representation

Drawbacks Single-slot rule generation Large-volume of training data

SRV Rule

WHISK [1998]Covering Algorithm (top-down)Advantages Learn multi-slot extraction rules Handle various order of items-to-be-extracted Handle document types from free text to

structured text

Drawbacks Must see all the permutations of items Less expressive feature set Need large volume of training data

WHISK Rule

Wrapper Induction

Wrapper: an IE application for one particular information sourceDelimiter-based RulesNo linguistic constraints

WIEN [1997]Assumes Items are always in fixed, known order

Introduces several types of wrappersAdvantages Fast to learn and extract

Drawbacks Can not handle permutations and missing

items Must label entire pages Does not use semantic classes

WIEN Rule

SoftMealy [1998]Learns a transducerAdvantages Learns order of items Allows item permutations and missing items Allows both the use of semantic classes and

disjunctions

Drawbacks Must see all possible permutations Can not use delimiters that do not

immediately precede and follow the relevant items

SoftMealy Rule

STALKER [1998,1999,2001]

Hierarchical Information ExtractionEmbedded Catalog Tree (ECT) FormalismAdvantages Extracts nested data Allows item permutations and missing items Need not see all of the permutations One hard-to-extract item does not affect others

Drawbacks Does not exploit item order

STALKER Rule

ApplicationsProduct Descriptions (ShopBot)Restaurant Guides (STALKER)Seminar Announcements (SRV)Job Advertisement (RAPIER)Executive Succession (WHISK)

Commercial Systems

Junglee [1996]Jango [1997]MySimon [1998] …?…

Discussion