Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, 65-79 1997 Summarized by Seong-Bae Park

Empirical Methods in Information Empirical Methods in Information ExtractionExtraction

Claire Cardie

Appeared in AI Magazine, 18:4, 65-79 1997

Summarized by Seong-Bae Park

Information Extraction

Particular natural language understanding task Inherently domain-specific Input : unrestricted text Output : information in a structured form Skim a text to find relevant sections and focus only on

these sections

Problems in IEProblems in IE

1. The accuracy and robustness of systems can still be greatly improved.

2. Building a system in a new domain is difficult and time-consuming.

Architecture of IE SystemArchitecture of IE System

Architecture (1)Architecture (1)

Tokenizing and Tagging Sentence Analysis

Phrase Identification Simple Grammatical Relation Find and Label semantic entities relevant to the

extraction topic Difference to traditional parsers

In IE, we need not a complete, detailed parse tree.

Extraction Identify domain-specific relations among relevant

entities.

Architecture (2)Architecture (2)

Merging The main job : Coreference Resolution (Anaphora Resolution) Optional : Implicit Subject of All Subjects

Template Generation Determine the number of distinct events Map the individually extracted pieces onto each event Produce output templates

Role of Corpus-Based Language LeaRole of Corpus-Based Language Learning Algorithmsrning Algorithms Catch

Obtaining enough training data For language tasks

Annotated corpora like Penn Treebank

Some problems Learning extraction patterns, coreference resolution, te

mplate generation Difficult to Apply ML techniques

No Corpora annotated Semantic and Domain-specific language processing skill is nee

ded.

Learning Extraction PatternsLearning Extraction Patterns

Good Pattern General enough to extract the correct information from

more than one sentence Specific enough not to apply in inappropriate contexts

A number of learning methods The class of patterns learned The training corpus required The amount and type of human feedback required The degree of preprocessing necessary The background knowledge required The biases inherent in the learning algorithm itself

AutoSlog (1)AutoSlog (1) Learns extraction patterns in the form of domain-specific

“concept node” definitions CIRCUS Parser

Concept Node Domain-specific semantic case frames that contain a maximum of

one slot per frame

AutoSlog (2)AutoSlog (2)

One-shot learning algorithm Training Corpus

A set of texts with noun phrases annotated with the appropriate concept type

Associated answer keys as in MUC corpus

Required Partial parser A small(approximately 13) set of general linguistic patterns

AutoSlog (3)AutoSlog (3)

To derive a pattern for extracting the phrase:1. Find the sentence from which the NP originated.

2. Present the sentence to the partial parser for processing.

3. Apply the linguistic patters in order. Identify thematic role based on the syntactic position.

4. When a pattern applies, generate a concept node definition from the matched constituents, their context, the concept type provided in the annotation for the target NP, and the predefined semantic class for the filler.

Other SystemOther System

PALKA (Kim and Moldovan, 1995) Background knowledge

Concept hierarchy a set of predefined keywords that can be used to trigger each p

attern and a semantic class lexicon

CRYSTAL (Soderland et al. 1995) Learn extraction patterns in the form of semantic case f

rames

Huffman’s LIEP system

Coreference Resolution (1)Coreference Resolution (1) An Example from MUC-6

Major weakness of existing IE systems Use manually generated heuristics (Generalization?) Assume input is fully parsed

With Grammatical Function, Thematic Roles available The error is accumulated by sentence after sentence. Must be able to handle the myriad forms of coreference

Coreference Resolution (2)Coreference Resolution (2)

Empirical Method Inductive learning algorithms can be applied MLR (Aone and Bennett, 1995) : on Japanese RESOLVE (McCarthy and Lehnert, 1995) : on English C4.5 as learning algorithm Dataset

MLR : Automatically Generated RESOLVE : Manually Generated, noise-free


MLR Feature Set 66 features

(1) lexical features of each phrase

(2) the grammatical role of the phrase

(3) semantic class information

(4) relative positional information

(5) whether each phrase contains a proper name (2 features)

(6) whether one or both phrases refer to the entity formed by a joint venture (3 features)

(7) whether one phrase contains an alias of the other (1 feature)

(8) whether the phrases have the same base noun phrase (1 feature)

(9) whether the phrases originate from the same sentence (1 feature)

(1) ~ (4) : domain independent


Test of MLR and RESOLVE Evaluated using 50 ~ 250 texts RESOLVE

Recall : 80 ~ 85% Precision : 87 ~ 92% Default (Always negative) : about 74%

MLR Recall : 67 ~ 70% Precision : 83 ~ 88%

Both significantly outperforms IE systems manually developed.


Much research to do yet Should be tested on additional types of anaphors Without domain-specific information (?) Relative errors from the preceding phases must be

investigated.

Few attempt for other discourse-level problems

Future DirectionsFuture Directions

Research in IE is very new. Applying ML algorithms is even newer.

A number of exciting directions Unsupervised Learning for sidestepping the lack of cor

pora How to eliminate NLP experts in moving IE systems to

other domains?

Documents

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, 65-79 1997 Summarized by Seong-Bae Park