Upload
kathleen-tyler
View
213
Download
0
Embed Size (px)
Citation preview
Empirical Methods in Information Empirical Methods in Information ExtractionExtraction
Claire Cardie
Appeared in AI Magazine, 18:4, 65-79 1997
Summarized by Seong-Bae Park
Information Extraction
Particular natural language understanding task Inherently domain-specific Input : unrestricted text Output : information in a structured form Skim a text to find relevant sections and focus only on
these sections
Problems in IEProblems in IE
1. The accuracy and robustness of systems can still be greatly improved.
2. Building a system in a new domain is difficult and time-consuming.
Architecture of IE SystemArchitecture of IE System
Architecture (1)Architecture (1)
Tokenizing and Tagging Sentence Analysis
Phrase Identification Simple Grammatical Relation Find and Label semantic entities relevant to the
extraction topic Difference to traditional parsers
In IE, we need not a complete, detailed parse tree.
Extraction Identify domain-specific relations among relevant
entities.
Architecture (2)Architecture (2)
Merging The main job : Coreference Resolution (Anaphora Resolution) Optional : Implicit Subject of All Subjects
Template Generation Determine the number of distinct events Map the individually extracted pieces onto each event Produce output templates
Role of Corpus-Based Language LeaRole of Corpus-Based Language Learning Algorithmsrning Algorithms Catch
Obtaining enough training data For language tasks
Annotated corpora like Penn Treebank
Some problems Learning extraction patterns, coreference resolution, te
mplate generation Difficult to Apply ML techniques
No Corpora annotated Semantic and Domain-specific language processing skill is nee
ded.
Learning Extraction PatternsLearning Extraction Patterns
Good Pattern General enough to extract the correct information from
more than one sentence Specific enough not to apply in inappropriate contexts
A number of learning methods The class of patterns learned The training corpus required The amount and type of human feedback required The degree of preprocessing necessary The background knowledge required The biases inherent in the learning algorithm itself
AutoSlog (1)AutoSlog (1) Learns extraction patterns in the form of domain-specific
“concept node” definitions CIRCUS Parser
Concept Node Domain-specific semantic case frames that contain a maximum of
one slot per frame
AutoSlog (2)AutoSlog (2)
One-shot learning algorithm Training Corpus
A set of texts with noun phrases annotated with the appropriate concept type
Associated answer keys as in MUC corpus
Required Partial parser A small(approximately 13) set of general linguistic patterns
AutoSlog (3)AutoSlog (3)
To derive a pattern for extracting the phrase:1. Find the sentence from which the NP originated.
2. Present the sentence to the partial parser for processing.
3. Apply the linguistic patters in order. Identify thematic role based on the syntactic position.
4. When a pattern applies, generate a concept node definition from the matched constituents, their context, the concept type provided in the annotation for the target NP, and the predefined semantic class for the filler.
Other SystemOther System
PALKA (Kim and Moldovan, 1995) Background knowledge
Concept hierarchy a set of predefined keywords that can be used to trigger each p
attern and a semantic class lexicon
CRYSTAL (Soderland et al. 1995) Learn extraction patterns in the form of semantic case f
rames
Huffman’s LIEP system
Coreference Resolution (1)Coreference Resolution (1) An Example from MUC-6
Major weakness of existing IE systems Use manually generated heuristics (Generalization?) Assume input is fully parsed
With Grammatical Function, Thematic Roles available The error is accumulated by sentence after sentence. Must be able to handle the myriad forms of coreference
Coreference Resolution (2)Coreference Resolution (2)
Empirical Method Inductive learning algorithms can be applied MLR (Aone and Bennett, 1995) : on Japanese RESOLVE (McCarthy and Lehnert, 1995) : on English C4.5 as learning algorithm Dataset
MLR : Automatically Generated RESOLVE : Manually Generated, noise-free
Coreference Resolution (3)Coreference Resolution (3)
MLR Feature Set 66 features
(1) lexical features of each phrase
(2) the grammatical role of the phrase
(3) semantic class information
(4) relative positional information
(5) whether each phrase contains a proper name (2 features)
(6) whether one or both phrases refer to the entity formed by a joint venture (3 features)
(7) whether one phrase contains an alias of the other (1 feature)
(8) whether the phrases have the same base noun phrase (1 feature)
(9) whether the phrases originate from the same sentence (1 feature)
(1) ~ (4) : domain independent
Coreference Resolution (4)Coreference Resolution (4)
Test of MLR and RESOLVE Evaluated using 50 ~ 250 texts RESOLVE
Recall : 80 ~ 85% Precision : 87 ~ 92% Default (Always negative) : about 74%
MLR Recall : 67 ~ 70% Precision : 83 ~ 88%
Both significantly outperforms IE systems manually developed.
Coreference Resolution (5)Coreference Resolution (5)
Much research to do yet Should be tested on additional types of anaphors Without domain-specific information (?) Relative errors from the preceding phases must be
investigated.
Few attempt for other discourse-level problems
Future DirectionsFuture Directions
Research in IE is very new. Applying ML algorithms is even newer.
A number of exciting directions Unsupervised Learning for sidestepping the lack of cor
pora How to eliminate NLP experts in moving IE systems to
other domains?