60
Plain Text Information Extraction (based on Machine Learning) Chia-Hui Chang Department of Computer Science & Information Engineering National Central University [email protected] 9/24/2002

Plain Text Information Extraction (based on Machine Learning )

  • Upload
    thom

  • View
    24

  • Download
    0

Embed Size (px)

DESCRIPTION

Plain Text Information Extraction (based on Machine Learning ). Chia-Hui Chang Department of Computer Science & Information Engineering National Central University [email protected] 9/24/2002. Introduction. Plain Text Information Extraction - PowerPoint PPT Presentation

Citation preview

Page 1: Plain Text Information Extraction  (based on Machine Learning )

Plain Text Information Extraction (based on Machine Learning)Chia-Hui Chang Department of Computer Science & Information EngineeringNational Central University

[email protected]/24/2002

Page 2: Plain Text Information Extraction  (based on Machine Learning )

Introduction

Plain Text Information Extraction The task of locating specific pieces of data from a

natural language document To obtain useful structured information from unstr

uctured text DARPA’s MUC program

The extraction rules are based on syntactic analyzer semantic tagger

Page 3: Plain Text Information Extraction  (based on Machine Learning )

On-line documents SRV, AAAI-1998

D. Freitag Rapier, ACL-1997, AAAI-

1999 M. E. Califf

WHISK, ML-1999 Solderland

Related Work

Free-text documents PALKA, MUC-5, 1993 AutoSlog, AAAI-1993

E. Riloff LIEP, IJCAI-1995

Huffman Crystal, IJCAI-1995, KD

D-1997 Solderland

Page 4: Plain Text Information Extraction  (based on Machine Learning )

SRVInformation Extraction from HTML: Application of a General Machine Learning Approach

Dayne Freitag

[email protected]

AAAI-98

Page 5: Plain Text Information Extraction  (based on Machine Learning )

Introduction

SRV A general-purpose relational learner A top-down relational algorithm for IE Reliance on a set of token-oriented features

Extraction pattern First-order logic extraction pattern with predicates

based on attribute-value tests

Page 6: Plain Text Information Extraction  (based on Machine Learning )

Extraction as Text Classification Extraction as Text Classification

Identify the boundaries of field instances Treat each fragment as a bag-of-words Find the relations from the surrounding context

Page 7: Plain Text Information Extraction  (based on Machine Learning )

Relational Learning

Inductive Logic Programming (ILP) Input: class-labeled instances Output: classifier for unlabeled instances Typical covering algorithm

Attribute values are added greedily to a rule The number of positive examples is heuristically

maximized while the number of negative examples is heuristically minimized

Page 8: Plain Text Information Extraction  (based on Machine Learning )

Simple Features

Features on individual token Length (e.g. single letter or multiple letters) Character type (e.g. numeric or alphabet) Orthography (e.g. capitalized) Part of speech (e.g. verb) Lexical meaning (e.g. geographical_place)

Page 9: Plain Text Information Extraction  (based on Machine Learning )

Individual Predicates

Individual predicate: Length (=3): accepts only fragments containing three token

s Some(?A [] capitalizedp true): the fragment contains some t

oken that is capitalized Every(numericp false): every token in the fragment is non-n

umeric Position(?A fromfirst <2): the token bound to ?A is either fir

st or second in the fragment Relpos(?A ?B =1) the token bound to ?A immediately prece

ds the token bound to ?B

Page 10: Plain Text Information Extraction  (based on Machine Learning )

Relational Features

Relational Feature types Adjacency (next_token) Linguistic syntax (subject_verb)

Page 11: Plain Text Information Extraction  (based on Machine Learning )

Example

Page 12: Plain Text Information Extraction  (based on Machine Learning )

Search

Adding predicates greedily, attempting to cover as many positive and as few negative examples as possible.

At every step in rule construction, all documents in the training set are scanned and every text fragment of appropriate size counted.

Every legal predicate is assessed in terms of the number of positive and negative examples it covers.

A position-predicate is not legal unless some-predicate is already part of the rule

Page 13: Plain Text Information Extraction  (based on Machine Learning )

Relational Paths

Relational features are used only in the Path argument to the some-predicate Some(?A [prev_token prev_token] capitalized tru

e): The fragment contains some token preceded by a capitalized token two tokens back.

Page 14: Plain Text Information Extraction  (based on Machine Learning )

Validation Training Phase

2/3: learning 1/3: validation

Testing Bayesian m-

estimates: All rules matching a given

fragment are used to assign a confidence score.

Combined confidence:

Ccc)1(1

Page 15: Plain Text Information Extraction  (based on Machine Learning )

Adapting SRV for HTML

Page 16: Plain Text Information Extraction  (based on Machine Learning )

Experiments Data Source:

Four university computer science departments: Cornell, U. of Texas, U. of Washington, U. of Wisconsin

Data Set: Course: title, number, instructor Project: title, member 105 course pages 96 project pages

Two Experiments Random: 5 cross-validation LOUO: 4-fold experiments

Page 17: Plain Text Information Extraction  (based on Machine Learning )

OPD Coverage:Each rule

has its own confidence

Page 18: Plain Text Information Extraction  (based on Machine Learning )

MPD

Page 19: Plain Text Information Extraction  (based on Machine Learning )

Baseline Strategies

OPD

MPD

Simply memorizes field instances

Random Guesser

Page 20: Plain Text Information Extraction  (based on Machine Learning )

Conclusions

Increased modularity and flexibility Domain-specific information is separate from the

underlying learning algorithm Top-down induction

From general to specific Accuracy-coverage trade-off

Associate confidence score with predictions Critique: single-slot extraction rule

Page 21: Plain Text Information Extraction  (based on Machine Learning )

RAPIERRelational Learning of Pattern-Match Rules for Information Extraction

M.E. Califf and R.J. Mooney

ACL-97, AAAI-1999

Page 22: Plain Text Information Extraction  (based on Machine Learning )

Rule Representation

Single-slot extraction patterns Syntactic information (part-of-speech tagger) Semantic class information (WordNet)

Page 23: Plain Text Information Extraction  (based on Machine Learning )

The Learning Algorithm A specific to general search

The pre-filler pattern contains an item for each word The filler pattern has one item from each word in the

filler The post-filler has one item for each word

Compress the rules for each slot Generate the least general generalization (LGG) of each

pair of rules When the LGG of two constraints is a disjunction, we

create two alternatives (1) disjunction (2) removal of the constraints.

Page 24: Plain Text Information Extraction  (based on Machine Learning )

Example Located in Atlanta, Georgia. Offices in Kansas City, Missouri.

,,

,,

Page 25: Plain Text Information Extraction  (based on Machine Learning )

Example:

Assume there is a semantic class for states, but not one for cities.

Located in Atlanta, Georgia.Offices in Kansas City, Missouri.

Page 26: Plain Text Information Extraction  (based on Machine Learning )
Page 27: Plain Text Information Extraction  (based on Machine Learning )

Experimental Evaluation

300 computer-related Jobs 17 slots: employer, location, salary, job requirements,

language and platform.

Page 28: Plain Text Information Extraction  (based on Machine Learning )

Experimental Evaluation

485 seminar announcement 4 slots:

Page 29: Plain Text Information Extraction  (based on Machine Learning )

WHISK:

S. Soderland

University of Washington

Journal of Machine Learning 1999

Page 30: Plain Text Information Extraction  (based on Machine Learning )

Semi-structured Text

Page 31: Plain Text Information Extraction  (based on Machine Learning )

Free Text

Person name Position

Verb stem

Verb stem

Page 32: Plain Text Information Extraction  (based on Machine Learning )

WHISK Rule Representation

For Semi-structured IE

Page 33: Plain Text Information Extraction  (based on Machine Learning )

WHISK Rule Representation For Free Text IE

Person name Position

Verb stem

Verb stem

Skip only whithin the same syntactic field

Page 34: Plain Text Information Extraction  (based on Machine Learning )

Example – Tagged by Users

Page 35: Plain Text Information Extraction  (based on Machine Learning )

The WHISK Algorithm

Page 36: Plain Text Information Extraction  (based on Machine Learning )

Creating a Rule from a Seed Instance Top-down rule induction

Start from an empty rule

Add terms within the extraction boundary (Base_1) Add terms just outside the extraction (Base_2)

Until the seed is covered

Page 37: Plain Text Information Extraction  (based on Machine Learning )

Example

Page 38: Plain Text Information Extraction  (based on Machine Learning )
Page 39: Plain Text Information Extraction  (based on Machine Learning )
Page 40: Plain Text Information Extraction  (based on Machine Learning )

EN

Page 41: Plain Text Information Extraction  (based on Machine Learning )

AutoSlog: Automatically Constructing a Dictionary for Information Extraction Tasks

Ellen RiloffDept. of Computer Science, University of Massachusetts, AAAI93

Page 42: Plain Text Information Extraction  (based on Machine Learning )

AutoSlog

Purpose: Automatically constructs a domain-specific

dictionary for IE Extraction pattern (concept nodes):

Conceptual anchor: a trigger word Enabling conditions: constraints

Page 43: Plain Text Information Extraction  (based on Machine Learning )

Concept Node Example

Physical target slot of a bombing template

Page 44: Plain Text Information Extraction  (based on Machine Learning )

Construction of Concept Nodes1. Given a target piece of information.

2. AutoSlog finds the first sentence in the text that contains the string.

3. The sentence is handed over to CIRCUS which generates a conceptual analysis of the sentence.

4. The first clause in the sentence is used.

5. A set of heuristics are applied to suggest a good conceptual anchor point for a concept node.

6. If none of the heuristics is satisfied, AutoSlog searches for the next sentence, and goto 3.

Page 45: Plain Text Information Extraction  (based on Machine Learning )

Conceptual Anchor Point Heuristics

Page 46: Plain Text Information Extraction  (based on Machine Learning )

Background Knowledge

Concept Node Construction Slot

The slot of the answer key Hard and soft constraints

Type: Use template types such as bombing, kidnapping Enabling condition: heuristic pattern

Domain Specification The type of a template The constraints for each template slot

Page 47: Plain Text Information Extraction  (based on Machine Learning )

Another good concept node definition Perpetrator slot from a

perpetrator template

Page 48: Plain Text Information Extraction  (based on Machine Learning )

A bad concept node definition Victim slot from a

kidnapping template

Page 49: Plain Text Information Extraction  (based on Machine Learning )

Empirical Results Input:

Annotated corpus of texts in which the targeted information is marked and annotated with semantic tags denoting the type of information (e.g., victim) and type of event (e.g., kidnapping)

1500 texts with 1258 answer keys contain 4780 string fillers Output:

1237 concept node definitions Human intervention: 5 user-hour to sift through all generated concept nodes 450 definitions are kept

Performance:

Page 50: Plain Text Information Extraction  (based on Machine Learning )

Conclusion

In 5 person-hour, AutoSlog creates a dictionary that achieves 98% of the performance of hand-crafted dictionary

Each concept node is a single-slot extraction pattern Reasons for bad definitions

When a sentence contains the targeted string but does not describe the event

When a heuristic proposes the wrong conceptual anchor point

When CIRCUS incorrectly analyzes the sentence

Page 51: Plain Text Information Extraction  (based on Machine Learning )

CRYSTAL: Inducing a Conceptual Dictionary

S. Soderland, D. Fisher, J. Aseltine, W. Lehnert

University of Massachusetts

IJCAI’95

Page 52: Plain Text Information Extraction  (based on Machine Learning )

Concept Nodes (CN)

CN-type Subtype Extracted syntactic

constituents Linguistic patterns Constraints on

syntactic constituents

Page 53: Plain Text Information Extraction  (based on Machine Learning )

The CRYSTAL Induction Tool

Creating initial CN definitions For each instance

Inducing generalized CN definitions Relaxing constraints for highly similar definitions

Word constraints: intersecting strings of words Class constraints: moving up the semantic hierarchy

Page 54: Plain Text Information Extraction  (based on Machine Learning )
Page 55: Plain Text Information Extraction  (based on Machine Learning )

Inducing Generalized CN Definitions1. Start from a CN definition, D

2. Assume we have found a second definition D’ which is similar to D,

a) Create a new definition U

b) Delete from the dictionary all definitions covered by U, e.g. D and D’

c) Test if U extracts only marked informationa) If ‘Yes’, then go to Step 2 and set D=U,

b) If ‘No’, then start from another definition as D

Page 56: Plain Text Information Extraction  (based on Machine Learning )
Page 57: Plain Text Information Extraction  (based on Machine Learning )

Implementation Issue

Finding similar definitions Indexing CN definitions by verbs and by extraction

buffers Similarity metric

Intersecting classes or intersecting strings of words

Testing error rate of a generalized definition A database of instances segmented by sentence

analyzer is constructed

Page 58: Plain Text Information Extraction  (based on Machine Learning )

Experimental Results

385 annotated hospital discharge reports

14719 training instances The choice of error

tolerance parameter is used to manipulate a tradeoff between precision and recall

Output: CN definitions 194, coverage=10 527, 2<coverage<10

Page 59: Plain Text Information Extraction  (based on Machine Learning )

Comparison

Bottom-up: From specific to generalized CRYSTAL [Soderland, 1996] RAPIER [Califf & Mooney, 1997]

Top-down: From general to specific SRV [Freitag, 1998] WHISK [Soderland, 1999]

Page 60: Plain Text Information Extraction  (based on Machine Learning )

References

I. Muslea, Extraction Patterns for Information Extraction Tasks: A Survey, The AAAI-99 Workshop on Machine Learning for Information Extraction.

Riloff, E. (1993) Automatically Constructing a Dictionary for Information Extraction Tasks, AAAI-93, pp. 811-816

S. Soderland, et al, CRYSTAL: Inducing a Conceptual Dictionary, AAAI-95.

Dayne Freitag, Information Extraction from HTML: Application of a General Machine Learning Approach, AAAI98

Mary Elaine Califf and Raymond J. Mooney, Relational Learning of Pattern-Match Rules for Information Extraction, AAAI-99, Orlando, FL, pp. 328-334, July, 1999.

S. Soderland, Learning information extraction rules for semi-structured and free text. J. of Machine Learning, 1999.