31
Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Information Extraction Feiyu Xu Jakub Piskorski DFKI LT-Lab

Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Information Extraction Feiyu Xu Jakub Piskorski DFKI LT-Lab

Embed Size (px)

Citation preview

Page 1: Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Information Extraction Feiyu Xu Jakub Piskorski DFKI LT-Lab

Source: Feiyu Xu, Jakub Piskorski 2002

LanguageTechnology

Information Extraction

Feiyu Xu

Jakub Piskorski

DFKI LT-Lab

Page 2: Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Information Extraction Feiyu Xu Jakub Piskorski DFKI LT-Lab

Source: Feiyu Xu, Jakub Piskorski 2002

LanguageTechnology

Overview

Introduction to information extraction

SPPC

Survey of information extraction approaches

MUC conferences

Page 3: Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Information Extraction Feiyu Xu Jakub Piskorski DFKI LT-Lab

Source: Feiyu Xu, Jakub Piskorski 2002

LanguageTechnologyWhat is Information Extraction

The creation of a structured representation (TEMPLATE) of selected information drawn from natural language text

Task is more limited than full text understanding: extracting relevant information, ignoring irrelevant information

Page 4: Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Information Extraction Feiyu Xu Jakub Piskorski DFKI LT-Lab

Source: Feiyu Xu, Jakub Piskorski 2002

LanguageTechnology

Example(Management Succession)

1. Dr. Hermann Wirth, bisheriger Leiter der Musikhochschule München, verabschiedete sich heute aus dem Amt.

2. Der 65jährige tritt seinen wohlverdienten Ruhestand an. 3. Als seine Nachfolgerin wurde Sabine Klinger benannt. 4. Ebenfalls neu besetzt wurde die Stelle des Musikdirektors. 5. Annelie Häfner folgt Christian Meindl nach.

PersonOut: Dr. Hermann WirthPersonIn: Sabine KlingerPosition: LeiterOrganization: Musikhochschule

MünchenTimeOut: HeuteTimeIn:

PersonOut: Christian MeindlPersonIn: Annelie HäfnerPosition: MusikdirektorOrganization: ?TimeOut:TimeIn:

Page 5: Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Information Extraction Feiyu Xu Jakub Piskorski DFKI LT-Lab

Source: Feiyu Xu, Jakub Piskorski 2002

LanguageTechnologyTemplate filling rule

NLP: VG: verabschiedet sich aus dem AmtSubject: [1] NP

Domain: IN:

OUT: PERSON: [1]

PERSON: [2]

ORGANISATION: [3] POSITION: [4]

ORGANISATION: [3] POSITION: [4]

Page 6: Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Information Extraction Feiyu Xu Jakub Piskorski DFKI LT-Lab

Source: Feiyu Xu, Jakub Piskorski 2002

LanguageTechnologyTemplate filling rule

NLP: VG: verabschiedete sich heute aus dem Amt

Subject: [1]

Domain: IN:

OUT: PERSON: [1]

PERSON: [2]

ORGANISATION: [3] POSITION: [4]

ORGANISATION: [3] POSITION: [4]

PERSON_NAME: [5]POSITION_NAME: [4]ORGANISATION_NAME: [3]

Page 7: Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Information Extraction Feiyu Xu Jakub Piskorski DFKI LT-Lab

Source: Feiyu Xu, Jakub Piskorski 2002

LanguageTechnology

Traditional IE Architecture

Tokenization

Morphological and Lexical Processing

Parsing

Discourse Analysis

Text Sectionizing andFiltering

Part of Speech TaggingWord Sense Tagging

Fragment ProcessingFragment Combination

Scenario Pattern Matching

Coreference Resolution Inference

Template Merging

Local T

ext Analysis

Page 8: Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Information Extraction Feiyu Xu Jakub Piskorski DFKI LT-Lab

Source: Feiyu Xu, Jakub Piskorski 2002

LanguageTechnologyIntroduction

• System for shallow analysis of German free text documents (toolset of integrated shallow components)

• based on shallow text processor of SMES (G. Neumann)

• exhaustive use of finite-state technology:

- DFKI Finite-State Machine Toolkit - dynamic tries

• high linguistic coverage

• very good performance

SPPC - SHALLOW PROCESSING PRODUCTION CENTER

Page 9: Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Information Extraction Feiyu Xu Jakub Piskorski DFKI LT-Lab

Source: Feiyu Xu, Jakub Piskorski 2002

LanguageTechnology

Architecture

rund 60 bis 70 Prozent: percentage-NP

bis: adv

Steigerungsrate: steigerung+[s]+ratebis: prep|adv

rund 60 bis 70 Prozent: NPder Steigerungsrate: NP

ASCIIDocuments

Tokenizer

Lexical Processor

POS-Filtering

Named Entity Finder

Phrase Recognizer

Sentence Boundary Detection

XML-Output Interface

LinguisticKnowledge

Pool

XMLDocuments

rund: lowercase 60: two-digit-integer

EXAMPLE: rund 60 bis 70 Prozent der Steigerungsrate(about 60 to 70 percent increase)

Text Chart

Page 10: Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Information Extraction Feiyu Xu Jakub Piskorski DFKI LT-Lab

Source: Feiyu Xu, Jakub Piskorski 2002

LanguageTechnology

Tokenizer

• The goal of the TOKENIZER is to:

- map sequences of consecutive characters into word-like units (tokens) - identify the type of each token disregarding the context - performing word segmentation when necessary (e.g., splitting contractions into multiple tokens if necessary)

• overall more than 50 classes (proved to simplify processing on higher stages)

- NUMBER_WORD_COMPOUND (“69er”) - ABBREVIATION (CANDIDATE_FOR_ABBREVIATION), - COMPLEX_COMPOUND_FIRST_CAPITAL („AT&T-Chief“) - COMPLEX_COMPOUND_FIRST_LOWER_DASH („d‘Italia-Chefs-“)

• represented as single WFSA (406 KB)

Page 11: Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Information Extraction Feiyu Xu Jakub Piskorski DFKI LT-Lab

Source: Feiyu Xu, Jakub Piskorski 2002

LanguageTechnology

Lexical Processor

• Tasks of the LEXICAL PROCESSOR:

- retrieval of lexical information- recognition of compounds („Autoradiozubehör“ - car-radio equipment)- hyphen coordination („Leder-, Glas-, Holz- und Kunstoffbranche“ leather, glass, wooden and synthetic materials industry)

• lexicon contains currently more than 700 000 German full-form words (tries)

• each reading represented as triple <STEM,INFLECTION,POS>

example: „wagen“ (to dare vs. a car)

STEM: „wag“INFL: (GENDER: m,CASE: nom, NUMBER: sg) (GENDER: m,CASE: akk, NUMBER: sg) (GENDER: m,CASE: dat, NUMBER: sg) (GENDER: m,CASE: nom, NUMBER: pl) (GENDER: m,CASE: akk, NUMBER: pl) (GENDER: m,CASE: dat, NUMBER: pl) (GENDER: m,CASE: gen, NUMBER: pl)POS: noun

STEM: „wag“INFL: (FORM: infin) (TENSE: pres, PERSON: anrede, NUMBER: sg) (TENSE: pres, PERSON: anrede, NUMBER: pl) (TENSE: pres, PERSON: 1, NUMBER: pl) (TENSE: pres, PERSON: 3, NUMBER: pl) (TENSE: subjunct-1, PERSON: anrede, NUMBER: sg) (TENSE: subjunct-1, PERSON: anrede, NUMBER: pl) (TENSE: subjunct-1, PERSON: 1, NUMBER: pl) (TENSE: subjunct-1, PERSON: 3, NUMBER: pl) (FORM: imp, PERSON: anrede) POS: noun

Page 12: Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Information Extraction Feiyu Xu Jakub Piskorski DFKI LT-Lab

Source: Feiyu Xu, Jakub Piskorski 2002

LanguageTechnology

Part-of-Speech Filtering

• The task of POS FILTER is to filter out unplausible readings of ambiguous word forms

• large amount of German word forms are ambiguous (20% in test corpus)

• contextual filtering rules (ca. 100)

- example: „Sie bekannten, die bekannten Bilder gestohlen zu haben“ They confessed they have stolen the famous pictures

„bekannten“ - to confess vs. famous

FILTERING RULE: if the previous word form is determiner and the next word form is a noun then filter out the verb reading of the current word form

• supplementary rules determined by Brill‘s tagger in order to achieve broader coverage

• rules represented as FSTs, hard-coded rules (filtering out rare readings)

Page 13: Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Information Extraction Feiyu Xu Jakub Piskorski DFKI LT-Lab

Source: Feiyu Xu, Jakub Piskorski 2002

LanguageTechnology

Named Entity Recognition

• The task of NAMED ENTITY FINDER is the identification of:

- entities: organizations, persons, locations - temporal expressions: time, date - quantities: monetary values, percentages, numbers

• identification of named entities in two steps:

- recognition patterns expressed as WFSA are used to identify phrases containing potential candidates for named entities (longest match strategy)

- additional constraints (depending on the type of candidate) are used for validating the candidates

• on-line base lexicon for geographical names, first names (persons)

Page 14: Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Information Extraction Feiyu Xu Jakub Piskorski DFKI LT-Lab

Source: Feiyu Xu, Jakub Piskorski 2002

LanguageTechnology

Named Entity Recognition

• since named entities may appear without designators (companies, persons) a dynamic lexicon for storing such named entities is used

• candidates for named entities:

example:

Da flüchten sich die einen ins Ausland, wie etwa der Münchner Strickwarenhersteller März GmbH oder der badische Strumpffabrikant Arlington Socks, GmbH. Ab kommendem Jahr strickt März knapp drei Viertel seiner Produktion in Ungarn.

• partial reference resolution (acronyms)

• resolution of type ambiguity using the dynamic lexicon:

example:

if an expression can be a person name or company name (Martin Marietta Corp.) then use type of last entry inserted into dynamic lexicon for making decision

Page 15: Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Information Extraction Feiyu Xu Jakub Piskorski DFKI LT-Lab

Source: Feiyu Xu, Jakub Piskorski 2002

LanguageTechnology

Phrase Recognition

• The task of FRAGMENT RECOGNIZER is to recognize nominal and prepositional phrases and verb groups

example: [NP Das Unternehmen] [ORG Merck]] [V geht] [DATE im Herbst 1999] [PP mit einem Viertel seines Kapitals] [PP an die [ORG Frankfurter Börse]] „In autumn 1999 the company Merck will go public to the Frankfurt Stock Exchange with one fourth of it‘s capital “

• extraction patterns expressed as WFSAs (mainly based on POS information and named entity type)

• phrasal grammars only consider continuous substrings (recognition of verb groups is partial)

example: „Gestern [ist] der Linguist vom neuen Manager [entlassen worden].“ Yesterday, the linguist [has] [been fired] by the new manager.

Sentence Boundary Detection

• few simple contextual rules are sufficient since named entity recognition is performed earlier

Page 16: Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Information Extraction Feiyu Xu Jakub Piskorski DFKI LT-Lab

Source: Feiyu Xu, Jakub Piskorski 2002

LanguageTechnology

Evaluation & Performance

Recall Precision

COMPOUND ANALYSIS: 98.53% 99.29%

POS-FILTERING: 74.50% 96.36%

NE-RECOGNITION: 85% 95.77%

FRAGMENTS (NPs,PPs): 76.11% 91.94%

Performance:

Evaluation:

INPUT: corpus of German business magazine „Wirtschaftswoche“ (1,2MB)

TIME: ~15sec. ( ~13400 wrds/sec) on Pentium III, 850MHz, 256 RAM

RECOGNIZED: 197118 tokens, 11575 named entities, 64839 Phrases

Availability: C++ Library for UNIX, Windows 98/NT, Demo with GUI for Windows 98/NT

Page 17: Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Information Extraction Feiyu Xu Jakub Piskorski DFKI LT-Lab

Source: Feiyu Xu, Jakub Piskorski 2002

LanguageTechnology

Two Approaches to Building IE Systems[Appelt & Israel, 99]

Knowledge Engineering Approach Grammars are constructed by hand Domain patterns are discovered by a human expert through introspection and inspection of a

corpus Much laborious tuning

Automatically Trainable Systems Use statistical methods when possible Learn rules from annotated corpora, e.g.,

– statistical name recognizer Learn rules from interaction with user, e.g.,

– learning template filler rules

Page 18: Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Information Extraction Feiyu Xu Jakub Piskorski DFKI LT-Lab

Source: Feiyu Xu, Jakub Piskorski 2002

LanguageTechnology

Knowledge Engineering[Appelt & Israel, 99]

Advantages With skill and experience, good performing systems are not conceptually hard to develop The best performing systems have been hand crafted

Disadvantages Very laborious development process Some changes to specifications can be hard to accommodate, e.g.,

– Name recognition rules based upper and lower cases Required experts may not be available

Page 19: Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Information Extraction Feiyu Xu Jakub Piskorski DFKI LT-Lab

Source: Feiyu Xu, Jakub Piskorski 2002

LanguageTechnology

Trainable Systems[Appelt & Israel, 99]

Advantages Domain portability is relatively straightforward System experts is not required for customization “Data driven” rule acquisition ensures full coverage of examples

Disadvantages Training data may not exist, and may be very expensive to acquire Large volume of training data may be required Changes to specifications may require reannotation of large quantities of training data

Page 20: Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Information Extraction Feiyu Xu Jakub Piskorski DFKI LT-Lab

Source: Feiyu Xu, Jakub Piskorski 2002

LanguageTechnology

When Works Best for Knowledge based IE?[Appelt & Israel, 99]

Resources (e.g. lexicons, lists) are available

Rule writers are available

Training data scarce or expensive to obtain

A mixture of knowledge based and machine learning based approach is also possible!

Page 21: Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Information Extraction Feiyu Xu Jakub Piskorski DFKI LT-Lab

Source: Feiyu Xu, Jakub Piskorski 2002

LanguageTechnology

“pseudo-syntax” [Hobbs et al. 96]

The material between the end of the subject noun group and the beginning of the main verb group must be read over, for example,

Read over prepositional phrases and relative clauses

1. Subject {Preposition NounGroup}* VerbGroup2. Subject Relpro {NounGroup | Other }* VerbGroup

The mayor, who was kidnapped yesterday, was found dead today.

Conjoined verb phrase, skipping over the first conjunct and associate the subject with the verb group in the second conjunct

Subject VerbGroup {NounGroup|Other}* Conjunction VerbGroup

Page 22: Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Information Extraction Feiyu Xu Jakub Piskorski DFKI LT-Lab

Source: Feiyu Xu, Jakub Piskorski 2002

LanguageTechnology

Problem of “pseudo-syntax” [Hobbs et al. 96]

Same semantic content can be realized in different forms. GM manufactures cars.

Cars are manufactured by GM. ... GM, which manufactures cars ... ... cars, which are manufactured by GM ... ... cars manufactured by GM ... GM is to manufacture cars. Cars are to be manufactured by GM. GM is a car manufacturer.

Question: How many rules are needed to extract all relevant patterns? Why not using a linguistic theory?

Performance vs. Compotence TACITUS: 36 hours to process 100 Messages FASTUS: 12 minutes to process 100 messages

Page 23: Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Information Extraction Feiyu Xu Jakub Piskorski DFKI LT-Lab

Source: Feiyu Xu, Jakub Piskorski 2002

LanguageTechnologyMessage Understanding Conferences

[MUC-7 98]

U.S. Government sponsored conferences with the intention to coordinate multiple research groups seeking to improve IE and IR technologies (since 1987)

defined several generic types of information extraction tasks(MUC Competition)

MUC 1-2 focused on automated analysis of military messages containing textual information

MUC 3-7 focused on information extraction from newswire articles terrorist events international joint-ventures management succession event

Page 24: Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Information Extraction Feiyu Xu Jakub Piskorski DFKI LT-Lab

Source: Feiyu Xu, Jakub Piskorski 2002

LanguageTechnologyEvaluation of IE systems in MUC

Participants receive description of the scenario along with the annotated training corpus in order to adapt their systems to the new scenario (1 to 6 months)

Participants receive new set of documents (test corpus) and use their systems to extract information from these documents and return the results to the conference organizer

The results are compared to the manually filled set of templates (answer key)

Page 25: Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Information Extraction Feiyu Xu Jakub Piskorski DFKI LT-Lab

Source: Feiyu Xu, Jakub Piskorski 2002

LanguageTechnologyEvaluation of IE systems in MUC

precision and recall measures were adopted from the information retrieval research community

Sometimes an F-meassure is used as a combined recall-precision score

key

correct

N

Nrecall =

incorrectcorrect

correct

NN

Nprecision

+=

recallprecision

recallprecisionF

+↔↔↔+

= 2

2 )1(β

β

Page 26: Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Information Extraction Feiyu Xu Jakub Piskorski DFKI LT-Lab

Source: Feiyu Xu, Jakub Piskorski 2002

LanguageTechnologyGeneric IE tasks for MUC-7

(NE) Named Entity Recognition Task requires the identification an classification of named entities

organizations locations persons dates, times, percentages and monetary expressions

(TE) Template Element Task requires the filling of small scale templates for specified classes of entities in the texts

Attributes of entities are slot fills (identifying the entities beyond the name level) Example: Persons with slots such as name (plus name variants), title, nationality,

description as supplied in the text, and subtype.

“Capitan Denis Gillespie, the comander of Carrier Air Wing 11”

Page 27: Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Information Extraction Feiyu Xu Jakub Piskorski DFKI LT-Lab

Source: Feiyu Xu, Jakub Piskorski 2002

LanguageTechnologyGeneric IE tasks for MUC-7

(TR) Template Relation Task requires filling a two slot template representing a binary relation with pointers to template elements standing in the relation, which were previously identified in the TE task

subsidiary relationship between two companies(employee_of, product_of, location_of)

:

:

_

ONORGANIZATI

PERSON

OFEMPLOYEE researcherDESCRIPTOR

XuFeiyuNAME

PERSON

:

:

GmbH

instituteresearch

NAME

ONORGANIZATI

:CATEGORY

:DESCRIPTOR

DFKI:

Page 28: Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Information Extraction Feiyu Xu Jakub Piskorski DFKI LT-Lab

Source: Feiyu Xu, Jakub Piskorski 2002

LanguageTechnology

(CO) Coreference Resolution requires the identification of expressions in the text that refer to the same object, set or activity

variant forms of name expressions definite noun phrases and their antecedents pronouns and their antecedents

“The U.K. satellite television broadcaster said its subscriber base grew 17.5 percentduring the past year to 5.35 million”

bridge between NE task and TE task

Generic IE tasks for MUC-7

Page 29: Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Information Extraction Feiyu Xu Jakub Piskorski DFKI LT-Lab

Source: Feiyu Xu, Jakub Piskorski 2002

LanguageTechnology

(ST) Scenario Template requires filling a template structure with extracted information involving several relations or events of interest

intended to be the MUC approximation to a real-world information extraction problem identification of partners, products, profits and capitalization of joint ventures

Generic IE tasks for MUC-7

1997 18February :

:

:/

:2

:1

LtdSystems ionCommunicat GEC Siemens :

_

TIME

unknownTIONCAPITALIZA

SERVICEPRODUCT

PARTNER

PARTNER

NAME

VENTUREJOINT

−−

..............

ONORGANIZATI

..............

ONORGANIZATI

:

:

_

ONORGANIZATI

PRODUCT

OFPRODUCT..............

PRODUCT

Page 30: Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Information Extraction Feiyu Xu Jakub Piskorski DFKI LT-Lab

Source: Feiyu Xu, Jakub Piskorski 2002

LanguageTechnologyTasks evaluated in MUC 3-7

[Chinchor, 98]

EVAL\TASK NE CO RE TR ST

MUC-3 YES

MUC-4 YES

MUC-5 YES

MUC-6 YES YES YES YES

MUC-7 YES YES YES YES YES

Page 31: Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Information Extraction Feiyu Xu Jakub Piskorski DFKI LT-Lab

Source: Feiyu Xu, Jakub Piskorski 2002

LanguageTechnologyMaximum Results Reported in MUC-7

MEASSURE\TASK NE CO TE TR ST

RECALL 92 56 86 67 42

PRECISION 95 69 87 86 65