Upload
arianna-meyer
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Source: Feiyu Xu, Jakub Piskorski 2002
LanguageTechnology
Information Extraction
Feiyu Xu
Jakub Piskorski
DFKI LT-Lab
Source: Feiyu Xu, Jakub Piskorski 2002
LanguageTechnology
Overview
Introduction to information extraction
SPPC
Survey of information extraction approaches
MUC conferences
Source: Feiyu Xu, Jakub Piskorski 2002
LanguageTechnologyWhat is Information Extraction
The creation of a structured representation (TEMPLATE) of selected information drawn from natural language text
Task is more limited than full text understanding: extracting relevant information, ignoring irrelevant information
Source: Feiyu Xu, Jakub Piskorski 2002
LanguageTechnology
Example(Management Succession)
1. Dr. Hermann Wirth, bisheriger Leiter der Musikhochschule München, verabschiedete sich heute aus dem Amt.
2. Der 65jährige tritt seinen wohlverdienten Ruhestand an. 3. Als seine Nachfolgerin wurde Sabine Klinger benannt. 4. Ebenfalls neu besetzt wurde die Stelle des Musikdirektors. 5. Annelie Häfner folgt Christian Meindl nach.
PersonOut: Dr. Hermann WirthPersonIn: Sabine KlingerPosition: LeiterOrganization: Musikhochschule
MünchenTimeOut: HeuteTimeIn:
PersonOut: Christian MeindlPersonIn: Annelie HäfnerPosition: MusikdirektorOrganization: ?TimeOut:TimeIn:
Source: Feiyu Xu, Jakub Piskorski 2002
LanguageTechnologyTemplate filling rule
NLP: VG: verabschiedet sich aus dem AmtSubject: [1] NP
Domain: IN:
OUT: PERSON: [1]
PERSON: [2]
ORGANISATION: [3] POSITION: [4]
ORGANISATION: [3] POSITION: [4]
Source: Feiyu Xu, Jakub Piskorski 2002
LanguageTechnologyTemplate filling rule
NLP: VG: verabschiedete sich heute aus dem Amt
Subject: [1]
Domain: IN:
OUT: PERSON: [1]
PERSON: [2]
ORGANISATION: [3] POSITION: [4]
ORGANISATION: [3] POSITION: [4]
PERSON_NAME: [5]POSITION_NAME: [4]ORGANISATION_NAME: [3]
Source: Feiyu Xu, Jakub Piskorski 2002
LanguageTechnology
Traditional IE Architecture
Tokenization
Morphological and Lexical Processing
Parsing
Discourse Analysis
Text Sectionizing andFiltering
Part of Speech TaggingWord Sense Tagging
Fragment ProcessingFragment Combination
Scenario Pattern Matching
Coreference Resolution Inference
Template Merging
Local T
ext Analysis
Source: Feiyu Xu, Jakub Piskorski 2002
LanguageTechnologyIntroduction
• System for shallow analysis of German free text documents (toolset of integrated shallow components)
• based on shallow text processor of SMES (G. Neumann)
• exhaustive use of finite-state technology:
- DFKI Finite-State Machine Toolkit - dynamic tries
• high linguistic coverage
• very good performance
SPPC - SHALLOW PROCESSING PRODUCTION CENTER
Source: Feiyu Xu, Jakub Piskorski 2002
LanguageTechnology
Architecture
rund 60 bis 70 Prozent: percentage-NP
bis: adv
Steigerungsrate: steigerung+[s]+ratebis: prep|adv
rund 60 bis 70 Prozent: NPder Steigerungsrate: NP
ASCIIDocuments
Tokenizer
Lexical Processor
POS-Filtering
Named Entity Finder
Phrase Recognizer
Sentence Boundary Detection
XML-Output Interface
LinguisticKnowledge
Pool
XMLDocuments
rund: lowercase 60: two-digit-integer
EXAMPLE: rund 60 bis 70 Prozent der Steigerungsrate(about 60 to 70 percent increase)
Text Chart
Source: Feiyu Xu, Jakub Piskorski 2002
LanguageTechnology
Tokenizer
• The goal of the TOKENIZER is to:
- map sequences of consecutive characters into word-like units (tokens) - identify the type of each token disregarding the context - performing word segmentation when necessary (e.g., splitting contractions into multiple tokens if necessary)
• overall more than 50 classes (proved to simplify processing on higher stages)
- NUMBER_WORD_COMPOUND (“69er”) - ABBREVIATION (CANDIDATE_FOR_ABBREVIATION), - COMPLEX_COMPOUND_FIRST_CAPITAL („AT&T-Chief“) - COMPLEX_COMPOUND_FIRST_LOWER_DASH („d‘Italia-Chefs-“)
• represented as single WFSA (406 KB)
Source: Feiyu Xu, Jakub Piskorski 2002
LanguageTechnology
Lexical Processor
• Tasks of the LEXICAL PROCESSOR:
- retrieval of lexical information- recognition of compounds („Autoradiozubehör“ - car-radio equipment)- hyphen coordination („Leder-, Glas-, Holz- und Kunstoffbranche“ leather, glass, wooden and synthetic materials industry)
• lexicon contains currently more than 700 000 German full-form words (tries)
• each reading represented as triple <STEM,INFLECTION,POS>
example: „wagen“ (to dare vs. a car)
STEM: „wag“INFL: (GENDER: m,CASE: nom, NUMBER: sg) (GENDER: m,CASE: akk, NUMBER: sg) (GENDER: m,CASE: dat, NUMBER: sg) (GENDER: m,CASE: nom, NUMBER: pl) (GENDER: m,CASE: akk, NUMBER: pl) (GENDER: m,CASE: dat, NUMBER: pl) (GENDER: m,CASE: gen, NUMBER: pl)POS: noun
STEM: „wag“INFL: (FORM: infin) (TENSE: pres, PERSON: anrede, NUMBER: sg) (TENSE: pres, PERSON: anrede, NUMBER: pl) (TENSE: pres, PERSON: 1, NUMBER: pl) (TENSE: pres, PERSON: 3, NUMBER: pl) (TENSE: subjunct-1, PERSON: anrede, NUMBER: sg) (TENSE: subjunct-1, PERSON: anrede, NUMBER: pl) (TENSE: subjunct-1, PERSON: 1, NUMBER: pl) (TENSE: subjunct-1, PERSON: 3, NUMBER: pl) (FORM: imp, PERSON: anrede) POS: noun
Source: Feiyu Xu, Jakub Piskorski 2002
LanguageTechnology
Part-of-Speech Filtering
• The task of POS FILTER is to filter out unplausible readings of ambiguous word forms
• large amount of German word forms are ambiguous (20% in test corpus)
• contextual filtering rules (ca. 100)
- example: „Sie bekannten, die bekannten Bilder gestohlen zu haben“ They confessed they have stolen the famous pictures
„bekannten“ - to confess vs. famous
FILTERING RULE: if the previous word form is determiner and the next word form is a noun then filter out the verb reading of the current word form
• supplementary rules determined by Brill‘s tagger in order to achieve broader coverage
• rules represented as FSTs, hard-coded rules (filtering out rare readings)
Source: Feiyu Xu, Jakub Piskorski 2002
LanguageTechnology
Named Entity Recognition
• The task of NAMED ENTITY FINDER is the identification of:
- entities: organizations, persons, locations - temporal expressions: time, date - quantities: monetary values, percentages, numbers
• identification of named entities in two steps:
- recognition patterns expressed as WFSA are used to identify phrases containing potential candidates for named entities (longest match strategy)
- additional constraints (depending on the type of candidate) are used for validating the candidates
• on-line base lexicon for geographical names, first names (persons)
Source: Feiyu Xu, Jakub Piskorski 2002
LanguageTechnology
Named Entity Recognition
• since named entities may appear without designators (companies, persons) a dynamic lexicon for storing such named entities is used
• candidates for named entities:
example:
Da flüchten sich die einen ins Ausland, wie etwa der Münchner Strickwarenhersteller März GmbH oder der badische Strumpffabrikant Arlington Socks, GmbH. Ab kommendem Jahr strickt März knapp drei Viertel seiner Produktion in Ungarn.
• partial reference resolution (acronyms)
• resolution of type ambiguity using the dynamic lexicon:
example:
if an expression can be a person name or company name (Martin Marietta Corp.) then use type of last entry inserted into dynamic lexicon for making decision
Source: Feiyu Xu, Jakub Piskorski 2002
LanguageTechnology
Phrase Recognition
• The task of FRAGMENT RECOGNIZER is to recognize nominal and prepositional phrases and verb groups
example: [NP Das Unternehmen] [ORG Merck]] [V geht] [DATE im Herbst 1999] [PP mit einem Viertel seines Kapitals] [PP an die [ORG Frankfurter Börse]] „In autumn 1999 the company Merck will go public to the Frankfurt Stock Exchange with one fourth of it‘s capital “
• extraction patterns expressed as WFSAs (mainly based on POS information and named entity type)
• phrasal grammars only consider continuous substrings (recognition of verb groups is partial)
example: „Gestern [ist] der Linguist vom neuen Manager [entlassen worden].“ Yesterday, the linguist [has] [been fired] by the new manager.
Sentence Boundary Detection
• few simple contextual rules are sufficient since named entity recognition is performed earlier
Source: Feiyu Xu, Jakub Piskorski 2002
LanguageTechnology
Evaluation & Performance
Recall Precision
COMPOUND ANALYSIS: 98.53% 99.29%
POS-FILTERING: 74.50% 96.36%
NE-RECOGNITION: 85% 95.77%
FRAGMENTS (NPs,PPs): 76.11% 91.94%
Performance:
Evaluation:
INPUT: corpus of German business magazine „Wirtschaftswoche“ (1,2MB)
TIME: ~15sec. ( ~13400 wrds/sec) on Pentium III, 850MHz, 256 RAM
RECOGNIZED: 197118 tokens, 11575 named entities, 64839 Phrases
Availability: C++ Library for UNIX, Windows 98/NT, Demo with GUI for Windows 98/NT
Source: Feiyu Xu, Jakub Piskorski 2002
LanguageTechnology
Two Approaches to Building IE Systems[Appelt & Israel, 99]
Knowledge Engineering Approach Grammars are constructed by hand Domain patterns are discovered by a human expert through introspection and inspection of a
corpus Much laborious tuning
Automatically Trainable Systems Use statistical methods when possible Learn rules from annotated corpora, e.g.,
– statistical name recognizer Learn rules from interaction with user, e.g.,
– learning template filler rules
Source: Feiyu Xu, Jakub Piskorski 2002
LanguageTechnology
Knowledge Engineering[Appelt & Israel, 99]
Advantages With skill and experience, good performing systems are not conceptually hard to develop The best performing systems have been hand crafted
Disadvantages Very laborious development process Some changes to specifications can be hard to accommodate, e.g.,
– Name recognition rules based upper and lower cases Required experts may not be available
Source: Feiyu Xu, Jakub Piskorski 2002
LanguageTechnology
Trainable Systems[Appelt & Israel, 99]
Advantages Domain portability is relatively straightforward System experts is not required for customization “Data driven” rule acquisition ensures full coverage of examples
Disadvantages Training data may not exist, and may be very expensive to acquire Large volume of training data may be required Changes to specifications may require reannotation of large quantities of training data
Source: Feiyu Xu, Jakub Piskorski 2002
LanguageTechnology
When Works Best for Knowledge based IE?[Appelt & Israel, 99]
Resources (e.g. lexicons, lists) are available
Rule writers are available
Training data scarce or expensive to obtain
A mixture of knowledge based and machine learning based approach is also possible!
Source: Feiyu Xu, Jakub Piskorski 2002
LanguageTechnology
“pseudo-syntax” [Hobbs et al. 96]
The material between the end of the subject noun group and the beginning of the main verb group must be read over, for example,
Read over prepositional phrases and relative clauses
1. Subject {Preposition NounGroup}* VerbGroup2. Subject Relpro {NounGroup | Other }* VerbGroup
The mayor, who was kidnapped yesterday, was found dead today.
Conjoined verb phrase, skipping over the first conjunct and associate the subject with the verb group in the second conjunct
Subject VerbGroup {NounGroup|Other}* Conjunction VerbGroup
Source: Feiyu Xu, Jakub Piskorski 2002
LanguageTechnology
Problem of “pseudo-syntax” [Hobbs et al. 96]
Same semantic content can be realized in different forms. GM manufactures cars.
Cars are manufactured by GM. ... GM, which manufactures cars ... ... cars, which are manufactured by GM ... ... cars manufactured by GM ... GM is to manufacture cars. Cars are to be manufactured by GM. GM is a car manufacturer.
Question: How many rules are needed to extract all relevant patterns? Why not using a linguistic theory?
Performance vs. Compotence TACITUS: 36 hours to process 100 Messages FASTUS: 12 minutes to process 100 messages
Source: Feiyu Xu, Jakub Piskorski 2002
LanguageTechnologyMessage Understanding Conferences
[MUC-7 98]
U.S. Government sponsored conferences with the intention to coordinate multiple research groups seeking to improve IE and IR technologies (since 1987)
defined several generic types of information extraction tasks(MUC Competition)
MUC 1-2 focused on automated analysis of military messages containing textual information
MUC 3-7 focused on information extraction from newswire articles terrorist events international joint-ventures management succession event
Source: Feiyu Xu, Jakub Piskorski 2002
LanguageTechnologyEvaluation of IE systems in MUC
Participants receive description of the scenario along with the annotated training corpus in order to adapt their systems to the new scenario (1 to 6 months)
Participants receive new set of documents (test corpus) and use their systems to extract information from these documents and return the results to the conference organizer
The results are compared to the manually filled set of templates (answer key)
Source: Feiyu Xu, Jakub Piskorski 2002
LanguageTechnologyEvaluation of IE systems in MUC
precision and recall measures were adopted from the information retrieval research community
Sometimes an F-meassure is used as a combined recall-precision score
key
correct
N
Nrecall =
incorrectcorrect
correct
NN
Nprecision
+=
recallprecision
recallprecisionF
+↔↔↔+
= 2
2 )1(β
β
Source: Feiyu Xu, Jakub Piskorski 2002
LanguageTechnologyGeneric IE tasks for MUC-7
(NE) Named Entity Recognition Task requires the identification an classification of named entities
organizations locations persons dates, times, percentages and monetary expressions
(TE) Template Element Task requires the filling of small scale templates for specified classes of entities in the texts
Attributes of entities are slot fills (identifying the entities beyond the name level) Example: Persons with slots such as name (plus name variants), title, nationality,
description as supplied in the text, and subtype.
“Capitan Denis Gillespie, the comander of Carrier Air Wing 11”
Source: Feiyu Xu, Jakub Piskorski 2002
LanguageTechnologyGeneric IE tasks for MUC-7
(TR) Template Relation Task requires filling a two slot template representing a binary relation with pointers to template elements standing in the relation, which were previously identified in the TE task
subsidiary relationship between two companies(employee_of, product_of, location_of)
:
:
_
ONORGANIZATI
PERSON
OFEMPLOYEE researcherDESCRIPTOR
XuFeiyuNAME
PERSON
:
:
GmbH
instituteresearch
NAME
ONORGANIZATI
:CATEGORY
:DESCRIPTOR
DFKI:
Source: Feiyu Xu, Jakub Piskorski 2002
LanguageTechnology
(CO) Coreference Resolution requires the identification of expressions in the text that refer to the same object, set or activity
variant forms of name expressions definite noun phrases and their antecedents pronouns and their antecedents
“The U.K. satellite television broadcaster said its subscriber base grew 17.5 percentduring the past year to 5.35 million”
bridge between NE task and TE task
Generic IE tasks for MUC-7
Source: Feiyu Xu, Jakub Piskorski 2002
LanguageTechnology
(ST) Scenario Template requires filling a template structure with extracted information involving several relations or events of interest
intended to be the MUC approximation to a real-world information extraction problem identification of partners, products, profits and capitalization of joint ventures
Generic IE tasks for MUC-7
1997 18February :
:
:/
:2
:1
LtdSystems ionCommunicat GEC Siemens :
_
TIME
unknownTIONCAPITALIZA
SERVICEPRODUCT
PARTNER
PARTNER
NAME
VENTUREJOINT
−−
..............
ONORGANIZATI
..............
ONORGANIZATI
:
:
_
ONORGANIZATI
PRODUCT
OFPRODUCT..............
PRODUCT
Source: Feiyu Xu, Jakub Piskorski 2002
LanguageTechnologyTasks evaluated in MUC 3-7
[Chinchor, 98]
EVAL\TASK NE CO RE TR ST
MUC-3 YES
MUC-4 YES
MUC-5 YES
MUC-6 YES YES YES YES
MUC-7 YES YES YES YES YES
Source: Feiyu Xu, Jakub Piskorski 2002
LanguageTechnologyMaximum Results Reported in MUC-7
MEASSURE\TASK NE CO TE TR ST
RECALL 92 56 86 67 42
PRECISION 95 69 87 86 65