View
213
Download
0
Category
Preview:
Citation preview
Pastra et al., LREC 2002
How feasible is the reuse of grammars forNamed Entity Recognition?
Katerina Pastra, Diana Maynard, Oana Hamza,
Hamish Cunningham and Yorick Wilks
Department of Computer Science, Natural Language Processing
Group, University of Sheffield, U.K.
Pastra et al., LREC 2002
The paradox
NER results: close to human performance
Reuse of NER resources: minimal
We will focus on:
Traditional rule-based NER systems
NER in text
Reuse of grammars for NER
Manual adaptation of grammars
Pastra et al., LREC 2002
1) Grammar Formalism
2) Application Domain 3) Natural Language
What is it that hinders grammar reuse?
The use of Flexible System Architectures guarantees
reusability of resources>>> But
is this a “sine qua non” solution ?Does the lack of such architectures render
reusability simply “not feasible” ?
Pastra et al., LREC 2002
Grammar Formalism (1)
>> Current Practice: No standardised formalism
>> Traditional pattern-matching languages:
inappropriate for NER
>> Norm: Use of AV notations (allow for reference
to token attributes from multiple analysis levels).
• Translating formalisms: a time-effective solution?
• Time gained-information lost: is there a trade-off?
Pastra et al., LREC 2002
Grammar Formalism (2)
The need: NER for SOCIS (not main task – limited time)
The problem:Existing grammar in another formalism
>> NEA – JAPE Similarities: Declarative, context-sensitive, non-det PM…
>> NEA – JAPE Differences: BU rule invocation – FST cascades Appelt control mechanism - Appelt, First, Brill Rules augmented with PROLOG – JAVA Wildcards, “don’t care sequ”: not common Iterations, (!=) : different mechanisms
Pastra et al., LREC 2002
Grammar Formalism (3)
The experiment: From the NEA notation to JAPE
NEA notation: A => B\C/D
JAPE: (B)(C) :label (D) :label.EntityType = {attr}
• one’s LHS another’s RHS
• same things handled in different ways
• differences in modules run before NER affect rulesSTILL:
Original set in 2 months – SOCIS set in 1 week
Pastra et al., LREC 2002
Application Domain (1)
Is there a core set of grammar rules that are always domain independent ?
General purpose NER grammars:
• Developed to serve grammar reuse, but originated
themselves from specific applications
• They separate specific from general information.
• MUSE: automatic resource switches ~ text features
• HaSIE: company reports on health and safety issues
Pastra et al., LREC 2002
Application Domain (2)
The experiment:• The gazetteers were enriched with police and crime related information• All original domain-specific rules were deleted• Original results with no modifications to the grammar : close to 90% • Only 1 change to the core set and addition of rules
From newswire text on Biotechnology
to … Crime Scene Police Reports
Pastra et al., LREC 2002
Natural Language (1)
Parameters to consider:
• The relation of A and B (close related or not)
determines the extent of reuse
• Nature of NEs (formation, syntagmatic relations)
unpredictable behaviour and structure
finite set
NER Grammar in language (A) + linguistic knowledge of NE in (B) = NER grammar for (B) ?
Pastra et al., LREC 2002
Natural Language (2)
Romanian NE (compared to English):
• Rich inflection
• Flexible word order
• Different word order (e.g modifier follows noun)
The experiment:
Run NER grammar for English on Romanian text
Pastra et al., LREC 2002
Natural Language (3)
1st experiment: Romanian Gaz + English grammar
>> Overall Results: P = 0.82, R = 0.67
• Low recall even for entity types rec with high P
(e.g. Org 0.75P – 0.39R)
2nd experiment: Romanian Gaz + Adapted grammar
>> Overall Results: P = 0.95, R = 0.94
Corpus: 1MB of Romanian newspaper texts
Manual marking of NEs – Romanian NER (3 weeks)
Pastra et al., LREC 2002
Natural Language (3)
Entity Type Precision Recall
Address 0.81 0.81
Date 0.67 0.77
Location 0.88 0.96
Money 0.82 0.47
Organisation
0.75 0.39
Percent 1 0.82
Person 0.68 0.78
Identifier 0.94 0.38
Overall 0.82 0.67
Entity Type Precision Recall
Address 0.96 0.93
Date 0.95 0.94
Location 0.92 0.97
Money 0.98 0.92
Organisation 0.95 0.89
Percent 1 0.99
Person 0.88 0.92
Identifier 0.99 0.96
Overall 0.95 0.94
Pastra et al., LREC 2002
Reuse of existing NER grammars is time effective
and should be attempted even when the formalisms,
applications and languages involved are different
Conclusions
Further issues to be addressed:
• Reuse of NER grammars for spoken NEs
• Reuse in statistical/ML NER approaches
• Automating grammar reuse
Recommended