13
Pastra et al., LREC 2002 How feasible is the reuse of grammars for Named Entity Recognition? Katerina Pastra, Diana Maynard, Oana Hamza, Hamish Cunningham and Yorick Wilks Department of Computer Science, Natural Language Processing Group, University of Sheffield, U.K.

Pastra et al., LREC 2002 How feasible is the reuse of grammars for Named Entity Recognition? Katerina Pastra, Diana Maynard, Oana Hamza, Hamish Cunningham

Embed Size (px)

Citation preview

Page 1: Pastra et al., LREC 2002 How feasible is the reuse of grammars for Named Entity Recognition? Katerina Pastra, Diana Maynard, Oana Hamza, Hamish Cunningham

Pastra et al., LREC 2002

How feasible is the reuse of grammars forNamed Entity Recognition?

Katerina Pastra, Diana Maynard, Oana Hamza,

Hamish Cunningham and Yorick Wilks

Department of Computer Science, Natural Language Processing

Group, University of Sheffield, U.K.

Page 2: Pastra et al., LREC 2002 How feasible is the reuse of grammars for Named Entity Recognition? Katerina Pastra, Diana Maynard, Oana Hamza, Hamish Cunningham

Pastra et al., LREC 2002

The paradox

NER results: close to human performance

Reuse of NER resources: minimal

We will focus on:

Traditional rule-based NER systems

NER in text

Reuse of grammars for NER

Manual adaptation of grammars

Page 3: Pastra et al., LREC 2002 How feasible is the reuse of grammars for Named Entity Recognition? Katerina Pastra, Diana Maynard, Oana Hamza, Hamish Cunningham

Pastra et al., LREC 2002

1) Grammar Formalism

2) Application Domain 3) Natural Language

What is it that hinders grammar reuse?

The use of Flexible System Architectures guarantees

reusability of resources>>> But

is this a “sine qua non” solution ?Does the lack of such architectures render

reusability simply “not feasible” ?

Page 4: Pastra et al., LREC 2002 How feasible is the reuse of grammars for Named Entity Recognition? Katerina Pastra, Diana Maynard, Oana Hamza, Hamish Cunningham

Pastra et al., LREC 2002

Grammar Formalism (1)

>> Current Practice: No standardised formalism

>> Traditional pattern-matching languages:

inappropriate for NER

>> Norm: Use of AV notations (allow for reference

to token attributes from multiple analysis levels).

• Translating formalisms: a time-effective solution?

• Time gained-information lost: is there a trade-off?

Page 5: Pastra et al., LREC 2002 How feasible is the reuse of grammars for Named Entity Recognition? Katerina Pastra, Diana Maynard, Oana Hamza, Hamish Cunningham

Pastra et al., LREC 2002

Grammar Formalism (2)

The need: NER for SOCIS (not main task – limited time)

The problem:Existing grammar in another formalism

>> NEA – JAPE Similarities: Declarative, context-sensitive, non-det PM…

>> NEA – JAPE Differences: BU rule invocation – FST cascades Appelt control mechanism - Appelt, First, Brill Rules augmented with PROLOG – JAVA Wildcards, “don’t care sequ”: not common Iterations, (!=) : different mechanisms

Page 6: Pastra et al., LREC 2002 How feasible is the reuse of grammars for Named Entity Recognition? Katerina Pastra, Diana Maynard, Oana Hamza, Hamish Cunningham

Pastra et al., LREC 2002

Grammar Formalism (3)

The experiment: From the NEA notation to JAPE

NEA notation: A => B\C/D

JAPE: (B)(C) :label (D) :label.EntityType = {attr}

• one’s LHS another’s RHS

• same things handled in different ways

• differences in modules run before NER affect rulesSTILL:

Original set in 2 months – SOCIS set in 1 week

Page 7: Pastra et al., LREC 2002 How feasible is the reuse of grammars for Named Entity Recognition? Katerina Pastra, Diana Maynard, Oana Hamza, Hamish Cunningham

Pastra et al., LREC 2002

Application Domain (1)

Is there a core set of grammar rules that are always domain independent ?

General purpose NER grammars:

• Developed to serve grammar reuse, but originated

themselves from specific applications

• They separate specific from general information.

• MUSE: automatic resource switches ~ text features

• HaSIE: company reports on health and safety issues

Page 8: Pastra et al., LREC 2002 How feasible is the reuse of grammars for Named Entity Recognition? Katerina Pastra, Diana Maynard, Oana Hamza, Hamish Cunningham

Pastra et al., LREC 2002

Application Domain (2)

The experiment:• The gazetteers were enriched with police and crime related information• All original domain-specific rules were deleted• Original results with no modifications to the grammar : close to 90% • Only 1 change to the core set and addition of rules

From newswire text on Biotechnology

to … Crime Scene Police Reports

Page 9: Pastra et al., LREC 2002 How feasible is the reuse of grammars for Named Entity Recognition? Katerina Pastra, Diana Maynard, Oana Hamza, Hamish Cunningham

Pastra et al., LREC 2002

Natural Language (1)

Parameters to consider:

• The relation of A and B (close related or not)

determines the extent of reuse

• Nature of NEs (formation, syntagmatic relations)

unpredictable behaviour and structure

finite set

NER Grammar in language (A) + linguistic knowledge of NE in (B) = NER grammar for (B) ?

Page 10: Pastra et al., LREC 2002 How feasible is the reuse of grammars for Named Entity Recognition? Katerina Pastra, Diana Maynard, Oana Hamza, Hamish Cunningham

Pastra et al., LREC 2002

Natural Language (2)

Romanian NE (compared to English):

• Rich inflection

• Flexible word order

• Different word order (e.g modifier follows noun)

The experiment:

Run NER grammar for English on Romanian text

Page 11: Pastra et al., LREC 2002 How feasible is the reuse of grammars for Named Entity Recognition? Katerina Pastra, Diana Maynard, Oana Hamza, Hamish Cunningham

Pastra et al., LREC 2002

Natural Language (3)

1st experiment: Romanian Gaz + English grammar

>> Overall Results: P = 0.82, R = 0.67

• Low recall even for entity types rec with high P

(e.g. Org 0.75P – 0.39R)

2nd experiment: Romanian Gaz + Adapted grammar

>> Overall Results: P = 0.95, R = 0.94

Corpus: 1MB of Romanian newspaper texts

Manual marking of NEs – Romanian NER (3 weeks)

Page 12: Pastra et al., LREC 2002 How feasible is the reuse of grammars for Named Entity Recognition? Katerina Pastra, Diana Maynard, Oana Hamza, Hamish Cunningham

Pastra et al., LREC 2002

Natural Language (3)

Entity Type Precision Recall

Address 0.81 0.81

Date 0.67 0.77

Location 0.88 0.96

Money 0.82 0.47

Organisation

0.75 0.39

Percent 1 0.82

Person 0.68 0.78

Identifier 0.94 0.38

Overall 0.82 0.67

Entity Type Precision Recall

Address 0.96 0.93

Date 0.95 0.94

Location 0.92 0.97

Money 0.98 0.92

Organisation 0.95 0.89

Percent 1 0.99

Person 0.88 0.92

Identifier 0.99 0.96

Overall 0.95 0.94

Page 13: Pastra et al., LREC 2002 How feasible is the reuse of grammars for Named Entity Recognition? Katerina Pastra, Diana Maynard, Oana Hamza, Hamish Cunningham

Pastra et al., LREC 2002

Reuse of existing NER grammars is time effective

and should be attempted even when the formalisms,

applications and languages involved are different

Conclusions

Further issues to be addressed:

• Reuse of NER grammars for spoken NEs

• Reuse in statistical/ML NER approaches

• Automating grammar reuse