Exploiting Named Entity Taggers in a Second Language

Exploiting Named Entity Taggers in a Second Language

Thamar Solorio

Computer Science Department

National Institute of Astrophysics, Optics and Electronics

ACL 2005 Student Research Workshop

2/15

Abstract Named Entity Recognition

(X) Complex linguistic resources (X) A hand coded system (X) Any language dependent tools The only information we use is automatically extracted

from the documents, without human intervention. Our approach even outperformed the hand coded

system on NER in Spanish, and it achieved high accuracies in Portuguese.

3/15

Introduction Most NER approaches have very low portability Many NE extractor systems rely heavily on complex linguisti

c resources, which are typically hand coded, for example regular expressions, grammars, gazetteers and the like.

Adapting a system to a different collection or language requires a lot of human effort: rewriting the grammars, acquiring new dictionaries, searching trigger words, and so on.

(O) NE extractor system for Spanish + Portuguese corpus (X) developing linguistic tools such as parsers, POS taggers,

grammars and the like.

4/15

Related Work (1/2) Hidden Markov Models

Zhou and Su, 2002 Internal features: gazetteer information External features: the context of other NEs already recognized

Bikel et al., 1997 and Bikel et al., 1999

Maximum entropy Borthwick, 1999: dictionaries and other orthographic information

Carreras et al.,2003 presented results of a NER system for Catalan using Spanish resource

s using cross-linguistic features

5/15

Related Work (2/2) Petasis et al., 2000

extending a proper noun dictionary an inductive decision-tree classifier unsupervised probabilistic learning of syntactic and semantic con

text POS tags and morphological information

Arévalo et al., 2002 External information provided by gazetteers and lists of tr

igger words A context free grammar, manually coded, is used for reco

gnizing syntactic patterns

6/15

Data sets The corpus in Spanish is that used in the CoNLL 200

2 competitions for the NE extraction task. training: 20,308 NEs testa: 4,634 NEs

to tune the parameters of the classifiers (development set) performed experiments with testa only

testb: 3,948 NEs to compare the results of the competitors

The corpus in Portuguese is “HAREM: Evaluation contest on named entity recognition for Portuguese”. This corpus contains newspaper articles and consists of 8,551 words with 648 NEs.

7/15

Two-step Named Entity Recognition Named Entity Delimitation (NED)

Determining boundaries of named entities Named Entity Classification (NEC)

Classifying the named entities into categories

8/15

Named Entity Delimitation (1/3) BIO scheme We used a modified version of C4.5 algorithm (Quinlan, 199

3) implemented within the WEKA environment (Witten and Frank, 1999).

For each word we combined two types of features: Internal: word itself, orthographic information(1~6 possible states) an

d the position in the sentence. External: POS tag and BIO tag A given word w are extracted using a window of five words anchored

in the word w, each word described by the internal and external features mentioned previously.

Our classifier learns to discriminate among the three classes and assigns labels to all the words, processing them sequentially.

9/15

Named Entity Delimitation (2/3)

10/15

Named Entity Delimitation (3/3) The hand coded system used in this work was

developed by the TALP research center (Carreras and Padró, 2002). NLP analyzers for Spanish, English and Catalan Include practical tools such as POS taggers, sema

ntic analyzers and NE extractors. This NER system is based on hand-coded gramm

ars, lists of trigger words and gazetteer information.

11/15

NED - Experimental Results Average of several runs of 10-fold cross-validation

12/15

Named Entity Classification To build a training set for the NEC learner:

the same attributes as for the NED task suffix of each word. maximum size of 5 characters.

Spanish: person, organization, location and miscellaneous.

Portuguese: person, object, quantity, event, organization, artifact, location, date, abstraction and miscellaneous.

We believe that the learner will be capable of achieving good accuracies in the learning task.

13/15

NEC - Experimental Results (1/2) Similarly to the NED case we trained C4.5 cla

ssifiers for the NEC task

14/15

NEC - Experimental Results (2/2)

Only 4 instances

15/15

Conclusions NER systems must be easy to port and robust, given

the great variety of documents and languages for which it is desirable to have these tools available.

Our method does not require any language dependent features. The only information used in this approach is automatically extracted from the documents, without human intervention.

Our method has shown to be robust and easy to port to other languages. The only requirement for using our method is a tokenizer for languages.

Documents

Exploiting Named Entity Taggers in a Second Language