Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo

Combining terminology resources and statistical

methods for entity recognition: an evaluation

Angus Roberts, Robert Gaizauskas,Mark Hepple, Yikun Guo

presented by George Demetriou

Natural Language Processing Group, University of Sheffield, UK

Introduction

Combining techniques for entity recognition: Dictionary based term recognition Filtering of ambiguous terms Statistical entity recognition

How do the techniques compare: separately and in combination?

When combined, can we retain the advantages of both?

LocusConditionLocusInvestigation

Semantic annotation of clinical text

Our basic task is semantic annotation of clinical text

For the purposes of this paper, we ignore:

Modifiers such as negation Relations and coreference

These are the subject of other papers

Punch biopsy of skin. No lesion on the skin surface following fixation.

Entity recognition in specialist domains

Specialist domains, e.g. medicine, are rich in: Complex terminology Terminology resources and ontologies

We might expect these resources to be of use in entity recognition

We might expect annotation using these resources to add value to the text, providing additional information to applications

Ambiguity in term resources Most term resources have not been designed

with NLP applications in mind

When used for dictionary lookup, many suffer from problems of ambiguity

I: Iodine, an Iodine test or the personal pronoun be: bacterial endocarditis or the root of a verb

Various techniques can overcome this: Filtering or elimination of problematic terms Use of context: in our case, statistical models

Corpus: the CLEF gold standard

For experiments, we used a manually annotated gold standard

Careful construction of a schema and guidelines Double annotation with a consensus step Measurement of Inter Annotator Agreement (IAA) (Roberts et al 2008 LREC bio text mining workshop)

For the experiments reported, we use 77 gold standard documents

Entity types

Entity type Brief description Instances

Condition Symptom, diagnosis, complication, etc. 739

Drug or device Drug or some other prescribed item 272

Intervention Action performed by a clinician 298

Investigation Tests, measurements and studies 325

Locus 490

Total 2124

Anatomical location, body substance etc.

Terminomatchers

Terminoannotators

Externalontologies

Terminodatabase

Link back to resources

Externaldatabases

Externalterminologies

Dictionary lookup: Termino

Termino is loaded from external resources

FSM matchers are compiled out of Termino

Finding entities with termino

GATE application pipeline

Termino

Applicationtexts

Annotatedtexts

Termino termrecognition

Linguisticpre-processing

Termino loaded with selected terms from UMLS (600K terms)

Pre-processing includes tokenisation and morphological analysis

Lookup is against the roots of tokens

Filtering problematic terms

Many UMLS terms are not suitable for NLP

Ambiguity with common general language words

To identify the most problematic of these, we ran Termino over a separate development corpus, and manually inspected the results

A supplementary list of missing terms was compiled by domain experts (6 terms)

Creation of these lists took a couple of hours

Creating the filter list

1. Add all unique terms of 1 character to the list

2. For all unique terms of <= 6 characters:

i. Add to the list if it matches a common general language word or abbreviation

ii. Add to the list if it has a numeric component

iii. Reject from the list if it is an obvious technical term

iv. Reject from the list if none of the above apply

3. Filter list size: 232 terms

Entities found by Termino

UMLS UMLS+filter IAAP 0.2458 0.5224 0.5238R 0.6999 0.6939 0.7042F1 0.3638 0.5961 0.6008 0.7373

UMLS+filter+ supplementary

UMLS alone gives poor precision, due to term ambiguity with general language words

Adding in the filter list improves precision with little loss in recall

Statistical entity recognition

Statistical entity recognition allows us to model context

We use an SVM implementation provided with GATE

Mapping of our multi-class entity recognition task to binary SVM classifiers is handled by GATE

Features for machine learning

Token kind (e.g. number, word)

Orthographic type (e.g. lower case, upper case)

Morphological root

Affix

Generalised part of speech The first two characters of Penn Treebank tagset

Termino recognised terms

Finding entities: ML


GATE training pipeline

Statisticalmodel of text

Term modellearning

Linguisticprocessing

Gold standardannotated texts(human annotated)

Applicationtexts

Annotatedtexts

Term modelapplication


Finding entities: ML + Termino


GATE training pipeline

Statisticalmodel of text

Term modellearning



Termino

Gold standardannotated texts(human annotated)

Applicationtexts

Annotatedtexts

Term modelapplication



Entities found by SVM

Best UMLS SVM+tokens IAAP 0.5238 0.7931 0.8065R 0.7042 0.5417 0.6308F1 0.6008 0.6423 0.7071 0.7373

SVM+tokens+termino

Statistical entity recognition alone gives a higher P than dictionary lookup, but a lower R

The combined system gains from the higher R of dictionary lookup, with no loss in P

Linkage to external resources

The peritoneumcontains depositsof tumour... thetumour cells arenegative fordesmin.

Semantic annotation allows us to link texts to existing domain resources

Giving more intelligent indexing and making additional information available to applications

Linkage to external resources

UMLS links terms to Concept Unique Identifiers (CUIs)

Where a recognised entity is associated with an underlying Termino term, can likewise automatically link the entity to a CUI

If the SVM finds an entity when Termino has found nothing, the entity cannot be linked to a CUI

CUIs assigned

% of terms

0 146 16.94

1 486 56.38

2 190 22.043 31 3.604 6 0.705 3 0.35>0 716 83.06

CUIs assigned

Number of terms

At least one CUI can be automatically assigned to 83% of the terms in the gold standard

Some are ambiguous, and resolution is needed

Availability

Most of the software is open source and can be downloaded as part of GATE

We are currently packaging Termino for public release

We are currently preparing a UK research ethics committee application for release of the annotated gold standard

Conclusions Dicitionary lookup gives a good recall but poor

precision, due to term ambiguity

Much ambiguity is due to a few of terms, which can be filtered to give little loss in recall

Combining dictionary lookup with statistical models of context improves precision

A benefit of dictionary lookup, linkage to external resources, can be retained in the combined system

Questions?

http://www.clinical-escience.org

http://www.clef-user.com

Documents

Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo