Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute

Recognizing Names in Biomedical Texts: a Machine Learning Approach

GuoDong Zhou1,*, Jie Zhang1,2, Jian Su1, Dan Shen1,2 and ChewLim Tan2

1Institute for Infocomm Research, Singapore2School of Computing, National Univ. of Singap

ore(Bioinformatics, Vol.20, No.7, 2004, p.1178-1190)

2/29

Abstract Present a named entity recognition system in the bio

medical domain: PowerBioNE. Evidential features

(1) word formation pattern; (2) morphological pattern, such as prefix and suffix; (3) POS; (4) head noun trigger; (5) special verb trigger and (6) name alias feature

Hidden Markov model (HMM) Use k-Nearest Neighbor algorithm to resolve the dat

a sparseness problem Use pattern-based post-processing to deal with the c

ascaded entity name phenomenon.

3/29

Special Naming Conventions (1/2)

Descriptive naming convention normal thymic epithelial cells->difficulty for identifying t

he left boundaries 18.6% consist of at least four words in GENIA 3.0

Conjunction and disjunction 91 and 84 kDa proteins 2.06% have such construction in GENIA 3.0

Non-standardized naming convention N-acetylcysteine, N-acetyl-cysteine, NAcetylCysteine

4/29

Special Naming Conventions (2/2)

Abbreviation Frequently used in the biomedical domain. Ambiguous: 81.2% are ambiguous and have an

average of 16.6 senses in MEDLINE abstracts Cascaded construction

<PROTEIN><DNA>kappa 3</DNA> binding factor </PROTEIN>

16.7% have such construction in GENIA 3.0

5/29

GENIA Corpus GENIA V1.1

670 MEDLINE abstracts of 123K words. Use it for comparison of their work with others.

GENIA V2.1 Incorporate POS to GENIA V1.1. Used to train POS tagger and evaluate the usefulness of PO

S. GENIA V3.0

2,000 MEDLINE abstracts of 360K words. Used it to do the great scope of experiments.

GENIA ontology includes 23 distinct classes.

6/29

Features Word formation pattern (FWFP)

Morphological pattern (FMP)

Part-of-speech (FPOS) Semantic triggers Name alias feature (FALIAS)

7/29

Word Formation Pattern

It is useful to distinguish between biomedical entity names and others.

8/29

Morphological Pattern (1/2)

9/29

Morphological Pattern (2/2) They count the frequency of each

prefix/suffix in each entity class and group prefixes/suffixes with the similar distribution among the entity classes into one category.

Average 37 prefixes/suffixes are selected from the training data and further grouped into 23 categories.

10/29

Part-of-Speech POS may provide useful evidence about the bound

aries of biomedical entity names. Authors adapt an HMM-based POS tagger to GEN

IA V2.1 by training on PENN TreeBank (2,500 WSJ articles, 1M words) and 590 GENIA abstracts.

11/29

Semantic Triggers (1/2) Head noun trigger (FHEAD)

The major noun of a noun phrase, often describes the function or the property of the NP.

E.g. activated human B cells Extract unigram and bigram head nouns, rank by frequency. Selec

t 60% as head noun trigger.

12/29

Semantic Triggers (2/2)

Special verb trigger (FVERB) They may provide the evidence on the

boundaries and the classes of biomedical entity names.

13/29

Name Alias Feature Inter-sentential name alias phenomenon

TCF: proposed as an entity name candidate. The name alias algorithm is invoked. If ‘T cell Factor’ is a ‘Protein’ name recognized earlier in the

document, ‘TCF’ is determined as an alias of ‘T cell Factor’ with the name alias feature Protein3L3.

Inner-sentential abbreviation When an abbreviation with parentheses is detected, remove the abbre

viation and the parentheses. Applying the HMM-based named entity recognizer to the sentence, r

estore the abbreviation with parentheses to its original position. The abbreviation is classified as the same class of the expanded form. The expanded form and its abbreviation are stored in the recognized

list of biomedical entity names.

14/29

HMM-based Biomedical Named Entity Recognition (1/2)

Given an output sequence , the purpose of an HMM is to find the most likely tag (state) sequence that maximizes . Here, oi=<fi, wi>, where wi is the word and is the feature set of the word wi, and si = BOUNDARYi_ENTITYi_FEATUREi, where BOUNDARYi denotes the position of the current word in the entity; ENTITYi indicates the class of the entity; and FEATUREi is the feature set.

nn oooO ...211

nn sssS ...211 )|( 11

nn OSP

iALIAS

iVERB

iHEAD

iPOS

iMP

iWFPi FFFFFFf ,,,,,

15/29

HMM-based Biomedical Named Entity Recognition (2/2)

Assume MI independence

=>

16/29

k-NN Algorithm for Computing (1/2)

Assume , where the pattern entry Ei=oi-N…oi…oi+N.

The k-NN algorithm estimates P(·|Ei) by first finding the K-nearest neighbors of frequently occurring pattern entries to the initial pattern entry Ei and then aggregating them to make a proper estimation of P(·|Ei).

)|( 1n

i OsP

)|()|( 1 iin

i EsPOsP

Kiii EEE ,, 21

17/29

k-NN Algorithm for Computing (2/2)

Conditional state probability distribution

likelihood(E, Ei), the likelihood of a pattern entry E, is one of the K nearest neighbors to the initial pattern entry Ei.

)|( 1n

i OsP

18/29

Post-processing: Cascaded Entity Name Resolution (1/2)

Six patterns are extracted from GENIA <ENTITY> := <ENTITY> + head noun, e.g.

<PROTEIN> binding motif <DNA> <ENTITY> := <ENTITY> + <ENTITY>, e.g.

<LIPID> <PROTEIN> <PROTEIN> <ENTITY> := modifier + <ENTITY>, e.g. anti

<PROTEIN> <PROTEIN> <ENTITY> := <ENTITY> + word + <ENTITY>,

e.g. <VIRUS> infected <MULTICELL> <MULTICELL>

19/29

Post-processing: Cascaded Entity Name Resolution (2/2) <ENTITY> := modifier + <ENTITY> + head

noun <ENTITY> := <ENTITY> + <ENTITY> + head

noun In the experiments, all the rules of above six

patterns are extracted from the cascaded entity names in the training data to deal with the cascaded entity name phenomenon.

20/29

Experiments (1/5) Evaluate the PowerBioNE on GENIA V3.0 and V1.1. For GENIA V1.1, they select 80 abstracts for testing and the

remaining 590 abstracts as the training data. For GENIA V3.0, they select 200 abstracts as the test data an

d the remaining 1800 abstracts as the training data. All the experimentations are done 10 times and the evaluatio

ns are averaged over the test data. Average 63 rules are extracted from the cascaded entity name

s from GENIA V1.1 while average 102 rules are extracted from the cascaded entity names in the training data of GENIA V3.0.

21/29

Experiments (2/5)

22/29

Experiments (2/4)

23/29

Experiments (3/5)

24/29

Analysis for Table 8 The contribution of the word formation pattern feature in the biomedical

domain is very limited compared with that in the newswire domain. The morphological pattern feature is useful. POS after adaptation is proven to be very useful in the biomedical

domain. The head noun trigger feature is very useful. The use of the special verb trigger feature decreases the recall rate while

keeping the precision. The name alias feature only slightly improves the F-measure by 0.6. This

may be due to the complexity of the name alias phenomenon and the simple strategy applied in the system.

The pattern-based post-processing for cascaded entity name resolution is proven to be very useful.

25/29

Experiments (4/5)

More verb triggers only decrease the performance more.

26/29

Experiments (5/5)

27/29

Analysis for Table 10 It suggests that stable and significant performance im

provement can only be achieved for inclusion of POS with enough accuracy.

The performance of HMMs is the highest HMMs have the better ability of capturing the locality of va

rious biomedical entity names. The feature vector-based classifiers, such as SVM, C4.5, C

4.5 rules and RIPPER, cannot effectively capture the local context dependence by assuming the independence between the features while the baseline naïve Bayes classifier fails to capture local context dependence by assuming the conditional probability independence among the local context.

28/29

Error Analysis Randomly choose 100 errors from results. Errors

that are due to the strict annotation scheme and the annotation inconsistence in the GENIA corpus, can be considered acceptable. (total/acceptable) Left boundary errors (15/12) Cascaded entity name errors (17/13) Misclassification errors (16/3) True negative (29/12) False positive (18/10) Miscellaneous (11/1)

29/29

Conclusion Propose and integrate various evidential features,

including word formation pattern, morphological pattern, POS, head noun trigger, special verb trigger and name alias feature.

K-NN algorithm effective resolves the data sparseness problem.

The pattern-based post-processing deals with the cascaded entity name phenomenon. It is the first system which deals with the cascaded entity name phenomenon.

Documents

Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute