[IEEE 2013 ACS International Conference on Computer Systems and Applications (AICCSA) - Ifrane, Morocco (2013.05.27-2013.05.30)] 2013 ACS International Conference on Computer Systems

Extracting candidate terms from medical texts

Imene Bentounsi

Lire Laboratory

Department of Software Technology and Information

System

Universtity of Constantine 2

Constantine, Algeria

[email protected]

Zizette Boufaida

Lire Laboratory

Department of Software Technology and Information

System

Universtity of Constantine 2

Constantine, Algeria

[email protected]

Abstract—In this paper, we present a new method for the

construction of domain ontology from texts in the medical field.

In our method, NLP tools are not used since our goal is to reduce

the amount of noise and the number of filters applied. This

allows us to manage the control of semantic quality of Candidate

Terms (CT). The proposed method is based on the technique of

semantic annotation for controlled terminology extraction via

semantic resources, which we supplement with a Coloring

Strategy Identification (CSI) method allowing the identification

of medical terms and an extraction of CT according to their

appearance in the text. In CSI, we exploit the resulting metadata

for semantic annotation via some rules. Moreover, we apply a

method of categorization by prototype model to overcome

perceived silence. The result of this extraction is structured in

XML representing the hierarchy of concepts that composes our

medical ontology.

Keywords—knowledge extraction; noise; filters; semantic

annotation; coloration; document in XML; ontology.

I. INTRODUCTION

A corpus is a collection of texts used as a sample of the language [1]. From these texts written in standard languages, a knowledge extraction is possible. This is a non-trivial process that builds valid, new, potentially useful and understandable knowledge model [2]. But according to [3], the lists produced by the extractors are not perfect. These lists include suites that are not correct, these errors are called noise.

Moreover, the current work meets a number of errors related to the use of extraction tools such as segmentation errors in sentences and words, parsing errors… To overcome this noise, a set of filters are proposed in the literature. A filter is a program to treat a data stream in order to block access to the noise. We notice that there is a close relationship between the noise generated and the number of filters applied. Indeed, when the number of filters increases, the quantity of the noise is also increasing.

In [4, 5] a set of filters is applied once the lexicon is analyzed. They allow refining the identification of units considered as external to expertise domain. They are usually applied to a large number and in a meaningful order which causes their dependence on each other. Our challenge is to try to get less noise with less filters. For this, we propose a new method for extracting knowledge from texts. It is based on a

controlled process of extraction terminology and supported by semantics resources considered as a reference experienced and certified.

Terms belonging to known referential are identified by a coloring process exploiting metadata. We extract these terms in a structured way, enriched by the use of the principle of semantic annotation. Annotation is the equivalent of a note added by way of comment, explanation or even the association of colors. The principle is almost the same as the semantic annotation except that the latter combines the sense by metadata inserted in the document or in an external media.

In this paper, we propose an architecture based on controlled terminology extraction process of [5, 6]. Our contribution concerns firstly, the elimination of morphosyntactic and syntactic analyzes and secondly, the use of CSI (Coloring Strategy Identification). Finally, we produce a structured document (in XML) in a perspective of bottom-up construction of ontological resources.

We describe, in this paper, some existing work on knowledge extraction from texts. We are particularly interested in the semantics of words extraction. This aspect is not considered in other works [4, 5] (section 2). Moreover, we present the general architecture of the proposed method and its various steps (section 3). We apply our approach on some sentences of a corpus and illustrate the impact of the proposed method (section 4). Then we conclude (section 5) on some prospects for future work.

II. RELATED WORK

In the literature, several processes on the extraction of

knowledge from texts are available, in different areas [4, 5, 6,

7]. One of them [4] proposes a linguistic equipped method for

identifying the evolution of knowledge from texts in the

spatial domain. Its purpose is to get a suitable list of CT from

the Syntex parser [8]. But noise is generated in three forms.

To remedy these errors, [4] applies a series filters. They are

based on the frequency of the candidate, its grammatical

category, its syntactic form and the domain it belongs to.

This process is based on a free terminology extraction; this

traditional method of extraction via the Syntex parser uses a

labeled corpus (TreeTagger [9]) as an only source of

978-1-4799-0792-2/13/$31.00 ©2013 IEEE

information. In addition, it generates a result with a low

semantic level.

The advantage of this method is low silence; the most of

proposed terms are not forgotten. Unfortunately, it generates a

large flow of information in the form of lists. These lists

contain noise as syntax error analysis, size (too long

candidates) and meaning (too general candidate).

On the other hand, a question / answer system to the medical

field was created [5]. The proposed approach is the integration

of semantic annotation using UMLS (Unified Medical

Language System) [10] semantic resources. But using

MetaMap [11] generates two types of errors.

To reduce these errors, a segmentation MetaMap is required

by upstream use of LingPipe Models and TreeTagger. In

addition, two filters are applied according to two lists: list of

the most frequent errors segmentations and another for the

terms whose semantic types are Quantitative, Qualitative and

Functional Concept.

The use of semantic resources allows detection of concepts

and relationships with more precision and at a higher level.

But this process still produces a long and noisy list (Syntax:

segmentation into sentences and words, Semantic: Too general

candidates).

We notice that in [4] and [5], the noise is related to the use of

analytical syntactic (Syntex) and morphosyntactic (MetaMap)

tools. Indeed, the extraction is made on cleaned and labeled

corpus, where NLP tools are applied to generate long lists of

CT that remain to be filtered. The process may seem short, but

we notice that in [4] three filters were used.

In [6], the authors focus on access to semantic content in a

specialized language for the extraction of medical

prescriptions. They rely on existing linguistic resources

backed through rules extraction and lexicons lists without

using external tools such as taggers, parsers or lemmatizers.

Thus, allowing rapid quality results while low noise is noticed.

But the obtained result is in a list format making its use so

difficult like in [7] for the exploitation of medical corpus

extracted from Internet.

The treatment of a large volume of texts needs some

organization. Thus, the structure of the knowledge extraction

results makes the created resources easily reusable.

To overcome the problems previously mentioned, we propose

a new method for extracting knowledge from a text corpus

based on monitored and improved terminology extraction. We

use a linguistic resource through the semantic annotation

technique.

III. PROPOSED METHOD

For the extraction of knowledge from texts whose objective is the construction of domain ontology, we are moving towards a controlled strategy extraction based on rules in order to:

1. Reduce the noise generated by analyzes.

2. Reduce the number of filters to remove noise.

From our previous work [13], some improvements in the overall process are presented where two main steps are considered:

Semantic Annotation: from linguistic resources, it sets

the interpretation of a term by associating an explicit

and formal semantic through metadata.

Coloring Strategy Identification (CSI): is the

foundation of our contribution. In this step, we exploit

the resulting metadata from the previous step to

identify some rules [12]. Our goal is to reach a

structured XML document without using syntactic

and morphosyntactic analysis.

Fig. 1. Architecture of the proposed system

From unstructured data written in a standard language, the

semantic annotation is applied. Through the CSI, we

recuperate the resulting metadata’s using some rules. Our goal

is the identification of domain terms by a coloring strategy in

the text. This coloring allows the division of CTs according to

their appearance in the text. We propose an XML output

type for extracting CT from texts. The global architecture of

our approach is summarized in (fig 1)

In the following sections, we will see the interest of applying

rules on the metadata’s resulting from semantic annotation

step. We also explain the coloring process and all necessary

steps in order to reach a structured document.

We illustrate the different steps of our approach by choosing a

medical field as an application.

A. Description of the system

The architecture of our system is described as follows:

1) Corpus presentation These reports are collected manually (still in progress collection) at the archives of Ibn Badis hospital of Constantine. They represent medical observations of cardiac patients written by specialists. Each report deals with the reason for admission, cardiovascular risk factors, patient treatment, allergies, the history of disease and output requirements.

2) Semantic Annotation The semantic resources allow quick access to relevant information taking advantage of the thesaurus, Metathesaurus and existing dictionaries to describe and represent merely knowledge area.

Our goal is to build an ontology from medical reports. We exploit the UMLS Metathesaurus. This latter is a large medical terminology resource, hybrid and multilingual. It has two main sources of knowledge:

The Metathesaurus is based on unified medical

concepts.

The semantic network specifies semantic types that

are used to categorize all the concepts defined in the

medical metathesaurus [13].

UMLS help us to build the basic structure (core) of our domain ontology. It identifies the standard terms of the medical domain (concept instances) classified according to semantic groups and types. This can match the levels of a hierarchical ontology. The UMLS also includes different taxonomic (hierarchical) and non-taxonomic (semantic) relationships, whose accessible only from the web service UMLSKS. From there, several modifications and adaptations should be provided. The semantic annotation is based on MetaMap tool to make the connection between the concepts of the UMLS with the terms of the corpus. We exploit the resulting metadata. This is code in XMLF (Formatted XML). The latter is a structured language (XML) readable by humans and in tree form. For each term, it specifies the name concept as UMLS concepts in the tag <CandidateMatched>, its lexical category <LexCat>, its semantics type <SemType>, its CUI <CandidateCUI>… The CUI (Concept Unique Identifier) is a code assigned by the UMLS to key concepts where each concept has its own CUI. The semantic type describes the key concept. For example, the semantic types of the concept Ventricular are”spatial concept” and “body part, organ or organ component”.

3) Coloring Strategy Identification The identification strategy coloring provides several advantages:

Identification of CTs without NLP tools.

Abstraction on irrelevant information: the, he, she, it, be and have auxiliaries...

Generation new concepts called composed concepts.

Distinction of the key concepts of the ontology.

Conservation of the meaning and structure of the text.

Construction of an XML document semantically rich and reusable.

This step consists of several sub-steps as follows:

a) Coloration from metadata: we get the CandidateMatched tag and the CUI tag of the terms identified as belonging to the medical field. The result of this recovery is the coloring (assigning a color) to these terms in the original text.

b) Validation of coloration step: We notice that there is a silence in the UMLS Metathesaurus. This latter does not include all the medical vocabulary. Effectively, some terms such as coronarography, repolarization, sudation, precordialgia, retrosternal… are missing. Thus, we have to

develop an English medical dictionary of synonyms for adding concepts whose not finding in UMLS.

c) Structuration : we generate a structured XML document by these following rules:

Tagging: An XML tag is a colored term or a suite of colorful terms.

Sorting is performed according to semantic types. A parent tag (x) contains a sub-tag which <SemType> = x. But if the tag includes a suite of more than one word whose semantic types are different then the semantic type becomes heterogeneous.

We adapt an existing model of categorization [14] for sorting

of new concepts (finding in coloration’s validation) according

to semantic types as following:

X is a medical term, D is a dictionary.

UMLS is a Metathesaurus.

X ∈ D. X ∉ UMLS.

{A, B, C} is a set of term synonyms of X in D.

The semantic type of X is found if and only if one of the

terms of the set {A, B, C} is found in UMLS.

Categorizing is performed according to semantic groups. UMLS has counted fifteen in number: Activities & Behaviors, Anatomy, Chemicals & Drugs…

B. Illustration

1). Applying a semantic annotation via MetaMap, on the phrase below, gives the following XMLf code:

« Angiography of the bifurcation Coronary Inter ventricular anterior diagonal with active stents or aorta bypass coronary »

<MMOs>

<Command>

metamap11.binary.x86-win32-nt-4 -L 2011 -Z

2011AA –XMLf…

<PhraseText>angiography</PhraseText>

<LexCat>noun</LexCat>

<CandidateCUI>C0002978</CandidateCUI>

<CandidateMatched> Angiography

Each term </CandidateMatched> in UMLS has an identifier </ CandidateCUI>.

2). Coloration Strategy Identification

We project the matching result on the initial sentence. Our goal is the identification of the key concepts in the text as Fig.3.

Fig. 2. After coloration

3). Validation of coloration step: In this case all key concepts composed the phrase are founded.

4). XML structuation: We illustrate the results of the three sub-steps in the following XML block

Angiography of the bifurcation coronary inter ventricular

anterior diagonal with active stents or aorta bypass coronary

<?xml version=“1.0” encoding=“UTF-8”?>

<concepts>

<proc><diap></Angiography></diap></proc>

<bifurcationCoronaryInterVentricularAnteriorDiagonal>

<proc><topp></bifurcation><topp></proc> <anat><bpoc></coronary></bpoc></anat>

<conc> <spco></inter></spco></conc>

<conc> <tmco></inter></tmco></conc> <anat> <bpoc></ventricular></bpoc></anat>

<conc><spco></ventricular></spco></conc>

<conc><spco></anterior></spco></conc> <conc><spco></diagonal></spco></conc>

</bifurcationCoronaryInterVentricularAnteriorDiagonal>

<activestents > <conc> <ftcn></active></ftcn></conc>

<devi> <medd></stents></medd></devi>

</activestents >

<aortabypasscoronary>

<anat><bpoc></aorta></bpoc></anat>

<proc> <topp></bypass></topp> </proc> <anat><bpoc></coronary></bpoc></anat>

</aortabypasscoronary> </concepts>

Fig. 3. Extracted XML code produced after extraction of CT via the CSI

IV. FROM TEXT TO DOMAIN ONTOLOGY

In the UMLS Metathesaurus, one term can have several

semantic types. The latter cause an encumbered hierarchy.

By focusing on the previously treatments on the concepts

hierarchy, we have a new concepts composed through the CSI.

This leads to non strict tree with inheritance relations

significantly reduced. We have reproduced the concept

hierarchy of the XML code of (fig5) as illustrated in Fig.6

Fig. 4. Application of our approach

From top to bottom, we find the semantic groups at the highest

level. In the other levels, there are specific concepts

(semantics type) and in the lowest level, there are instances of

concepts.

V. CONCLUSION

In the purpose of the construction of domain ontology from

texts, we have proposed a new method to generate an XML

format instead of list format. We also try to reduce the number

of filters and the perceived noise during the extraction of

knowledge from corpus. In this controlled extraction method,

we propose the combination of two annotation techniques: (1)

Semantic annotation of texts and (2) its coloration.

CSI exploits metadata via developed some rules. These rules

allow the extraction of CT in a structured way.

That leads the extract of the CT from the texts without using

syntactic and morphosyntactic analysis. This allows us to

significantly reduce the amount of noise and the number of

filters that are attached to.

The application of our approach on the cardiac reports directs

us to a Metathesaurus assistance due to the noticed silence,

and the changes made via the CSI leads to semantically

complete CTs. Their divisions depend on the localization of

the key concepts in the text. This leads to a change in the

initial UMLS hierarchical tree favoring a less cluttered

hierarchy.

REFERENCES

[1] J. Sinclair, “Corpus Typology. A Framework for Classification.” G.Melchers & B.Warren (eds), Studies in Anglistics. Stockhom: Almquist and Wiksell International, 1995, vol 8, pp. 17-34.

[2] Y. Toussaint, ‘’Extraction de connaissances à partir de textes structurés’’, Document numérique 2004/3, Vol 8, p.11-34. DOI : 10.3166/dn.8.3.11-34

[3] M.C. L’Homme, ‘’Nouvelle technologies et recherches terminologique. Techniques d’extraction des données terminologiques et leur impact sur le travail du terminologies’’, in L’impact des nouvelles technologies sur la gestion terminologiques, Toronto : Université York. 2001

[4] A. Picton, ‘’Diachronie en langue de spécialité. Définition d’une méthode linguistique outillée pour repérer l’évolution des connaissances en corpus : Un exemple appliqué au domaine spatial ‘’, Thèse en vue de l'obtention du doctorat de l’université de Toulouse en Science du langage, (Octobre 2009)

[5] A.Ben Abacha, P. Zweigenbaum, ‘’Annotation et interrogation sémantiques de textes médicaux. In Actes Atelier Web Sémantique Médical 2010 à IC 2010, pp. 61-70, Nîmes.

[6] C. Grouin, L. Deléger, B. Cartoni, S. Rosset, P. Zweigenbauml, ‘’Accès au contenu sémantique en langue de spécialité : extraction des prescriptions et concepts médicaux’’, In Traitement Automatique de la Langue Naturelle TALN 2011, Montpellier, 27 juin- 1er juillet. p. 109-120

[7] T.Delbecque, P. Zweigenbaum, ‘’Exploitation de corpus médicaux extraits d'internet : une expérience.’’ , in Le Web comme ressource pour le TAL, Journée d'étude ATALA, Paris, 2006. ATALA.

[8] D.Bourigault, C.Fabre, C.Frérot, M.P. Jacques, S. Ozdowska, ‘’Syntex, analyseur syntaxique de corpus’’, in Acte des 12èmes journées sur le Traitement Automatique des Langues Naturelles, Atelier EASY (Évaluation des Analyseurs SYntaxiques), Dourdan, ( juin 2005)

[9] H. Schmid, ‘’Probabilistic Part-of-Speech Tagging Using Decision Trees’’, in Proceedings of International Conference on New Methods in Language Processing (ICNLP), Manchester,1994, pp. 44-49.

[10] C. Lindberg, ‘’The Unified Medical Language System of the National Library of Medicine’’. In Journal of the American Medical Record Association 1990, vol.61(5), pp.40-2.

[11] A. R. Aronson, ‘’Effective mapping of biomedical text to the UMLS metathesaurus : the MetaMap Program”, in Journal of the American Medical Informatics Association, vol. 8, pp. 17–21, 2001

[12] I.Bentounsi, Z.Boufaida, ‘’Réduction du nombre de filtres pour l’extraction des connaissances à partir d’un texte’’, in Actes Extraction et Gestion des Connaissances–Maghreb, EGC-M. Hammamet, Tunisie, 2012, pp 89-94.

[13] T. Merabti, H. Abdoune, T. Lecroq, M. Joubert, S.J. Darmoni, ‘’Projection des relations SNOMED CT entre les termes de deux terminologies (CIM10 et SNOMED 3.5)’’, Springer Journal of FIM, Informatique et Santé, 2009, vol.17, pp.79-88.

[14] C. Mervis, E. Rosch : “Categorization of Natural Objects”, in Annual Review of Psychology, 1981, vol. 32, pp. 89-113.

Documents

[IEEE 2013 ACS International Conference on Computer Systems and Applications (AICCSA) - Ifrane, Morocco (2013.05.27-2013.05.30)] 2013 ACS International Conference on Computer Systems