View
213
Download
0
Category
Preview:
Citation preview
Extracting candidate terms from medical texts
Imene Bentounsi
Lire Laboratory
Department of Software Technology and Information
System
Universtity of Constantine 2
Constantine, Algeria
bentimene@hotmail.fr
Zizette Boufaida
Lire Laboratory
Department of Software Technology and Information
System
Universtity of Constantine 2
Constantine, Algeria
zboufaida@gmail.com
Abstract—In this paper, we present a new method for the
construction of domain ontology from texts in the medical field.
In our method, NLP tools are not used since our goal is to reduce
the amount of noise and the number of filters applied. This
allows us to manage the control of semantic quality of Candidate
Terms (CT). The proposed method is based on the technique of
semantic annotation for controlled terminology extraction via
semantic resources, which we supplement with a Coloring
Strategy Identification (CSI) method allowing the identification
of medical terms and an extraction of CT according to their
appearance in the text. In CSI, we exploit the resulting metadata
for semantic annotation via some rules. Moreover, we apply a
method of categorization by prototype model to overcome
perceived silence. The result of this extraction is structured in
XML representing the hierarchy of concepts that composes our
medical ontology.
Keywords—knowledge extraction; noise; filters; semantic
annotation; coloration; document in XML; ontology.
I. INTRODUCTION
A corpus is a collection of texts used as a sample of the language [1]. From these texts written in standard languages, a knowledge extraction is possible. This is a non-trivial process that builds valid, new, potentially useful and understandable knowledge model [2]. But according to [3], the lists produced by the extractors are not perfect. These lists include suites that are not correct, these errors are called noise.
Moreover, the current work meets a number of errors related to the use of extraction tools such as segmentation errors in sentences and words, parsing errors… To overcome this noise, a set of filters are proposed in the literature. A filter is a program to treat a data stream in order to block access to the noise. We notice that there is a close relationship between the noise generated and the number of filters applied. Indeed, when the number of filters increases, the quantity of the noise is also increasing.
In [4, 5] a set of filters is applied once the lexicon is analyzed. They allow refining the identification of units considered as external to expertise domain. They are usually applied to a large number and in a meaningful order which causes their dependence on each other. Our challenge is to try to get less noise with less filters. For this, we propose a new method for extracting knowledge from texts. It is based on a
controlled process of extraction terminology and supported by semantics resources considered as a reference experienced and certified.
Terms belonging to known referential are identified by a coloring process exploiting metadata. We extract these terms in a structured way, enriched by the use of the principle of semantic annotation. Annotation is the equivalent of a note added by way of comment, explanation or even the association of colors. The principle is almost the same as the semantic annotation except that the latter combines the sense by metadata inserted in the document or in an external media.
In this paper, we propose an architecture based on controlled terminology extraction process of [5, 6]. Our contribution concerns firstly, the elimination of morphosyntactic and syntactic analyzes and secondly, the use of CSI (Coloring Strategy Identification). Finally, we produce a structured document (in XML) in a perspective of bottom-up construction of ontological resources.
We describe, in this paper, some existing work on knowledge extraction from texts. We are particularly interested in the semantics of words extraction. This aspect is not considered in other works [4, 5] (section 2). Moreover, we present the general architecture of the proposed method and its various steps (section 3). We apply our approach on some sentences of a corpus and illustrate the impact of the proposed method (section 4). Then we conclude (section 5) on some prospects for future work.
II. RELATED WORK
In the literature, several processes on the extraction of
knowledge from texts are available, in different areas [4, 5, 6,
7]. One of them [4] proposes a linguistic equipped method for
identifying the evolution of knowledge from texts in the
spatial domain. Its purpose is to get a suitable list of CT from
the Syntex parser [8]. But noise is generated in three forms.
To remedy these errors, [4] applies a series filters. They are
based on the frequency of the candidate, its grammatical
category, its syntactic form and the domain it belongs to.
This process is based on a free terminology extraction; this
traditional method of extraction via the Syntex parser uses a
labeled corpus (TreeTagger [9]) as an only source of
978-1-4799-0792-2/13/$31.00 ©2013 IEEE
information. In addition, it generates a result with a low
semantic level.
The advantage of this method is low silence; the most of
proposed terms are not forgotten. Unfortunately, it generates a
large flow of information in the form of lists. These lists
contain noise as syntax error analysis, size (too long
candidates) and meaning (too general candidate).
On the other hand, a question / answer system to the medical
field was created [5]. The proposed approach is the integration
of semantic annotation using UMLS (Unified Medical
Language System) [10] semantic resources. But using
MetaMap [11] generates two types of errors.
To reduce these errors, a segmentation MetaMap is required
by upstream use of LingPipe Models and TreeTagger. In
addition, two filters are applied according to two lists: list of
the most frequent errors segmentations and another for the
terms whose semantic types are Quantitative, Qualitative and
Functional Concept.
The use of semantic resources allows detection of concepts
and relationships with more precision and at a higher level.
But this process still produces a long and noisy list (Syntax:
segmentation into sentences and words, Semantic: Too general
candidates).
We notice that in [4] and [5], the noise is related to the use of
analytical syntactic (Syntex) and morphosyntactic (MetaMap)
tools. Indeed, the extraction is made on cleaned and labeled
corpus, where NLP tools are applied to generate long lists of
CT that remain to be filtered. The process may seem short, but
we notice that in [4] three filters were used.
In [6], the authors focus on access to semantic content in a
specialized language for the extraction of medical
prescriptions. They rely on existing linguistic resources
backed through rules extraction and lexicons lists without
using external tools such as taggers, parsers or lemmatizers.
Thus, allowing rapid quality results while low noise is noticed.
But the obtained result is in a list format making its use so
difficult like in [7] for the exploitation of medical corpus
extracted from Internet.
The treatment of a large volume of texts needs some
organization. Thus, the structure of the knowledge extraction
results makes the created resources easily reusable.
To overcome the problems previously mentioned, we propose
a new method for extracting knowledge from a text corpus
based on monitored and improved terminology extraction. We
use a linguistic resource through the semantic annotation
technique.
III. PROPOSED METHOD
For the extraction of knowledge from texts whose objective is the construction of domain ontology, we are moving towards a controlled strategy extraction based on rules in order to:
1. Reduce the noise generated by analyzes.
2. Reduce the number of filters to remove noise.
From our previous work [13], some improvements in the overall process are presented where two main steps are considered:
Semantic Annotation: from linguistic resources, it sets
the interpretation of a term by associating an explicit
and formal semantic through metadata.
Coloring Strategy Identification (CSI): is the
foundation of our contribution. In this step, we exploit
the resulting metadata from the previous step to
identify some rules [12]. Our goal is to reach a
structured XML document without using syntactic
and morphosyntactic analysis.
Fig. 1. Architecture of the proposed system
From unstructured data written in a standard language, the
semantic annotation is applied. Through the CSI, we
recuperate the resulting metadata’s using some rules. Our goal
is the identification of domain terms by a coloring strategy in
the text. This coloring allows the division of CTs according to
their appearance in the text. We propose an XML output
type for extracting CT from texts. The global architecture of
our approach is summarized in (fig 1)
In the following sections, we will see the interest of applying
rules on the metadata’s resulting from semantic annotation
step. We also explain the coloring process and all necessary
steps in order to reach a structured document.
We illustrate the different steps of our approach by choosing a
medical field as an application.
A. Description of the system
The architecture of our system is described as follows:
1) Corpus presentation These reports are collected manually (still in progress collection) at the archives of Ibn Badis hospital of Constantine. They represent medical observations of cardiac patients written by specialists. Each report deals with the reason for admission, cardiovascular risk factors, patient treatment, allergies, the history of disease and output requirements.
2) Semantic Annotation The semantic resources allow quick access to relevant information taking advantage of the thesaurus, Metathesaurus and existing dictionaries to describe and represent merely knowledge area.
Our goal is to build an ontology from medical reports. We exploit the UMLS Metathesaurus. This latter is a large medical terminology resource, hybrid and multilingual. It has two main sources of knowledge:
The Metathesaurus is based on unified medical
concepts.
The semantic network specifies semantic types that
are used to categorize all the concepts defined in the
medical metathesaurus [13].
UMLS help us to build the basic structure (core) of our domain ontology. It identifies the standard terms of the medical domain (concept instances) classified according to semantic groups and types. This can match the levels of a hierarchical ontology. The UMLS also includes different taxonomic (hierarchical) and non-taxonomic (semantic) relationships, whose accessible only from the web service UMLSKS. From there, several modifications and adaptations should be provided. The semantic annotation is based on MetaMap tool to make the connection between the concepts of the UMLS with the terms of the corpus. We exploit the resulting metadata. This is code in XMLF (Formatted XML). The latter is a structured language (XML) readable by humans and in tree form. For each term, it specifies the name concept as UMLS concepts in the tag <CandidateMatched>, its lexical category <LexCat>, its semantics type <SemType>, its CUI <CandidateCUI>… The CUI (Concept Unique Identifier) is a code assigned by the UMLS to key concepts where each concept has its own CUI. The semantic type describes the key concept. For example, the semantic types of the concept Ventricular are”spatial concept” and “body part, organ or organ component”.
3) Coloring Strategy Identification The identification strategy coloring provides several advantages:
Identification of CTs without NLP tools.
Abstraction on irrelevant information: the, he, she, it, be and have auxiliaries...
Generation new concepts called composed concepts.
Distinction of the key concepts of the ontology.
Conservation of the meaning and structure of the text.
Construction of an XML document semantically rich and reusable.
This step consists of several sub-steps as follows:
a) Coloration from metadata: we get the CandidateMatched tag and the CUI tag of the terms identified as belonging to the medical field. The result of this recovery is the coloring (assigning a color) to these terms in the original text.
b) Validation of coloration step: We notice that there is a silence in the UMLS Metathesaurus. This latter does not include all the medical vocabulary. Effectively, some terms such as coronarography, repolarization, sudation, precordialgia, retrosternal… are missing. Thus, we have to
develop an English medical dictionary of synonyms for adding concepts whose not finding in UMLS.
c) Structuration : we generate a structured XML document by these following rules:
Tagging: An XML tag is a colored term or a suite of colorful terms.
Sorting is performed according to semantic types. A parent tag (x) contains a sub-tag which <SemType> = x. But if the tag includes a suite of more than one word whose semantic types are different then the semantic type becomes heterogeneous.
We adapt an existing model of categorization [14] for sorting
of new concepts (finding in coloration’s validation) according
to semantic types as following:
X is a medical term, D is a dictionary.
UMLS is a Metathesaurus.
X ∈ D. X ∉ UMLS.
{A, B, C} is a set of term synonyms of X in D.
The semantic type of X is found if and only if one of the
terms of the set {A, B, C} is found in UMLS.
Categorizing is performed according to semantic groups. UMLS has counted fifteen in number: Activities & Behaviors, Anatomy, Chemicals & Drugs…
B. Illustration
1). Applying a semantic annotation via MetaMap, on the phrase below, gives the following XMLf code:
« Angiography of the bifurcation Coronary Inter ventricular anterior diagonal with active stents or aorta bypass coronary »
<MMOs>
<Command>
metamap11.binary.x86-win32-nt-4 -L 2011 -Z
2011AA –XMLf…
<PhraseText>angiography</PhraseText>
<LexCat>noun</LexCat>
<CandidateCUI>C0002978</CandidateCUI>
<CandidateMatched> Angiography
Each term </CandidateMatched> in UMLS has an identifier </ CandidateCUI>.
2). Coloration Strategy Identification
We project the matching result on the initial sentence. Our goal is the identification of the key concepts in the text as Fig.3.
Fig. 2. After coloration
3). Validation of coloration step: In this case all key concepts composed the phrase are founded.
4). XML structuation: We illustrate the results of the three sub-steps in the following XML block
Angiography of the bifurcation coronary inter ventricular
anterior diagonal with active stents or aorta bypass coronary
<?xml version=“1.0” encoding=“UTF-8”?>
<concepts>
<proc><diap></Angiography></diap></proc>
<bifurcationCoronaryInterVentricularAnteriorDiagonal>
<proc><topp></bifurcation><topp></proc> <anat><bpoc></coronary></bpoc></anat>
<conc> <spco></inter></spco></conc>
<conc> <tmco></inter></tmco></conc> <anat> <bpoc></ventricular></bpoc></anat>
<conc><spco></ventricular></spco></conc>
<conc><spco></anterior></spco></conc> <conc><spco></diagonal></spco></conc>
</bifurcationCoronaryInterVentricularAnteriorDiagonal>
<activestents > <conc> <ftcn></active></ftcn></conc>
<devi> <medd></stents></medd></devi>
</activestents >
<aortabypasscoronary>
<anat><bpoc></aorta></bpoc></anat>
<proc> <topp></bypass></topp> </proc> <anat><bpoc></coronary></bpoc></anat>
</aortabypasscoronary> </concepts>
Fig. 3. Extracted XML code produced after extraction of CT via the CSI
IV. FROM TEXT TO DOMAIN ONTOLOGY
In the UMLS Metathesaurus, one term can have several
semantic types. The latter cause an encumbered hierarchy.
By focusing on the previously treatments on the concepts
hierarchy, we have a new concepts composed through the CSI.
This leads to non strict tree with inheritance relations
significantly reduced. We have reproduced the concept
hierarchy of the XML code of (fig5) as illustrated in Fig.6
Fig. 4. Application of our approach
From top to bottom, we find the semantic groups at the highest
level. In the other levels, there are specific concepts
(semantics type) and in the lowest level, there are instances of
concepts.
V. CONCLUSION
In the purpose of the construction of domain ontology from
texts, we have proposed a new method to generate an XML
format instead of list format. We also try to reduce the number
of filters and the perceived noise during the extraction of
knowledge from corpus. In this controlled extraction method,
we propose the combination of two annotation techniques: (1)
Semantic annotation of texts and (2) its coloration.
CSI exploits metadata via developed some rules. These rules
allow the extraction of CT in a structured way.
That leads the extract of the CT from the texts without using
syntactic and morphosyntactic analysis. This allows us to
significantly reduce the amount of noise and the number of
filters that are attached to.
The application of our approach on the cardiac reports directs
us to a Metathesaurus assistance due to the noticed silence,
and the changes made via the CSI leads to semantically
complete CTs. Their divisions depend on the localization of
the key concepts in the text. This leads to a change in the
initial UMLS hierarchical tree favoring a less cluttered
hierarchy.
REFERENCES
[1] J. Sinclair, “Corpus Typology. A Framework for Classification.” G.Melchers & B.Warren (eds), Studies in Anglistics. Stockhom: Almquist and Wiksell International, 1995, vol 8, pp. 17-34.
[2] Y. Toussaint, ‘’Extraction de connaissances à partir de textes structurés’’, Document numérique 2004/3, Vol 8, p.11-34. DOI : 10.3166/dn.8.3.11-34
[3] M.C. L’Homme, ‘’Nouvelle technologies et recherches terminologique. Techniques d’extraction des données terminologiques et leur impact sur le travail du terminologies’’, in L’impact des nouvelles technologies sur la gestion terminologiques, Toronto : Université York. 2001
[4] A. Picton, ‘’Diachronie en langue de spécialité. Définition d’une méthode linguistique outillée pour repérer l’évolution des connaissances en corpus : Un exemple appliqué au domaine spatial ‘’, Thèse en vue de l'obtention du doctorat de l’université de Toulouse en Science du langage, (Octobre 2009)
[5] A.Ben Abacha, P. Zweigenbaum, ‘’Annotation et interrogation sémantiques de textes médicaux. In Actes Atelier Web Sémantique Médical 2010 à IC 2010, pp. 61-70, Nîmes.
[6] C. Grouin, L. Deléger, B. Cartoni, S. Rosset, P. Zweigenbauml, ‘’Accès au contenu sémantique en langue de spécialité : extraction des prescriptions et concepts médicaux’’, In Traitement Automatique de la Langue Naturelle TALN 2011, Montpellier, 27 juin- 1er juillet. p. 109-120
[7] T.Delbecque, P. Zweigenbaum, ‘’Exploitation de corpus médicaux extraits d'internet : une expérience.’’ , in Le Web comme ressource pour le TAL, Journée d'étude ATALA, Paris, 2006. ATALA.
[8] D.Bourigault, C.Fabre, C.Frérot, M.P. Jacques, S. Ozdowska, ‘’Syntex, analyseur syntaxique de corpus’’, in Acte des 12èmes journées sur le Traitement Automatique des Langues Naturelles, Atelier EASY (Évaluation des Analyseurs SYntaxiques), Dourdan, ( juin 2005)
[9] H. Schmid, ‘’Probabilistic Part-of-Speech Tagging Using Decision Trees’’, in Proceedings of International Conference on New Methods in Language Processing (ICNLP), Manchester,1994, pp. 44-49.
[10] C. Lindberg, ‘’The Unified Medical Language System of the National Library of Medicine’’. In Journal of the American Medical Record Association 1990, vol.61(5), pp.40-2.
[11] A. R. Aronson, ‘’Effective mapping of biomedical text to the UMLS metathesaurus : the MetaMap Program”, in Journal of the American Medical Informatics Association, vol. 8, pp. 17–21, 2001
[12] I.Bentounsi, Z.Boufaida, ‘’Réduction du nombre de filtres pour l’extraction des connaissances à partir d’un texte’’, in Actes Extraction et Gestion des Connaissances–Maghreb, EGC-M. Hammamet, Tunisie, 2012, pp 89-94.
[13] T. Merabti, H. Abdoune, T. Lecroq, M. Joubert, S.J. Darmoni, ‘’Projection des relations SNOMED CT entre les termes de deux terminologies (CIM10 et SNOMED 3.5)’’, Springer Journal of FIM, Informatique et Santé, 2009, vol.17, pp.79-88.
[14] C. Mervis, E. Rosch : “Categorization of Natural Objects”, in Annual Review of Psychology, 1981, vol. 32, pp. 89-113.
Recommended