Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Louhi 2015
EMNLP 2015 Workshop, Lisbon, Portugal
Exploring Word Embedding for Drug Name Recognition
Isabel Segura-Bedmar, Víctor Suárez-Paniagua, Paloma Martínez
Computer Science Department,
Universidad Carlos III de Madrid, Spain
{isegura, vspaniag, pmf}@inf.uc3m.es
Machine Learning: Conditional Random Fields (CRF) achieves the best results in the recognition of drugs and chemical names.
Word Embedding Features: word clusters and word vectors generated using the Word2Vec tool.
DINTO ontology2: The first ontology providing a representation of drug-drug interactions (DDI). It also contains drugs and classes.
Biomedical texts: DDI corpus1 is the dataset of SemEval-2013 Task 9.1 Drug Name Recognition.
Results
ACKNOWLEDGEMENTS: This work was supported by TrendMiner project [FP7-ICT287863] and by eGovernAbility-Access project (TIN2014-52665-C2-2-R).
DATASET: The DDI corpus was manually annotated with a total of 18,502 pharmacological substances and 5,028 DDIs. It consist of two different datasets: DDI-DrugBank (792 texts from DrugBank database) and DDI-Medline (233 Medline abstracts). EVALUATION: the recognition of pharmacological substances (exact criterion) and the classification in four types (strict criterion): drug (generic drug names), brand (branded drug names), group (drug group names) and drug-n (active substances not approved). RESULTS: WBI is the best results in SemEval-2013 Task 9.1 Drug Name Recognition. CRF is the basic configuration. CRFD is CRF with DINTO.
References María Herrero-Zazo, Isabel Segura-Bedmar, Paloma Martínez, and Thierry Declerck. 2013. The DDI corpus: An annotated corpus with pharmacological substances and drug-
drug interactions. Journal of Biomedical Informatics, 46(5):914–920. 1http://labda.inf.uc3m.es/ddicorpus
María Herrero Zazo. 2015. Semantic Resources in Pharmacovigilance: A Corpus and an Ontology for Drug-Drug Interactions. Ph.D. thesis, Carlos III University of Madrid, 5. 2http://www.obofoundry.org/cgibin/detail.cgi?id=DINTO
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. In ICLR 2013 Workshop Track.
Experiments
Motivation
We developed a machine learning-based approach that use word embedding features and the DINTO ontology to recognize drug names from biomedical texts.
Pipeline for CRF machine learning-based Drug NER. The main hypothesis is that the incorporating of word embeddings and DINTO as features into a CRF model could help to recognize drugs in texts.
• A pipeline of GATE components was used to process the DDI corpus and to obtain the basic feature set composed of:
o Current token together the previous three and following three.
o POS tag and lemmas of all these tokens.
o An orthographical feature:
upperInitial, allCaps, lowerCase and mixedCaps.
o A feature representing the type:
word, number, symbol or punctuation.
• DINTO is processed to produce a binary feature that indicates whether the current token was found in this ontology.
• We train word embeddings using the Word2Vec tool with two different corpora Wikipedia and MedLine:
o The word vector for the current token as new feature.
We tried different dimensions of vectors (50, 100 and 200).
o The word cluster for the current token as new feature.
We tried different k values in the k-means (50, 150 and 500).