Louhi 2015 EMNLP 2015 Workshop, Lisbon, Portugal Exploring Word Embedding for Drug Name Recognition Isabel Segura-Bedmar, Víctor Suárez-Paniagua, Paloma Martínez Computer Science Department, Universidad Carlos III de Madrid, Spain {isegura, vspaniag, pmf}@inf.uc3m.es Machine Learning: Conditional Random Fields (CRF) achieves the best results in the recognition of drugs and chemical names. Word Embedding Features: word clusters and word vectors generated using the Word2Vec tool. DINTO ontology 2 : The first ontology providing a representation of drug-drug interactions (DDI). It also contains drugs and classes. Biomedical texts: DDI corpus 1 is the dataset of SemEval-2013 Task 9.1 Drug Name Recognition. Results ACKNOWLEDGEMENTS: This work was supported by TrendMiner project [FP7-ICT287863] and by eGovernAbility-Access project (TIN2014-52665-C2-2-R). DATASET: The DDI corpus was manually annotated with a total of 18,502 pharmacological substances and 5,028 DDIs. It consist of two different datasets: DDI-DrugBank (792 texts from DrugBank database) and DDI-Medline (233 Medline abstracts). EVALUATION: the recognition of pharmacological substances (exact criterion) and the classification in four types (strict criterion): drug (generic drug names), brand (branded drug names), group (drug group names) and drug-n (active substances not approved). RESULTS: WBI is the best results in SemEval-2013 Task 9.1 Drug Name Recognition. CRF is the basic configuration. CRFD is CRF with DINTO. References María Herrero-Zazo, Isabel Segura-Bedmar, Paloma Martínez, and Thierry Declerck. 2013. The DDI corpus: An annotated corpus with pharmacological substances and drug- drug interactions. Journal of Biomedical Informatics, 46(5):914–920. 1 http://labda.inf.uc3m.es/ddicorpus María Herrero Zazo. 2015. Semantic Resources in Pharmacovigilance: A Corpus and an Ontology for Drug-Drug Interactions. Ph.D. thesis, Carlos III University of Madrid, 5. 2 http://www.obofoundry.org/cgibin/detail.cgi?id=DINTO Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. In ICLR 2013 Workshop Track. Experiments Motivation We developed a machine learning-based approach that use word embedding features and the DINTO ontology to recognize drug names from biomedical texts. Pipeline for CRF machine learning-based Drug NER. The main hypothesis is that the incorporating of word embeddings and DINTO as features into a CRF model could help to recognize drugs in texts. • A pipeline of GATE components was used to process the DDI corpus and to obtain the basic feature set composed of: o Current token together the previous three and following three. o POS tag and lemmas of all these tokens. o An orthographical feature: upperInitial, allCaps, lowerCase and mixedCaps. o A feature representing the type: word, number, symbol or punctuation. • DINTO is processed to produce a binary feature that indicates whether the current token was found in this ontology. • We train word embeddings using the Word2Vec tool with two different corpora Wikipedia and MedLine: o The word vector for the current token as new feature. We tried different dimensions of vectors (50, 100 and 200). o The word cluster for the current token as new feature. We tried different k values in the k-means (50, 150 and 500).

Exploring Word Embedding for Drug Name Recognitionsphynx.uc3m.es/~pmf/PosterLOUHI15.pdf · Louhi 2015 EMNLP 2015 Workshop, Lisbon, Portugal Exploring Word Embedding for Drug Name

Download PDF Report

Upload
others
View
6
Download
0

Embed Size (px)

Citation preview

Louhi 2015

EMNLP 2015 Workshop, Lisbon, Portugal

Exploring Word Embedding for Drug Name Recognition

Isabel Segura-Bedmar, Víctor Suárez-Paniagua, Paloma Martínez

Computer Science Department,

Universidad Carlos III de Madrid, Spain

{isegura, vspaniag, pmf}@inf.uc3m.es

Machine Learning: Conditional Random Fields (CRF) achieves the best results in the recognition of drugs and chemical names.

Word Embedding Features: word clusters and word vectors generated using the Word2Vec tool.

DINTO ontology2: The first ontology providing a representation of drug-drug interactions (DDI). It also contains drugs and classes.

Biomedical texts: DDI corpus1 is the dataset of SemEval-2013 Task 9.1 Drug Name Recognition.

Results

ACKNOWLEDGEMENTS: This work was supported by TrendMiner project [FP7-ICT287863] and by eGovernAbility-Access project (TIN2014-52665-C2-2-R).

DATASET: The DDI corpus was manually annotated with a total of 18,502 pharmacological substances and 5,028 DDIs. It consist of two different datasets: DDI-DrugBank (792 texts from DrugBank database) and DDI-Medline (233 Medline abstracts). EVALUATION: the recognition of pharmacological substances (exact criterion) and the classification in four types (strict criterion): drug (generic drug names), brand (branded drug names), group (drug group names) and drug-n (active substances not approved). RESULTS: WBI is the best results in SemEval-2013 Task 9.1 Drug Name Recognition. CRF is the basic configuration. CRFD is CRF with DINTO.

References María Herrero-Zazo, Isabel Segura-Bedmar, Paloma Martínez, and Thierry Declerck. 2013. The DDI corpus: An annotated corpus with pharmacological substances and drug-

drug interactions. Journal of Biomedical Informatics, 46(5):914–920. 1http://labda.inf.uc3m.es/ddicorpus

María Herrero Zazo. 2015. Semantic Resources in Pharmacovigilance: A Corpus and an Ontology for Drug-Drug Interactions. Ph.D. thesis, Carlos III University of Madrid, 5. 2http://www.obofoundry.org/cgibin/detail.cgi?id=DINTO

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. In ICLR 2013 Workshop Track.

Experiments

Motivation

We developed a machine learning-based approach that use word embedding features and the DINTO ontology to recognize drug names from biomedical texts.

Pipeline for CRF machine learning-based Drug NER. The main hypothesis is that the incorporating of word embeddings and DINTO as features into a CRF model could help to recognize drugs in texts.

• A pipeline of GATE components was used to process the DDI corpus and to obtain the basic feature set composed of:

o Current token together the previous three and following three.

o POS tag and lemmas of all these tokens.

o An orthographical feature:

upperInitial, allCaps, lowerCase and mixedCaps.

o A feature representing the type:

word, number, symbol or punctuation.

• DINTO is processed to produce a binary feature that indicates whether the current token was found in this ontology.

• We train word embeddings using the Word2Vec tool with two different corpora Wikipedia and MedLine:

o The word vector for the current token as new feature.

We tried different dimensions of vectors (50, 100 and 200).

o The word cluster for the current token as new feature.