17
Claudiu Mihăilă, Riza Theresa Batista-Navarro, Sophia Ananiadou Claudiu Mihăilă Analysing Entity Type Variation across Biomedical Subdomains National Centre for Text Mining School of Computer Science University of Manchester 26 May 2012

Analysing Entity Type Variation across Biomedical Subdomains

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Analysing Entity Type Variation across Biomedical Subdomains

Claudiu Mihăilă, Riza Theresa Batista-Navarro, Sophia Ananiadou

Claudiu Mihăilă

Analysing Entity Type Variationacross Biomedical Subdomains

National Centre for Text MiningSchool of Computer Science

University of Manchester

26 May 2012

Page 2: Analysing Entity Type Variation across Biomedical Subdomains

2

BioTxtM 2012

Introduction• Named entities

o Atomic elements, classified into various categories (protein, gene, disease, treatment, metabolite etc.)

Organism OrganismPro Pro Pro ProPro+RegTranscription

ThemeTheme

In contrast to the phenotype of the pta ackA double mutant, pbgP transcription was reduced in the pmrD mutant.

Page 3: Analysing Entity Type Variation across Biomedical Subdomains

3

BioTxtM 2012

Introduction• Corpora

Page 4: Analysing Entity Type Variation across Biomedical Subdomains

4

BioTxtM 2012

Methodology• Full-text open-access journal articles from UKPMC• 20 subdomains 400 single broad-subject-termed articles

4

Allergy & Immunology Biology Cell Biology Communicable

Diseases Critical Care

Environmental Health Genetics

Health Services Research

Medical Informatics Medicine

Microbiology Neoplasms Neurology Pharmacology Physiology

Public Health Pulmonary Medicine Rheumatology Tropical

Medicine Virology

Page 5: Analysing Entity Type Variation across Biomedical Subdomains

5

BioTxtM 2012

Methodology• NE source: ASilver = AUKPMC AOscar ANeMine

Allergy & Immunology Biology Cell Biology Communicable

Diseases Critical Care

Environmental Health Genetics

Health Services Research

Medical Informatics Medicine

Microbiology Neoplasms Neurology Pharmacology Physiology

Public Health Pulmonary Medicine Rheumatology Tropical

Medicine Virology

UKPMC

NeMine

OSCAR

Critical Care

Medicine

Physiology

Virology

Corpus Annotation

Page 6: Analysing Entity Type Variation across Biomedical Subdomains

6

BioTxtM 2012

MethodologyNeMine

GeneProteinDiseaseDrugMetaboliteBacteriaDiagnostic processGeneral phenomenonIndicatorNatural phenomenonOrganPathologic functionSymptomTherapeutic process

OSCAR

Chemical molecule

Chemical adjective

Enzyme

Reaction

UKPMC

Gene

Protein

Disease

Drug

Metabolite

Gene|Protein

SilverAnnotation

Page 7: Analysing Entity Type Variation across Biomedical Subdomains

7

BioTxtM 2012

Methodology• Feature vectors

Document d

Enzyme 2

Chemical molecule 71

Disease 8

Drug 12

Gene 15

Gene|Protein 155

Metabolite 3

Protein 188

Reaction 24

Document d

Enzyme 0.45%

Chemical molecule 14.85%

Disease 1.67%

Drug 2.51%

Gene 3.13%

Gene|Protein 3.24%

Metabolite 0.62%

Protein 39.33%

Reaction 5.02%

Page 8: Analysing Entity Type Variation across Biomedical Subdomains

8

BioTxtM 2012

Methodology

Page 9: Analysing Entity Type Variation across Biomedical Subdomains

9

BioTxtM 2012

Methodology

Page 10: Analysing Entity Type Variation across Biomedical Subdomains

10

BioTxtM 2012

Methodology• Chi-squared statistics

Page 11: Analysing Entity Type Variation across Biomedical Subdomains

11

BioTxtM 2012

Methodology• Frobenius norm

1247.0725

Page 12: Analysing Entity Type Variation across Biomedical Subdomains

12

BioTxtM 2012

Feature evaluation

Frobenius norm of 2 vectors for each pair.

• Good features foro Cell Biologyo Pharmacologyo Health Scienceso Public Health

• Not-so-good features foro Medical Informaticso Medicineo Microbiologyo Neoplasmso Neurology

Page 13: Analysing Entity Type Variation across Biomedical Subdomains

13

BioTxtM 2012

Feature evaluation•Mean Chi-Squared for every feature over all pairs

Page 14: Analysing Entity Type Variation across Biomedical Subdomains

14

BioTxtM 2012

Classifier selection

Random Forest F-score for each pair.

Classifier Top result count

J48 0 0%

JRip 4 2.10%

Logistic 2 1.05%

Random Tree 0 0%

Random Forest 86 45.26%

SMO 0 0%

AdaBoost

J48 6 3.15%

JRip 7 3.68%

Decision Stump 16 8.42%

Logistic 0 0%

Random Tree 0 0%

Random Forest 68 35.78%

SMO 1 5.26%

Page 15: Analysing Entity Type Variation across Biomedical Subdomains

15

BioTxtM 2012

Classifier evaluation

Random Forest F-score for each pair.

• Dissimilar subdomainso Cell Biologyo Pharmacologyo Health Scienceso Public Health

• Similar subdomainso Medical Informaticso Medicineo Microbiologyo Neoplasmso Neurology

Page 16: Analysing Entity Type Variation across Biomedical Subdomains

16

BioTxtM 2012

Conclusions• To remember

o Significant semantic variation of biomedical sublanguageso Distinguishable bio-subdomains using only NE typeso Caution needed when adapting NLP tools to subdomains

• To doo Extension to bio-eventso Combination with lexical, syntactical, discourse featureso Extension to other domains

Page 17: Analysing Entity Type Variation across Biomedical Subdomains

17

BioTxtM 2012

Thank you!

http://misteringo.deviantart.com/art/Bunnies-Scream-Again-79745974