A Rule-Based NLP System in Tagging and Categorizing Phenotype Variables in dbGaP

  • View
    75

  • Download
    3

  • Category

    Science

Preview:

Citation preview

A Rule-Based NLP System in Tagging and A Rule-Based NLP System in Tagging and Categorizing Phenotype Variables in dbGaPCategorizing Phenotype Variables in dbGaP

Son Doan, Ko-Wei Lin, Rebecca Walker, Seena Farzaneh,

Neda Alipanah and Hyeon-Eui Kim

Division of Biomedical InformaticsUC San Diego

AMIA 2013, Washington DC, 11/18/2013

RoadmapRoadmap

• Background o dbGaP

o Challenges in using dbGaP

o pFINDR program

• Phenotype standardization in dbGaP

• PhenDisco system

• Performance evaluation

2

NCBI’s database of Genotypes and Phenotypes (dbGaP)NCBI’s database of Genotypes and Phenotypes (dbGaP)

“dbGaP was developed to

archive and distribute the

results of studies that have

investigated the interaction

of genotype and phenotype”

Until 11/14/2013:

-411 top-level studies

-139,238 phenotypes

variables

-2,816 datasets

-3,895 analyses

3

What topics dbGaP users are focusing on?What topics dbGaP users are focusing on?

Most frequent phenotype topics from the dbGaP data requests

GeneticDisease CongenitalAbnormality (8.6%)

Cardiovascular Disease(8.1%)

14,287 data requests from dbGaP Website

5

http://www.ncbi.nlm.nih.gov/gap

9/5/13 6

http://www.ncbi.nlm.nih.gov/gap

9/5/13 7

http://www.ncbi.nlm.nih.gov/gap

pFINDR pFINDR (phenotype Finding IN Data Repositories)(phenotype Finding IN Data Repositories)

8

• Funded by NHLBI/NIH

• To facilitate dbGaP use by improving accuracy and completeness of search returns – Standardized existing phenotype variables

– Searchable study related information

Related workRelated work

9

PhenX: Define 287 most important phenotypes and manually mapped into 16 dbGaP studies

eMERGE: Standardize EMR phenotypesSemi-automatic process: First phenotypes are automatically mapped into standardized vocabularies , then outputs are returned and curated by users

Our approachOur approach

Using NLP to standardize phenotype variables in dbGaP

Integrating NLP components into a new phenotype search tool for dbGaP

dbGaPFree text search

Structured (advanced) search

Unsorted, flat list results

Data user

Study DescriptionAnnotator

Phenotype

Variable Annotator

Ontology

Mapper

Query Parser

Ranking Algorithms

Standardization & annotation

sdGaP

PhenDisco

sdGaP contains standardized

phenotype variables

Ranked results

Structured Search

Free text search

Phenotype Variable Standardization PipelinePhenotype Variable Standardization Pipeline

Identify topic and subject of information

Identify semantic category of phenotypes

Phenotype variables

Phenotype variables

TaggerTagger CategorizerCategorizer

Variable Topic Subject Category

Gender of the participant Gender participant Demographics

CIGARETTES/DAY, EXAM 1

medical examination

study subject Smoking History; Healthcare Activity Finding

Weight in kg. at baseline weighing patient study subject Clinical Attributes

AGE OF LIVING MOTHER Age mother (person)

Demographics

Phenotype Variable Standardization PipelinePhenotype Variable Standardization Pipeline

Identify topic and subject of information

Identify semantic category of phenotypes

Variable Topic Subject Category

Gender of the participant Gender participant Demographics

CIGARETTES/DAY, EXAM 1

medical examination

study subject Smoking History; Healthcare Activity Finding

Weight in kg. at baseline weighing patient study subject Clinical Attributes

AGE OF LIVING MOTHER Age mother (person)

Demographics

Phenotype variables

Phenotype variables

TaggerTagger CategorizerCategorizer

Topic: Main theme of phenotype variablesSubject of information: Bearer of the variable

Phenotype Variable Standardization PipelinePhenotype Variable Standardization Pipeline

Identify topic and subject of information

Identify semantic category of phenotypes

Phenotype variables

Phenotype variables

TaggerTagger CategorizerCategorizer

Variable Topic Subject Category

Gender of the participant Gender participant Demographics

CIGARETTES/DAY, EXAM 1

medical examination

study subject Smoking History; Healthcare Activity Finding

Weight in kg. at baseline weighing patient study subject Clinical Attributes

AGE OF LIVING MOTHER Age mother (person)

Demographics

Phenotype Variable Standardization PipelinePhenotype Variable Standardization Pipeline

Variable Descriptions

Variable Descriptions

Normali-zation

Normali-zation

MetaMapProcessingMetaMapProcessing

Semantic Role Assignment

Semantic Role Assignment

TopicFilteringTopic

FilteringVariable

CategorizationVariable

Categorization

• Spell out abbreviations and short hand expressions

• Drop question numbers and other unimportant characters

• Generate CUIs, concept names, semantic types

• Semantic types and keyword- based role identification

• Keep concepts that match SNOMED-CT clinical findings

• Remove problematic concepts

• Semantic types and keyword-based categorization

15 semantic categories are selected based on semantic types from MetaMap: Demographics, Medical History, Clinical Attribute, Medication, Lab Tests from two domain experts

TaggerTagger CategorizerCategorizer

Creating an abbreviation list from dbGaPCreating an abbreviation list from dbGaP

15

bp blood pressure bmi body mass indexbpm beats per minutebw body weightdbp diastolic blood pressurehbp high blood pressurehtn hypertensionhr heart rateHt heightlb poundsrr respiration ratesbp systolic blood pressuretemp temperature

TPR temperature, pulse, respirationwt weightyr yearvs vital signs

We compiled and reviewed a list of abbreviation in dbGaP, original contain 50 abbreviations, latest version contains 520 abbreviations

Rule Example

1if # after type, please keep this number type 1 diabetes

type I diabetestype 2 diabetes

Glycogen storage disease type I

type 1 hypersensitivity diseases

2if # after grade, please keep this number grade 1 Dupuytren's disease

3if # after stage, please keep this number stage 1 chronic kidney disease

4if # after bipolar, please keep this number bipolar 1 disorder

bipolar I disorderbipolar II disorderbiporlar 2 disorder

5if # after class, please keep this number class I and II Newcastle disease

class 1 Newcastle disease 16

Remove number with exceptionsRemove number with exceptions

17

IF CandidatePreferred contains "gender" or "sex"AND SemType = organism attributeTHEN Topic=Gender

IF Topic concepts = Pharmacologic SubstanceORVariable description contains a word “medication”THEN Type = Medication

Rule for Tagger Rule for Categorizer

Rules for tagger and categorizer: examplesRules for tagger and categorizer: examples

Two domain experts reviewed and created rules from 300 randomly unique phenotype variables

77 age mom diagnosed–stroke (tia)

age mother diagnosed stroke (tia) • C0001779:Age [Organism

Attribute]• C0026591:Mother (Mother

(person)) [Family Group]• C0038454:Stroke

(Cerebrovascular accident) [Disease or Syndrome]

• ‘Diagnosed’

• TOPIC: Age (C0001779)• Subject of Information:

Mother (C0026591)

MetaMap

Example of taggingExample of tagging

Phenotype Variable Standardization PipelinePhenotype Variable Standardization Pipeline

Variable Descriptions

Variable Descriptions

Normali-zation

Normali-zation

MetaMapProcessingMetaMapProcessing

Semantic Role Assignment

Semantic Role Assignment

TopicFilteringTopic

FilteringVariable

CategorizationVariable

Categorization

116,957 phenotypes mapped to Topic

104,172 phenotypes mapped to Category

135,608 variables

TaggerTagger CategorizerCategorizer

Evaluation: - Random sample of 500 unique phenotypes - Reviewed by 3 domain experts

73% accuracy for topic 71% accuracy for category

Semantic category of phenotypes in dbGaP Semantic category of phenotypes in dbGaP

20

(as of July 1, 2013)

Mapping FailuresMapping Failures

21

a. Unprocessed by Metamap14 c  first arm othervessel max l lat obliq 

a. Lexical problem. Items with incorrect lexical forms including typos hba1 c collection date  month 057 fateat gm 

a. Id variables or some administrative variablesform numberf124  documentation used form 

a. Our rules do not coveryears treated pet for fleas what is your first language 

Free text Query parser

sdGaP

Relevant studies

Ranked studies

NLP tools + MetaMap

Information model mapping

dbGaP

PhenDisco: Put-it-all-togetherPhenDisco: Put-it-all-together

BM25 ranking algorithm

Standardized

phenotypes

Standardized

phenotypes

Doan, S, Lin KW, Conway M, Ohno-Machado et al. "PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)”, JAMIA, 2013, doi:10.1136/amiajnl-2013-001882.

PhenDisco systemPhenDisco systemSearch oTerm auto-completeoSynonym expansion

23

DisplayoKeyword highlighting oRanking by relevanceoFilter by study metadataoCross-link related studies

Export to ExceloSelected study metadataoSelected phenotype variables

Search by titles, platform, study

24

Advanced Search

PhenDisco vs dbGaP EntrezPhenDisco vs dbGaP Entrez

25

Basic SearchdbGaP PhenDisco

Recall Precision Recall Precision

COPD 100 % 41.67% 80.00% 100 %

“macular degeneration” AND white

100 % 42.86% 100 % 85.71%

“breast cancer” AND “breast density”

100 % 66.67% 50.00% 100 %

schizophrenia 100 % 46.88% 86.67% 92.86%

cardiomyopathy 100 % 35.00% 100 % 100 %

Average 100 % 46.61% 83.33% 95.71%

Average F-measure 0.64 0.89

(as of July 7, 2013)

Summary & Future workSummary & Future work

• A rule-based approach is a simple yet efficient way to standardize phenotype variables in dbGaP

• Integration to machine learning methods will be investigated

• Identification of similar variables is in progress!

AcknowledgementsAcknowledgements

• Lucila Ohno-Machado (PI)• Other PhenDisco team members:

o Mike Conwayo Alex Hsieho Stephanie Feudjido Feupeo Asher Garlando Mindy Rosso Xiaoqian Jiango Jing Zhang

• Early contributorso Wendy Chapmano Melissa Tharpo Jihoon Kim

• Collaborator:o Hua Xu

• SAB member and NHLBI officers• Funding: UH2HL108785 from NHLBI/NIH 27

Questions?Project Homepage: http://pfindr.net

PhenDisco: http://pfindr-data.ucsd.edu/_PhDVer1

Contact:

sondoan@ucsd.edu

hyk038@ucsd.edu

Source code and database of PhenDisco are publicly available

Recommended