View
1.718
Download
0
Embed Size (px)
DESCRIPTION
presented at DBMI, UCSD Journal club on 9/5/2013
Citation preview
PhenDisco: a new phenotype discovery system for the database
of genotypes and phenotypes
Son Doan, Hyeoneui Kim
Division of Biomedical Informatics University of California San Diego
Open Access Journal Club, 09/05/2013
Roadmap to the Presentation � Background
� dbGaP � Challenges in using dbGaP � pFINDR program
� PhenDisco development � User requirement analysis for PhenDisco � Data standardization (variables, study metadata) � System development: technical details
� PhenDisco demo � Performance evaluation 9/5/13 2
Background
9/5/13 3
Overview on dbGaP � Database of Genotypes and Phenotypes
� Developed by NCBI
� Stores and distributes the data and outputs of the studies on the interactions of genotypes & phenotypes
� Provides 2 levels of access � Open access: variable information including
summary statistics and study information � Controlled access: raw data – upon approval by
NIH DAC 9/5/13 4
A Typical Challenge in Using dbGaP
Potentially, dbGaP is great…it contains so many different types of studies and their data!
However, I find it very hard to reuse dbGaP data because there is no easy but robust way to filter studies by important study related information such as study design, analysis methods, analysis data produced by the studies.
Even if I find the studies that seem fitting to my needs, I still need to make sure that the studies have the genotype and/or the phenotype information that I need.
Of course, dealing with the data values with all sort of different formats is another challenge to go through…
(Erin Smith, PhD, Division of Genome Information Science, UCSD)
9/5/13 5
9/5/13
http://www.ncbi.nlm.nih.gov/gap
6
9/5/13
http://www.ncbi.nlm.nih.gov/gap
7
9/5/13
http://www.ncbi.nlm.nih.gov/gap
8
pFINDR (phenotype Finding IN Data Repositories)
9/5/13
• Funded by NHLBI • To facilitate dbGaP use by improving
accuracy and completeness of search returns – Standardized phenotype variables – Searchable study related information
9
User Requirement Analysis
9/5/13 10
Use-Case Driven Development � User requirements collected from
� Analysis of data use descriptions from data requests available in dbGaP (14,287 requests)
� Online user survey (17 users)
� User interviews (8 local dbGaP users) � NIH officers/Scientific Advisory Board
recommendations and suggestions
9/5/13 11
Genetic Disease Congenital Abnormality (8.6%)
Cardiovascular Disease (8.1%)
Data Request Analysis
9/5/13
Disease
Chemical or Biological Substance
Therapeutic or preventive Procedure
Research Activity
Laboratory Procedure or
Test
Pathologic Function
Signs or Symptoms
Diagnostic Procedure
Clinical Attributes
Mood, Emotion, and
Individual Behavior
Qualitative Concept
Mental Process
Social Behavior
Organism Function
Daily Function or Activity
Health Care Activity Food
Other
Neoplasm/Cancer (30%)
Psychiatric Disease (13%)
12
Interviews, Survey and SAB/NIH officers’ feedback
� Functions that maximize search efficiency
� Examples � “option to expand search terms through
synonyms” � “studies displayed in the order of relevancy” � “select studies from the returned list and save for
later review” � “search results organized in a way that supports
quick browsing”
9/5/13 13
Problems We Addressed � Focus areas:
� Completeness and accuracy of search results � Abbreviation expansion � Concept-based search
� Ease of result review � Sorting the results by relevancy � Highlighting search keywords in the retrieved records
� Additional functionality � Export of selected study and variable information � Categorization of variables
9/5/13 14
Data Standardization
9/5/13
• Variable Standardization • Study Level Metadata Generation
15
Phenotype Variable Standardization
� Used variable descriptions
� Focused on identifying � Topic (main theme: “pain”, “walking”)
� Subject of information (i.e., bearer: “study subject”)
� Mapped the topic and SOI concepts to UMLS Metathesaurus
9/5/13
Variable ID Variable Name Variable Description
Phv00116192.v2.p2 C41RPACE Get pain when walk at ordinary pace?
16
Variable Descriptions
• 135,608 variables
9/5/13 17
Phenotype Variable Standardization
Variable Descriptions
• 135,608 variables
“77 age mom diagnosed – stroke (tia)”
Phenotype Variable Standardization
9/5/13 18
Normalization
• Spell out abbreviations and short hand expressions
• Drop question numbers and other unimportant characters
Variable Descriptions
• 135,608 variables
“77 age mom diagnosed – stroke (tia)”
“age mother diagnosed stroke (tia)”
Phenotype Variable Standardization
9/5/13 19
Normalization
• Spell out abbreviations and short hand expressions
• Drop question numbers and other unimportant characters
MetaMap Processing
• Generate CUIs, concept names, semantic types
Variable Descriptions
• 135,608 variables
“77 age mom diagnosed – stroke (tia)”
“age mother diagnosed stroke (tia)”
C0001779: age [organism attribute], C0026591: Mother [family group] C0038454: Stroke [disease or syndrome]
Phenotype Variable Standardization
9/5/13 20
Normalization
• Spell out abbreviations and short hand expressions
• Drop question numbers and other unimportant characters
MetaMap Processing
• Generate CUIs, concept names, semantic types
Semantic Role Assignment
• Semantic types and keyword- based role identification
• Evaluation from random sample of 500: 73% accuracy
Variable Descriptions
• 135,608 variables
“77 age mom diagnosed – stroke (tia)”
“age mother diagnosed stroke (tia)”
C0001779: age [organism attribute], C0026591: Mother [family group] C0038454: Stroke [disease or syndrome]
C0001779: age, C0038454: Stroke – topic C0026591: Mother – subject of information
Phenotype Variable Standardization
9/5/13 21
Normalization
• Spell out abbreviations and short hand expressions
• Drop question numbers and other unimportant characters
MetaMap Processing
• Generate CUIs, concept names, semantic types
Semantic Role Assignment
• Semantic types and keyword- based role identification
• Evaluation from random sample of 500: 73% accuracy
Variable Categorization
• Semantic types and keyword-based categorization
• Evaluation from random sample of 500: 71% accuracy
Variable Descriptions
• 135,608 variables
“77 age mom diagnosed – stroke (tia)”
“age mother diagnosed stroke (tia)”
C0001779: age [organism attribute], C0026591: Mother [family group] C0038454: Stroke [disease or syndrome]
C0001779: age, C0038454: Stroke – topic C0026591: Mother – subject of information
family history, demographics
Phenotype Variable Standardization
9/5/13 22
Category Examples Variable Descriptions
Topics Subject of Information
Variable Categories
Gender of the participant
gender study subject Demographics
Last known smoking status
smoking study subject Smoking History
Cigarettes/day, exam 1 smoking, medical examination
study subject Smoking History Healthcare Activity Finding
Age in years at uric acid measurement
age, uric acid measurement
study subject Demographics Lab Tests
AGE of living mother age mother Demographics - Family
Age at dementia onset as defined by the DSM IV definition
age, dementia study subject
Demographics Medical History
9/5/13 23
Normalization
• Spell out abbreviations and short hand expressions
• Drop question numbers and other unimportant characters
MetaMap Processing
• Generate CUIs, concept names, semantic types
Semantic Role Assignment
• Semantic types and keyword- based role identification
• Evaluation from random sample of 500: 73% accuracy
Variable Categorization
• Semantic types and keyword-based categorization
• Evaluation from random sample of 500: 71% accuracy
Identification of Similar
Variables
• Same CUI, similar keywords, and same category
in progress
Variable Descriptions
• 135,608 variables
“77 age mom diagnosed – stroke (tia)”
“age mother diagnosed stroke (tia)”
C0001779: age [organism attribute], C0026591: Mother [family group] C0038454: Stroke [disease or syndrome]
C0001779: age, C0038454: Stroke – topic C0026591: Mother – subject of information
family history, demographics
Phenotype Variable Standardization
9/5/13 24
Study Level Metadata Annotation
9/5/13
• Manual annotation of 422 studies (07/31/13) • Metadata items generated
• Disease topics (encoded with UMLS) • Geographical information (encoded with ISO
3166-2 subdivision code: state and country) • IRB approval (required or not) • Consent type (not restricted, restricted,
unspecified) • Sample demographics (race and/or ethnicity,
gender, age) 9/5/13 25
System Development: Integration
9/5/13 9/5/13 26
Free text
Query parser
sdGaP
Relevant studies
Ranked studies
NLP tools + MetaMap
Information Model Mapping
dbGaP
PhenDisco: Put-it-all-together
BM25 ranking algorithm 9/5/13 27
System Development: Query Parser
9/5/13 9/5/13 28
Contextual Query Language
� Query types: � Simple queries: keywords, phrases.
� Using Boolean logic: AND, OR, NOT
� Can process index values, e.g., age > 40
� Build a language guideline: � BNF form
9/5/13 9/5/13 29
BNF form cqlQuery ::= prefixAssignment cqlQuery | scopedClause prefixAssignment ::= '>' prefix '=' uri | '>' uri scopedClause ::= scopedClause booleanGroup searchClause | searchClause booleanGroup ::= boolean [modifierList] boolean ::= 'and' | 'or' | 'not' | 'prox' searchClause ::= '(' cqlQuery ')’| index relation searchTerm| searchTerm relation ::= comparitor [modifierList] comparitor ::= comparitorSymbol | namedComparitor comparitorSymbol ::= '=' | '>' | '<' | '>=' | '<=' | '<>' | '==' namedComparitor ::= identifier modifierList ::= modifierList modifier | modifier modifier ::= '/' modifierName [comparitorSymbol modifierValue] prefix, uri, modifierName, modifierValue, searchTerm, index ::= term term ::= identifier | 'and' | 'or' | 'not' | 'prox' | 'sortby' identifier ::= charString1 | charString2 9/5/13 9/5/13 30
System Development: Study Ranking
9/5/13 9/5/13 31
BM25 ranking algorithm
9/5/13
• N: total number of studies. • nt – number of studies contains
the term t • c – field in study d • wc – boost factor for each field c • Tf – term frequency • Idf – inverted document
frequency
9/5/13 32
Technical Infrastructure � URL: http://pfindr-data.ucsd.edu/_PhDVer1/
� Linux machine: Ubuntu 64 bits
� Memory: 32GB RAM
� Database: MySQL 14.14
� Apache 2.2.20 Web server
� Programming languages: PHP, Python, JavaScripts
� Python toolkits: pyparsing, Whoosh 9/5/13 9/5/13 33
9/5/13
System Demonstra-on
9/5/13 34
System Evaluation
9/5/13
• Search Accuracy • User Interface
9/5/13 35
Evaluation on Basic Search
9/5/13
Basic Search dbGaP PhenDisco
Recall Precision Recall Precision
COPD 100 % 41.67% 80.00% 100 %
“macular degeneration” AND white 100 % 42.86% 100 % 85.71%
“breast cancer” AND “breast density”
100 % 66.67% 50.00% 100 %
schizophrenia 100 % 46.88% 86.67% 92.86%
cardiomyopathy 100 % 35.00% 100 % 100 %
Average 100 % 46.61% 83.33% 95.71%
Average F-measure 0.64 0.89
(as of July 7, 2013)
9/5/13 36
Evaluation on Advanced Search
9/5/13
Advanced Search in PhenDisco Recall Precision
“macular degeneration” AND white AND [whole genome genotyping]
100 % 66.67%
“breast cancer” AND “breast density” AND [IRB not required] AND [whole genome genotyping]
100 % 100 %
schizophrenia AND [female] AND [AFFY_6.0] 100 % 100 %
cardiomyopathy AND [copy number variant analysis]
100 % 100 %
Average 100 % 91.67 %
Average F-measure 0.96
(as of July 7, 2013)
9/5/13 37
Feedback on the User Interface (N=6)
9/5/13 9/5/13 38
Trainees � Post-doctoral trainees
� Ko-Wei Lin, DVM, PhD (Study Abstraction, Standardization, Evaluation)
� Mindy Ross, MD, MBA (Study Abstraction, Ontology Building) � Neda Alipanah, PhD (Ontology Building) � Xiaoqian Jiang, PhD (Ranking Algorithm) � Mike Conway, PhD (Study Abstraction)
� Undergraduate trainees � Alexander Hsieh (Standardization) � Vinay Venkatesh (System Development) � Rafael Talavera (Evaluation) � Karen Truong (Study Abstraction) � Asher Garland (System Development)
9/5/13
Acknowledgements � Lucila Ohno-Machado (PI) � Collaborator
� Hua Xu
� Other contribution � Jihoon Kim � Wendy Chapman � Melissa Tharp
� Staff � Stephanie Feudjio Feupe, MS � Seena Farzaneh, MS � Rebecca Walker, BS
� Funding: UH2HL108785 from NHLBI, NIH 9/5/13 9/5/13 40
Questions? Project Homepage: http://pfindr.net
PhenDisco: http://pfindr-data.ucsd.edu/_PhDVer1/index.php
Contact: [email protected]