Upload
homer-parker
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
1
Phenotyping from Electronic Health Records
Jimeng Sun
College of Computing
Georgia Tech
More info at sunlab.org
2
My research focus on health analytics
Genomic data
Clinical data
Behavior data
Social data
Health Analytic Apps
Heart disease predictor for $5.99 Analytic cloudPrivacy engine
Visualization
User
Clinical Researchers
Training data
Research Challenges
Big data analytics on the cloud
Data mining and machine learning techniques
Privacy preserving data sharing
Visual analytic techniques
My focus
3
Outline
Phenotyping from EHR
Other work
– PARAMO: Large scale predictive modeling pipeline
– Patient Similarity
4
EHR
Phenotyping from Electronic Health Records
Demographic
Diagnosis
Medication
Lab Tests
Procedure
Medical Images
Medical Concepts
(phenotypes)Phenotyping
5
Motivation: Increasing Importance of Electronic Health Records
EHR become acceptable data sources for clinical research
EHR data can enable many more research
Explosion in interest
HOW TO TURN EHR INTO PHENOTYPES?
6
This talk
Challenges in Phenotyping from EHR
Representation
– How to represent heterogeneous EHR data and phenotypes?
Speed
– How to construct diverse phenotypes in unsupervised fashion?
Intuition
– How to validate and refine the phenotypes?
Adaptation
– How to adapt phenotypes from one site to another?
7
Constructing Feature Tensor
Tensor is a generalization of matrix
– Matrix is a 2nd order tensor
Tensors can better capture interactions among concepts
Data element types:• Binary • Count (integer)• Continuous (numeric)
Mode
8
Multiple Tensors
Diagnosis-Medication
Diagnostic Sources
Medication Reconciliation
Lab Results
SymptomsVital
9
Phenotyping through Tensor Factorization
Phenotype R
≈ + … +
λ1 λR
Medication factor
Diagnosis factor
Patients factor
Phenotype importance
Phenotype 1
Elements sum to 1
Factor elementssum to 1
Candidate Phenotype k(40% of patients)HypertensionBeta Blockers Cardio-SelectiveThiazides and Thiazide-Like DiureticsHMG CoA Reductase Inhibitors
λk
Example Phenotype
Diagnosis factor
Medication factor
Patients factor
11
Phenotyping Process using Tensor Factorization
CountData
Tensor Factorization
Projection
Phenotype Definitions
Count Data
λ1 λR+ … +
PhenotypesMatrix
New Patients
12
CP-APR Model
KL divergence for count data
Nonnegative combinations
Stochastic constraint(elements in factor sum to 1)
Element index
Chi, E.C. and Kolda, T.G. 2012. On tensors, sparsity, and nonnegative factorizations. SIAM Journal on Matrix Analysis and Applications. 33, 4 (2012), 1272–1299.
13
Constructing the Tensor
Medication orders from Geisinger dataset
Diagnosis codes aggregated into HCC codes
Medications are defined as pharmacy subclass
31,816 patients x 169 diagnoses x 471 medications
14
Evaluation of Phenotypes: Classification
Task: predict patients with heart failure
Model: logistic regression with ℓ1 regularization
10 random even splits of the dataset (50% training)
Features:
1. Baseline using source independence matrix
2. Principal Component Analysis (PCA)
3. Nonnegative Matrix Factorization (NMF)
4. Phenotype Tensor Factorization (PTF)
15
Predictive Performance Effect
Small number of phenotypes outperforms 640 features
Number of Phenotypes
Phenotype 1
Hypertension – Opiod Combinations
Disorders of the Vertebrae and Spinal Discs – Glucocortiocosteriods
Disorders of the Vertebrae and Spinal Discs – Stimulant Laxatives
Disorders of the Vertebrae and Spinal Discs – Beta Blockers Cardio-Selective
Disorders of the Vertebrae and Spinal Discs – Sympathomimetics
Disorders of the Vertebrae and Spinal Discs – Anticonvulsants - Misc
Disorders of the Vertebrae and Spinal Discs – Central Muscle Relaxants
Disorders of the Vertebrae and Spinal Discs – HMG CoA Reductase Inhibitors
Disorders of the Vertebrae and Spinal Discs – Selective Serotonin Reuptake Inhibitors
Disorders of the Vertebrae and Spinal Discs – Surfactant Laxatives
Disorders of the Vertebrae and Spinal Discs – Proton Pump Inhibitors
Disorders of the Vertebrae and Spinal Discs – Cephalosporins – 1st Generation
Disorders of the Vertebrae and Spinal Discs – Analgesics Other
Disorders of the Vertebrae and Spinal Discs – Non-Barbiturate Hypnotics
Disorders of the Vertebrae and Spinal Discs – Electrolyte Mixtures
Minor Symptoms, Signs, Findings – Opiod Combinations
Post-Surgical States/Aftercare/Elective – Opiod Combinations
Post-Surgical States/Aftercare/Elective – Stimulant Laxatives
Post-Surgical States/Aftercare/Elective – Beta Blockers Cardio-Selective
Post-Surgical States/Aftercare/Elective – HMG CoA Reductase Inhibitors
Post-Surgical States/Aftercare/Elective – Proton Pump Inhibitors
Post-Surgical States/Aftercare/Elective – Opiod Agonists
Post-Surgical States/Aftercare/Elective – Cephalosporins – 1st Generation
Post-Surgical States/Aftercare/Elective – Analgesics Other
Post-Surgical States/Aftercare/Elective – Non-Barbiturate Hypnotics
Other Eye Disorders – Opiod Combinations
Other Eye Disorders – Stimulant Laxatives
Other Eye Disorders – Opiod Agonists
Other Eye Disorders – Cephalosporins – 1st Generation
Other Eye Disorders – Non-Barbiturate Hypnotics
Phenotype 2
Major Symptoms, Abnormalities – Stimulant Laxatives
Major Symptoms, Abnormalities – Beta Blockers Cardio-Selective
Major Symptoms, Abnormalities – Sympathomimetics
Major Symptoms, Abnormalities – Coumarin Anticoagulants
Major Symptoms, Abnormalities – Salicylates
Major Symptoms, Abnormalities – Surfactant Laxatives
Major Symptoms, Abnormalities – Insulin
Major Symptoms, Abnormalities – Proton Pump Inhibitors
Major Symptoms, Abnormalities – Anti-infective Agents - Misc
Major Symptoms, Abnormalities – Vasodilators
Hypertension – Opiod Combinations
Other Gastrointestinal Disorders – Surfactant Laxatives
Other Gastrointestinal Disorders – Insulin
Diabetes with No or Unspecified Complications – Insulin
Specified Heart Arrhythmias – Beta Blockers Cardio-Selective
Iron Deficiency and Other/Unspecified Anemias and Blood Disease - Hematopoietic Growth Factors
Urinary Tract Infection – Insulin
Other Endocrine/Metabolic/Nutritional Disorders – Insulin
Vascular Disease – Coumarin Anticoagulants
Vascular Disease – Insulin
History of Disease– Insulin
Unspecified Renal Failure – Coumarin Anticoagulants
Diabetes with Renal Manifestation – Insulin
NMF factors are not concise, harder to interpret
Phenotype 3 (17.6% of patients)Diabetes with No or Unspecified ComplicationsSulfonylureasBiguanidesDiagnostic TestsInsulin Sensitizing AgentsDiabetic SuppliesMeglitinide AnaloguesAntidiabetic Combinations
Phenotype 4 (31.1% of patients)HypertensionACE InhibitorsThiazides and Thiazide-Like Diuretics
Phenotype 5 (36.7% of patients)Other Ear, Nose, Throat, and Mouth DisordersViral and Unspecified Pneumonia, PleurisySignificant Ear, Nose, and Throat DisordersCough/Cold/Allergy CombinationsAzithromycinFluoroquinolonesSympathomimeticsPenicillin CombinationsAntitussivesGlucocorticosteroidsTetracyclinesAnti-infective Misc. - CombinationsClarithromycinCephalosporins - 2nd GenerationCephalosporins - 1st GenerationExpectorants
Uncomplicated Diabetes
Mild Hypertension Chronic Respiratory Inflammation/Infection
PTF interpretation: Major disease phenotypes can be identified
Phenotype 4 (31.1% of patients)HypertensionACE InhibitorsThiazides and Thiazide-Like Diuretics
Mild HypertensionPhenotype 6 (24.3% of patients)HypertensionCalcium Channel BlockersAntihypertensive CombinationsAntiadrenergic AntihypertensivesPotassium Sparing Diuretics
Severe HypertensionPhenotype 2(31.5% of patients)HypertensionBeta Blockers Cardio-SelectiveAngiotensin II Receptor AntagonistsLoop DiureticsPotassiumNitratesAlpha-Beta BlockersVasodilators
Moderate Hypertension
PTF interpretation: Disease subtypes can be automatically identified
Over 80% phenotype factors are clinically meaningful
19
Summary: Phenotyping using Tensor Factorization
Nonnegative tensor factorization can be used to learn phenotypes without supervision
Small number of phenotypes outperforms a large number of features in a prediction task
Phenotype R
≈ +…+λ1 λR
Phenotype 1
Few diagnosis
20
PARAMO: PARALLEL PREDICTIVE MODELING PLATFORM
System
21
Predictive Modeling Pipeline
There are many different models that need to be built and evaluated
– Different patient cohorts
– Different targets
– Different features
– Different algorithms
– Multiple training and testing splits in cross-validation
~100K different pipelines
22
Running Time vs. Parallelism level
Patient sets
– Small: 5,000 patients for hypertension control prediction
– Medium: 33K for predicting heart failure onset
– Large: 319K for hypertension diagnosis prediction
Dependency graph: 1808 nodes and 3610 edges
9 days
3 hours
72X speed up
23
PATIENT SIMILARITYAlgorithm
24
Patient Similarity Problem
Supervision
Patient Doctor
Similarity
search
25
Patient Similarity Problem
Patient Doctor
26
Summary on Patient Similarity
To learn a customized distance metric for a target [1]
Extension 1: Composite distance integration (Comdi) [2]
– How to combine multiple patient similarity measures?
Extension 2: Interactive metric update (iMet) [3]
– How to update an existing distance measure?
1. Sun, J., Wang, F., Hu, J., Edabollahi, S., 2012. Supervised patient similarity measure of heterogeneous patient records. ACM SIGKDD Explorations Newsletter 14, 16.
2. Fei Wang, Jimeng Sun, Shahram Ebadollahi: Integrating Distance Metrics Learned from Multiple Experts and its Application in Inter-Patient Similarity Assessment. SDM 2011: 59-70 56
3. Fei Wang, Jimeng Sun, Jianying Hu, Shahram Ebadollahi: iMet: Interactive Metric Learning in Healthcare Applications. SDM 2011: 944-955
27
Phenotyping from Electronic Health Records
Jimeng Sun
College of Computing
Georgia Tech
More info at sunlab.org