View
215
Download
2
Tags:
Embed Size (px)
Citation preview
Data Mining Cardiovascular
Bayesian Networks
Charles Twardy†, Ann Nicholson†, Kevin Korb†, John McNeil‡
(Danny Liew‡, Sophie Rogers‡, Lucas Hope†)
†School of Computer Science & Software Engineering‡Dept. of Epidemilogy & Preventive Medicine
Monash Universitywww.datamining.monash.edu.au/bnepi
Overview
Medical Experts
2 epidemiological models
1. Knowledge Engineering
Causal discovery(CaMML)
+ Other learners
3. Evaluation
2. Data MiningBusselton Study data
Problem: assessment of risk for coronary heart disease (CHD)
Bayesian network software (Netica)
Knowledge Engineering BNs from the medical literature
The Australian Busselton Study» every 3 years, 1966-1981, > 8,000 participants» mortality followup via WA death register + manually» Cox proportional-hazards model, 2,258 from 1978 cohort» CHD event base rates: 23% for men, 14% for women
The German PROCAM Study» 1979-1985, followup every 2 years, > 25,000 participants» Scoring model (based on Cox), ~5,000 men» CHD event base rates: ~6%
General question: are models transferable across populations?
The Busselton BN: nodes
The Busselton BN: arcs
predictor variables
uninformative
10-year risk of CHD event
P(S,B,Al,At) =P(S)P(B|S)P(Al|S)P(At|S) BNs summarize the joint
distribution
All nodes have an associatedconditional prob. distribution
The Busselton BN: discretization
discretization choices
binary nodes
The Busselton BN: reasoning
The Busselton BN: reasoning
The Busselton BN: reasoning
Bad cholesterol
Heavy smoking
Normal
The Busselton BN: reasoning
More risk factors
!
A risk assessment tool for clinicians
Previous tool: TAKEHEART Combine risk assessment (probability) with
costs.
Risk Assessment Tool: exampleYoung, predictor not observed – don’t treat
old, predictor not observed – treat Not so old, predictor not observed – treat
Young, predictor observed – don’t treat
CaMML: a causal learner
Developed at Monash University Data mines BNs from epidemiological data Minimum message length (MML) metric:
Trades-off complexity vs goodness of fit MCMC search over model space
CaMML: example BN
CaMML: example BN
Evaluation
Predicting 10 year risk of CHD using Busselton data Split data 90-10 training/testing 10 fold cross validation Metrics:
» Predictive Accuracy» ROC Curves (area under curve): correct classification vs
false positives» Bayesian Information Reward (BIR)
Using Weka: Java environment for machine learning tools and techniques
Predictive accuracy
Examining each joint observation in the sample
Adding any available evidence for the other nodes
Updating the network Use value with highest probability as
predicted value Compare predicted value with the actual value
Information Reward
Rewards calibration of probabilities Zero reward for just reporting priors Unbounded below for a bad prediction Bounded above by a maximum that depends
on priors
Reward = 0
Repeat
If I == correct state
IR += log ( 1 / p[i] )
else
IR += log ( 1 / 1 - p[i] )
Experimental Evaluation
Experiment 1: » Compare Busselton, PROCAM and CaMML BNs
Experiment 2» Compare CaMML and other standard machine
learners (from Weka)
Evaluation: Weka learners
Naïve Bayes J48 (version of C4.5) CaMML –Causal BN learner, using MML
metric AODE TAN Logistic
Pr=1/3 Pr=1/3 Pr=1/3
Experiment 1: ROC Results
Area under curve (AUC)
priors
No-one at risk!
Everyone at risk!Extremes:
Experiment 2: ROC Results
Experiment 2: Bayesian Info Reward
Summary of Results
Experiment I (Models of whole data) PROCAM model does at least as well as Busselton
» On Busselton data» For both "relative" (ROC) and "absolute" (BIR) risk
CaMML Models do as well» But much simpler: only 4 nodes matter to CHD10!
Experiment II (Cross-validation of learners) Logistic regression does best on both metrics
» Statistically powerful: only 1 parameter per arc» No search required: structure is given» No discretization necessary
Conclusions
Busselton & PROCAM models appear to perform equally well on Busselton data, using an absolute risk measure (BIR) from the literature
CaMML results suggest the data have high variance and are too weak to support inference to complex models. Combining data would help.
Future directions
Improve data mining by» Adding prior knowledge to search» Assessing whether data sources can be combined;
if so, do so
Investigate combination of continuous and discrete variables in data mining and modeling
Develop new TAKEHEART model using BNs (taking the best from experts, literature, data mining)» with intervention modeling (Causal Reckoner)» with decision support» with GUI, usable by clinicians
References
G. Assmann, P. Cullen and H. Schulte. Simple scoring scheme for calculating the risk of acute coronary events based on the 10-year follow-up of the Prospective Cardiovascular Munster (PROCAM) study. Circulation, 105(3):310-315, 2002.
M.W. Knuiman, H.T. Vu and H. C. Bartholomew. Multivariate risk estimation for coronary heart disease: the Busselton Health Study, Australian & New Zealand Journal of Public Health, 22:747-753, 1998.
C.S. Wallace and K.B. Korb. Learning Linear Causal Models by MML Sampling, In A. Gammerman, editor, Causal Models and Intelligent Data Management, pages 89-111. Springer-Verlag, 1999.www.datamining.monash.edu.au/software/camml
C.R. Twardy, A.E. Nicholson, K.B. Korb and J. McNeil. Data Mining Cardiovascular Bayesian Networks. Technical report 2004/165. School of Computer Science and Software Engineering, Monash University, 2004.
C.R. Twardy, A.E. Nicholson and K.B. Korb. Knowledge engineering cardiovascular Bayesian networks from the literature, Technical Report 2005/170, School of CSSE, Monash University, 2005.