27
Data Mining Cardiovascular Bayesian Networks Charles Twardy , Ann Nicholson , Kevin Korb , John McNeil (Danny Liew , Sophie Rogers , Lucas Hope ) School of Computer Science & Software Engineering Dept. of Epidemilogy & Preventive Medicine Monash University www.datamining.monash.edu.au/b nepi

Data Mining Cardiovascular Bayesian Networks Charles Twardy †, Ann Nicholson †, Kevin Korb †, John McNeil ‡ (Danny Liew ‡, Sophie Rogers ‡, Lucas Hope

  • View
    215

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Data Mining Cardiovascular Bayesian Networks Charles Twardy †, Ann Nicholson †, Kevin Korb †, John McNeil ‡ (Danny Liew ‡, Sophie Rogers ‡, Lucas Hope

Data Mining Cardiovascular

Bayesian Networks

Charles Twardy†, Ann Nicholson†, Kevin Korb†, John McNeil‡

(Danny Liew‡, Sophie Rogers‡, Lucas Hope†)

†School of Computer Science & Software Engineering‡Dept. of Epidemilogy & Preventive Medicine

Monash Universitywww.datamining.monash.edu.au/bnepi

Page 2: Data Mining Cardiovascular Bayesian Networks Charles Twardy †, Ann Nicholson †, Kevin Korb †, John McNeil ‡ (Danny Liew ‡, Sophie Rogers ‡, Lucas Hope

Overview

Medical Experts

2 epidemiological models

1. Knowledge Engineering

Causal discovery(CaMML)

+ Other learners

3. Evaluation

2. Data MiningBusselton Study data

Problem: assessment of risk for coronary heart disease (CHD)

Bayesian network software (Netica)

Page 3: Data Mining Cardiovascular Bayesian Networks Charles Twardy †, Ann Nicholson †, Kevin Korb †, John McNeil ‡ (Danny Liew ‡, Sophie Rogers ‡, Lucas Hope

Knowledge Engineering BNs from the medical literature

The Australian Busselton Study» every 3 years, 1966-1981, > 8,000 participants» mortality followup via WA death register + manually» Cox proportional-hazards model, 2,258 from 1978 cohort» CHD event base rates: 23% for men, 14% for women

The German PROCAM Study» 1979-1985, followup every 2 years, > 25,000 participants» Scoring model (based on Cox), ~5,000 men» CHD event base rates: ~6%

General question: are models transferable across populations?

Page 4: Data Mining Cardiovascular Bayesian Networks Charles Twardy †, Ann Nicholson †, Kevin Korb †, John McNeil ‡ (Danny Liew ‡, Sophie Rogers ‡, Lucas Hope

The Busselton BN: nodes

Page 5: Data Mining Cardiovascular Bayesian Networks Charles Twardy †, Ann Nicholson †, Kevin Korb †, John McNeil ‡ (Danny Liew ‡, Sophie Rogers ‡, Lucas Hope

The Busselton BN: arcs

predictor variables

uninformative

10-year risk of CHD event

P(S,B,Al,At) =P(S)P(B|S)P(Al|S)P(At|S) BNs summarize the joint

distribution

All nodes have an associatedconditional prob. distribution

Page 6: Data Mining Cardiovascular Bayesian Networks Charles Twardy †, Ann Nicholson †, Kevin Korb †, John McNeil ‡ (Danny Liew ‡, Sophie Rogers ‡, Lucas Hope

The Busselton BN: discretization

discretization choices

binary nodes

Page 7: Data Mining Cardiovascular Bayesian Networks Charles Twardy †, Ann Nicholson †, Kevin Korb †, John McNeil ‡ (Danny Liew ‡, Sophie Rogers ‡, Lucas Hope

The Busselton BN: reasoning

Page 8: Data Mining Cardiovascular Bayesian Networks Charles Twardy †, Ann Nicholson †, Kevin Korb †, John McNeil ‡ (Danny Liew ‡, Sophie Rogers ‡, Lucas Hope

The Busselton BN: reasoning

Page 9: Data Mining Cardiovascular Bayesian Networks Charles Twardy †, Ann Nicholson †, Kevin Korb †, John McNeil ‡ (Danny Liew ‡, Sophie Rogers ‡, Lucas Hope

The Busselton BN: reasoning

Bad cholesterol

Heavy smoking

Normal

Page 10: Data Mining Cardiovascular Bayesian Networks Charles Twardy †, Ann Nicholson †, Kevin Korb †, John McNeil ‡ (Danny Liew ‡, Sophie Rogers ‡, Lucas Hope

The Busselton BN: reasoning

More risk factors

!

Page 11: Data Mining Cardiovascular Bayesian Networks Charles Twardy †, Ann Nicholson †, Kevin Korb †, John McNeil ‡ (Danny Liew ‡, Sophie Rogers ‡, Lucas Hope

A risk assessment tool for clinicians

Previous tool: TAKEHEART Combine risk assessment (probability) with

costs.

Page 12: Data Mining Cardiovascular Bayesian Networks Charles Twardy †, Ann Nicholson †, Kevin Korb †, John McNeil ‡ (Danny Liew ‡, Sophie Rogers ‡, Lucas Hope

Risk Assessment Tool: exampleYoung, predictor not observed – don’t treat

old, predictor not observed – treat Not so old, predictor not observed – treat

Young, predictor observed – don’t treat

Page 13: Data Mining Cardiovascular Bayesian Networks Charles Twardy †, Ann Nicholson †, Kevin Korb †, John McNeil ‡ (Danny Liew ‡, Sophie Rogers ‡, Lucas Hope

CaMML: a causal learner

Developed at Monash University Data mines BNs from epidemiological data Minimum message length (MML) metric:

Trades-off complexity vs goodness of fit MCMC search over model space

Page 14: Data Mining Cardiovascular Bayesian Networks Charles Twardy †, Ann Nicholson †, Kevin Korb †, John McNeil ‡ (Danny Liew ‡, Sophie Rogers ‡, Lucas Hope

CaMML: example BN

Page 15: Data Mining Cardiovascular Bayesian Networks Charles Twardy †, Ann Nicholson †, Kevin Korb †, John McNeil ‡ (Danny Liew ‡, Sophie Rogers ‡, Lucas Hope

CaMML: example BN

Page 16: Data Mining Cardiovascular Bayesian Networks Charles Twardy †, Ann Nicholson †, Kevin Korb †, John McNeil ‡ (Danny Liew ‡, Sophie Rogers ‡, Lucas Hope

Evaluation

Predicting 10 year risk of CHD using Busselton data Split data 90-10 training/testing 10 fold cross validation Metrics:

» Predictive Accuracy» ROC Curves (area under curve): correct classification vs

false positives» Bayesian Information Reward (BIR)

Using Weka: Java environment for machine learning tools and techniques

Page 17: Data Mining Cardiovascular Bayesian Networks Charles Twardy †, Ann Nicholson †, Kevin Korb †, John McNeil ‡ (Danny Liew ‡, Sophie Rogers ‡, Lucas Hope

Predictive accuracy

Examining each joint observation in the sample

Adding any available evidence for the other nodes

Updating the network Use value with highest probability as

predicted value Compare predicted value with the actual value

Page 18: Data Mining Cardiovascular Bayesian Networks Charles Twardy †, Ann Nicholson †, Kevin Korb †, John McNeil ‡ (Danny Liew ‡, Sophie Rogers ‡, Lucas Hope

Information Reward

Rewards calibration of probabilities Zero reward for just reporting priors Unbounded below for a bad prediction Bounded above by a maximum that depends

on priors

Reward = 0

Repeat

If I == correct state

IR += log ( 1 / p[i] )

else

IR += log ( 1 / 1 - p[i] )

Page 19: Data Mining Cardiovascular Bayesian Networks Charles Twardy †, Ann Nicholson †, Kevin Korb †, John McNeil ‡ (Danny Liew ‡, Sophie Rogers ‡, Lucas Hope

Experimental Evaluation

Experiment 1: » Compare Busselton, PROCAM and CaMML BNs

Experiment 2» Compare CaMML and other standard machine

learners (from Weka)

Page 20: Data Mining Cardiovascular Bayesian Networks Charles Twardy †, Ann Nicholson †, Kevin Korb †, John McNeil ‡ (Danny Liew ‡, Sophie Rogers ‡, Lucas Hope

Evaluation: Weka learners

Naïve Bayes J48 (version of C4.5) CaMML –Causal BN learner, using MML

metric AODE TAN Logistic

Pr=1/3 Pr=1/3 Pr=1/3

Page 21: Data Mining Cardiovascular Bayesian Networks Charles Twardy †, Ann Nicholson †, Kevin Korb †, John McNeil ‡ (Danny Liew ‡, Sophie Rogers ‡, Lucas Hope

Experiment 1: ROC Results

Area under curve (AUC)

priors

No-one at risk!

Everyone at risk!Extremes:

Page 22: Data Mining Cardiovascular Bayesian Networks Charles Twardy †, Ann Nicholson †, Kevin Korb †, John McNeil ‡ (Danny Liew ‡, Sophie Rogers ‡, Lucas Hope

Experiment 2: ROC Results

Page 23: Data Mining Cardiovascular Bayesian Networks Charles Twardy †, Ann Nicholson †, Kevin Korb †, John McNeil ‡ (Danny Liew ‡, Sophie Rogers ‡, Lucas Hope

Experiment 2: Bayesian Info Reward

Page 24: Data Mining Cardiovascular Bayesian Networks Charles Twardy †, Ann Nicholson †, Kevin Korb †, John McNeil ‡ (Danny Liew ‡, Sophie Rogers ‡, Lucas Hope

Summary of Results

Experiment I (Models of whole data) PROCAM model does at least as well as Busselton

» On Busselton data» For both "relative" (ROC) and "absolute" (BIR) risk

CaMML Models do as well» But much simpler: only 4 nodes matter to CHD10!

Experiment II (Cross-validation of learners) Logistic regression does best on both metrics

» Statistically powerful: only 1 parameter per arc» No search required: structure is given» No discretization necessary

Page 25: Data Mining Cardiovascular Bayesian Networks Charles Twardy †, Ann Nicholson †, Kevin Korb †, John McNeil ‡ (Danny Liew ‡, Sophie Rogers ‡, Lucas Hope

Conclusions

Busselton & PROCAM models appear to perform equally well on Busselton data, using an absolute risk measure (BIR) from the literature

CaMML results suggest the data have high variance and are too weak to support inference to complex models. Combining data would help.

Page 26: Data Mining Cardiovascular Bayesian Networks Charles Twardy †, Ann Nicholson †, Kevin Korb †, John McNeil ‡ (Danny Liew ‡, Sophie Rogers ‡, Lucas Hope

Future directions

Improve data mining by» Adding prior knowledge to search» Assessing whether data sources can be combined;

if so, do so

Investigate combination of continuous and discrete variables in data mining and modeling

Develop new TAKEHEART model using BNs (taking the best from experts, literature, data mining)» with intervention modeling (Causal Reckoner)» with decision support» with GUI, usable by clinicians

Page 27: Data Mining Cardiovascular Bayesian Networks Charles Twardy †, Ann Nicholson †, Kevin Korb †, John McNeil ‡ (Danny Liew ‡, Sophie Rogers ‡, Lucas Hope

References

G. Assmann, P. Cullen and H. Schulte. Simple scoring scheme for calculating the risk of acute coronary events based on the 10-year follow-up of the Prospective Cardiovascular Munster (PROCAM) study. Circulation, 105(3):310-315, 2002.

M.W. Knuiman, H.T. Vu and H. C. Bartholomew. Multivariate risk estimation for coronary heart disease: the Busselton Health Study, Australian & New Zealand Journal of Public Health, 22:747-753, 1998.

C.S. Wallace and K.B. Korb. Learning Linear Causal Models by MML Sampling, In A. Gammerman, editor, Causal Models and Intelligent Data Management, pages 89-111. Springer-Verlag, 1999.www.datamining.monash.edu.au/software/camml

C.R. Twardy, A.E. Nicholson, K.B. Korb and J. McNeil. Data Mining Cardiovascular Bayesian Networks. Technical report 2004/165. School of Computer Science and Software Engineering, Monash University, 2004.

C.R. Twardy, A.E. Nicholson and K.B. Korb. Knowledge engineering cardiovascular Bayesian networks from the literature, Technical Report 2005/170, School of CSSE, Monash University, 2005.