Data Mining Cardiovascular Bayesian Networks Charles Twardy †, Ann Nicholson †, Kevin Korb †, John McNeil ‡ (Danny Liew ‡, Sophie Rogers ‡, Lucas Hope

Data Mining Cardiovascular

Bayesian Networks

Charles Twardy†, Ann Nicholson†, Kevin Korb†, John McNeil‡

(Danny Liew‡, Sophie Rogers‡, Lucas Hope†)

†School of Computer Science & Software Engineering‡Dept. of Epidemilogy & Preventive Medicine

Monash Universitywww.datamining.monash.edu.au/bnepi

Overview

Medical Experts

2 epidemiological models

1. Knowledge Engineering

Causal discovery(CaMML)

+ Other learners

3. Evaluation

2. Data MiningBusselton Study data

Problem: assessment of risk for coronary heart disease (CHD)

Bayesian network software (Netica)

Knowledge Engineering BNs from the medical literature

The Australian Busselton Study» every 3 years, 1966-1981, > 8,000 participants» mortality followup via WA death register + manually» Cox proportional-hazards model, 2,258 from 1978 cohort» CHD event base rates: 23% for men, 14% for women

The German PROCAM Study» 1979-1985, followup every 2 years, > 25,000 participants» Scoring model (based on Cox), ~5,000 men» CHD event base rates: ~6%

General question: are models transferable across populations?

The Busselton BN: nodes

The Busselton BN: arcs

predictor variables

uninformative

10-year risk of CHD event

P(S,B,Al,At) =P(S)P(B|S)P(Al|S)P(At|S) BNs summarize the joint

distribution

All nodes have an associatedconditional prob. distribution

The Busselton BN: discretization

discretization choices

binary nodes

The Busselton BN: reasoning



Bad cholesterol

Heavy smoking

Normal


More risk factors

!

A risk assessment tool for clinicians

Previous tool: TAKEHEART Combine risk assessment (probability) with

costs.

Risk Assessment Tool: exampleYoung, predictor not observed – don’t treat

old, predictor not observed – treat Not so old, predictor not observed – treat

Young, predictor observed – don’t treat

CaMML: a causal learner

Developed at Monash University Data mines BNs from epidemiological data Minimum message length (MML) metric:

Trades-off complexity vs goodness of fit MCMC search over model space

CaMML: example BN

CaMML: example BN

Evaluation

Predicting 10 year risk of CHD using Busselton data Split data 90-10 training/testing 10 fold cross validation Metrics:

» Predictive Accuracy» ROC Curves (area under curve): correct classification vs

false positives» Bayesian Information Reward (BIR)

Using Weka: Java environment for machine learning tools and techniques

Predictive accuracy

Examining each joint observation in the sample

Adding any available evidence for the other nodes

Updating the network Use value with highest probability as

predicted value Compare predicted value with the actual value

Information Reward

Rewards calibration of probabilities Zero reward for just reporting priors Unbounded below for a bad prediction Bounded above by a maximum that depends

on priors

Reward = 0

Repeat

If I == correct state

IR += log ( 1 / p[i] )

else

IR += log ( 1 / 1 - p[i] )

Experimental Evaluation

Experiment 1: » Compare Busselton, PROCAM and CaMML BNs

Experiment 2» Compare CaMML and other standard machine

learners (from Weka)

Evaluation: Weka learners

Naïve Bayes J48 (version of C4.5) CaMML –Causal BN learner, using MML

metric AODE TAN Logistic

Pr=1/3 Pr=1/3 Pr=1/3

Experiment 1: ROC Results

Area under curve (AUC)

priors

No-one at risk!

Everyone at risk!Extremes:

Experiment 2: ROC Results

Experiment 2: Bayesian Info Reward

Summary of Results

Experiment I (Models of whole data) PROCAM model does at least as well as Busselton

» On Busselton data» For both "relative" (ROC) and "absolute" (BIR) risk

CaMML Models do as well» But much simpler: only 4 nodes matter to CHD10!

Experiment II (Cross-validation of learners) Logistic regression does best on both metrics

» Statistically powerful: only 1 parameter per arc» No search required: structure is given» No discretization necessary

Conclusions

Busselton & PROCAM models appear to perform equally well on Busselton data, using an absolute risk measure (BIR) from the literature

CaMML results suggest the data have high variance and are too weak to support inference to complex models. Combining data would help.

Future directions

Improve data mining by» Adding prior knowledge to search» Assessing whether data sources can be combined;

if so, do so

Investigate combination of continuous and discrete variables in data mining and modeling

Develop new TAKEHEART model using BNs (taking the best from experts, literature, data mining)» with intervention modeling (Causal Reckoner)» with decision support» with GUI, usable by clinicians

References

G. Assmann, P. Cullen and H. Schulte. Simple scoring scheme for calculating the risk of acute coronary events based on the 10-year follow-up of the Prospective Cardiovascular Munster (PROCAM) study. Circulation, 105(3):310-315, 2002.

M.W. Knuiman, H.T. Vu and H. C. Bartholomew. Multivariate risk estimation for coronary heart disease: the Busselton Health Study, Australian & New Zealand Journal of Public Health, 22:747-753, 1998.

C.S. Wallace and K.B. Korb. Learning Linear Causal Models by MML Sampling, In A. Gammerman, editor, Causal Models and Intelligent Data Management, pages 89-111. Springer-Verlag, 1999.www.datamining.monash.edu.au/software/camml

C.R. Twardy, A.E. Nicholson, K.B. Korb and J. McNeil. Data Mining Cardiovascular Bayesian Networks. Technical report 2004/165. School of Computer Science and Software Engineering, Monash University, 2004.

C.R. Twardy, A.E. Nicholson and K.B. Korb. Knowledge engineering cardiovascular Bayesian networks from the literature, Technical Report 2005/170, School of CSSE, Monash University, 2005.

Documents

Data Mining Cardiovascular Bayesian Networks Charles Twardy †, Ann Nicholson †, Kevin Korb †, John McNeil ‡ (Danny Liew ‡, Sophie Rogers ‡, Lucas Hope