Constructing Binary Decision Tree for Predicting Deep Venous Thrombosis (DVT)

Preview:

DESCRIPTION

Constructing Binary Decision Tree for Predicting Deep Venous Thrombosis (DVT). Christopher Nwosisi 1,2 , Sung-Hyuk Cha 1 , Yoo Jung An, Charles C. Tappert 1 , Evan Lipsitz 2. 1 Computer Science Department Pace University New York, USA. 2 Vascular Laboratory Montefiore Medical Center - PowerPoint PPT Presentation

Citation preview

Constructing Binary Decision Tree for Predicting Deep Venous Thrombosis (DVT)

Christopher Nwosisi1,2, Sung-Hyuk Cha1, Yoo Jung An, Charles C. Tappert1, Evan Lipsitz2

1Computer Science Department Pace UniversityNew York, USA

2Vascular LaboratoryMontefiore Medical CenterNew York, USA

Statement of Problem

• The use of decision tree algorithms such as ID3 and C4.5 in medical diagnostic application today is promising, but often suffer from excessive complexity and can even be incomprehensible.

• Especially in predicting DVTs which have high mortality, simple and accurate decision model is preferred for potential patients, Medical Technologists and Physicians before sending patients for expensive medical examinations.

Proposed approach

• Using the Genetic Algorithm to minimize the complexity (size) and/or maximize the accuracy of the decision tree.

• New approach found shorter and/or more accurate decision trees than ones produced by conventional the ID3 and C4.5 algorithms.

DVT / VTE

Silent PESilent PE1 Million1 Million

DeathDeath60,00060,000

Estimated Cost of VTE Care $1.5 Billion/year

Magnitude of the Problem

Post-thrombotic Post-thrombotic SyndromeSyndrome

800,000800,000

Pulmonary Pulmonary HypertensionHypertension

30,00030,000

Goldhaber SZ, et al. Lancet 1999;353:1386-19.

DVTDVT2 Million2 Million

PEPE600,000600,000

Patients with deep vein thrombosis have a painful swollen leg which limits their mobility

Clinical Problem

Montefiore Hospital Vascular Laboratory, 2008

DVT-Duplex Evaluation

Criteria for positive diagnosis:

- incompressibility of a venous segment

- visualization of thrombus

absence of flow

v a

Montefiore Hospital Vascular Laboratory

Database Overview

Two datasets are extracted from two databases:

• Medical History

• Physical Exam

• Diagnostic Tests

• 515 records from the Laboratory

- 350 patients are positive for DVT- 165 patients are negative for DVT

• 620 records from the general registry

- 420 patients are positive for DVT- 200 patients are negative for DVT

Table 1- Databases Attributes

No. Name Description

1 Sex1 = male; 0 = female

2 AgeAge in years {1- 99}

3 Diabetes0 = normal; 1 = Patient is receiving some treatment

4 Smoking0 = never smoked; 1 = Patient is an active Smoker;

2 = Patient stopped smoking

5 Surgery0 = never had surgery;

1 = Patient who had previous surgery

6 Pain0 = no pain in the leg;

1 = Patient experienced pain in the leg {Right, Left or Bilateral}

7 Swelling0 = no swelling below the knee;

1 = swelling in the leg

DVT0 = examination result indicate negative for DVT;

1 = examination result indicate positive for DVT

Medical History

Table 2 – Database AttributesNo. Name Description

1 Sex 1 = male; 0 = female 12 Congestive heart

failure

0 = never diagnosed; 1 = previously diagnosed

2 Age Age in years {1-99} 13 Obesity 0 = obesity not specified; 1 = obesity specified

3 Diabetes 0 = normal; 1 = Patient is receiving some treatment 14 Accident 0 = never had a fall; 1 = previously had a fall

4 Smoking 0 = never smoked; 1 = Patient is an active Smoker;

2 = Patient stopped smoking

15 Hyperlipidemia 0 = normal; 1 = Patient is diagnosed

5 Surgery 0 = never had surgery; 1 = Patient who had previous

surgery

16 Cardiac

Dysrthythmia

0 = normal; 1 = Patient is diagnosed

6 Swelling 0 = no swelling below the knee; 1 = swelling in the leg 17 Lymphoproliferat

disease

0 = normal; 1 = Patient is diagnosed

7 Chest Pain 0 = none; 1 = pain in Chest DVT 0 = examination result indicate negative for DVT

1 = examination result indicate positive for DVT

8 Cancer 0 = normal; 1 = positive for cancer

9 Cellulitis 0 = normal; 1 = positive for cellulitis

10 Injury 0 = no injury; 1 = previous and current injuries

11 Pulmonary

embolism

0 = never diagnosed; 1 = previously diagnosed

Medical History

Physical ExamDiagnostic Tests

Sex Age Diabetes Smoking surgery pain swelling DVT

M 77 y no y n n yes

M 53 n no y n n yes

M 55 n yes n n y yes

F 73 n no y n y yes

F 84 y no y n n yes

F 68 n yes y n n yes

F 81 n no y n n yes

M 84 y yes n n n yes

F 84 y no y n n yes

M 84 n no y n n yes

F 73 n no y n y yes

F 56 n no n n y yes

M 63 n no n n n yes

F 76 y no y n n yes

F 70 y no y n n yes

M 75 n no y n n yes

F 92 n no n n n no

F 73 n no y y n no

F 61 n stopped n y n no

M 63 y stopped y n n no

M 78 n no y n n no

F 96 n no y n n no

F 71 n no y n n no

M 71 n no n n y no

Table 2.1.1.1 - DVT sample data set IIDVT database (Table 1)

AGE SEX Ob Sm Swell CHF Canc Surg Chest Lip Lymp Card DB Othr ACC/ Leg leg DVT                Pain     Dysr   PE Fall Inj Cell  50 M 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1

82 F 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1

88 F 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1

67 F 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1

83 F 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

79 M 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1

54 M 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1

69 M 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1

68 M 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1

62 M 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1

26 F 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1

64 F 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 1

80 F 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1

82 F 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1

78 M 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1

33 F 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1

26 M 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 1

54 M 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1

45 F 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1

47 F 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1

74 F 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1

60 F 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1

58 M 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1

42 F 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

63 M 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1

45 F 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 1

30 F 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0

87 F 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0

77 F 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

97 F 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0

88 F 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0

18 M 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0

85 F 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

35 M 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

68 F 0 0 0 0 1 0 1 1 0 0 1 0 0 0 0 0

48 F 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0

85 M 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0

68 M 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0

42 F 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

DVT database (Table II)

GNCP

PN

SMSBSS

CRCL

IJ

PEHFOB

ACLP

CDLD

A60DB

SR

SW

Dataset I Dataset II

Datasets Relationship

Preprocessing (Binarization)

Heterogeneous type attributes

Sex Smoking … pain DVT

M no N yes

F no L yes

F yes Bi yes

F no N yes

M yes N yes

F stopped R no

M no N no

Homogeneous Binary type attributes

Original table Binary tableSex Smoking … pain DVT

1 0 0 0 0 1

0 0 0 1 0 1

0 1 1 1 1 1

0 0 0 0 0 1

1 1 1 0 0 1

0 1 0 0 1 0

1 0 0 0 0 0

Why Binary Attribute?

• Applying GA on Non-binary attributes is extremelydifficult and currently an open problem

• To use the GA to build a binary decision tree, theattribute types must be in binary

Age Distributions (numeric)

Nominal type attributes (|v| > 2)

Leg Pain {L, R, Bi, N}

L P RP

vSmoking {N, Stopped, Yes}

SB SS

v

1 1 Bi 1 1 Smoking

1 0 L 1 0 Stopped

0 1 R 0 0 None

0 0 None

A60 GN DB SM SR PN SW DVT1 1 0 0 0 0 0 10 0 0 0 0 0 1 11 0 0 0 1 0 0 11 1 0 0 1 0 0 10 1 0 0 1 0 0 1

0 1 0 0 1 0 1 0

0 0 0 0 0 0 0 01 0 0 0 0 0 0 00 0 0 0 0 0 0 01 1 1 0 0 0 0 0

Dataset I Binarized Table

A60 GN DB OB SM SR SW HF CR CP HL LD CD PE AC IJ CL DVT

0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1

1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1

1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1

1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 1

0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1

1 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Dataset II Binarized Table

Decision Tree

Their representation of acquired knowledge in tree form is intuitive and generally easy to assimilate by humans.In general, DT classifiers have comparable accuracy to other complex classifiers but simple to understand and visualize.

SR

HFPE

CRSW pos(17/25)

(12/13)(11/12)

(10/10)pos

pospos negneg

1

1

1

1

10 0

00

0

• Decision trees classify instances – by sorting them down from the root to the leaf node, – which provides the classification of the instance.

• Each internal node in the tree specifies a test of some attribute of the instance.

• Each leaf node assigns a classification

• Each branch descending from that node corresponds to one of the possible values of this attribute.

Decision Tree RepresentationDecision Tree Representation

Decision Trees from Dataset I

SR

PN

SB

pn

n

DB

A6

pn

(b) 61.5% by GA

n

SB

GN

DB

PN

SS

SW

p

np

nn

np

(a) 59.5% by C4.5

pn pnpnpn

n

np

SWSRDBA6DB

DBA6GNn

GNPN

SS

(c) 64.5% by GA

PE

SW

p

p

CL

HFSR

n pSMCR

a6

CR

p

n

ACAC

GN

DB

npp

p

p n

DB

LD

HL

GN

SM

CP

AC

CPHF

a6

p

npp

p n

a6

HF

n

p

p

p

p

p n

n

(a) C4.5 (72.25%)depth = 12

0

1

2

3

4

5

6

7

8

9

10

11

12

SR

CR

n

SM

CR

HF

AC

pnp

a6

n

n

p

a6

n

p

n

p

CL

DB

a6n

DB

DB

pn

OB

CD

SWa6

pn

HF

IJ

GNHFGN

n

SM

HL

GN a6

n

np np

np OB

np

SW HF

SM a6a6n

pp nnpn

(c) 73.75% by GA

HF

n

SW

pn

DB

DB

CLa6

pn

SRCR

n CD

np

HL

SR

pn

SM

GNa6

pn np

a6

GNAC

np

n

a6

GN

np

(b) 69.75% by GAdepth = 5

n

0

1

2

3

4

5

6

7

SR

a6

HF

GN

SM

CR

n

n

n

HL

p

n

p

a6

n

pGN

p

n

LD a6

AC

n p

IJ

a6 HL

nn

CD

p

SMnDB

p

OB

n

n

p

a6

(d) 75.25% by GAdepth = 7

CD

p

DB

n

nHF

CL

HF

n

p

DB

DB

pn

OB

GN

p

np

SWa6

pn

HFSWCR

AC

pn

a6 CR

PN a6GN GN

pnpn pnnp

a6

n GN

HF

np

Decision Trees from Dataset II – Figure 5

The Best Measure of Efficiency (shortness) for a DT

• Average number of questions required to obtain a prediction.

Other measures:

• the depth of the tree• the number of nodes in the tree

Depth limit

Performancerate

The average # of question

5 69.75 2.95256 73.75 3.37257 75.25 3.89558 76.50 4.32759 76.75 4.8225

10 78.00 5.122511 78.50 5.467512 79.50 5.867513 80.25 6.3075

Complexity of Decision Trees

12 72.25 7.485

16 80.0

C4.5

ID3

GA

From both a depth and average-number of questions perspective the complexity of the

decision tree in Figure 5 (d) can be considered much more efficient (simpler)than the decision

tree from the C4.5 algorithm (Figure 5a).

0

1

2

3

4

5

6

7

SR

a6

HF

GN

SM

CR

n

n

n

HL

p

n

p

a6

n

pGN

p

n

LD a6

AC

n p

IJ

a6 HL

nn

CD

p

SMnDB

p

OB

n

n

p

a6

(d) 75.25% by GAdepth = 7

CD

p

DB

n

nHF

CL

HF

n

p

DB

DB

pn

OB

GN

p

np

SWa6

pn

HFSWCR

AC

pn

a6 CR

PN a6GN GN

pnpn pnnp

a6

n GN

HF

np

PE

SW

p

p

CL

HFSR

n pSMCR

a6

CR

p

n

ACAC

GN

DB

npp

p

p n

DB

LD

HL

GN

SM

CP

AC

CPHF

a6

p

npp

p n

a6

HF

n

p

p

p

p

p n

n

(a) C4.5 (72.25%)depth = 12

0

1

2

3

4

5

6

7

8

9

10

11

12

SR

HFPE

A6

CR

SW

SW

DB

HF

CR

LP

pos(17/25)

(12/13)

(30/43)

(20/22)

(6/8)

(13/16)

(11/12)

(10/10)

(56/79)

SR

(43/52)

pos

pospos

pos

posnegpos

posneg posneg

pos

Optimal DT

This might be the optimal decision tree based on the data and indicates that combining human knowledge and machine speed of processing can often produce a superior result than either the human or machine could produce separately.

Conclusion

• Experimental results on two datasets suggest that more accurate and efficient decision trees can be found by the GA

• The decision trees produced by the GA have significant clinical relevance.

• The results shown here increase the probability of predicting whether a patient would develop or have had DVT, which provides advancement in the diagnosis of DVT

Future Works

The decision trees found by using GA tend to be almost full binary trees i.e., the width is large while the depthis short.

For future work, the C4.5 pruning mechanism could be applied to decision trees produced by GA to make trees sparse and to further avoid the potential over-fittingproblem.

Recommended