Author
martha
View
38
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Constructing Binary Decision Tree for Predicting Deep Venous Thrombosis (DVT). Christopher Nwosisi 1,2 , Sung-Hyuk Cha 1 , Yoo Jung An, Charles C. Tappert 1 , Evan Lipsitz 2. 1 Computer Science Department Pace University New York, USA. 2 Vascular Laboratory Montefiore Medical Center - PowerPoint PPT Presentation
Constructing Binary Decision Tree for Predicting Deep Venous Thrombosis (DVT)
Christopher Nwosisi1,2, Sung-Hyuk Cha1, Yoo Jung An, Charles C. Tappert1, Evan Lipsitz2
1Computer Science Department Pace UniversityNew York, USA
2Vascular LaboratoryMontefiore Medical CenterNew York, USA
Statement of Problem
• The use of decision tree algorithms such as ID3 and C4.5 in medical diagnostic application today is promising, but often suffer from excessive complexity and can even be incomprehensible.
• Especially in predicting DVTs which have high mortality, simple and accurate decision model is preferred for potential patients, Medical Technologists and Physicians before sending patients for expensive medical examinations.
Proposed approach
• Using the Genetic Algorithm to minimize the complexity (size) and/or maximize the accuracy of the decision tree.
• New approach found shorter and/or more accurate decision trees than ones produced by conventional the ID3 and C4.5 algorithms.
DVT / VTE
Silent PESilent PE1 Million1 Million
DeathDeath60,00060,000
Estimated Cost of VTE Care $1.5 Billion/year
Magnitude of the Problem
Post-thrombotic Post-thrombotic SyndromeSyndrome
800,000800,000
Pulmonary Pulmonary HypertensionHypertension
30,00030,000
Goldhaber SZ, et al. Lancet 1999;353:1386-19.
DVTDVT2 Million2 Million
PEPE600,000600,000
Patients with deep vein thrombosis have a painful swollen leg which limits their mobility
Clinical Problem
Montefiore Hospital Vascular Laboratory, 2008
DVT-Duplex Evaluation
Criteria for positive diagnosis:
- incompressibility of a venous segment
- visualization of thrombus
absence of flow
v a
Montefiore Hospital Vascular Laboratory
Database Overview
Two datasets are extracted from two databases:
• Medical History
• Physical Exam
• Diagnostic Tests
• 515 records from the Laboratory
- 350 patients are positive for DVT- 165 patients are negative for DVT
• 620 records from the general registry
- 420 patients are positive for DVT- 200 patients are negative for DVT
Table 1- Databases Attributes
No. Name Description
1 Sex1 = male; 0 = female
2 AgeAge in years {1- 99}
3 Diabetes0 = normal; 1 = Patient is receiving some treatment
4 Smoking0 = never smoked; 1 = Patient is an active Smoker;
2 = Patient stopped smoking
5 Surgery0 = never had surgery;
1 = Patient who had previous surgery
6 Pain0 = no pain in the leg;
1 = Patient experienced pain in the leg {Right, Left or Bilateral}
7 Swelling0 = no swelling below the knee;
1 = swelling in the leg
DVT0 = examination result indicate negative for DVT;
1 = examination result indicate positive for DVT
Medical History
Table 2 – Database AttributesNo. Name Description
1 Sex 1 = male; 0 = female 12 Congestive heart
failure
0 = never diagnosed; 1 = previously diagnosed
2 Age Age in years {1-99} 13 Obesity 0 = obesity not specified; 1 = obesity specified
3 Diabetes 0 = normal; 1 = Patient is receiving some treatment 14 Accident 0 = never had a fall; 1 = previously had a fall
4 Smoking 0 = never smoked; 1 = Patient is an active Smoker;
2 = Patient stopped smoking
15 Hyperlipidemia 0 = normal; 1 = Patient is diagnosed
5 Surgery 0 = never had surgery; 1 = Patient who had previous
surgery
16 Cardiac
Dysrthythmia
0 = normal; 1 = Patient is diagnosed
6 Swelling 0 = no swelling below the knee; 1 = swelling in the leg 17 Lymphoproliferat
disease
0 = normal; 1 = Patient is diagnosed
7 Chest Pain 0 = none; 1 = pain in Chest DVT 0 = examination result indicate negative for DVT
1 = examination result indicate positive for DVT
8 Cancer 0 = normal; 1 = positive for cancer
9 Cellulitis 0 = normal; 1 = positive for cellulitis
10 Injury 0 = no injury; 1 = previous and current injuries
11 Pulmonary
embolism
0 = never diagnosed; 1 = previously diagnosed
Medical History
Physical ExamDiagnostic Tests
Sex Age Diabetes Smoking surgery pain swelling DVT
M 77 y no y n n yes
M 53 n no y n n yes
M 55 n yes n n y yes
F 73 n no y n y yes
F 84 y no y n n yes
F 68 n yes y n n yes
F 81 n no y n n yes
M 84 y yes n n n yes
F 84 y no y n n yes
M 84 n no y n n yes
F 73 n no y n y yes
F 56 n no n n y yes
M 63 n no n n n yes
F 76 y no y n n yes
F 70 y no y n n yes
M 75 n no y n n yes
F 92 n no n n n no
F 73 n no y y n no
F 61 n stopped n y n no
M 63 y stopped y n n no
M 78 n no y n n no
F 96 n no y n n no
F 71 n no y n n no
M 71 n no n n y no
Table 2.1.1.1 - DVT sample data set IIDVT database (Table 1)
AGE SEX Ob Sm Swell CHF Canc Surg Chest Lip Lymp Card DB Othr ACC/ Leg leg DVT Pain Dysr PE Fall Inj Cell 50 M 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1
82 F 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1
88 F 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1
67 F 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1
83 F 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
79 M 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1
54 M 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1
69 M 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1
68 M 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1
62 M 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1
26 F 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1
64 F 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 1
80 F 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1
82 F 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1
78 M 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1
33 F 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1
26 M 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 1
54 M 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
45 F 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1
47 F 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1
74 F 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
60 F 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1
58 M 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1
42 F 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
63 M 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1
45 F 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 1
30 F 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
87 F 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0
77 F 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
97 F 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
88 F 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0
18 M 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
85 F 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
35 M 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
68 F 0 0 0 0 1 0 1 1 0 0 1 0 0 0 0 0
48 F 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0
85 M 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0
68 M 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
42 F 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
DVT database (Table II)
GNCP
PN
SMSBSS
CRCL
IJ
PEHFOB
ACLP
CDLD
A60DB
SR
SW
Dataset I Dataset II
Datasets Relationship
Preprocessing (Binarization)
Heterogeneous type attributes
Sex Smoking … pain DVT
M no N yes
F no L yes
F yes Bi yes
F no N yes
M yes N yes
F stopped R no
M no N no
Homogeneous Binary type attributes
Original table Binary tableSex Smoking … pain DVT
1 0 0 0 0 1
0 0 0 1 0 1
0 1 1 1 1 1
0 0 0 0 0 1
1 1 1 0 0 1
0 1 0 0 1 0
1 0 0 0 0 0
Why Binary Attribute?
• Applying GA on Non-binary attributes is extremelydifficult and currently an open problem
• To use the GA to build a binary decision tree, theattribute types must be in binary
Age Distributions (numeric)
Nominal type attributes (|v| > 2)
Leg Pain {L, R, Bi, N}
L P RP
vSmoking {N, Stopped, Yes}
SB SS
v
1 1 Bi 1 1 Smoking
1 0 L 1 0 Stopped
0 1 R 0 0 None
0 0 None
A60 GN DB SM SR PN SW DVT1 1 0 0 0 0 0 10 0 0 0 0 0 1 11 0 0 0 1 0 0 11 1 0 0 1 0 0 10 1 0 0 1 0 0 1
0 1 0 0 1 0 1 0
0 0 0 0 0 0 0 01 0 0 0 0 0 0 00 0 0 0 0 0 0 01 1 1 0 0 0 0 0
Dataset I Binarized Table
A60 GN DB OB SM SR SW HF CR CP HL LD CD PE AC IJ CL DVT
0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1
1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1
1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 1
0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1
1 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Dataset II Binarized Table
Decision Tree
Their representation of acquired knowledge in tree form is intuitive and generally easy to assimilate by humans.In general, DT classifiers have comparable accuracy to other complex classifiers but simple to understand and visualize.
SR
HFPE
CRSW pos(17/25)
(12/13)(11/12)
(10/10)pos
pospos negneg
1
1
1
1
10 0
00
0
• Decision trees classify instances – by sorting them down from the root to the leaf node, – which provides the classification of the instance.
• Each internal node in the tree specifies a test of some attribute of the instance.
• Each leaf node assigns a classification
• Each branch descending from that node corresponds to one of the possible values of this attribute.
Decision Tree RepresentationDecision Tree Representation
Decision Trees from Dataset I
SR
PN
SB
pn
n
DB
A6
pn
(b) 61.5% by GA
n
SB
GN
DB
PN
SS
SW
p
np
nn
np
(a) 59.5% by C4.5
pn pnpnpn
n
np
SWSRDBA6DB
DBA6GNn
GNPN
SS
(c) 64.5% by GA
PE
SW
p
p
CL
HFSR
n pSMCR
a6
CR
p
n
ACAC
GN
DB
npp
p
p n
DB
LD
HL
GN
SM
CP
AC
CPHF
a6
p
npp
p n
a6
HF
n
p
p
p
p
p n
n
(a) C4.5 (72.25%)depth = 12
0
1
2
3
4
5
6
7
8
9
10
11
12
SR
CR
n
SM
CR
HF
AC
pnp
a6
n
n
p
a6
n
p
n
p
CL
DB
a6n
DB
DB
pn
OB
CD
SWa6
pn
HF
IJ
GNHFGN
n
SM
HL
GN a6
n
np np
np OB
np
SW HF
SM a6a6n
pp nnpn
(c) 73.75% by GA
HF
n
SW
pn
DB
DB
CLa6
pn
SRCR
n CD
np
HL
SR
pn
SM
GNa6
pn np
a6
GNAC
np
n
a6
GN
np
(b) 69.75% by GAdepth = 5
n
0
1
2
3
4
5
6
7
SR
a6
HF
GN
SM
CR
n
n
n
HL
p
n
p
a6
n
pGN
p
n
LD a6
AC
n p
IJ
a6 HL
nn
CD
p
SMnDB
p
OB
n
n
p
a6
(d) 75.25% by GAdepth = 7
CD
p
DB
n
nHF
CL
HF
n
p
DB
DB
pn
OB
GN
p
np
SWa6
pn
HFSWCR
AC
pn
a6 CR
PN a6GN GN
pnpn pnnp
a6
n GN
HF
np
Decision Trees from Dataset II – Figure 5
The Best Measure of Efficiency (shortness) for a DT
• Average number of questions required to obtain a prediction.
Other measures:
• the depth of the tree• the number of nodes in the tree
Depth limit
Performancerate
The average # of question
5 69.75 2.95256 73.75 3.37257 75.25 3.89558 76.50 4.32759 76.75 4.8225
10 78.00 5.122511 78.50 5.467512 79.50 5.867513 80.25 6.3075
Complexity of Decision Trees
12 72.25 7.485
16 80.0
C4.5
ID3
GA
From both a depth and average-number of questions perspective the complexity of the
decision tree in Figure 5 (d) can be considered much more efficient (simpler)than the decision
tree from the C4.5 algorithm (Figure 5a).
0
1
2
3
4
5
6
7
SR
a6
HF
GN
SM
CR
n
n
n
HL
p
n
p
a6
n
pGN
p
n
LD a6
AC
n p
IJ
a6 HL
nn
CD
p
SMnDB
p
OB
n
n
p
a6
(d) 75.25% by GAdepth = 7
CD
p
DB
n
nHF
CL
HF
n
p
DB
DB
pn
OB
GN
p
np
SWa6
pn
HFSWCR
AC
pn
a6 CR
PN a6GN GN
pnpn pnnp
a6
n GN
HF
np
PE
SW
p
p
CL
HFSR
n pSMCR
a6
CR
p
n
ACAC
GN
DB
npp
p
p n
DB
LD
HL
GN
SM
CP
AC
CPHF
a6
p
npp
p n
a6
HF
n
p
p
p
p
p n
n
(a) C4.5 (72.25%)depth = 12
0
1
2
3
4
5
6
7
8
9
10
11
12
SR
HFPE
A6
CR
SW
SW
DB
HF
CR
LP
pos(17/25)
(12/13)
(30/43)
(20/22)
(6/8)
(13/16)
(11/12)
(10/10)
(56/79)
SR
(43/52)
pos
pospos
pos
posnegpos
posneg posneg
pos
Optimal DT
This might be the optimal decision tree based on the data and indicates that combining human knowledge and machine speed of processing can often produce a superior result than either the human or machine could produce separately.
Conclusion
• Experimental results on two datasets suggest that more accurate and efficient decision trees can be found by the GA
• The decision trees produced by the GA have significant clinical relevance.
• The results shown here increase the probability of predicting whether a patient would develop or have had DVT, which provides advancement in the diagnosis of DVT
Future Works
The decision trees found by using GA tend to be almost full binary trees i.e., the width is large while the depthis short.
For future work, the C4.5 pruning mechanism could be applied to decision trees produced by GA to make trees sparse and to further avoid the potential over-fittingproblem.