MATH2319 Machine Learning Project Phase 2 Heart Disease ... · MATH2319 Machine Learning Project Phase 2 Heart Disease Classification Name: Laura Pritchard Student ID: s3732959 June

MATH2319 Machine Learning Project Phase 2 Heart Disease Classification

Name: Laura Pritchard

Student ID: s3732959

June 9, 2019

Page 2 of 23

Table of Contents

1 Executive Summary ................................................................................................................... 3

2 Introduction .............................................................................................................................. 4

2.1 Objective ........................................................................................................................... 4

2.2 Dataset .............................................................................................................................. 4

2.2.1 Target Feature ........................................................................................................... 4

2.2.2 Descriptive Features .................................................................................................. 4

3 Methodology ............................................................................................................................ 5

3.1 k-Nearest Neighbor .......................................................................................................... 5

3.2 Decision Tree ..................................................................................................................... 5

3.3 Support Vector Machine .................................................................................................... 5

3.4 Test Data ........................................................................................................................... 5

3.5 Feature Selection ............................................................................................................... 6

4 Parameter tuning ...................................................................................................................... 7

4.1 k-Nearest Neighbor ........................................................................................................... 7

4.1.1 Range of Values ......................................................................................................... 7

4.1.2 10-Fold Cross Validation ............................................................................................ 8

4.1.3 Final Model ................................................................................................................ 9

4.2 Decision Tree ................................................................................................................... 10

4.2.1 Default Decision Tree ............................................................................................... 10

4.2.2 Decision Tree – Constrained parameters ................................................................. 12

4.2.3 Final Model .............................................................................................................. 13

4.3 Support Vector Machine .................................................................................................. 14

4.3.1 Range of Values ....................................................................................................... 15

4.3.2 10-Fold Cross Validation .......................................................................................... 16

4.3.3 Final Model .............................................................................................................. 17

5 Performance comparison ........................................................................................................ 18

6 Critique ................................................................................................................................... 19

7 Summary ................................................................................................................................. 20

8 Conclusions ............................................................................................................................. 21

9 References .............................................................................................................................. 22

10 Appendix A – Imported packages ........................................................................................ 23

Page 3 of 23

1 Executive Summary Heart disease is the leading cause of death in Australia. For this reason, the ability to detect heart

disease early using a classification model based on a set of attributes is critically important within

the healthcare industry.

For Phase 2 of this project, we attempted to develop such a classification model using the heart

disease dataset, available from the UCI Repository (which we pre-processed and explored in Phase 1

of this project). The heart disease dataset has 303 patients, where 164 had no heart disease and 139

had heart disease. While the original study measured 76 attributes, it was found that only 13 were

relevant to the diagnosis of heart disease, so these are the attributes which have been used in the

modelling phase of this project.

The following three classifier models were developed, tuned and assessed:

1) k-Nearest Neighbor

2) Decision Tree

3) Support Vector Machine.

The k-Nearest Neighbor model and the Support Vector Machine model are both characterised by a

single parameter, which was tuned using 10-fold cross validation. On the other hand, the Decision

Tree model has four parameters, which were tuned using a grid search along with 10-fold cross

validation.

For the data modelling we chose to use 20% of the heart disease dataset to create our test data,

with the remaining 80% used for our training data. We tuned the parameters on the training dataset

and measured the performance using a combination of the confusion matrix, classification error

rates, precision and accuracy scores.

The accuracy scores and classification error rates of our final tuned models can be seen below. As

can be observed from the table of results, each of the three models gave decent results in the

training and test data, however the recommended model is the Decision Tree. This is because the

Decision Tree classification model had the highest accuracy score (82%) and lowest classification

error rate (18%).

Model Accuracy Score Classification Error Rate

k-Nearest Neighbor 80.3% 19.7%

Decision Tree 82.0% 18.0%

Support Vector Machine 78.7% 21.3%

Page 4 of 23

2 Introduction According to the Australian Bureau of Statistics, the leading cause of death in Australia in 2017 was

Ischaemic heart disease, where Ischaemic heart disease is a condition that affects the supply of

blood to the heart, and includes acute myocardial infarction, angina and chronic ischaemic disease

(ABS 2018).

Given the implication for early detection of heart disease and cost of health care to the economy,

the ability to diagnosis heart disease is crucial. As such, this study considers if it is possible to

develop a classification model to assist in the diagnosis of heart disease. To this effect, we consider

three classifiers which are k-Nearest Neighbor, Decision Tree, and Support Vector Machine.

2.1 Objective

To reiterate the objective, the goal of this project is to predict whether an individual has heart

disease. This project has two phases. Phase I focused on data pre-processing and exploration. Phase

II, as covered in this report, will present model building and evaluation. This report was compiled

from Jupyter Notebook and contains both narratives and the Python code used for the model

building and evaluation.

2.2 Dataset As covered in detail in Phase 1 of this project, the data was obtained from a heart disease study

conducted by the Cleveland Clinic Foundation and is available at the UCI Repository. The heart

disease dataset contains 13 attributes in addition to the target attribute. The data contains 303

observations (each representing a patient) where 164 patients did not have heart disease and the

remaining 139 patients had heart disease.

2.2.1 Target Feature The target feature has two classes and therefore this problem is a binary classification problem.

2.2.2 Descriptive Features For ease of reference, the variable descriptions of each attribute in the heart disease data can be

seen in the table below:

Variable Name Variable Type Variable Description

age Continuous Age of the patient (in years)

sex Categorical Sex of the patient (1=male; 0=female)

cp Categorical Chest pain type

restbp Continuous Resting blood pressure

chol Continuous Serum cholesterol in mg/dl

fbs Categorical Fasting blood sugar > 120 mg/dl (1=true; 0=false)

restecg Categorical Resting electrocardiographic results

thatach Continuous Maximum heart rate achieved

exang Categorical Exercise induced angina (1=yes; 0=no)

oldpeak Continuous ST depression induced by exercise relative to rest

slope Categorical Slope of the peak exercise ST segment

ca Categorical Number of major vessels (0-3) coloured by fluoroscopy

thal Categorical 3=normal; 6=fixed defect; 7=reversable defect

target Categorical 0=no heart disease; 1=heart disease

Page 5 of 23

3 Methodology Three classification models were developed and tested using the heart disease dataset. These were

k-Nearest Neighbor, Decision tree and Vector Support Machine.

3.1 k-Nearest Neighbor The first classifier selected was the k-Nearest Neighbor (kNN) which is a nonparametric supervised

learning model using for classification and regression. In this classification model an observation is

modelled as the k closest observations in the feature space. For kNN, there is a single parameter that

needs to be selected which is the nearest neighbour value, k. We used two different approaches on

the training dataset to find the optimal value of k for the classifier. These approaches were testing a

range of pre-selected values and the 10-fold cross validation method. The assessment criteria used

in these two approaches was the highest accuracy score.

3.2 Decision Tree The decision tree classifier divides the dataset into smaller subsets until a chosen measure (such as

Gini index or cross entropy) satisfies the predefined values. The initial dataset is referred to as Root

Node. The algorithm subsequently splits the Root Node into small subsets referred to as Decision

Nodes. The Decision Nodes which do not split further are referred to as Terminal Nodes. The results

of these Terminal Nodes provide the labels of 0 (does not have heart disease) and 1 (has heart

disease).

The decision tree has four parameters which needed to be calibrated. These are as follows:

• Minimum observation required in a Terminal Node

• Maximum depth of the tree

• Maximum depth of the Terminal Node

• Maximum number of features to consider while searching for the best split.

The approach used to select these four parameters was to first develop the tree with no constraints

(referred to as the Saturation Tree). Inspection of the Saturation Tree suggested possible range of

values for the four parameters. This was implemented using grid search and 10-fold cross validation

to determine the final value of the four parameters, which is referred to as the Constraint Tree.

3.3 Support Vector Machine The last classification model we developed and tested was the Support Vector Machine (SVM). The

SVM model is appropriate for multi-class classification problems which allows for nonlinear

separating boundaries between the classes. Like kNN, SVM has a single parameter (C) that needs to

be optimised. We used a 10-fold cross validation to determine the best value of this parameter.

3.4 Test Data To tune the parameters of the classifiers, the heart disease dataset is subdivided between a training

dataset and a test dataset. The training dataset is used for tuning the classifier parameters such as k

in kNN or the four parameters of the Decision Tree. The test dataset is not used in the tuning

process and is used only to assess the performance of the classifiers.

Given the small sample size of the heart disease dataset, it is necessary to determine the size of the

training and test dataset. In principle the choice of the test data size should be a trade-off between

bias/variance. For this report we have decided to use the commonly used rule of thumb of the

80%/20% split. That is, 80% of the heart disease data is used for training and 20% is used for testing.

Page 6 of 23

Before we can split the data into test and train datasets, we need to separate the attributes into

column features and target using the following code.

ColFeatures = ["age", "sex", "cp", "restbp", "chol", "fbs",

"restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal"]

ColTarget = "target"

Heart_features = Heart_data[ColFeatures]

Heart_target = np.array(Heart_data[ColTarget])

Now we can divide the heart disease data into train and test datasets using a test size of 20%, as

previously mentioned. This is done as follows:

X_train, X_test, y_train, y_test = train_test_split(Heart_features,

Heart_target, test_size=0.2, random_state=0)

Note it is assumed that the required packages were imported as part of Phase 1 of this project. The

full code to import all required Python packages can be seen in Appendix A.

3.5 Feature Selection As noted in Phase 1 of this project, the original study from which this dataset was obtained

measured 76 attributes. However, in previous research, it was found that only 13 of these were

relevant to the diagnosis of heart disease, and these are the attributes which are used in our three

classification models. For that reason, we have not undertaken further feature selection as we are

confident that each of the 13 features in our dataset plays a critical role in the diagnosis of heart

disease.

Page 7 of 23

4 Parameter tuning

4.1 k-Nearest Neighbor The k-Nearest Neighbor classifier has just one parameter, k, which represents the number of

neighbours. The next two sections consider the two different approaches we used in order to arrive

at the optimal value for k (which we found to be k=5).

4.1.1 Range of Values Initial tunning considered a range of k values from 3 to 20. As a rule of thumb, we used k values up

to the square root of the number of observations in the dataset, which is roughly equal to 20. We

increment by 2 and loop through odd values only to avoid ties.

Model1_range_score = []

for i in list(range(3,20,2)):

clf = KNeighborsClassifier(n_neighbors=i, weights="uniform",

metric="minkowski", p=2)

Model1fit = clf.fit(X_train, y_train)

scores = clf.score(X_train, y_train)

Model1_range_score.append(scores)

plt.figure()

plt.plot(neighbors, Model1_range_score)

plt.xlabel('Number of Neighbors K')

plt.ylabel('Accuracy-range of KNN')

plt.title("KNN - Training Dataset")

knn_Range_optimal =

neighbors[Model1_range_score.index(max(Model1_range_score))]

max_Model1_range_score = max(Model1_range_score)

MSE_Range = (1-max_Model1_range_score)*100

print "The optimal k parameter using a range of values is %d" %

knn_Range_optimal

print "The classification error rate using a range of values is

%0.2f" % MSE_Range+'%'

plt.show()

The graph below shows the accuracy score for the varying k values, as produced by the code above.

This shows that the highest accuracy score is 89.7% corresponding to a k value of 3. This is

equivalent to the default value for the kNN classifier in the sklearn package.

Page 8 of 23

4.1.2 10-Fold Cross Validation The second approach was to use a 10-fold cross validation on the training dataset, the code for

which is shown below.

Model1_cv_score = []

for i in list(range(3,20,2)):

clf=KNeighborsClassifier(n_neighbors=i, weights="uniform",

metric="minkowski",p=2)

cvscores=cross_val_score(clf, X_train, y_train, cv=10)

Model1_cv_score.append(cvscores.mean())

plt.figure()

plt.plot(neighbors, Model1_cv_score)

plt.xlabel('Number of Neighbors K')

plt.ylabel('Accuracy Cross Validation Score')

plt.title("kNN - Cross Validation")

plt.show()

knn_cv_optimal=neighbors[Model1_cv_score.index(max(Model1_cv_score))]

max_Model1_cv_score = max(Model1_cv_score)

MSE_cv = (1-max_Model1_cv_score)*100

print "The optimal k parameter using 10-fold cross validation is %d"

% knn_cv_optimal

print "The classification error rate using 10-fold cross validation

is %0.2f" % MSE_cv+'%'

The graph below shows that the optimal accuracy score is 85.9% corresponding to a k value of 5.

Page 9 of 23

4.1.3 Final Model Given that cross validation approach to tunning parameters is recommended, we select k to be 5.

The following code is run to fit the final kNN model and print the various scoring metrics, such as

accuracy, precision, classification error rate and the confusion matrix.

final_X_train, final_X_test, final_y_train, final_y_test =

train_test_split(Heart_features, Heart_target,

test_size=0.2, random_state=0)

final_clf = KNeighborsClassifier(n_neighbors=5, weights="uniform",

metric="minkowski", p=2)

final_model1fit = final_clf.fit(final_X_train, final_y_train)

final_model1_train_score = final_clf.score(final_X_train,

final_y_train)

print( "Model 1 (kNN) Final Training accuracy score is: %0.2f" %

(100*final_model1_train_score) + '%')

print( "Model 1 (kNN) Final Training Classification error rate:

%0.2f" % (100-100*final_model1_train_score)+'%')

final_model1_test_score = final_model1fit.score(final_X_test,

final_y_test)

print( "Model 1 (kNN) Final Test accuracy score is: %0.2f" %

(100*final_model1_test_score ) + '%' )

print( "Model 1 (kNN) Final Test Classification error rate: %0.2f" %

(100-100*final_model1_test_score)+'%' )

final_model1_predict = final_model1fit.predict(final_X_test)

final_model1_cm = confusion_matrix(final_y_test,final_model1_predict)

print("Model 1 (kNN) Test Confusion Matrix")

print(final_model1_cm)

final_model1_classreport = classification_report(final_y_test,

final_model1_predict)

print("Model 1 (kNN) Test Classification Report")

print(final_model1_classreport)

The output from the above code is as follows:

Model 1 (kNN) Final Training accuracy score is: 87.19%

Model 1 (kNN) Final Training Classification error rate: 12.81%

Model 1 (kNN) Final Test accuracy score is: 80.33%

Model 1 (kNN) Final Test Classification error rate: 19.67%

Model 1 (kNN) Test Confusion Matrix

[[31 4]

[ 8 18]]

Model 1 (kNN) Test Classification Report

precision recall f1-score support

0 0.79 0.89 0.84 35

1 0.82 0.69 0.75 26

micro avg 0.80 0.80 0.80 61

macro avg 0.81 0.79 0.79 61

weighted avg 0.80 0.80 0.80 61

As we can see from the output above, our final test accuracy score for the kNN model is 80.3%.

Page 10 of 23

4.2 Decision Tree The decision tree has four parameters which need to be tuned. These are as follows:

• Minimum observation required in a Terminal Node

• Maximum depth of the tree

• Maximum depth of the Terminal Node

• Maximum number of features to consider while searching for the best split.

4.2.1 Default Decision Tree Our initial decision tree classifier was developed using the default setting of the sklearn package and

has resulted in our default tree.

model2_default_clf = DecisionTreeClassifier(random_state=0)

model2_default_fit = model2_default_clf.fit(X_train,y_train)

The tree diagram for the default tree is shown below. Inspection of this shows the tree has a total

depth of 10. Many of the decision nodes and terminal nodes have a very small sample size.

dot_data=tree.export_graphviz(model2_default_fit,out_file="test.dot",

feature_names=["age","sex","cp","restbp","chol","fbs","restecg","thal

ach","exang","oldpeak","slope","ca","thal"], class_names=["1","0"],

filled=True, rounded=True, special_characters=True)

Page 11 of 23

The code below prints the accuracy and precision metrics for our default decision tree:

model2_default_predict=model2_default_fit.predict(X_test)

model2_default_score=accuracy_score(y_test, model2_default_predict)

print("Model 2 Test (default) score: %0.2f"

%(100*model2_default_score) + '%')

print("Model 2 Test (default) Classification error rate: %0.2f"

%(100-100*model2_default_score) +'%')

model2_default_classreport=classification_report(y_test,model2_defaul

t_predict)

print("Model 2 Test (default) Classification Report")

print(model2_default_classreport)

model2_default_cm=confusion_matrix(y_test, model2_default_predict)

print("Model 2 Test (default) Confusion Matrix")

print(model2_default_cm) From this we get the following output:

Model 2 Test (default) score: 73.77049%

Model 2 Test (default) Classification error rate: 26.22951%

Model 2 Test (default) Classification Report


0 0.79 0.74 0.76 35

1 0.68 0.73 0.70 26

micro avg 0.74 0.74 0.74 61

macro avg 0.73 0.74 0.73 61

weighted avg 0.74 0.74 0.74 61

Model 2 Test (default) Confusion Matrix

[[26 9]

[ 7 19]]

Page 12 of 23

4.2.2 Decision Tree – Constrained parameters We used the grid search and 10-fold cross validation to constrain the tree parameters. The range of

values of the tree parameters are:

Parameter Range

min_samples_split 2 to 30 by increments of 1

max_depth 1 to 20 by increments of 1

min_samples_leaf 10 to 30 by increments of 5

max_features 1 to 14 by increments of 1

The following code produces the grid search in order to final the best parameter values:

parameters = {'min_samples_split':range(2,30,1),

max_depth':range(1,20,2), 'min_samples_leaf':range(10,30,5),

'max_features':range(1,14,1)}

model2_constraint_clf = DecisionTreeClassifier(random_state=0)

model2_constraint_gridcv = GridSearchCV(model2_constraint_clf,

parameters, cv=10)

model2_constraint_fit = model2_constraint_gridcv.fit(X_train,y_train)

The resultant tunned tree parameters from the grid search and 10-fold cross validation are:

Parameter Optimal Value

min_samples_split 20

max_depth 5

min_samples_leaf 10

max_features 3

Page 13 of 23

4.2.3 Final Model For our final decision tree model we create the tree using the optimal parameters found in the

previous section.

The following produces our final decision tree.

dot_data=tree.export_graphviz(final_model2_fit,out_file="test.dot",

feature_names=["age","sex","cp","restbp","chol","fbs","restecg","thal

ach","exang","oldpeak","slope","ca","thal"],class_names= ["1","0"],

filled=True,rounded=True,special_characters=True)

Model 2 Final Test Score: 81.96721%

Model 2 Final Test Classification error rate: 18.03279%

Model 2 Final Test Classification Report


0 0.82 0.89 0.85 35

1 0.83 0.73 0.78 26

micro avg 0.82 0.82 0.82 61

macro avg 0.82 0.81 0.81 61

weighted avg 0.82 0.82 0.82 61

Model 2 Final Test Confusion Matrix

[[31 4]

[ 7 19]]

As we can see from the output below, our final test accuracy score for the Decision tree is 82.0%.

Page 14 of 23

4.3 Support Vector Machine Having considered the k-Nearest Neighbor and Decision Tree, we now consider the Support Vector

Machine (SVM). Like the k-Nearest Neighbor classifier, SVM has one parameter, C, which determines

the performance of the SVM classifier.

We will use the same features as per the previous models, and use the following code to split the

data into test and train datasets:

X_train,X_test,y_train,y_test=train_test_split(Heart_features,Heart_t

arget,test_size=0.2,random_state=0)

We now fit a default SVM model with the following code and print the accuracy metrics to help us

assess how good the default model is:

svcModel = svm.SVC(C=1,decision_function_shape='ovr',random_state=0,

probability=True)

print(svcModel)

Model3_fit = svcModel.fit(X_train,y_train)

Model3_scores = svcModel.score(X_train,y_train)

Model3_predicted = svcModel.predict(X_test)

print("Model 3 Train score: %0.2f" %(100*Model3_scores) + '%' )

error_rate = 1-Model3_scores

print("Model 3 Train classification error rate: %0.2f"

%(error_rate*100) + '%')

Model3_test_score = svcModel.score(X_test,y_test)

print("Model 3 Test score: %0.2f" %(100*Model3_test_score) + '%' )

error_rate = 1-Model3_test_score

print("Model 3 Test classification error rate: %0.2f"

%(error_rate*100) +'%')

Model3_ClassReport=classification_report(y_test,Model3_predicted)

print("Model 3 Test Classification Report")

print(Model3_ClassReport)

Model3_cm=confusion_matrix(y_test,Model3_predicted)

print("Model 3 Test Confusion Matrix")

print(Model3_cm)

The code above prints the following output which allows us to assess the default model.

Model 3 Train score: 84.71%

Model 3 Train classification error rate: 15.29%

Model 3 Test score: 80.33%

Model 3 Test classification error rate: 19.67%

Model 3 Test Classification Report


0 0.78 0.91 0.84 35

1 0.85 0.65 0.74 26

micro avg 0.80 0.80 0.80 61

macro avg 0.82 0.78 0.79 61

weighted avg 0.81 0.80 0.80 61

Page 15 of 23

Model 3 Test Confusion Matrix

[[32 3]

[ 9 17]]

4.3.1 Range of Values Initial tunning considered a range of C values from 0.1 to 10 increments of 0.1 using the following

Python code:

Model3_score_C=[]

for i in [0.1, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5

,2, 5, 10]:

svcModel =

svm.SVC(C=i,decision_function_shape='ovr',random_state=0,

probability=True)

Model3_C = svcModel.fit(X_train,y_train)

Model3_score_C.append(Model3_C.score(X_train,y_train))

plt.figure()

plt.plot(paramGrid, Model3_score_C)

plt.xlabel('C value')

plt.ylabel('Accuracy Score ')

plt.title("SVM - Testing range of values")

plt.show()

model3_C = paramGrid[Model3_score_C.index(max(Model3_score_C))]

print "Optimal C parameter via testing range is %d" % model3_C

max_Model3_C=max(Model3_score_C)

MSE_cv=(1-max_Model3_C)*100 #[1-x for x in Model1_cv_score]

print "Classification error rate via testing range is %0.2f" %

MSE_cv+'%'

The resulting graph provides the accuracy score against the range of C values. As can be observed

from the figure, the accuracy score increases exponentially to C=2 followed by linearly to C=10.

Page 16 of 23

4.3.2 10-Fold Cross Validation Using a 10-fold cross validation on the training dataset, the highest accuracy score is 84.8%

corresponding to a C value of 2.

Model3_cv_score=[]

for i in paramGrid:

svcModel = svm.SVC(C=i, decision_function_shape='ovr',

random_state=0, probability=True)

cvscores = cross_val_score(svcModel,X_train,y_train,cv=10)

Model3_cv_score.append(cvscores.mean())

plt.figure()

plt.plot(paramGrid,Model3_cv_score)

plt.xlabel('C value')

plt.ylabel('Accuracy Cross Validation Score ')

plt.title("SVM - Cross Validation")

plt.show()

model3_C_cv = paramGrid[Model3_cv_score.index(max(Model3_cv_score))]

print "Optimal C parameter via cross validation is %d" % model3_C_cv

max_Model3_cv_score=max(Model3_cv_score)

MSE_cv=(1-max_Model3_cv_score)*100

print "Classification error rate via cross validation is %0.2f" %

MSE_cv+'%'

The code above produces the following graph:

Page 17 of 23

4.3.3 Final Model From our findings in the previous sections and given that the cross-validation approach to tunning

parameters is recommended, for our final SVM model we select the C parameter to be 2.

X_train, X_test, y_train,y_test = train_test_split(Heart_features,

Heart_target, test_size=0.2, random_state=0)

svcModel_final = svm.SVC(C=2, decision_function_shape='ovr',

random_state=0, probability=True)

svcModel_final = svcModel_final.fit(X_train,y_train)

Model3_final_predicted = svcModel_final.predict(X_test)

Model3_final_scores = svcModel_final.score(X_train,y_train)

print("Model 3 (SVM) Final Train score: %0.2f"

%(100*Model3_final_scores) + '%')

print("Model 3 (SVM) Final Train Classification error rate: %0.2f"

%((1-Model3_final_scores)*100) + '%')

Model3_final_test_score = svcModel_final.score(X_test,y_test)

print("Model 3 (SVM) Final Test score: %0.2f"

%(100*Model3_final_test_score) + '%')

print("Model 3 (SVM) Final Test Classification error rate: %0.2f"

%((1-Model3_final_test_score)*100) +'%')

Model3_final_ClassReport =

classification_report(y_test,Model3_final_predicted)

print("Model 3 (SVM) Final Test Classification Report")

print(Model3_final_ClassReport)

Model3_final_cm = confusion_matrix(y_test,Model3_final_predicted)

print("Model 3 (SVM) Final Test Confusion Matrix")

print(Model3_final_cm)

scores = cross_val_score(svcModel_final, X_train, y_train, cv=10)

print("Model 3 (SVM) Final CV (train) mean score: %0.2f"

%(100*scores.mean()) + '%')

print("Model 3 (SVM) Final CV (train) Classification error mean rate

: %0.2f" %(100-100*scores.mean()) + '%') As we can see from the output below, our final test accuracy score for the SVM model is 78.7%. Model 3 (SVM) Final Train score: 88.02%

Model 3 (SVM) Final Train Classification error rate: 11.98%

Model 3 (SVM) Final Test score: 78.69%

Model 3 (SVM) Final Test Classification error rate: 21.31%

Model 3 (SVM) Final Test Classification Report


0 0.78 0.89 0.83 35

1 0.81 0.65 0.72 26

micro avg 0.79 0.79 0.79 61

macro avg 0.79 0.77 0.78 61

weighted avg 0.79 0.79 0.78 61

Model 3 (SVM) Final Test Confusion Matrix

[[31 4]

[ 9 17]]

Model 3 (SVM) Final CV (train) mean score: 84.76%

Model 3 (SVM) Final CV Classification error mean rate: 15.24%

Page 18 of 23

5 Performance comparison From the outputs of each of our final models, we observe and compare the performance metrics. As

it can be observed from the tables below, across all metrics, including precision which is the most

important in the medical field, the Decision Tree classifier performs the best.

K-Nearest Neighbour

Precision Recall f1-score Support

0 0.79 0.89 0.84 35

1 0.82 0.69 0.75 26

Micro avg 0.80 0.80 0.80 61

Macro avg 0.81 0.79 0.79 61

Weighted avg 0.80 0.80 0.80 61

Prediction = 0 Prediction = 1

Actual = 0 31 4

Actual = 1 8 18

The final test accuracy score: 80.3%. Decision Tree


0 0.82 0.89 0.85 35

1 0.83 0.73 0.78 26

Micro avg 0.82 0.82 0.82 61

Macro avg 0.82 0.81 0.81 61

Weighted avg 0.82 0.82 0.82 61


Actual = 0 31 4

Actual = 1 7 19

The final test accuracy score: 82.0%.

Support Vector Machine


0 0.78 0.89 0.83 35

1 0.81 0.65 0.72 26

Micro avg 0.79 0.79 0.79 61

Macro avg 0.79 0.77 0.78 61

Weighted avg 0.79 0.79 0.78 61


Actual = 0 31 4

Actual = 1 9 17

The final test accuracy score: 78.7%.

Page 19 of 23

6 Critique The purpose of this project was to determine if we could derive a model which could accurately

classifier whether an individual has heart disease or does not have heart disease. Three models were

investigated which were:

1. Model 1 – K-Nearest Neighbor with k=5

2. Model 2 – Decision Tree with a depth of 5

3. Model 3 – Support Vector Machine with C=2

Based on the confusion matrices, and in particular the precision measure, the ranking of the models

is the Decision Tree, The K-Nearest Neighbor and the Support Vector Machine.

All three of these classifiers have decent classification rates, precisions and recall metrics. For this

reason, all models are potential candidates for classifying individuals as having heart disease or not.

The decision tree model however has the advantage of providing the best precision score which is

important when it comes to fatal disease.

Given more time, we would develop and test some further models, especially a logistic regression

model which may perform well on this type of data.

Page 20 of 23

7 Summary The objective of this project is to predict whether an individual has heart disease.

In Phase 1, as covered by the previous report, we focused on data pre-processing and exploration of

the heart disease dataset. The data pre-processing included checking for capital letter mismatches,

extra whitespace and typos as well as performing sanity checks, dealing with missing values and

ensuring appropriate data types. In the data exploration section, we analysed each attribute in the

heart disease dataset using appropriate visualisations.

In Phase 2, as covered by this report, we developed and tested the following three classification

models:

1) k-Nearest Neighbor

2) Decision Tree

3) Support Vector Machine.

Through parameter tuning using a range of values and 10-fold cross validation, we optimised each of

these models. A performance comparison showed us that the Decision Tree performed the best as it

gave us the best precision score and lowest classification error rate.

Page 21 of 23

8 Conclusions Given the area of interest for this project is the diagnosis of heart disease, a fatal illness, the metric

of precision is the most appropriate one to use in terms of comparing the performance of different

models. Therefore, it follows that our second model, the Decision Tree classifier is the one to be

recommended based on the precision scores (as can be seen in the output from the final models).

Furthermore, Model 2 also provides the highest accuracy score and lowest classification error rate.

Page 22 of 23

9 References 1. Australian Bureau of Statistics, 2018, Causes of Death, Australia, 2017, cat. no. 3303.0,

viewed 21 April 2019, < https://www.abs.gov.au/ausstats/>

2. Machine Learning Project Phase 1 Report

3. Boschetti and L. Massaron, Python Data Science Essentials

4. McKinnery. W, Python for Data Science

5. Pandas Documentation

6. Sklearn Documentation

https://www.abs.gov.au/ausstats/

Page 23 of 23

10 Appendix A – Imported packages

import pandas as pd

from pandas.plotting import scatter_matrix

import urllib2

import numpy as np

import matplotlib as mpl

import matplotlib.pyplot as plt

from sklearn import model_selection

from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import confusion_matrix

from sklearn.metrics import classification_report

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import LeaveOneOut

from sklearn.metrics import accuracy_score

from sklearn.model_selection import GridSearchCV

from sklearn import tree

from sklearn.tree import export_graphviz

from sklearn import svm

Documents

MATH2319 Machine Learning Project Phase 2 Heart Disease ... · MATH2319 Machine Learning Project Phase 2 Heart Disease Classification Name: Laura Pritchard Student ID: s3732959 June