Machine Learning Classifications Techniques on Cervical ...globaliotek.com/wp-content/uploads/2019/02/Machine-Learning... · genital herpes, STDs molluscum contagiosum, STDs HIV,

1

Machine Learning Classifications Techniques on Cervical Cancer Identification

by Vanban Wu, Ph.D., PMP, PSM

Indices: Data Science, Data Processing, Data Modeling, Machine Learning, Logistic Regression, K-Nearest Neighbor (K-NN), Support Vector Machine (SVM) & Kernel SVM, Naïve Bayes, Decision Tree, Random Forest, Accuracy, False Negative Rate or Type-II error, Cumulative Accuracy Profile (CAP), Linear Discriminant Analysis (LDA), R Programming, Cervical Cancer Risk Classification

Abstract:

This paper evaluates the most common machine learning classifications techniques applicable to a medical research topic – cervical cancer risk identification; it highlights the unique characteristics and mathematical approach of each technique in the machine learning domain. The computation result of each mechanism is being evaluated by two criteria: Accuracy rate and False Negative Rate or Type-II error which are the essential factors of medical research. The most promising algorithm is further evaluated by the Cumulative Accuracy Profile (CAP) measurement and enhanced by the Linear Discriminant Analysis mechanism to achieve a perfect result.

1.0 Introduction:

1.1 Machine Learning Classifications techniques: Per progress of the machine learning research over the past many years, these are the couple common classifications techniques supported in this specific domain:

• K-Nearest Neighbor (K-NN) • Logistic Regression • Support Vector Machine (SVM) and Kernel SVM • Naïve Bayes • Decision Tree • Random Forest

Without a doubt, there are other approaches which can achieve similar objectives, such as Deep Learning and Neural Network. But these techniques won’t be part of this discussion for the sake of simplicity.

In this paper, we use R programming to illustrate each of the techniques; the code demonstrates the effectiveness of the vector approach of the language in the implementation of machine learning.

2

1.2 Medical Topic Selection: The cervical cancer research is being chosen from the Kaggle database [1] to illustrate each machine learning technique and approach. Past literature did use some of these techniques in the evaluation of cervical cancer, but none did a comprehensive comparison of all the techniques described above, thus makes this report valid in its own right.

For this medical topic, there are 35 independent medical features used to support the evaluation of the cervical cancer biopsy result. Below are these 35 independent variables with the last one being the dependent variable:

1. Age 2. Number of sexual partners 3. First sexual intercourse 4. Number of pregnancies 5. Smokes 6. Smokes (years) 7. Smokes (packs/year) 8. Hormonal Contraceptives 9. Hormonal Contraceptives (years) 10. IUD (Intrauterine Device) 11. IUD (years) 12. STDs (Sexually Transmitted Disease) 13. STDs (number) 14. STDs: condylomatosis 15. STDs: cervical condylomatosis 16. STDs: vaginal condylomatosis 17. STDs: vulvo-perineal condylomatosis 18. STDs: syphilis 19. STDs: pelvic inflammatory disease 20. STDs: genital herpes 21. STDs: molluscum contagiosum 22. STDs: AIDS 23. STDs: HIV 24. STDs: Hepatitis B 25. STDs: HPV 26. STDs: Number of diagnoses 27. STDs: Time since the first diagnosis 28. STDs: Time since the last diagnosis 29. Dx: Cancer 30. Dx: CIN 31. Dx: HPV 32. Dx 33. Hinselmann 34. Schiller 35. Citology 36. Biopsy (Dependent variable)

3

The datafile only contains 858 specimens. From the preliminary observation, more than a few independent variables are sparsely populated in the dataset; for instance, STDs cervical condylomatosis and STDs AIDS all contain 0 value, STDs pelvic inflammatory disease, STDs genital herpes, STDs molluscum contagiosum, STDs HIV, and STDs Hepatitis B only have 1/858 cases populated, and STDs HPVs, 2/858 populated. Per information theory, these distributions are known as a low entropy phenomenon [11]. This raises a major concern where all techniques won’t perform well under such situation due to lack of data guidance during the machine learning construction.

2.0 Preprocessing of Datafile:

The preprocessing phase contains the following steps:

• Resetting of invalid values into a default setting, e.g. “na” value • Relabeling some variables naming convention for easy accessing • Replacing missing values with the mean of each variable • Removing variables of only 0 value, e.g. STDs cervical condylomatosis and STDs AIDS • Reordering some of the columns for easy accessing • Encoding some of the variables as factors wherever needed to improve the

computation efficiency of the algorithm.

Please refer to the R coding example in the attachment 1 for both the preprocessing and subsequent machine learning phases.

3.0 Splitting of Data into training and testing Sets:

Per common practice, 75% of a dataset is randomly selected to the training set to build the model and 25% to the test set for the testing of the model. This percentage selection ratio is a nice measure to prevent the overfit situation during the model construction phase. As the result of the split, there are 643 records selected to the training set and 215 to the test set.

4.0 Machine Learning:

In this section, we preview all algorithms under discussion and highlight some of the essential parameters tunable for the algorithms.

4.1 K-Nearest Neighbor (K-NN):

The K-Nearest Neighbors algorithm (K-NN) is a non-parametric method used for classification [2]. In the K-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

There are two parameters one can select to fine-tuning the outcome of this algorithm:

4

• Parameter K: This is the number of nearest neighbors used to classify a test sample to a certain class. Per literature recommendation, K is usually set to √𝑛# , where n is the number of records in the training set [3]. Instead of using such a recommendation, we select a small number from the Fibonacci series. This approach converges to a K value (e.g. K = 5) far better than √𝑛# (K ~ 25).

• K-NN distances: There are many types of distances one can choose for the computation, the most common one being the Euclidean distance. Other selections can also be trialed if an application fits such a characteristic, e.g. Manhattan distance or Chebyshev distance. Please refer to Attachment 3 for the definitions of those options. In this study, we only experiment with the Euclidean distance as the parameter setting.

4.2 Logistic Regression:

In statistics, the logistic model uses a logistic function to model a binary dependent variable of two possible values labeled “0” and “1” [4]. In the logistic model, the log-odds (the logarithm of the odds) is a linear combination of one or more independent variables (see the equation (2)). Basically, the algorithm tries to fit all independent variables into a Sigmoid function (refer to the equation (1)) by optimizing all coefficients, a0, a1, …, an, where p is the probability of occurrence (predicted outcome), and Xi’s are the independent variables under consideration. Thus, no fine tuning is needed by adopting this machine learning algorithm:

𝑝 = &&'()* where y= 𝑎0 + 𝑎1𝑋1 + 𝑎2𝑋2 + ⋯+ 𝑎n𝑋n (1)

ó ln( 7897

) = 𝑎0 + 𝑎1𝑋1 + 𝑎2𝑋2 + ⋯+ 𝑎n𝑋n (2)

ó 7897

= 𝑒<= ∗ 𝑒<8?8 ∗ 𝑒<@?@ ∗ … ∗𝑒<B?B (3)

The equation (3) reveals how a coefficient 𝑎iand its sign impacting the predicted outcome of 7

897 in either positive or negative sense. This is a worth noticing correlation

between the independent and dependent variables.

For this algorithm, no parameter tuning is needed, thus simplifying its usage during a trial.

4.3 Support Vector Machine (SVM) and Kernel SVM:

Support Vector Machine (SVM) is a supervised learning model that build a model by assigning new data to one category or the other, so that the separated categories are divided by a clear boundary that is as wide as possible through a Maximum Margin Hyperplane (or Maximum Margin Classifier) and Support Vectors, each from each category [5]. Any hyperplane can be written as the set of points �⃑� satisfying: 𝑤GG⃑ . 𝑥 − 𝑏 = 0 (Figure 4.3.1). The two hard-margins are: 𝑤GG⃑ . �⃑� − 𝑏 = 1 and 𝑤GG⃑ . �⃑� − 𝑏 = −1. The distance between these two hyperplanes is 2/||𝑤GG⃑ ||, so to maximize the distance between the planes we need to minimize ||𝑤GG⃑ ||.

In addition to performing linear classification, SVMs algorithm can also be performed on non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces. Basically, the original function is mapped to a higher dimension by a kernel function, and the most common kernel functions used in the non-linear

5

classification are: Gaussian (a.k.a. radial Basis – equation 4), Sigmoid (equation 5) and Polynomial (equation 6) function.

(Figure 4.3.1 – from Wikipedia)

𝐾N�⃑�, 𝑙QR = 𝑒9S|T?G⃑ 9U⃑VT|#

WX , where l is the landmark and 𝜎 is a constant (4)

|| … || represents the distance between �⃑� and the landmark l,and𝜏 = 2𝜎@

𝐾(𝑋, 𝑌) = tanh(𝛾. 𝑋a𝑌 + 𝜏) (5)

𝐾(𝑋, 𝑌) = (𝛾. 𝑋a + 𝜏)b, 𝛾 > 0 (6)

In this report, we experiment with all four types of the kernel: Linear, Gaussian, Sigmoid and polynomial for a best kernel to the topic under study.

4.4 Naïve Bayes:

Naïve Bayes classifiers are a family of simple “probabilistic classifiers” based on applying Bayes’ theorem with strong (naïve) independence assumptions between features [6]. It is built from a conditional probability model: Given a problem instance, X = (x1, x2, …, xn), to be classified, the conditional probability of X belongs to class Ck can be written as:

𝑝(𝐶e|𝑋) =7(fg)7(?|fg)

7(?)> 𝑝(𝐶Q|𝑋)∀𝐶Q ≠ 𝐶e (7)

In plain English term, the above equation can be rephrased as follows:

𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 = 𝑝𝑟𝑖𝑜𝑟 × 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑

𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒

There are other variations of Naïve Bayes models, such as Gaussian, Multinomial, and Bernoulli models. Please refer to [6] for the mathematical formula for each alternative. In this study, we only consider the first type specified by the equation (7).

6

4.5 Decision Tree:

Decision Tree learning is the construction of a decision tree from class-labeled training tuples. It is a flow-chart-like structure, where each internal (non-leaf) node denotes a test on an attribute, each branch represents the outcome of a test, and each leaf (or terminal) node holds a class label [7].

Algorithms for constructing decision trees usually work top-down, by choosing a variable at each step that best splits the set of items. Different algorithms use different metrics for measuring the “best” choice. Some of the well-known metrics are Gini Impurity, Information Gain, and Variance Reduction. Information Gain is the metric used in the evaluation of this report.

Despite the algorithm is simple to understand interpret, and performs well with large datasets, the decision tree can be non-robust; a small change in the training set can result in a large change in the tree and consequently the final predictions. Also, the problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality.

4.6 Random Forest:

Random Forests are an ensemble learning method for classification by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes or mean prediction of the individual trees [8]. Random decision forests correct for decision trees’ habit of overfitting to their training set.

The parameter set by this algorithm is the selection of the number of decision trees. We experimented with an initial N=500 trees and eventually settled with N=300 as it generated a better accuracy value.

5.0 Machine Learning Evaluation:

A performance evaluation method used in machine learning is called the Confusion Matrix (CM), also known as an error matrix [9]. In the CM matrix, each column represents the instances in a predicted class while each row represents the instances in an actual class (See Figure 5.0.1 for a configuration of two classes – 0 and 1). For each entry in the matrix, it bears a specific name: True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN), and can be interpreted as followed: TP is equivalent with Hit, TN with Correct Rejection, FP with False Alarm or Type I error, FN with Miss or Type II error.

(Figure 5.0.1)

In this report, we use the following two criteria for the selection of an algorithm:

0 1Actual 0 TN FPClass 1 FN TP

Predicted Class

7

• Accuracy (ACC) defined by the equation (8) • False Negative Rate (FNR) or Type II Error Rate defined by the equation (9)

ACC = auvaw

auvawvxuvxw (8)

FNR = xw

xwvau (9)

Accuracy can be used to compare the performance of machine learning

algorithms. The Accuracy results for all the computed algorithms are depicted in Table 5.0.2 below:

(Table 5.0.2)

All Accuracy rates fluctuate around 0.938 with the standard deviation of 0.017

which gives no obvious advantage over any option. Besides, one needs to be aware of the accuracy paradox which may trap the selection in favor of an algorithm over others. Thus, we need to rely on an additional criterion, FNR, to assist in the decision making.

Per medical applications, FNR is a very essential factor in the forming of medical conclusion; high FNR reflects a serious misdiagnose by declaring a malignant case as benign. Thus, combining the FNR info into the computation, we get the following result (Table 5.0.3):

The result shows the Support Vector Machine is the best method among all as it has the lowest FNR rate (29%), followed by the Logistic Regression method having the second lowest FNR rate (36%). The three algorithms with the worst FNR rates happen to be the K-NN, Kernel SVM (Sigmoid), and Kernel SVM (Gaussian) methods.

Thirdly, we calculate the Cumulative Accuracy Profile (CAP) [10] from the test set of 215 records to assess the performance of the Support Vector Machine algorithm (Chart 5.0.4). At 50% mark, the algorithm converges at 85.7% accuracy, a very good accuracy rate by the CAP standard. This performance level proves the algorithm to be reliable for the intended medical application. Furthermore, the CAP computation can be used to detect deterioration of the model over time.

Machine Learning Methods AccuracyK-Nearest Neighbor 0.935Logistic Regression 0.953Support Vector Machine 0.962Kernel SVM (Polynomial) 0.921Kernel SVM (Sigmoid) 0.916Kernel SVM (Radial Kernel) 0.935Naïve Bayes 0.921Decision Tree 0.953Random Forest 0.949

8

(Table 5.0.3)

(Chart 5.0.4)

6.0 Linear Discriminant Analysis (LDA) Reduction:

In order to further improving the result, we adopt a supervised dimensional reduction technique called Linear Discriminant Analysis (LDA). This method is used in the pre-processing phase for pattern classification in Machine Learning by projecting the dataset onto a much lower-dimensional space [12].

Due to LDA algorithm has low tolerance in low entropy variables, we have to delete the seven independent variables described at the end of the section 1.2. The final result, after this modification, yields a perfect FNR rate of 0% (see Table 6.0.1).

Please refer to Attachment 2 for part of the code enhancement.

Machine Learning Methods Accuracy FNRK-Nearest Neighbor 0.935 100%Logistic Regression 0.953 36%Support Vector Machine 0.962 29%Kernel SVM (Polynomial) 0.921 78%Kernel SVM (Sigmoid) 0.916 100%Kernel SVM (Radial Kernel) 0.935 100%Naïve Bayes 0.921 72%Decision Tree 0.953 43%Random Forest 0.949 57%

9

(Table 6.0.1)

7.0 Concluding Remarks:

In conclusion, we have successfully selected a machine learning algorithm viable to the medical topic under study; the result is objectively justified using the Accuracy and FNR rates. The CAP profile is also calculated to verify the robustness of the algorithm in the determination of the medical outcome. As the last step, we apply the LDA reduction method to further reduce the dimensionality of the dataset and to achieve a perfect score in the FNR performance criterion.

One thing worth noticing in the whole operation: at the beginning of the process, there was a concern where some of the features are sparsely populated. But such drawback didn’t materialize as each of the methods are equipped to manipulate around those low entropy features during machine learning computation. But with the introduction of LDA, we have to eliminate those variables as the enhanced method won’t work well with low entropy variables. Due to this manual preprocessing intervention, any machine model needs to be reviewed periodically, e.g. three to six months, to watch for pattern changes in the dataset.

Last but not least, both machine model and testing results are sensitive to the size and random distribution of data selected for a training and testing set. Thus, it is unpractical to compare the results of this report against past literature unless the exact setting was chosen by all studies.

0 1Actual 0 201 0Class 1 0 14

Predicted Class

10

References

[1] Cervical Cancer Risk Classification, https://www.kaggle.com/loveall/cervical-cancer-risk-classification [2] K-Nearest Neighbors (K-NN), https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm [3] How to determine K value for the K-NN Algorithm, https://stackoverflow.com/questions/18110951/how-to-determine-k-value-for-the-k-nearest-neighbours-algorithm-for-a-matrix-in [4] Logistic Regression, https://en.wikipedia.org/wiki/Logistic_regression [5] Support Vector Machine, https://en.wikipedia.org/wiki/Support-vector_machine [6] Naïve Bayes, https://en.wikipedia.org/wiki/Naive_Bayes_classifier [7] Decision Tree, https://en.wikipedia.org/wiki/Decision_tree_learning [8] Random Forest, https://en.wikipedia.org/wiki/Random_forest [9] Confusion Matrix, https://en.wikipedia.org/wiki/Confusion_matrix [10] Cumulative Accuracy Profile, https://en.wikipedia.org/wiki/Cumulative_accuracy_profile [11] Entropy (Information Theory), https://en.wikipedia.org/wiki/Entropy_(information_theory) [12] Linear Discriminant Analysis, https://en.wikipedia.org/wiki/Linear_discriminant_analysis

11

Attachment 1 (R Programming)

We only show the R programming for Linear Support Vector Machine (SVM) code. The rest of other algorithms are similar, except the classifier part where the model needs to be built. # Kernel (Linear) SVM

# Importing the dataset dataset = read.csv('CCDataset.csv', na.strings=c('?', '', ' ', 'NA')) # Rename some of the columns colnames(dataset)[colnames(dataset) == 'Number.of.sexual.partners'] = 'NoSexualPartners' colnames(dataset)[colnames(dataset) == 'First.sexual.intercourse'] = 'FirstSexualIntercourse' colnames(dataset)[colnames(dataset) == 'Num.of.pregnancies'] = 'NoPregnancies' colnames(dataset)[colnames(dataset) == 'Smokes..years.'] = 'SmokesYears' colnames(dataset)[colnames(dataset) == 'Smokes..packs.year.'] = 'SmokesPacksYear' colnames(dataset)[colnames(dataset) == 'Hormonal.Contraceptives'] = 'HormonalContraceptives' colnames(dataset)[colnames(dataset) == 'Hormonal.Contraceptives..years.'] = 'HormonalContraceptivesYears' colnames(dataset)[colnames(dataset) == 'IUD..years.'] = 'IUDYears' colnames(dataset)[colnames(dataset) == 'STDs..number.'] = 'STDsNo' colnames(dataset)[colnames(dataset) == 'STDs.condylomatosis'] = 'STDsCondylomatosis' colnames(dataset)[colnames(dataset) == 'STDs.cervical.condylomatosis'] = 'STDsCervicalCondylomatosis' colnames(dataset)[colnames(dataset) == 'STDs.vaginal.condylomatosis'] = 'STDsVaginalCondylomatosis' colnames(dataset)[colnames(dataset) == 'STDs.vulvo.perineal.condylomatosis'] =

'STDsVulvoPerinealCondylomatosis' colnames(dataset)[colnames(dataset) == 'STDs.syphilis'] = 'STDsSyphilis' colnames(dataset)[colnames(dataset) == 'STDs.pelvic.inflammatory.disease'] = 'STDsPelvicInflammatoryDisease' colnames(dataset)[colnames(dataset) == 'STDs.genital.herpes'] = 'STDsGenitalHerpes' colnames(dataset)[colnames(dataset) == 'STDs.molluscum.contagiosum'] = 'STDsMolluscumContagiosum' colnames(dataset)[colnames(dataset) == 'STDs.AIDS'] = 'STDsAIDS' colnames(dataset)[colnames(dataset) == 'STDs.HIV'] = 'STDsHIV' colnames(dataset)[colnames(dataset) == 'STDs.Hepatitis.B'] = 'STDsHepatitisB' colnames(dataset)[colnames(dataset) == 'STDs.HPV'] = 'STDsHPV' colnames(dataset)[colnames(dataset) == 'STDs..Number.of.diagnosis'] = 'STDsNoDiagnosis' colnames(dataset)[colnames(dataset) == 'STDs..Time.since.first.diagnosis'] = 'STDsTimeSinceFirstDiagnosis' colnames(dataset)[colnames(dataset) == 'STDs..Time.since.last.diagnosis'] = 'STDsTimeSinceLastDiagnosis' colnames(dataset)[colnames(dataset) == 'Dx.Cancer'] = 'DxCancer' colnames(dataset)[colnames(dataset) == 'Dx.CIN'] = 'DxCIN' colnames(dataset)[colnames(dataset) == 'Dx.HPV'] = 'DxHPV' # Replacing missing data with mean of the column dataset$NoSexualPartners = ifelse(is.na(dataset$NoSexualPartners),

round(ave(dataset$NoSexualPartners, FUN = function(x) mean(x, na.rm=TRUE)), digits = 0), dataset$NoSexualPartners)

dataset$FirstSexualIntercourse = ifelse(is.na(dataset$FirstSexualIntercourse), round(ave(dataset$FirstSexualIntercourse, FUN = function(x) mean(x, na.rm=TRUE)), digits = 0), dataset$FirstSexualIntercourse)

dataset$NoPregnancies = ifelse(is.na(dataset$NoPregnancies), round(ave(dataset$NoPregnancies, FUN = function(x) mean(x, na.rm=TRUE)), digits = 0), dataset$NoPregnancies)

dataset$Smokes = ifelse(is.na(dataset$Smokes), round(ave(dataset$Smokes, FUN = function(x) mean(x, na.rm=TRUE)), digits = 0), dataset$Smokes)

dataset$SmokesYears = ifelse(is.na(dataset$SmokesYears), round(ave(dataset$SmokesYears, FUN = function(x) mean(x, na.rm=TRUE)), digits = 0), dataset$SmokesYears)

dataset$SmokesPacksYear = ifelse(is.na(dataset$SmokesPacksYear), round(ave(dataset$SmokesPacksYear, FUN = function(x) mean(x, na.rm=TRUE)), digits = 0), dataset$SmokesPacksYear)

12

dataset$HormonalContraceptives = ifelse(is.na(dataset$HormonalContraceptives), round(ave(dataset$HormonalContraceptives, FUN = function(x) mean(x, na.rm=TRUE)), digits =0), dataset$HormonalContraceptives)

dataset$HormonalContraceptivesYears = ifelse(is.na(dataset$HormonalContraceptivesYears), ave(dataset$HormonalContraceptivesYears, FUN = function(x) mean(x, na.rm=TRUE)), dataset$HormonalContraceptivesYears)

dataset$IUD = ifelse(is.na(dataset$IUD), round(ave(dataset$IUD, FUN = function(x) mean(x, na.rm=TRUE)), digits = 0), dataset$IUD)

dataset$IUDYears = ifelse(is.na(dataset$IUDYears), round(ave(dataset$IUDYears, FUN = function(x) mean(x, na.rm=TRUE)), digits = 0), dataset$IUDYears)

dataset$STDs = ifelse(is.na(dataset$STDs), round(ave(dataset$STDs, FUN = function(x) mean(x, na.rm=TRUE)), digits = 0), dataset$STDs)

dataset$STDsNo = ifelse(is.na(dataset$STDsNo), round(ave(dataset$STDsNo, FUN = function(x) mean(x, na.rm=TRUE)), digits = 0), dataset$STDsNo)

dataset$STDsCondylomatosis = ifelse(is.na(dataset$STDsCondylomatosis), round(ave(dataset$STDsCondylomatosis, FUN = function(x) mean(x, na.rm=TRUE)), digits = 0), dataset$STDsCondylomatosis)

dataset$STDsCervicalCondylomatosis = ifelse(is.na(dataset$STDsCervicalCondylomatosis), round(ave(dataset$STDsCervicalCondylomatosis, FUN = function(x) mean(x, na.rm=TRUE)), digits = 0), dataset$STDsCervicalCondylomatosis)

dataset$STDsVaginalCondylomatosis = ifelse(is.na(dataset$STDsVaginalCondylomatosis), round(ave(dataset$STDsVaginalCondylomatosis, FUN = function(x) mean(x, na.rm=TRUE)), digits = 0), dataset$STDsVaginalCondylomatosis)

dataset$STDsVulvoPerinealCondylomatosis = ifelse(is.na(dataset$STDsVulvoPerinealCondylomatosis), round(ave(dataset$STDsVulvoPerinealCondylomatosis, FUN = function(x) mean(x, na.rm=TRUE)), digits = 0), dataset$STDsVulvoPerinealCondylomatosis)

dataset$STDsSyphilis = ifelse(is.na(dataset$STDsSyphilis), round(ave(dataset$STDsSyphilis, FUN = function(x) mean(x, na.rm=TRUE)), digits = 0), dataset$STDsSyphilis)

dataset$STDsPelvicInflammatoryDisease = ifelse(is.na(dataset$STDsPelvicInflammatoryDisease), round(ave(dataset$STDsPelvicInflammatoryDisease, FUN = function(x) mean(x, na.rm=TRUE)), digits = 0), dataset$STDsPelvicInflammatoryDisease)

dataset$STDsGenitalHerpes = ifelse(is.na(dataset$STDsGenitalHerpes), round(ave(dataset$STDsGenitalHerpes, FUN = function(x) mean(x, na.rm=TRUE)), digits = 0), dataset$STDsGenitalHerpes)

dataset$STDsMolluscumContagiosum = ifelse(is.na(dataset$STDsMolluscumContagiosum), round(ave(dataset$STDsMolluscumContagiosum, FUN = function(x) mean(x, na.rm=TRUE)), digits = 0), dataset$STDsMolluscumContagiosum)

dataset$STDsAIDS = ifelse(is.na(dataset$STDsAIDS), round(ave(dataset$STDsAIDS, FUN = function(x) mean(x, na.rm=TRUE)), digits = 0), dataset$STDsAIDS)

dataset$STDsHIV = ifelse(is.na(dataset$STDsHIV), round(ave(dataset$STDsHIV, FUN = function(x) mean(x, na.rm=TRUE)), digits = 0), dataset$STDsHIV)

dataset$STDsHepatitisB = ifelse(is.na(dataset$STDsHepatitisB), round(ave(dataset$STDsHepatitisB, FUN = function(x) mean(x, na.rm=TRUE)), digits = 0), dataset$STDsHepatitisB)

dataset$STDsHPV = ifelse(is.na(dataset$STDsHPV), round(ave(dataset$STDsHPV, FUN = function(x) mean(x, na.rm=TRUE)), digits = 0), dataset$STDsHPV)

dataset$STDsTimeSinceFirstDiagnosis = ifelse(is.na(dataset$STDsTimeSinceFirstDiagnosis), round(ave(dataset$STDsTimeSinceFirstDiagnosis, FUN = function(x) mean(x, na.rm=TRUE)), digits = 0), dataset$STDsTimeSinceFirstDiagnosis)

dataset$STDsTimeSinceLastDiagnosis = ifelse(is.na(dataset$STDsTimeSinceLastDiagnosis), round(ave(dataset$STDsTimeSinceLastDiagnosis, FUN = function(x) mean(x, na.rm=TRUE)), digits = 0), dataset$STDsTimeSinceLastDiagnosis)

13

# Remove two unused variables since they only contain a single value (0) dataset$STDsCervicalCondylomatosis = NULL dataset$STDsAIDS = NULL # Splitting the dataset into the Training set and Test set # install.packages('caTools') library(caTools) set.seed(123) split = sample.split(dataset$Biopsy, SplitRatio = 0.75) training_set = subset(dataset, split == TRUE) test_set = subset(dataset, split == FALSE) # Feature Scaling training_set[, 1:33] = scale(training_set[, 1:33], scale=FALSE) test_set[, 1:33] = scale(test_set[, 1:33], scale=FALSE) # Fitting Kernel SVM to the Training set # install.packages('e1071') library(e1071) options(warn=-1) # suppress warning messages classifier = svm(formula = Biopsy ~ ., data = training_set, type = 'C-classification', kernel = 'linear') # Linear is better than Gaussian and polynomial kernels # Predicting the Test set results y_pred = predict(classifier, newdata = test_set[-34]) # Making the Confusion Matrix cm = table(test_set[, 34], y_pred) # Calculating the accuracy and Type2 error of the model accuracy <- (sum(diag(cm)))/sum(cm) type2 = cm[2,1] / (cm[2,1] + cm[2,2]) * 100 print("Confusion Matrix:") print(cm) P = sprintf("Accuracy Rate is: %f", accuracy) print(P) P = sprintf("Type2 Error is: %g%%", type2) print(P) # Export test_set and y_pred to csv file for CAP curve generation in Excel write.csv(test_set, "test_set.csv") write.csv(y_pred, "y_pred.csv")

14

Attachment 2 (R Programming)

We only show part of the R programming to apply Linear Discriminant Analysis into the end of the pre-processing phase. The rest of the code is similar to the attachment 1. # Applying LDA

library(MASS) lda = lda(formula = Biopsy ~ ., data = training_set) training_set = as.data.frame(predict(lda, training_set)) training_set = training_set[c(4, 1)] test_set = as.data.frame(predict(lda, test_set)) test_set = test_set[c(4, 1)] # Fitting Kernel SVM to the Training set # install.packages('e1071') library(e1071) options(warn=-1) # suppress warning messages classifier = svm(formula = class ~ ., data = training_set, type = 'C-classification', kernel = 'linear') # Linear is better than Gaussian and polynomial kernels # Predicting the Test set results y_pred = predict(classifier, newdata = test_set[-2]) # Making the Confusion Matrix cm = table(test_set[, 2], y_pred)

15

Attachment 3

Types of Distances

For any two vectors 𝑈GG⃑ = (𝑢8, 𝑢@, … , 𝑢B) and 𝑉G⃑ = (𝑣8, 𝑣@, … , 𝑣B) ∈ ℝB, each type of distance follows the definition below:

• Euclidean distance: 𝑑N𝑈GG⃑ , 𝑉G⃑ R = ~(𝑢8 −𝑣8)@ +(𝑢@ −𝑣@)@ + ⋯+(𝑢B −𝑣B)@#

• Manhattan distance: 𝑑N𝑈GG⃑ , 𝑉G⃑ R = |𝑢8 − 𝑣8| + |𝑢@ − 𝑣@| + ⋯+ |𝑢B − 𝑣B|

• Maximum distance: 𝑑N𝑈GG⃑ , 𝑉G⃑ R = 𝑛

𝑚𝑎𝑥𝑖 = 1

|𝑢Q − 𝑣Q|

• Minkowski distance is a general form including all three distances above:

𝑑N𝑈GG⃑ , 𝑉G⃑ R = (�|𝑢Q − 𝑣Q|7B

Q�8

)8 7⁄

For p = 1 or 2, the formula corresponds to the Manhattan distance and Euclidian distance, respectively. In the limiting case of p reaching infinity, it transforms into the Chebyshev distance, a.k.a. Tchebychev distance or maximum distance:

lim7→�

(�|𝑢Q − 𝑣Q|7B

Q�8

)8 7⁄ =𝑛

𝑚𝑎𝑥𝑖 = 1

|𝑢Q − 𝑣Q|

• Canberra distance: 𝑑N𝑈GG⃑ , 𝑉G⃑ R = ∑ |�V9�V|

|�V|v|�V|BQ98

Documents

Machine Learning Classifications Techniques on Cervical ...globaliotek.com/wp-content/uploads/2019/02/Machine-Learning... · genital herpes, STDs molluscum contagiosum, STDs HIV,