View
13
Download
1
Category
Preview:
Citation preview
Readmission of Diabetes Patients
Project Report by
Rahmawati Nusantari Maria D. Marroquin
Essenam Kakpo Hong Lu
Team 10
INSY 5339 – Th. 7 pm-9:50 pm
May 12, 2016
2
Table of Contents Problem Domain 3
Data Summary 3 Encounters 3 Features 3
Target Variable 6 Prediction 6
Data Cleaning Process 7
Data Cleaning Tools 7 Missing Values 7 Irrelevant Data 7 Data Imbalance 8 Past Cleaning Efforts 10 Discretization 10 Various SMOTE Percentages 10
Algorithms Utilized 11 Classifiers 11 Comparison of Bayes Classifiers 11 Factor Experimental Design 13 Number of Attributes 13 Noise 13 Experiments 14 Combination Sets with Each Classifier 14 Summary of Results 18 Analysis and Conclusion 20 ROC Curves 20 Additional Analysis 23 Overall Observations 23 References 26
3
Problem Domain Dataset Summary The dataset was obtained from the UCI Machine Learning Repository. It is listed under the name Diabetes 130 – US Hospitals. According to the dataset description, the data has been prepared to analyze factors related to readmission as well as other outcomes pertaining to patients with diabetes. The dataset represents 10 years (1999-2008) of clinical care at 130 U.S. hospitals and integrated delivery networks. This dataset contains 101,766 unique inpatients encounters (instances) with 50 attributes, making the size of this dataset a total of 5,088,300 cells. Encounters (Records) As stated on the UCI’s dataset information page, the dataset contains encounters that satisfied the following criteria:
• It is an inpatient encounter (a hospital admission). • It is a diabetic encounter, that is, one during which any kind of diabetes was entered to
the system as a diagnosis. • The length of stay was at least 1 day and at most 14 days. • Laboratory tests were performed during the encounter. • Medications were administered during the encounter.
Features (Attributes) The attributes represent patient and hospital outcomes. This data set mostly contains nominal attributes such as medical specialty and gender, but also includes a few ordinal attributes such as age and weight and continues attributes such as time(days) in hospital and number of medications. The following table list each attribute, its description, and the percentage of missing information pertaining to each attribute.
4
Attributes and Target Variable Table Feature name Type Description and values
% missing
Encounter ID Numeric Unique identifier of an encounter 0% Patient number Numeric Unique identifier of a patient 0%
Race Nominal Values: Caucasian, Asian, African American, Hispanic, and other 2%
Gender Nominal Values: male, female, and unknown/invalid 0%
Age Nominal Grouped in 10-year intervals: 0, 10), 10, 20), …, 90, 100) 0%
Weight Numeric Weight in pounds. 97%
Admission type Nominal
Integer identifier corresponding to 9 distinct values, for example, emergency, urgent, elective, newborn, and not available 0%
Discharge disposition Nominal
Integer identifier corresponding to 29 distinct values, for example, discharged to home, expired, and not available 0%
Admission source Nominal
Integer identifier corresponding to 21 distinct values, for example, physician referral, emergency room, and transfer from a hospital 0%
Time in hospital Numeric
Integer number of days between admission and discharge 0%
Payer code Nominal
Integer identifier corresponding to 23 distinct values, for example, Blue Cross/Blue Shield, Medicare, and self-pay 52%
Medical specialty Nominal
Integer identifier of a specialty of the admitting physician, corresponding to 84 distinct values, for example, cardiology, internal medicine, family/general practice, and surgeon 53%
Number of lab procedures Numeric Number of lab tests performed during the encounter 0% Number of procedures Numeric
Number of procedures (other than lab tests) performed during the encounter 0%
Number of medications Numeric
Number of distinct generic names administered during the encounter 0%
Number of outpatient visits Numeric
Number of outpatient visits of the patient in the year preceding the encounter 0%
Number of emergency visits Numeric
Number of emergency visits of the patient in the year preceding the encounter 0%
5
Feature name Type Description and values
% missing
Number of inpatient visits Numeric
Number of inpatient visits of the patient in the year preceding the encounter 0%
Diagnosis 1 Nominal The primary diagnosis (coded as first three digits of ICD9); 848 distinct values 0%
Diagnosis 2 Nominal Secondary diagnosis (coded as first three digits of ICD9); 923 distinct values 0%
Diagnosis 3 Nominal Additional secondary diagnosis (coded as first three digits of ICD9); 954 distinct values 1%
Number of diagnoses Numeric Number of diagnoses entered to the system 0% Glucose serum test result Nominal
Indicates the range of the result or if the test was not taken. Values: “>200,” “>300,” “normal,” and “none” if not measured 0%
A1c test result Nominal
Indicates the range of the result or if the test was not taken. Values: “>8” if the result was greater than 8%, “>7” if the result was greater than 7% but less than 8%, “normal” if the result was less than 7%, and “none” if not measured. 0%
Change of medications Nominal
Indicates if there was a change in diabetic medications (either dosage or generic name). Values: “change” and “no change” 0%
Diabetes medications Nominal
Indicates if there was any diabetic medication prescribed. Values: “yes” and “no” 0%
24 features for medications Nominal
For the generic names: metformin, repaglinide, nateglinide, chlorpropamide, glimepiride, acetohexamide, glipizide, glyburide, tolbutamide, pioglitazone, rosiglitazone, acarbose, miglitol, troglitazone, tolazamide, examide, sitagliptin, insulin, glyburide-metformin, glipizide-metformin, glimepiride-pioglitazone, metformin-rosiglitazone, and metformin-pioglitazone, the feature indicates whether the drug was prescribed or there was a change in the dosage. Values: “up” if the dosage was increased during the encounter, “down” if the dosage was decreased, “steady” if the dosage did not change, and “no” if the drug was not prescribed 0%
Readmitted Nominal
Days to inpatient readmission. Values: “<30” if the patient was readmitted in less than 30 days, “>30” if the patient was readmitted in more than 30 days, and “No” for no record of readmission. 0%
6
Target Variable The last attribute in the previous table is the class attribute, which in this case is Readmission. The distribution of the class attribute is as follows:
• Encounters of patients who were not readmitted (No) to the hospital. There are 54, 864 of such encounters.
• Encounters of patients who were readmitted to the hospital after 30 days of discharge (>30). There are 35,545 of such encounters.
• Encounters of patients who were readmitted to the hospital within 30 days of discharge (<30). There are 11, 357 of such encounters.
Prediction We want to predict whether or when diabetes patients will be readmitted to the hospital based on several factors (attributes).
>30
No
<30
Readmission?
7
Data Cleaning Process Data cleaning is commonly defined as the process of detecting and correcting corrupt or inaccurate records from a dataset, table, or database.1 Data quality is an important component in any data mining efforts. For this reason, many data scientists spend from 50% to 80% of their time preparing and cleaning their data before it can be mined for insights.2 There are four broad categories of data quality problems: missing data, abnormal data (outliers), departure from models, and goodness-of-fit.3 For our project, our team mainly dealt with missing data. Our team will also address the imbalance in the class variable using SMOTE. Data Cleaning Tools Our team utilized Microsoft Excel to perform the data cleaning. As our guidance to understand the variables and meaning of the data, we consulted the research article that owned the data: “Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records” by Beata Strack et al. Missing Values The journal identified three attributes with the majority of their records missing such as weight (97%), payer code (52%), and medical specialty (53%). Weight was not properly recorded since this experiment was done prior to the HITECH legislation of the American Reinvestment and Recovery Act in 2009, while payer code was deemed irrelevant by the researchers. As a result, these 3 attributes were deleted. There were also 23 attributes that had zero values in 79% to 99% of their records. Those are medications features such as metformin and other generic medications. The zero value indicated that the type of medication was not prescribed to the patient. As a result, all these 23 attributes were deleted. However, insulin was the only medication attribute retained since it had more than 50% of data in its records, and it is considered prevalent in diabetic patient cases. Irrelevant Data The class attribute determines whether a patient is readmitted in the hospital within 30 days, over 30 days, or not readmitted at all. The attribute, discharge disposition, corresponds to 29 distinct values that indicate patients are discharged to home or another hospital, to hospice for terminally-ill patients, or indicate that the patients have passed away.
1 https://en.wikipedia.org/wiki/Data_cleansing 2 Steve Lohr, The New York Times, August 17, 2014, For Big-Data Scientists, ‘Janitor Work’ is Key Hurdle to Insights. 3 Tamraparni Dasu and Theodore Johnson, Exploratory Data Mining and Data Quality, Wiley, 2004
8
To correctly include only active (alive) patients and not in hospice, we removed records that had Discharge Disposition codes of 11, 13, 14, 19, 20, and 21. These discharge codes matched the instances of patients who were deceased or sent to hospice. This cleaning process removed 2,423 instances. Data Imbalance SMOTE (Synthetic Minority Oversampling Technique) is a filter that samples the data and alters the class distribution. It can be used to adjust the relative frequency between the minority and majority classes in the data. SMOTE does not under-sample the majority classes. Instead, it oversamples the minority class by creating synthetic instances using a K-Nearest-Neighbor approach. The user can specify the oversampling percentage and the number of neighbors to use when creating synthetic instances.4 Our team applied SMOTE in different combinations and ultimately decided to apply a 200% synthetic minority oversample with 3-nearest-neighbors as shown below. SMOTE filter in WEKA
4 Ian H. Witten, Eibe Frank, Mark A. Hall, Data Mining: Practical Machine Learning Tools and Techniques, 3rd edition, Elsevier, 2011
9
The following graphs and matrices represent the comparison of the data before the SMOTE and after the 200% SMOTE applied to the minority class (<30). Class Distribution Graphs Before and After 200% SMOTE
Confusion Matrices Before and After 200% SMOTE
Using J-48
Using BayesNet
Original Data SMOTE 200%
SMOTE 200%
SMOTE 200% Original Data
Original Data
10
Past Cleaning Efforts Discretization As part of our initial data cleaning efforts, we discretized several nominal attributes in integer identifiers. Those attributes are diagnosis1, diagnosis2, diagnosis3, admission type, discharge disposition, and admission source. The first three attributes (diagnoses) were coded based on ICD9 (International Statistical Code of Diseases and Related Health Problems). For example, code IDs 390-459 and 785 are diseases of the circulatory system. After converting all the integer identifiers into nominal values, the results did not show significant improvement. Various SMOTE Percentages We also applied different SMOTE percentages mainly to the <30 minority class. However, there was no significant improvement. We ultimately decided to apply a 200% increase on the <30 minority class as mentioned earlier. Class Distribution Graphs with Different SMOTE Percentages
350%, <30 350% on <30, 50% on >30
500%, <30 250%, <30
11
Algorithms Utilized After data cleaning and pre-processing, the selection of algorithms to run our experiments was made. We selected three classifiers for the experiment design: Classifiers
• J48. It works on the Decision Tree Learning process to find and optimize the most efficient attribute which increases the prediction accuracy.
• Naïve Bayes. It takes a probabilistic approach to determine the attributes upon which a
model is to be built.
• Bayes Net. It is a probabilistic graphical model that represents a set of random variables and their conditional dependencies through a directed acyclic graph.
Comparison of Classifiers Since we selected two Bayes classifiers, we compared the difference between Naïve Bayes and Bayes Net. A Naive Bayes classifier is a simple model that describes a particular class of the Bayesian network: all the features are conditionally independent of each other. Because of this, there are certain problems that Naive Bayes cannot solve. An advantage of Naive Bayes is it only requires a small amount of training data to estimate the parameters necessary for classification. A Bayesian Net models relationships between features in a very general way. The Bayesian Network does not have such assumptions. All the dependence in the Bayesian Network has to be modeled. If it is known what these relationships are, or there is enough data to derive them, then it may be appropriate to use a Bayesian Network. The following are two examples that illustrate the differences between these two algorithms. In the first example, a fruit may be considered to be an apple if it is red, round, and about 10 centimeters in diameter. A Naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness, and diameter features.
12
In the second example, presume that there are two events that could cause grass to be wet: either the sprinkler is on, or it's raining. Also, presume that the rain has a direct effect on the use of the sprinkler (i.e. when it rains, the sprinkler is usually not turned on). Then the situation can be modeled with a Bayesian network. All three variables have two possible values, T (for true) and F (for false). Two attributes (Sprinkler and Rain) are correlated.
13
Factor Experimental Design We selected two factors to conduct our experiment design: Number of Attributes and Noise. Each of the combinations, as shown in the 2-Factor Experimental Design table below, were run in an experiment with each algorithm utilizing 10 random seeds. Number of Attributes Due to the amount of attributes this dataset contained originally and even after it was cleaned, we decided to analyze the effect of decreasing the number of attributes. We compared the results of experimental runs with the full, cleaned data set versus the results with a reduced, cleaned data set. We utilized a tool found in Weka called InfoGain. This tool evaluates the worth of the attribute by measuring the information gain with respect to the class. The output from this tool ranked all attributes and we selected the top 10. We then compared experiment results from the dataset containing 22 attributes versus the same dataset containing only the top 10 attributes. Noise Noise refers to the modification of original values such as a distortion in voice during a phone call or fuzziness on a computer screen that can’t be seen clearly. We wanted to observe the effect on classification performance by adding noise into our data. Noise was selected as our second factor in the experiment design. We carefully added 10% of noise only to the target variable, ran the experiments and compared the results to the dataset without noise. 2-Factor Experimental Design Table
ALL ATTRIBUTES
SELECTED ATTRIBUTES
NO NOISE
C1 All Attributes & No Noise
C3 Selected Attributes & No Noise
NOISE (10%)
C2 All Attributes & Noise
C4 Selected Attributes & Noise
14
Experiments As previously mentioned, our experiment design was composed of two factors (Selected Attributes and Noise), giving us four different sets of experiments to run:
• C1: All Attributes & No Noise • C2: All Attributes & Noise • C3: Selected Attributes & No Noise • C4: Selected Attributes & Noise
Combination Sets with Each Classifier C1
E1 Performance of J48 for All Attributes, No Noise E2 Performance of Naïve Bayes for All Attributes, No Noise E3 Performance of Bayes Net for All Attributes, No Noise
C2
E4 Performance of J48 for All Attributes, 10% Noise E5 Performance of Naïve Bayes for All Attributes, 10% Noise E6 Performance of Bayes Net for All Attributes, 10% Noise
C3
E7 Performance of J48 for Selected Attributes, No Noise E8 Performance of Naïve Bayes for Selected Attributes, No Noise E9 Performance of Bayes Net for Selected Attributes, No Noise
C4
E10 Performance of J48 for Selected Attributes, 10% Noise E11 Performance of Naïve Bayes for Selected Attributes, 10% Noise E12 Performance of Bayes Net for Selected Attributes, 10% Noise
Each of the experiments E1, E2,…, E12 was run 10 separate times with a different seed each time, ensuring that the algorithm would use a slightly different training data set each time. For each of the experiments, the percentage split was 66% training and 34% testing. For each C1-C4, we use 3 different algorithms:
• Experiments E1, E4, E7, and E10 use the J48 algorithm. • Experiments E2, E5, E8, and E11 use the Naives Bayes algorithm • Experiments E3, E6, E9, and E12 use the Bayes Net algorithm.
The following tables are the results of the experiments conducted:
15
Results Tables
E1 (J48) Run Seed Accuracy 1 1 57.3161 2 2 57.6682 3 3 57.5814 4 4 57.7960 5 5 57.4174 6 6 57.7502 7 7 57.4680 8 8 57.5838 9 9 57.9696 10 10 58.0709 Average 57.6622 Std Dev 0.2397
E2 (Naïve Bayes) Run Seed Accuracy
1 1 56.3468 2 2 56.4794 3 3 56.4673 4 4 56.7085 5 5 56.6458 6 6 57.0051 7 7 56.7519 8 8 56.7977 9 9 57.1039 10 10 57.1208
Average 56.7427 Std Dev 0.2704
E3 (Bayes Net) Run Seed Accuracy
1 1 64.6274 2 2 63.8679 3 3 64.2633 4 4 64.4418 5 5 63.9474 6 6 64.5816 7 7 64.2416 8 8 63.8968 9 9 63.9354 10 10 64.4225
Average 64.2226 Std Dev 0.2931
E4 (J48) Run Seed Accuracy
1 1 53.5954 2 2 53.6364 3 3 53.2554 4 4 53.7015 5 5 53.4338 6 6 53.2988 7 7 53.1903 8 8 53.4893 9 9 53.7497 10 10 53.6219
Average 53.49725 Std Dev 0.196119
16
Results Tables
E5 (Naïve Bayes)
Run Seed Accuracy 1 1 52.5826 2 2 52.9274 3 3 52.6405 4 4 52.9708 5 5 53.0769 6 6 53.0311 7 7 52.9539 8 8 52.9395 9 9 53.2771 10 10 53.3856
Average 52.97854 Std Dev 0.245655392
E6 (Bayes Net) Run Seed Accuracy
1 1 59.4695 2 2 59.3369 3 3 59.5611 4 4 59.8143 5 5 59.5322 6 6 59.5274 7 7 59.4406 8 8 59.2669 9 9 59.3851
10 10 59.6889 Average 59.50229 Std Dev 0.162799778
E7 (J48) Run Seed Accuracy
1 1 57.415 2 2 57.4849 3 3 56.87 4 4 57.3113 5 5 56.776 6 6 57.1907 7 7 57.2631 8 8 57.2004 9 9 57.2438
10 10 57.1739 Average 57.19291 Std Dev 0.219752874
E8 (Naïve Bayes) Run Seed Accuracy
1 1 55.2882 2 2 55.4642 3 3 55.4618 4 4 55.469 5 5 55.2472 6 6 56.0043 7 7 55.6137 8 8 55.416 9 9 55.5052
10 10 55.5341 Average 55.50037 Std Dev 0.207621349
17
Results Tables
E9 (Bayes Net) Run Seed Accuracy
1 1 55.2882 2 2 55.4642 3 3 55.4618 4 4 55.469 5 5 55.2472 6 6 56.0043 7 7 55.6137 8 8 55.416 9 9 55.5052 10 10 55.5341
Average 55.50037 Std Dev 0.207621349
E10 (J48) Run Seed Accuracy
1 1 52.6887 2 2 53.511 3 3 52.9853 4 4 53.1769 5 5 52.7538 6 6 52.9829 7 7 52.7213 8 8 52.7646 9 9 53.1457 10 10 53.1022
Average 52.9832 Std Dev 0.2475
E11 (Naïve Bayes) Run Seed Accuracy
1 1 51.4082 2 2 51.9701 3 3 51.8519 4 4 51.9471 5 5 51.6639 6 6 51.6831 7 7 51.6772 8 8 51.6674 9 9 51.7206 10 10 51.9471
Average 51.7537 Std Dev 0.1668
E12 (Bayes Net)
Run Seed Accuracy 1 1 51.4082 2 2 51.9701 3 3 51.8519 4 4 51.9471 5 5 51.6639 6 6 51.6831 7 7 51.6772 8 8 51.6674 9 9 51.7206
10 10 51.9471 Average 51.7537 Std Dev 0.1668
18
Summary of Results Results of Experiments Graph
From the graph shown above, we can infer that C1 (All attributes & No Noise) is the best experiment since it gives us the highest results across all algorithms. Next is C3 (Selected Attributes & No Noise) which gives us slightly lesser results, but they are still significantly higher than the results obtained from C2 (third best) and C4 (fourth best). When it comes to the accuracy of the algorithms, Bayes Net leads with a significant margin over J48 and Naïve Bayes across all four experiments. J48 comes in second position, performing up to 1.7% points higher than Naives Bayes across all experiments. We can say the Selected Attributes that we considered for experiments in C3 and C4 are the most relevant because, controlling for noise, the accuracy of the algorithms with all attributes declines by less than 1.5% when switching to only the Selected Attributes.
C1, 56.74272
C2, 52.97854
C3, 55.50037
C4, 51.7537
C1, 57.66216
C2, 53.49725
C3, 57.19291
C4, 52.9832
64.22257
59.50229
63.85893
59.1443
51
53
55
57
59
61
63
65
C1 C2 C3 C4
Accuracy Averages (%)
NaiveBayes J48 BayesNet
19
Standard Errors Graph
Looking at the different standard errors across experiments, we noticed that C1 stands out with relatively high values and the highest mean standard error among all four experiments. The mean standard error for C2, C3 and C4 are roughly the same. The linear graph of the standard errors for each model shows different trends for each algorithm. The area under the J48 curve form C1 to C4 is roughly similar to the one of Naïve Bayes, and as far as size, they are both relatively high. BayesNet on the other hand, seems to have a slightly lower standard error in general, with a smaller area under its curve. We can infer that Bayes Net has the smallest standard error among all three algorithms used.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
C1 C2 C3 C4
Standard Errors
NaiveBayes J48 BayesNet
20
Analysis and Conclusion In order to evaluate the performance of our algorithms, we will be using ROC curves. ROC Curves A receiver operating characteristic (ROC), or ROC curve, is defined as a graphical plot that shows the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity, or recall in machine learning. The false-positive rate is also known as the fall-out and can be calculated as (1 - specificity). In order to determine which algorithm performs better, we will be looking at the tendency of each curve. The closer the curve follows the Y-axis and then the top border, the larger the area is, and the more accurate is the test. In order to get a plot of the curves, we made use of Weka’s Knowledge Flow tool. We loaded the following workflow through which we derived the curves:
Knowledge Flow
The above Knowledge Flow merely loads the specified file, assigns which attribute is considered as the class, then chooses which class value to plot the curve for, and then allows us to select a percentage split. The three algorithms are then run on the dataset with the parameters selected, their performance is recorded, and the results are used to plot the ROC curve.
21
The following are ROC curves for each experiment set when the class value is NO. ROC Curve Graphs When Readmission is NO C1
C2
22
ROC Curve Graphs When Readmission is NO C3
C4
From the above graphs, we observe that the area under the curve is greater for Naïve Bayes, in each of the four experiment sets, C1, C2, C3 and C4. We can therefore conclude that Naïve Bayes is more accurate than the other algorithms, according to these ROC curves.
23
Additional Analysis Let’s take a look at the confusion matrices we obtained from C1 and C3 (which give us the most relevant models) Confusion Matrix Tables for C1 and C3
We can conclude from these confusion matrices that Bayes Net gives a higher percentage of True Positives across class values <30, >30 and NO. Bayes Net therefore appears to be the best predictor from the point of view of our confusion matrices. Overall Observations Considering the average accuracy, average standard deviation, ROC curves, attributes, and classifier evaluation, we recommend the following for the Readmission of Diabetes Patients dataset: Class balancing - SMOTE increased overall model accuracy (see SMOTE Comparison Matrices below) Classifier - Bayes Net gives the highest accuracy. Naive Bayesian classifier (NBC) assumes independence between all attributes given a class, which is seldom true. That is why it is called “Naïve”. In contrast, in a Bayesian network you can make a more detailed (true) model of the problem using several layers of dependencies. It can track the cause-effect relationship among attributes and class, and at the same time calculate and draw the probabilistic graph. Attributes factor: Using All attributes instead of Top 10 has the highest accuracy.
24
SMOTE Comparison Matrices
u Original Data using J48
=== Confusion Matrix ===
a b c <-- classified as
15914 2664 117 | a = NO
8222 3728 141 | b = >30
2396 1371 47 | c = <30
u After SMOTE 200% using J48
=== Confusion Matrix === a b c <-- classified as 13827 2683 1231 | a = NO 7596 3282 1331 | b = >30 3427 1433 6660 | c = <30
u Original Data using Naives Bayes
=== Confusion Matrix ===
a b c <-- classified as
16009 2138 548 | a = NO
8230 3168 693 | b = >30
2430 927 457 | c = <30
u After SMOTE 200% using Naives Bayes
=== Confusion Matrix === a b c <-- classified as 15294 1439 1008 | a = NO 8445 2092 1672 | b = >30 4721 818 5981 | c = <30
25
u Original Data using BayesNet
=== Confusion Matrix ===
a b c <-- classified as
13302 4867 526 | a = NO
5138 6440 513 | b = >30
1662 1800 352 | c = <30
u After SMOTE 200% using BayesNet
=== Confusion Matrix ===
a b c <-- classified as
12746 4934 61 | a = NO
5802 6312 95 | b = >30
1714 2063 7743 | c = <30
26
References
• Beata Strack et al., “Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records.”
• Hindawi Publishing Corporation - http://www.hindawi.com/journals/bmri/2014/781670/tab1/
• Ian H. Witten, Eibe Frank, Mark A. Hall, Data Mining: Practical Machine Learning Tools and Techniques, 3rd edition, Elsevier, 2011
• Machine Learning Repository - https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008#
• Naïve Bayes for Dummies - http://blog.aylien.com/post/120703930533/naive-bayes-for-dummies-a-simple-explanation
• Steve Lohr, The New York Times, August 17, 2014, For Big-Data Scientists, ‘Janitor Work’ is Key Hurdle to Insights.
• Tamraparni Dasu and Theodore Johnson, Exploratory Data Mining and Data Quality, Wiley, 2004.
• Wikipedia - https://en.wikipedia.org/wiki/Data_cleansing https://en.wikipedia.org/wiki/Receiver_operating_characteristic https://en.wikipedia.org/wiki/Bayesian_network
Recommended