77
Kaggle Digits Analysis Zachary Combs, Philip Remmele, M.S. Data Science Candidates South Dakota State University July 2, 2015 Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Kaggle digits analysis_final_fc

Embed Size (px)

Citation preview

Kaggle Digits Analysis

Zachary Combs, Philip Remmele, M.S. Data Science Candidates

South Dakota State University

July 2, 2015

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Introduction

In the following presentation we will be discussing our analysis of the Kaggle Digits data.

The Digits data set is comprised of a training set of 42,000 observations and 784

variables (not including the response), and a test set, containing 28,000 observations.

The variables contain pixelation values of hand written digits, ranging from 0-9.

For more information regarding the Kaggle Digits data please visit the site:

https://www.kaggle.com/c/digit-recognizer.

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Objective

Develop a classification model that is able to accurately classify digit labels in the

test set where class labels are unknown.

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Methods

Employed a repeated 10-fold cross-validation to obtain stable estimates of

classification accuracy.

Iteratively maximized model tuning parameters (e.g. number of components, decay

factor, etc.).

Performed model comparison.

Selected optimal model based on accuracy measure.

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

!"#$%#&'!()

*+,-.'./"'01.1'-2.%'3#1-2451,-01.-%2'*".6

7%#'819/':%0",'3;+"

<6"'=>?@A7%,0'(5'.%'6","9.'./"':%0",B6'3C2-2D'!1#1&"."#6

*","9.'E"6.'&%0",'E16"0'%2'(5'F"6C,.6

G".'1'6"9%201#;'"6.-&1."'%$'199C#19;'E;'+#"0-9.-2D'H1,-01.-%2'6".

Data Exploration: Mean

0.00

0.02

0.04

0.06

0.08

0 50 100 150Mean

Den

sity

Train Data Mean Pixel Values

Table 1:Train Data Summary Statistics

Mean Median

33.40891 7.2315

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Data Exploration: Percent Unique

0.00

0.05

0.10

0.15

0 20 40 60 80Percent Unique

Den

sity

Percent of Unique Pixel Values in Train Data

Table 2:Train Data Summary Statistics

Max Percentage Unique

60.95238

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Data Exploration: Max

0.00

0.02

0.04

0.06

0 100 200 300Max

Den

sity

Max Pixel Values in Training Data

Table 3:Train Data Summary Statistics

Maximum Pixel Values

255

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Image of Kaggle Handwritten Digit Labels

1

1:28

1:28

0

1:28

1:28

1

1:28

1:28

4

1:28

1:28

0

1:28

1:28

0

1:28

1:28

7

1:28

1:28

3

1:28

1:28

5

1:28

1:28

3

1:28

1:28

8

1:281:28

9

1:28

1:28

1

1:28

3

1:28

31:28

1

1:28

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

PCA With Different Transformations

0.25

0.50

0.75

1.00

0 50 100 150 200Number of Components

Perc

ent o

f Tot

al V

aria

nce

Expl

aine

d

transform_TypeDr. Saunder's TransformLog TransformationNo TransformSquare Root

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Kaggle Digits Data Variance Explained via. PCA

0.75

0.80

0.85

0.90

0.95

1.00

0 200 400 600 800Components

Cum

mul

ative

Var

ianc

e Ex

plai

ned

0.0

0.2

0.4

0.6

0 200 400 600 800Components

Prop

ortio

n of

Var

ianc

e Ex

plai

ned

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Two-dimensional Visualization of PCA

−25

0

25

50

−70 −60 −50PC1

PC2

−30

−20

−10

0

10

20

30

−70 −60 −50PC1

PC3

−30

−20

−10

0

10

20

30

−25 0 25 50PC2

PC3

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Shiny Applications: PCA Exploration

Shiny PCA 1

Shiny PCA 2

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Data Partitioning

We created a 70/30 split of the data based on the distributions of class labels for

our training and validation set.

training_index <- createDataPartition(y = training[,1],p = .7,list = FALSE)

training <- training[training_index,]validation <- training[-training_index,]

100 covariates were kept due to explaining approximately 95% of variation in the

data, and for the ease of presentation.

dim(training)

## [1] 29404 101

dim(validation)

## [1] 8821 101

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Class Proportions

Train

0%

3%

6%

9%

0 1 2 3 4 5 6 7 8 9

Training Partition

0%

3%

6%

9%

0 1 2 3 4 5 6 7 8 9Class Label

Validation

0%

3%

6%

9%

0 1 2 3 4 5 6 7 8 9

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Class Proportions Continued

Table 4:Class Proportions

0 1 2 3 4 5 6 7 8 9

Orig. 0.1 0.11 0.1 0.10 0.1 0.09 0.1 0.10 0.1 0.1

Train 0.1 0.11 0.1 0.10 0.1 0.09 0.1 0.10 0.1 0.1

Valid 0.1 0.11 0.1 0.11 0.1 0.09 0.1 0.11 0.1 0.1

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Linear Discriminant Analysis

Discriminant Function

δk (x) = xTΣ

−1µk −1

2µT

−1µk + logπk

Estimating Class Probabilities

�Pr(Y = k|X = x) =

πke

�δk

�K

l=1πl e

�δ l (x)

Assigning x to the class with the largest discriminant score δk (x) will result in the

highest probability for that classification. [James, 2013]

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Model Fitting: LDA

ind <- seq(10,100,10)lda_Ctrl <- trainControl(method = "repeatedcv", repeats = 3,

classProbs = TRUE,summaryFunction = defaultSummary)

accuracy_measure_lda <- NULLptm <- proc.time()for(i in 1:length(ind)){

lda_Fit <- train(label ~ ., data = training[,1:(ind[i]+1)],method = "lda",metric = "Accuracy",maximize = TRUE,trControl = lda_Ctrl)

accuracy_measure_lda[i] <- confusionMatrix(validation$label,predict(lda_Fit,validation[,2:(ind[i]+1)]))$overall[1]

}proc.time() - ptm

## user system elapsed## 22.83 2.44 129.86

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

LDA Optimal Model: Number of Components vs. Model Accuracy

0.876 0.876

0.78

0.80

0.82

0.84

0.86

0.88

25 50 75 100Number of Components

Cla

ssifi

catio

n Ac

cura

cy

LDA Accuracy vs. Number of Components

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

LDA Optimal Model Summary Statistics

Table 5:Confusion Matrix (Columns:Predicted,Rows:Actual)

zero one two three four five six seven eight nine

zero 827 1 2 4 2 16 7 2 4 5

one 0 916 2 4 0 7 3 2 16 1

two 9 31 726 17 21 8 19 11 42 7

three 3 11 23 803 6 41 7 26 26 25

four 0 9 2 0 770 2 5 1 8 56

five 10 16 2 39 5 653 18 9 29 15

six 11 9 2 3 13 23 804 0 9 0

seven 2 26 9 4 16 4 0 791 3 76

eight 4 46 6 28 13 32 7 3 686 17

nine 8 5 1 16 28 1 1 29 5 748

Table 6:Overall Accuracy

Accuracy 0.8756377

AccuracyLower 0.8685703

AccuracyUpper 0.8824559

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

LDA Optimal Model Confusion Matrix Image

827 1 2 4 2 16 7 2 4 5

0 916 2 4 0 7 3 2 16 1

9 31 726 17 21 8 19 11 42 7

3 11 23 803 6 41 7 26 26 25

0 9 2 0 770 2 5 1 8 56

10 16 2 39 5 653 18 9 29 15

11 9 2 3 13 23 804 0 9 0

2 26 9 4 16 4 0 791 3 76

4 46 6 28 13 32 7 3 686 17

8 5 1 16 28 1 1 29 5 748

9.4% 0.0% 0.0% 0.0% 0.0% 0.2% 0.1% 0.0% 0.0% 0.1%

10.4% 0.0% 0.0% 0.1% 0.0% 0.0% 0.2% 0.0%

0.1% 0.4% 8.2% 0.2% 0.2% 0.1% 0.2% 0.1% 0.5% 0.1%

0.0% 0.1% 0.3% 9.1% 0.1% 0.5% 0.1% 0.3% 0.3% 0.3%

0.1% 0.0% 8.7% 0.0% 0.1% 0.0% 0.1% 0.6%

0.1% 0.2% 0.0% 0.4% 0.1% 7.4% 0.2% 0.1% 0.3% 0.2%

0.1% 0.1% 0.0% 0.0% 0.1% 0.3% 9.1% 0.1%

0.0% 0.3% 0.1% 0.0% 0.2% 0.0% 9.0% 0.0% 0.9%

0.0% 0.5% 0.1% 0.3% 0.1% 0.4% 0.1% 0.0% 7.8% 0.2%

0.1% 0.1% 0.0% 0.2% 0.3% 0.0% 0.0% 0.3% 0.1% 8.5%nine

eight

seven

six

five

four

three

two

one

zero

zero one two three four five six seven eight ninePredicted

Actu

al

0

20

40

60

80Count

LDA Optimal Model Confusion Matrix Image

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

LDA Optimal Model Bar Plot

0

300

600

900

zero one two three four five six seven eight nineLabels

Coun

t Labelsactualpredicted

LDA Optimal Model Predicted vs. Actual Class Labels

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

LDA Optimal Model Predictions for Test Set

2

1:28

1:28

0

1:28

1:28

9

1:28

1:28

4

1:28

1:28

3

1:28

1:28

7

1:28

1:28

0

1:281:28

3

1:28

1:28

0

1:28

1:28

3

1:28

1:28

5

1:28

1:28

7

1:28

1:28

4

1:28

1:28

0

1:28

1:28

4

1:28

1:28

0

1:28

1:28

2

1:28

1:28

1

1:28

1:28

9

1:28

1:28

0

1:28

1:28

9

1:28

1:28

1

1:28

1:28

8

1:28

1:28

5

1:28

1:28

7

1:28

1:28

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

LDA Summary Statistics on Manually Labeled Test Set

Table 7:Confusion Matrix (Columns:Predicted,Rows:Actual)

zero one two three four five six seven eight nine

zero 92 1 1 0 1 3 1 0 3 0

one 0 111 0 0 0 1 0 0 3 0

two 1 6 62 2 3 1 1 3 4 0

three 1 1 4 100 0 4 1 5 5 1

four 0 0 0 0 100 1 0 1 0 6

five 0 2 0 3 1 83 0 0 4 2

six 2 0 1 0 0 1 92 0 4 0

seven 0 1 1 0 1 0 0 91 1 6

eight 0 8 1 2 1 5 0 0 65 4

nine 1 0 0 1 4 0 0 1 1 80

Table 8:Overall Accuracy

Accuracy 0.8760000

AccuracyLower 0.8539602

AccuracyUpper 0.8957969

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Quadratic Discriminant Analysis

Discriminant Function

δk (x) = −1

2(x − µk )

−1

k(x − µk ) + logπk

Estimating Class Probabilities

�Pr(Y = k|X = x) =

πk fk (x)�K

l=1πl fl (x)

While fk (x) are Gaussian densities with different covariance matrix�

for each class

we obtain a Quadratic Discriminant Analysis. [James, 2013]

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Model Fitting: QDA

qda_Ctrl <- trainControl(method = "repeatedcv", repeats = 3,classProbs = TRUE,summaryFunction = defaultSummary)

accuracy_measure_qda <- NULLptm <- proc.time()for(i in 1:length(ind)){

qda_Fit <- train(label ~ ., data = training[,1:(ind[i]+1)],method = "qda",metric = "Accuracy",maximize = TRUE,trControl = lda_Ctrl)

accuracy_measure_qda[i] <- confusionMatrix(validation$label,predict(qda_Fit,validation[,2:(ind[i]+1)]))$overall[1]

}proc.time() - ptm

## user system elapsed## 20.89 2.16 66.20

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

QDA Optimal Model: Number of Components vs. Model Accuracy

0.967

0.875

0.900

0.925

0.950

25 50 75 100Number of Components

Cla

ssifi

catio

n Ac

cura

cy

QDA Accuracy vs. Number of Components

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

QDA Optimal Model Summary Statistics

Table 9:Confusion Matrix (Columns:Predicted,Rows:Actual)

zero one two three four five six seven eight nine

zero 862 0 2 1 0 1 0 0 4 0

one 0 917 10 2 2 0 1 2 17 0

two 1 0 871 0 1 0 0 3 15 0

three 0 0 12 929 0 9 0 4 17 0

four 0 1 1 0 838 0 0 0 6 7

five 2 0 1 13 0 773 0 0 6 1

six 2 0 0 1 2 14 850 0 5 0

seven 3 4 15 3 3 3 0 874 11 15

eight 0 1 9 7 2 4 0 0 816 3

nine 1 0 5 12 5 1 0 9 9 800

Table 10:Overall Accuracy

Accuracy 0.9670105

AccuracyLower 0.9630690

AccuracyUpper 0.9706396

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

QDA Optimal Model Confusion Matrix Image

862 0 2 1 0 1 0 0 4 0

0 917 10 2 2 0 1 2 17 0

1 0 871 0 1 0 0 3 15 0

0 0 12 929 0 9 0 4 17 0

0 1 1 0 838 0 0 0 6 7

2 0 1 13 0 773 0 0 6 1

2 0 0 1 2 14 850 0 5 0

3 4 15 3 3 3 0 874 11 15

0 1 9 7 2 4 0 0 816 3

1 0 5 12 5 1 0 9 9 800

9.8% 0.0% 0.0% 0.0% 0.0%

10.4% 0.1% 0.0% 0.0% 0.0% 0.0% 0.2%

0.0% 9.9% 0.0% 0.0% 0.2%

0.1% 10.5% 0.1% 0.0% 0.2%

0.0% 0.0% 9.5% 0.1% 0.1%

0.0% 0.0% 0.1% 8.8% 0.1% 0.0%

0.0% 0.0% 0.0% 0.2% 9.6% 0.1%

0.0% 0.0% 0.2% 0.0% 0.0% 0.0% 9.9% 0.1% 0.2%

0.0% 0.1% 0.1% 0.0% 0.0% 9.3% 0.0%

0.0% 0.1% 0.1% 0.1% 0.0% 0.1% 0.1% 9.1%nine

eight

seven

six

five

four

three

two

one

zero

zero one two three four five six seven eight ninePredicted

Actu

al

0

20

40

60

80Count

QDA Optimal Model Confusion Matrix Image

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

QDA Optimal Model Bar Plot

0

250

500

750

1000

zero one two three four five six seven eight nineLabels

Coun

t Labelsactualpredicted

QDA Optimal Model Predicted vs. Actual Class Labels

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

QDA Optimal Model Predictions for Test Set

2

1:28

1:28

0

1:28

1:28

9

1:28

1:28

9

1:28

1:28

3

1:28

1:28

7

1:28

1:28

0

1:281:28

3

1:28

1:28

0

1:28

1:28

3

1:28

1:28

5

1:28

1:28

7

1:28

1:28

4

1:28

1:28

0

1:28

1:28

4

1:28

1:28

3

1:28

1:28

3

1:28

1:28

1

1:28

1:28

9

1:28

1:28

0

1:28

1:28

9

1:28

1:28

1

1:28

1:28

8

1:28

1:28

5

1:28

1:28

7

1:28

1:28

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

QDA Summary Statistics on Manually Labeled Test Set

Table 11:Confusion Matrix (Columns:Predicted,Rows:Actual)

zero one two three four five six seven eight nine

zero 99 0 0 0 0 1 0 0 1 1

one 0 111 1 0 0 0 0 0 3 0

two 0 0 79 1 1 0 1 1 0 0

three 0 0 1 117 0 0 0 0 4 0

four 0 0 0 0 107 0 0 1 0 0

five 0 0 0 1 0 93 0 0 1 0

six 0 0 0 0 0 1 98 0 1 0

seven 1 0 1 0 0 0 0 98 1 0

eight 0 0 0 0 1 1 0 0 84 0

nine 0 0 0 1 0 0 0 0 1 86

Table 12:Overall Accuracy

Accuracy 0.9720000

AccuracyLower 0.9597851

AccuracyUpper 0.9813153

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

K-Nearest Neighbor

KNN Algorithm

1. Each predictor in the training set represents a dimension in some space.

2. The value that an observation has for each predictor is that values coordinates in

this space.

3. The similarity between points are based on a distance metric (e.g. Euclidean

Distance).

4. The class of an observation is predicted by taking the k-closest data points to

that observation, and assigning the observation to that class which it has most in

common with.

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

KNN Model Fitting and Parameter Tuning

0.80

0.85

0.90

0.95

1.00

1 2 3 4 5Neighbors

Accu

racy

Component10203040

KNN Accuracy vs. Number of Components and

Number of Neighbors

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

KNN: Number of Components vs. Accuracy

0.972

0.92

0.94

0.96

10 20 30 40Number of Components

Cla

ssifi

catio

n Ac

cura

cy

KNN Classification Accuracy vs

Number of Components

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

KNN: Optimal Model Fitting

knn_Ctrl <- trainControl(method = "repeatedcv", repeats = 3,classProbs = TRUE,summaryFunction = defaultSummary)

knn_grid <- expand.grid(k=c(1,2,3,4,5))

knn_Fit_opt <- train(label~., data = training[,1:(knn_opt+1)],method = "knn",metric = "Accuracy",maximize = TRUE,tuneGrid = knn_grid,trControl = knn_Ctrl)

accuracy_measure_knn_opt <- confusionMatrix(validation$label,predict(knn_Fit_opt,validation[,2:(knn_opt+1)]))

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

KNN Optimal Model Summary Statistics

Table 13:Confusion Matrix (Columns:Predicted,Rows:Actual)

zero one two three four five six seven eight nine

zero 868 0 0 0 0 0 2 0 0 0

one 0 945 1 0 0 0 0 2 2 1

two 1 0 879 0 0 0 1 8 2 0

three 0 0 6 949 0 7 0 4 4 1

four 0 3 0 0 835 0 1 1 0 13

five 2 1 0 4 0 781 7 0 0 1

six 1 0 0 0 1 1 871 0 0 0

seven 0 9 5 1 1 0 0 909 0 6

eight 0 3 1 2 4 6 2 1 822 1

nine 0 0 2 7 4 1 1 4 1 822

Table 14:Overall Accuracy

Accuracy 0.9841288

AccuracyLower 0.9812982

AccuracyUpper 0.9866327

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

KNN Optimal Model Confusion Matrix Image

868 0 0 0 0 0 2 0 0 0

0 945 1 0 0 0 0 2 2 1

1 0 879 0 0 0 1 8 2 0

0 0 6 949 0 7 0 4 4 1

0 3 0 0 835 0 1 1 0 13

2 1 0 4 0 781 7 0 0 1

1 0 0 0 1 1 871 0 0 0

0 9 5 1 1 0 0 909 0 6

0 3 1 2 4 6 2 1 822 1

0 0 2 7 4 1 1 4 1 822

9.8% 0.0%

10.7% 0.0% 0.0% 0.0% 0.0%

0.0% 10.0% 0.0% 0.1% 0.0%

0.1% 10.8% 0.1% 0.0% 0.0% 0.0%

0.0% 9.5% 0.0% 0.0% 0.1%

0.0% 0.0% 0.0% 8.9% 0.1% 0.0%

0.0% 0.0% 0.0% 9.9%

0.1% 0.1% 0.0% 0.0% 10.3% 0.1%

0.0% 0.0% 0.0% 0.0% 0.1% 0.0% 0.0% 9.3% 0.0%

0.0% 0.1% 0.0% 0.0% 0.0% 0.0% 0.0% 9.3%nine

eight

seven

six

five

four

three

two

one

zero

zero one two three four five six seven eight ninePredicted

Actu

al

020406080

Count

KNN Optimal Model Confusion Matrix Image

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

KNN Optimal Bar Plot

0

250

500

750

1000

zero one two three four five six seven eight nineLabels

Coun

t Labelsactualpredicted

KNN Optimal Model Predicted vs. Actual Class Labels

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

KNN Optimal Model Predictions for Test Set

2

1:28

1:28

0

1:28

1:28

9

1:28

1:28

0

1:28

1:28

3

1:28

1:28

7

1:28

1:28

0

1:281:28

3

1:28

1:28

0

1:28

1:28

3

1:28

1:28

5

1:28

1:28

7

1:28

1:28

4

1:28

1:28

0

1:28

1:28

4

1:28

1:28

3

1:28

1:28

3

1:28

1:28

1

1:28

1:28

9

1:28

1:28

0

1:28

1:28

9

1:28

1:28

1

1:28

1:28

1

1:28

1:28

5

1:28

1:28

7

1:28

1:28

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

KNN Summary Statistics on Manually Labeled Test Set

Table 15:Confusion Matrix (Columns:Predicted,Rows:Actual)

zero one two three four five six seven eight nine

zero 101 0 0 0 0 0 0 0 0 1

one 0 115 0 0 0 0 0 0 0 0

two 0 0 81 0 1 0 0 1 0 0

three 0 0 2 116 0 1 0 1 2 0

four 0 0 0 0 105 0 0 0 0 3

five 0 0 0 0 0 95 0 0 0 0

six 0 1 0 0 0 2 97 0 0 0

seven 0 1 0 0 1 0 0 99 0 0

eight 0 1 0 0 0 1 0 0 82 2

nine 0 0 0 0 0 1 0 0 1 86

Table 16:Overall Accuracy

Accuracy 0.9770000

AccuracyLower 0.9656877

AccuracyUpper 0.9853654

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Random Forest

”A random forest is a classifier consisting of a collection of tree-structured

classifiers {h(x , θk ), k = 1} where the {θk} are independent identically

distributed random vectors and each tree casts a unit vote for the most

popular class input x.” [Breiman, 2001]

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

RF Model Fitting: Recursive Feature Selection

subsets <- c(1:40,seq(45,100,5)) # vector of variable subsets# for recursive feature selection

ptm <- proc.time() # starting timer for code execution

ctrl <- rfeControl(functions = rfFuncs, method = "repeatedcv",number = 3, verbose = FALSE,returnResamp = "all", allowParallel = FALSE)

rfProfile <- rfe(x = training[,-1],y = as.factor(as.character(training$label)),sizes = subsets, rfeControl = ctrl)

rf_opt <- rfProfile$optVariables

proc.time() - ptm

## user system elapsed## 7426.48 64.87 7491.48

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Random Forest: Accuracy vs. Number of Variables

0.4

0.6

0.8

1.0

0 25 50 75 100Variables

Accu

racy

(Rep

eate

d C

ross−V

alid

atio

n)

Random Forest Recursive Feature Selection

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Random Forest Optimal Model Summary Statistics

Table 17:Confusion Matrix (Columns:Predicted,Rows:Actual)

eight five four nine one seven six three two zero

eight 842 0 0 0 0 0 0 0 0 0

five 0 796 0 0 0 0 0 0 0 0

four 0 0 853 0 0 0 0 0 0 0

nine 0 0 0 842 0 0 0 0 0 0

one 0 0 0 0 951 0 0 0 0 0

seven 0 0 0 0 0 931 0 0 0 0

six 0 0 0 0 0 0 874 0 0 0

three 0 0 0 0 0 0 0 971 0 0

two 0 0 0 0 0 0 0 0 891 0

zero 0 0 0 0 0 0 0 0 0 870

Table 18:Overall Accuracy

Accuracy 1.0000000

AccuracyLower 0.9995819

AccuracyUpper 1.0000000

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Random Forest Optimal: Confusion Matrix Image

842 0 0 0 0 0 0 0 0 0

0 796 0 0 0 0 0 0 0 0

0 0 853 0 0 0 0 0 0 0

0 0 0 842 0 0 0 0 0 0

0 0 0 0 951 0 0 0 0 0

0 0 0 0 0 931 0 0 0 0

0 0 0 0 0 0 874 0 0 0

0 0 0 0 0 0 0 971 0 0

0 0 0 0 0 0 0 0 891 0

0 0 0 0 0 0 0 0 0 870

9.5%

9.0%

9.7%

9.5%

10.8%

10.6%

9.9%

11.0%

10.1%

9.9%zero

two

three

six

seven

one

nine

four

five

eight

eight five four nine one seven six three two zeroPredicted

Actu

al

020406080

Count

Random Forest Optimal Model Confusion Matrix Image

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Random Forest Bar Plot

0

250

500

750

1000

eight five four nine one seven six three two zeroLabels

Cou

nt

Labelsactualpredicted

Random Forest Actual vs. Predicted Class Labels

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

RF Summary Statistics on Manually Labeled Test Set

Table 19:Confusion Matrix (Columns:Predicted,Rows:Actual)

eight five four nine one seven six three two zero

eight 82 1 0 1 0 1 1 2 2 0

five 1 93 0 1 1 0 1 2 0 0

four 1 0 104 0 0 0 0 0 1 0

nine 0 0 1 84 0 0 0 1 0 0

one 2 0 0 0 114 0 0 0 0 0

seven 0 0 2 1 0 100 0 2 0 1

six 0 0 1 0 0 0 97 0 1 0

three 0 1 0 1 0 0 0 114 0 0

two 0 0 0 0 0 0 1 1 77 0

zero 0 0 0 0 0 0 0 0 2 101

Table 20:Overall Accuracy

Accuracy 0.9660000

AccuracyLower 0.9528106

AccuracyUpper 0.9763414

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Conditional Inference Tree

General Recursive Partitioning Tree

1. Perform an exhaustive search over all possible splits

2. Maximize information measure of node impurity

3. Select covariate split that maximized this measure

CTREE

1. In each node the partial hypotheses Hj

o : D(Y |Xj ) = D(Y ) is tested against the

global null hypothesis of H0 =�

m

j=1 Hj

0.

2. If the global hypothesis can be rejected then the association between Y and eachof the covariates Xj , j = 1...,m is measured by P-value.

3. If we are unable to reject H0 at the specified α then recursion is stopped.[Hothorn, 2006]

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

CTREE Model Fitting and Tuning

0.83

0.805

0.810

0.815

0.820

0.825

0.830

10 15 20 25 30Number of Components

Cla

ssifi

catio

n Ac

cura

cy

CTREE Classification Accuracy vs

Number of Components

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

CTREE: Optimal Model Fitting

ctree_Ctrl <- trainControl(method = "repeatedcv", repeats = 3,classProbs = TRUE,summaryFunction = defaultSummary)

ctree_Fit_opt <- train(label~., data = training[,1:(ctree_opt+1)],method = "ctree",metric = "Accuracy",tuneLength = 5,maximize = TRUE,trControl = ctree_Ctrl)

accuracy_measure_ctree_opt <- confusionMatrix(validation$label,predict(ctree_Fit_opt,validation[,2:(ctree_opt+1)]))

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

CTREE Optimal Model Summary Statistics

Table 21:Confusion Matrix (Columns:Predicted,Rows:Actual)

zero one two three four five six seven eight nine

zero 825 0 7 8 1 6 13 2 6 2

one 0 924 2 3 1 7 0 7 5 2

two 10 11 797 14 5 7 11 16 16 4

three 15 3 20 847 6 23 8 8 33 8

four 5 8 7 7 749 6 10 14 10 37

five 15 6 4 37 9 671 14 7 26 7

six 23 4 13 9 5 16 799 1 2 2

seven 2 6 11 4 12 3 1 851 6 35

eight 12 10 15 31 5 25 5 10 720 9

nine 3 5 8 13 54 11 3 26 11 708

Table 22:Overall Accuracy

Accuracy 0.8945698

AccuracyLower 0.8879734

AccuracyUpper 0.9009042

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

CTREE Optimal Model Confusion Matrix Image

825 0 7 8 1 6 13 2 6 2

0 924 2 3 1 7 0 7 5 2

10 11 797 14 5 7 11 16 16 4

15 3 20 847 6 23 8 8 33 8

5 8 7 7 749 6 10 14 10 37

15 6 4 37 9 671 14 7 26 7

23 4 13 9 5 16 799 1 2 2

2 6 11 4 12 3 1 851 6 35

12 10 15 31 5 25 5 10 720 9

3 5 8 13 54 11 3 26 11 708

9.4% 0.1% 0.1% 0.0% 0.1% 0.1% 0.0% 0.1% 0.0%

10.5% 0.0% 0.0% 0.0% 0.1% 0.1% 0.1% 0.0%

0.1% 0.1% 9.0% 0.2% 0.1% 0.1% 0.1% 0.2% 0.2% 0.0%

0.2% 0.0% 0.2% 9.6% 0.1% 0.3% 0.1% 0.1% 0.4% 0.1%

0.1% 0.1% 0.1% 0.1% 8.5% 0.1% 0.1% 0.2% 0.1% 0.4%

0.2% 0.1% 0.0% 0.4% 0.1% 7.6% 0.2% 0.1% 0.3% 0.1%

0.3% 0.0% 0.1% 0.1% 0.1% 0.2% 9.1% 0.0% 0.0% 0.0%

0.0% 0.1% 0.1% 0.0% 0.1% 0.0% 0.0% 9.6% 0.1% 0.4%

0.1% 0.1% 0.2% 0.4% 0.1% 0.3% 0.1% 0.1% 8.2% 0.1%

0.0% 0.1% 0.1% 0.1% 0.6% 0.1% 0.0% 0.3% 0.1% 8.0%nine

eight

seven

six

five

four

three

two

one

zero

zero one two three four five six seven eight ninePredicted

Actu

al

020406080

Count

CTREE Optimal Model Confusion Matrix Image

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

CTREE Optimal Bar Plot

0

250

500

750

1000

zero one two three four five six seven eight nineLabels

Coun

t Labelsactualpredicted

CTREE Optimal Model Predicted vs. Actual Class Labels

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

CTREE Optimal Model Confusion Matrix on Manually Labeled Test Set

Table 23:Confusion Matrix (Columns:Predicted,Rows:Actual)

zero one two three four five six seven eight nine

zero 93 0 1 3 0 1 2 0 1 1

one 0 110 0 0 3 0 1 0 1 0

two 1 0 74 2 1 0 2 1 2 0

three 2 0 4 96 0 7 0 3 9 1

four 0 0 2 1 89 1 1 2 0 12

five 1 0 0 2 2 77 3 1 6 3

six 0 1 3 0 0 2 90 0 4 0

seven 0 0 2 1 4 2 0 90 0 2

eight 0 2 4 1 1 3 1 1 70 3

nine 0 0 1 1 11 1 0 1 3 70

Table 24:Overall Accuracy

Accuracy 0.8590000

AccuracyLower 0.8358734

AccuracyUpper 0.8799885

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Multinomial Logistic Regression

Class Probabilities

Pr(Y = k|X = x) =eβ0k+β1kX1+...+βpkXp

�K

l=1eβ0l+β1l X1+...+βpl Xp

Logistic Regression Model generalized for problems containing more than two classes.

[James, 2013]

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

MLR Model Fitting and Tuning

0.80

0.82

0.84

0.86

0.88

20 40 60Number of Components

Cla

ssifi

catio

n Ac

cura

cy

Multinomial Logistic Model: Number of Components vs. Accuracy

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

MLR Optimal Model Summary Statistics

Table 25:Confusion Matrix (Columns:Predicted,Rows:Actual)

zero one two three four five six seven eight nine

zero 802 0 5 8 0 43 6 0 2 4

one 0 900 16 6 0 14 4 2 9 0

two 25 19 674 28 34 7 54 15 31 4

three 11 12 27 730 5 90 8 12 60 16

four 5 8 3 4 672 9 22 9 7 114

five 27 19 9 68 14 585 14 15 31 14

six 16 20 29 7 12 31 748 3 6 2

seven 8 17 22 8 10 14 0 775 12 65

eight 6 31 39 68 6 48 6 5 608 25

nine 14 8 7 15 142 16 1 71 17 551

Table 26:Overall Accuracy

Accuracy 0.7986623

AccuracyLower 0.7901393

AccuracyUpper 0.8069875

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

MLR Optimal Model Confusion Matrix Image

802 0 5 8 0 43 6 0 2 4

0 900 16 6 0 14 4 2 9 0

25 19 674 28 34 7 54 15 31 4

11 12 27 730 5 90 8 12 60 16

5 8 3 4 672 9 22 9 7 114

27 19 9 68 14 585 14 15 31 14

16 20 29 7 12 31 748 3 6 2

8 17 22 8 10 14 0 775 12 65

6 31 39 68 6 48 6 5 608 25

14 8 7 15 142 16 1 71 17 551

9.1% 0.1% 0.1% 0.5% 0.1% 0.0% 0.0%

10.2% 0.2% 0.1% 0.2% 0.0% 0.0% 0.1%

0.3% 0.2% 7.6% 0.3% 0.4% 0.1% 0.6% 0.2% 0.4% 0.0%

0.1% 0.1% 0.3% 8.3% 0.1% 1.0% 0.1% 0.1% 0.7% 0.2%

0.1% 0.1% 0.0% 0.0% 7.6% 0.1% 0.2% 0.1% 0.1% 1.3%

0.3% 0.2% 0.1% 0.8% 0.2% 6.6% 0.2% 0.2% 0.4% 0.2%

0.2% 0.2% 0.3% 0.1% 0.1% 0.4% 8.5% 0.0% 0.1% 0.0%

0.1% 0.2% 0.2% 0.1% 0.1% 0.2% 8.8% 0.1% 0.7%

0.1% 0.4% 0.4% 0.8% 0.1% 0.5% 0.1% 0.1% 6.9% 0.3%

0.2% 0.1% 0.1% 0.2% 1.6% 0.2% 0.0% 0.8% 0.2% 6.2%nine

eight

seven

six

five

four

three

two

one

zero

zero one two three four five six seven eight ninePredicted

Actu

al

020406080

Count

Multinomial Logistic Optimal Model Confusion Matrix Image

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

MLR Optimal Bar Plot

0

250

500

750

1000

zero one two three four five six seven eight nineLabels

Coun

t Labelsactualpredicted

Multinomial Logistic Optimal Model Predicted vs. Actual Class Labels

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

MLR Optimal Model Confusion Matrix on Manually Labeled Test Set

Table 27:Confusion Matrix (Columns:Predicted,Rows:Actual)

zero one two three four five six seven eight nine

zero 93 0 0 0 1 4 3 1 0 0

one 0 109 2 0 0 1 1 1 1 0

two 1 1 74 3 2 0 1 1 0 0

three 1 0 0 108 0 4 1 3 0 5

four 0 0 0 0 104 0 0 1 0 3

five 2 1 0 3 4 81 1 0 2 1

six 0 0 1 0 0 1 97 1 0 0

seven 0 0 2 0 3 0 0 88 1 7

eight 0 1 0 2 2 11 0 0 62 8

nine 0 0 0 1 8 1 0 1 0 77

Table 28:Overall Accuracy

Accuracy 0.8930000

AccuracyLower 0.8721714

AccuracyUpper 0.9114796

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Model Comparison: Summary Statistics

Table 29:Model Comparison: Summary Statistics

Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s

KNN 0.9653 0.9685 0.9711 0.9713 0.9737 0.9779 0

LDA 0.8606 0.8681 0.8722 0.8706 0.8733 0.8851 0

QDA 0.9524 0.9575 0.9585 0.9590 0.9613 0.9667 0

RF 0.9422 0.9486 0.9521 0.9514 0.9548 0.9572 0

Log 0.8690 0.8800 0.8846 0.8857 0.8911 0.9062 0

Ctree 0.8158 0.8229 0.8254 0.8270 0.8314 0.8387 0

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Testing for Normality: LDA

0

50

100

150

200

0.86 0.87 0.88accuracy

density

0.860

0.865

0.870

0.875

0.880

0.885

−2 −1 0 1 2theoretical

sample

Table 30:Shapiro-Wilk normality Test

Test-statistic (W) P-value

0.9224415 0.0310465

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Testing for Normality: QDA

0

50

100

150

200

0.955 0.960 0.965accuracy

density

0.952

0.956

0.960

0.964

−2 −1 0 1 2theoretical

sample

Table 31:Shapiro-Wilk normality Test

Test-statistic (W) P-value

0.9769401 0.7396847

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Testing for Normality: KNN

0

50

100

150

200

250

0.964 0.968 0.972 0.976accuracy

density

0.965

0.970

0.975

−2 −1 0 1 2theoretical

sample

Table 32:Shapiro-Wilk normality Test

Test-statistic (W) P-value

0.9774543 0.7545886

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Testing for Normality: RF

0

50

100

150

200

0.945 0.950 0.955accuracy

density

0.945

0.950

0.955

−2 −1 0 1 2theoretical

sample

Table 33:Shapiro-Wilk normality Test

Test-statistic (W) P-value

0.9504195 0.1734898

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Testing for Normality: CTREE

0

50

100

0.815 0.820 0.825 0.830 0.835 0.840accuracy

density

0.815

0.820

0.825

0.830

0.835

−2 −1 0 1 2theoretical

sample

Table 34:Shapiro-Wilk normality Test

Test-statistic (W) P-value

0.9686452 0.5028018

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Testing for Normality: Log

0

30

60

90

0.87 0.88 0.89 0.90accuracy

density

0.87

0.88

0.89

0.90

−2 −1 0 1 2theoretical

sample

Table 35:Shapiro-Wilk normality Test

Test-statistic (W) P-value

0.9850217 0.9375558

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Model Comprison: Statistical Inference

Table 36:Summary Statistics

nbr.val min max median mean var

KNN 30 0.96532 0.97788 0.97111 0.97133 1e-05

QDA 30 0.95236 0.96669 0.95852 0.95901 1e-05

Table 37:Wilcoxon Signed Rank Test

Test-statistic (V) P-value

Two-sided 465 0

Greater 465 0

Table 38:T-test

Test-statistic (t) P-value

Two-sided 15.75693 0

Greater 15.75693 0

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Model Comparison: Box Plot

Ctree

LDA

Log

RF

QDA

KNN

0.80 0.85 0.90 0.95

Accuracy

0.80 0.85 0.90 0.95

Kappa

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Class Accuracy by Model

Table 39:Optimal Model Class Accuracy Measures

0 1 2 3 4 5 6 7 8 9

KNN 0.998 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99

LDA 0.970 0.93 0.96 0.93 0.94 0.91 0.96 0.94 0.90 0.89

QDA 0.994 0.99 0.97 0.98 0.99 0.98 1.00 0.99 0.95 0.98

RF 0.983 0.98 0.94 0.92 0.94 0.95 0.97 0.95 0.93 0.92

Ctree 0.950 0.97 0.94 0.93 0.94 0.93 0.96 0.95 0.92 0.93

Log 0.934 0.93 0.89 0.87 0.86 0.83 0.93 0.92 0.87 0.83

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Ensemble Predictions:

Goal: Develop a method though which the class accuracy of each ‘optimized’

model can be employed in making class predictions.

Condition 1: Majority vote wins.

Condition 2: If each model predicts a different class label, go with the prediction

from the model that has the maximum accuracy for that class prediction.

Condition 3: If there is a two-way tie or split-vote then go with that class label

that has the maximum mean accuracy among all models for that class.

Condition 4: If there is a three-way tie then go with that class label that has the

maximum mean accuracy among all models for that class.

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Ensemble Summary Statistics

Table 40:Confusion Matrix (Columns:Predicted,Rows:Actual)

0 1 2 3 4 5 6 7 8 9

0 101 0 1 0 0 0 0 0 0 0

1 1 114 1 4 2 3 1 3 4 1

2 0 0 78 1 0 0 0 0 0 0

3 0 0 0 112 0 1 0 0 0 1

4 0 0 1 0 105 0 0 0 1 0

5 0 1 0 0 0 91 1 0 1 1

6 0 0 1 0 0 0 98 0 0 0

7 0 0 1 1 1 0 0 97 0 1

8 0 0 0 4 0 0 0 1 79 0

9 0 0 0 0 0 0 0 0 1 84

Table 41:Overall Accuracy

Accuracy 0.9590000

AccuracyLower 0.9447875

AccuracyUpper 0.9704198

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Conclusion

1. KNN was the best performing model with a classification accuracy of 0.978.

2. Examine effectiveness of Support Vector Machine classifiers, as well as NeuralNetwork models.

3. Also, may wish to examine the effectiveness of employing a hierarchical clusteringtechnique for dimension reduction and compare results with principle componentanalysis.

4. Continue to explore ensemble prediction method, with a variety of logic rules.

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Parallel Processing

50

100

25 50 75 100Number of Components

Tim

e El

aspe

d (s

econ

ds)

groupLDALDA 2 coresQDAQDA 2 cores

LDA and QDA Parallel vs Non−parallel Processing

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

Parallel Processing Continued

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis

References

Breiman, L. (2001). ”Random forests.” Machine learning 45(1): 5-32.

Hothorn, T., et al. (2006). ”Unbiased recursive partitioning: A conditional inference

framework.” Journal of Computational and Graphical statistics 15(3): 651-674.

James, G., et al. (2013). An introduction to statistical learning, Springer.

Kuhn, M. and K. Johnson (2013). Applied predictive Modeling, Springer.

Zachary Combs, Philip Remmele, M.S. Data Science Candidates Kaggle Digits Analysis