31
Case- SeŶsitive AlphaŶuŵeric Character RecogŶitioŶ ECE539 Project Report Joseph Miller May 12, 2016 1. Introduction As more personal information and monetary transactions move towards the digital domain, cyber- security becomes more and more critical. Each day, malicious programmers and criminals find new ways to gain access to secure information. In order to keep that information secure, companies have been working to find new and better ways to verify the identity of users. Currently, a popular method to verify that an internet user is human and not machine is to use the Pƌoǀe LJou’ƌe Ŷot a ƌoďot test. An example is shown below. Figure 1.1 An online Prove you’re not a robot Test The idea here is that a human will be able to identify the characters and a machine will not. Therefore, if the ĐoƌƌeĐt seƋueŶĐe of ĐhaƌaĐteƌs is tLJped, the useƌ’s identity is verified and he/she/it is allowed to continue deeper into the online program and possibly access secure information. A machine on the other hand should fail the test and, therefore, be blocked from accessing any information. In this project, I will explore the difficulties of using machine learning to characterize alphanumeric characters of varying fonts and contexts. Using a dataset of obscure characters, I will compare the recognition rates of human vs. machine to discern if online Pƌoǀe LJou’ƌe Ŷot a ƌoďot tests are a good means of verifying if a user is human or machine. Note: Source code is referenced throughout the text and is included in section 11. Source Code.

^ v ] ]À o Zv µu ] Z Z }Pv] ]}vhomepages.cae.wisc.edu/~ece539/project/s16/Miller_rpt.pdf · Each day, malicious programmers and criminals find new ways ... Three networks and their

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ^ v ] ]À o Zv µu ] Z Z }Pv] ]}vhomepages.cae.wisc.edu/~ece539/project/s16/Miller_rpt.pdf · Each day, malicious programmers and criminals find new ways ... Three networks and their

Case-Se sitive Alpha u eric Character Recog itio ECE539 Project Report

Joseph Miller

May 12, 2016

1. Introduction As more personal information and monetary transactions move towards the digital domain, cyber-

security becomes more and more critical. Each day, malicious programmers and criminals find new ways

to gain access to secure information. In order to keep that information secure, companies have been

working to find new and better ways to verify the identity of users.

Currently, a popular method to verify that an internet user is human and not machine is to use the

P o e ou’ e ot a o ot test . An example is shown below.

Figure 1.1 An online Prove you’re not a robot Test

The idea here is that a human will be able to identify the characters and a machine will not. Therefore, if

the o e t se ue e of ha a te s is t ped, the use ’s identity is verified and he/she/it is allowed to

continue deeper into the online program and possibly access secure information. A machine on the

other hand should fail the test and, therefore, be blocked from accessing any information.

In this project, I will explore the difficulties of using machine learning to characterize alphanumeric

characters of varying fonts and contexts. Using a dataset of obscure characters, I will compare the

recognition rates of human vs. machine to discern if online P o e ou’ e ot a o ot tests are a good

means of verifying if a user is human or machine.

Note: Source code is referenced throughout the text and is included in section 11. Source Code .

Page 2: ^ v ] ]À o Zv µu ] Z Z }Pv] ]}vhomepages.cae.wisc.edu/~ece539/project/s16/Miller_rpt.pdf · Each day, malicious programmers and criminals find new ways ... Three networks and their

2. Data Set The data set consists of 6283 training samples and 6220 testing samples taken from Google Street View

images. Each image is associated with a character label. Possible character labels include uppercase

letters A-Z , lowercase letters a-z , and numbers 0-9 for a total of 62 classes. Characters may be

blurry, obscured, rotated in either direction by up to 90 degrees, or styled. In addition, each character

may appear on any variety of background. Several examples are shown in Figure 2.1.

Figure 2.1 Example characters from left to right: A , q , H , H , E , Z , 2 , D , Y

In addition, some of the images contain several full characters. Even if this is the case, the image is still

associated with just a single character label. Two examples are shown below.

Figure 2.2. Two sample images – both assigned to the label j

Note: These images are not used for Prove you’re not a robot tests . They are simply examples of

characters that one could see if taking a Prove you’re not a robot test . Many of these images are

more difficult than what would be seen during an actual Prove you’re not a robot test .

3. Recognition by Human Ten people were each assigned 100 unique training images and asked to identify each. The human

classification rate will later be compared with the classification rates of various machine classification

algorithms. The human recognition results are described in Table 3.1. The participant classification data

is stored in participant.csv and the grading script is saved as human.m .

Page 3: ^ v ] ]À o Zv µu ] Z Z }Pv] ]}vhomepages.cae.wisc.edu/~ece539/project/s16/Miller_rpt.pdf · Each day, malicious programmers and criminals find new ways ... Three networks and their

Participant Classification

Rate

1 84%

2 82%

3 80%

4 83%

5 80%

6 84%

7 84%

8 80%

9 72%

10 85%

Total 81.4%

Table 3.1 Recognition rate of characters by human

You may be wondering how a human was unable to perfectly categorize the images. To illustrate the

difficulty of this dataset, you may attempt to categorize each of the images on this page. The correct

labels will be shown on the next page.

Figure 3.1 Attempt to classify these characters by sight.

Page 4: ^ v ] ]À o Zv µu ] Z Z }Pv] ]}vhomepages.cae.wisc.edu/~ece539/project/s16/Miller_rpt.pdf · Each day, malicious programmers and criminals find new ways ... Three networks and their

The correct labels are as follows:

first row – I , O , O , d , m , I

second row – O , 0 , O , o , 9 , a

third row – 1 , 0 , a , l , O , 0 , N , I

This dataset requires that we differentiate characters that are fundamentally very similar. For example,

0 , O , and o are difficult to differentiate when taken out of context. Likewise, 1 , l (lowercase L),

and I (uppercase I) are just as difficult. Because data can be rotated, characters that would normally

be easy to discern become very similar as well. For example, the letter d above could easily be

mistaken for a rotated p or even as an o , O , or 0 with an ignorable feature superimposed above

it.

As described in the following sections, these challenges pose an even greater challenge for machine

learning algorithms.

4. Preprocessing of Images The set of downloaded images came in a variety of sizes ranging from 14x29 pixels to 178x197 pixels.

The K Nearest Neighbor and Random Forest algorithms use images scaled to 20x20 pixels. The

Multilayer Perceptron algorithm uses images scaled to 40x40 pixels. The smaller image size was chosen

for the first two algorithms due to speed issues.

Because color does not aid in the classification of these images, all images are converted to grayscale.

This is all the preprocessing that is used for the KNN and RF algorithms. The MLP algorithm uses a more

sophisticated approach described in section 7. Recognition by Multilayer Perceptron .

5. Recognition by K Nearest Neighbor This KNN algorithm is written in the Julia programming language developed by MIT in 2012. This

language was made to combine the usability of Python and the performance of C into a single language.

The source code is provided in knn.jl . Much of this algorithm was created by working through an

online tutorial of the Julia language.

This KNN algorithm tunes the value of K by using Leave-One-Out Cross Validation. The classification

rates of each value of K is shown in Table 5.1.

Page 5: ^ v ] ]À o Zv µu ] Z Z }Pv] ]}vhomepages.cae.wisc.edu/~ece539/project/s16/Miller_rpt.pdf · Each day, malicious programmers and criminals find new ways ... Three networks and their

K Classification

Rate

1 44.4%

2 44.4%

3 44.5%

4 43.9%

5 42.9%

6 42.1%

7 41.3%

8 40.9%

9 40.6%

10 40.57%

11 40.3%

12 40.2%

13 39.9%

14 39.7%

15 39.4%

16 39.2%

17 39.2%

18 39.2%

19 38.9%

20 39.6%

Table 5.1 Recognition rates of various values of K

After leave-one-out fold cross validation is performed, we can select the value of K which provides the

best results. From Table 5.1, we see that K=3 gave the highest classification rate. Therefore, when the

testing set is applied to the KNN system, we use a value of K=3.

The classification rate of KNN classifier on the testing set is 42.4%.

6. Recognition by Random Forest (RF) The random forest learning algorithm expands upon decision tree learning by taking a majority vote of

the classifications by many decision trees. Decision trees are prone to learning irregular patterns. The

random forest technique reduces this trend by bootstrapping the decision trees.

Random Forest functions are available in libraries for a variety of languages. My RF classification method

is written in the Julia programming language with aid from an online Julia tutorial and uses the following

built-in RF functions:

Build_forest

Apply_forest

nfoldCV_forest

Page 6: ^ v ] ]À o Zv µu ] Z Z }Pv] ]}vhomepages.cae.wisc.edu/~ece539/project/s16/Miller_rpt.pdf · Each day, malicious programmers and criminals find new ways ... Three networks and their

These functions require several parameters such as the number of features chosen at each split and the

number of trees in the forest. The number of features chosen at each split is typically chosen as √ . In this case, each of the images is 20x20 pixels. Therefore, we set this

parameter to 20. The number of trees is varied to achieve the best results. I used 4-fold cross validation

to test my choice of parameters.

As the parameters passed into the RF functions vary, the following classification rates are achieved:

Number of Features

Chosen at each split

Number of trees in

the forest

4-Fold Classification

Rate

20 50 40.5%

20 100 45.3%

20 150 47.8%

20 160 46.3%

20 200 45.7%

Table 6.1 Classification rates of Random Forest classifier

The classification rate is highest when the number of trees in the forest is 150. Therefore, when the

testing set was applied to the RF, a forest size of 150 was used.

The classification rate of the RF classifier on the testing set is 48.4%.

7. Recognition by Multilayer Perceptron Additional pre-processing of the data was performed for use in the MLP network. For this section, I

attempted to artificially increase the number of training samples by applying various affects to each of

the training images and adding these newly created images to the training set. These effects include

random rotation between 30 and -30 degrees, random translation in any direction by up to 8 pixels,

random choice to invert the grayscale color, and random choice to apply edge detection. I applied these

affects to 5283 training samples to increase the number of training samples from 6283 to 11506

samples. I reserved 1000 training samples to use in tuning the parameters of my network. Two

augmented training samples are shown in Figure 7.1 next to their original images. The source code of

the MLP preprocessing work is shown in dataAugmentation.m . The purpose of increasing the number

of training samples is to increase the variety of images that the network is exposed to. This data

augmentation had an immediate effect on the classification rate, as it shot up by about 3% for each

architecture of MLP.

Page 7: ^ v ] ]À o Zv µu ] Z Z }Pv] ]}vhomepages.cae.wisc.edu/~ece539/project/s16/Miller_rpt.pdf · Each day, malicious programmers and criminals find new ways ... Three networks and their

Figure 7.1 Augmented training samples

Several MLP architectures were constructed. Three networks and their final classification rates of on the

testing set are shown below. The code for creating, training, and testing the MLP is provided in

neuralNetwork.m .

Figure 7.2 MLP with one hidden layer consisting of 100 neurons. Classification Rate = 29%.

Original

Original

Augmented

Augmented

Page 8: ^ v ] ]À o Zv µu ] Z Z }Pv] ]}vhomepages.cae.wisc.edu/~ece539/project/s16/Miller_rpt.pdf · Each day, malicious programmers and criminals find new ways ... Three networks and their

Figure 7.3 MLP with 4 hidden layers each consisting of 300 neurons. Classification Rate = 48%.

Figure 7.4MLP with 5 hidden layers each consisting of 300 neurons. Classification Rate = 47%.

Additional architectures were used as well, but these three were chosen to show how increasing depth

and amount of neurons affects the classification rate. The network with 4 hidden layers each of 300

neurons gave the highest classification rate of 48.0%.

8. Recognition by Ensemble Classifier A max vote ensemble classifier was implemented to determine a final classification rate. If the KNN, RF,

and MLP classifications contained conflicting results, the ensemble classifier defaulted to the label given

by the RF classifier because it gave the most accurate predictions. Source code for the ensemble

classifier is shown in combine.m .

The classification rate of the ensemble classifier on the testing set is 49.7%.

9. Discussion of Results The seemingly low classification rates are expected. A significant amount of error in all three methods

comes from the fact that many alphanumeric characters are very similar. For example, an O , o , and

0 are nearly impossible to distinguish when taken out of context. To test the effect of similar

characters, I reduced the number of classes to ignore misclassification among similar characters. Table

9.1 shows which characters are considered similar. Because the images are allowed to be rotated by any

amount, characters that are similar after rotation may also be considered similar. After similar

characters are combined into single classes, the total number of classes is reduced from 62 to 41, which

is still a very high number of classes.

Page 9: ^ v ] ]À o Zv µu ] Z Z }Pv] ]}vhomepages.cae.wisc.edu/~ece539/project/s16/Miller_rpt.pdf · Each day, malicious programmers and criminals find new ways ... Three networks and their

Similar Characters

O o 0 1 I l

2 Z z 4 A

5 S s 6 b

8 B C c

J j K k

M m P p

U u V v

W w X x

Y y

Table 9.1 Combine the characters in each cell into a single class so that misclassifications

cannot occur between similar characters.

After the similar character misclassifications are ignored, the classification rate of the best MLP

increased from 48% to 63%, which is a significant improvement and proves that the classification

algorithms are behaving as expected.

Even so, the human classification rate of 81.4% is significantly higher than the ensemble classification

rate of 49.7%. If a Prove ou’ e ot a o ot test asks for the classification of 6 characters from this

dataset, the human will be registered as a human (81.4%)6 = 29% of the time (clearly this dataset is more

difficult than the characters used in real P o e ou’ e ot a o ot tests ) and the ensemble classifier

will be registered as a human only (49.7%)6 = 1.5% of the time. Therefore, an actual human is 20x more

likely to be classified as a human than my ensemble classifier.

Is this a large enough margin of human over machine? I do ’t elie e so. M p ep o essi g a d classification methods are trivial when compared to the most complex machine learning methods that

a e pe t ould use. Fo e a ple, the o ld’s highest lassifi atio ate of this dataset is 85%. It utilizes a 6 layer convolutional network with a significant amount of image processing. That classification rate

exceeds that of humans.

In conclusion, the p o e ou’ e ot a o ot test cannot ensure it passes only humans.

Page 10: ^ v ] ]À o Zv µu ] Z Z }Pv] ]}vhomepages.cae.wisc.edu/~ece539/project/s16/Miller_rpt.pdf · Each day, malicious programmers and criminals find new ways ... Three networks and their

10. Resources

[1] T. E. de Campos, B. R. Babu and M. Varma, Character recognition in natural

images,Proceedings of the International Conference on Computer Vision Theory and

Applications (VISAPP), Lisbon, Portugal, February 2009.

[2] First Steps With Julia. (n.d.). Retrieved May 13, 2016, from

https://www.kaggle.com/c/street-view-getting-started-with-julia

[3] Florian Muellerklein. (n.d.). Retrieved May 13, 2016, from

http://florianmuellerklein.github.io/cnn_streetview/

[4] Random forest. (n.d.). Retrieved May 13, 2016, from

https://en.wikipedia.org/wiki/Random_forest

[5] Hudson Beale, M., Hagan, M., & Demuth, H. (n.d.). Neuralnetworktoolboxuserguide.pdf.

Retrieved from https://ay15-

16.moodle.wisc.edu/prod/pluginfile.php/273453/mod_resource/content/1/neuralnetw

orktoolboxuserguide.pdf

11. Source Code

Page 11: ^ v ] ]À o Zv µu ] Z Z }Pv] ]}vhomepages.cae.wisc.edu/~ece539/project/s16/Miller_rpt.pdf · Each day, malicious programmers and criminals find new ways ... Three networks and their

############################################################## ############################################################## # KNN SOURCE CODE ############################################################## ############################################################## # load packages Pkg.add("Images") Pkg.add("DataFrames") using Images using DataFrames ############################################################## # read_data # - loads the image files and converts them to grayscale image # vectors. # # typeData - "train" or "test" # labelsInfo - the IDs of each image to be read # imageSize - 20*20 pixels = 400 ############################################################## function read_data(typeData, labelsInfo, imageSize) #Intialize the x matrix x = zeros(size(labelsInfo, 1), imageSize) for (index, idImage) in enumerate(labelsInfo[:ID]) # get the images nameFile = "C:/Users/Joey/OneDrive/2016__spring/ECE_539/Project/Julia/$(typeData)Resized/$(idImage).Bmp" img = load(nameFile) # Convert image to float values temp = convert(Image{Images.Gray}, img) # convert to grayscale by taking the mean of the 3 colors if ndims(temp) == 3 temp = mean(temp.data, 1) end # Reshape image into a column vector x[index, :] = reshape(temp, 1, imageSize) end return x end ############################################################## # set up ############################################################## # 20 x 20 pixel imageSize = 400 # Read information about training data labelsInfoTrain = readtable("C:/Users/Joey/OneDrive/2016__spring/ECE_539/Project/Julia/trainLabels.csv")

Page 12: ^ v ] ]À o Zv µu ] Z Z }Pv] ]}vhomepages.cae.wisc.edu/~ece539/project/s16/Miller_rpt.pdf · Each day, malicious programmers and criminals find new ways ... Three networks and their

# Read training matrix xTrain = read_data("train", labelsInfoTrain, imageSize) # Read information about the testing data labelsInfoTest = readtable("C:/Users/Joey/OneDrive/2016__spring/ECE_539/Project/Julia/sampleSubmission.csv") # Read testing matrix xTest = read_data("test", labelsInfoTest, imageSize) # Get only first character of string # Apply the function to each element of the column "Class" yTrain = map(x -> x[1], labelsInfoTrain[:Class]) # Convert from character to int yTrain = int(yTrain) xTrain = xTrain' xTest = xTest' addprocs(3) ############################################################## # Functions for KNN with LOOF-CV with K=1:20 # - euclidean_distance # - get_k_nearest_neighbors # - assign_label_each_k ############################################################## @everywhere function euclidean_distance(a, b) distance = 0.0 for index in 1:size(a, 1) distance += (a[index]-b[index]) * (a[index]-b[index]) end return distance end @everywhere function get_k_nearest_neighbors(x, i, k) nRows, nCols = size(x) imageI = Array(Float32, nRows) for index in 1:nRows imageI[index] = x[index, i] end imageJ = Array(Float32, nRows) distances = Array(Float32, nCols) for j in 1:nCols for index in 1:nRows imageJ[index] = x[index, j] end distances[j] = euclidean_distance(imageI, imageJ) end

Page 13: ^ v ] ]À o Zv µu ] Z Z }Pv] ]}vhomepages.cae.wisc.edu/~ece539/project/s16/Miller_rpt.pdf · Each day, malicious programmers and criminals find new ways ... Three networks and their

sortedNeighbors = sortperm(distances) kNearestNeighbors = sortedNeighbors[2:k+1] return kNearestNeighbors end @everywhere function assign_label_each_k(x, y, maxK, i) kNearestNeighbors = get_k_nearest_neighbors(x, i, maxK) labelsK = zeros(Int, 1, maxK) counts = Dict{Int, Int}() highestCount = 0 mostPopularLabel = 0 # keep track of current value of k for (k, n) in enumerate(kNearestNeighbors) labelOfN = y[n] if !haskey(counts, labelOfN) counts[labelOfN] = 0 end counts[labelOfN] += 1 if counts[labelOfN] > highestCount highestCount = counts[labelOfN] mostPopularLabel = labelOfN end # Save the most popular label labelsK[k] = mostPopularLabel end # Return the vector of labels for each K return labelsK end ############################################################## # Perform KNN with LOOF-CV with K = 1 to 20 ############################################################## tic() maxK = 20 yPredictionsK = @parallel (vcat) for i in 1:size(xTrain, 2) assign_label_each_k(xTrain, yTrain, maxK, i) end for k in 1:maxK accuracyK = mean(yTrain .== yPredictionsK[:, k]) println("The LOOF-CV accuracy of $(k)-NN is $(accuracyK)") end toc() ############################################################## # Functions for KNN on testing set

Page 14: ^ v ] ]À o Zv µu ] Z Z }Pv] ]}vhomepages.cae.wisc.edu/~ece539/project/s16/Miller_rpt.pdf · Each day, malicious programmers and criminals find new ways ... Three networks and their

# - get_k_nearest_neighbors # - assign_label # - also uses euclidean_distance from above ############################################################## @everywhere function get_k_nearest_neighbors(xTrain, imageI, k) nRows, nCols = size(xTrain) imageJ = Array(Float32, nRows) distances = Array(Float32, nCols) for j in 1:nCols for index in 1:nRows imageJ[index] = xTrain[index, j] end distances[j] = euclidean_distance(imageI, imageJ) end sortedNeighbors = sortperm(distances) kNearestNeighbors = sortedNeighbors[1:k] return kNearestNeighbors end @everywhere function assign_label(xTrain, yTrain, k, imageI) kNearestNeighbors = get_k_nearest_neighbors(xTrain, imageI, k) counts = Dict{Int, Int}() highestCount = 0 mostPopularLabel = 0 for n in kNearestNeighbors labelOfN = yTrain[n] if !haskey(counts, labelOfN) counts[labelOfN] = 0 end counts[labelOfN] += 1 #add one to the count if counts[labelOfN] > highestCount highestCount = counts[labelOfN] mostPopularLabel = labelOfN end end return mostPopularLabel end ############################################################## # Perform KNN with K=3 on test data ############################################################## println("Running kNN on test data") tic() # CV shows K=3 is best k = 3 yPredictions = @parallel (vcat) for i in 1:size(xTest, 2)

Page 15: ^ v ] ]À o Zv µu ] Z Z }Pv] ]}vhomepages.cae.wisc.edu/~ece539/project/s16/Miller_rpt.pdf · Each day, malicious programmers and criminals find new ways ... Three networks and their

nRows = size(xTrain, 1) imageI = Array(Float32, nRows) for index in 1:nRows imageI[index] = xTest[index, i] end assign_label(xTrain, yTrain, k, imageI) end toc() #Convert int predictions to char labelsInfoTest[:Class] = map(Char, yPredictions) #Save predictions writetable("C:/Users/Joey/OneDrive/2016__spring/ECE_539/Project/Julia/KNN_SUBMISSION.csv", labelsInfoTest, separator=',', header=true)

Page 16: ^ v ] ]À o Zv µu ] Z Z }Pv] ]}vhomepages.cae.wisc.edu/~ece539/project/s16/Miller_rpt.pdf · Each day, malicious programmers and criminals find new ways ... Three networks and their

############################################################## ############################################################## # RF SOURCE CODE ############################################################## ############################################################## # load packages Pkg.add("Images") Pkg.add("DataFrames") using Images using DataFrames ############################################################## # read_data # - loads the image files and converts them to grayscale image # vectors. # # typeData - "train" or "test" # labelsInfo - the IDs of each image to be read # imageSize - 20*20 pixels = 400 ############################################################## function read_data(typeData, labelsInfo, imageSize) #Intialize the x matrix x = zeros(size(labelsInfo, 1), imageSize) for (index, idImage) in enumerate(labelsInfo[:ID]) # get the images nameFile = "C:/Users/Joey/OneDrive/2016__spring/ECE_539/Project/Julia/$(typeData)Resized/$(idImage).Bmp" img = load(nameFile) # Convert image to float values temp = convert(Image{Images.Gray}, img) # convert to grayscale by taking the mean of the 3 colors if ndims(temp) == 3 temp = mean(temp.data, 1) end # Reshape image into a column vector x[index, :] = reshape(temp, 1, imageSize) end return x end ############################################################## # set up ############################################################## # 20 x 20 pixel imageSize = 400 # Read information about training data labelsInfoTrain = readtable("C:/Users/Joey/OneDrive/2016__spring/ECE_539/Project/Julia/trainLabels.csv")

Page 17: ^ v ] ]À o Zv µu ] Z Z }Pv] ]}vhomepages.cae.wisc.edu/~ece539/project/s16/Miller_rpt.pdf · Each day, malicious programmers and criminals find new ways ... Three networks and their

# Read training matrix xTrain = read_data("train", labelsInfoTrain, imageSize) # Read information about the testing data labelsInfoTest = readtable("C:/Users/Joey/OneDrive/2016__spring/ECE_539/Project/Julia/sampleSubmission.csv") # Read testing matrix xTest = read_data("test", labelsInfoTest, imageSize) # Get only first character of string # Apply the function to each element of the column "Class" yTrain = map(x -> x[1], labelsInfoTrain[:Class]) # Convert from character to int yTrain = int(yTrain) ############################################################## # main code # - applies random forest training, testing and four-fold # cross validation to test the accuracy of the model. ############################################################## # load packages Pkg.add("DecisionTree") using DecisionTree # Train random forest with # 20 for number of features chosen at each random split # 100 for number of trees # 1.0 for ratio of subsampling num_features = 20 num_trees = 150 ratio = 1.0 model = build_forest(yTrain, xTrain, num_features, num_trees, ratio) # Get predictions for test data predTest = apply_forest(model, xTest) # Convert integer predictions to character labelsInfoTest[:Class] = map(Char, predTest) # Save predictions writetable("C:/Users/Joey/OneDrive/2016__spring/ECE_539/Project/Julia/RF_SUBMISSION.csv", labelsInfoTest, separator=',', header=true) # Run 4 fold cross validation accuracy = nfoldCV_forest(yTrain, xTrain, num_features, num_trees, 4, ratio);

Page 18: ^ v ] ]À o Zv µu ] Z Z }Pv] ]}vhomepages.cae.wisc.edu/~ece539/project/s16/Miller_rpt.pdf · Each day, malicious programmers and criminals find new ways ... Three networks and their

%% dataAugmentation.m % - convert images to 40x40 grayscale images % - randomly rotates some of the images by -10 to 10 degrees % - randomly translates some of the images in any direction % - randomly inverts the color of some of the images % - randomly applies edge detection to some of the images % - saves augmented trainings set as trainSet.mat % - saves preserved training set as testSet.mat % runtime = 106.56 sec % % note: make sure paths to training set is correct % % (C) 2016 by Joseph Miller % created: 5/10/2016 clear all close all tic; % initialize variables imHeight = 40; imSize = imHeight * imHeight; num_samples = 6283; num_test = 500; type = input('Press 1 to augment the training samples (default = enter) : '); if isempty(type) num_augmented = 0; else num_augmented = num_samples - num_test; end trainReshaped = zeros(imSize,num_samples+num_augmented); %% Create the feature vectors path = '..\train\'; for i = 1:num_samples file = strcat(path, int2str(i), '.bmp'); % get filename of each file % load each file and resize to imageSize image = imresize(imread(file), [imHeight imHeight]); % if the image has 3 dimensions, it is rgb --> convert to grayscale % else the image is 2D meaning it is already grayscale if(length(size(image)) == 3) image = rgb2gray(image); end % maximize contrast of images image = imadjust(image); % Samples 1-999 will be reserved for testing. Do not augment them. % Samples 1000-6284 will be used for training and therefore will be % augmented.

Page 19: ^ v ] ]À o Zv µu ] Z Z }Pv] ]}vhomepages.cae.wisc.edu/~ece539/project/s16/Miller_rpt.pdf · Each day, malicious programmers and criminals find new ways ... Three networks and their

% Apply one data augmentation technique to each training sample if i > num_samples - num_augmented aug_image = image; % % randomly apply edge detection % if mod(round(rand*10),2) == 0 % aug_image = edge(aug_image,'canny'); % end % randomly translate in any direction by 8 pixels if mod(round(rand*10),2) == 0 aug_image = imtranslate(aug_image,[rand*8, 0],'OutputView','same'); end if mod(round(rand*10),2) == 0 aug_image = imtranslate(aug_image,[-rand*8, 0],'OutputView','same'); end if mod(round(rand*10),2) == 0 aug_image = imtranslate(aug_image,[0, rand*8],'OutputView','same'); end if mod(round(rand*10),2) == 0 aug_image = imtranslate(aug_image,[0, -rand*8],'OutputView','same'); end % randomly rotate between -10 and 10 degrees if mod(round(rand*10),2) == 0 aug_image = imrotate(aug_image, rand*10, 'crop'); else aug_image = imrotate(aug_image, -rand*10, 'crop'); end % randomly invert the color if mod(round(rand*10),2) == 0 aug_image = imcomplement(aug_image); end % figure, imagesc(aug_image); colormap('gray'),title('Aug') % reshape aug_image to a column vector trainReshaped(:,i+num_augmented) = reshape(aug_image, imSize, 1); end % figure, imagesc(image); colormap('gray'), title('Orig') % reshape image to a column vector trainReshaped(:,i) = reshape(image, imSize, 1); end %% Create the labels % load labels fid = fopen('../trainLabels.csv'); labels_in = textscan(fid, '%d %c', 'HeaderLines', 1, 'Delimiter', ','); fclose(fid);

Page 20: ^ v ] ]À o Zv µu ] Z Z }Pv] ]}vhomepages.cae.wisc.edu/~ece539/project/s16/Miller_rpt.pdf · Each day, malicious programmers and criminals find new ways ... Three networks and their

% convert char labels to ASCII code labels labels_in = labels_in{2} - 0; % convert ASCII code labels to values 1 to 62 for easier use in indexing labels_in = labels_in - 47; % characters 0-9 are now in range labels_in(labels_in > 10) = labels_in(labels_in > 10) - 7; % A-Z in range labels_in(labels_in > 36) = labels_in(labels_in > 36) - 6; % a-z in range % there are 62 classes of characters: 10 ints, 26 upper case, 26 lower case labels = zeros(num_samples + num_augmented, 62); % convert to proper format where each row contains a single 1 to specify % the class. for i = 1:num_samples labels(i, labels_in(i)) = 1; end % Replicate the labels of the augmented data if ~isempty(type) labels(num_samples+1:end,:) = labels(num_test+1:num_samples,:); end trainLabels = labels(num_test+1:end,:)'; testLabels = labels(1:num_test,:)'; %% Save the data for use in other programs % save the reshaped images for use in other programs testSet = trainReshaped(:,1:num_test); trainSet = trainReshaped(:,num_test+1:end); save('trainSet.mat', 'trainSet'); save('testSet.mat', 'testSet'); % save the labels for use in other programs save('trainLabels.mat', 'trainLabels'); save('testLabels.mat', 'testLabels'); %% Convert test samples to proper formatting path = '..\test\'; testFullSet = zeros(1600,12503-6284); for i = 6284:12503 file = strcat(path, int2str(i), '.bmp'); % get filename of each file % load each file and resize to imageSize image = imresize(imread(file), [imHeight imHeight]); % if the image has 3 dimensions, it is rgb --> convert to grayscale % else the image is 2D meaning it is already grayscale if(length(size(image)) == 3) image = rgb2gray(image); end % maximize contrast of images image = imadjust(image); index = i-6283;

Page 21: ^ v ] ]À o Zv µu ] Z Z }Pv] ]}vhomepages.cae.wisc.edu/~ece539/project/s16/Miller_rpt.pdf · Each day, malicious programmers and criminals find new ways ... Three networks and their

% reshape image to a column vector testFullSet(:,index) = reshape(image, imSize, 1); end % save the reshaped matrices for use in other programs save('testFullSet.mat', 'testFullSet'); toc

Page 22: ^ v ] ]À o Zv µu ] Z Z }Pv] ]}vhomepages.cae.wisc.edu/~ece539/project/s16/Miller_rpt.pdf · Each day, malicious programmers and criminals find new ways ... Three networks and their

% neuralNetwork.m % % Solve a Pattern Recognition Problem with a Neural Network % Script generated by Neural Pattern Recognition app % Created Tue May 10 10:21:32 CDT 2016 % % This script assumes these variables are defined: % % trainReshaped - input data. % labels - target data. % [300 300 300] -> 47% % [100] -> 29% % [200 200 200] -> 46% % [300 300 300 300] -> 48% % [300 300 300 300 300] -> 47% tic; load testSet; load trainSet; load testLabels; load trainLabels; load testFullSet type = input('Press enter to use simplified labels: '); if isempty(type) load labelsSIMP; testLabels = labelsSIMP(:,1:length(testLabels)); trainLabels = [labelsSIMP(:,length(testLabels)+1:end) labelsSIMP(:,length(testLabels)+1:end)]; end type = input('Press enter to retrain the network: '); if isempty(type) % Create a Pattern Recognition Network hiddenLayerSize = [300 300 300 300]; net = patternnet(hiddenLayerSize); % Setup Division of Data for Training, Validation, Testing net.divideParam.trainRatio = 70/100; net.divideParam.valRatio = 30/100; net.divideParam.testRatio = 0/100; % Train the Network [net, tr] = train(net, trainSet, trainLabels); out_labels = net(testSet); labels_num = vec2ind(testLabels); out_labels_num = vec2ind(out_labels); percentErrors = sum(labels_num ~= out_labels_num)/numel(labels_num); classificationRate = 1 - percentErrors; % View the Network view(net)

Page 23: ^ v ] ]À o Zv µu ] Z Z }Pv] ]}vhomepages.cae.wisc.edu/~ece539/project/s16/Miller_rpt.pdf · Each day, malicious programmers and criminals find new ways ... Three networks and their

% Plots % Uncomment these lines to enable various plots. %figure, plotperform(tr) %figure, plottrainstate(tr) %figure, plotconfusion(t,y) %figure, plotroc(t,y) toc; end %% Run the test samples through the network type = input('Press enter to create submission file: '); if isempty(type) out_labels = net(testFullSet); % obtain submission file out_labels_ascii = labelToASCII(out_labels); submission = [(6284:12503)', out_labels_ascii']'; fileID = fopen('submission.csv','w'); fprintf(fileID,'%2s,%5s\n','ID','Class'); fprintf(fileID,'%d,%c\n',submission); fclose(fileID); end %% Study the effect of similar characters on misclassification type = input('Press enter to test the effects of similar characters: '); if isempty(type) out_labels = net(testSet); % obtain submission file out_labels_ascii = labelToASCII(out_labels); labels_key_ascii = labelToASCII(testLabels); count = 0; for i = 1:length(out_labels) if testMatchIgnoreSimilar(out_labels_ascii(i),labels_key_ascii(i)) == 1 count = count + 1; end end % classification rate of simplified class model CR_SIMP = count / length(out_labels); end

Page 24: ^ v ] ]À o Zv µu ] Z Z }Pv] ]}vhomepages.cae.wisc.edu/~ece539/project/s16/Miller_rpt.pdf · Each day, malicious programmers and criminals find new ways ... Three networks and their

% combine.m % - Max vote of MLP, Random Forrest, Julia KNN close all clear all %% Read in all of the submission files % load results of MLP fid = fopen('./sub300_300_300_300.csv'); labels_MLP = textscan(fid, '%d %c', 'HeaderLines', 1, 'Delimiter', ','); fclose(fid); % load results of Random Forrest fid = fopen('../RF_SUBMISSION.csv'); labels_RF = textscan(fid, '%d %s', 'HeaderLines', 1, 'Delimiter', ','); fclose(fid); % load results of Julia KNN fid = fopen('../KNN_SUBMISSION.csv'); labels_KNN = textscan(fid, '%d %s', 'HeaderLines', 1, 'Delimiter', ','); fclose(fid); %% Combine the arrays and find the mode for each sample % convert char labels to ASCII code labels labels_MLP = labels_MLP{2} - 0; labels_RF = char(labels_RF{2}) - 0; labels_RF = labels_RF(:,2); labels_KNN = char(labels_KNN{2}) - 0; labels_KNN = labels_KNN(:,2); % combine the arrays with RF coming first labels = [labels_RF'; labels_MLP'; labels_KNN']; % take the max vote of the 3 classifying methods. If all three are % conflicting, the mode() function returns the first result which is set to % be RF because it is the most accurate classifier. labels_final = mode(labels); %% obtain submission file submission = [(6284:12503)', labels_final']'; fileID = fopen('combined.csv','w'); fprintf(fileID,'%2s,%5s\n','ID','Class'); fprintf(fileID,'%d,%c\n',submission); fclose(fileID);

Page 25: ^ v ] ]À o Zv µu ] Z Z }Pv] ]}vhomepages.cae.wisc.edu/~ece539/project/s16/Miller_rpt.pdf · Each day, malicious programmers and criminals find new ways ... Three networks and their

function labels_ascii = labelToASCII(labels)

%labelToChar Converts an array of labels to its corresponding character % label numbers 1-10 correspond to characters 0-9 % label numbers 11-36 correspond to characters A-Z % label numbers 37-62 correspond to characters a-z % % This function is used for faster troubleshooting and verification of % results.

% Normalized training image labelNums = vec2ind(labels); % convert to ASCII code labelNums = labelNums + 47; labelNums(labelNums > 57) = labelNums(labelNums > 57) + 7; labelNums(labelNums > 90) = labelNums(labelNums > 90) + 6; labels_ascii = labelNums;

end

Page 26: ^ v ] ]À o Zv µu ] Z Z }Pv] ]}vhomepages.cae.wisc.edu/~ece539/project/s16/Miller_rpt.pdf · Each day, malicious programmers and criminals find new ways ... Three networks and their

%% formatLabelsSIMP.m % - immediately improves categorization by over 10% % - saves labelsORIG.mat and labelsSIMP.mat which contains the labels of % the training set with the format specified below % % Class labels will go: % 0 1 2 3 4 5 6 7 8 9 A B C ... X Y Z a b c ... x y z % but similar characters will be combined into one class. For example, % 0, O, and o will all be in a single class. Likewise, 1, I, and l will all % be in a single class. There is a total of 41 classes. % % An image can only be of one class so each row will only contain a single % 1 to specify to which class it belongs. All other slots in the row will % contain 0s. % % (C) 2016 by Joseph Miller % created: 4/26/2016 close all clear all % load labels fid = fopen('../trainLabels.csv'); labels_in = textscan(fid, '%d %c', 'HeaderLines', 1, 'Delimiter', ','); fclose(fid); % convert char labels to ASCII code labels labels_in = labels_in{2} - 0; % convert ASCII code labels to values 1 to 62 for easier use in indexing labels_in = labels_in - 47; % characters 0-9 are now in range labels_in(labels_in > 10) = labels_in(labels_in > 10) - 7; % A-Z in range labels_in(labels_in > 36) = labels_in(labels_in > 36) - 6; % a-z in range % there are 62 classes of characters: 10 ints, 26 upper case, 26 lower case labelsORIG = zeros(length(labels_in), 62); % convert to proper format where each row contains a single 1 to specify % the class. for i = 1:length(labelsORIG) labelsORIG(i, labels_in(i)) = 1; end % there are 41 classes of characters in the simplified set labelsSIMP = zeros(length(labels_in), 41); % convert to proper format where each row contains a single 1 to specify % the class. for i = 1:length(labelsSIMP) if(labels_in(i) == 1), labelsSIMP(i, 1) = 1; elseif(labels_in(i) == 2), labelsSIMP(i, 2) = 1; elseif(labels_in(i) == 3), labelsSIMP(i, 3) = 1; elseif(labels_in(i) == 4), labelsSIMP(i, 4) = 1; elseif(labels_in(i) == 5), labelsSIMP(i, 5) = 1; elseif(labels_in(i) == 6), labelsSIMP(i, 6) = 1; elseif(labels_in(i) == 7), labelsSIMP(i, 7) = 1; elseif(labels_in(i) == 8), labelsSIMP(i, 8) = 1;

Page 27: ^ v ] ]À o Zv µu ] Z Z }Pv] ]}vhomepages.cae.wisc.edu/~ece539/project/s16/Miller_rpt.pdf · Each day, malicious programmers and criminals find new ways ... Three networks and their

elseif(labels_in(i) == 9), labelsSIMP(i, 9) = 1; elseif(labels_in(i) == 10), labelsSIMP(i, 10) = 1; elseif(labels_in(i) == 11), labelsSIMP(i, 5) = 1; elseif(labels_in(i) == 12), labelsSIMP(i, 9) = 1; elseif(labels_in(i) == 13), labelsSIMP(i, 11) = 1; elseif(labels_in(i) == 14), labelsSIMP(i, 12) = 1; elseif(labels_in(i) == 15), labelsSIMP(i, 13) = 1; elseif(labels_in(i) == 16), labelsSIMP(i, 14) = 1; elseif(labels_in(i) == 17), labelsSIMP(i, 15) = 1; elseif(labels_in(i) == 18), labelsSIMP(i, 16) = 1; elseif(labels_in(i) == 19), labelsSIMP(i, 2) = 1; elseif(labels_in(i) == 20), labelsSIMP(i, 17) = 1; elseif(labels_in(i) == 21), labelsSIMP(i, 18) = 1; elseif(labels_in(i) == 22), labelsSIMP(i, 19) = 1; elseif(labels_in(i) == 23), labelsSIMP(i, 20) = 1; elseif(labels_in(i) == 24), labelsSIMP(i, 21) = 1; elseif(labels_in(i) == 25), labelsSIMP(i, 1) = 1; elseif(labels_in(i) == 26), labelsSIMP(i, 22) = 1; elseif(labels_in(i) == 27), labelsSIMP(i, 23) = 1; elseif(labels_in(i) == 28), labelsSIMP(i, 24) = 1; elseif(labels_in(i) == 29), labelsSIMP(i, 6) = 1; elseif(labels_in(i) == 30), labelsSIMP(i, 25) = 1; elseif(labels_in(i) == 31), labelsSIMP(i, 26) = 1; elseif(labels_in(i) == 32), labelsSIMP(i, 27) = 1; elseif(labels_in(i) == 33), labelsSIMP(i, 28) = 1; elseif(labels_in(i) == 34), labelsSIMP(i, 29) = 1; elseif(labels_in(i) == 35), labelsSIMP(i, 30) = 1; elseif(labels_in(i) == 36), labelsSIMP(i, 3) = 1; elseif(labels_in(i) == 37), labelsSIMP(i, 31) = 1; elseif(labels_in(i) == 38), labelsSIMP(i, 7) = 1; elseif(labels_in(i) == 39), labelsSIMP(i, 11) = 1; elseif(labels_in(i) == 40), labelsSIMP(i, 32) = 1; elseif(labels_in(i) == 41), labelsSIMP(i, 33) = 1; elseif(labels_in(i) == 42), labelsSIMP(i, 34) = 1; elseif(labels_in(i) == 43), labelsSIMP(i, 35) = 1; elseif(labels_in(i) == 44), labelsSIMP(i, 36) = 1; elseif(labels_in(i) == 45), labelsSIMP(i, 37) = 1; elseif(labels_in(i) == 46), labelsSIMP(i, 17) = 1; elseif(labels_in(i) == 47), labelsSIMP(i, 18) = 1; elseif(labels_in(i) == 48), labelsSIMP(i, 2) = 1; elseif(labels_in(i) == 49), labelsSIMP(i, 20) = 1; elseif(labels_in(i) == 50), labelsSIMP(i, 38) = 1; elseif(labels_in(i) == 51), labelsSIMP(i, 1) = 1; elseif(labels_in(i) == 52), labelsSIMP(i, 22) = 1; elseif(labels_in(i) == 53), labelsSIMP(i, 39) = 1; elseif(labels_in(i) == 54), labelsSIMP(i, 40) = 1; elseif(labels_in(i) == 55), labelsSIMP(i, 6) = 1; elseif(labels_in(i) == 56), labelsSIMP(i, 41) = 1; elseif(labels_in(i) == 57), labelsSIMP(i, 26) = 1; elseif(labels_in(i) == 58), labelsSIMP(i, 27) = 1; elseif(labels_in(i) == 59), labelsSIMP(i, 28) = 1; elseif(labels_in(i) == 60), labelsSIMP(i, 29) = 1; elseif(labels_in(i) == 61), labelsSIMP(i, 30) = 1; elseif(labels_in(i) == 62), labelsSIMP(i, 3) = 1; end end labelsSIMP = labelsSIMP'; % save the matrix for use in other programs

Page 28: ^ v ] ]À o Zv µu ] Z Z }Pv] ]}vhomepages.cae.wisc.edu/~ece539/project/s16/Miller_rpt.pdf · Each day, malicious programmers and criminals find new ways ... Three networks and their

save('labelsSIMP.mat', 'labelsSIMP');

Page 29: ^ v ] ]À o Zv µu ] Z Z }Pv] ]}vhomepages.cae.wisc.edu/~ece539/project/s16/Miller_rpt.pdf · Each day, malicious programmers and criminals find new ways ... Three networks and their

function match = testMatchIgnoreSimilar( key, data )

% testMatchIgnoreSimilar

% - Returns 1 if the key and the data match.

% - Differences between similar characters are ingored.

key = char(key);

data = char(data);

match = 0;

if key == 'O' | key == 'o' | key == '0' | key == 'Q'

if data == 'O' | data == 'o' | data == '0' | data == 'Q'

match = 1;

end

elseif key == '1' | key == 'l' | key == 'I'

if data == '1' | data == 'l' | data == 'I'

match = 1;

end

elseif key == '2' | key == 'Z' | key == 'z' | key == 'N'

if data == '2' | data == 'Z' | data == 'z' | data == 'N'

match = 1;

end

elseif key == '3' | key == 'm' | key == 'E' | key == 'W' | key == 'w' |

key == 'M'

if data == '3' | data == 'm' | data == 'E' | data == 'W' | data ==

'w' | data == 'M'

match = 1;

end

elseif key == '4' | key == 'A'

if data == '4' | data == 'A'

match = 1;

end

elseif key == '5' | key == 'S' | key == 's'

if data == '5' | data == 'S' | data == 's'

match = 1;

end

elseif key == '6' | key == 'b' | key == 'P' | key == 'p' | key == 'q' |

key == '9' | key == 'g'

if data == '6' | data == 'b' | data == 'P' | data == 'p' | data ==

'q'| data == '9' | data == 'g'

match = 1;

end

elseif key == '8' | key == 'B'

if data == '8' | data == 'B'

match = 1;

end

elseif key == 'C' | key == 'c' | key == 'U' | key == 'u' | key == 'n'

if data == 'C' | data == 'c' | data == 'U' | data == 'u' | data ==

'n'

match = 1;

end

elseif key == 'J' | key == 'j'

if data == 'J' | data == 'j'

match = 1;

end

elseif key == 'K' | key == 'k'

if data == 'K' | data == 'k'

match = 1;

end

elseif key == 'T' | key == '7' | key == 'L'

if data == 'T' | data == '7' | data == 'L'

Page 30: ^ v ] ]À o Zv µu ] Z Z }Pv] ]}vhomepages.cae.wisc.edu/~ece539/project/s16/Miller_rpt.pdf · Each day, malicious programmers and criminals find new ways ... Three networks and their

match = 1;

end

elseif key == 'V' | key == 'v'

if data == 'V' | data == 'v'

match = 1;

end

elseif key == 'X' | key == 'x'

if data == 'X' | data == 'x'

match = 1;

end

elseif key == 'Y' | key == 'y' | key == 'h'

if data == 'Y' | data == 'y' | key == 'h'

match = 1;

end

else

if key == data

match = 1;

end

end

end

Page 31: ^ v ] ]À o Zv µu ] Z Z }Pv] ]}vhomepages.cae.wisc.edu/~ece539/project/s16/Miller_rpt.pdf · Each day, malicious programmers and criminals find new ways ... Three networks and their

% human.m % - grade the human classification close all clear all %% Read in all of the submission files % load labels of human classification fid = fopen('./participant_test.csv'); labels_human = textscan(fid, '%d %c %d %c %d %c %d %c %d %c %d %c %d %c %d %c %d %c %d %c', 'HeaderLines', 2, 'Delimiter', ','); fclose(fid); % load training labels fid = fopen('../trainLabels.csv'); labels = textscan(fid, '%d %c', 'HeaderLines', 1, 'Delimiter', ','); fclose(fid); %% Combine the arrays and find the mode for each sample % convert char labels to ASCII code labels human_array = zeros(100,20); for i = 1:20 human_array(:,i) = labels_human{i}(1:100) - 0; end labels = labels{2} - 0; counts = zeros(1,10); for i = 1:10 for j = human_array(1,i*2-1):(human_array(end,i*2-1)) if human_array(j-human_array(1,i*2-1)+1,i*2) == labels(j) counts(i) = counts(i) + 1; end end end classification_rates = counts/(length(human_array)); classification_rate_total = mean(classification_rates);