Final Project: Project 9 Part 1: Neural Networks Part 2: Overview of Classifiers

Final Project: Project 9

Part 1: Neural NetworksPart 2: Overview of Classifiers

Aparna S. Varde

April 28, 2005

CS539: Machine LearningCourse Instructor: Prof. Carolina Ruiz

Part 1: Neural Networks

• Data: The data sets used in this project are as follows.• CPU Data Set:

– Attributes describe features of computer CPUs such as vendors.– Target attribute denotes CPU performance: real. – Other attributes are mixture of real, nominal.– 8 attributes, 209 instances.

• Iris Data Set:– Attributes describe features of Iris flowers such as sepals and petals.– Target attribute denotes species of Iris: nominal. – 5 attributes, 150 instances.

• Covtype Set:– Attributes describe features of forests such as soil type, elevation.– Target attribute denotes covertype of Forest: nominal. – 55 attributes, approximately 58,000 instances.– Attributes describe 12 features, some are Boolean namely type1, type 2

etc., so 54 attributes, plus target.

Preprocessing of Data• Resampling:

– WEKA instance-based unsupervised filter used as a preprocessing step for the Covtype data to select a subset of instances for running experiments.

– 3 subsets selected, with 5000, 3000 instances and 1000 instances respectively. – This was done to observe the impact of the neural network on data sets of different sizes.

• Supervised Discretization: – To convert continuous attributes to ranges for the Iris data, to observe impact on accuracy. – Supervised discretization done with default MDL since the target class is nominal. – Done using the WEKA attribute-based supervised preprocessing filter. – Followed by nominal to binary conversion which is directly done in the neural net. This will

be discussed in the experiments section.

• Unsupervised Discretization: – Done for the CPU data set only to be able to compare it with the other classifiers such as

decision trees and Zero R. – Target initially not nominal, so simple discretization done by binning. Discretization done for

this data because J4.8 classifier used for comparison works with nominal targets only. – Discretiztion done using the WEKA attribute-based unsupervised preprocessing filter.

Experiments with Covertype DataExperiments conducted: 4-fold-cv used for testing.

1. Data set size: 1000, 3000, 5000 other parameters default.2. Learning Rate: 0.1, 0.2, 0.3 … 1.0 with best settings from above. 3. Momentum: 0.1, 0.2, 0.3 … 1.0 with best settings from above. 4. Number of epochs: 100, 200, 300 … 1000 with best settings from above. 5. Validation Set: 0%, 5%, 10%, 15% …. 50% of data set, best settings

above. 6. Validation Threshold: 10, 20, 30 …. 100 epochs, best settings above. 7. Number of Hidden Layer units: a, i, o, t, with best settings.

• i = number of input values• o= number of output classes• t = i+o• a = (i+o)/2

8. Two Hidden Layers: “x,a”… “x,t” where “x” is best setting from above.9. Normalization: True/False, experiment for default/best settings.

Experiment 1: Effect of Data Set Size

• The highest accuracy is obtained for the data set with 1000 instances.• This 1000 instances data set also requires the least time to model.• The lowest accuracy is obtained for the 3000 instances data set.• The highest time to model was with the 5000 instances data set.• Based on this, 1000 instances data set selected for remaining experiments.

• Default Settings: Learning Rate = 0.3, Momentum = 0.2, Number of epochs = 500, Validation Set = 0, Number of Hidden Layer Units = “a”, Normalization = True.

Experiment 2: Effect of Learning Rate

• Maximum time to model is 202.59 seconds for Learning Rate of 0.1• Minimum time to model is 184.57 seconds for Learning Rate of 0.3• The lowest accuracy 86% is for Learning Rate of 1.0 • In general as learning rate increases, accuracy tends to reduce. Also time model is

less though the drop in time is not as steady as the drop in accuracy.• The highest accuracy is 87.85% obtained for learning rates of 0.2 and 0.4• However, time to model is less for learning rate of 0.4 than for 0.2• Thus the learning rate of 0.4 is selected as the setting for further experiments.

• Settings: Data Size = 1000, Momentum = 0.2, Number of epochs = 500, Validation Set = 0, Number of Hidden Layer Units = “a”, Normalization = True.

Experiment 3: Effect of Momentum

• The lowest accuracy is 34.28% obtained for momentum of 0.9• The highest accuracy is 87.85% obtained for momentum of 0.2• The longest time to model is 204.36 seconds for momentum of 0.8• The shortest time to model is 178.06 seconds for momentum of 1• In general accuracy drops down after momentum 0.7 and gets really low for

momentum of 1• The setting selected for further experiments is with momentum of 0.2 since it

gives the highest accuracy of 87.85%

• Settings: Data Size = 1000, Learning Rate = 0.4, Number of epochs = 500, Validation Set = 0, Number of Hidden Layer Units = “a”, Normalization = True.

Experiment 4: Effect of Number of Epochs

• Accuracy increases as the number of epochs increase• The time to model obviously increases as the number of epochs increase• The best accuracy of 88.17% is obtained for number of epochs = 900• The lowest accuracy is 84% for number of epochs = 100• The setting used for further experiments is Number of Epochs = 900

• Settings: Data Size = 1000, Learning Rate = 0.4, Momentum = 0.2, Validation Set = 0, Number of Hidden Layer Units = “a”, Normalization = True.

Experiment 5: Effect of Validation Set Size

• As the size of the validation set increases, the training time tends to go down for most cases. • The accuracy tends to go down as the validation set size increases. • The best accuracy is actually obtained for validation set size of 0%, i.e., no validation set.

However this model has the risk of overfitting the training data. • Hence the setting selected for further experiments is one that is likely to avoid overfitting, i.e.,

one with a validation set.• With a validation set of 50%, the accuracy is as low as 81%, and this could be due to the fact

that less data is available for training. • Setting selected is with a validation set of 10%.

– First of all, this model gives very high accuracy. – Secondly, since this model is considerably fast compared to the others.

• Settings: Data Size = 1000, Learning Rate = 0.4, Momentum = 0.2, Number of Epochs = 900, Number of Hidden Layer Units = “a”, Normalization = True.

Experiment 6: Effect of Validation Threshold

• The accuracy stays constant at and after validation threshold = 50• The time to model is also more or less the same after validation threshold = 50 • Validation thresholds of 10 and 20 require distinctly less time to model than others.• The setting selected for further experiments is the one that gives highest accuracy

87.79%, with validation threshold of 20 and time to model 32.48 seconds.

• Settings: Data Size = 1000, Learning Rate = 0.4, Momentum = 0.2, Validation Set = 10 Number of Epochs = 900, Number of Hidden Layer Units = “a”, Normalization = True.

Experiment 7: Effect of Number of Units

• The number of hidden units = t = i + o requires the longest time to model and gives the lowest accuracy.

• The number of hidden units = a = (i + o)/2 requires the shortest time to model and gives the highest accuracy.

• The setting selected for the next experiments is the one with number of hidden units = a, which gives accuracy of 87.79% and time to model = 32.48 seconds

• Settings: Data Size = 1000, Learning Rate = 0.4, Momentum = 0.2, Validation Set = 10 Number of Epochs = 900, Validation Threshold = 20, Normalization = True.

Experiment 8: Effect of 2 Hidden Layers

• The highest time to model and also the lowest accuracy is obtained for “a,i”, which means “a” units in the first layer and “i” units in the second.

• The fastest model is obtained with “a,o” topology.• The highest accuracy is with the “a,a” topology. This also happens to be the

highest accuracy in all the experiments so far.• Hence this is considered as the best overall setting and is used for the next

experiment.

• Settings: Data Size = 1000, Learning Rate = 0.4, Momentum = 0.2, Validation Set = 10 Number of Epochs = 900, Validation Threshold = 20, Normalization = True.

Experiment 9: Effect of Normalized Attributes

• The settings without normalization give distinctly lower accuracy than those with normalization, implying that normalization favorably affects accuracy.

• However the settings with normalization require much more time to model, implying that normalization makes learning slower.

• The best accuracy obtained in all the covertype experiments with neural nets so far is 89.14% with the best settings from the previous experiments and with normalization. The time required to build this model is 508.72 seconds.

• Best Settings: Learning Rate = 0.4, Momentum = 0.2, Validation Set = 10, Number of Epochs = 900, Validation Threshold = 20, Hidden Units = “a,a”.

• Default Settings: Learning Rate = 0.3, Momentum = 0.2, Number of epochs = 500, Validation Set = 0, Number of Hidden Layer Units = “a”

Experiments with CPU and Iris Data

• CPU– Exp 1: Learning Rate varied from 0.1 to 1.0, other parameters default – Exp 2: Momentum varied from 0.1 to 1.0, best settings from above– Exp 3: Normalize Numeric Class, True / False with default and best

settings

• Iris– Exp 1: Number of Units in 1 hidden layer as “a”, “i”, “o”, “t”, with other

parameters default– Exp 2: Number of Units in 2 hidden layers, with 1st layer having best

settings from above– Exp 3: Nominal to Binary Conversion, True / False with default settings

and best settings overall

CPU Experiment 1: Effect of Learning Rate

• The highest correlation coefficient is observed for Learning Rate 0.1• The lowest correlation coefficient is for Learning Rate 1.0• In general correlation coefficient decreases as learning rate increases• The time to model is almost the same for this data set and is very fast

compared to the CPU data set. • For the next experiment, the setting selected is the one that gives the best

correlation, i.e. the one with learning rate 0.1

• Settings: Momentum = 0.2, Number of epochs = 500, Validation Set = 0, Number of Hidden Layer Units = “a”, Normalize Numeric Class = True.

CPU Experiment 2: Effect of Momentum

• The highest correlation is achieved for momentum 0.1• The lowest correlation is achieved for momentum 1.0• For most cases, correlation coefficient has a tendency to decrease as the

momentum increases• The best setting is selected as the one that shows the highest correlation.

This is for momentum 0.1

Settings: Learning Rate = 0.1, Number of epochs = 500, Validation Set = 0, Number of Hidden Layer Units = “a”, Normalize Numeric Class = True.

CPU Experiment 3: Effect of Normalizing Numeric Class

• The default settings with no normalization give a negative correlation coefficient implying that the attributes are not well correlated

• For both the settings, correlation coefficient increases with normalization.• The best overall setting for the CPU data set is selected as the last one in the above

table, i.e., with learning rate = 0.1, momentum = 0.1, normalize numeric class = true and other parameters default.

• Best Settings: Learning Rate = 0.1, Momentum = 0.1

• Default Settings: Learning Rate = 0.3, Momentum = 0.2

Iris Experiment 1: Effect of Units in One Hidden Layer

• The highest accuracy is 98% observed for number of units = “i”• The lowest accuracy and also the longest time to model is observed for

number of units = “t”.• The shortest time to model is for number of units = “o”• The best setting selected is the one with number of units = “i” because it

gives the highest accuracy of 98%

• Settings: Learning Rate = 0.3, Momentum = 0.2, Number of epochs = 500, Validation Set = 0, Nominal to Binary = True

Iris Experiment 2: Effect of Units in Two Hidden Layers

• In general two hidden layers give lower accuracy than one hidden layer for this data set.

• The best accuracy obtained is for the “i,a” and “i,o” settings, however this is still lower than the best accuracy with 1 hidden layer

• The lowest accuracy is for the “i,i” and “i,t” topologies.• The time to model is the longest with “i,t” topology• The fastest time to model is with “i,a” topology

• Settings: Learning Rate = 0.3, Momentum = 0.2, Number of epochs = 500, Validation Set = 0, Units in 1st Hidden Layer = “i”, Nominal to Binary = True

Iris Experiment 3: Effect of Nominal To Binary Conversion

• The best accuracy obtained is 96% which is still lower than the best one with the raw data set. This is without nominal to binary conversion

• The lowest accuracy is obtained for the same settings with nominal to binary conversion

• Default Topology: Hidden Units = “a”

• Best Topology: Hidden Units = “i”

• Data: Discretized Data Set

Summary of Results

• L.R. is Learning Rate, M is Momentum, H is hidden units, V is validation set size percent and T is validation threshold.

• Covertype has longest time to model, Iris has shortest• Iris gives highest accuracy

Best Models Obtained

Summary (Contd.)

• CPU data set shows a negative correlation for Zero R, while the best neural net model shows a very high positive correlation of 0.9967.

• The best accuracy for Iris is 98% with neural networks which is better than that with decision trees.

• Covertype gives a very high accuracy with decision trees, but best model with neural nets gives accuracy of 89.14% which is even higher.

Comparison with Other Classifiers

Part 2: Overview of Classifiers

• Decision Trees• Neural Networks• Bayesian Classifiers• Genetic Algorithms• Instance-Based Learning• Classification Rules• Final Project: Neural Networks Improved

Conclusions

• Machine Learning: Very good course• Excellent Professor• Great Classmates• Very Interactive, Learned a Lot• Thank you…

Documents

Final Project: Project 9 Part 1: Neural Networks Part 2: Overview of Classifiers