31
م ي ح ر ل ا ن م ح ر ل ه ا ل ل م ا س بUniversity of Gezira Faculty of Mathematical and Computer Sciences A Dissertation Submitted to the University of Gezira in Partial Fulfillment of the Requirements for the Award of the Degree of Master of Science in computer Sciences entitle: A comparative Study of Factor Analysis and Principle Component Analysis On Classification Performance Using Neural Network BY Abuzer Hussein Ibrahim Ahmed Supervisor: Dr

A comparative Study of Factor Analysis and Principle Component Analysis On Classification Performance Using Neural Network

Embed Size (px)

Citation preview

Page 1: A comparative Study of   Factor Analysis and   Principle Component Analysis On Classification Performance Using Neural Network

بسم الله الرحمن الرحيمUniversity of Gezira

Faculty of Mathematical and Computer Sciences

A DissertationSubmitted to the University of Gezira in Partial Fulfillment of the Requirements for the Award of the Degree of Master of Science in

computer Sciences entitle:

A comparative Study of Factor Analysis and Principle Component Analysis

On Classification Performance Using Neural Network

BY Abuzer Hussein Ibrahim

Ahmed

Supervisor: Dr Murtada Khalfallah Elbashir

Page 2: A comparative Study of   Factor Analysis and   Principle Component Analysis On Classification Performance Using Neural Network

IntroductionResearch problem

Presentation contents

Research ObjectivesPrevious studies

MethodologyResults

ConclusionsRecommendations

Page 3: A comparative Study of   Factor Analysis and   Principle Component Analysis On Classification Performance Using Neural Network

Data Mining is the process of analyzing data from different perspectives and summarizing it into useful information that can be used to increase revenue. The objective of data mining is to identify valid novel and understandable correlations and patterns in existing data. Some of the major techniques of data mining are classification, association and clustering. Data mining is upcoming research area to solve various problems and classification is one of main problem in the field of data mining. Before using the dataset in the classification it needs some preprocesses such as Data cleaning, data transformation and data reduction, the last one is very important, because usually represent the dataset in an dimensional space thesedimensional spaces are too large, however I need to reduce the size of the dataset before applying a learning algorithm. A common way to attempt to resolve this problem is to use dimensionality reduction techniques.

Introduction

Page 4: A comparative Study of   Factor Analysis and   Principle Component Analysis On Classification Performance Using Neural Network

Difficult to extract knowledge from large amount of data. Difficult maintain intrinsic information of high-dimensional

data when are transformed to low dimensional space for analysis.

Difficult to visualize the data in high dimension.

Research problem

Page 5: A comparative Study of   Factor Analysis and   Principle Component Analysis On Classification Performance Using Neural Network

1. To explain the importance of using Factor Analysis (FA) and Principal Component Analysis (PCA) algorithms with neural network (NN). 2. Reduction in dimensions using Factor Analysis (FA) and Principal Component Analysis (PCA) algorithms to get the new reduced features without affecting the original dimensions.3. To compare between Factor Analysis (FA) and Principal Component Analysis (PCA) in terms of performance measures.

Research Objectives

Page 6: A comparative Study of   Factor Analysis and   Principle Component Analysis On Classification Performance Using Neural Network

Dataset Features

Dimensionality reduction algorithm

Principal Component Analysis (PCA)

Factor analysis (FA)

MATLAB (drtoolbox)

Neural Network

Calculate the performance measures

Accuracy  

Receiver Operating Characteristics (ROC)

New Features

Methodology

Page 7: A comparative Study of   Factor Analysis and   Principle Component Analysis On Classification Performance Using Neural Network

Is the MATLAB Toolbox for Dimensionality Reduction, The toolbox can be

obtained from http://lvdmaaten.github.io/drtoolbox You are free to use, This MATLAB toolbox implements 34 techniques for dimensionality reduction included (Principal Component Analysis (PCA) , Factor analysis(FA)).

dimensionality reduction toolbox (drtoolbox)

Page 8: A comparative Study of   Factor Analysis and   Principle Component Analysis On Classification Performance Using Neural Network

The Computation of the Principal Component Analysis (PCA) 1/ Calculate the covariance matrix from the input data.  COV(X,Y)=Σ ( Xi - X ) ( Yi - Y ) / N2/ Compute the eigenvalues and eigenvectors from the covariance matrix .3/ Form the actual transition matrix by taking the predefined number of components (eigenvectors).4/ Chosen the components (eigenvectors) that the eigenvector with the highest .5/ Finally, multiply the original feature with the obtained transition matrix .which yields a lower- dimensional representation

Principal component analysis (PCA)

Page 9: A comparative Study of   Factor Analysis and   Principle Component Analysis On Classification Performance Using Neural Network

The Computation of the Factor analysis (FA)

1.Explore and choose relevant variables to construction of the covariance matrix. 2.Extract initial factors (via principal components) from covariance matrix is specified in terms of is eigenvalue- eigenvector pair (), Where  

3. Choose number of factors

4. Choose estimation method, estimate model

5. Calculator estimating factor loading  6 Rotate and interpret

 

Factor Analysis (FA)

Page 10: A comparative Study of   Factor Analysis and   Principle Component Analysis On Classification Performance Using Neural Network

Neural network is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome

Neural network (NN)

Page 11: A comparative Study of   Factor Analysis and   Principle Component Analysis On Classification Performance Using Neural Network

Accuracythe proportion of true results (both true positives and true negatives ) in

the population .

Where:

TP: true positives (predicted positive, actual positive)TN: true negatives (predicted negative, actual negative)FP: false positives (predicted positive, actual negative)FN: false negatives (predicted negative, actual positive)

Accuracy=

performance measures

Page 12: A comparative Study of   Factor Analysis and   Principle Component Analysis On Classification Performance Using Neural Network

Sensitivity or RecallProportion of actual positives which are predicted positive.

= Specificity

proportion of actual negative which are predicted negative.

Specificity = TN

TN + FP

Page 13: A comparative Study of   Factor Analysis and   Principle Component Analysis On Classification Performance Using Neural Network

proportion of predicted positives which are actual positive.

Precision =

Precision

F ScoreHarmonic Mean of Precision and recall . Tries to give a good combination of the other 2 metrics

F Score = 2 (Precision. recall) (Precision + recall)

Page 14: A comparative Study of   Factor Analysis and   Principle Component Analysis On Classification Performance Using Neural Network

Receiver Operating Characteristics (ROC), curve the true positive rate (Sensitivity) is plotted in function of the false positive rate and can be calculated as (1 - Specificity)for different cutoff point.

Each point on the Roc curve represents a Sensitivity specificity pair corresponding to particular division

Receiver Operating Characteristics (ROC) analyses

Page 15: A comparative Study of   Factor Analysis and   Principle Component Analysis On Classification Performance Using Neural Network

Area under the curve is between 0 and 1 and increasingly being recognized as a better measure for evaluating algorithm performance than accuracy. A bigger AUC value implies a better ranking performance for a classifier

Area under the curve (AUC)

Page 16: A comparative Study of   Factor Analysis and   Principle Component Analysis On Classification Performance Using Neural Network

The results show that dimensionality reduction in toolbox (drtoolbox) in MATLAB software with several performance measures with different data sets using FA and PCA algorithms in reveal a number of points: 1. In all performance measures (accuracy, Specificity, Sensitivity, precision, F-Score, roc curves and area under the curve) FA algorithm better than PCA algorithm. 2. The FA algorithm it has given a better result in all datasets although there are different in the number of Instances, number of attributes and type of attributes if compare to the PCA algorithm .3.Extraction of knowledge in FA using NN is a better than PCA.

Results

Page 17: A comparative Study of   Factor Analysis and   Principle Component Analysis On Classification Performance Using Neural Network

Datasets

Performance Measures

Accuracy

Specify

Sensitivity

Precision

F-score

Climate Model Simulation Crashes

0.9452

0.9552

0.8333

0.6250

0.7142

Heart disease

0.9198

0.9250

0.5000

0.7692

0.6095

Musk (Version 1) 0.8889

0.8965

0.8824

0.9091

0.8956

Pima Indians Diabetes

0.7619

0.6667

0.8000

0.8571

0.8276

Wine Quality

0.8400

0.9091

0.7857

0.8967

0.8375

Results of the performance measures for neural network with Principle Component Analysis algorithm (PCA)

Page 18: A comparative Study of   Factor Analysis and   Principle Component Analysis On Classification Performance Using Neural Network

Datasets

Performance Measures

Accuracy

Specify

Sensitivity

Precision

F-score

Climate Model Simulation Crashes

0.9589

0.9705

0.8000

0.6667

0.7273

Heart disease

0.9441

0.9602

0.7273

0.5714

0.6395

Musk (Version 1) 0.9367

0.8845

0.9723

0.9231

0.9471

Pima Indians Diabetes

0.8571

0.7857

0.8929

0.8928

0.8929

Wine Quality

0.8800

0.9230

0.7857

0.9091

0.8429

Results of the performance measures for neural network with Factor Analysis algorithm (FA)

Page 19: A comparative Study of   Factor Analysis and   Principle Component Analysis On Classification Performance Using Neural Network

4.The Roc of FA, PCA with different datasets represents the roc to each dataset and indicates to FA is better than PCA.

5. the value of the area under the curve in FA bigger than the

value in the PCA and that indicates to FA is better than PCA.

6. visualize the data in FA is better than PCA

(cont).Results

Page 20: A comparative Study of   Factor Analysis and   Principle Component Analysis On Classification Performance Using Neural Network

Data set for Climate Model Simulation Crashes 1.Roc curve

Page 21: A comparative Study of   Factor Analysis and   Principle Component Analysis On Classification Performance Using Neural Network

2.Roc Curve for Heart disease data sets

Page 22: A comparative Study of   Factor Analysis and   Principle Component Analysis On Classification Performance Using Neural Network

3.Roc Curve for Musk (Version 1) Data Set

Page 23: A comparative Study of   Factor Analysis and   Principle Component Analysis On Classification Performance Using Neural Network

4.Roc Curve for Pima Indians Diabetes dataset.

Page 24: A comparative Study of   Factor Analysis and   Principle Component Analysis On Classification Performance Using Neural Network

5.Roc Curve for Wine Quality dataset

Page 25: A comparative Study of   Factor Analysis and   Principle Component Analysis On Classification Performance Using Neural Network

Datasets

Methods

FA PCA Climate Model Simulation Crashes

0.866 0.808

Heart disease 0.902 0.848 Musk (Version 1) 0.795 0.689 Pima Indians Diabetes

0.955 0.881 Wine Quality 0.819 0.809

7.The Area under the Curve (AUC):(cont).Results

Page 26: A comparative Study of   Factor Analysis and   Principle Component Analysis On Classification Performance Using Neural Network

8. Neural network given good efficiency when using the feature selection methods.9.The Neural network(NN) maintain intrinsic information of high-dimensional data.

(cont ).Results

Page 27: A comparative Study of   Factor Analysis and   Principle Component Analysis On Classification Performance Using Neural Network

Results of the performance measures for Neural Network with all variables

Datasets

Performance Measures

Accuracy

Specify

Sensitivity

Precision

F-score

Climate Model Simulation Crashes

0.9315

0.9552

0.6667

0.5714

0.6153

Heart disease

0.9142

0.9136

0.7042

0.6241

0.6617

Musk (Version 1) 0.8412

0.8571

0.8286

0.8788

0.8523

Pima Indians Diabetes

0.7142

0.7307

0.7068

0.8541

0.7735

Wine Quality 0.8000

0.9091

0.7142

0.9090

0.7991

Page 28: A comparative Study of   Factor Analysis and   Principle Component Analysis On Classification Performance Using Neural Network

The comparison shows that FA it gives relatively good results in feature reduction and computational complexity .

The FA algorithm it has given a better results in all datasets although there are different in the number of Instances.

It can be stated that neural network has proven to be a powerful classifier for high dimensional data sets and it also gives good efficiency when using the features selection methods.

can say that the FA algorithm seems to be the best method to deal with dataset.

Conclusions

Page 29: A comparative Study of   Factor Analysis and   Principle Component Analysis On Classification Performance Using Neural Network

To obtain the best results this research recommends by : using more than classification algorithm such as Logistic

regression (LR), decision tree and support vector machine (SVM) and much more datasets should be taken.

Increase the data sets more than five. by using more than mathematical model to obtain the best

results and best performance.

Recommendations

Page 30: A comparative Study of   Factor Analysis and   Principle Component Analysis On Classification Performance Using Neural Network

[1]Guleria, Pratiyush, and Manu Sood. "Data Mining In Education: A review on The Knowledge Discovery Perspective." International Journal of Data Mining & Knowledge Management Process 4.5 .2014[2]Arora, Rohit. "Comparative analysis of classification algorithms on different datasets using WEKA." International Journal of Computer Applications 54.13 .2012. [3]Shalev-Shwartz, Shai, and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge University Press, 2014.[4] Prompramote, Supawan, Yan Chen, and Yi-Ping Phoebe Chen. "Machine learning in bioinformatics." Bioinformatics technologies. Springer Berlin Heidelberg, 2005. [5]E. Postma and E. Postma, “Dimensionality Reduction : A Comparative Review Dimensionality Reduction : A Comparative Review,” 2009.[6]Zaïane, Osmar R. "CMPUT690 Principles of Knowledge Discovery in Databases." University of 1999. [7]Zhao, Lizhuang, and Mohammed J. Zaki. "Tricluster: an effective algorithm for mining coherent clusters in 3d microarray data." Proceedings of the 2005 ACM SIGMOD international conference on Management of data. ACM, 2005.[8]Hall, Mark A. Correlation-based feature selection for machine learning. Diss. The University of Waikato, 1999.[9]Lemm, Steven, et al. "Introduction to machine learning for brain imaging." Neuroimage 56.2 .2011.[10]Cunningham, Pádraig, Matthieu Cord, and Sarah Jane Delany. "Supervised learning." Machine learning techniques for multimedia. Springer Berlin Heidelberg, 2008. [11]Pareek, Astha, and Dr Manish Gupta. "Review of data mining techniques in cloud computing database." International Journal of Advanced Computer Research (IJACR) Volume 2 .2012.[12] Arellano, M., et al. "Some Tests of Specification for Panel Data: Monte Carlo Evidence and an Application to Employment Equations}." Review of Economic Studies 31 .2003.[13]Fehske, A., J. Gaeddert, and Jeffrey H. Reed. "A new approach to signal classification using spectral correlation and neural networks." New Frontiers in Dynamic Spectrum Access Networks, 2005. DySPAN 2005. 2005 First IEEE International Symposium on. IEEE, 2005.

References

Page 31: A comparative Study of   Factor Analysis and   Principle Component Analysis On Classification Performance Using Neural Network

[14]Hull, Jason, David Ward, and Radoslaw R. Zakrzewski. "Verification and validation of neural networks for safety-critical applications." American Control Conference, 2002. Proceedings of the 2002. Vol. 6. IEEE, 2002.[15]S. S. Panwar, “OF COMPUTER © I A E M E DATA REDUCTION TECHNIQUES TO ANALYZE NSL-KDD DATASET,” pp. 21–31, 2014.[16] Wu, Xindong, et al. "Data mining with big data." Knowledge and Data Engineering, IEEE Transactions on 26.1 .2014.[17]Khosla, Nitin. Dimensionality Reduction Using Factor Analysis. Diss. Griffith University, Australia, 2004.[18]Johnson, Richard Arnold, and Dean W. Wichern. Applied multivariate statistical analysis. Vol. 4. Englewood Cliffs, NJ: Prentice hall, 1992.[19]Chen, Yisong, et al. "Discovering hidden knowledge in data classification via multivariate analysis." Expert Systems 27.2 .2010.[20]Kumar, Sandeep, Deepak Kumar, and Rashid Ali. "Factor Analysis Using Two Stages Neural Network Architecture." International Journal of Machine Learning and Computing 2.6 .2012. [21]D. Singh, J. P. Choudhary, and M. De, “A comparative study on principal component analysis and factor analysis for the formation of association rule in data mining domain.2013”[22]F. Abujarad and A. S. Omar, “Factor and Principle Component Analysis for Automatic Landmine Detection Based on Ground Penetrating Radar,” pp. 2014.[23]Slaby, Antonin. "ROC analysis with matlab." Information Technology Interfaces, 2007. ITI 2007. 29th International Conference on. IEEE, 2007.[24]Fawcett, Tom. "An introduction to ROC analysis." Pattern recognition letters 27.8 .2006.

cont.)) References