A comparative Study of Factor Analysis and Principle Component Analysis On Classification...

Preview:

Citation preview

بسم الله الرحمن الرحيمUniversity of Gezira

Faculty of Mathematical and Computer Sciences

A DissertationSubmitted to the University of Gezira in Partial Fulfillment of the Requirements for the Award of the Degree of Master of Science in

computer Sciences entitle:

A comparative Study of Factor Analysis and Principle Component Analysis

On Classification Performance Using Neural Network

BY Abuzer Hussein Ibrahim

Ahmed

Supervisor: Dr Murtada Khalfallah Elbashir

IntroductionResearch problem

Presentation contents

Research ObjectivesPrevious studies

MethodologyResults

ConclusionsRecommendations

Data Mining is the process of analyzing data from different perspectives and summarizing it into useful information that can be used to increase revenue. The objective of data mining is to identify valid novel and understandable correlations and patterns in existing data. Some of the major techniques of data mining are classification, association and clustering. Data mining is upcoming research area to solve various problems and classification is one of main problem in the field of data mining. Before using the dataset in the classification it needs some preprocesses such as Data cleaning, data transformation and data reduction, the last one is very important, because usually represent the dataset in an dimensional space thesedimensional spaces are too large, however I need to reduce the size of the dataset before applying a learning algorithm. A common way to attempt to resolve this problem is to use dimensionality reduction techniques.

Introduction

Difficult to extract knowledge from large amount of data. Difficult maintain intrinsic information of high-dimensional

data when are transformed to low dimensional space for analysis.

Difficult to visualize the data in high dimension.

Research problem

1. To explain the importance of using Factor Analysis (FA) and Principal Component Analysis (PCA) algorithms with neural network (NN). 2. Reduction in dimensions using Factor Analysis (FA) and Principal Component Analysis (PCA) algorithms to get the new reduced features without affecting the original dimensions.3. To compare between Factor Analysis (FA) and Principal Component Analysis (PCA) in terms of performance measures.

Research Objectives

Dataset Features

Dimensionality reduction algorithm

Principal Component Analysis (PCA)

Factor analysis (FA)

MATLAB (drtoolbox)

Neural Network

Calculate the performance measures

Accuracy  

Receiver Operating Characteristics (ROC)

New Features

Methodology

Is the MATLAB Toolbox for Dimensionality Reduction, The toolbox can be

obtained from http://lvdmaaten.github.io/drtoolbox You are free to use, This MATLAB toolbox implements 34 techniques for dimensionality reduction included (Principal Component Analysis (PCA) , Factor analysis(FA)).

dimensionality reduction toolbox (drtoolbox)

The Computation of the Principal Component Analysis (PCA) 1/ Calculate the covariance matrix from the input data.  COV(X,Y)=Σ ( Xi - X ) ( Yi - Y ) / N2/ Compute the eigenvalues and eigenvectors from the covariance matrix .3/ Form the actual transition matrix by taking the predefined number of components (eigenvectors).4/ Chosen the components (eigenvectors) that the eigenvector with the highest .5/ Finally, multiply the original feature with the obtained transition matrix .which yields a lower- dimensional representation

Principal component analysis (PCA)

The Computation of the Factor analysis (FA)

1.Explore and choose relevant variables to construction of the covariance matrix. 2.Extract initial factors (via principal components) from covariance matrix is specified in terms of is eigenvalue- eigenvector pair (), Where  

3. Choose number of factors

4. Choose estimation method, estimate model

5. Calculator estimating factor loading  6 Rotate and interpret

 

Factor Analysis (FA)

Neural network is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome

Neural network (NN)

Accuracythe proportion of true results (both true positives and true negatives ) in

the population .

Where:

TP: true positives (predicted positive, actual positive)TN: true negatives (predicted negative, actual negative)FP: false positives (predicted positive, actual negative)FN: false negatives (predicted negative, actual positive)

Accuracy=

performance measures

Sensitivity or RecallProportion of actual positives which are predicted positive.

= Specificity

proportion of actual negative which are predicted negative.

Specificity = TN

TN + FP

proportion of predicted positives which are actual positive.

Precision =

Precision

F ScoreHarmonic Mean of Precision and recall . Tries to give a good combination of the other 2 metrics

F Score = 2 (Precision. recall) (Precision + recall)

Receiver Operating Characteristics (ROC), curve the true positive rate (Sensitivity) is plotted in function of the false positive rate and can be calculated as (1 - Specificity)for different cutoff point.

Each point on the Roc curve represents a Sensitivity specificity pair corresponding to particular division

Receiver Operating Characteristics (ROC) analyses

Area under the curve is between 0 and 1 and increasingly being recognized as a better measure for evaluating algorithm performance than accuracy. A bigger AUC value implies a better ranking performance for a classifier

Area under the curve (AUC)

The results show that dimensionality reduction in toolbox (drtoolbox) in MATLAB software with several performance measures with different data sets using FA and PCA algorithms in reveal a number of points: 1. In all performance measures (accuracy, Specificity, Sensitivity, precision, F-Score, roc curves and area under the curve) FA algorithm better than PCA algorithm. 2. The FA algorithm it has given a better result in all datasets although there are different in the number of Instances, number of attributes and type of attributes if compare to the PCA algorithm .3.Extraction of knowledge in FA using NN is a better than PCA.

Results

Datasets

Performance Measures

Accuracy

Specify

Sensitivity

Precision

F-score

Climate Model Simulation Crashes

0.9452

0.9552

0.8333

0.6250

0.7142

Heart disease

0.9198

0.9250

0.5000

0.7692

0.6095

Musk (Version 1) 0.8889

0.8965

0.8824

0.9091

0.8956

Pima Indians Diabetes

0.7619

0.6667

0.8000

0.8571

0.8276

Wine Quality

0.8400

0.9091

0.7857

0.8967

0.8375

Results of the performance measures for neural network with Principle Component Analysis algorithm (PCA)

Datasets

Performance Measures

Accuracy

Specify

Sensitivity

Precision

F-score

Climate Model Simulation Crashes

0.9589

0.9705

0.8000

0.6667

0.7273

Heart disease

0.9441

0.9602

0.7273

0.5714

0.6395

Musk (Version 1) 0.9367

0.8845

0.9723

0.9231

0.9471

Pima Indians Diabetes

0.8571

0.7857

0.8929

0.8928

0.8929

Wine Quality

0.8800

0.9230

0.7857

0.9091

0.8429

Results of the performance measures for neural network with Factor Analysis algorithm (FA)

4.The Roc of FA, PCA with different datasets represents the roc to each dataset and indicates to FA is better than PCA.

5. the value of the area under the curve in FA bigger than the

value in the PCA and that indicates to FA is better than PCA.

6. visualize the data in FA is better than PCA

(cont).Results

Data set for Climate Model Simulation Crashes 1.Roc curve

2.Roc Curve for Heart disease data sets

3.Roc Curve for Musk (Version 1) Data Set

4.Roc Curve for Pima Indians Diabetes dataset.

5.Roc Curve for Wine Quality dataset

Datasets

Methods

FA PCA Climate Model Simulation Crashes

0.866 0.808

Heart disease 0.902 0.848 Musk (Version 1) 0.795 0.689 Pima Indians Diabetes

0.955 0.881 Wine Quality 0.819 0.809

7.The Area under the Curve (AUC):(cont).Results

8. Neural network given good efficiency when using the feature selection methods.9.The Neural network(NN) maintain intrinsic information of high-dimensional data.

(cont ).Results

Results of the performance measures for Neural Network with all variables

Datasets

Performance Measures

Accuracy

Specify

Sensitivity

Precision

F-score

Climate Model Simulation Crashes

0.9315

0.9552

0.6667

0.5714

0.6153

Heart disease

0.9142

0.9136

0.7042

0.6241

0.6617

Musk (Version 1) 0.8412

0.8571

0.8286

0.8788

0.8523

Pima Indians Diabetes

0.7142

0.7307

0.7068

0.8541

0.7735

Wine Quality 0.8000

0.9091

0.7142

0.9090

0.7991

The comparison shows that FA it gives relatively good results in feature reduction and computational complexity .

The FA algorithm it has given a better results in all datasets although there are different in the number of Instances.

It can be stated that neural network has proven to be a powerful classifier for high dimensional data sets and it also gives good efficiency when using the features selection methods.

can say that the FA algorithm seems to be the best method to deal with dataset.

Conclusions

To obtain the best results this research recommends by : using more than classification algorithm such as Logistic

regression (LR), decision tree and support vector machine (SVM) and much more datasets should be taken.

Increase the data sets more than five. by using more than mathematical model to obtain the best

results and best performance.

Recommendations

[1]Guleria, Pratiyush, and Manu Sood. "Data Mining In Education: A review on The Knowledge Discovery Perspective." International Journal of Data Mining & Knowledge Management Process 4.5 .2014[2]Arora, Rohit. "Comparative analysis of classification algorithms on different datasets using WEKA." International Journal of Computer Applications 54.13 .2012. [3]Shalev-Shwartz, Shai, and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge University Press, 2014.[4] Prompramote, Supawan, Yan Chen, and Yi-Ping Phoebe Chen. "Machine learning in bioinformatics." Bioinformatics technologies. Springer Berlin Heidelberg, 2005. [5]E. Postma and E. Postma, “Dimensionality Reduction : A Comparative Review Dimensionality Reduction : A Comparative Review,” 2009.[6]Zaïane, Osmar R. "CMPUT690 Principles of Knowledge Discovery in Databases." University of 1999. [7]Zhao, Lizhuang, and Mohammed J. Zaki. "Tricluster: an effective algorithm for mining coherent clusters in 3d microarray data." Proceedings of the 2005 ACM SIGMOD international conference on Management of data. ACM, 2005.[8]Hall, Mark A. Correlation-based feature selection for machine learning. Diss. The University of Waikato, 1999.[9]Lemm, Steven, et al. "Introduction to machine learning for brain imaging." Neuroimage 56.2 .2011.[10]Cunningham, Pádraig, Matthieu Cord, and Sarah Jane Delany. "Supervised learning." Machine learning techniques for multimedia. Springer Berlin Heidelberg, 2008. [11]Pareek, Astha, and Dr Manish Gupta. "Review of data mining techniques in cloud computing database." International Journal of Advanced Computer Research (IJACR) Volume 2 .2012.[12] Arellano, M., et al. "Some Tests of Specification for Panel Data: Monte Carlo Evidence and an Application to Employment Equations}." Review of Economic Studies 31 .2003.[13]Fehske, A., J. Gaeddert, and Jeffrey H. Reed. "A new approach to signal classification using spectral correlation and neural networks." New Frontiers in Dynamic Spectrum Access Networks, 2005. DySPAN 2005. 2005 First IEEE International Symposium on. IEEE, 2005.

References

[14]Hull, Jason, David Ward, and Radoslaw R. Zakrzewski. "Verification and validation of neural networks for safety-critical applications." American Control Conference, 2002. Proceedings of the 2002. Vol. 6. IEEE, 2002.[15]S. S. Panwar, “OF COMPUTER © I A E M E DATA REDUCTION TECHNIQUES TO ANALYZE NSL-KDD DATASET,” pp. 21–31, 2014.[16] Wu, Xindong, et al. "Data mining with big data." Knowledge and Data Engineering, IEEE Transactions on 26.1 .2014.[17]Khosla, Nitin. Dimensionality Reduction Using Factor Analysis. Diss. Griffith University, Australia, 2004.[18]Johnson, Richard Arnold, and Dean W. Wichern. Applied multivariate statistical analysis. Vol. 4. Englewood Cliffs, NJ: Prentice hall, 1992.[19]Chen, Yisong, et al. "Discovering hidden knowledge in data classification via multivariate analysis." Expert Systems 27.2 .2010.[20]Kumar, Sandeep, Deepak Kumar, and Rashid Ali. "Factor Analysis Using Two Stages Neural Network Architecture." International Journal of Machine Learning and Computing 2.6 .2012. [21]D. Singh, J. P. Choudhary, and M. De, “A comparative study on principal component analysis and factor analysis for the formation of association rule in data mining domain.2013”[22]F. Abujarad and A. S. Omar, “Factor and Principle Component Analysis for Automatic Landmine Detection Based on Ground Penetrating Radar,” pp. 2014.[23]Slaby, Antonin. "ROC analysis with matlab." Information Technology Interfaces, 2007. ITI 2007. 29th International Conference on. IEEE, 2007.[24]Fawcett, Tom. "An introduction to ROC analysis." Pattern recognition letters 27.8 .2006.

cont.)) References

Recommended