210

PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source
Page 2: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source
Page 3: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source
Page 4: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source
Page 5: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source
Page 6: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

Copyright © , by University of Jyväskylä

Page 7: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

ABSTRACT 

Pechenizkiy, Mykola Feature Extraction for Supervised Learning in Knowledge Discovery Systems Jyväskylä: University of Jyväskylä, 2005, 86 p. (+ included articles) (Jyväskylä Studies in Computing, ISSN 1456-5390; 56) ISBN 951-39-2299-5 Finnish summary Diss. Knowledge discovery or data mining is the process of finding previously unknown and potentially interesting patterns and relations in large databases. The so-called “curse of dimensionality” pertinent to many learning algorithms, denotes the drastic increase in computational complexity and classification error with data having a great number of dimensions. Beside this problem, some individual features, being irrelevant or indirectly relevant for the learning concepts, form poor problem representation space. The purpose of this study is to develop theoretical background and practical aspects of feature extraction (FE) as means of (1) dimensionality reduction, and (2) representation space improvement, for supervised learning (SL) in knowledge discovery systems. The focus is on applying conventional Principal Component Analysis (PCA) and two class-conditional approaches for two targets: (1) for a base level classifier construction, and (2) for dynamic integration of the base level classifiers. Theoretical bases are derived from classical studies in data mining, machine learning and pattern recognition. The software prototype for the experimental study is built within WEKA open-source machine-learning library in Java. The different aspects of the experimental study on a number of benchmark and real-world data sets include analyses of (1) importance of class information use in the FE process; (2) (dis-)advantages of using either extracted features or both original and extracted features for SL; (3) applying FE globally to the whole data and locally within natural clusters; (4) the effect of sampling reduction on FE for SL; and (5) the problems of FE techniques selection for SL for a problem at consideration. The hypothesis and detailed results of the many-sided experimental research process are reported in the corresponding papers included in the thesis. The main contributions of the thesis can be divided into contribution (1) to current theoretical knowledge and (2) to development of practical suggestion on applying FE for SL.

Keywords:  feature  extraction,  dimensionality  reduction,  principal  component analysis, data pre-processing,  integration of data mining methods, supervised learning, knowledge discovery in databases

Page 8: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

ACM Computing Reviews Categories

H.2.8 Information Systems: Database Management: Database Applications, Data Mining

I.2.6 Computing Methodologies: Artificial Intelligence: Learning I.5.1 Computing Methodologies: Pattern Recognition: Models I.5.2 Computing Methodologies: Pattern Recognition: Design Methodology

Author’s address Mykola Pechenizkiy Dept. of Computer Science and Information Systems University of Jyväskylä P. O. Box 35, 40351 Jyväskylä, Finland E-mail: [email protected]

Supervisors Prof. Dr. Seppo Puuronen Dept. of Computer Science and Information Systems University of Jyväskylä Finland

Dr. Alexey Tsymbal Department of Computer Science Trinity College Dublin Ireland

Prof. Dr. Tommi Kärkkäinen Department of Mathematical Information Technology University of Jyväskylä Finland

Reviewers Prof. Dr. Ryszard Michalski Machine Learning and Inference Laboratory George Masson University USA

Prof. Dr. Peter Kokol Department of Computer Science University of Maribor Slovenia

Opponent Dr. Kari Torkkola Motorola Labs USA

Page 9: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

ACKNOWLEDGEMENTS 

I would like to thank all those who helped me to carry out this research. I am thankful to my chief scientific supervisor Prof. Seppo Puuronen (Dept. of Computer Science and Information Systems, University of Jyväskylä), who never refused to share his time and scientific experience on the direction of the thesis in many fruitful discussions.

I am very grateful to Dr. Alexey Tsymbal who has been able to combine successfully of being my second scientific supervisor, colleague and friend since 2001. He has introduced many new things to me and helped me to become a researcher. His doctoral thesis “Dynamic Integration of Data Mining Methods in Knowledge Discovery Systems” was of great help for me in enabling me to better understand some concepts and to present my thesis as a collection of papers with a summary.

I would also like to thank Prof. Seppo Puuronen, Dr. Alexey Tsymbal and Dr. David Patterson for their co-operation and co-authoring of the articles included in this thesis.

I am thankful to my third scientific supervisor Prof. Tommi Kärkkäinen (Dept. of Mathematical Information Technology, University of Jyväskylä) for his valuable advices and comments. My special thanks to Prof. Tommi Kärkkäinen and Licentiate Pekka Räsänen for their supervision of my M.Sc. project “PCA-based Feature Extraction for Classification”. That work was the starting point of the present doctoral thesis.

Prof. Ryszard Michalski from George Masson University and Prof. Peter Kokol from University of Maribor have acted as external reviewers (examiners) of the dissertation. Their constructive comments and suggestions helped me to improve this work.

I would like to thank Dr. David Patterson who invited me to Northern Ireland Knowledge Engineering Laboratory, University of Ulster (1.09 – 30.11.2002), as a visiting researcher, and Prof. Padraig Cunningham who invited me to Machine Learning Group, Trinity College Dublin (1.04 – 30.04.2003 and 17.07 – 7.08.2005). This cooperation gave me useful experience and generated new ideas for further research.

This research took place at the Department of Computer Science and Information Systems, at University of Jyväskylä and I am thankful to many staff members for their support. I am also thankful to the Department for their financial support of conference traveling.

This research has been funded by the Graduate School in Computing and Mathematical Sciences (COMAS) of the University of Jyväskylä. I am also grateful to the INFWEST.IT program for some practical arrangements.

I would also like to thank Steve Legrand for his support in checking the language.

I would also like to thank the UCI machine learning repository of databases, domain theories and data generators (UCI ML Repository) for the

Page 10: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source code used in this research.

Last, but not least, I wish to thank my wife Ekaterina, my dear parents Anatoliy and Lyubov, my brother Oleksandr, and all my friends, who provided invaluable moral support. Jyväskylä November 2005 Mykola Pechenizkiy

Page 11: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

LIST OF FIGURES  

FIGURE 1  Basic steps of KDD process .......................................................... 20 FIGURE 2  The integration of classifiers ........................................................  26 FIGURE 3  High versus low quality representation spaces for              

concept learning ............................................................................  35 FIGURE 4  Random sampling .........................................................................  36 FIGURE 5  Stratified random sampling .........................................................  37 FIGURE 6  kd‐Tree based selective sampling ................................................  37 FIGURE 7  Stratified sampling with kd‐tree based selection of           

instances .........................................................................................  38 FIGURE 8  The feature‐values representations of the instances ................  40 FIGURE 9  PCA for classification ...................................................................  41 FIGURE 10  Independent searches for the most appropriate FE and SL 

techniques ......................................................................................  42 FIGURE 11  The joint search for a combination of the most appropriate        

FE and SL techniques ...................................................................  42 FIGURE 12  Scheme of the FEDIC approach ..................................................  43 FIGURE 13  Data mining strategy selection via meta‐learning and taking 

benefit of constructive induction approach ..............................  47 FIGURE 14  A multimethodological approach to the construction of an 

artefact for DM ..............................................................................  49  LIST OF TABLES  TABLE 1  Data sets used in the experimental study (in appendix) ........  81 

Page 12: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

LIST OF ACRONYMS

C4.5 C4.5 decision tree classifier DIC dynamic integration of classifiers DB data base DM data mining DS dynamic selection DV dynamic voting DVS dynamic voting with selection DSS decision support system FE feature extraction FS feature selection FEDIC feature extraction for dynamic integration of classifiers IS information system ISD information system development KDD knowledge discovery in databases KDS knowledge discovery system kNN (k) nearest neighbour algorithm LDA linear discriminant analysis ML machine learning MLC++ machine learning library in C++ OLE DB Object Linking and Embedding DB PCA principal component analysis RP random projection SL supervised learning WEKA Waikato Environment for Knowledge Analysis WNN a weighted average of the nearest neighbours

   

Page 13: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

CONTENTS

ABSTRACT ACKNOWLEDGEMENTS LIST OF FIGURES LIST OF TABLES LIST OF ACRONYMS CONTENTS LIST OF INCLUDED ARTICLES

1 INTRODUCTION...............................................................................................15 1.1 Motivation ..................................................................................................16 1.2 Objectives....................................................................................................16 1.3 Methods and Results.................................................................................17 1.4 Thesis overview .........................................................................................18

2 RESEARCH BACKGROUND...........................................................................19 2.1 Knowledge discovery in databases.........................................................19

2.1.1 Knowledge discovery as a process..............................................20 2.1.2 Critical issues and perspectives in KDD systems .....................21

2.2 Supervised learning and classifiers.........................................................22 2.2.1 Supervised learning: the taxonomy of concepts .......................22 2.2.2 Instance-based classifier ...............................................................23 2.2.3 Naïve-Bayes classifier ...................................................................24 2.2.4 C4.5 Decision Tree classifier.........................................................24

2.3 Dynamic integration of classifiers...........................................................25 2.3.1 Generation of base classifiers.......................................................26 2.3.2 Integration of base classifiers .......................................................27 2.3.3 Dynamic integration approaches used in the study.................27

2.4 The curse of dimensionality and dimensionality reduction ...............28 2.4.2 Inferences of geometrical, statistical, and asymptotical

properties of high dimensional spaces for supervised classification ...................................................................................28

2.4.3 Dimensionality reduction techniques.........................................29 2.4.4 Feature extraction ..........................................................................30

2.5 Feature extraction for supervised learning............................................31 2.5.1 Principal component analysis ......................................................31 2.5.2 The random projection approach................................................32 2.5.3 Class-conditional feature extraction ...........................................33 2.5.4 FE for supervised learning as an analogy to constructive

induction .........................................................................................35 2.6 Selecting representative instances for FE ...............................................35

2.6.1 Random sampling .........................................................................36 2.6.2 Stratified random sampling .........................................................36 2.6.3 kd-Tree based sampling ................................................................37

Page 14: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

2.6.4 Stratified sampling with kd-tree based selection of instances ..........................................................................................37

3 RESEARCH PROBLEM .....................................................................................39 3.1 How important is it to use class information in the FE process? .......41 3.2 Is FE a data- or hypothesis-driven constructive induction?................41 3.3 Is FE for dynamic integration of base-level classifiers useful in a

similar way as for a single base-level classifier? ...................................42 3.4 Which features – original, extracted or both – are useful for SL?.......43 3.5 How many extracted features are useful for SL?..................................44 3.6 How to cope with the presence of contextual features in data, and

data heterogeneity? ...................................................................................44 3.7 What is the effect of sample reduction on the performance of

FE for SL? ....................................................................................................45 3.8 When is FE useful for SL?.........................................................................45 3.9 Interpretability of the extracted features................................................46 3.10 Putting all together: towards the framework of DM strategy

selection.......................................................................................................47

4 RESEARCH METHODS AND RESEARCH DESIGN...................................48 4.1 DM research in the scope of ISD research methods .............................48 4.2 Research methods used in the study ......................................................50 4.3 Experimental approach.............................................................................51

4.3.1 Estimating the accuracy of the model learnt .............................51 4.3.2 Tests of hypotheses........................................................................53

4.4 Experimental design..................................................................................54

5 RESEARCH RESULTS: SUMMARY OF THE INCLUDED ARTICLES .....56 5.1 Eigenvector-based feature extraction for classification........................56 5.2 PCA-based Feature Transformations for Classification: Issues in

Medical Diagnostics ..................................................................................57 5.3 On Combining Principal Components with Parametric

LDA-based Feature Extraction for Supervised Learning ....................59 5.4 The Impact of the Feature Extraction on the Performance of a

Classifier: kNN, Naïve Bayes and C4.5 ..................................................60 5.5 Local Dimensionality Reduction within Natural Clusters for

Medical Data Analysis ..............................................................................61 5.6 The Impact of Sample Reduction on PCA-based Feature

Extraction for Naïve Bayes Classification ..............................................62 5.7 Feature extraction for dynamic integration of classifiers ....................63 5.8 Feature extraction for classification in knowledge discovery

systems ........................................................................................................65 5.9 Data mining strategy selection via empirical and constructive

induction .....................................................................................................66 5.10 About the joint articles..............................................................................67

Page 15: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

6 CONCLUSIONS .................................................................................................68 6.1 Contributions of the thesis .......................................................................68

6.1.1 Contributions to the theory..........................................................69 6.1.2 Contributions to the practice (use) of FE for SL in a KDS .......70

6.2 Limitations and future work....................................................................71 6.2.1 Limitations......................................................................................71 6.2.2 Future work ....................................................................................71

6.3 Further challenges .....................................................................................71

REFERENCES...............................................................................................................73

APPENDIX A. DATASETS USED IN THE EXPERIMENTS.................................80 FINNISH SUMMARY (YHTEENVETO) ORIGINAL ARTICLES  

Page 16: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source
Page 17: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

LIST OF INCLUDED ARTICLES

I Tsymbal, A., Puuronen, S., Pechenizkiy, M., Baumgarten, M. & Patterson, D. 2002. Eigenvector-based Feature Extraction for Classification. In: S. Haller, G. Simmons (Eds.), Proceedings of 15th International FLAIRS Conference on Artificial Intelligence, FL, USA: AAAI Press, 354-358.

II Pechenizkiy, M., Tsymbal, A. & Puuronen, S. 2004. PCA-based Feature Transformation for Classification: Issues in Medical Diagnostics. In: R. Long et al. (Eds.), Proceedings of 17th IEEE Symposium on Computer-Based Medical Systems CBMS’2004, Bethesda, MD: IEEE CS Press, 535-540.

III Pechenizkiy, M., Tsymbal, A. & Puuronen, S. 2005. On Combining Principal Components with Fisher’s Linear Discriminants for Supervised Learning. (submitted to) Special Issue of Foundations of Computing and Decision Sciences “Data Mining and Knowledge Discovery” (as extended version of Pechenizkiy et al., 2005e).

IV Pechenizkiy, M. 2005. The Impact of the Feature Extraction on the Performance of a Classifier: kNN, Naïve Bayes and C4.5. In: B. Kegl & G. Lapalme (Eds.), Proceedings of 18th CSCSI Conference on Artificial Intelligence AI’05, LNAI 3501, Heidelberg: Springer Verlag, 268-279.

V Pechenizkiy, M., Tsymbal, A. & Puuronen, S. 2005. Supervised Learning and Local Dimensionality Reduction within Natural Clusters: Biomedical Data Analysis, (submitted to) IEEE Transactions on Information Technology in Biomedicine, Special Post-conference Issue "Mining Biomedical Data" (as extended version of Pechenizkiy et al., 2005c).

VI Pechenizkiy, M., Puuronen, S. & Tsymbal, A. 2006. The Impact of Sample Reduction on PCA-based Feature Extraction for Supervised Learning. (to appear) In: H. Haddad et al. (Eds.), Proceedings of 21st ACM Symposium on Applied Computing (SAC’06, Data Mining Track), ACM Press.

VII Pechenizkiy, M., Tsymbal, A., Puuronen, S. & Patterson D. 2005. Feature Extraction for Dynamic Integration of Classifiers, (submitted to) Fundamenta Informaticae, IOS Press (as extended version of Tsymbal et al., 2003).

VIII Pechenizkiy, M., Puuronen, S. & Tsymbal, A. 2003. Feature Extraction for Classification in Knowledge Discovery Systems. In: V. Palade, R. Howlett & L. Jain (Eds.), Proceedings of 7th International Conference on Knowledge-Based Intelligent Information and Engineering Systems KES’2003, LNAI 2773, Heidelberg: Springer-Verlag, 526-532.

Page 18: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

IX Pechenizkiy, M. 2005. Data Mining Strategy Selection via Empirical and Constructive Induction. In: M. Hamza (Ed.), Proceedings of the IASTED International Conference on Databases and Applications DBA’05, Calgary: ACTA Press, 59-64.

Page 19: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

1 INTRODUCTION

Knowledge discovery in databases (KDD) or data mining (DM) is the process of finding previously unknown and potentially interesting patterns and relations in large databases (Fayyad 1996). Numerous data mining methods have recently been developed to extract knowledge from these large databases. Selection of the most appropriate data mining method or a group of the most appropriate methods is usually not straightforward.

During the past several years in a variety of application domains researchers have tried to learn how to manage knowledge discovery process in those specific domains. This has resulted in a large number of “vertical solutions”. Data mining has evolved from less sophisticated first-generation techniques to today's cutting-edge ones. Currently there is a growing need for next-generation data mining systems to manage knowledge discovery applications. These systems should be able to discover knowledge by combining several available techniques, and provide a more automatic environment, or an application envelope, surrounding a highly sophisticated data mining engine (Fayyad & Uthurusamy 2002).

This thesis presents the study of data mining techniques integration and application for different benchmark and real-world data sets, with the focus on the study of feature extraction (FE) for supervised learning (SL). This introductory chapter presents the motivation for the research and the main research objectives, overviews the methods used in the study, and introduces the organization of the thesis.

Page 20: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

16

1.1 Motivation

Classification is a typical data mining task where the value of some attribute for a new instance is predicted based on the given collection of instances for which all the attribute values are known (Aivazyan, 1989). The purpose of supervised learning (SL) is to learn to classify (predict a value of some attribute for) a new instance. In many applications, data, which is the subject of analysis and processing in data mining, is multidimensional, and presented by a number of features. The so-called “curse of dimensionality” (Bellman, 1961) pertinent to many learning algorithms, denotes the drastic increase in computational complexity and classification error with data having a great number of dimensions (Aivazyan, 1989). Furthermore, nowadays the complexity of real-world problems, increased by the presence of many irrelevant or indirectly relevant features, challenge the existing learning algorithms. It is commonly accepted that just by pushing a button someone should not expect useful results to appear.

Hence, attempts are often made to reduce the dimensionality of the feature space before SL is undertaken. Feature extraction (FE) is one of the dimensionality reduction techniques that extracts a subset of new features from the original feature set by means of some functional mapping, keeping as much information in the data as possible (Liu, 1998).

FE is an effective data pre-processing step aimed to reduce the dimensionality and to improve representation space of the problem at consideration.

Although there has been some rigorous research going on both FE and SL for many years in applied statistics, pattern recognition and related fields, to the best of our knowledge, there is no comprehensive many-sided analysis of FE and SL processes integration.

1.2 Objectives

In this thesis, the emphasis is on studying FE and SL integration. Both FE and SL are seen as constituent parts of a DM strategy. Our basic assumption is that each DM strategy is best suited for a certain problem. Therefore, our overall (faraway) research goal is to contribute to knowledge in the problem of DM strategy selection for a certain DM problem. And our particular focus is on different combinations of FE techniques and SL techniques.

The main objective of this study is to develop theoretical background and practical aspects of FE as means of (1) dimensionality reduction and (2) representation space improvement for SL in knowledge discovery systems. The focus is on applying conventional Principal Component Analysis (PCA) and two class-conditional approaches for two targets: (1) for a base level classifier construction, and (2) for dynamic integration of the base level classifiers. The

Page 21: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

17

different aspects of the study include analyses of (1) importance of class information use in the FE process; (2) advantages of using either extracted features or both original and extracted features for SL; (3) applying FE globally to the whole set of training instances and locally within natural clusters; and (4) the effect of sample reduction on FE for SL. Besides this, more general problem of FE techniques selection for SL for a dataset at consideration is analysed.

Related work on FE for SL includes also research on constructive induction (Michalski, 1997) and latent semantic indexing for text mining applications (Deerwester et al., 1990).

1.3 Methods and Results

We consider a knowledge discovery system as a special kind of adaptive information system. We adapted the Information System Development (ISD) framework for the context of DM systems development. Three basic groups of IS research methods, including conceptual-theoretical, constructive, and experimental approaches are used in this study. These approaches are tightly connected and are applied in parallel. The theoretical background is exploited during the constructive work and the constructions are used for experimentation. The results of constructive and experimental work are used to refine the theory.

Consequently, the main results (beside the developed software prototype for experimental studies) come from the experimental study.

The results of our study show that: − FE is an important step in DM/KDD process that can be beneficial for

SL and integration of classifiers in terms of classification accuracy and in terms of time complexity of model learning and new instances classification.

− FE can improve classification accuracy of a model produced by a learner even for datasets having relatively small number of features. And, therefore, FE can be considered as a dimensionality reduction technique as well as a technique for construction of better representation space for further supervised learning.

− Use of class information in FE process is crucial for many datasets. − Combination of original features with extracted features can be

beneficial for SL on some datasets. − Local FE and SL models can outperform corresponding global models

in classification accuracy using fewer of features for learning. − Training sample reduction affects the performance of the SL with FE

rather differently; and when the proportion of training instances used to build the FE and the learning model is relatively small it is important to use an adequate sample reduction technique to select more representative instances for the FE process.

Page 22: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

18

In our experimental study we use mainly benchmark data sets from UCI repository (Blake & Merz, 1998). However, part of the experimental study was done on data related to some real problems of medical diagnostics (the classification of acute abdominal pain (Zorman et al., 2001) and problems of antibiotic resistance (Pechenizkiy et al., 2005h)). Besides benchmark and real-world data sets we conduct some studies on synthetically generated data sets where desired data set characteristics are varied.

1.4 Thesis overview

The thesis consists of two parts. The first part presents the summary of the collection of papers presented in the second part (the included papers are listed before this introductory Chapter). The summary part of the thesis introduces the background of the study, presents the research problem of the study, describes basic research methods used, overviews the papers included in the thesis and concludes with the main contribution of the thesis and suggestions for further research directions.

The organization of the thesis summary part is as follows: in Chapter 2 research background is considered. First, in Section 2.1 knowledge discovery and data mining concepts are discussed. A brief history of knowledge discovery systems is presented. Then, basic introduction to the problem of classification (Section 2.2), to dynamic integration of classifiers (Section 2.3), to dimensionality reduction (Section 2.4), to FE for supervised learning (Section 2.5) that is the focus of this thesis, and to selection of representative instances for FE (Section 2.6) is presented. In Chapter 3, the research problem of the thesis is stated. Each aspect of the study is presented with a separate section. Chapter 4 introduces the research design of the thesis. First, research methods being used and basic approaches for evaluating learned models that are used in experiments are discussed. Then, experimental design is considered. Chapter 5 contains summaries of the articles included in the thesis. Each section is a summary of the corresponding included article. In Chapter 6, contribution of the thesis is summarized, and limitations of the research and future work are discussed. The information about datasets used in the experimental studies is given in Appendix A.

Page 23: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

2 RESEARCH BACKGROUND

Knowledge discovery in databases (KDD) is a combination of data warehousing, decision support, and data mining – an innovative new approach to information management (Fayyad, 1996). KDD is an emerging area that covers such areas as statistics, machine learning, databases, pattern recognition, econometrics, and some other. In the following section we consider knowledge discovery as a process and discuss the perspectives in KDD systems. In the further section, basics of supervised learning are introduced and classifiers used in the study are considered, and some background on ensemble classification and dynamic integration of classifiers is presented. Then, we introduce the problem known as “the curse of dimensionality”, before feature extraction techniques for supervised learning are considered. We finish the chapter with a review of training sample selection techniques (used in this study) aimed to reduce the computational complexity of feature extraction and supervised learning.

2.1 Knowledge discovery in databases

The present history of KDD systems development consists of three main stages/generations (Piatetsky-Shapiro, 2000). The year 1989 can be considered as the first generation of KDD systems when few single-task data mining (DM) tools such as C4.5 decision tree algorithm (Quinlan, 1993) existed. These tools were difficult to use and required significant preparation. Most of such systems were based on a loosely coupled architecture, where the database and the data

Page 24: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

20

mining subsystems are realised as separate independent parts. This type of architecture demands continuous context switching between the data mining engine and the database (Imielinski & Mannila, 1996).

The year 1995 can be associated with a formation of the second-generation tool-suits (Piatetsky-Shapiro, 2000). KDD started to be seen as “the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” (Fayyad 1996, 22).

2.1.1 Knowledge discovery as a process

The process of KDD comprises several steps, which involve data selection, data pre-processing, data transformation, application of machine learning (ML) techniques (Mitchel, 1997), and interpretation and evaluation of patterns. These basic steps of the KDD process from raw data to the extracted knowledge are presented in Figure 1 (adapted from (Fayyad, 1996)).

FIGURE 1 Basic steps of KDD process (adapted from (Fayyad 1996, 22)). Solid

arrows denote the processing steps of data towards discovered knowledge, while dotted-line arrows show that each of these steps may form a different iterative cycle and results of one step can be used for any other step.

The process starts from target data selection that is often related to the

problem of building and maintaining useful data warehouses (Fayyad & Uthurusamy, 2002) and is recognized as the most time-consuming in the KDD process (Fayyad, 1996). The target data selection is excluded from our research focus here. After selection, the target data is preprocessed in order to reduce the level of noise, to preprocess the missing information, to reduce data, and to remove obviously redundant features.

The data transformation step is aimed either to reduce the dimensionality of the problem, extracting the most informative features, or to enlarge feature space, constructing additional (potentially useful) features. The linear FE techniques (a type of techniques that can be applied during the transformation step) for subsequent classification, is, in fact, the main focus of this thesis. During the next step, search for patterns – summary of a subset of the data, statistical or predictive models of the data, and relationships among parts of the data – i.e., application of machine-learning techniques takes place. In Fayyad

Page 25: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

21

(1996) this step is associated with data mining – the identification of interesting structure in the data. In this thesis we prefer to denote the step associated with the search for patterns as application of ML techniques. Data mining concept is used when we consider the feature extraction/transformation processes, application of ML techniques, and evaluation processes as a core process of KDD. These processes are in the focus of this thesis, namely studying of FE and classification processes, their integration and evaluation. When we refer here to data mining strategy selection, we assume a selection of the most appropriate FE technique (as one type of techniques for data transformation), classifier and evaluator.

The Interpretation/evaluation step helps the user by providing tools for visualization (Fayyad et al., 2001) of models built and patterns discovered and for generation of reports with discovery results and discovery process log analysis. The user has a possibility to interpret and evaluate extracted patterns and models, to determine the patterns that can be considered as new knowledge, and to draw conclusions. Still, it should be noted that while evaluation is often mainly technically oriented, the interpretation of results requires close collaboration with domain experts.

Good examples of the knowledge discovery systems that follow Fayyad’s view on DM as the process are: SPSS Clementine (Clementine User Guide, Version 5, 1998), SGI Mineset (Brunk et al., 1997), and IBM Intelligent Miner (Tkach, 1998).

2.1.2 Critical issues and perspectives in KDD systems

Numerous KDD systems have recently been developed. At the beginning of this millennium there exist about 200 tools that could perform several tasks (such as clustering, classification, and visualization) for specialized applications (“vertical solutions”) (Piatetsky-Shapiro, 2000). This growing trend towards vertical solutions in DM (Fayyad & Uthurusamy, 2002) has been associated with the third generation of DM systems.

The next-generation database mining systems are aimed to manage KDD applications just the way SQL-based systems successfully manage business applications. These systems should integrate the data mining and database subsystems and automate (as far as needed) all the steps of the whole KDD process. These systems should be able to discover knowledge by combining several available KDD techniques. An essential part of the integrated KDD-process is the subpart that enables situation-dependent selection of appropriate KDD technique(s) at every step of a KDD process.

Because of the increasing number of such “vertical solutions” and the possibility to accumulate knowledge from these solutions, there is a growing potential for appearance of next-generation database mining systems to manage KDD applications. While today’s algorithms tend to be fully automatic and therefore fail to allow guidance from knowledgeable users at key stages in the search for data regularities, the researchers and the developers, who are involved in the creation of the next generation data mining tools, are motivated

Page 26: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

22

to provide a broader range of automated steps in the data mining process and make this process more mixed-initiative. In this process human experts collaborate more closely with the computer to form hypotheses and test them against the data. Moreover, nowadays some initiatives to standardize definition of data mining techniques and the process of knowledge discovery, to provide API are gaining in strength (Grossman et al., 2002). Good examples are: the Predictive Model Markup Language (PMML, 2004) that is an XML-based language which provides a way for applications to define statistical and data mining models and to share models between PMML compliant applications; the SQL Multimedia and Applications Packages Standard (Melton & Eisenberg, 2001), which specifies SQL interface to data mining applications and services, and provides an API for data mining applications to access data from SQL/MM-compliant relational databases; the Java Specification Request-73 (JSR, 2004) that defines a pure Java API supporting the building of data mining models and the creation, storage, and access to data and metadata; the Microsoft-supported OLE DB for DM defining an API for data mining for Microsoft-based applications (OLE DB, 2004); the CRoss-Industry Standard Process for Data Mining (CRISP-DM, 2004) capturing the data mining process from business problems to deployment of the knowledge gained during the process.

2.2 Supervised learning and classifiers

A typical data mining task is to explain and predict the value of some attribute of the data given a collection of fields of some tuples with known attribute values (Chan & Stolfo, 1997). This task is often solved with inductive learning, the process of building a model from training data. The resulting model is then used to make predictions on previously unseen examples.

2.2.1 Supervised learning: the taxonomy of concepts

The task of placing an instance x into one of a finite set of possible categories c is called classification (Aivazyan, 1989). Often an instance (also called an example or a case) is defined by specifying the value of each feature. This is known as feature-value (also called attribute-value) notation of a data that represents a problem, and may be written as a row vector using the following notation:

x = [v(x1), v(x2),…,v(xd)], (1) where v(xi) denotes the value of feature (attribute) xi, and d is the number of features. Features, which take on a value from an unordered set of possible values, are called categorical (also called nominal). Continuous features are used whenever there is a linear ordering on the values, even if they are not truly continuous. The features used to define an instance are paired, as a rule, with an extra categorical feature that is called class attribute (also called output

Page 27: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

23

attribute). The range of all possible values of the features of instances is referred to as the instance space (also called example space).

Typically, instances with a given classification value are used for supervised learning (building classifiers), and are called training set (also called learning set, or simply a dataset). The classifiers are usually applied to instances with unknown (for a classifier) class value, and called test (also called unseen) instances, which constitute a test set. The classification of test instance C(xtest) ∈ range(c) = {c1, … , cc}, where index c is the number of classes, is the process of predicting the most probable ci. However, in the test set class values should be present so that an evaluator of the classifier is able to check the correctness of the prediction for each test instance.

A common measure of a classifier’s performance is error rate that is calculated as the percentage of misclassified test instances (Merz, 1998), and classification accuracy (also called generalization performance) that is the percentage of correctly classified test instances. More generally, the accuracy of a classifier is the probability of correctly classifying a randomly selected instance (Kohavi, 1995b). Classification accuracy measure is used in this thesis to evaluate the performance of a data mining strategy (for example a coupled combination of a FE technique and a classifier).

Classifiers may vary widely from simple rules to neural networks. However, we are mainly interested here in the instance-based, Naïve Bayes and decision-tree learning techniques. These are briefly described in the following sections. These learning techniques are used in experiments when application of FE for supervised learning is analysed.

2.2.2 Instance-based classifier

An instance-based learning algorithm stores a series of training instances in its memory and uses a distance metric to compare new instances to those stored. Prediction on the new instance is based on the instance(s) closest to it (Aha et al., 1991). The simplest and most well studied instance-based learning algorithm is known as the “nearest neighbor” (NN) classifier.

The classifier stores all instances from the training set (this memorisation is hard to refer to as a training/learning phase) and classifies an unseen instance on the base of a similarity measure. The distance from the unseen instance to all the training instances is calculated and the class label corresponding to the closest training instance is assigned to the example. The most elementary version of the algorithm is limited to continuous features with the Euclidean distance metric. Categorical features are binarised and then treated as numerical.

A more sophisticated version of the nearest neighbor classifier returns the most frequent class among the k closest training examples (denoted kNN) (Aha et al., 1991). A weighted average of the nearest neighbors can be used, for example in weighted nearest neighbor (WNN) (Cost & Salzberg, 1993). Given a specific instance that shall be classified, the weight of an example increases with increasing similarity to the example to be classified. In this thesis we use the IBk

Page 28: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

24

instance-based learning algorithm from WEKA machine learning library in Java (Witten & Frank, 2000), and the PEBLS instance-based learning algorithm (Cost & Salzberg, 1993), and WNN classifier implemented within MLC++ – machine learning library in C++ (Kohavi et al., 1996) for dynamic integration of classifier (see Section 2.3).

A major problem of the simple approach of kNN is that the vector distance will not necessarily be the best measure for finding intuitively similar examples, especially if irrelevant attributes are present.

2.2.3 Naïve-Bayes classifier

The Naïve-Bayes (NB) classifier (John, 1997) uses Bayes rule to predict the class of a previously unseen example, given a training set. Bayes’ theorem defines how to compute the probability of each class given the instance, assuming the features are conditionally independent given the class. The chosen class is the one that maximizes the conditional probability:

∏=

=k

jij

test

itesti cxP

PcP

cP1

)|()(

)()|(

xx

(2)

where ci is the i-th class, xtest is a test example, and P(A|B) is the conditional probability of A given B, P(xtest|ci) is broken down into the product P(xk|ci)… P(xk|ci), where xj is the value of the j-th feature in the example xtest.

More sophisticated Bayesian classifiers were developed, for example by John (1997), but only the Naïve-Bayes classifier is used in the experiments in this study.

The Naïve-Bayes classifier relies on an assumption that is rarely valid in practical learning problems, and therefore has traditionally not been the focus of research. It has sometimes been used as the base against which more sophisticated algorithms are compared. However, it has been recently shown that, for classification problems where the predicted value is categorical, the independence assumption is less restrictive than might be expected (Domingos & Pazzani, 1996; Domingos & Pazzani, 1997; Friedman, 1997). Domingos and Pazzani (1997) have presented a derivation of necessary and sufficient conditions for the optimality of the simple Bayesian classifier showing that it can be optimal even when the independence assumption is violated by a wide margin. They showed that although the probability estimates that the Naïve-Bayes classifier produces can be inaccurate, the classifier often assigns maximum probability to the correct class.

2.2.4 C4.5 Decision Tree classifier

Decision tree learning is one of the most widely used inductive learning methods (Breiman et al., 1984; Quinlan, 1996). A decision tree is represented as a set of nodes and arcs. Each node usually contains a feature (an attribute) and each arc leaving the node is labelled with a particular value (or range of values)

Page 29: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

25

for that feature. Together, a node and the arcs leaving it represent a decision about the path an example follows when being classified by the tree.

A decision tree is usually induced using “divide and conquer” or “recursive partitioning” approach to learning. Initially all the examples are in one partition and each feature is evaluated for its ability to improve the “purity” of the classes in the partitions it produces. The splitting process continues recursively until all of the leaf nodes are of one class.

The requirement that all the data be correctly classified may result in an overly complex decision tree. Extra nodes may be added in response to minor variations in the data. The problem of being overly sensitive to minor fluctuations in the training data is known as overfitting, and it is a general problem for all learning algorithms. A common strategy for avoiding overfitting in decision trees is to “prune” away those subtrees of the decision tree, which improves generalization performance on a too small set of pruning validation examples.

The decision tree learning algorithm used in this thesis is WEKA’s implementation of the C4.5 decision tree learning algorithm (Quinlan, 1993), which is the most widely used decision tree learning approach. C4.5 uses gain ratio, a variant of mutual information, as the feature selection measure. C4.5 prunes by using the upper bound of a confidence interval on the resubstitution error as the error estimate; since nodes with fewer instances have a wider confidence interval, they are removed if the difference in error between them and their parents is not significant (Quinlan, 1993).

2.3 Dynamic integration of classifiers

Recently the integration of classifiers (or ensemble of classifiers) has been under active research in machine learning (Dietterich, 1997), and different ensemble approaches have been considered (Chan & Stolfo, 1997). The integration of base classifiers into ensemble has been shown to yield higher accuracy than the most accurate base classifier alone in different real-world problems (Merz, 1998).

In general the process of ensemble of classifiers construction can be considered in the following way (see Figure 2). A set of base classifiers is formed during the learning phase. Each base classifier in the ensemble is trained using training instances of the corresponding training subset. During the integration phase an integration model of classifiers that allows combining the results produced by a set of selected base classifiers is constructed. The integration model produces the final classification of the ensemble.

Use of an ensemble of classifiers gives rise to two basic questions: (1) what is the set of classifiers (often called base classifiers) that should be generated?; and (2) how should the classifiers be integrated? (Merz, 1998). In this thesis we will be interested in applying FE to improve the integration of classifiers.

Page 30: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

26

Training Set

… Subset 1 Subset n

Accuracy Estimate … Accuracy

Estimate

Classifier 1 …

Integration Model of Classifiers Testset Predicted class labels

Subsetting

Learning

Evaluating

Integrating

Classifier n

 

FIGURE 2 The integration of classifiers

2.3.1 Generation of base classifiers

One way of generating a diverse set of classifiers is to use learning algorithms with heterogeneous representations and search biases (Merz, 1998), such as decision trees, neural networks, instance-based learning, etc.

Another approach is to use models with homogeneous representations that differ in their method of search or in the data on which they are trained. This approach includes several techniques for generating base models, such as learning base models from different subsets of the training data. For example, two well-known ensemble methods of this type are bagging and boosting (Quinlan, 1996).

One particular way for building models with homogeneous representations, which proved to be effective, is the use of different subsets of features for each model. For example, in Oza and Tumer (1999) base classifiers are built on different feature subsets, where each feature subset includes features relevant for distinguishing one class label from the others (the number of base classifiers is equal to the number of classes). Finding a set of feature subsets for constructing an ensemble of accurate and diverse base models is also known as ensemble feature selection (Opitz & Maclin, 1999).

Ho (1998) has shown that simple random selection of feature subsets may be an effective technique for ensemble feature selection. This technique is called the RSM and is derived from the theory of stochastic discrimination (Kleinberg, 2000). In the RSM, to construct each base classifier, one randomly selects a subset of features. The RSM has much in common with bagging, but instead of sampling instances, features are sampled (Skurichina & Duin, 2001).

In this thesis we use RSM in ensemble feature selection (see Article VII).

Page 31: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

27

2.3.2 Integration of base classifiers

The challenging problem of integration is to decide what the base classifiers should be or how to combine the results produced by the base classifiers. Two basic approaches have been suggested as a solution to the integration problem: (1) a combination approach, where the base classifiers produce their classifications and the final classification is composed using them (Merz, 1998); and (2) a selection approach, where one of the base classifiers is selected and the final classification is the result produced by it (Schaffer, 1993).

Techniques for combining or selecting classifiers can be divided into two subsets: static and dynamic. A static model does not depend on local information. The techniques belonging to the static selection approach propose one “best” method for the whole data space. Usually, better results can be achieved if the classifier integration is done dynamically taking into account the characteristics of each new instance. The basic idea of dynamic integration is that the information about a model’s errors in the instance space can be used for learning just as the original instances were used for learning the model. Both theoretical background and practical aspects of dynamic integration can be found in Tsymbal (2002). Gama (1999) showed that the distribution of the error rate over the instance space is not homogeneous for many types of classifiers. Depending on the classifier, the error rate will be more concentrated on certain regions of the instance space than in others.

2.3.3 Dynamic integration approaches used in the study

In this thesis, we will be interested in a dynamic integration approach that estimates the local accuracy of the base classifiers by analyzing their accuracy on nearby instances to the instance to be classified (Puuronen et al., 1999). Instead of directly applying selection or combination as an integration method, cross validation is used to collect information about the classification accuracies of the base classifiers, and this information is then used to estimate the local classification accuracies for each new instance. These estimates are based on the weighted nearest neighbor classification (WNN) (Cost & Salzberg, 1993).

In the study we use three different approaches based on the local accuracy estimates: Dynamic Selection (DS), Dynamic Voting (DV), and Dynamic Voting with Selection (DVS) (Tsymbal et al., 2001). All these are based on the same local accuracy estimates obtained using WNN. In DS a classifier with the least predicted local classification error is selected. In DV, each base classifier receives a weight that is proportional to the estimated local accuracy of the base classifier, and the final classification is produced by combining the votes of each classifier with their weights. In DVS, the base classifiers with the highest local classification errors are discarded (the classifiers with errors that fall into the upper half of the error interval of the base classifiers) and locally weighted voting (DV) is applied to the remaining base classifiers.

Page 32: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

28

2.4 The curse of dimensionality and dimensionality reduction

In many applications, data, which is the subject of analysis and processing in data mining, is multidimensional, and presented by a number of features. The so-called “curse of dimensionality” (Bellman, 1961) pertinent to many learning algorithms, denotes the drastic increase in computational complexity and classification error with data having a large number of dimensions. In this section we consider some interesting properties of high dimensional spaces which motivate the reduction of space dimensionality that could probably be performed without a significant loss of important information for classification. Afterwards we give a brief general categorization of the dimensionality reduction and FE techniques.

2.4.2 Inferences of geometrical, statistical, and asymptotical properties of high dimensional spaces for supervised classification

In this section some unusual or unexpected hyperspace characteristics are discussed in order to show that a higher dimensional space is quite different from a lower dimensional one.

It was shown by Jimenez and Landgrebe (1995) that as dimensionality of a hyperspace increases:

− The volume of a hypercube concentrates in the corners. − The volume of a hypersphere and a hyperellipsoid concentrate in the outside

shell. The above characteristics have two important implications for high dimensional data. The first one is that high dimensional space is mostly empty, which implies that multivariate data has usually an intristic lower dimensional structure. As a consequence high dimensional data can be projected to a lower dimensional subspace without loosing significant information, especially in terms of separability among the different classes. The second inference is that normally distributed data will have a tendency to concentrate in the tails, corners, outside shells etc, but not in the “main space”. Support for this tendency can be found in the statistical behaviour of normally and uniformly distributed multivariate data at high dimension.

− The required number of labelled samples for supervised classification increases as a function of dimensionality.

It was proved by Fukunaga (1990) that the required number of training samples depends linearly on the dimensionality for a linear classifier and is related to the square of the dimensionality for a quadratic classifier. This fact is very relevant, especially because there exist circumstances where second order statistics are more appropriate than the first order statistics in discriminating among classes in high dimensional data (Fukunaga, 1990). In terms of nonparametric classifiers the situation is even more severe. It has been estimated that as the number of dimensions increases, the sample size needs to

Page 33: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

29

increase exponentially in order to have an effective estimate of multivariate densities (Jimenez & Landgrebe, 1995).

It seems, therefore, that original, high dimensional data should contain more discriminative information (comparing to a lower dimensional projection of original data). But at the same time the above characteristics tell us that it is difficult with the current techniques, which are usually based on computations at full dimensionality, to extract such information unless the amount of available labelled data is substantial. The so-called Hughes phenomenon is a concrete example of this – with a limited number of training samples there is a penalty in classification accuracy as the number of features increases beyond some point (Hughes, 1968).

− For most high dimensional data sets, low linear projections have the tendency to be normal, or a combination of normal distributions, as the dimension increases.

This is a significant characteristic of high dimensional data that is quite important to the analysis of such data. It has been proved by Hall and Li (1993) that as the dimensionality tends to infinity, lower-dimensional linear projections will approach a normality model. Normality in this case means a normal distribution or a combination of normal distributions.

As a consequence of the introduced properties it is possible to reduce the dimensionality without losing significant information and separability. As the dimensionality increases the increased number of labelled samples is required for supervised classification (where the computation is done at full dimensionality). So, there is a challenge to reduce dimensionality while trying to preserve the discriminative information. Therefore, there is an increasing tendency towards new methods that, instead of doing the computation at full dimensionality, use a lower dimensional subspace(s). This, beside computational benefits, will make the assumption of normality better grounded in reality, yielding a better estimation of parameters, and better classification accuracy.

2.4.3 Dimensionality reduction techniques

In this section we follow the categorization of dimensionality reduction techniques according to the book by Liu (1998).

There are many techniques to achieve dimensionality reduction for data, including multidimensional heterogeneous data, presented by a large number of features of different types. Usually these techniques are divided mainly into dimensionality reduction for optimal data representation and dimensionality reduction for classification, according to their aim. According to the adopted strategy these techniques can also be divided into feature selection and feature transformation (also called feature discovery). The variants of the last one are FE and feature construction. The key difference between feature selection and feature transformation is that during the first process only a subset of original features is selected while the second approach is based on the generation of completely new features. Concerning the distinction between transformation techniques, feature construction implies discovering missing information about

Page 34: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

30

the relationships among features by inferring or creating additional features while FE discovers new feature space through a functional mapping.

If a subset of irrelevant and/or redundant features can be recognized and eliminated from the data then feature selection techniques may work well. Unfortunately this is not always easy and sometimes not even possible. This happens because of the fact that a feature subset may be useful in one part of the instance space, and at the same time useless or even misleading in another part of it. And, all methods that just assign weights to the individual features have an essential drawback in that they are insensitive to interacting or correlated features. That is why the transformation of the given representation before weighting the features is often preferable.

In this thesis we are interested in the study of several data mining strategies that apply feature extraction for supervised learning (i.e. for subsequent classification). The next chapter gives a brief introduction to FE techniques used throughout the study.

2.4.4 Feature extraction

Feature extraction (FE) is a process that extracts a subset of new features from the original set by means of some functional mapping. There are several interesting approaches for FE introduced in the literature. Among them are the discriminant analysis, fractal encoding, use of wavelet transformation, use of mutual information, and different types of neural networks (Diamantras & Kung, 1996). The most common technique is still probably the principal component analysis (PCA) and its different variations and extensions (Liu, 1998). We would like to point out that some practitioners from the Pattern Recognition field use the term ‘feature extraction’ to refer to the process of extracting features from data (Duda et al., 2001).

Besides computational complexity reduction, dimensionality reduction by FE helps also to solve the problem of overfitting, a tendency of a classifier to assign importance to random variations in the data by declaring them important patterns, i.e. the classifier is turned to the contingent, rather than just the constitutive characteristics of the training data (Duda et al., 2001).

In general the FE process requires some domain knowledge and intuition about the problem for the following reasons: different problem areas may require different approaches and domain knowledge also allows restricting the search space and thus helps to effectively find out relevant features (Fayyad, 1996).

However, it should be noticed that the transformed features are often not meaningful in terms of the original domain. Thus, additional constraints on the transformation process are required to guarantee comprehensibility if examination of the transformed classifier is necessary. Sometimes, domain knowledge helps to overcome the problem of interpretability too (Liu, 1998).

According to the availability of the supervised or unsupervised data, FE methods can or cannot use class information. Certainly, this question is crucial for the classification purposes.

Page 35: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

31

Also, one of the key issues in FE is the decision whether to proceed globally over the entire instance space or locally in different parts of the instance space. It can be seen that despite being globally high-dimensional and sparse, data distributions in some domain areas are locally low-dimensional and dense, for example in physical movement systems (Vijayakumar & Schaal, 1997).

2.5 Feature extraction for supervised learning

Generally, FE for supervised learning can be seen as a search process among all possible transformations of the original feature set for the best one, which preserves class separability as much as possible in the space with the lowest possible dimensionality (Aladjem, 1994). In other words we are interested in finding a projection w:

xwy T= (3)

where y is a 1×k transformed data point (presented using k features), w is a kd × transformation matrix, and x is a 1×d original data point (presented using

d features).

2.5.1 Principal component analysis

Principal Component Analysis (PCA) is a classical statistical method, which extracts a lower dimensional space by analyzing the covariance structure of multivariate statistical observations (Jolliffe, 1986).

The main idea behind PCA is to determine the features that explain as much of the total variation in the data as possible with as few of these features as possible. We are interested in PCA primarily as a widely used dimensionality reduction technique, although PCA is also used for example for the identification of the underlying variables, for visualization of multidimensional data, identification of groups of objects or outliers and for some other purposes (Jolliffe, 1986).

The computation of the PCA transformation matrix is based on the eigenvalue decomposition of the covariance matrix S (and therefore it is computationally rather expensive).

⎟⎟⎠

⎞⎜⎜⎝

⎛−−=← ∑

=

n

i

Tiiiondecompositeig

1

))((_ mxmxSw (4)

where n is the number of instances, xi is the i-th instance, and m is the mean vector of the input data.

Computation of the principal components can be presented with the following algorithm:

1. Calculate the covariance matrix S from the input data.

Page 36: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

32

2. Compute the eigenvalues and eigenvectors of S and sort them in a descending order with respect to the eigenvalues.

3. Form the actual transition matrix by taking the predefined number of components (eigenvectors).

4. Finally, multiply the original feature space with the obtained transition matrix, which yields a lower- dimensional representation.

The necessary cumulative percentage of variance explained by the principal axes is used commonly as a threshold, which defines the number of components to be chosen.

In the case of high-dimensional data PCA is computationally expensive, especially if only a few of the first components are needed. Also, when new data points are observed and the PCA-based model is being updated, the covariance matrix and its eigenvalues require complete recalculation. Therefore, many algorithms for PCA, which extract only the desired number of principal components and which can adapt to new data have been introduced and examined (Weingessel & Hornik, 1998).

2.5.2 The random projection approach

In many application areas like market basket analysis, text mining, image processing etc., dimensionality of data is so high that commonly used dimensionality reduction techniques like PCA are almost inapplicable because of extremely high computational time/cost.

Recent theoretical and experimental results on the use of random projection (RP) as a dimensionality reduction technique have attracted the DM community (Bingham & Mannila, 2001). In RP a lower-dimensional projection is produced by means of transformation like in PCA but the transformation matrix is generated randomly (although often with certain constrains).

The theory behind RP is based on the Johnson and Lindenstrauss Theorem (see for example Dasgupta & Gupta, 2003) that says that any set of n points in a d-dimensional Euclidean space can be embedded into a k-dimensional Euclidean space – where k is logarithmic in n and independent of d – so that all pairwise distances are maintained within an arbitrarily small factor (Achlioptas, 2001). The basic idea is that the transformation matrix has to be orthogonal in order to protect data from significant distortions and try to preserve distances between the data points. Generally, orthogonalization of the transformation matrix is computationally expensive, however, Achlioptas (2001) showed a very easy way of defining (and also implementing and computing) the transformation matrix for RP. So, according to Achlioptas (2001) the transformation matrix w can be computed simply either as:

⎪⎩

⎪⎨

+⋅=

6/1yprobabilitwith13/2yprobabilitwith06/1yprobabilitwith1

3ijw , or ⎩⎨⎧−+

=2/1yprobabilitwith12/1yprobabilitwith1

ijw (5)

Page 37: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

33

RP as a dimensionality reduction technique was experimentally analyzed on image (noisy and noiseless) and text data (a newsgroup corpus) by Bingham and Mannila (2001). Their results demonstrate that RP preserves the similarity of data vectors rather well (even when data is projected onto relatively small numbers of dimensions).

Fradkin and Madigan (2003) performed experiments on 5 different data sets with RP and PCA for inductive supervised learning. Their results show that although PCA predictively outperformed RP, RP is a rather useful approach because of its computational advantages. The authors also indicated a trend in their results, namely that the predictive performance of RP is improved with increased dimensionality when combining with the right learning algorithm. It was found that for those 5 data sets RP is suited better for nearest neighbour methods, where preserving distance between data points is more important than preserving the informativeness of individual features, in contrast to the decision tree approaches where the importance of these factors is reversed. However, further experimentation was encouraged.

Related work on RP includes use of RP as preprocessing of textual data, for further LSI (Papadimitriou et al., 1998), for indexing of audio documents with further LSI and use of SOM (Kurimo, 1999), for nearest-neighbor search in a high dimensional Euclidean space (Kleinberg, 1997; Indyk & Motwani, 1998), for learning high-dimensional Gaussian mixture models (Dasgupta 1999; 2000).

In general, the use of random methods (with regard to manipulations on features space) has a strong and lengthy tradition in DM community mainly because of practical success of random forests (Breiman, 2001), and the random subspace method (RSM) (Ho, 1998).

2.5.3 Class-conditional feature extraction

Although PCA is still probably the most popular FE technique, it has a serious drawback, i.e., giving high weights to features with higher variabilities, irrespective of whether they are useful for classification or not. This may give rise to a situation where the chosen principal component corresponds to an attribute with the highest variability but has no discriminating power (Oza & Tumer, 1999).

A usual approach to overcome the above problem is to use some class separability criterion (Aladjem, 1994), for example the criteria defined in Fisher’s linear discriminant analysis (Fisher, 1936) and based on the family of functions of scatter matrices:

wSwwSww

WT

BT

J =)( , (6)

where SB in the parametric case is the between-class covariance matrix that shows the scatter of the expected vectors around the mixture mean , and SW is the within-class covariance, that shows the scatter of samples around their respective class expected vectors. Thus,

Page 38: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

34

∑∑==

−−=in

j

Tiij

iij

c

iiW n

1

)()()()(

1

))(( mxmxS , and Tiic

iiB n ))(( )()(

1

mmmmS −−=∑=

, (7)

where c is the number of classes, ni is the number of instances in a class i, )(ijx is

the j-th instance of i-th class, m(i) is the mean vector of the instances of i-th class, and m is the mean vector of all the input data.

The total covariance matrix shows the scatter of all samples around the mixture mean. It can be shown analytically that this matrix is equal to the sum of the within-class and between-class covariance matrices (Fukunaga, 1990). In this approach the objective is to maximize the distance between the means of the classes while minimizing the variance within each class. A number of other criteria were proposed by Fukunaga (1990).

The criterion (6) is optimized using the simultaneous diagonalization algorithm (see for example Fukunaga, 1990). The basic steps of the algorithm include eigenvalues decomposition of SW ; transformation of original space to intermediate xW (whitining); calculation of SB in xW ; eigenvalues decomposition of SB and then transformation matrix w finally can be produced by a simple multiplication:

BW

B

WW

SS

WBS

SWWS

xSiondecompositeig

xxSiondecompositeig

www

w

ww

=

=←

)|(_

;),(_

(8)

The parametric approach considers one mean for each class and one total mixture mean when computing the between class covariance matrix. Therefore, there is a fundamental problem with the parametric nature of the covariance matrices. The rank of SB is at most the number of classes-1, and hence no more than this number of new features can be obtained through the FE process.

The nonparametric method overcomes this problem by trying to increase the number of degrees of freedom in the between-class covariance matrix, measuring the between-class covariances on a local basis. The k-nearest neighbor (kNN) technique is used for this purpose. In the nonparametric case the between-class covariance matrix is calculated as the scatter of the samples around the expected vectors of other classes’ instances in the neighborhood:

∑∑∑≠===

−−=c

ijj

Tjik

ik

jik

ik

n

kik

c

iiB

i

wn1

)(*

)()(*

)(

11

))(( mxmxS (9)

where )(*j

ikm is the mean vector of the nNN instances of the j-th class, which are nearest neighbors to )(i

kx . The coefficient wik is a special weighting coefficient, which shows the importance of each summand in (9). The goal of this coefficient is to assign more weight to those elements of the matrix which involve instances lying near the class boundaries and are thus more important for classification.

Page 39: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

35

Thus, a nonparametric approach is potentially more efficient than a parametric one, because it constructs flexible bounds between classes. However, the computational complexity of a nonparametric approach is higher. For further details on these class-conditional approaches, please see Pechenizkiy et al., (2004).

2.5.4 FE for supervised learning as an analogy to constructive induction

Constructive induction (CI) is a learning process that consists of two intertwined phases, one of which is responsible for the construction of the “best” representation space and the other one is concerned with generating hypothesis in the found space (Michalski, 1997).

In Figure 3 we can see two two-class (“+” and “–“) problems – with a) high quality and b) low quality representation spaces (RS). In a) points marked by “+” are easily separated from the points marked by “–” using a straight line or a rectangular border. But in b) “+” and “–” are highly intermixed that indicates the inadequateness of the original RS. A traditional approach is to search for complex boundaries to separate the classes, whereas constructive induction approach is to search for a better representation space where the groups are much better separated as is the situation in c).

Constructive induction systems view learning as a dual search process for an appropriate representation in the space of representational spaces and for an appropriate hypothesis in the specific representational space. Michalski introduced constructive (expand the representation space by attribute generation methods) and destructive (contract the representational space by feature selection or feature abstraction) operators. Bloedorn et al. (1993) consider meta-rules construction from meta-data to guide the selection of the operators.

a) High quality RS b) Low quality RS c) Improved RS due to CI

FIGURE 3 High vs. low quality representation spaces (RS) for concept learning

(Arciszewski et al., 1995, 9)

2.6 Selecting representative instances for FE

When a data set contains a huge number of instances, some sampling approach is commonly applied to address the computational complexity of knowledge

Page 40: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

36

discovery processes. In this thesis we are interested in the study of sample reduction effect on the considered FE techniques with regard to the classification performance of a supervised learner.

We use four different strategies to select samples: (1) random sampling, (2) stratified random sampling, (3) kd-tree based selective sampling, and (4) stratified sampling with kd-tree based selection.

2.6.1 Random sampling

Random sampling and stratified random sampling are the most commonly applied strategies as they are straightforward and extremely fast. In random sampling the information about the distribution of the instances by classes is disregarded. So, defining the percentage p of the total number of training set N instances to take, we select S = 0.01*N*p sample for building FE model and consequent supervised learning (Figure 4).

FE + SL

kRandom Sampling

SpN=

⋅%100

k

Data

Sample

N

FIGURE 4 Random sampling Intuitively, stratified sampling, which randomly selects instances from each chunk (group of instances) related to the corresponding class separately, might be preferable if we have the supervised learning process in mind.

2.6.2 Stratified random sampling

Figure 5 presents the basic idea of stratified random sampling. Conceptually, the first step (that certainly can be omitted algorithmically) is to divide data into c (equal to number of classes) chunks. Then, random sampling is applied for each data chunk.

However, the assumption that instances are not uniformly distributed and some instances are more representative than others (Aha et al., 1991) motivates to apply a selective sampling approach. Thus, the main idea of selective sampling is to identify and select representative instances, so that fewer instances are needed to achieve similar (or even better) performance. The common approach to selective sampling is data partitioning (or data indexing) that is aimed to find some structure in data and then to select instances from each partition of the structure. Although there exist many data partitioning techniques (see for example (Gaede & Gunther, 1998) for an overview), we choose kd-tree for our study because of its simplicity, and wide use.

Page 41: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

37

kRandom Sampling

11

%100SpN

=⋅

FE + SL

kRandom Sampling

cc SpN

=⋅%100

class

1

o o o o o o

class c

NNc

ii =∑

=1

k

Sample

SSc

ii =∑

=1

k

Data

N

k

k

1N

cN

FIGURE 5 Stratified random sampling

2.6.3 kd-Tree based sampling

A kd-tree is a generalization of the simple binary tree which uses k features instead of a single feature to split instances in a multi-dimensional space (Gaede & Gunther, 1998). The splitting is done recursively in each of the successor nodes until the node contains no more than a predefined number of instances (called bucket size) or cannot be split further. The order in which features are chosen to split can result in different kd-trees. As the goal of partitioning for selective sampling is to split instances into different (dissimilar) groups, a splitting feature is chosen if the data variance is maximized along the dimension associated with the splitting feature.

In Figure 6 the basic idea of selective sampling is presented graphically. First, a kd-tree is constructed from data, then a defined percentage of instances is selected from each leaf of the tree and added to the resulting sample to be used for FE models construction and supervised learning.

k

FE + SL

kkd-tree building

Root

kd-tree1N nN

NNn

ii =∑

=1

Random Sampling Data

Sample

SpN=

⋅%100

N

FIGURE 6 kd-Tree based selective sampling

2.6.4 Stratified sampling with kd-tree based selection of instances

Potentially, the combination of these approaches, so that both class information and information about data distribution are used, might be useful. This idea is

Page 42: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

38

presented in Figure 7. It can be seen from the figure that in this approach instead of constructing one global tree, several local kd-trees for each data chunk related to certain class are constructed.

11

%100S

pN=

⋅kd-tree

building

Root

kd-tree

11N 1

nN

11

1 NNn

ii =∑

=

cc SpN

=⋅%100

kd-tree building

Root

kd-tree

cN1c

nN

c

n

i

ci NN =∑

=1

FE + SLo o o o o oo o o

k

k

k

class

1

class c

Sample

SSc

ii =∑

=1

k

Data

N

k

1N

k

cN

Random Sampling

Random Sampling

FIGURE 7 Stratified sampling with kd-tree based selection of instances

Page 43: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

3 RESEARCH PROBLEM

The idea of learning from data is far from being new. However, perhaps due to developments in the Information Technology, Database Management and the huge increase of data volumes being accumulated in databases the interest in DM has become very intense. Numerous DM algorithms have recently been developed to extract knowledge from large databases. Nevertheless, nowadays the complexity of real-world problems, high dimensionality of data being analyzed and poor representation spaces due to presence of many irrelevant or indirectly relevant features challenge learning algorithms. It is commonly accepted that just by pushing a button someone should not expect useful results to appear.

FE is an effective data pre-processing step aimed to reduce the dimensionality and to improve representation space of the problem at consideration. There exists a strong theoretical background on FE (techniques) and supervised learning (SL) from applied statistics, pattern recognition and related field. However, many issues related to the integration of FE and SL processes have not been studied intensively perhaps due to the rare emphasis on DM/KDD as an iterative (and interactive) process (remember Figure 1) and due to the absence of relatively large collection of benchmark data sets to conduct extensive experimental study.

The purpose of this study is to develop theoretical background and practical aspects of FE as means of (1) dimensionality reduction, and (2) representation space improvement for SL in knowledge discovery systems. The focus is on applying conventional Principal Component Analysis (PCA) and two class-conditional approaches considered in Section 2.5 for two targets: (1)

Page 44: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

40

for a base level classifier construction, and (2) for dynamic integration of the base level classifiers.

The main dimensions of research issues related to FE for SL can be recognized from Figure 8. Each instance of a dataset has associated class-value and, originally, features’ values x1 … xk. By means of feature transformation new features can be extracted (constructed) by some functional mapping. Thus each instance will have additional values of features y1…ym if they are added to the original ones. Intuitively, this might be useful when the number of original features is too small. By means of feature selection process a number of original (hopefully, redundant and irrelevant) features can be eliminated. In general, different number of features can be selected from k and m for each data cluster (partition). Besides construction and selection of the most relevant features, the most representative instances for each class can be selected if the sample size is relatively large.

The rest of this Chapter is organized so that each section corresponds to one of the recognized research questions: “How important is it to use class information in the FE process?” (Section 3.1); “Is FE a data- or hypothesis-driven constructive induction?” (Section 3.2); “Is FE for dynamic integration of base-level classifiers useful in a similar way as for a single base-level classifier?” (Section 3.3); “Which features – original, extracted or both – are useful for SL?” (Section 3.4); “How many extracted features are useful for SL?” (Section 3.5); “How to cope with the presence of contextual features in data, and data heterogeneity?” (Section 3.6); “What is the effect of sample reduction on the performance of FE for SL?” (Section 3.7); “When is FE useful for SL?” (Section 3.8); “What is the effect of FE on interpretability of results and transparency of SL?” (Section 3.9); “How to make a decision about the selection of the appropriate DM strategy (particularly, the selection of FE and SL techniques) for a problem at consideration?” (Section 3.10).

FIGURE 8 The feature-values representations of the instances

Sele

ctin

g th

e m

ost

repr

esen

tativ

e in

stan

ces

Selecting the most relevant features

Page 45: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

41

3.1 How important is it to use class information in the FE process?

Giving large weights to features with higher variabilities, irrespective of whether they are useful for classification or not can be dangerous. This may give rise to a situation where the chosen principal component corresponds to an attribute with the highest variability but has no discriminating power as shown in Figure 9 (Oza & Tumer, 1999).

Our goal is to study the performance of conventional PCA for SL and compare it with class-conditional parametric and nonparametric approaches presented in Section 2.5.

FIGURE 9 PCA for classification: a) effective work of PCA, b) the case where an

irrelevant principal component (PC(1)) was chosen from the classification point of view (O denotes the origin of the initial feature space x1, x2 and OT – the origin of the transformed feature space PC(1), PC(2))

3.2 Is FE a data- or hypothesis-driven constructive induction?

Constructive induction methods are classified into three categories: data driven (information from the training examples is used), hypothesis driven (information from the analysis of the form of intermediate hypothesis is used) and knowledge driven (domain knowledge provided by experts is used) methods (Arciszewski et al., 1995).

We consider FE for SL as an analogy of constructive induction (classification). Indeed, what we are trying to achieve by means of FE is the most appropriate data representation for the subsequent SL.

One approach is to select and perform FE, keeping in mind the subsequent classification, and then perform the selection of a classifier (Figure 10). However, another approach – the selection of a combination of a FE technique and a classifier may be sufficient. In this case FE and classification cannot be separated into two different independent processes (Figure 11).

Page 46: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

42

Our goal is to address this intuitive reasoning and to study if FE techniques have different effects on the performance of different widely used classifier like Naïve Bayes, C4.5 and kNN.

FIGURE 10 Independent searches for the most appropriate FE and SL techniques

FIGURE 11 The joint search for a combination of the most appropriate FE and SL techniques

3.3 Is FE for dynamic integration of base-level classifiers useful in a similar way as for a single base-level classifier?

Recent research has shown the integration of multiple classifiers to be one of the most important directions in machine learning and data mining (Dietterich, 1997). Generally, the whole space of original features is used to find the neighborhood of a new instance for local accuracy estimates in dynamic integration. We propose to use FE in order to cope with the curse of dimensionality in the dynamic integration of base-level classifiers (Figure 12).

Our main hypothesis to test is that with data sets where FE improves classification accuracy when employing a single classifier (such as kNN), it would also improve classification accuracy when a dynamic integration approach is employed. Conversely, with data sets where FE decreases (or has no effect to) classification accuracy with the use of a single classifier, then FE

Search for the most appropriate FE technique

FE process

Trans-formed Train

set

Train

set

Search for the most appropriate SL technique

FE model

SL process

SL model

Test set

Prediction

Search for the most appropriate FE technique

FE process Trans-

formed train set

Train

set

Search for the most appropriate SL technique

FE model

SL process

SL model

Test set

Prediction

Page 47: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

43

will also decrease (or will have no effect to) classification accuracy when employing a dynamic integration approach.

Dynamic Selection

Dynamic Voting

Dynamic Voting with Selection

Dynamic Integration

Divide instances

Data set

Training set Validation set Test set

S - size of the emsemble N - number of features TS - training subset BC - base classifier NN - nearest neighborhood

Search for NN

Feature Extraction :

RSM(S,N)

TS1

Training BC1

accuracy estimation

TSS

Training BCS

accuracy estimation

TSi

Training BCi

accuracy estimation

Local accuracy estimates

Trained Base

classifiers

Meta-Data

WNN: for each nn predict local errors

of every BC

Transformed training set

Transforma-tion models

Feature subsets refinement

PCA Par Non-Par

training phase 

application ph

ase 

...

...

...

...

...

...

Meta-Learning

FIGURE 12 Scheme of the FEDIC approach (see Article VII for the detailed

description)

3.4 Which features – original, extracted or both – are useful for SL?

FE is often seen as a dimensionality reduction technique. An alternative is to see FE as a useful transformation that leads to representation space improvement, for example due to elimination of correlated and uninformative features and construction of uncorrelated and informative ones instead. However, when the number of original features is relatively small, some new features produced by means of FE process may give additional value for the set of original ones.

Popelinsky (2001) used some transformed features as additional ones for a decision-tree learner, instance-based learner and Naïve Bayes learner, and found that adding principal components to the original dataset results in a decrease in error rate for many datasets for all three learners, although a decrease of error rate was significant only for the instance-based learner. Another interesting result was that for a decision-tree the learner decrease of error rate could be achieved without increasing complexity of the decision tree.

Page 48: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

44

Our goal is to study advantages of using either extracted features or both original and extracted features for SL with regard to the FE approaches considered in Section 2.5.

3.5 How many extracted features are useful for SL?

When transformation matrix w for (3) (Section 2.5) is found, someone is interested in deciding how many extracted features to take and what is the best subset of orthogonally transformed features for SL.

One common method is to introduce some threshold, for example variance accounted by a component to be selected. This results in selecting principal components which correspond to the largest eigenvalues. The problem with this approach is that the magnitude of eigenvalue depends on data variance only and has nothing to do with class information. Jollife (1986) presents several real-life examples where principal components corresponding to the smallest eigenvalues are correlated with the output attribute. So, principal components important for classification may be excluded because they have small eigenvalues. In Figure 9 another simple example of such a situation was shown. Nevertheless, criteria for selecting the most useful transformed features are often based on variance accounted by the features to be selected.

An alternative approach is to use a ranking procedure and select principal components that have the highest correlations with the class attribute. Although this makes intuitive sense, there is criticism of such an approach. Almoy (1996) showed that this alternative approach worked slightly worse than using components with the largest eigenvalues in the prediction context.

Our goal is to analyze the performance of different FE techniques when different number of extracted features is selected for SL.

3.6 How to cope with the presence of contextual features in data, and data heterogeneity?

For some datasets a feature subset may be useful in one part of the instance space, and at the same time it may be useless or even misleading in another part of it. Therefore, it may be difficult or even impossible for some problem domains to remove irrelevant and/or redundant features from a data set and leave only useful ones by means of global FS. However, if it is possible to find local homogeneous regions of heterogeneous data, there are more chances to successfully apply FS. For FE the decision whether to proceed globally over the entire instance space or locally in different parts of the instance space is also one of the key issues. It can be seen that despite being globally high-dimensional and sparse, data distributions in some domain areas are locally low- dimensional and dense, for example in physical movement systems.

One possible approach for local FS or local FE would be clustering

Page 49: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

45

(partitioning) of the whole dataset into smaller regions. Generally, different clustering techniques can be used for this purpose, for example the k-means or EM techniques (Duda et al., 2001). However, in this thesis our focus is on a possibility to apply so-called natural clustering aimed to use contextual features for splitting whole heterogeneous data into more homogeneous clusters. Often such (possibly hierarchical) contextual features may be constructed by domain experts. Usually, contextual (or environmental) features are assumed to be features that are not useful for classification by themselves but are useful in combination with other (context-sensitive) features (Turney, 1996).

Our goal is to analyse the performance of FE and SL when applied globally to the whole data and locally within natural clusters on a data set which is likely heterogeneous and contains contextual features.

3.7 What is the effect of sample reduction on the performance of FE for SL?

When a data set contains a huge number of instances, some sampling strategy is commonly applied before the FE or SL processes to reduce their computational time and cost.

In this thesis we are interested to study the effect of sample reduction on FE for SL. The goal is to study if it is important to take into account class information (to apply some sort of stratification) and preserve variance in data or select the most representative instances during the sampling process. The intuitive hypothesis is that the type of sampling approach is not important when the selected sample size is relatively large. However, it might be important to take into account both class information and information about data distribution when the sample size to be selected is small.

Another goal is to find out if sample reduction has different effect on different FE approaches.

3.8 When is FE useful for SL?

An important issue is how to decide (for example analyzing the space of original features or meta-data if available) whether a PCA-based FE approach is appropriate for a certain problem or not. Since the main goal of PCA is to extract new uncorrelated features, it is logical to introduce some correlation-based criterion with a possibility to define a threshold value. One of such criteria is the Kaiser-Meyer-Olkin (KMO) criterion that accounts for both total and partial correlation:

∑∑∑∑∑∑

+=

i jij

i jij

i jij

ar

r

KMO 22

2

, (10)

Page 50: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

46

where ),( )()( jiij xxrr = is the element of the correlation matrix R and ija are the

elements of A (partial correlation matrix), and

jjii

ijXij RR

Ra ji

−=),(. , (11)

where ),(. jiXija is a partial correlation coefficient for )(ix and )( jx , when the effect of all the other but i and j features denoted as ),( jiX is fixed (controlled), and klR is an algebraic complement for klr in the determinant of the correlation matrix R .

It can be seen that if two features share a common factor with other features, their partial correlation ija will be small, indicating the unique variance they share. And then, if ija are close to zero (the features are measuring a common factor) KMO will be close to one, while if ija are close to one (the variables are not measuring a common factor) KMO will be close to zero.

Generally, it is recommended to apply PCA for a data set only if KMO is greater than 0.5. Popelinsky (2001) recommended PCA for meta-learning tasks if KMO is greater than 0.6.

Our goal is to analyze the appropriateness of criterion (10) for the decision-making process on usefulness of FE for a problem under consideration.

3.9 Interpretability of the extracted features

Interpretability context in the DM applications commonly refers to the issue whether a classifier is easy to understand. It is commonly accepted that rule-based classifiers like a decision tree and associative rules are very easy to interpret, and neural networks and other connectionist and “black-box” classifiers have low interpretability. kNN is considered to have a very poor interpretability because the unstructured collection of training instances is far from readable, especially if there are many instances.

While interpretability concerns a typical classifier generated by a learning algorithm, transparency (or comprehensibility) refers to whether the principle of the method of constructing a classifier is easy to understand (that is a users’ subjective assessment). Therefore, for example, a kNN classifier is scarcely interpretable, but the method itself is transparent because it appeals to the intuition of humans who spontaneously reason from similar cases. Similarly, interpretability of Naïve Bayes can be estimated as not very high, but the transparency of the method is good for example for physicians who find that probabilistic explanations replicate their way of diagnosing, i.e., by summing evidence for or against a disease (Kononenko, 1993).

Page 51: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

47

Our goal is to analyse the drawbacks and advantages (if any) of FE process on the interpretability and transparency of different rule-based and instance-based SL algorithms.

3.10 Putting all together: towards the framework of DM strategy selection

The purpose of studying and developing some pieces of theory background or practical aspects of FE for SL is certainly the rigor-related type of research. However, in this thesis we address also the relevance issues. With this respect our main distant (faraway) research goal is to contribute to knowledge in the problem of data mining strategy selection for a certain data mining problem. And our particular focus is on different combinations of considered PCA-based FE techniques and SL techniques.

Our aim is to introduce the general framework of KDD system that would incorporate DSS approach to provide help for the user in the selection of the most appropriate data mining strategy for a data set under consideration and to allow mixed-initiative management of automated KDD process (see Figure 13).

A key idea is to apply the meta-learning approach for automatic algorithm selection (see for example Kalousis (2002) for an overview). There exist two contexts of meta-learning. The first one is related to the so-called multi-classifier systems that apply different ensemble techniques (Dietterich, 1997). Their general idea is usually to select one classifier on the dynamic basis taking into account the local performance (for example generalisation accuracy) in the instance space (see Article VII). In the second, the multi-strategy learning applies strategy selection approach which takes into account the classification problem related characteristics (meta-data).

Meta-Model,ES, KB

Feature Manipu-

lators

ML algorithms/Classifiers

Post-processors/visualisers

Meta-Data

Meta-learning

Data set

KDD‐Manager Data Pre-

processors

Instances Manipu-

lators

GUIData generator

Evaluators

FIGURE 13 DM strategy selection via meta-learning and taking benefit of CI approach (see Article IX for the detailed description)

Page 52: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

4 RESEARCH METHODS AND RESEARCH DESIGN

In this chapter research methods and research design of the thesis are considered. First, we introduce our view on DM research in the scope of Information Systems Development (ISD) perspective and consider three basic research approaches used in the study. Then, we focus our discussion on basic approaches for evaluating learned models and for evaluating DM techniques that are used to construct these models.

4.1 DM research in the scope of ISD research methods

In Pechenizkiy et al. (2005a) we consider the DM research as a continuous Information Systems Development (ISD) process. We refer to the traditional framework presented by Ives et al. (1980) that is widely known and has been used in the classification of Information Systems (IS) research literature. Drawing an analogy to this framework we consider a DM system as a special kind of adaptive information system that processes data and helps to make use of it. Adaptation in this context is important because of the fact that the DM system is often aimed to produce solutions to various real-world problems, and not to a single problem. On the one hand, a DM system is equipped with a number of techniques to be applied for a problem at hand. On the other, there exist a number of different problems, and current research has shown that no single technique can dominate some other technique over all possible data mining problems (Wolpert & MacReady, 1996). Nevertheless, many empirical studies report that a technique or a group of techniques can perform

Page 53: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

49

significantly better than any other technique on a certain DM problem or a group of problems (Kiang, 2003). Therefore DM research can be seen as a development process of a DM system aimed at efficient utilization of available DM techniques for solving a current problem.

Focusing on the ISD process, we consider ISD framework of Nunamaker et al. (1990-91) adapted to DM artefact development. We discuss three basic groups of IS research methods. Namely, we consider theoretical, constructive and experimental approaches with regard to Nunamaker’s framework in the context of DM. We demonstrate how these approaches can be applied iteratively and/or in parallel for the development of an artefact – a DM tool, and contribute to theory creation and theory testing.

Iivari et al. (1999) relate development process to the constructive type of research because of their philosophical belief that development always involves creation of some new artefacts – conceptual (models, frameworks) or more technical artefacts (software implementations). The research approach is classified as constructive where scientific knowledge is used to produce either useful systems or methods, including development of prototypes and processes. Iivari et al. (1999) argue the importance of constructive research especially for applied disciplines of IS and computer science such as DM.

Nunamaker et al. (1990-91, 94) consider system development as a central part of a multi-methodological IS research cycle (Figure 14).

Theory Buildying Conceptual framework,

Math. models and methods

System Development

Artefact construction, Technology transfer

Experimentation Computer simulation,

Field experiments, Lab experiments

Observation Case studies, Field studies

FIGURE 14 A multimethodological approach to the construction of an artefact for DM

(adapted from Nunamaker et al., 1990-91, 94) Theory building involves discovery of new knowledge in the field of

study, however it is rarely contributing directly to practice. Nevertheless, the built theory often (if not always) needs to be tested in the real world to show its validity, recognize its limitations and make refinements according to observations made during its application. Therefore, research methods are subdivided into basic and applied research, as naturally both are common for

Page 54: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

50

any large project (Nunamaker et al., 1990-91). A proposed theory leads to the development of a prototype system in order to illustrate the theoretical framework, and to test it through experimentation and observation with subsequent refinement of the theory and the prototype in an iterative manner. Such a view presents DM research as a complete, comprehensive and dynamic process. It allows multiple perspectives and flexible choices of methods to be applied during different stages of the research process. Furthermore, following this multimethodological approach a researcher can analyze how the results achieved through different research approaches relate to each other and search for contradictions in the results. It can be expected that such joint use of these approaches will give a better understanding of the introduced research goal and provide a more significant and sophisticated contribution to the knowledge in the area.

4.2 Research methods used in the study

Three basic research approaches are used in this thesis: the conceptual-theoretical approach, the constructive approach, and the experimental approach. These approaches are tightly connected and are applied in parallel. The theoretical background is exploited during the constructive work and the constructions are used for experimentation. The results of the constructive and experimental work are used to refine the theory. Accordingly, several research methods are applied.

In the conceptual-theoretical approach, conceptual basics and formalisms of the integration of multiple DM methods in knowledge discovery systems, and especially dynamic integration, are reviewed and discussed. During the constructive part of the research, software that implements the developed theory and allows conduct the experimental study and evaluation is developed. In the experimental part of the research, widely available benchmark databases (artificial and real-world ones) are used to evaluate characteristics of the developed integration approach in order to obtain deeper understanding about its behaviour in different subject domains.

The constructive approach, from the DM research point of view, can be seen as means that helps to manipulate and coordinate integrative work of different DM methods, and to carry out the experimental approach. It is obvious that in order to construct a good artefact we need some background knowledge about artefacts’ components (that are basic DM techniques) and their appropriateness for certain data set characteristics. Thus, it is natural that theory-creating research has to be performed, during which the basics of the relevant DM techniques should be elaborated. For these purposes literature survey and review was done. This helped us to understand the background of the problem and to analyse the previous findings in the area.

During the development process of our constructive research we used MLC++ (the machine learning library in C++) (Kohavi et al., 1996) and WEKA

Page 55: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

51

machine-learning environment in Java (Witten & Frank, 2000). This allowed us to use tested and validated tools as a core/backbone for a new tool. We chose component-based development as it allows each component to be designed, implemented, tested and refined independently. The control over the individual components is easier to organize and the experiments can be more easily performed on the separate components also.

Evaluation process is an essential part of constructive research. Naturally, experimental approach was used to evaluate the prototype. The experimental approach was beneficial for theory testing and theory construction.

Experimental study can be done in the ‘field’ or in the ‘laboratory’. In the first case different approaches are tested on so-called real-world datasets with real users. In the second case systematically controlled experiments can be organized. Controlled experiments sometimes might produce more beneficial results for theory creating, since unlike real world datasets, synthetically generated data allow testing exactly the desired number of characteristics while keeping all the others unchanged.

In the next two sections, the experimental approach and experiment design are considered in more detail.

4.3 Experimental approach

Evaluation process is an essential part of constructive research. By the evaluation of a DM artefact we understand first of all (1) the evaluation of learned models and meta-level models and (2) testing the hypothesis about superiority of one studied technique or a combination of techniques over another one. Some other important issues related to the use of DM artefact are discussed in Pechenizkiy et al. (2005a). However, the experimental approach benefits not only the artefact evaluation and theory testing that has been used for artefact construction but also it can contribute to knowledge producing new pieces of theory for selection and/or combination of DM techniques for a given dataset. Meta-learning approaches are one good example of such attempts to contribute to new pieces of theory induction.

4.3.1 Estimating the accuracy of the model learnt 

For the purposes of algorithm comparison and selection, as well as for parameter setting, methods of estimating the performance of a set of learned models are needed. The goal of the model selection task is to estimate the generalization performance of a collection of learning algorithms and to select the algorithm with the lowest error estimate (Kohavi, 1995a).

When testing and validating a model, data miners use several techniques. They include sampling, validation, cross-validation, stratification, Monte Carlo methods, division of dataset into training, validating and testing sets etc. The

Page 56: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

52

two most essential elements of any experimental design are randomization and experimental control of adjustable variables and restrictions of known factors.

One way to estimate the accuracy is to use the resubstitution estimate, in which the model is tested on the same data it was built on (Kohavi, 1995b). Although the resubstitution estimate is a highly optimistic estimate of accuracy it was noted that for large enough samples, and for some algorithms there is no need to look further than the resubstitution estimator (Kohavi, 1995b).

However, the common approach is to use a sample of the available previously classified instances for a training set and the remaining instance – for a test set. Then the training set is used to learn the model, and the test set is used to test it. The major nonparametric statistical methods that follow this methodology are cross-validation, random sampling (Monte Carlo cross-validation), and bootstrapping (Merz, 1998).

In cross-validation (Schaffer, 1993) the examples are randomly split into v mutually exclusive partitions (folds) of approximately equal size. A sample is formed by setting aside one of the v folds as the test set, and the remaining folds make up the training set. This creates v possible samples. As each learned model is formed using one of the v training sets, its generalization performance is estimated on the corresponding test partition. Stratified cross-validation, where the folds are stratified so that they contain approximately the same proportions of classes as the original dataset, can give better estimation (Kohavi, 1995a). Usually, multiple runs of cross-validation are used for stabilization of the estimations (Kohavi, 1995b).

Random sampling or Monte Carlo cross-validation (Kohavi, 1995a) is a special case of v-fold cross-validation where a percentage of training examples (typically 2/3) is randomly placed in the training set, and the remaining examples are placed in the test set. After learning takes place on the training set, generalization performance is estimated on the test set. This whole process is repeated for many training/test splits (usually 30) and the algorithm with the best average generalization performance is selected. Random sampling is used in most experiments throughout this dissertation to evaluate the methods developed and to compare them with the existing ones.

Bootstrapping (Kohavi, 1995a) is the process of sampling with replacement from the available examples to form the training and test partitions. Kohavi (1995a) showed that on average, cross-validation methods are better than bootstrapping, and could be recommended for accuracy estimation and model selection.

The evaluation of an DM technique can be either based on filter paradigm, when evaluation process is independent from a learning algorithm and the most appropriate approach is chosen from available ones according to certain data characteristics before the algorithm starts, or based on wrapper paradigm (Kohavi & John, 1998) that assumes the interaction between the approach selection process and performance of the integrative model. In this thesis the wrapper approach is used in the experimental studies.

Page 57: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

53

4.3.2 Tests of hypotheses 

From the theory evaluation as well as from the artefact evaluation point of view, the general principle of evaluation – the new derivation or construct must be better that its best challenger – is applicable for DM as well. ‘Goodness’ criterion of a built theory or an artefact is multidimensional and sometimes is difficult to define because of mutual dependence between the compromising estimates. However, it is fairly easy to construct a criterion based on such estimates as accuracy (including sensitivity and specificity, and various costs matrices) of a built model and its performance (time and memory resources). On the other hand – it is more difficult or even impossible to include into a criterion such important aspects as interpretability of the artefact’s output because estimates of that kind usually are subjective and can be evaluated only by the end-users.

When a new DM technique is compared with some existing technique (competitor) the cross-validation methods are commonly used to estimate their generalization performance. As a rule, it is necessary to determine how significant the observed differences are. The resampled Student’s t-test (also known as the resampled paired t-test) is one commonly used tool that has many potential drawbacks (Dietterich, 1998).

For the resampled paired t-test, a series of trials (usually 30) is conducted. In each trial, the available sample is randomly divided into a training set and a test set (for example, two thirds and one third of the data correspondingly). Learning algorithms A and B are both trained on the training set and the resulting classifiers are tested on the test set. Let )(i

Ap (respectively )(iBp ) be the

observed proportion of test examples misclassified by algorithm A (respectively B) during trial i. If we assume that the 30 differences )()()( i

Bi

Ai ppp −= were drawn

independently from a normal distribution, then we can apply the Student’s t test, by computing the statistic

1

)(1

2)(

⋅=

∑ =

n

pp

nptn

ii

, (12)

where ∑=⋅=

n

iip

np

1)(1 , and n is the number of trials. Under the null hypothesis,

this statistic has a t distribution with n-1 degrees of freedom. For example for 30 trials, the null hypothesis can be rejected if 04523.2975.0,29 => tt .

Usually, neither independence of algorithms ( )(iAp and )(i

Bp ) nor independence of each evaluation from the others (because of overlapping of the training sets in the trials) is guaranteed. Recent studies (Dietterich, 1998; Salzberg, 1999) have shown that the resampled t-test and other commonly used significance tests have an unacceptably high probability of detecting a difference in generalization performance when no difference exists (Type 1 error). This is primarily due to the nature of the sampling process in the experimental design and the number of examples available.

Page 58: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

54

The McNemar’s test, and the test for the difference of two proportions are claimed to have acceptable Type 1 error (Dietterich, 1998). Nevertheless, in Tsymbal (2002), the same results were obtained with all the tests, sometimes with different levels of significance. However, although no single procedure for comparing learning methods based on limited data satisfies all the constraints, each of them provides the approximate confidence intervals that can be used in interpreting experimental comparisons of learning techniques (Mitchel, 1997).

4.4 Experimental design

In this section, the most common experimental settings used throughout this study are described. More detailed experimental settings can be found in the corresponding section of the related article included in the thesis.

To compare the developed algorithms and the existing ones, for each data set, 30 or 70 test runs were made. In each test run a data set was first split into the training set, the validation set, and the test set by stratified random sampling. Each time 70 percent of the instances were included in the training set. The other 30 percent were used for the test set. When the validation set is used (for example in the iterative refinement of the ensemble) 60 percent of the instances were included in the training set, and the other 40 percent were divided into the validation and test sets of approximately equal size. The test set was used for the final estimation of the ensemble accuracy.

When needed, the values of continuous features are discretized dividing the interval of the values of the feature into intervals with equal length. The whole experimental environment was implemented first within the MLC++ framework. The results described in Article I, Article VII and Article VIII were achieved using that experimental environment. For our further studies we implemented the experimental environment within WEKA machine-learning environment in Java (WEKA 3, 2004).

The datasets used in the experiments were taken from the University of California at Irvine Machine Learning Repository (Blake & Merz., 1998), except for the Acute Abdominal Pain (AAP) datasets provided by Laboratory for System Design, Faculty of Electrical Engineering and Computer Science, University of Maribor, Slovenia and Theoretical Surgery Unit, Dept. of General and Trauma Surgery, Heinrich-Heine University Düsseldorf, Germany (Zorman et al., 2001), and microbiology datasets Antibioticograms provided by N. N. Burdenko Institute of Neurosurgery, Russian Academy of Medical Sciences, Moscow, Russia. The short summary and the description of all the datasets used throughout the different experiments can be found in Appendix A.

In the experiments, four FE techniques considered in Section 2.5: conventional PCA, random projections, the parametric and nonparametric class-conditional FE techniques; three supervised learning techniques considered in Section 2.2: Naïve Bayes, k Nearest Neighbour, and C4.5 Decision Tree; three integration techniques considered in Section 2.3: dynamic selection,

Page 59: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

55

dynamic voting, dynamic voting with selection; and four sampling techniques considered in Section 2.6: random, stratified random, and kd-tree based selective sampling (with and without stratification) were used. In the evaluation of different DM strategies composed by the above mentioned techniques we were interested in their generalization accuracy, number of features required to construct a model, and the time taken to construct and test a model.

Page 60: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

5 RESEARCH RESULTS: SUMMARY OF THE INCLUDED ARTICLES

This chapter presents a brief discussion of each article included in the thesis and discusses the main findings of each corresponding article. Generally, each included article addresses the research problem(s) presented in the corresponding section(s) of Chapter 3.

5.1 “Eigenvector-based feature extraction for classification”

Reference: Tsymbal, A., Puuronen, S., Pechenizkiy, M., Baumgarten, M. & Patterson D. 2002. Eigenvector-based Feature Extraction for Classification. In: S.M. Haller, G. Simmons (Eds.), Proceedings 15th International FLAIRS Conference on Artificial Intelligence, FL, USA: AAAI Press, 354-358. PCA-based FE techniques are widely used for classification problems, though they generally do not take into account the class information and are based solely on inputs. Although this approach can be of great help in unsupervised learning, there is no guarantee that the new axes are consistent with the discriminatory features in a classification problem.

This paper shows the importance of the use of class information in FE for SL and inappropriateness of conventional PCA to FE for SL. We considered two class-conditional eigenvector-based approaches for FE described in Subsection 2.5.3. We compared the two approaches with each other, with conventional PCA, and with plain nearest neighbor classification without FE.

Page 61: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

57

First, a series of experiments were conducted to select the best α and k coefficients for the nonparametric approach. The parameter α was selected from the set of 9 values: { }20,10,5,3,1,3/1,5/1,10/1,20/1∈α , and the number of nearest neighbors n from the set of 8 values: 8,...,1,12 =−= ik i ,

{ }255,127,63,31,15,7,3,1∈k . The parameters were selected on the wrapper-like basis, optimizing the classification accuracy. For some data sets, for example LED and LED17, selection of the best parameters did not give almost any improvement in comparison with the ones considered by Fukunaga (1990) α =1 and k=3, and the classification accuracy varied within the range of one percent. It is necessary to note that the selection of the α and n parameters changed the ranking of the three feature extraction approaches from the accuracy point of view only on two data sets, thus demonstrating that the nonparametric approach is robust with regard to the built-in parameters. However, for some data sets the selection of the parameters had a significant positive effect on the classification accuracy. For example, on the MONK-2 data set, accuracy is 0.796 when α =1 and k=3, but it reaches 0.974 when α =20 and k=63.

The nonparametric approach had the best accuracy on average. Also, the nonparametric approach performed much better on the categorical data, improving the accuracy of the other approaches for this selection of the data sets. However, further research is necessary to check this finding. The parametric approach was quite unstable, and not robust to different data sets’ characteristics. Conventional PCA was the worst FE technique on average. Classification without FE was clearly the worst. This shows the so-called “curse of dimensionality” and necessity of FE.

Thus, the experimental results supported our expectations. Still, it is necessary to note that each feature extraction technique was significantly worse than all the other techniques at least on one data set (for example, the Heart data set for the nonparametric approach), and it is a question for further research to define the dependencies between the characteristics of a data set and the type and parameters of the feature extraction approach best suited for it.

5.2 “PCA-based Feature Transformations for Classification: Issues in Medical Diagnostics”

Reference: Pechinizkiy, M., Tsymbal, A. & Puuronen, S. 2004. PCA-based Feature Transformations for Classification: Issues in Medical Diagnostics, In: R. Long et al. (Eds.), Proceedings of 17th IEEE Symposium on Computer-Based Medical Systems CBMS’2004, Bethesda, MD: IEEE CS Press, 535-540. Current electronic data repositories, especially in medical domains, contain enormous amounts of data including also currently unknown and potentially interesting patterns and relations that can be uncovered using knowledge discovery and DM methods. Inductive learning systems were successfully applied in a number of medical domains, for example in the localization of a

Page 62: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

58

primary tumor, prognostics of recurrence of breast cancer, diagnosis of thyroid diseases, and rheumatology.

However, researchers and practitioners realize that the effective use of these inductive learning systems requires data preprocessing before applying a learning algorithm. This is especially important for multidimensional heterogeneous data, consisting of a large number of features of different types. If a subset of irrelevant and/or redundant features can be recognized and eliminated from the data then feature selection techniques may work well. Unfortunately this is not always easy and sometimes not even possible. This is because a feature subset may be useful in one part of the instance space, and at the same time be useless or even misleading in another part of it. That is why the transformation of the given representation before weighting the features is often preferable.

FE is often seen as a dimensionality reduction technique. An alternative is to see FE as a useful transformation that leads to representation space improvement due to elimination of correlated and uninformative features and construction of uncorrelated and informative ones. However, when the number of original features is relatively small, some new features produced by means of FE process may give additional value for the set of original ones.

In this paper we studied advantages of using either extracted features or both original and extracted features for SL.

We elaborated a test bench with a collection of medical data sets taken from the UCI machine learning repository and on three data sets of cases of acute abdominal pain to conduct experiments with the considered FE techniques (Section 2.5) and the kNN-classifier. We evaluated four combinations of kNN with the considered PCA-based FE techniques (conventional PCA, ranked PCA, parametric eigenvalue-based approach, and nonparametric eigenvalue-based approach). Then we evaluated the results of these four combinations against the best wrapper procedure. After that we compared the same combinations to find out whether the replacement of the initial features by the extracted ones is better than their superposition.

Our experimental results showed that for Diabetes and Thyroid data sets none of the feature transformation techniques can improve the work of a plain 3NN classifier. For Heart and Cancer data sets 3NN achieves the highest accuracy results when the new features extracted by the parametric approach are used instead of the original ones. And for Liver data set the best results are achieved when the feature extracted by the parametric approach is used together with the original ones. KMO criterion (Section 3.8) being successfully used in factor analysis is not useful when deciding whether FE will improve the representation space for SL. Although for every data set KMO was higher than 0.5, the principal components, when used instead of the original features, resulted in lower accuracy of 3NN classifier, and when used together with the original features, never improved the classification accuracy.

We also discussed some interpretability issues of the new extracted features. We argued that, although when applied for rule-like approaches

Page 63: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

59

(association rules), the interpretation of rules with respect to the initial (original) features may be difficult or even impossible, for case-based approaches (nearest neighbor) the comparison by analogy may be easier. Additionally we discussed whether the transformation formulas of principal components may provide useful information for interpretability of results; and whether the interpretability can be improved with the help of a new feature space rotation and back-transformation (where such operations are appropriate and applicable).

5.3 “On Combining Principal Components with Fisher’s Linear Discriminants for Supervised Learning”

Reference: Pechenizkiy, M., Tsymbal, A. & Puuronen, S. 2005. On Combining Principal Components with Fisher’s Linear Discriminants for Supervised Learning. (submitted to) Special Issue of Foundations of Computing and Decision Sciences “Data Mining and Knowledge Discovery” (as extended version of Pechenizkiy et al., 2005e). In this paper, principal component analysis (PCA), parametric feature extraction (FE) based on Fisher’s linear discriminant analysis (LDA), and their combination as means of dimensionality reduction are analyzed with respect to the performance of classifier. Three commonly used classifiers are taken for analysis: kNN, Naïve Bayes and C4.5 decision tree. Recently, it has been argued that it is extremely important to use class information in FE for supervised learning (SL). However, LDA-based FE, although using class information, has a serious shortcoming due to its parametric nature. Namely, the number of extracted components cannot be more that the number of classes minus one. Besides, as it can be concluded from its name, LDA works mostly for linearly separable classes only.

In this paper we study whether it is possible to overcome these shortcomings by adding the most significant principal components to the set of features extracted with LDA. In experiments on 21 benchmark datasets from UCI repository these two approaches (PCA and LDA) are compared with each other, and with their combination for each classifier.

Our results show that such a combination approach has certain potential, especially when applied for C4.5 decision tree learning. However, from the practical point of view the combination approach cannot be recommended for Naïve Bayes since its behavior is very unstable on different datasets. Presumably, additional feature selection would be useful for Naïve Bayes and kNN from the combined set of features, which is implicitly done with C4.5.

Page 64: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

60

5.4 “The Impact of the Feature Extraction on the Performance of a Classifier: kNN, Naïve Bayes and C4.5”

Reference: Pechenizkiy, M. 2005. The Impact of the Feature Extraction on the Performance of a Classifier: kNN, Naïve Bayes and C4.5. In: B.Kegl & G.Lapalme (Eds.): Proceedings of 18th CSCSI Conference on Artificial Intelligence AI’05, LNAI 3501, Heidelberg: Springer-Verlag, 268-279. In this paper we analyzed FE from two different perspectives. The first one is related to the “curse of dimensionality” problem and the necessity of dimensionality reduction (see Section 2.4). The second perspective comes from the assumption that in many data sets to be processed some individual features, being irrelevant or indirectly relevant for the purpose of analysis, form poor problem representation space. Corresponding ideas of constructive induction that assume the improvement of problem representation before application of any learning technique are presented (see Section 2.6).

FE accounts for both of the perspectives, and therefore, FE, when applied either on data sets with high dimensionality or on data sets including indirectly relevant features, can improve the performance of a classifier.

One main hypothesis is that different FE techniques might have different effects for different classifiers.

We conducted the experiments with four different types of FE techniques: PCA, Random Projection, and two class-conditional approaches to FE, and with three SL algorithms: the nearest neighbour classification, Naïve Bayes, and C4.5 decision tree learning, analyzing the impact of FE techniques on the classification performance on 20 UCI datasets.

In this paper, the experimental results show that for many data sets FE does increase classification accuracy. Still, we could see from the results that there is no best FE technique among the considered ones, and it is hard to say which one is the best for a certain classifier and/or for a certain problem, however according to the experimental results some major trends can be recognized.

Class-conditional approaches (and especially nonparametric approach) were often the best ones. This indicated the importance of taking into account class information and not relying only on the distribution of variance in the data. At the same time it is important to notice that the parametric FE was very often the worst, and for 3NN and C4.5 the parametric FE was the worst more often than RP. Such results highlight the very unstable behavior of parametric FE. One possibility to improve the parametric FE would be to combine it with PCA or a feature selection approach in a way that a few principal components or the components most useful for classification features are added to those extracted by the parametric approach. We experimentally evaluated this idea later in Article III; the results showed that parametric FE produces more stable results when its extracted features are combined with few principal components for SL.

Page 65: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

61

Although it is logical to assume that RP should have more success in applications where the distances between the original data points are meaningful and/or for such learning algorithms that use distances between the data points, our results show that this is not necessarily true. However, data sets in our experiments have 48 features at most and RP is usually applied for problems with much higher dimensionality.

The main conclusion of the paper is that FE techniques are powerful tools that can significantly increase the classification accuracy producing better representation spaces or resolving the problem of “the curse of dimensionality”. However, when applied blindly, FE may have no effect for the further classification or can even deteriorate the classification accuracy.

5.5 “Local Dimensionality Reduction within Natural Clusters for Medical Data Analysis”

Reference: Pechenizkiy, M., Tsymbal, A., Puuronen, S. 2005. Supervised Learning and Local Dimensionality Reduction within Natural Clusters: Biomedical Data Analysis, (submitted to) IEEE Transactions on Information Technology in Biomedicine, Special Post-conference Issue "Mining Biomedical Data" (as extended version of Pechenizkiy et al., 2005c) Inductive learning systems have been successfully applied in a number of medical domains. Nevertheless, the effective use of these systems requires data preprocessing before applying a learning algorithm. It is especially important for multidimensional heterogeneous data, presented by a large number of features of different types. Dimensionality reduction is one commonly applied approach. The goal of this paper was to study the impact of “natural” clustering on dimensionality reduction for classification. We compared several DM strategies that apply dimensionality reduction by means of FE or feature selection for subsequent SL with the selected part of real clinical database trying to construct data models that would help in the prediction of antibiotic resistance and in understanding its development.

Each instance of the data used in our analysis represents one sensitivity test and contains the features related to pathogen that is isolated during the microbe identification analysis, antibiotic that is used in the sensitivity test and the result of the sensitivity test (sensitive, resistant, or intermediate). The information about sensitivity analysis is connected with a patient, his/her demographical data (sex, age) and hospitalization in the Institute (main department, whether the test was taken while patient was in ICU (Intensive Care Unit), days spent in the hospital before, etc.). We introduced grouping features for pathogens and antibiotics so that 17 pathogens and 39 antibiotics were combined into 6 and 15 groups respectively. Thus, each instance had 28 features that included information corresponding to a single sensitivity test augmented with data concerning the type of the antibiotic used and the isolated pathogen, and

Page 66: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

62

clinical features of the patient and his/her demographics, and the microbiology test result as the class attribute. The data is relatively high dimensional and heterogeneous; heterogeneity is presented by a number of contextual (environmental) features. In this study we applied natural clustering aimed to use contextual features for splitting a real clinical dataset into more homogeneous clusters in order to construct local data models that would help in better prediction of antibiotic resistance. Semantically, the sensitivity concept is related first of all to the pathogen and antibiotic concepts. For our study binary features that describe the pathogen grouping were selected as prior environmental features, and they were used for hierarchical natural clustering (the hierarchy was introduced by the grouping of the features). So, the whole dataset was divided into two nearly equal natural clusters: gram+ and gram–. Then, the gram+ cluster divided into the staphylococcus and enterococcus clusters, and gram– cluster divided into the enterobacteria and nonfermentes clusters.

We analyzed experimentally whether local dimensionality reduction within “natural” clusters is better than global search for a better feature space for classification in terms of performance.

In our experimental study we applied k-nearest neighbor classification (kNN) to build antibiotic sensitivity prediction models. We applied three different wrapper-based sequential FS techniques and three PCA-based FE techniques globally and locally and analyzed their impact on the performance of kNN classifier.

The results of our experiments showed that natural clustering is very effective and efficient approach to cope with complex heterogeneous datasets, and that the proper selection of a local FE technique can lead to significant increase of predictive accuracy in comparison with the global kNN with or without FE. The amount of features extracted or selected locally is always smaller than that in the global space, which shows the usefulness of natural clustering in coping with data heterogeneity.

5.6 “The Impact of Sample Reduction on PCA-based Feature Extraction for Naïve Bayes Classification”

Reference: Pechenizkiy, M., Puuronen, S. & Tsymbal, A. 2006. The Impact of Sample Reduction on PCA-based Feature Extraction for Supervised Learning. (to appear) In: Proceedings of the 21st ACM Symposium on Applied Computing (SAC’06, Data Mining Track), ACM Press. When a data set contains a huge number of instances, some sampling approach is applied to address the computational complexity of FE and classification processes. The focus of this paper is within the study of sample reduction effect on FE techniques with regard to the classification performance.

The main goal of this paper is to show the impact of sample reduction on the process of FE for classification. In our study we analyzed the conventional

Page 67: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

63

Principal Component Analysis (PCA) and two eigenvector-based approaches that take into account class information (Section 2.5). The experiments were conducted on ten UCI data sets, using four different strategies to select samples: (1) random sampling, (2) stratified random sampling, (3) kd-tree based selective sampling, and (4) stratified sampling with kd-tree based selection.

The experimental results of our study showed that the type of sampling approach is not important when the selected sample size is relatively large. However, it is important to take into account both class information and information about data distribution when the sample size to be selected is small.

The kd-tree based sampling has very similar effect to stratified random sampling, although different in nature.

Comparing the results related to the four sampling strategies we can conclude that no matter which one of the four sampling strategies is used, if sample size p is small, p ≈ 10%, then SL without FE yields the most accurate results; if sample size p ≥ 20%, then nonparametric class-conditional FE (NPAR) outperforms other methods; and if sample size p ≥ 30%, NPAR outperforms other methods even if they use 100% of the sample. The best p for NPAR depends on sampling method: for random and stratified p = 70%, for kd-tree p = 80%, and for stratified + kd-tree p = 60%. PCA is the worst technique when applied on a small sample size, especially when stratification or kd-tree indexing is used.

Generally, all sampling strategies have similar effect on final classification accuracy of NB for p > 30%. The significant difference in accuracy is within 10% ≤ p ≤ 30%. The intuitive explanation for this is that when taking a very large proportion of the sample, it does not matter which strategy is used since most of the selected instances are likely to be the same ones (maybe chosen in different orders). However, the smaller the portion of the sample, the more important it is how the instances are selected.

5.7 “Feature extraction for dynamic integration of classifiers”

Reference: Pechenizkiy, M., Tsymbal, A., Puuronen, S. & Patterson, D. 2005. Feature Extraction for Dynamic Integration of Classifiers, (submitted to) Fundamenta Informaticae, IOS Press (as extended version of Tsymbal et al., 2003). Recent research has shown the integration of multiple classifiers to be one of the most important directions in machine learning and DM. It was shown that, for an ensemble to be successful, it should consist of accurate and diverse base classifiers. However, it is also important that the integration procedure in the ensemble should properly utilize the ensemble diversity. In this paper, we present an algorithm for the dynamic integration of classifiers in the space of extracted features (FEDIC). It is based on the technique of dynamic integration,

Page 68: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

64

in which local accuracy estimates are calculated for each base classifier of an ensemble, in the neighbourhood of a new instance to be processed. Generally, the whole space of original features is used to find the neighbourhood of a new instance for local accuracy estimates in dynamic integration. We propose to use FE in order to cope with the curse of dimensionality in the dynamic integration of classifiers. We consider classical principal component analysis and two eigenvector-based supervised FE methods that take into account class information and their application to the dynamic selection, dynamic voting and dynamic voting with selection integration techniques (DS, DV and DVS). Experimental results show that, on some data sets, the use of FEDIC leads to significantly higher ensemble accuracies than the use of plain dynamic integration in the space of original features. As a rule, FEDIC outperforms plain dynamic integration on data sets, on which both dynamic integration works well (it outperforms static integration), and considered FE techniques are able to successfully extract relevant features.

Our main hypothesis was that with data sets, where FE improves classification accuracy when employing a single classifier (such as kNN), it would also improve classification accuracy when a dynamic integration approach is employed. Conversely, with data sets, where FE decreases (or has no effect on) classification accuracy with the use of a single classifier, FE will also decrease (or will have no effect on) classification accuracy when employing a dynamic integration approach.

The results supported our hypothesis and showed that the proposed FEDIC algorithm outperforms the dynamic schemes on plain features only on those data sets in which FE for classification with a single classifier provides better results than classification on plain features. When we analyzed this dependency further, we came to a conclusion that FE influenced the accuracy of dynamic integration in most cases in the same manner as FE influenced the accuracy of base classifiers.

We conducted further experimental analyses on those data sets on which FEDIC was found to produce significantly more accurate results than DIC. For each data set we compared the behavior of conventional PCA versus class-conditional approaches with respect to DS, DV and DVS, and vice versa, the behavior of integration strategies with respect to FE techniques.

A number of meta-features that are used to search for the nearest neighbor were compared with respect to the cases with and without FE. We then analyzed how FE techniques improve the neighborhood of each data set on average and found strong correlation between these results with the generalized accuracy results.

Page 69: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

65

5.8 “Feature extraction for classification in knowledge discovery systems”

Reference: Pechenizkiy, M., Puuronen, S. & Tsymbal, A. 2003. Feature extraction for classification in knowledge discovery systems, In: V.Palade, R.J.Howlett, L.C.Jain (Eds.), Proceedings of 7th International Conference on Knowledge-Based Intelligent Information and Engineering Systems KES’2003, Lecture Notes in Artificial Intelligence, Vol.2773, Heidelberg:

Springer-Verlag, 526-532. During the last years DM has evolved from less sophisticated first-generation techniques to today's cutting-edge ones. Currently there is a growing need for next-generation DM systems to manage knowledge discovery applications. These systems should be able to discover knowledge by combining several available techniques, and provide a more automatic environment, or an application envelope, to surround this highly sophisticated DM engine.

In this paper we considered a decision support system (DSS) approach that is based on the methodology used in expert systems. The approach is aimed to combine FE techniques with different classification tasks. The main goal of such system is to automate as much as possible the selection of the most suitable FE approach for a certain classification task on a given data set according to a set of criteria.

Although there is a huge number of FE methods (that apply linear or nonlinear processing, and are applied globally or locally), currently, as far as we know, there is no FE technique that would be the best for all data sets in the classification task. Thus the adaptive selection of the most suitable FE technique for a given data set is a real challenge. Unfortunately, there does not exist canonical knowledge, a perfect mathematical model, or any relevant tool to select the best extraction technique. Instead, a volume of accumulated empirical findings, some trends, and some dependencies have been discovered.

In order to help to manage the DM process, recommending the best-suited FE method and a classifier for a given data set, we proposed to take benefit of experimental approach for relevant knowledge discovery. Discovered during the experimental research, pieces of knowledge in the form of association rules may save a great amount of time when selecting or at least initialising methods’ parameters in a proper way, and moreover when selecting/recommending the most appropriate combination(s) of FE and classification methods.

Thus, potentially, it might be possible to reach a performance close to the wrapper type approach, when actually using the filter paradigm, because of the selection of methods and their parameters according to a certain set of criteria in advance.

During the pilot studies we did not find a simple correlation-based criterion to separate the situations where a FE technique would be beneficial for the classification. Nevertheless, we found out that there exists a trend between the correlation ratio in a data set and the threshold level used in every FE

Page 70: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

66

method to address the amount of variation in the data set explained by the selected extracted features. This finding helps in the selection of the initial threshold value as a starting point in the search for the optimal threshold value. However, further research and experiments are required to verify these findings.

5.9 “Data mining strategy selection via empirical and constructive induction”

Reference: Pechenizkiy, M. 2005 Data mining strategy selection via empirical and constructive induction. In: M.H. Hamza (Ed.) Proceedings of the IASTED International Conference on Databases and Applications DBA’05, Calgary: ACTA Press, 59-64. Recently, several meta-learning approaches have been applied for automatic technique selection by several researchers but with little success. The goal of this paper was (1) to critically analyze such approaches, (2) to consider their main limitations, (3) to discuss why they were unsuccessful and (4) to suggest the ways for their improvement. We introduce a general framework for DM strategy selection via empirical and constructive induction, which is central to our analysis.

Our aim in proposing this framework was to contribute to knowledge in the problem of DM strategy selection for a certain DM problem at hand. We proposed a DSS approach in the framework to recommend a DM strategy rather than a classifier or any other ML algorithm. And the important difference here is that constituting a DM strategy the system searches for the most appropriate ML algorithm with respect to the most suitable data representation (for this algorithm). We believe that a deeper analysis of a limited set of DM techniques (particularly, FE techniques and classifiers) of both the theoretical and experimental levels is a more beneficial approach than application of the meta-learning approach only to the whole range of machine learning techniques at once. Combining theoretical and (semiautomatic) experimental approaches requires the integration of knowledge produced by a human-expert and the meta-learning approach.

In the framework we considered the constructive induction approach that may include the FE, the feature construction and the feature selection processes as means of relevant representation space construction.

We considered pairwise comparison of classifiers of the meta-level as more beneficial than regression and ranking approaches with respect to contribution to knowledge, since the pairwise comparison gives more insight to the understanding of advantages and weaknesses of available algorithms, and produces more specific characterizations.

With respect to meta-model construction we recommended the meta-rules extraction and learning by analogy rather than inducing a meta-decision tree.

Page 71: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

67

The argumentation is straightforward. A decision tree is a form of procedural knowledge. Since it has been constructed it is not easy to update it according to changing decision-making conditions. So, if a feature related to a high-level node in the tree is unmeasured (for example due to time-/cost-consuming processing), the decision tree can produce nothing but probabilistic reasoning. Decision rules, on the contrary, are a form of declarative knowledge. From a set of decision rules it is possible to construct many different, but logically equivalent, or nearly equivalent, decision trees. Thus the decision rules are a more stable approach to meta-learning rather than the decision trees.

We considered the possibility to conduct experiments on synthetically generated datasets that allows generating, testing and validating hypothesis on DM strategy selection with respect to a dataset at hand under controlled settings when some data characteristics are varied while the others are fixed. Beside this, experiments on synthetic datasets allow producing additional instances for the meta-dataset.

5.10 About the joint articles

The present introductory part and Article IV (Pechenizkiy, 2005b) and Article IX (Pechenizkiy, 2005a) have been written solely by the author.

The author of this thesis is the principal author of Article II (Pechenizkiy et al., 2004), Article III (Pechenizkiy et al., 2005d), Article V (Pechenizkiy et al., 2005e), Article VI (Pechenizkiy et al., 2006), Article VII (Pechenizkiy et al., 2005c), and Article VIII (Pechenizkiy et al., 2003). Article I (Tsymbal et al., 2002) has been written in close collaboration by the authors. All the articles included have been refereed by at least two international reviewers and published. All the articles except Article II are full-paper refereed and the Article II is extended abstract refereed. Article I, Article IV, Article VIII, and Article IX, and earlier versions of Article III and Article V have been presented by the author personally at the corresponding conferences.

The developed software prototype within WEKA machine learning library in Java for the experimental studies, and some of the contents of experimental sections in the included articles also represent the independent work done by the author. Analysis of background and review of related work in the included joint papers (for example Section 3 in Article VII) were also done mainly by the author.

Page 72: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

6 CONCLUSIONS

FE is an important step in DM/KDD process that can be beneficial for SL in terms of classification accuracy and time complexity (see for example Article I) of model learning and new instances classification.

FE can be considered as a dimensionality reduction technique as well as a technique for construction of better representation space for further supervised learning. FE can improve classification accuracy of a model produced by a learner even for datasets having relatively small number of features.

In this chapter we briefly summarize the main contributions of the thesis with regard to rigor and relevance of accomplished research study, discuss its limitations and overview the directions for future work and finally present the challenges of further research.

6.1 Contributions of the thesis

This thesis contributes to the problem of DM methods integration in the KDD process. All contributions of the thesis are summarized with respect to the stated research question (RQ) in Chapter 3:

RQ 1: How important is it to use class information in the FE process? (Section 3.1)

RQ 2: Is FE a data- or hypothesis-driven constructive induction? (Section 3.2)

RQ 3: Is FE for dynamic integration of base-level classifiers useful in a similar way as for a single base-level classifier? (Section 3.3)

Page 73: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

69

RQ 4: Which features – original, extracted or both – are useful for SL? (Section 3.4)

RQ 5: How many extracted features are useful for SL? (Section 3.5) RQ 6: How to cope with the presence of contextual features in data,

and data heterogeneity? (Section 3.6) RQ 7: What is the effect of sample reduction on the performance of FE

for SL? (Section 3.7) RQ 8: When is FE useful for SL? (Section 3.8) RQ 9: Interpretability of the extracted features. (Section 3.9) RQ 10: How to make a decision about the selection of the appropriate

DM strategy (particularly, the selection of FE and SL techniques) for a problem at consideration?

The list of the main contributions of the thesis is also divided into two parts. First, the contributions related to the more theory-based results, and then, the contributions related to the use of FE for SL in a knowledge discovery system (KDS) are considered. We denote contribution related to research question RQi as CRQi. We provide also the reference at every point to the corresponding article included in the collection in the thesis.

6.1.1. Contributions to the theory  

The results of our experimental studies showed that: CRQ 1: Use of class information in FE process is crucial for many

datasets. Consequently, class-conditional FE can result in better classification accuracy of a learning model whereas solely variance-based FE has no effect on or deteriorates the accuracy. (Articles I and IV)

CRQ 2: Ranking of different FE techniques in line with the corresponding accuracy results of a SL technique can vary a lot for different datasets. And different FE techniques behave also in a different way when integrated with different SL techniques. Thus, FE process should correspond both to dataset characteristics and the type of SL that follows FE process. (Article IV)

CRQ 3: FE can improve dynamic integration of classifiers for those datasets where FE improves accuracy of an instance-based (such as Nearest Neighbour) classifier. (Article VII)

CRQ 4: Combination of original features with extracted features can be beneficial for SL on some datasets especially when tree-based inducers like C4.5 are used for classification. (Article II). Similarly, combining linear discriminants with few principal components may result in better classification accuracy (compared with the use of either of these approaches) when C4.5 is used. However this combination is not beneficial for Naïve Bayes classifier and it results in very unstable behaviour on different datasets. (Article III)

Page 74: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

70

6.1.2. Contributions to the practice (use) of FE for SL in a KDS 

First, it might be valuable to remind that in many experimental studies accomplished during the work on this thesis besides artificial and benchmark datasets, we used real world datasets from medical and microbiology domain areas in order to (1) validate our findings also with dirty real datasets and (2) contribute to the domain area, primarily, to improve the classification accuracy.

The results of our experimental studies showed that: CRQ 5: The appropriate threshold values used in FE process to account

the variance explained by (selected) extracted features varies a lot from one dataset to other. (Article I and VIII)

CRQ 6: Natural clustering is a very efficient approach in DM that allows building local FE and SL models, which outperform corresponding global models in classification accuracy using less number of features for learning (due to utilizing some background knowledge). (Article V)

CRQ 7: Training sample reduction affects the performance of SL with FE rather differently. In general, nonparametric FE results in similar or better accuracy results of a classifier with smaller number of training instances than parametric FE. Our results showed that when the proportion of training instances used to build the FE and the learning model is relatively small it is important to use an adequate sample reduction technique to select more representative instances for the FE process. (Article VI)

CRQ 8: Our preliminary experimental study show that, in general, it is hard to predict ahead when (and for which type of dataset) FE might be useful to apply with regard to SL, and which extracted features and how many of them should be used in supervised learning. (Article VIII)

We presented our vision of interpretability of results and transparency of learning process with regard to FE as transformation of original space.

CRQ 9: Our analysis shows that depending on the kind of data, the meaning of the original features and the problem at consideration and the supervised learning technique, FE can be both beneficial and harmful with these respects. (Article II)

The results and conclusions from the experimental studies and further conceptual-analytic research resulted in:

CRQ 10: the construction of a general framework for the selection of the most appropriate DM strategy according to the knowledge about behaviour (use) of DM techniques and their combinations (that can constitute a DM strategy) on different kinds of datasets. (Article IX)

Page 75: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

71

6.2 Limitations and future work

This section is aimed to highlight some known limitations of the study and to name the main aspects of future work.

6.2.1. Limitations 

In the thesis we considered a limited set of FE and SL techniques; and a limited set of data characteristics and method’s parameters has been analysed.

FE techniques like PCA transformation are crippled by their reliance on second-order statistics. Though uncorrelated the principal components can be statistically highly dependent (Hyvärinen et al., 2001). Independent component analysis (ICA) that can be seen as an extension to PCA (but use some form of higher-order statistics, which means information not contained in a covariance matrix) accounts for this problem.

If the data components have non-linear dependencies, linear feature transformation will require a larger dimensional representation than would be found by a non-linear technique. There exists a number of non-linear implementations of PCA, (see for example Oja, 1997). These and many other existing FE techniques have not been considered in this thesis; however the same research design can be applied to these groups of techniques.

Most of the study is based on experimental type of research supported by constructive and theoretic approaches. We believe that stronger connections to theoretical background of FE and SL techniques could help to make more significant contribution to the field.

6.2.2. Future work 

We see several directions for further research. From the experimental setting side - data sets with significantly higher number of features could be analysed. With this respect further analysis of random projections as means of FE as a pre-processing step for FE may bring interesting findings. Significantly more experiments should be performed on the synthetically generated data sets with predefined characteristics.

While in this thesis mainly accuracy of the approaches was analysed, in the further studies it would be interesting to estimate the algorithmic complexity of different schemes more precisely. Another important and interesting work is to study further the effect of FE on transparency of SL process and interpretability of SL outcomes.

6.3 Further challenges

In this section we take the risk to guess the further main interests and challenges with respect to the topic of the thesis. Our strong belief is that the

Page 76: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

72

relevance of DM research should be taken more seriously so that rigor and relevance of research are well-balanced.

From a practical point of view it is important to provide useful tools for DM practitioners. Therefore, the most challenging goals of further research in the area are likely to be related to construction of decision support system for DM strategy recommendation. The application of the meta-learning approach for the discovery of pieces of knowledge about behaviour of different DM strategies on different types of data (Pechenizkiy, 2005b) perhaps would be the first step in addressing this challenge. However, knowledge management issues including knowledge representation, organization, storage, and continuous distribution, integration (from multiple types of sources: DM experts and practitioners, results from laboratory experiments on synthetic datasets and from field experiments on real-world problems) and refinement processes will naturally appear. Besides this, research and business communities, or similar KDSs themselves can organize different so-called trusted networks, where participants are motivated to share their knowledge. We tried to highlight these challenges in Pechenizkiy et al. (2005b).

Page 77: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

73

REFERENCES

Achlioptas, D. 2001. Database-friendly random projections. In: P. Buneman (Ed.), Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, New York: ACM Press, 274- 281.

Aha, D., Kibler, D. & Albert, M. 1991. Instance-based learning algorithms. Machine Learning 6, 37-66.

Aivazyan, S. 1989. Applied statistics: classification and dimension reduction. Moscow: Finance and Statistics.

Aladjem, M. 1994. Multiclass discriminant mappings. Signal Processing 35(1), 1-18.

Almoy, T. 1996. A simulation study on the comparison of prediction methods when only a few components are relevant. Computational Statistics and Data Analysis 21(1), 87-107.

Arciszewski, T., Michalski, R. & Wnek, J. 1995 Constructive Induction: the Key to Design Creativity. In Proceedings of the 3rd International Round-Table Conference on Computational Models of Creative Design, Queensland, Australia, 397-425.

Bellman, R. 1961. Adaptive Control Processes: A Guided Tour, Princeton, Princeton University Press.

Bingham, E. & Mannila, H. 2001. Random projection in dimensionality reduction: applications to image and text data. Proceedings of the 7th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM Press, San Francisco, California

Blake, C. & Merz, C. 1998. UCI repository of machine learning databases. Dept. of Information and Computer Science, University of California, Irvine, CA. http://www.ics.uci.edu/~mlearn/MLRepository.html

Bloedorn, E., Wnek, J. & Michalski, R. 1993. Multistrategy Constructive Induction: AQ17-MCI, Reports of the Machine Learning and Inference Laboratory, MLI 93-4, School of Information Technology and Engineering, George Mason University.

Breiman, L. 2001. Random Forests. Machine Learning 45(1), 5-32. Brunk, C., Kelly, J., & Kohavi, R. 1997. MineSet: an integrated system for data

mining. In D. Heckerman, H. Mannila, D. Pregibon (Eds.) Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining (KDD-97), AAAI Press, California, 135-138.

Chan, P. & Stolfo, S., 1997. On the accuracy of meta-learning for scalable data mining. Intelligent Information Systems, 8, 5-28.

Clementine User Guide, Version 5. 1998. Integral Solutions Limited. Cost, S. & Salzberg, S. 1993. A weighted nearest neighbor algorithm for learning

with symbolic features. Machine Learning 10(1), 57-78. CRISP-DM. 2004. Cross Industry Standard Process for Data Mining; see

www.crisp-dm.org.

Page 78: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

74

Dasgupta, S. 1999. Learning Mixtures of Gaussians, Proceedings of 40th Annual Symposium on Foundations of Computer Science, 634.

Dasgupta, S. 2000. Experiments with Random Projection, Proceedings of 16th Conference on Uncertainty in Artificial Intelligence, 143-151.

Dasgupta, S. & Gupta, A. 2003. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures and Algorithms, 22(1), 60-65.

Deerwester, S., Dumais, S., Furnas, G., Landauer, T. & Harshman, R. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391-407.

Diamantras, K. & Kung S. 1996. Principal Component Neural Networks. John Wiley & Sons.

Dietterich, T. 1997. Machine learning research: four current directions, AI Magazine 18(4), 97-136.

Dietterich, T. 1998. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10 (7), 1895-1923.

Domingos, P. & Pazzani, M. 1996. Beyond independence: conditions for the optimality of the simple Bayesian classifier. In L. Saitta (Ed.) Proceedings of 13th International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann, 105-112.

Domingos, P. & Pazzani, M. 1997. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning 29 (2,3), 103-130.

Duda, R., Hart, P. & Stork, D. 2001. Pattern classification, 2nd edition. Wiley, New York.

Eklund, P. 1999. Comparative study of public domain supervised machine-learning accuracy on the UCI database. In B. Dasarathy (Ed.) Data mining and knowledge discovery: theory, tools, and technology. Proceedings of SPIE, Vol. 3695. Bellingham, WA: SPIE, 39-50.

Fayyad, U. 1996. Data Mining and Knowledge Discovery: Making Sense Out of Data, IEEE Expert 11(5), 20-25.

Fayyad, U., Grinstein G. & Wierser A. 2001. Information Visualization in Data Mining and Knowledge Discovery. San Diego, Morgan Kaufmann.

Fayyad, U. & Uthurusamy, R. 2002. Evolving data into mining solutions for insights. Communications of the ACM 45(8), 28-31.

Fisher, R. 1936. The Use of Multiple Measurements in Taxonomic Problems, Annals of Eugenics, 7(2), 179-188.

Fradkin, D. & Madigan, D. 2003. Experiments with random projections for machine learning. Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM Press, Washington D.C., 517-522.

Friedman, J. 1997. On bias, variance, 0/1-loss, and the curse of dimensionality. Data Mining and Knowledge Discovery 1 (1), 55-77.

Fukunaga, K. 1990. Introduction to statistical pattern recognition. 2nd Edition, New York, Academic Press.

Gaede, V. & Günther, O. 1998. Multidimensional access methods, ACM Comput. Surv. 30 (2), 170-231.

Page 79: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

75

Gama, J. 1999. Combining classification algorithms. Dept. of Computer Science, University of Porto, Portugal. PhD thesis.

Grossman, R., Hornick, M. & Meyer, G. 2002. Data mining standards initiatives, Communications of the ACM, 45(8), 59-61.

Hall, P. & Li, K. 1993. On Almost Linearity of Low Dimensional Projections From High Dimensional Data, The Annals of Statistics, 21(2), 867-889.

Ho, T. 1998. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8), 832–844.

Hughes, G. 1968. On the mean accuracy of statistical pattern recognizers. IEEE Transactions on Information Theory, 14(1), 55-63.

Hyvärinen, A., Karhunen, J. & Oja, E. 2001. Independent Component Analysis. New York: John Wiley & Sons, Inc.

Iivari, J., Hirscheim, R. & Klein, H. 1999. A paradigmatic analysis contrasting information systems development approaches and methodologies, Information Systems Research 9(2), 164-193.

Imielinski, T. & Mannila, H. 1996. A database perspective on knowledge discovery. Communications of the ACM, 39(11), 58-64.

Indyk, P. & Motwani, R. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality, Proceedings of the 30th ACM symposium on Theory of Computing, 604-613.

Ives, B., Hamilton, S. & Davis, G. 1980. A Framework for Research in Computer-based Management Information Systems”, Management Science 26(9), 910-934.

Jimenez, L. & Landgrebe, D. 1995. High dimensional feature reduction via projection pursuit. School of electrical and computer engineering, Purdue University, TR-ECE 96-5.

John, G. 1997. Enhancements to the data mining process. Dept. of Computer Science, Stanford University, Stanford, USA. PhD Thesis.

Jolliffe, I. 1986. Principal Component Analysis. New York: Springer. JSR. 2004. Java Specification Request 73; also available from

http://jcp.org/en/jsr/detail?id=073. Kalousis, A. 2002. Algorithm Selection via Meta-Learning. University of

Geneve, Department of Computer Science. PhD Thesis. Kiang, M. 2003. A comparative assessment of classification methods, Decision

Support Systems 35, 441-454 Kleinberg, J. 1997. Two algorithms for nearest-neighbor search in high

dimensions, Proceedings of the 29th ACM symposium on Theory of Computing, 599-608.

Kleinberg, E. 2000. On the algorithmic implementation of stochastic discrimination. IEEE Transactions on PAMI 22 (5), 473–490

Kohavi, R. 1995a. A study of cross-validation and bootstrap for accuracy estimation and model selection. In C. Mellish (Ed.), Proceedings of 14th International Joint Conference on Artificial Intelligence IJCAI-95, San Francisco, Morgan Kaufmann, 1137-1145.

Page 80: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

76

Kohavi, R. 1995b. Wrappers for performance enhancement and oblivious decision graphs. Dept. of Computer Science, Stanford University, Stanford, USA. PhD Thesis.

Kohavi, R. & John, G. 1998. The wrapper approach. In: H. Liu & H. Motoda (Eds.) Feature Selection for Knowledge Discovery and Data Mining, Kluwer Academic Publishers, 33-50.

Kohavi, R., Sommerfield, D. & Dougherty, J. 1996. Data mining using MLC++: a machine learning library in C++. In M. Radle (Ed.) Proceedings of 8th IEEE Conference on Tools with Artificial Intelligence. Los Alamitos: IEEE CS Press, 234-245.

Kononenko, I. 1993. Inductive and Bayesian learning in medical diagnosis. Applied Artificial Intelligence 7(4), 317-337.

Kurimo, M. 1999. Indexing audio documents by using latent semantic analysis and SOM. In: E. Oja & S. Kaski, (Eds.), Kohonen Maps, Elsevier, 363-374.

Liu, H. 1998. Feature Extraction, Construction and Selection: A Data Mining Perspective, Boston, Kluwer Academic Publishers.

Liu, H., Motoda H. & Yu L. 2004. A selective sampling approach to active feature selection, Artificial Intelligence 159(1-2), 49-74.

Melton, J. & Eisenberg, A. 2001. SQL Multimedia and Application Packages (SQL/MM). ACM Sigmod Record, SIGMOD Record 30(4), 97-102.

Merz, C. 1998. Classification and regression by combining models, Dept. of Information and Computer Science, University of California, Irvine, USA, PhD Thesis.

Michalski, R. 1997. Seeking Knowledge in the Deluge of Facts, Fundamenta Informaticae 30, 283-297.

Mitchel, T. 1997. Machine Learning. McGraw-Hill. Nunamaker, W., Chen, M., & Purdin, T. 1990-91. Systems development in

information systems research, Journal of Management Information Systems 7(3), 89-106.

Oja, E. 1997. The nonlinear PCA learning rule in independent component analysis. Neurocomputing, 17(1), 25-46.

OLE DB. 2004. OLE DB for Data Mining Specification 1.0. Microsoft; www.microsoft.com/data/oledb/default.htm.

Opitz, D. & Maclin, D. 1999. Popular ensemble methods: an empirical study. Journal of Artificial Intelligence Research 11, 169-198.

Oza, N. & Tumer, K. 1999. Dimensionality reduction through classifier ensembles. Computational Sciences Division, NASA Ames Research Center, Moffett Field, CA. TR NASA-ARC-IC-1999-124.

Papadimitriou, C., Tamaki, H., Raghavan, P. & Vempala, S. 1998. Latent semantic indexing: a probabilistic analysis, Proceedings of 17th ACM SIGACT-SIGMOD-SIGART symposium on Principles of Database Systems, 159-168.

Pechenizkiy, M. 2005a. Data mining strategy selection via empirical and constructive induction. In: M.H. Hamza (Ed.) Proceedings of the IASTED

Page 81: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

77

International Conference on Databases and Applications DBA’05, ACTA Press, 59-64.

Pechenizkiy, M. 2005b. The Impact of the Feature Extraction on the Performance of a Classifier: kNN, Naïve Bayes and C4.5. In: B. Kegl & G. Lapalme (Eds.): Proceedings of 18th CSCSI Conference on Artificial Intelligence AI’05, LNAI 3501, Heidelberg: Springer-Verlag, 268-279.

Pechenizkiy, M., Puuronen S. & Tsymbal, A. 2003a. Feature extraction for classification in knowledge discovery systems, In: V.Palade, R.Howlett, L.Jain (Eds.), Proceedings of 7th International Conference on Knowledge-Based Intelligent Information and Engineering Systems KES’2003, LNAI, Vol.2773, Heidelberg: Springer-Verlag, 526-532.

Pechenizkiy, M., Puuronen, S. & Tsymbal, A. 2003b. Feature Extraction for Classification in the Data Mining Process. International Journal on Information Theories and Applications 10(1), Sofia, FOI-Commerce, 321-329.

Pechenizkiy, M., Puuronen, S. & Tsymbal, A. 2005a. On the Use of Information Systems Research Methods in Data Mining. In: O.Vasilecas et al. (Eds.) Proceedings of 13th International Conference on Information Systems Development: Advances in Theory, Practice and Education ISD’04, Springer, 487-499.

Pechenizkiy, M., Puuronen, S. & Tsymbal, A. 2006. The Impact of Sample Reduction on PCA-based Feature Extraction for Supervised Learning. (to appear) In: H. Haddad et al. (Eds.), Proceedings of 21st ACM Symposium on Applied Computing (SAC’06, Data Mining Track), ACM Press.

Pechenizkiy, M., Tsymbal, A. & Puuronen S. 2004. PCA-based feature transformation for classification: issues in medical diagnostics, In: R. Long et al. (Eds.), Proceedings of 17th IEEE Symposium on Computer-Based Medical Systems CBMS’2004, Bethesda, MD, IEEE CS Press, 2004, 535-540.

Pechenizkiy, M., Tsymbal, A. & Puuronen, S. 2005b. Knowledge Management Challenges in Knowledge Discovery Systems. In: IEEE Workshop Proceedings of DEXA’05, 6th Int. Workshop on Theory and Applications of Knowledge Management TAKMA’05, IEEE CS Press, 433-437.

Pechenizkiy M., Tsymbal A., Puuronen S. 2005c. Local Dimensionality Reduction within Natural Clusters for Medical Data Analysis, In: P.Cunningham & A.Tsymbal (Eds.), Proc. 18th IEEE International Symposium on Computer-Based Medical Systems CBMS’2005, Los Alamitos, CA: IEEE CS Press, 365-370.

Pechenizkiy, M., Tsymbal, A. & Puuronen, S. 2005d. On Combining Principal Components with Fisher’s Linear Discriminants for Supervised Learning. (submitted to) Special Issue of Foundations of Computing and Decision Sciences “Data Mining and Knowledge Discovery” (as extended version of Pechenizkiy et al., 2005e).

Pechenizkiy, M., Tsymbal, A. & Puuronen, S. 2005e. On Combining Principal Components with Parametric LDA-based Feature Extraction for Supervised Learning. In: T.Morzy et al. (Eds.), Proceedings of 1st ADBIS

Page 82: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

78

Workshop on Data Mining and Knowledge Discovery, ADMKD’05, Tallinn, Estonia, 47-56.

Pechenizkiy, M., Tsymbal, A. & Puuronen, S. 2005f. Supervised Learning and Local Dimensionality Reduction within Natural Clusters: Biomedical Data Analysis, (submitted to) IEEE Transactions on Information Technology in Biomedicine, Special Post-conference Issue "Mining Biomedical Data" (as extended version of Pechenizkiy et al., 2005c).

Pechenizkiy, M., Tsymbal, A., Puuronen, S. & Patterson, D. 2005g. Feature Extraction for Dynamic Integration of Classifiers, (submitted to) Fundamenta Informaticae, IOS Press.

Pechenizkiy, M., Tsymbal, A., Puuronen, S., Shifrin, M. & Alexandrova, I. 2005h. Knowledge Discovery from Microbiology Data: Many-sided Analysis of Antibiotic Resistance in Nosocomial Infections. In: K. Althoff et al. (Eds.) Post-Conference Proceedings of 3rd Conference on Professional Knowledge Management: Experiences and Visions, LNAI 3782, Heidelberg: Springer-Verlag, 360-372.

Piatetsky-Shapiro, G. 2000. Knowledge Discovery in Databases: 10 years after. SIGKDD Explorations 1(2), 59-61.

PMML. 2004. Predictive Model Markup Language. Data Mining Group, see www.dmg.org.

Popelinsky, L. 2001. Combining the Principal Components Method with Different Learning Algorithms. In Proceedings of 12th European Conference on Machine Learning, Workshop on Integrating Aspects of Data Mining, Decision Support and Meta-Learning,.

Puuronen, S., Terziyan, V. & Tsymbal, A. 1999. A dynamic integration algorithm for an ensemble of classifiers. In Z.W.Ras & A.Skowron (Eds.) Foundations of intelligent systems: 11th International Symposium ISMIS’99, Warsaw, Poland. LNAI 1609. Berlin: Springer, 592-600.

Quinlan, J. 1993. C4.5 Programs for Machine Learning. San Mateo CA: Morgan Kaufmann.

Quinlan, J. 1996. Bagging, boosting, and C4.5. In Lenz (Ed.) Proceedings of the 13th National Conference on Artificial Intelligence, AAAI-96, New York, NY: Springer-Verlag, AAAI Press, 725-730.

Salzberg, S. 1999. On comparing classifiers: a critique of current research and methods. Data Mining and Knowledge Discovery 1, 1-12.

Schaffer, C. 1993. Selecting a classification method by cross-validation, Machine Learning 13, 135-143.

Skurichina, M. & Duin, R. 2001. Bagging and the random subspace method for redundant feature spaces, in: J. Kittler, F. Roli (Eds.), Proceedings of 2nd International Workshop on Multiple Classifier Systems MCS’01, Cambridge, UK, 1–10.

Thrun, S., Bala, J, Bloedorn, E., et al. 1991. The MONK’s problems – a performance comparison of different learning algorithms. Carnegie Mellon University, Pittsburg PA. Technical report CS-CMU-91-197.

Page 83: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

79

Tkach, D. 1998. Information Mining with the IBM Intelligent Miner Family. An IBM Software Solutions White Paper.

Tsymbal, A. 2002. Dynamic Integration of Data Mining Methods in Knowledge Discovery Systems, Jyväskylä, University of Jyväskylä. PhD Thesis.

Tsymbal, A., Puuronen, S. & Skrypnyk, I. 2001. Ensemble feature selection with dynamic integration of classifiers. In International ICSC Congress on Computational Intelligence Methods and Applications, 558-564.

Tsymbal, A., Pechenizkiy, M., Puuronen, S. & Patterson, D. 2003. Dynamic integration of classifiers in the space of principal components, In: L.Kalinichenko, et al. (Eds.), Proceedings of Advances in Databases and Information Systems: 7th East-European Conference ADBIS'03, LNCS, Vol. 2798, Heidelberg: Springer-Verlag, 278-292.

Tsymbal, A., Puuronen, S., Pechenizkiy, M., Baumgarten, M. & Patterson, D. 2002. Eigenvector-based Feature Extraction for Classification. In: S.M. Haller, G. Simmons (Eds.), Proceedings of 15th International FLAIRS Conference on Artificial Intelligence, AAAI Press, 354-358.

Turney, P. 1996. The management of context-sensitive features: A review of strategies. In: Proceedings of Workshop on Learning in Context-Sensitive Domains at the 13th International Conference on Machine Learning, 60-66.

Vijayakumar, S. & Schaal, S. 1997. Local Dimensionality Reduction for Locally Weighted Learning, Proceedings of IEEE International Symposium on Computational Intelligence in Robotics and Automation, 220-225.

Weingessel, A. & Hornik, K. 1998. Local PCA Algorithms. IEEE Transactions on Neural Networks 8(5), 1208-1211.

WEKA 3. 2004. Data Mining Software in Java. Also available from http://www.cs.waikato.ac.nz/ml/weka/

Witten, I. & Frank, E. 2000. Data Mining: Practical machine learning tools with Java implementations, San Francisco: Morgan Kaufmann.

Wolpert, D. & MacReady, W. 1996. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation 1(1), 67-82.

Zorman, M., Eich, H., Kokol, P. & Ohmann, C. 2001. Comparison of Three Databases with a Decision Tree Approach in the Medical Field of Acute Appendicitis, In: Patel et al. (Eds.), Proceedings of 10th World Congress on Health and Medical Informatics, Vol.2, Amsterdam: IOS Press, 1414-1418.

Page 84: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

80

APPENDIX A. DATASETS USED IN THE EXPERIMENTS

The majority of the datasets used in the experiments were taken from the University of California at Irvine Machine Learning Repository (Blake & Merz, 1998). The Acute Abdominal Pain (AAP) datasets were provided by Laboratory for System Design, Faculty of Electrical Engineering and Computer Science, University of Maribor, Slovenia and Theoretical Surgery Unit, Dept. of General and Trauma Surgery, Heinrich-Heine University Düsseldorf, Germany (Zorman et al., 2001). The Antibiotic Resistance dataset was collected in the Hospital of N.N. Burdenko Institute of Neurosurgery, Moscow, Russia. The main characteristics of the datasets used in experiments throughout the thesis are presented in Table 1.

The table includes the name of the dataset, the number of instances included in the dataset, the number of different classes of instances, and the numbers of different kinds of features included in the instances.

The Acute Abdominal Pain (AAP) datasets represent the same problem of separating acute appendicitis (class “appendicitis”), which is a special problem of acute abdominal pain, from other diseases that cause acute abdominal pain (class “other diagnoses”). The early and accurate diagnosis of acute appendicitis is still a difficult and challenging problem in everyday clinical routine. AAPI, AAPII and AAPIII are three large data sets with cases of acute abdominal pain (AAP): (1) Small-AAP I; (2) Medium-AAP II; and (3) Large-AAP III, with the numbers of instances respectively 1254, 2286, and 4020 (Zorman et al., 2001). These data sets represent the same problem of separating acute appendicitis from other diseases that cause acute abdominal pain. The data for AAP I has been collected from 6 surgical departments in Germany, for AAP II – from 14 centers in Germany, and for AAP III – from 16 centers in Central and Eastern Europe. Each data set includes 18 features from history-taking and clinical examination (Zorman et al., 2001). These features are standardized by the World Organization of Gastroenterology (OMGE).

The Antibiotic Resistance dataset was collected in the Hospital of N.N. Burdenko Institute of Neurosurgery, Moscow, Russia, using the analyzer Vitek-60 (developed by bioMérieux, www.biomerieux.com) over the years 1997-2003 and the information systems Microbiologist (developed by the Medical Informatics Lab of the institute) and Microbe (developed by the Russian company MedProject-3). Each instance of the data used in analysis represents one sensitivity test and contains the following features: pathogen that is isolated during the bacterium identification analysis, antibiotic that is used in the sensitivity test and the result of the sensitivity test itself (sensitive S, resistant R or intermediate I), obtained from Vitek according to the guidelines of the National Committee for Clinical Laboratory Standards (NCCLS). Information about sensitivity analysis is connected with patient, his or her demographical data (sex, age) and hospitalization in the Institute (main department, days spent in ICU, days spent in the hospital before test, etc.) Each bacterium in a sensitivity test in the database is isolated from a single specimen that may be

Page 85: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

81

blood, liquor, urine, etc. In this study we focus on the analysis of meningitis cases only, and the specimen is liquor. For this purposes we picked up 4430 instances of sensitivity tests related to the meningitis cases of the period January 2002 – August 2004. We introduced 5 grouping binary features for pathogens and 15 binary features for antibiotics. These binary features represent hierarchical grouping of pathogens and antibiotics into 5 and 15 categories respectively. Thus, each instance in the dataset had 34 features that included information corresponding to a single sensitivity test augmented with the data concerning the used antibiotic, the isolated pathogen, the sensitivity test result and clinical features of the patient and his/her demographics.

TABLE 1 Basic characteristics of the datasets

Features Dataset Instances Classes Categorical Numerical Total Total*

Acute Abdominal Pain I 1251 2 17 1 18 89 Acute Abdominal Pain II 2279 2 17 1 18 89 Acute Abdominal Pain III 4020 2 17 1 18 89 Antibiotic Resistance 4430 3 28 6 34 47 Balance 625 3 0 4 4 3 Breast Cancer Ljubljana 286 2 9 0 9 38 Car Evaluation 1728 4 6 0 6 21 Pima Indians Diabetes 768 2 0 8 8 8 Glass Recognition 214 6 0 9 9 9 Heart Disease 270 2 8 5 5 13 Ionosphere 351 2 0 34 34 33 Iris Plants 150 3 0 4 4 4 Kr-vs-kp 3196 2 36 0 36 38 LED 300 10 7 0 7 7 LED17 300 10 24 0 24 24 Liver Disorders 345 2 0 6 6 6 Lymphography 148 4 15 3 18 36 MONK-1 432 2 6 0 0 15 MONK-2 432 2 6 0 0 15 MONK-3 432 2 6 0 0 15 Soybean 47 4 0 35 35 35 Thyroid 215 3 0 5 5 5 Tic-Tac-Toe Endgame 958 2 9 0 9 27 Vehicle 846 4 0 18 18 18 Voting 435 2 16 0 16 17 Zoo 101 7 16 0 16 17

* when categorical features are binarized.

The Balance dataset was generated to model psychological experimental results. Each example is classified as having the balance scale tip to the right, tip to the left, or be balanced. The attributes are the left weight, the left distance, the right weight, and the right distance. The correct way to find the class is the greater of

Page 86: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

82

(left-distance * left-weight) and (right-distance * right-weight). If they are equal, it is balanced.

In the Breast Cancer Ljubljana dataset the task is to determine whether breast cancer will or will not recur. The data were originally obtained from the University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia.

The Car Evaluation dataset was derived from a simple hierarchical decision model that evaluates cars according to a concept structure. The Car Evaluation dataset contains examples with the structural information removed, i.e., directly relates a car to the six input attributes: buying, maint, doors, persons, lug_boot, and safety. The four classes are “unacceptable”, “acceptable”, “good”, and “very good”.

The task for the Pima Indians Diabetes dataset is to determine whether the patient shows signs of diabetes according to World Health Organization criteria. There are eight continuous features: number of times pregnant, plasma glucose concentration, diastolic blood pressure, triceps skin fold thickness, 2-hour serum insulin, body mass index, diabetes pedigree function, and age.

The DNA dataset is drawn from the field of molecular biology. Splice junctions are points on a DNA sequence at which “superfluous” DNA is removed during protein creation. The task is to recognize exon/intron boundaries, referred to as EI sites; intron/exon boundaries, referred to as IE sites; or neither. The features provide a window of 60 nucleotides. The classification is the middle point of the window, thus providing 30 nucleotides at each side of the junction.

In the Glass Recognition dataset the task is to identify which one of the six types of glass is present from the chemical elements in a sample.

The task for the Heart Disease dataset is to distinguish the presence or absence of heart disease in patients. The features include: age, sex, chest pain type, resting blood pressure, fasting blood sugar, max heart rate, etc.

The Ionosphere dataset includes radar data that was collected by a system in Goose Bay, Labrador. This system consists of a phased array of 16 high-frequency antennas with a total transmitted power on the order of 6.4 kilowatts. The targets were free electrons in the ionosphere. "Good" radar returns are those showing evidence of some type of structure in the ionosphere. "Bad" returns are those that do not; their signals pass through the ionosphere. Received signals were processed using an autocorrelation function whose arguments are the time of a pulse and the pulse number. There were 17 pulse numbers for the Goose Bay system. Instances in this dataset are described by 2 attributes per pulse number, corresponding to the complex values returned by the function resulting from the complex electromagnetic signal.

The Iris Plants dataset created by R.A. Fisher is perhaps the best known database in the machine learning literature. The task is to classify iris plants into one of three iris plants varieties: Iris Setosa, Iris Versicolour, and Iris Virginica. This is an exceedingly simple domain and very low error rates have been reached already long ago.

Page 87: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

83

The Kr-vs-kp dataset represents one classical type of chess end-game – King with Rook versus King with Pawn on a7 and usually abbreviated as KRKPA7. The pawn on a7 means it is one square away from becoming a queen. It is the King with Rook's side (white) to move. White is deemed to be unable to win if the Black pawn can safely advance.

The LED dataset contains data about the LED display problem, where the goal is to learn to recognize decimal digits having information about whether the seven corresponding LED segments are on or off. The LED 17 dataset represents an extension of the LED display problem, with an additional 17 irrelevant attributes being added to the instance space. These attributes are randomly assigned the values of 0 or 1.

The Liver Disorders dataset was created by BUPA Medical Research Ltd, and the task is to predict liver disorders that might arise from excessive alcohol consumption.

The Lymphography dataset was obtained from the University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia. There are 15 categorical and 3 numerical attributes, and the classes being predicted are: “normal find”, “metastases”, “malign lymph”, and “fibrosis”.

The MONK’s problems are a collection of three artificial binary classification problems over the same six-attribute discrete domain (a1,…,a6). All MONK’s datasets contain 432 instances without missing values, representing the full truth tables in the space of the attributes. The ”true” concepts MONK-1, MONK-2, and MONK-3 underlying each MONK’s problem are given by: (a1=a2)or(a5=1) for MONK-1, exactly two of {a1=1, a2=1, a3=1, a4=1, a5=1, a6=1} for MONK-2, and (a5=3 and a4=1)or(a5<>4 and a2<>3) for MONK-3. MONK-3 has 5% additional noise (misclassifications) in the training set. The MONK’s problems were the basis of the first international comparison of learning algorithms (Thrun et al., 1991).

The Soybean dataset includes data about the soybean disease diagnosis. This is a small subset of the original Soybean-large database. There are 35 numerical attributes, and 4 classes, representing soybean diseases.

In the Thyroid dataset, five laboratory tests are used to try to predict whether a patient's thyroid is in the class “euthyroidism”, “hypothyroidism” or “hyperthyroidism”. The diagnosis (the class label) is based on a complete medical record, including anamnesis, scan etc.

The Tic-Tac-Toe Endgame dataset encodes the complete set of possible board configurations at the end of tic-tac-toe games, where ”x” is assumed to have played first. The target concept is ”win for x” (i.e., true when ”x” has one of 8 possible ways to create a ”three-in-a-row”). The dataset contains 958 instances without missing values, each with 9 attributes, corresponding to tic-tac-toe squares and taking on 1 of 3 possible values: ”x”, ”o”, and ”empty”.

In the Vehicle dataset, the goal is to classify a given silhouette as one of four types of vehicles (“Opel”, “Saab”, “Bus”, and “Van”), using a set of 18 numerical features extracted from the silhouette. The vehicle may be viewed

Page 88: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

84

from one of many different angles. This dataset comes from the Turing Institute, Glasgow, Scotland.

The Voting dataset includes votes for each of the U.S. House of Representatives Congressmen on the 16 key votes identified by the Congressional Quarterly Almanac in 1984. The goal is build a classification model to predict the voting congressman to be either a democrat or a republican.

Zoo is a simple dataset created by Richard S. Forsyth with instances containing 17 Boolean valued-attributes, and representing 7 types of animals.

A survey of widely used learning algorithms (decision trees, neural networks, and rule-based classifiers) on twenty-nine datasets from the UCI machine learning repository is given in Eklund (1999). This survey connects the properties of datasets examined with the selection of learning algorithms. In (Salzberg, 1999) the use of the datasets from the UCI repository was strongly criticized, because in his opinion it is difficult to produce major new results using well-studied and widely shared data. Hence, a new “significant” finding may occur to be “a statistical accident”. However, we would prefer to interpret this message as a caution to be careful when stating final conclusions if research has been conducted only on such benchmarks.

Page 89: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

85

YHTEENVETO (FINNISH SUMMARY) Tiedon louhinnalla pyritään paljastamaan tietokannasta tietomassaan sisältyviä säännönmukaisuuksia joiden olemassaolosta ei vielä olla tietoisia. Mikäli tieto-kantaan sisältyvät tiedot ovat kovin moniulotteisia sisältäen lukuisia piirteitä, heikkenee monien koneoppimisen menetelmien suoriutumiskyky ratkaisevasti. Ilmiötä nimitetään ”moniulotteisuuden kiroukseksi” (”curse of dimensionali-ty”) sen johtaessa usein sekä laskennallisen kompleksisuuden että luokitusvir-heiden kasvuun. Toisaalta tietokantaan mahdollisesti sisältyvät epärelevantit tai vain epäsuorasti relevantit piirteet tarjoavat heikon esitysavaruuden tieto-kannan käsiterakenteen kuvaamiseen.

Tutkimuksen tavoitteena on kehittää tietoainesta kuvaavien piirteiden muodostamisen teoreettista taustaa ja käytännön toteutusta. Tällä piirteiden muodostamisella pyritään joko ulotteisuuden pienentämiseen tai esitysava-ruuden parantamiseen (tai molempiin) ohjatun koneoppimisen tarpeita varten. Työssä sovelletaan perinteistä pääkompponenttianalyysiä ja kahta luokkiin liit-tyvää tietoa hyödyntävää analyysimenetelmää. Tarkastelu ulotetaan sekä pe-rustason luokittelijan muodostamiseen että luokittelijakokoelmaan siihen sisäl-tyvien luokittelijoiden integroinnin näkökulmasta. Tarkastelun teoreettisen pe-rustan muodostavat tiedon louhinnan, koneoppimisen ja hahmontunnistuksen tutkimusalueet. Tutkimuksen yhteydessä kokeellista osuutta varten laadittu oh-jelmisto on rakennettu Javalla toteutetulle avoimen koodin koneoppimisohjel-mistoalustalle (WEKA).

Työ koostuu erillisistä artikkeleista ja niihin tukeutuvasta yhteenvedosta, jossa tutkimuksen tulokset kootaan asetettujen tutkimusongelmien (TOi) alle. Tutkimusongelmista viisi ensimmäistä on enemmän teoriapainotteista ja viisi seuraavaa enemmän käytäntöpainotteisia. Tässä suomenkielisessä yhteenve-dossa esitetään sekä tutkimusongelma että työn tulos ongelma ongelmalta (tekstissä tietokanta tarkoittaa saman piirrerakenteen omaavaa tapausten jouk-koa):

TO1: Kuinka tärkeää on luokkainformaation käyttö piirteiden muodosta-misprosessissa? Tutkimuksessa todettiin luokkainformaation olevan ratkaise-van tärkeää useiden tietokantojen tapauksessa. Luokkiin liittyvää tietoa hyö-dyntävien piirteiden muodostamisprosessin todettiin voivan johtaa tarkempaan luokittelijaan niille tietokannoille, joille puhtaasti varianssiin perustuva piirtei-den muodostamisprosessi ei vaikuta luokittelijan tarkkuuteen tai heikentää sitä.

TO2: Onko piirteiden muodostaminen tieto- vai hypoteesivetoista kon-struktiivista induktiota? Tutkimuksessa todettiin, että piirteiden muodosta-misprosessin tulisi sopia niin tietokantaan kuin ohjatun koneoppimisen mene-telmäänkin, koska piirteiden muodostamisprosessien keskinäinen paremmuus-järjestys vaihteli paljon molempien suhteen.

TO3: Onko piirteiden muodostaminen hyödyllistä luokittelijakokoelman dynaamisen integroinnin yhteydessä niin kuin se on yksittäisten luokittelijoi-den tapauksessa? Todettiin, että piirteiden muodostaminen voi johtaa tarkkuu-

Page 90: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

86

den parantumiseen dynaamisen integroinnin yhteydessä niille tietokannoille, joille piirteiden muodostaminen johtaa tarkkuuden parantumiseen yksittäisen tapauspohjaisen luokittelijan yhteydessä.

TO4: Ovatko alkuperäiset piirteet, muodostetut piirteet vai kummatkin käyttökelpoisia ohjatun koneoppimisen yhteydessä? Alkuperäisten ja muodos-tettujen piirteiden kombinaatio voi olla hyväksi ohjatun koneoppimisen yhtey-dessä joillekin tietokannoille. Erityisesti käyttökelpoisuus todettiin päätöspuu-tyyppisten luokittelijoiden, kuten C4.5 käytön yhteydessä. Muutaman pää-komponentin yhdistämisen lineaaristen erottimien tueksi todettiin voivan joh-taa parempaan luokitustarkkuuteen C4.5-tyyppisen luokittelijan yhteydessä. Samalla kuitenkin todettiin ettei yhdistäminen kannata Bayes-tyyppisen luo-kittelijan yhteydessä koska se johtaa tässä yhteydessä luokittelijan epästabiiliin käyttäytymiseen.

TO5: Kuinka moni muodostettu piirre on käyttökelpoinen koneoppimisen yhteydessä? Piirteiden muodostamisprosessin yhteydessä käytettävät kynnys-arvot piirteiden selittämän varianssin osalta vaihtelivat paljon tietokannasta toi-seen.

TO6: Kuinka selviytyä silloin kun tietoon sisältyy kontekstia kuvaavia piir-teitä ja ongelma-avaruuden heterogeenisuutta? Luonteva ryvästely osoittautui hyvin tehokkaaksi lähestymistavaksi laadittaessa paikallisia ratkaisuja piirtei-den muodostamissa ja ohjatussa koneoppimisessa.

TO7: Miten otoksen pienentäminen vaikuttaa piirteiden muodostamiseen ohjattua koneoppimista varten? Yleensä epäparametrinen piirteiden muodos-taminen johtaa samaan tai parempaan tarkkuuteen pienemmällä oppimis-tapausten lukumäärällä kuin parametrinen piirteiden muodostaminen. Tulok-set osoittivat, että silloin kun piirteiden muodostamiseen ja luokittelijan raken-tamiseen käytettävien oppimistapausten osuus on suhteellisen pieni on tärkeää käyttää sopivaa otoksen pienentämistekniikkaa, jotta edustavimmat tapaukset tulevat valituiksi piirteiden muodostamisprosessiin.

TO8: Milloin piirteiden muodostaminen ohjattua koneoppimista varten on käytännöllistä? Alustavien tulosten perusteella näyttäisi olevan vaikea ennustaa milloin piirteiden muodostaminen voi olla käyttökelpoista ohjatun koneoppi-misen yhteydessä. Edelleen on vaikea ennustaa mitkä muodostetuista piirteistä ja kuinka monta niistä tulisi ottaa käyttöön ohjatun koneoppimisen yhteydessä.

TO9: Miten tulkita muodostettuja piirteitä? Tutkimuksen perusteella näyt-täisi siltä, että käsiteltävästä tietokannasta, alkuperäisten piirteiden merkityk-sestä ja tarkasteltavasta ongelmasta sekä ohjatun koneoppimisen tekniikasta riippuen piirteiden muodostaminen voi olla tulkinnan kannalta hyödyllistä tai haitallista.

TO10: Miten valita piirteiden muodostamistekniikan ja ohjatun koneoppi-mistekniikan yhdistelmä käsillä olevaan ongelmaan? Työssä esitetään yleinen viitekehys sopivimman tiedon louhintastrategian valitsemiseksi keräämällä tie-donlouhintatekniikoiden ja niiden kombinaatioiden käyttökokemuksia eri-laisten tietokantojen parissa.

Page 91: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

I

EIGENVECTOR-BASED FEATURE EXTRACTION FOR CLASSIFICATION

Tsymbal A., Puuronen S., Pechenizkiy M., Baumgarten M., Patterson D. 2002. In: S .Halle & G. Simmons (Eds.), Proc. 15th Int. FLAIRS Conference on Artificial Intelligence, FL, USA: AAAI Press, 354-358. With permission from AAAI Press.

ORIGINAL PAPERS

Page 92: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source
Page 93: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

Eigenvector-based Feature Extraction for Classification

Alexey Tsymbal1,3, Seppo Puuronen1, Mykola Pechenizkiy2, Matthias Baumgarten3, David Patterson3

1Department of Computer Science and Information Systems, University of Jyväskylä, P.O.Box 35, FIN-40351, Jyväskylä, Finland

[email protected] 2Niilo Mäki Institute, Jyväskylä, Finland

3 Northern Ireland Knowledge Engineering Laboratory, University of Ulster, U.K.

Abstract

This paper shows the importance of the use of class information in feature extraction for classification and inappropriateness of conventional PCA to feature extraction for classification. We consider two eigenvector-based approaches that take into account the class information. The first approach is parametric and optimizes the ratio of between-class variance to within-class variance of the transformed data. The second approach is a nonparametric modification of the first one based on local calculation of the between-class covariance matrix. We compare the two approaches with each other, with conventional PCA, and with plain nearest neighbor classification without feature extraction.

1. Introduction Data mining is the process of finding previously unknown and potentially interesting patterns and relations in large databases. A typical data-mining task is to predict an unknown value of some attribute of a new instance when the values of the other attributes of the new instance are known and a collection of instances with known values of all the attributes is given.

In many applications, data, which is the subject of analysis and processing in data mining, is multidimensional, and presented by a number of features. The so-called “curse of dimensionality” pertinent to many learning algorithms, denotes the drastic raise of computational complexity and the classification error in high dimensions (Aha et al., 1991). Hence, the dimensionality of the feature space is often reduced before classification is undertaken.

Feature extraction (FE) is a dimensionality reduction technique that extracts a subset of new features from the original set by means of some functional mapping keeping as much information in the data as possible (Fukunaga 1990). Conventional Principal Component Analysis (PCA) is one of the most commonly used feature extraction techniques, that is based on extracting the axes on which the data shows the highest variability (Jolliffe 1986). Although this approach “spreads” out the data in the new Copyright © 2002, American Association for Artificial Intelligence (www.aaai.org). All rights reserved.

basis, and can be of great help in regression problems and unsupervised learning, there is no guarantee that the new axes are consistent with the discriminatory features in a classification problem. Unfortunately, this often is not taken into account by data mining researchers (Oza 1999). There are many variations on PCA that use local and/or non-linear processing to improve dimensionality reduction (Oza 1999), though they generally are also based solely on the inputs.

In this paper we consider two eigenvector-based approaches that use the within- and between-class covariance matrices and thus do take into account the class information. In the next section we consider conventional PCA and give a simple example of why PCA is not always appropriate to feature extraction for classification.

2. Conventional PCA PCA transforms the original set of features into a smaller subset of linear combinations that account for most of variance of the original set (Jolliffe 1986).

The main idea of PCA is to determine the features, which explain as much of the total variation in the data as possible with as few of these features as possible. In PCA we are interested in finding a projection w:

xwy T= , (1) where y is a 1'×p transformed data point, w is a 'pp× transformation matrix, and x is a 1×p original data point. PCA can be done through eigenvalue decomposition of the covariance matrix S of the original data:

∑=

−−=n

i

Tii

1))(( mxmxS , (2)

where n is the number of instances, xi is the i-th instance, and m is the mean vector of the input data.

Computation of the principal components can be presented with the following algorithm: 1. Calculate the covariance matrix S from the input data. 2. Compute the eigenvalues and eigenvectors of S and

sort them in a descending order with respect to eigenvalues.

3. Form the actual transition matrix by taking the predefined number of components (eigenvectors).

Page 94: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

4. Finally, multiply the original feature space with the obtained transition matrix, which yields a lower- dimensional representation. The necessary cumulative percentage of variance

explained by the principal axes should be consulted in order to set a threshold, which defines the number of components to be chosen.

PCA has the following properties: (1) it maximizes the variance of the extracted features; (2) the extracted features are uncorrelated; (3) it finds the best linear approximation in the mean-square sense; and (4) it maximizes the information contained in the extracted features.

Although PCA has a number of advantages, there are some drawbacks. One of them is that PCA gives high weights to features with higher variabilities disregarding whether they are useful for classification or not. From Figure 1 one can see why it can be dangerous not to use the class information (Oza 1999). The first case shows the proper work of PCA where the first principal component corresponds to the variable with the highest discriminating power, but from the second case one can see that the chosen principal component is not always good for class discrimination.

x2 y1 y2

a) x1

x2 y1y2

b) x1

Fig. 1. PCA for classification: a) effective work of PCA, b) an irrelevant principal component was chosen wrt. to classification.

Nevertheless, conventional PCA is still often applied to feature extraction for classification by researchers.

3. Parametric Eigenvalue-based FE Feature extraction for classification is a search among all possible transformations for the best one, which preserves class separability as much as possible in the space with the lowest possible dimensionality (Aladjem, 1994). The usual decision is to use some class separability criterion, based on a family of functions of scatter matrices: the within-class covariance, the between-class covariance, and the total covariance matrices.

The within-class covariance matrix shows the scatter of samples around their respective class expected vectors:

∑∑==

−−=in

j

Tiij

iij

c

iiW n

1

)()()()(

1))(( mxmxS , (3)

where c is the number of classes, ni is the number of instances in a class i, )(i

jx is the j-th instance of i-th class, and m(i) is the mean vector of the instances of i-th class.

The between-class covariance matrix shows the scatter of the expected vectors around the mixture mean:

Tiic

iiB n ))(( )()(

1mmmmS −−=∑

=

, (4)

where c is the number of classes, ni is the number of instances in a class i, m(i) is the mean vector of the instances of i-th class, and m is the mean vector of all the input data.

The total covariance matrix shows the scatter of all samples around the mixture mean. It can be shown analytically that this matrix is equal to the sum of the within-class and between-class covariance matrices (Fukunaga 1990): WB SSS += . (5)

One possible criterion based on the between- and within-class covariance matrices (3) and (4) to be optimized for feature extraction transformation (1) is defined in Fisher linear discriminant analysis:

wSwwSw

wW

TB

TJ =)( . (6)

A number of other criteria were proposed in (Fukunaga 1990). The criterion (6) and some other relevant criteria may be optimized by the following algorithm often called simultaneous diagonalization (Fukunaga 1990):

1. Transformation of X to Y: XΦΛY T1/2−= , where Λ and Φ are the eigenvalues and eigenvectors matrices of WS .

2. Computation of BS in the obtained Y space. 3. Selection of m eigenvectors of BS , mψψ ,...,1 , which

correspond to the m largest eigenvalues. 4. Finally, new feature space YΨZ T

m= , where ],...,[ 1 mψψΨ = , can be obtained.

It should be noted that there is a fundamental problem with the parametric nature of the covariance matrices. The features extracted with the parametric approach are suboptimal in the Bayes sense. The rank of the between-class covariance matrix (4) is at most c-1 (because it is the summation of c rank one matrices and only c-1 of them are independent), and hence no more than c-1 of the eigenvalues will be nonzero. The nonparametric method for feature extraction overcomes the above-mentioned problem.

4. Nonparametric Eigenvalue-based FE The nonparametric method tries to increase the number of degrees of freedom in the between-class covariance matrix (4), measuring the between-class covariances on a local basis. K-nearest neighbor (kNN) technique is used for this purpose.

A two-class nonparametric feature extraction method was considered in (Fukunaga 1990), and it is extended in this paper to the multiclass case. The algorithm for

Page 95: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

nonparametric feature extraction is the same as for the parametric extraction (Section 3). Simultaneous diagonalization is used as well, and the difference is only in calculation of the between-class covariance matrix. In the nonparametric between-class covariance matrix, the scatter of the samples around the expected vectors of other classes’ instances in the neighborhood is calculated:

∑∑∑≠===

−−=c

ijj

Tjik

ik

jik

ik

n

kik

c

iiB

i

wn1

)(*

)()(*

)(

11))(( mxmxS , (7)

where )(*j

ikm is the mean vector of the nNN instances of j-th

class, which are nearest neighbors to )(ikx . The number of

nearest instances nNN is a parameter, which should be set in advance. In (Fukunaga 1990) it was proposed to use nNN equal to 3, but without any justification. The coefficient wik is a weighting coefficient, which shows importance of each summand in (7). The goal of this coefficient is to assign more weight to those elements of the matrix, which involve instances lying near the class boundaries and thus more important for classification. We generalize the two-class version of this coefficient proposed in (Fukunaga 1990) to the multiclass case:

∑=

=c

j

jnNN

ik

jnNN

ikj

ik

d

dw

1

)()(

)()(

),(

)},({min

xx

xx

α

α, (8)

where ),( )()( jnNN

ikd xx is the distance from )(i

kx to its nNN-nearest neighbor of class j, and α is a parameter which should be set in advance. In (Fukunaga 1990) the parameter α equal to 1 was used, but without any justification.

In the next section we consider our experiments where we analyze and compare the described above feature-extraction techniques.

5. Experiments The experiments were conducted on 21 data sets with different characteristics taken from the UCI machine learning repository (Blake et al., 1998). The main characteristics of the data sets are presented in Table 1, which includes the names of the data sets, the numbers of instances included in the data sets, the numbers of different classes of instances, and the numbers of different kinds of features (categorical and numerical) included in the instances. The pre-selected values for the α and nNN are included in the table as well. In (Tsymbal et al., 2001) we have presented results of experiments with several feature selection techniques on these data sets.

In the experiments, the accuracy of 3-nearest neighbor classification based on the heterogeneous Euclidean-overlap metric was measured to test the feature extraction approaches. Categorical features were binarized as it was

done in the correlation-based feature selection experiments in (Hall et al., 2000). Each categorical feature was replaced with a redundant set of binary features, each corresponding to a value of the original feature.

Table 1. Characteristics of the data sets FeaturesData set Instances Classes

Categorical Numerical α

nNN

Balance 625 3 0 4 1/3 255 Breast 286 2 9 0 5 1 Car 1728 4 6 0 5 63 Diabetes 768 2 0 8 1/5 127 Glass 214 6 0 9 1 1 Heart 270 2 0 13 1 31 Ionosphere 351 2 0 34 3 255 Iris Plants 150 3 0 4 1/5 31 LED 300 10 7 0 1/3 15 LED17 300 10 24 0 5 15 Liver 345 2 0 6 3 7 Lymph 148 4 15 3 1 7 MONK-1 432 2 6 0 1 1 MONK-2 432 2 6 0 20 63 MONK-3 432 2 6 0 1/3 1 Soybean 47 4 0 35 1 3 Thyroid 215 3 0 5 3 215 Tic-Tac-Toe 958 2 9 0 1 1 Vehicle 846 4 0 18 3 3 Voting 435 2 16 0 1/3 15 Zoo 101 7 16 0 1/20 7

For each data set 70 test runs of Monte-Carlo cross validation were made, first, to select the best α and nNN parameters, and after to evaluate the classification accuracy with the three feature extraction approaches and without any feature extraction. In each run, the data set is first split into the training set and the test set by stratified random sampling to keep class distributions approximately same. Each time 30 percent instances of the data set are first randomly picked up to the test set. The remaining 70 percent instances form the training set, which is used for finding the feature-extraction transformation matrix (1). The test environment was implemented within the MLC++ framework (the machine learning library in C++) (Kohavi et al. 1996).

First, a series of experiments were conducted to select the best α and nNN coefficients for the nonparametric approach. The parameter α was selected from the set of 9 values: { }20,10,5,3,1,3/1,5/1,10/1,20/1∈α , and the number of nearest neighbors nNN from the set of 8 values:

8,...,1,12 =−= inNN i , { }255,127,63,31,15,7,3,1∈nNN . The parameters were selected on the wrapper-like basis, optimizing the classification accuracy. For some data sets, e.g. LED and LED17, selection of the best parameters did not give almost any improvement in comparison with the considered in (Fukunaga 1990) α =1 and nNN=3, and the classification accuracy varied within the range of one percent. It is necessary to note that the selection of the α and nNN parameters changed the ranking of the three feature extraction approaches from the accuracy point of view only on two data sets, thus demonstrating that the nonparametric approach is robust wrt. the built-in parameters. However, for some data sets the selection of the parameters had a significant positive effect on the

Page 96: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

classification accuracy. For example, on the MONK-2 data set, accuracy is 0.796 when α =1 and nNN=3, but it reaches 0.974 when α =20 and nNN=63.

After, we have compared four classification techniques: the first three were based on the three considered above feature extraction approaches, and the last one did not use any feature extraction. For each feature selection technique, we have considered experiments with the best eigenvalue threshold of the following set {0.65, 0.75, 0.85, 0.9, 0.95, 0.97, 0.99, 0.995, 0.999, 1}.

The basic results of the experiments are presented in Table 2. First, average classification accuracies are given for the three extraction techniques: PCA, the parametric (Par) and nonparametric (NPar) feature extraction, and no feature extraction (Plain). The bold-faced and underlined accuracies represent the approaches that were significantly better than all the other approaches; the bold-faced only accuracies represent the approaches that were significantly worse on the corresponding data sets (according to the Student t-test with 0.95 level of significance). Then, the corresponding average numbers of extracted features are given. The remaining part contains the average extraction and the total expended time (in seconds) for the classification techniques. All the results are averaged over the 70 Monte-Carlo cross-validation runs.

Each row of Table 2 corresponds to one data set. The last two rows include the results averaged over all the data sets (the last row), and over the data sets containing

categorical features (the row before the last one). From Table 2 one can see that the nonparametric

approach has the best accuracy on average (0.824). Comparing the total average accuracy with the average accuracy on the categorical data sets, one can see that the nonparametric approach performs much better on the categorical data, improving the accuracy of the other approaches (as on the MONK data sets, and the Tic-Tac-Toe data set). The parametric approach is the second best. As we supposed, it is quite unstable, and not robust to different data sets’ characteristics (as on the MONK-1,2 and Glass data sets). The case with no feature selection has the worst average accuracy.

The parametric approach extracts the least number of features on average (only 2.3), and it is the least time-consuming approach. The nonparametric approach is able to extract more features due to its nonparametric nature (9.9 on average), and still it is less time-consuming than the PCA and Plain classification.

Still, it is necessary to note that each feature extraction technique was significantly worse than all the other techniques at least on one data set (e.g., the Heart data set for the nonparametric approach), and it is a question for further research to define the dependencies between the characteristics of a data set and the type and parameters of the feature extraction approach best suited for it. For each data set, we have also pairwise compared each feature extraction technique with the others using the paired

Table 2. Results of the experiments Accuracy Features Extraction time, sec. Total time, sec.

Data set PCA Par NPar Plain PCA Par NPar Plain PCA Par Npar PCA Par NPar Plain

Balance .827 .893 .863 .834 4.0 1.0 2.0 4.0 .00 .09 .21 3.11 1.02 1.87 2.55Breast .721 .676 .676 .724 16.5 1.0 33.7 51.0 2.66 3.10 4.00 5.33 3.31 9.32 5.88Car .824 .968 .964 .806 14.0 3.0 6.4 21.0 .38 .53 .64 12.02 3.08 6.43 12.07Diabetes .730 .725 .722 .730 7.0 1.0 3.8 8.0 .22 .24 .30 6.73 1.38 4.15 7.14Glass .659 .577 .598 .664 4.4 5.0 9.0 9.0 .11 .08 .13 .69 .69 1.19 1.01Heart .777 .806 .706 .790 13.0 1.0 4.4 13.0 .13 .23 .31 2.63 .44 1.21 2.14Ionospher .872 .843 .844 .849 9.0 1.0 2.0 34.0 1.52 1.50 2.08 3.49 1.77 2.55 6.09Iris .963 .980 .980 .955 2.0 1.0 1.0 4.0 .01 .05 .04 .03 .13 .08 .20LED .646 .630 .635 .667 7.0 7.0 7.0 14.0 .13 .39 .49 1.61 1.92 1.99 2.17LED17 .395 .493 .467 .378 24.0 6.7 11.4 48.0 1.88 2.46 3.10 5.66 3.54 4.91 5.48Liver .664 .612 .604 .616 4.9 1.0 3.1 6.0 .06 .15 .15 1.65 .53 1.17 1.88Lymph .813 .832 .827 .814 31.4 3.0 32.0 47.0 1.58 2.04 2.50 3.39 2.23 4.39 1.96MONK-1 .767 .687 .952 .758 10.0 1.0 2.0 17.0 .39 .55 .67 4.47 1.06 1.57 4.94MONK-2 .717 .654 .962 .504 8.0 1.0 2.0 17.0 .40 .60 .70 3.76 1.08 1.60 4.96MONK-3 .939 .990 .990 .843 11.0 1.0 1.9 17.0 .37 .55 .69 4.89 1.07 1.54 4.94Soybean .992 .987 .986 .995 7.8 1.0 2.2 35.0 .17 .45 .44 .23 .46 .47 .07Thyroid .921 .942 .933 .938 4.0 2.0 2.0 5.0 .05 .03 .05 .52 .35 .33 .69TicTacToe .971 .977 .984 .684 18.0 1.0 2.0 27.0 .80 .96 1.21 11.45 1.68 2.50 11.24Vehicle .753 .752 .778 .694 16.0 3.0 12.5 18.0 .55 .53 .67 10.34 2.39 8.02 10.42Voting .923 .949 .946 .921 15.9 1.0 61.7 82.0 3.37 4.29 5.76 5.56 4.46 14.05 7.88Zoo .937 .885 .888 .932 15.1 6.4 6.5 36.0 .62 .85 1.09 1.03 1.00 1.28 .78Average (categoric) .787 .795 .845 .730 15.5 2.9 15.1 34.3 1.14 1.48 1.90 5.38 2.22 4.51 5.66Average (total) .801 .803 .824 .766 11.6 2.3 9.9 24.4 .73 .94 1.20 4.22 1.60 3.36 4.50

Page 97: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

Student t-test with 0.95 level of significance. Results of the comparison are given in Table 3. Columns 2-5 of the table contain results of the comparison of the technique corresponding to the row of the cell against the technique corresponding to the column using the paired t-test.

Each cell contains win/tie/loss information according to the t-test, and in parenthesis the same results are given for the eleven data sets including categorical features. For example, PCA has 8 wins against the parametric extraction on 21 data sets, and 5 of them are on categorical data sets.

Table 3. Results of the paired t-test (win/tie/loss information) PCA Parametric Nonparametric Plain

PCA 8/3/10 (5/0/6)

8/1/13 (3/0/8)

9/8/4 (5/5/1)

Parametric 10/3/8 (6/0/5) 5/11/5

(2/6/3) 11/5/5 (7/0/4)

Nonparametric 13/1/8 (8/0/3)

5/11/5 (3/6/2) 11/3/8

(8/0/3)

Plain 4/8/9 (1/5/5)

5/5/11 (4/0/7)

8/3/11 (3/0/8)

From Tables 1, 2 one can see that classification without feature extraction is clearly the worst technique even for such data sets with relatively small numbers of features. This shows the so-called “curse of dimensionality” and necessity in feature extraction.

According to Table 3, among the three feature extraction techniques, the parametric and nonparametric techniques are the best on average, with the nonparametric technique being only slightly better than the parametric (3 wins versus 2 on the categorical data sets).

Conventional PCA was the worst feature extraction technique on average, which supports our expectations, as it does not take into account the class information. However, it was surprisingly stable. It was the best technique only on four data sets, but it was still the worst one only on three data sets (the best result).

On the categorical data sets the results are almost the same as on the rest of data sets. Only the nonparametric technique performs much better on the categorical data for this selection of the data sets, however, further experiments are necessary to check this finding.

6. Conclusions PCA-based techniques are widely used for classification problems, though they generally do not take into account the class information and are based solely on inputs. Although this approach can be of great help in unsupervised learning, there is no guarantee that the new axes are consistent with the discriminatory features in a classification problem.

The experimental results supported our expectations. Classification without feature extraction was clearly the worst. This shows the so-called “curse of dimensionality” and necessity in feature extraction. Conventional PCA was the worst feature extraction technique on average and,

therefore, cannot be recommended for finding features that are useful for classification. The nonparametric technique was only slightly better than the parametric one on average. However, this can be explained by the selection of the data sets, which are relatively easy to learn and do not include significant nonnormal class distributions. Besides, better parameter tuning can be used to achieve better results with the nonparametric technique. This is an interesting topic for further research. The nonparametric technique performed much better on the categorical data for this selection of the data sets, however, further research is necessary to check this finding.

Another important topic for further research is to define the dependencies between the characteristics of a data set and the type and parameters of the feature extraction approach best suited for it.

Acknowledgements. This research is partly supported by the COMAS Graduate School of the University of Jyväskylä, Finland and NEURE project of Niilo Mäki Institute, Finland. We would like to thank the UCI ML repository of databases, domain theories and data generators for the data sets, and the MLC++ library for the source code used in this study.

References Aha, D., Kibler, D., Albert, M. 1991. Instance-based learning algorithms. Machine Learning, 6:37-66. Aladjem, M. 1994. Multiclass discriminant mappings. Signal Processing, 35:1-18. Blake, C.L., Merz, C.J. 1998. UCI Repository of Machine Learning Databases [http:// www.ics.uci.edu/ ~mlearn/ MLRepository.html]. Dept. of Information and Computer Science, University of California, Irvine CA. Fukunaga, K. 1990. Introduction to Statistical Pattern Recognition. Academic Press, London. Hall, M.A. 2000. Correlation-based feature selection of discrete and numeric class machine learning. In Proc. Int. Conf. On Machine Learning (ICML-2000), San Francisco, CA. Morgan Kaufmann, San Francisco, CA, 359-366. Jolliffe, I.T. 1986. Principal Component Analysis. Springer, New York, NY. Kohavi, R., Sommerfield, D., Dougherty, J. 1996. Data mining using MLC++: a machine learning library in C++. Tools with Artificial Intelligence, IEEE CS Press, 234-245. Oza, N.C., Tumer, K. 1999. Dimensionality Reduction Through Classifier Ensembles. Technical Report NASA-ARC-IC-1999-124, Computational Sciences Division, NASA Ames Research Center, Moffett Field, CA. Tsymbal A., Puuronen S., Skrypnyk I., 2001. Ensemble feature selection with dynamic integration of classifiers, In: Int. ICSC Congress on Computational Intelligence Methods and Applications CIMA’2001, Bangor, Wales, U.K.

Page 98: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source
Page 99: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

II

PCA-BASED FEATURE TRANSFORMATIONS FOR CLASSIFICATION: ISSUES IN MEDICAL

DIAGNOSTICS

Pechinizkiy M., Tsymbal A., Puuronen S. 2004. In: R.Long et. al. (Eds.), Proc. 17th IEEE Symp. on Computer-Based Medical Systems CBMS'2004, Los Alamitos, California: IEEE CS Press, 535-540. With permission from IEEE CS Press.

Page 100: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source
Page 101: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

PCA-based Feature Transformation for Classification: Issues in Medical Diagnostics

Mykola Pechenizkiy*, Alexey Tsymbal**, Seppo Puuronen* *Department of Computer Science and Information Systems, University of Jyväskylä,

Finland, e-mails [email protected], [email protected] **Department of Computer Science, Trinity College Dublin, Ireland,

e-mail [email protected]

Abstract

The goal of this paper is to propose, evaluate, and compare several data mining strategies that apply feature transformation for subsequent classification, and to consider their application to medical diagnostics. We (1) briefly consider the necessity of dimensionality reduction and discuss why feature transformation may work better than feature selection for some problems; (2) analyze experimentally whether extraction of new components and replacement of original features by them is better than storing the original features as well; (3) consider how important the use of class information is in the feature extraction process; and (4) discuss some interpretability issues regarding the extracted features.

1. Introduction

Current electronic data repositories, especially in medical domains, contain enormous amount of data including also currently unknown and potentially interesting patterns and relations that can be uncovered using knowledge discovery and data mining methods [3]. Commonly supervised machine learning is used, in which there exists a set of training instances (cases) represented by a vector of the values of features (attributes) and the values of the class label. An induction algorithm is used to learn a classifier, which maps the space of feature values into the set of class values. The classifier is later used to classify new instances with unknown classifications (class labels). Inductive learning systems were successfully applied in a number of medical domains, e.g. in localization of a primary tumor, prognostics of recurrence of breast cancer, diagnosis of thyroid diseases, and rheumatology [3].

However, researchers and practitioners realize that the effective use of these inductive learning systems requires data preprocessing before applying a learning algorithm. Especially this is important for multidimensional heterogeneous data, presented by a large number of features of different types. The so-called “curse of dimensionality” pertinent to many learning algorithms, denotes the drastic raise of computational complexity and classification error with data having large number of features. Hence, the dimensionality of the feature space is often reduced before classification is undertaken. There are a number of dimensionality reduction techniques, and according to the adopted reduction strategy they are usually divided into feature selection and feature transformation (also called feature discovery) approaches. The key difference between feature selection and feature transformation is that in the former only a subset of original features is selected while the latter is based on generation of completely new features. The variants of the latter are feature extraction and feature construction. Feature construction implies discovering missing information about relationships among the features

Page 102: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

by inferring or creating additional features while feature extraction discovers a new feature space having fewer dimensions through a functional mapping, keeping as much information in the data as possible [9].

For some problems a feature subset may be useful in one part of the instance space, and at the same time it may be useless or even misleading in another part of it. Therefore, it may be difficult or even impossible to remove irrelevant and/or redundant features from a data set and leave only useful ones by means of feature selection. Feature selection techniques that just assign weights to the individual features are also insensitive to interacted or correlated features. That is why the transformation of the given representation before weighting the features is often preferable.

In this paper we consider several approaches to PCA-based feature transformation for classification and discuss how important the use of class information is when transforming original and selecting extracted features. In Section 2 beside the brief discussion of PCA-based transformations, we consider how to decide whether a feature transformation is useful or not for a problem at hand. In Section 3 we address the problem whether extracted features should be used together with or instead of the original feature space to learn a classifier, and which and how many extracted features are useful for classification. In Section 4 we present the results of experiments with the feature transformation techniques on some problems of medical diagnostics and conclude with preliminary findings and consider the directions of further research. Finally, in Section 5 some interpretability issues with respect to the use of feature transformation are discussed.

2. PCA-based feature transformation for classification

Conventional Principal Component Analysis (PCA) is one of the most commonly used feature extraction techniques. It is based on extracting the axes on which data shows the highest variability [7]. Although PCA “spreads out” data in the new basis, and can be of great help in unsupervised learning, there is no guarantee that the new axes are consistent with the discriminatory features in a classification problem.

Another approach is to account class information during the feature extraction process. One technique is to use some class separability criterion from Fisher’s linear discriminant analysis, based on a family of functions of scatter matrices: the within-class covariance, the between-class covariance, and the total covariance matrices. In [15] parametric and nonparametric eigenvector-based approaches that use the within- and between-class covariance matrices and thus do take into account the class information were analyzed and compared. In [14] these approaches were applied to dynamic integration of classifiers. Both parametric and nonparametric approaches use the simultaneous diagonalization algorithm [1] to optimize the relation between the within- and between-class covariance matrices. However, the difference between the approaches is in calculation of the between-class covariance matrix. The parametric approach accounts one mean per class and one total mean, and thus it potentially may extract at most the number of classes minus one features. The nonparametric method tries to increase the number of degrees of freedom in the between-class covariance matrix, measuring the between-class covariances on a local basis.

An important issue is how to decide whether a PCA-based feature transformation approach is appropriate for a certain problem or not. Since the main goal of PCA is to extract new uncorrelated features, it is logical to introduce some correlation-based criterion with a possibility to define a threshold value. One of such criteria is the Kaiser-Meyer-Olkin (KMO) criterion that accounts for both total and partial correlation:

Page 103: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

∑∑∑∑∑∑

+=

i jij

i jij

i jij

ar

r

KMO22

2

, (1)

where ),( )()( jiij xxrr = is the element of the correlation matrix R and ija are the elements of

A (partial correlation matrix), and

jjii

ij

Xij RR

Ra ji

−=),(.

, (2)

where ),(. jiXija is a partial correlation coefficient for )(ix and )( jx , when the effect of all the

other but i and j features denoted as ),( jiX is fixed (controlled), and klR is an algebraic

complement for klr in the determinant of the correlation matrix R .

It can be seen that if two features share a common factor with other features, their partial correlation ija will be small, indicating the unique variance they share. And then, if ija are

close to zero (the features are measuring a common factor) KMO will be close to one, while if

ija are close to one (the variables are not measuring a common factor) KMO will be close to

zero. Generally, it is recommended to apply PCA for a data set only if KMO is greater than 0.5. In

[11] it was recommended to apply PCA for meta-learning tasks if KMO is greater than 0.6.

3. Which and how many principal components are useful for classification?

In this section we are interested in the problem of selecting the best subset of orthogonally transformed features for subsequent classification, i.e. we are not searching for the best transformation but rather try to find the best subset of transformed components, which allow achieving the best classification.

One common method is to introduce some threshold, e.g. variance accounted by a component to be selected. This results in selecting principal components, which correspond to the largest eigenvalues. The problem with this approach is that the magnitude of eigenvalue depends on data variance only and has nothing to do with class information. In [6] Jollife presents several real-life examples where principal components corresponding to the smallest eigenvalues are correlated with the output attribute. So, principal components important for classification may be excluded because they have small eigenvalues. In [10] another example of such a situation is shown. Nevertheless, criteria for selecting the most useful transformed features are often based on variance accounted by the features to be selected.

In [13], e.g. it is argued that the selection of all the components, the corresponding eigenvalues of which are significantly greater than one, would produce the similar results as if the number of components to select was defined according to the following formula:

1#

1#21#

−−+>

instaces

featuresseigenvalue , (3)

where the number of features and instances should be relatively large. An alternative approach is to use a ranking procedure and select principal components that

Page 104: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

have the highest correlations with the class attribute. Although this makes intuitive sense, there is criticism of such an approach. In [1] this alternative approach was shown to work slightly worse than using components with the largest eigenvalues in the prediction context.

The problem of selecting useful transformed features is not so important for the parametric class-conditional approach since: (1) it takes into account class information, and (2) it extracts only the number of classes minus one component(s).

Another important issue is deciding on whether selected transformed features should be used together with or instead of original features to learn a classifier. In [11] the use of selected transformed features as additional ones for a decision-tree learner, instance-based learner and Naïve Bayes learner is recommended. In [15] we show that the use of selected transformed features only for an instance-based learner can significantly increase its accuracy for many problems.

4. Experimental results

In this study we experimented with conventional PCA feature transformation and a parametric class-conditional approach. We compared the work of four different approaches that combine 3-nearest neighbour classifier (3-NN) with conventional PCA feature transformation or parametric class-conditional feature transformation and either use transformed features together with or instead of original ones. We also compare them with the work of 3-NN without any feature transformation. We limit our study to data sets that have numerical features only. The non-parametric approach that in [15] was shown to have better results on data sets with categorical features than numerical ones is excluded from this study.

The experiments were conducted on five data sets from the UCI repository with different problems of medical diagnostics: Pima Indians Diabetes, Heart Disease, Liver Disorders, Thyroid Gland, and Wisconsin Breast Cancer (www.ics.uci.edu/~mlearn/MLRepository.html).

In Table 1, for each data set we present the number of instances, the KMO value, the number of features used by every approach and the corresponding accuracy.

Table 1. Experimental results

Accuracy of 3-NN classifier and number of features used Dataset Inst. KMO Orig. space PCA Par Orig.+PCA Orig.+Par

Diabetes 768 .549 .738 8 .706 8 .714 1 .734 11 .737 9

Heart 270 .533 .781 13 .659 12 .825 1 .788 17 .778 14

Liver 345 .551 .612 6 .591 5 .632 1 .594 8 .644 7

Thyroid 215 .568 .969 5 .951 4 .967 2 .970 6 .961 7

Cancer 569 .513 .968 30 .935 10 .978 1 .968 32 .971 31

For Diabetes and Thyroid data sets none of feature transformation techniques can improve the work of plain 3-NN classifier. For Heart and Cancer data sets 3-NN achieves the highest accuracy results when the new features extracted by parametric approach are used instead of the original ones. And for Liver data set the best results are achieved when the feature extracted by the parametric approach is used together with the original ones. It can be seen from the table that KMO is not a relevant criterion (at least for the considered data sets) to decide whether a PCA-based feature transformation technique is worth being applied to a problem of medical diagnostics. Although for every data set KMO was higher than 0.5, the principal components, when used instead of the original features, resulted in the lower accuracy of 3-NN classifier, and when used together with the original features, never improved the classification accuracy. Therefore, some additional measures beside KMO need to be

Page 105: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

considered. No theoretical analysis has been performed yet to answer whether transformed features need

to be used together with or instead of original features. This is an interesting direction of further research.

5. Feature transformation and interpretability

Before arguing for and against feature transformation with respect to the interpretability issue, let us consider first what is commonly meant by interpretability.

Interpretability refers to whether a classifier is easy to understand. It is commonly accepted that rule-based classifiers like a decision tree and associative rules are very easy to interpret, and neural networks and other connectionist and “black-box” classifiers have low interpretability. kNN is considered to have very poor interpretability because the unstructured collection of training instances is far from readable, especially if there are many of them. While interpretability concerns a typical classifier generated by a learning algorithm, transparency (or comprehensibility) refers to whether the principle of the method of constructing a classifier is easy to understand (that is a users’ subjective assessment). Therefore, for example, a kNN classifier is scarcely interpretable, but the method itself is transparent because it appeals to the intuition of humans who spontaneously reason from similar cases. Similarly, interpretability of Naive Bayes can be estimated as not very high, but the transparency of the method is good for example for physicians who find that probabilistic explanations replicate their way of diagnosing, i.e., by summing evidence for or against a disease [8].

However, when feature transformation is applied for rule-based approaches (e.g. association rules), interpretation of produced rules naturally becomes more difficult or even impossible. Since binarization of categorical attributes is required for PCA-based feature transformation, interpretation of results for a data set containing categorical features deteriorates drastically.

The common criticism of instance-based learning is that it does not provide explicit knowledge to the user like association rules do. However, each individual prediction can be explained transparently by uncovering those instances on which the decision is based. This is naturally useful for situations when the end user is rather familiar with previous medical cases than with some complex measures that are used to characterize them. However, a major problem of simple approaches like kNN is that the Euclidian distance will not necessarily be suited for finding intuitively similar cases, especially if irrelevant attributes are present. And feature extraction may be of great help in this context of explanations based on similar cases. In [15] we show how different feature extraction techniques “improve” the neighborhood of the instances with respect to their classes in the Euclidian space. The additional benefit of feature extraction is in the possibility of cases’ visualization (and performing visual analysis) by projecting them onto 2D or 3D plots.

PCA-based feature extraction for classification can be treated as means of constructive induction. In [2] it is argued that constructive induction, when generating new features, can produce new (emerging) concepts which in turn may lead to a new understanding of a problem and can produce additional knowledge, including new concepts and their relationship to the primary concepts.

PCA-based feature transformations allow to summarize the information from a large number of features into a limited number of components, i.e. linear combinations of the original features. It is argued that, unfortunately, principal components are often difficult to interpret. Many methods of rotating components to improve their interpretability have been proposed. The varimax rotation is the most well known of the orthogonal rotation methods. Chapter 8 in

Page 106: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

[5] contains a brief description of varimax along with several other rotation methods and relevant references.

It should be noticed also that the transformation formulae of principal components may provide useful information for interpretability of results. In order to achieve better interpretability, principal components can be replaced with suboptimal but better interpretable “simple components” [12]. The idea is to allow extracting less variability and there might be a small correlation between the components, but the approach might be advantageous for practical use since interpretability of components is better.

We would like to emphasise that the assessment of interpretability relies on the user’s perception of the classifier and the assessment of an algorithm’s practicality depends very much on a user’s background, preferences and priorities. Most of the characteristics related to practicality can be described only by reporting users’ subjective evaluations. Thus, the interpretability issues are still very disputable and difficult to evaluate, and therefore many conclusions on interpretability are relative and rather subjective.

Acknowledgments: This research is partly supported by the COMAS Graduate School of the University of Jyväskylä, Finland. This material is based upon works supported by the Science Foundation Ireland under Grant No. S.F.I.-02IN.1I111. We would like to thank the machine-learning library in java (WEKA) for the source code used in this study.

References

[1] Almoy, T. A simulation study on the comparison of prediction methods when only a few components are relevant. Computational Statistics and Data Analysis 21, 1996, pp. 87-107.

[2] Arciszewski, T., Wnek, J. and Michalski R.S., An Application of Constructive Induction to Engineering Design, Proceedings of the IJCAI-93 Workshop on AI in Design, France, 1993.

[3] Fayyad, U., G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Advances in Knowledge Discovery and Data Mining, AAAI/ MIT Press, 1997.

[4] Hadi, A. S., Ling, R. F. Some cautionary notes on the use of principal components regression. The American Statistician. 52, 1998, pp. 15-19.

[5] Jackson, J. E. A User’s Guide to Principal Components. Wiley & Sons, New York, 1991. [6] Jollife, I. T. A note on the use of principal components in regression. Applied Statistics. 31, 1982, pp. 300-303. [7] Jollife, I. T. Principal Component Analysis. Springer-Verlag, New York. 1986. [8] Kononenko I. Inductive and Bayesian learning in medical diagnosis. Applied Artificial Intelligence 7(4), 1993,

pp. 317-337. [9] Liu H. Feature Extraction, Construction and Selection: A Data Mining Perspective, Kluwer Academic

Publishers, 1998. [10] Oza, N.C., Tumer, K. Dimensionality Reduction Through Classifier Ensembles. Technical Report NASA-ARC-

IC-1999-124, Computational Sciences Division, NASA Ames Research Center, Moffett Field, CA, 1999. [11] Popelínský L. Combining the Principal Components Method with Different Learning Algorithms. In Proc. of

12th European Conference on Machine Learning, Workshop on Integrating Aspects of Data Mining, Decision Support and Meta-Learning, Freiburg, 2001.

[12] Rousson, V., Gasser, T. Simple Component Analysis. Manuscript. 2003. Available from http://www.unizh.ch/biostat/Manuscripts/sca3.pdf

[13] Saporta G. Some simple rules for interpreting outputs of principal components and correspondence analysis. In: Bacelar-Nicolau H., Nicolau F.C., Janssen J.(eds.): Proceedings of ASMDA-99, University of Lisbon, 1999.

[14] Tsymbal A., Pechenizkiy M., Puuronen S., Patterson D.W., Dynamic integration of classifiers in the space of principal components, In: L.Kalinichenko, R.Manthey, B.Thalheim, U.Wloka (Eds.), Proc. Advances in Databases and Information Systems: 7th East-European Conf. ADBIS'03, Dresden, Germany, Lecture Notes in Computer Science, Vol. 2798, Springer-Verlag, 2003, pp. 278-292.

[15] Tsymbal A., Puuronen S., Pechenizkiy M., Baumgarten M., Patterson D. 2002. Eigenvector-based Feature Extraction for Classification. In: S.M. Haller, G. Simmons (Eds.), Proc. 15th Int. FLAIRS Conference on Artificial Intelligence, AAAI Press, pp. 354-358.

Page 107: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

III

ON COMBINING PRINCIPAL COMPONENTS WITH FISHER’S LINEAR DISCRIMINANTS FOR

SUPERVISED LEARNING

Pechenizkiy M., Tsymbal A. & Puuronen S. Manuscript submitted to Special Issue of Foundations of Computing and Decision Sciences “Data Mining and Knowledge Discovery” (as extended version of Pechenizkiy et al., 2005e).

Page 108: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source
Page 109: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

ON COMBINING PRINCIPAL COMPONENTS WITH FISHER’S LINEAR DISCRIMINANTS FOR SUPERVISED LEARNING

Mykola PECHENIZKIY*, Alexey TSYMBAL**, Seppo PUURONEN*

Abstract. “The curse of dimensionality” is pertinent to many learning algorithms, and

it denotes the drastic increase of computational complexity and classification error in high

dimensions. In this paper, principal component analysis (PCA), parametric feature

extraction (FE) based on Fisher’s linear discriminant analysis (LDA), and their

combination as means of dimensionality reduction are analysed with respect to the

performance of different classifiers. Three commonly used classifiers are taken for

analysis: kNN, Naïve Bayes and C4.5 decision tree. Recently, it has been argued that it is

extremely important to use class information in FE for supervised learning (SL). However,

LDA-based FE, although using class information, has a serious shortcoming due to its

parametric nature. Namely, the number of extracted components cannot be more that the

number of classes minus one. Besides, as it can be concluded from its name, LDA works

mostly for linearly separable classes only. In this paper we study if it is possible to

overcome these shortcomings adding the most significant principal components to the set

of features extracted with LDA. In experiments on 21 benchmark datasets from UCI

repository these two approaches (PCA and LDA) are compared with each other, and with

their combination, for each classifier. Our results demonstrate that such a combination

approach has certain potential, especially when applied for C4.5 decision tree learning.

However, from the practical point of view the combination approach cannot be

recommended for Naïve Bayes since its behavior is very unstable on different datasets.

Keywords: Principal component analysis (PCA), linear discriminant analysis (LDA),

feature extraction, supervised learning

* Department of Computer Science and Information Systems, University of Jyväskylä,

P.O. Box 35, 40351 Jyväskylä, Finland, (phone: +358-14-2602472; fax: +358-14-2603011;

e-mail: {mpechen,sepi}@cs.jyu.fi).

** Department of Computer Science, Trinity College Dublin, College Green, Dublin 2,

Ireland (e-mail: [email protected]).

Page 110: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

1. Introduction

Fayyad [5] introduced KDD as “the nontrivial process of identifying valid, novel,

potentially useful, and ultimately understandable patterns in data”. The process comprises

several steps, which involve data selection, data pre-processing, data transformation,

application of machine learning techniques, and the interpretation and evaluation of

patterns.

In this paper we analyse issues related to data transformation, which needs to be

undertaken before applying machine learning techniques [8]. These issues are often

considered from two different perspectives. The first one is related to the so-called “curse

of dimensionality” problem [3] and the necessity of dimensionality reduction [1]. The

second perspective comes from the assumption that in many datasets to be processed some

individual features, being irrelevant or indirectly relevant to the purpose of analysis, form a

poor problem representation space. This may be the case even when there is no

dimensionality problem. The corresponding ideas of constructive induction that assume the

improvement of problem representation before application of any learning technique have

been presented in [9].

Feature extraction (FE) for classification is aimed at finding such a transformation of

the original space in order to produce new features, which would preserve class separability

as much as possible [6] and to form a new lower-dimensional problem representation

space. Thus, FE accounts for both the perspectives, and, therefore, we believe that FE,

when applied either on datasets with high dimensionality or on datasets including indirectly

relevant features, can improve the performance of a classifier.

We consider Principal Component Analysis (PCA), which is perhaps the most

commonly used FE technique, and class-conditional LDA-based FE in the context of

supervised learning (SL). Recently, it has been argued that it is extremely important to use

class information in FE for SL. Thus, in [10] it was shown that PCA gives high weights to

features with higher variabilities irrespective of whether they are useful for classification or

not. However, LDA-based FE, although using class information, also has a serious

shortcoming due to its parametric nature. Namely, the number of extracted components

cannot be more that the number of classes minus one. Besides, as it can be concluded from

its name, LDA works mostly for linearly separable classes only. We discuss the main

limitations of PCA and parametric LDA-based FE and search for ways to overcome those

limitations. Our approach is to combine linear discriminants (LDs) with principal

component (PCs) for SL. Particularly, in this paper we study if it is possible to overcome

the shortcomings of both approaches, and construct a better representation space for SL

adding the most significant principal components to the set of features extracted with LDA.

We conduct a number of experiments on 21 UCI datasets, analyzing the difference in

the impact of these three FE approaches (PCA, LDA and combined use of LDs and PCs)

on the classification performance of the nearest neighbour classification, Naïve Bayes, and

C4.5 decision tree learning. We compare PCA and LDA with each other, and with their

combination for each classifier. The results of these experiments show that linear

discriminants (LDs) when combined together with principal components (PCs) can

significantly improve the supervised learning process in comparison with the case when

either of approaches is used alone for FE. The results support our intuitive reasoning that

when PCs are added to LDs the representation space for SL is improved providing more

Page 111: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

information about variance in data and when LDs are added to PCs the representation space

is improved providing more information about data variance within and between classes.

Indeed, our results show that such a combination approach has certain potential, especially

when applied for C4.5 decision tree learning. However, from the practical point of view the

combination approach cannot be recommended for Naïve Bayes since its behaviour is very

unstable on different datasets.

To the best of our knowledge, this is the first attempt to analyse the combination of

Fisher’s linear discriminants and principal components.

The rest of the paper is organised as follows. In Section 2 we first consider PCA

(Section 2.1) and class-conditional LDA-based FE (Section 2.2) in the context of SL. Then,

Section 2.3 introduces the combination of PCA and LDA-based FE approaches. In Section

3 we consider our experiments and present the main results. In Section 4 we summarise our

work with the main conclusions and further research directions.

2. Feature Extraction for Supervised Learning

Generally, FE for classification can be seen as a search process among all possible

transformations of the original feature set for the best one, which preserves class

separability as much as possible in the space with the lowest possible dimensionality [6]. In

other words, having n independent observations of the variables ),...,( 1 dT xxx centred

about the mean, we are interested in finding a projection w:

xwy T (1)

where y is a 1k transformed data point (presented by k features), w is a kdtransformation matrix, and x is a 1d original data point (presented by d features).

2.1. Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a well-known statistical method, which extracts a

lower dimensional space by analyzing the covariance structure of multivariate statistical

observations [7].

The main idea behind PCA is to determine the features that explain as much of the total

variation in the data as possible with as few features as possible. The computation of the

PCA transformation matrix is based on the eigenvalue decomposition of the covariance

matrix S (formula 2 below) and therefore is rather expensive computationally:

n

i

Tiiiondecompositeig

1

))((_ mxmxSw (2)

where n is the number of instances, xi is the i-th instance, and m is the mean vector of the

Page 112: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

input data:

n

iin

1

1xm .

Computation of the PCs can be presented with the following algorithm:

1. Calculate the covariance matrix S from the input data.

2. Compute the eigenvalues and eigenvectors of S and sort them in a descending order

with respect to the eigenvalues.

3. Form the actual transition matrix by taking the predefined number of components

(eigenvectors).

4. Finally, multiply the original feature space with the obtained transition matrix,

which yields a lower-dimensional representation.

The necessary cumulative percentage of variance explained by the principal axes is

used commonly as a threshold, which defines the number of components to be chosen.

PCA has been perhaps one of the most popular FE techniques because of its many

appealing properties. Thus, PCA maximises the variance of the extracted features and the

extracted features are uncorrelated. It finds best linear approximation in the mean-square

sense, because the truncation error is the sum of the lower eigenvalues. Information that is

contained in the extracted features is maximised. Besides this the model parameters for

PCA can be computed directly from the data, for example by diagonalizing the sample

covariance of the data. For more detailed description of PCA and its properties see, for

example [16].

In [10] it was shown that PCA has an important drawback, namely the conventional

PCA gives high weights to features with higher variabilities irrespective of whether they

are useful for classification or not. This may give rise to the situation where the chosen

principal component corresponds to the attribute with the highest variability but without

any discriminating power (compare two different situations in Figure 1).

Figure 1. PCA for classification: a) effective work of PCA, b) the case where an irrelevant PC was chosen from the classification point of view (O denotes the origin of initial feature space x1, x2 and OT – the origin of transformed feature space PC(1), PC(2)).

2.2. Class-Conditional Eigenvector-Based FE

A usual approach to overcome the above problem is to use some class separability criterion

[2], e.g. the criteria defined in Fisher’s linear discriminant analysis and based on the family

Page 113: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

of functions of scatter matrices:

wSw

wSww

WT

BT

J )( (3)

where SB in the parametric case is the between-class covariance matrix that shows the

scatter of the expected vectors around the mixture mean, and SW is the within-class

covariance, that shows the scatter of samples around their respective class expected

vectors:

in

j

Tiij

iij

c

iiW n

1

)()()()(

1

))(( mxmxS (4)

Tiic

iiB n ))(( )()(

1

mmmmS (5)

where c is the number of classes, ni is the number of instances in a class i, )(ijx is the j-th

instance of i-th class, m(i) is the mean vector of the instances of i-th class, and m is the

mean vector of all the input data.

The total covariance matrix shows the scatter of all samples around the mixture mean. It

can be shown analytically that this matrix is equal to the sum of the within-class and

between-class covariance matrices [6]. In this approach the objective is to maximise the

distance between the means of the classes while minimizing the variance within each class.

A number of other criteria were proposed by Fukunaga [6].

The criterion (4) is optimised using the simultaneous diagonalization algorithm [6]. The

basic steps of the algorithm include eigenvalues decomposition of SW; transformation of

original space to intermediate xW (whitening); calculation of SB in xW; the eigenvalue

decomposition of SB and then transformation matrix w finally can be produced by simple

multiplication:

BW

B

WW

SS

WBS

SWWS

xSiondecompositeig

xxSiondecompositeig

www

w

ww

)|(_

;),(_

(6)

It should be noticed that there is a fundamental problem with the parametric nature of

the covariance matrices. The rank of SB is at most the number of classes-1, and hence no

more than this number of new features can be obtained.

Some attempts have been done to overcome the shortcomings of LDA-based FE. One

approach is to increase the number of degrees of freedom in the between-class covariance

matrix, measuring the between-class covariances on a local basis. The k-nearest neighbor

(kNN) technique can be used for this purpose. Such approach is known as nonpaprametric

FE [6]. In our previous works (as for example [14, 11, and 13]) we experimentally

compared these approaches for one-level supervised learning and for dynamic integration

of classifiers. Our results demonstrated that measuring the between-class covariance on a

local basis does increase the performance of FE for supervised learning with many datasets.

Yet, the nonparametric approach is much more time consuming.

Page 114: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

Another way to improve the performance of LDA-based FE is to discover the structure

of (possibly heterogeneous) data and apply FE locally for each region (cluster/partition)

independently. For example in [12] we applied so-called natural clustering approach to

perform local dimensionality reduction also using parametric LDA-based FE.

In this paper we propose another approach which tries to improve the parametric class-

conditional LDA by adding a few principal components (PCs). We describe this approach

in the following section.

2.3. Combination of Principal Components and Linear Discriminants for Supervised Learning

The basic idea behind combination of PCs with LDs for SL is rather intuitive. The added

PCs are selected so that they cover most variance in the data. By adding those PCs it is

possible to have more than the number of classes minus one extracted features. The

approach is illustrated in Figure 2.

PCA

Training Set

LDA

Training Set:

Merged PCs and LDs

Supervised

Learning

Accuracy

Transform

Test Set

DataSet

Training Set: PCs

Training Set: LDs

Test Set:

Merged PCs and LDs

Test Set: PCs

Test Set: LDs

Classifier Classification

(Classifier evaluation)

Transform

PCA-

model

LDA -

model

Figure 2. Combining PCs and LDs for SL

Page 115: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

A dataset is divided into training and test sets. Both PCA and parametric LDA-based

FE (LDA) are applied independently to training data, producing PCA-model and LDA-

model (transformation matrices) correspondingly. Then, original data in training set is

mapped onto the lower-dimensional spaces (also independently with each approach). Thus,

two transformed training sets are produced, one of which contains PCs instead of original

features and the other LDs. Then, these transformed (lower-dimensional spaces) datasets

are merged, so that the resulting training sets contains both LDs and PCs and class attribute

from the original training set. The constructed new representation space is used for learning

a classifier.

To be able to evaluate the learnt classifier test set should also be transformed to the

same format. This is done in the similar way, so that test set is transformed independently

with PCA and LDA models, and PCs and LDs are constructed. Then, LDs are merged with

PCs and class attribute from original test set.

3. Experiments and Results

The experiments were conducted on 21 datasets with different characteristics (see Table 1)

taken from the UCI machine learning repository [4]. For FE each categorical feature was

replaced with a redundant set of binary features, each corresponding to a value of the

original feature.

Table 1. Datasets characteristics

Features Dataset instances classes

num. categ. binarised num+bin.

Balance 625 3 0 4 20 20 Breast 286 2 0 9 38 38 Car 1728 4 0 6 21 21 Diabetes 768 2 8 0 0 8 Glass 214 6 9 0 0 9 Heart 270 2 13 0 0 13 Ionosphere 351 2 34 0 0 33 Iris Plants 150 3 4 0 0 4 LED 300 10 0 7 7 7 LED17 300 10 0 17 24 24 Liver 345 2 6 0 0 6 Lymph 148 4 3 15 33 36 Monk-1 432 2 0 6 15 15 Monk-2 432 2 0 6 15 15 Monk-3 432 2 0 6 15 15 Tic 958 2 0 9 27 27 Vehicle 846 4 18 0 0 18 Kr-vs-kp 3196 2 0 36 38 38 Waveform 3772 4 7 23 27 34 Vovel 990 11 10 2 16 26 Hypothyroid 5000 3 21 0 0 21

Page 116: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

In the experiments, the accuracies of the classifiers produced by 3-nearest neighbour

classification (3NN), Naïve-Bayesian (NB) learning algorithm, and C4.5 decision tree

(C4.5) [15] were calculated. All they are well known in the DM community and represent

three different approaches to learning from data. The main motivation to use the 3 different

types of classifiers is that we expect different impact of FE on the representation space not

only with respect to different datasets but also with respect to different classifiers. In

particular, for kNN FE can produce a better neigbourhood, for C4.5 it produces better

(more informative) individual features, and for Naïve Bayes it produces uncorrelated

features.

For each dataset 30 test runs of Monte-Carlo cross validation were made to evaluate

classification accuracies with and without FE. In each run, the dataset is first split into the

training set and the test set by stratified random sampling to keep class distributions

approximately same. Each time 30 percent instances of the dataset are first randomly taken

to the test set. The remaining 70 percent instances form the training set, which is used for

finding the FE transformation matrix w. The test environment was implemented within the

WEKA framework (the machine learning library in Java) [15]. The classifiers from this

library were used with their default settings.

We took all the features extracted by parametric FE as it was always equal to number of classes-1. In the experiments we added 1, 2, and 3 PCs and the number of principal

components that cover 85% of variance. We found that patterns of performance were quite

similar for all the ways of combining LDs and PCs. Therefore, in this paper we limit our

presentation with the case where the first three PCs were added to the set of features

extracted with PCA.

The basic accuracy results for each dataset for each of the three classifiers are presented

in Table 2. For each classifier three different accuracies are presented: for the case when

parametric LDA-based FE was used (denoted as LDA), for the case when PCA was used as

FE technique (PCA), and for the case when both LDA and PCA were used to produce new

features (denoted as LDA + PCA).

Main results are presented in Table 3 (placed after Figure 3 to prevent the table break) in

qualitative form. Each cell contains information whether one method was better than

another. Notations used are: ‘++’ denotes a significant win of LDA (or LDA+PCA)

according to the paired Student’s t-test with 0.95 level of significance, ‘+’ denotes win

(average improvement more than 0.5%, but without statistical significance) of LDA (or

LDA+PCA), ‘=’ denotes tie (average difference less than 0.5%), ‘–‘ denotes loss (average

decrease of accuracy is more than 0.5%, but non significant according to the paired

Student’s t-test with 0.95 level of significance) of LDA (or LDA+PCA), and ‘– –‘ denotes

significant loss of LDA (or LDA+PCA) also according to the paired Student’s t-test with

0.95 level of significance.

It can be seen from the table that there are many common patterns in the behavior of

techniques for the 3 different classifiers, yet there are some differences with some datasets

too.

First we tried to analyse how many wins, losses and ties occurred for each pair of

compared approaches. These results are presented in Figure 3. It can be seen from the

figure that, according to the ranking results, PCA in comparison with LDA works better for

C4.5, and LDA is suited better for NB. The accuracy results of 3NN classifier are similar

with either PCA or LDA FE. This kind of behaviour was observed and discussed earlier in

Page 117: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

[11]. The combination approach work very well with C4.5 classifier, and it is also good

with 3NN classifier. However, the combination approach with NB classifier demonstrates

surprisingly very poor behaviour.

Table 2. Basic accuracy results

3NN NB C4.5 Dataset

LDA PCA LDA+ PCA LDA PCA LDA+

PCA LDA PCA LDA+ PCA

balance .881 .707 .829 .901 .804 .890 .903 .668 .902

breast .715 .686 .702 .755 .723 .737 .745 .697 .743

car .930 .842 .926 .914 .825 .902 .935 .835 .939

diabetes .621 .711 .709 .672 .673 .674 .663 .703 .700

glass .633 .638 .635 .464 .492 .462 .614 .640 .609

heart .602 .640 .635 .619 .663 .649 .630 .663 .670

ionosphere .771 .853 .860 .787 .891 .835 .800 .881 .853

iris .944 .916 1 .903 .894 .979 .931 .930 1

led .690 .697 .703 .715 .735 .714 .683 .690 .682

led17 .549 .318 .554 .633 .492 .642 .541 .359 .534

liver .593 .577 .597 .568 .537 .511 .652 .594 .651

lymph .619 .749 .762 .626 .787 .770 .619 .738 .733

monk1 .681 .962 .946 .720 .751 .716 .740 .896 .854

monk2 .651 .519 .552 .679 .634 .666 .677 .671 .676

monk3 .965 .985 .987 .971 .960 .969 .967 .944 .974

tic .689 .926 .836 .707 .709 .699 .745 .802 .757

vehicle .552 .528 .601 .487 .445 .465 .554 .545 .601

kr-vs-kp .806 .920 .893 .836 .857 .801 .841 .857 .880

waveform .848 .768 .842 .838 .820 .838 .862 .853 .858

vowel .914 .927 .920 .667 .538 .664 .780 .754 .789

hypothyroid .954 .951 .954 .927 .850 .857 .952 .953 .954

average .743 .753 .783 .733 .718 .735 .754 .746 .779

02468

1012

LDA vsPCA

LDA+PCAvs LDA

LDA+PCAvs PCA

- - - = + ++

a) 3NN

02468

1012

LDA vsPCA

LDA+PCAvs LDA

LDA+PCAvs PCA

- - - = + ++

b) NB

02468

1012

LDA vsPCA

LDA+PCAvs LDA

LDA+PCAvs PCA

- - - = + ++

d) C4.5

Figure 3. Ranking of the FE techniques according to the results on 21 UCI datasets

Page 118: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

Table 3. Comparison of accuracy results

LDA vs PCA LDA+PCA vs LDA LDA+PCA vs PCA Dataset

3NN NB C4.5 3NN NB C4.5 3NN NB C4.5

Balance ++ ++ ++ – – – = ++ ++ ++

Breast ++ ++ ++ – – = + + ++

Car ++ ++ ++ – – = + ++ ++

Diabetes – – = – – = = ++ = = =

Glass = – – – – = = = = – – – –

Heart – – – – – – ++ ++ ++ = – ++

Ionosphere – – – – – – ++ ++ ++ + – – – –

Iris Plants ++ + = ++ ++ ++ ++ ++ ++

LED = – – + = = = – – –

LED17 ++ ++ – = + – ++ + ++

Liver ++ ++ ++ = – – = ++ – – ++

Lymph – – – – – – ++ ++ ++ + – –

Monk-1 – – – – – – ++ = ++ – – – – – –

Monk-2 ++ ++ + – – – = ++ ++ =

Monk-3 – – + ++ ++ = + = + ++

Tic – – = – – ++ – + – – – – –

Vehicle ++ ++ + ++ – – ++ ++ – – ++

Kr-vs-kp – – – – – – ++ – – ++ – – – – ++

Waveform ++ ++ + = = = ++ + =

Vovel – ++ ++ = = + = ++ ++

Hypothyroid = ++ = = – – = = + =

++ 9 10 6 9 4 8 7 5 11

+ 0 2 3 1 1 3 4 5 0

= 3 2 2 7 7 9 7 1 4

- 1 1 2 2 5 1 0 3 2

- - 8 6 8 2 4 0 3 7 4

The effect of combining PCs with LDs on classification accuracy of each supervised

learner with regard to the use of only PCs or only LDs has been estimated with the help of

state transition diagrams. Each transition from a state to the same state (i.e. no transition or

a loop) gives no score to an estimate. Each transition from state “– –“ to “–”, from “–” to

“=”, from “=” to “+”, and from “+” to “++” gives score +1; correspondingly, transition

from state “++“ to “+”, from “+” to “=”, from “=” to “–”, and from “–” to “– –” gives

score -1. Each transition from “– –” to “=”, from “–” to “+”, and from “=” to “++” gives

score +2; correspondingly, each transition from “++” to “=”, from “+” to “–”, and from “=”

to “– –” gives score -2. Each transition from “– –” to “+” and from “–” to “++” gives score

+3; and from “++” to “–” and from “+” to “– –” gives score -3. Finally, each transition

from “– –” to “++” gives score +4, and from “++” to “– –” score -4 correspondingly. We

summarise this information in Table 4.

Page 119: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

Thus, the effect of combining LDs with PCs is estimated according to the following

formula:

i jji trscoreeffect , (7)

where iscore is the score for the transition from state I to state II and jtr is the number of

such transitions.

Table 4. Scores for transition from state I to state II

state I – – ++ – – – ++ + – – – = ++ + =

state II ++ – – + ++ – – – = + ++ = – – –

score +4 –4 3 3 –3 –3 2 2 2 –2 –2 –2

state I – – – = + ++ + = – ++ + – =

state II – = + ++ + = – – – ++ + – =

score 1 1 1 1 –1 –1 –1 –1 0 0 0 0

We present state transition diagrams for each classifier in Figure 4. On the left side we

place diagrams that show the gain for a classifier due to the use of PCs with LDs for a new

representation space construction instead of only PCs, and on the right side – the gain for a

classifier due to use of LDs with PCs instead of only LDs.

In Table 5 we summarise the analysis of diagrams giving the corresponding scores of

the effect of merging LDs with PCs for each classifier.

– – 8

=3

+ + 9

–1

3NN: ”LDA vs PCA” (I) ”LDA+PCA vs PCA” (II) effect = +11

I II ++ 9 7

+ 0 4

= 3 7

– 1 0

– – 8 3

7

1

3

3

3

+0

2

2– – 9

=3

+ + 8

–0

3NN: ”PCA vs LDA” (I) ”PCA+LDA vs LDA” (II) effect = +14

I II ++ 8 9

+ 1 1

= 3 7

– 0 2

– – 9 2

7

2

3

2

+1

1

2

1

1

2

Figure 4a. State transition diagrams for 3NN classifier

– – 6

=2

+ + 10

+2

–1

NB: ”LDA vs PCA” (I) ”LDA+PCA vs PCA” (II) effect = -11

I II ++ 10 5

+ 2 5

= 2 1

– 1 3

– – 6 7

4

1

2

1

1 1

14

4

2

– – 10

=2

+ + 6

–2

NB: ”PCA vs LDA” (I) ” PCA+LDA vs LDA” (II) effect = +5

I II ++ 6 4

+ 1 1

= 2 7

– 2 5

– – 10 4

3

1

11

1

2

1

2

1

4

3

+1

1

Figure 4b. State transition diagrams for NB classifier

Page 120: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

– – 8

=2

+ + 6

–2

C4.5: ”LDA vs PCA” (I) ”LDA+PCA vs PCA” (II) effect = +15

I II ++ 6 10

+ 3 0

= 2 6

– 2 2

– – 8 4

6

2

1

2

1

1

2

4

+3

11

1– – 6

=2

+ + 8

–3

C4.5: ” PCA vs LDA” (I) ”PCA+LDA vs LDA” (II) effect = +17

I II ++ 8 9

+ 2 2

= 2 9

– 3 1

– – 6 3

6

1

12

1

4

1 1

13

+2

1

1

1

Figure 4c. State transition diagrams for C4.5 classifier

Figure 4. State transition diagrams for 3NN (a), NB (b) and C4.5 (c) classifiers: from “LDA vs PCA” to “LDA+PCA vs PCA”(on the left side) and from “PCA vs LDA” to “LDA+PCA vs LDA ”(on the right side)

Table 5. Effect of combining PCs with LDs according to the state transition diagrams

LDA+PCA vs PCA wrt LDA vs PCA

PCA+ LDA vs LDA wrt PCA vs LDA

kNN +11 +14

NB -11 +5

C4.5 +15 +17

In Table 6, averaged over the 21 datasets, accuracy results are presented for each classifier.

Also we show the averaged increase of accuracy due to the use of both LDs and PCs with

respect to PCA and with respect to LDA. It can be seen from the table that the combination

approach significantly outperforms the use of LDs or PCs alone.

Table 6. Averaged over 21 datasets accuracy results

PCA LDA LDA+PCA LDA+PCAvs PCA

LDA+PCAvs LDA

kNN .753 .743 .783 3.0% 4.0%

NB .718 .733 .735 1.7% .2%

C4.5 .746 .754 .779 3.3% 2.5%

In Figure 5 we visualise these averaged accuracies with a histogram.

However, from the practical point of view we are especially interested in those cases

where the use of both LDs and PCs together outperforms both a supervised learning

algorithm that uses only LDs and the same learning algorithm that uses only PCs to

construct a model (e.g. the case with Iris dataset with any classifier or the case with Vehicle

for 3NN and C4.5). We present the corresponding results for 3NN, NB and C4.5 in Figure

6. (Significant) loss is counted if either PCA or LDA outperforms the combination

approach. And a tie is counted if either PCA or LDA has a tie with the combination

approach even if one approach loses (significantly or not) to the combination approach.

Page 121: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

0.70

0.72

0.74

0.76

0.78

0.80

3NN NB C4.5

LDA PCA LDA+ PCA

Figure 5. The accuracy of classifiers averaged over 21 datasets.

0

2

4

6

8

10

12

3NN NB C4.5

- - - = + ++

Figure 6. When combination of PCs and LDs is practically useful.

It can be seen from the figure that from the practical point of view the use of both LDs

and PCs for supervised learning is safer and more beneficial for C4.5 in comparison with

NB and 3NN. For this learning algorithm only the number of wins is similar to the number

of losses. On the contrary, for 3NN there were two significant wins against five significant

losses. And for NB the situation is ever worse, since there are only one significant and one

non significant wins against seven non significant and 8 significant losses. Therefore the

combination approach cannot be recommended for use with NB.

It is very interesting to analyse why the behaviour is so different for considered

classifiers. Our hypothesis for the better behaviour with C4.5 is that decision trees use

inherent feature selection, and thus implicitly select LDs and/or PCs, that are useful for

classification, from the combined set of features, discarding the less relevant and duplicate

ones. Moreover, this feature selection is local, based on the recursive partitioning principle

of decision tree construction.

4. Conclusions

“The curse of dimensionality” is a serious problem for machine learning and data mining

especially with present-day multi-feature datasets. Classification accuracy decreases and

processing time increases dramatically in high dimensions literally with any learning

technique, which was demonstrated with a number of experiments and analytical studies.

FE is a common way to cope with this problem. Before applying a learning algorithm the

space of instances is transformed into a new space of a lower dimensionality, trying to

preserve the distances among instances and class separability.

A classical approach (that takes into account class information) is Fisher’s LDA, which

tries to minimise the within class covariance and to maximise the between class covariance

in the extracted features. This approach is well studied, and commonly used, however it has

one serious drawback. It has a parametric nature, and it extracts no more than the number

of classes minus one features. It works well with linearly separable classes, and often

provides informative features for classification in more complex problems.

Page 122: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

However, due to the limited number of extracted features, it often fails to provide

reasonably good classification accuracy even with fairly simple datasets, where the intrinsic

dimensionality exceeds that number. A number of ways has been considered to solve this

problem. Many approaches suggest non-parametric variations of LDA, which lead to

greater numbers of extracted features.

In this paper we consider an alternative way to improve LDA-based FE for

classification, which consists of combining the extracted LDs with a few PCs. Our

experiments with the combination of LDs with PCs have demonstrated that the

discriminating power of LDA features can be improved by PCs for many datasets and

learning algorithms. The best performance is exhibited with C4.5 decision trees. A possible

explanation for the good behaviour with C4.5 is that decision trees use implicit local

feature selection, and thus implicitly select LDs and/or PCs, useful for classification, from

the combined set of features, discarding the less relevant and duplicate ones.

This hypothesis and the fact that every dataset needs a different FE technique(s) suggest

a direction for future work. It would be interesting to consider a combination of the

presented approach with explicit automatic feature selection (filter or wrapper-based),

which could lead to an increase in accuracy and better FE through selection of a set of

appropriate features of different nature. Local feature selection might be especially useful.

Another important direction for future work is the evaluation of this technique with real-

world multidimensional datasets of different nature including images and texts.

Acknowledgments: We would like to thank the UCI ML repository of databases, domain

theories and data generators for the datasets, and the WEKA ML library in Java for the

source code used in this study. This research is partly supported by the COMAS Graduate

School of the University of Jyväskylä, Finland. This material is based also upon works

supported by the Science Foundation Ireland under Grant No. S.F.I.-02IN.1I111.

References

[1] Aivazyan S.A., Applied statistics: classification and dimension reduction, Finance and

Statistics, Moscow, 1989.

[2] Aladjem M., Multiclass discriminant mappings, Signal Processing, 35, 1994, 1-18.

[3] Bellman R., Adaptive Control Processes: A Guided Tour, Princeton University Press,

1961.

[4] Blake C.L., Merz C.J., UCI Repository of Machine Learning Databases, Dept. of

Information and Computer Science, University of California, Irvine CA, 1998.

[5] Fayyad U.M., Data Mining and Knowledge Discovery: Making Sense Out of Data,

IEEE Expert, 11, 5, 1996, 20-25.

[6] Fukunaga K., Introduction to statistical pattern recognition, Academic Press, London

1990.

[7] Jolliffe I.T., Principal Component Analysis, Springer, New York, NY, 1986.

[8] Liu H., Feature Extraction, Construction and Selection: A Data Mining Perspective,

ISBN 0-7923-8196-3, Kluwer Academic Publishers, 1998.

[9] Michalski R.S., Seeking Knowledge in the Deluge of Facts, Fundamenta Informaticae,

30, 1997, 283-297.

Page 123: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

[10]Oza N.C., Tumer K., Dimensionality Reduction Through Classifier Ensembles,

Technical Report NASA-ARC-IC-1999-124, Computational Sciences Division, NASA

Ames Research Center, Moffett Field, CA, 1999.

[11]Pechenizkiy M., Impact of the Feature Extraction on the Performance of a Classifier:

kNN, Naïve Bayes and C4.5, in: B.Kegl, G.Lapalme (eds.), Proc. of 18th CSCSI Conference on Artificial Intelligence AI’05, LNAI 3501, Springer Verlag, 2005, 268-

279.

[12]Pechenizkiy M., Tsymbal A., Puuronen S., Local Dimensionality Reduction within

Natural Clusters for Medical Data Analysis, in Proc. 18th IEEE Symp. on Computer-Based Medical Systems CBMS’2005, IEEE CS Press, 2005, 365-37.

[13]Tsymbal A., Pechenizkiy M., Puuronen S., Patterson D.W., Dynamic integration of

classifiers in the space of principal components, in: L.Kalinichenko, R.Manthey,

B.Thalheim, U.Wloka (eds.), Proc. Advances in Databases and Information Systems: 7th East-European Conf. ADBIS'03, LNCS 2798, Heidelberg: Springer-Verlag, 2003,

278-292.

[14]Tsymbal A., Puuronen S., Pechenizkiy M., Baumgarten M., Patterson D., Eigenvector-

based feature extraction for classification, in: Proc. 15th Int. FLAIRS Conference on Artificial Intelligence, Pensacola, FL, USA, AAAI Press, 2002, 354-358.

[15]Witten I. and Frank E., Data Mining: Practical machine learning tools with Java implementations, Morgan Kaufmann, San Francisco, 2000.

[16]William D.R., Goldstein M., Multivariate Analysis. Methods and Applications, ISBN

0-471-08317-8, John Wiley & Sons, 1984.

[17]Quinlan, J.R., C4.5 programs for machine learning, San Mateo CA: Morgan

Kaufmann, 1993.

Page 124: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source
Page 125: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

IV

THE IMPACT OF THE FEATURE EXTRACTION ON THE PERFORMANCE OF A CLASSIFIER:

KNN, NAÏVE BAYES AND C4.5

Pechenizkiy, M. 2005. In: B. Kegl & G. Lapalme (Eds.), Proceedings of 18th CSCSI Conference on Artificial Intelligence AI’05, LNAI 3501, Heidelberg: Springer Verlag, 268-279.

Page 126: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source
Page 127: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

The Impact of Feature Extraction on the Performance of a Classifier: kNN, Naïve Bayes and C4.5

Mykola Pechenizkiy1 1Dept. of Computer Science and Information Systems, University of Jyväskylä,

Jyväskylä, Finland [email protected]

Abstract. “The curse of dimensionality” is pertinent to many learning algorithms, and it denotes the drastic raise of computational complexity and the classification error in high dimensions. In this paper, different feature extraction techniques as means of (1) dimensionality reduction, and (2) constructive induction are analyzed with respect to the performance of a classifier. Three commonly used classifiers are taken for the analysis: kNN, Naïve Bayes and C4.5 decision tree. One of the main goals of this paper is to show the importance of the use of class information in feature extraction for classification and (in)appropriateness of random projection or conventional PCA to feature extraction for classification for some data sets. Two eigenvector-based approaches that take into account the class information are analyzed. The first approach is parametric and optimizes the ratio of between-class variance to the within-class variance of the transformed data. The second approach is a nonparametric modification of the first one based on the local calculation of the between-class covariance matrix. In experiments on benchmark data sets these two approaches are compared with each other, with conventional PCA, with random projection and with plain classification without feature extraction for each classifier.

1 Introduction

Knowledge discovery in databases (KDD) is a combination of data warehousing, decision support, and data mining that indicates an innovative approach to information management. KDD is an emerging area that considers the process of finding previously unknown and potentially interesting patterns and relations in large databases [8]. Current electronic data repositories are growing quickly and contain huge amount of data from commercial, scientific, and other domain areas. The capabilities for collecting and storing all kinds of data totally exceed the abilities to analyze, summarize, and extract knowledge from this data. Numerous data mining techniques have recently been developed to extract knowledge from these large databases. Fayyad in [8] introduced KDD as “the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data”. The process comprises several steps, which involve data selection, data pre-processing, data transformation, application of machine learning techniques, and the interpretation and evaluation of patterns.

Page 128: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

In this paper we analyze the problems related to data transformation, before applying certain machine learning techniques. In Section 2 the data transformation approaches are seen from two different perspectives. The first one is related to the so-called “curse of dimensionality” problem [5] and the necessity of dimensionality reduction [2]. The second perspective comes from the assumption that in many data sets to be processed some individual features, being irrelevant or indirectly relevant for the purpose of analysis, form poor problem representation space. Corresponding ideas of constructive induction that assume the improvement of problem representation before application of any learning technique are presented.

Feature extraction (FE) for classification is aimed at finding such a transformation of the original space in order to produce new features, which would preserve class separability as much as possible and to form a new lower-dimensional problem representation space. Thus, FE accounts for both the perspectives, and, therefore, we believe that FE, when applied either on data sets with high dimensionality or on data sets including indirectly relevant features, can improve the performance of a classifier.

We consider different types of feature extraction techniques for classification in Section 3, including Principal Component Analysis (PCA), Random Projection (RP) and two class-conditional approaches to FE. We conduct a number of experiments on 20 UCI datasets, analyzing the impact of these FE techniques on the classification performance of the nearest neighbour classification, Naïve Bayes, and C4.5 decision tree learning. The results of these experiments are reported in Section 4. And then, in Section 5 we briefly summarize with the main conclusions and further research directions.

2 Poor Representation Spaces: “the Curse of Dimensionality” and Indirectly Relevant Features

In this section the two main reasons are presented why data transformation might be an important step to be undertaken before a certain machine learning technique is applied. The first issue is related to the so-called “curse of dimensionality” and the necessity for dimensionality reduction. The second issue is related to the potentially poor representation of the problem in terms of some irrelevant or indirectly relevant features that represent the data and the corresponding necessity to improve the representation.

2.1 Dimensionality Reduction

In many real-world applications, numerous features are used in an attempt to ensure accurate classification. If all those features are used to build up classifiers, then they operate in high dimensions, and the learning process becomes computationally and analytically complicated, resulting often in the drastic rise of classification error. Hence, there is a need to reduce the dimensionality of the feature space before classification. According to the adopted strategy dimensionality reduction techniques

Page 129: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

are divided into feature selection and feature transformation (also called feature discovery). The key difference between feature selection and feature transformation is that during the first process a subset of original features only is selected while the second approach is based on the generation of completely new features [15]. Feature extraction is a dimensionality reduction technique that extracts a subset of new features from the original set of features by means of some functional mapping keeping as much information in the data as possible [10].

The essential drawback of all the methods that just assign weights to individual features is their insensitivity to interacting or correlated features. Also, in many cases some features are useful on one example set but useless or even misleading in another. That is why the transformation of the given representation before weighting the features in such cases can be preferable. However, feature extraction and subset selection are not, of course, totally independent processes and they can be considered as different ways of task representation. And the use of such techniques is determined by the purposes, and, moreover, sometimes feature extraction and selection methods are combined together in order to improve the solution.

2.2 Constructive Induction

Even, if the dimensionality of problem is relatively low, the problem is that most inductive learning approaches assume that the features used to represent instances are sufficiently relevant. However, it was shown experimentally that this assumption does not hold often for many learning problems. Some features may not be directly relevant, and some features may be redundant or irrelevant. Even those inductive learning approaches that apply feature selection techniques, and can eliminate irrelevant features and thus somehow account for the problem of high dimensionality, often fail to find good representation of data. This happens because of the fact that many features in their original representation are weakly or indirectly relevant to the problem. The existence of such features usually requires the generation of new, more relevant features that are some functions of the original ones. Such functions may vary from very simple as a product or a sum of a subset of the original features to very complex as a feature that reflects whether some geometrical primitive is present or absent in an instance. The discretization (quantization) of continuous features may serve for abstraction of some features when the reduction of the range of possible values is desirable. The original representation space can be improved for learning by removing less relevant features, adding more relevant features and abstracting features. We consider a constructive induction approach with respect to classification.

Constructive induction (CI) is a learning process that consists of two intertwined phases, one of which is responsible for the construction of the “best” representation space and the second concerns with generating hypotheses in the found space [16]. In Figure 1 we can see two problems – with a) high-quality, and b) low-quality representation spaces (RS). So, in a) points marked by “+” are easily separated from the points marked by “–” using a straight line or a rectangular border. But in b) “+” and “–” are highly intermixed that indicates the inadequateness of the original RS. A common approach is to search for complex boundaries to separate the classes. The constructive induction approach suggests searching for a better representation space

Page 130: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

where the groups are better separated, as in c). However, in this paper the focus is on constructing new features from the original

ones by means of some functional mapping that is known as feature extraction. We consider FE from both perspectives – as a constructive induction technique as a dimensionality reduction technique.

a) High quality RS b) Low quality RS c) Improved RS due to CI

– + – + – – + + + – + + – –

+ + + – – + + – + – CI + +

+ + – – + + – +

+ + + – + – – + – – – –

Fig. 1. High vs. low quality representation spaces (RS) for concept learning. Constructive induction (CI) aims to improve the quality of the low-quality RS [16].

3 Feature Extraction for Classification

Generally, feature extraction for classification can be seen as a search process among all possible transformations of the original feature set for the best one, which preserves class separability as much as possible in the space with the lowest possible dimensionality [10]. In other words we are interested in finding a projection w:

xwy T= (1)

where y is a 1×k transformed data point (presented using k features), w is a kd × transformation matrix, and x is a 1×d original data point (presented using d features).

3.1 PCA

Principal Component Analysis (PCA) is a classical statistical method, which extracts a lower dimensional space by analyzing the covariance structure of multivariate statistical observations [12].

The main idea behind PCA is to determine the features that explain as much of the total variation in the data as possible with as few of these features as possible. The computation of the PCA transformation matrix is based on the eigenvalue decomposition of the covariance matrix S and therefore is computationally rather expensive.

⎟⎟⎠

⎞⎜⎜⎝

⎛−−=← ∑

=

n

i

Tiiiondecompositeig

1

))((_ mxmxSw (2)

Page 131: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

where n is the number of instances, xi is the i-th instance, and m is the mean vector of the input data.

Computation of the principal components can be presented with the following algorithm: 1. Calculate the covariance matrix S from the input data. 2. Compute the eigenvalues and eigenvectors of S and sort them in a descending

order with respect to the eigenvalues. 3. Form the actual transition matrix by taking the predefined number of

components (eigenvectors). 4. Finally, multiply the original feature space with the obtained transition matrix,

which yields a lower- dimensional representation. The necessary cumulative percentage of variance explained by the principal axes is

used commonly as a threshold, which defines the number of components to be chosen.

3.2 The Random Projection Approach

In many application areas like market basket analysis, text mining, image processing etc., dimensionality of data is so high that commonly used dimensionality reduction techniques like PCA are almost inapplicable because of extremely high computational time/cost.

Recent theoretical and experimental results on the use of random projection (RP) as a dimensionality reduction technique have attracted the DM community [6]. In RP a lower-dimensional projection is produced by means of transformation like in PCA but the transformation matrix is generated randomly (although often with certain constrains).

The theory behind RP is based on the Johnson and Lindenstrauss Theorem that says that any set of n points in a d-dimensional Euclidean space can be embedded into a k-dimensional Euclidean space – where k is logarithmic in n and independent of d – so that all pairwise distances are maintained within an arbitrarily small factor [1]. The basic idea is that the transformation matrix has to be orthogonal in order to protect data from significant distortions and try to preserve distances between the data points. Generally, orthogonalization of the transformation matrix is computationally expensive, however, Achlioptas showed a very easy way of defining (and also implementing and computing) the transformation matrix for RP [1]. So, according to [1] the transformation matrix w can be computed simply either as:

⎪⎩

⎪⎨

+⋅=

6/1yprobabilitwith13/2yprobabilitwith06/1yprobabilitwith1

3ijw , or ⎩⎨⎧−+

=2/1yprobabilitwith12/1yprobabilitwith1

ijw (3)

RP as a dimensionality reduction technique was experimentally analyzed on image (noisy and noiseless) and text data (a newsgroup corpus) by Binghan and Mannila in [6]. Their results demonstrate that RP preserves the similarity of data vectors rather well (even when data is projected onto relatively small numbers of dimensions).

Fradkin and Madigan in [9] performed experiments (on 5 different data sets) with RP and PCA for inductive supervised learning. Their results show that although PCA

Page 132: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

predictively outperformed RP, RP is rather useful approach because of its computational advantages. Authors also indicated a trend in their results that the predictive performance of RP is improved with increasing the dimensionality when combining with the right learning algorithm. It was found that for those 5 data sets RP is suited better for nearest neighbour methods, where preserving distance between data points is more important than preserving the informativeness of individual features, in contrast to the decision tree approaches where the importance of these factors is reverse. However, further experimentation was encouraged.

3.3 Class-conditional Eigenvector-based FE

In [17] it was shown that although PCA is the most popular feature extraction technique, it has a serious drawback, namely the conventional PCA gives high weights to features with higher variabilities irrespective of whether they are useful for classification or not. This may give rise to the situation where the chosen principal component corresponds to the attribute with the highest variability but without any discriminating power.

A usual approach to overcome the above problem is to use some class separability criterion [3], e.g. the criteria defined in Fisher’s linear discriminant analysis and based on the family of functions of scatter matrices:

wSwwSww

WT

BT

J =)( (4)

where SB in the parametric case is the between-class covariance matrix that shows the scatter of the expected vectors around the mixture mean, and SW is the within-class covariance, that shows the scatter of samples around their respective class expected vectors.

A number of other criteria were proposed in [10]. Both parametric and nonparametric approaches optimize criterion (4) by using the simultaneous diagonalization algorithm [10].

It should be noticed that there is a fundamental problem with the parametric nature of the covariance matrices. The rank of SB is at most the number of classes-1, and hence no more than this number of new features can be obtained.

The nonparametric method overcomes this problem by trying to increase the number of degrees of freedom in the between-class covariance matrix, measuring the between-class covariances on a local basis. The k-nearest neighbor (kNN) technique is used for this purpose.

A two-class nonparametric feature extraction method was considered in [10], and it is extended in [20] to the multiclass case. The algorithm for nonparametric feature extraction is the same as for parametric extraction. Simultaneous diagonalization is used as well, and the only difference is in calculating the between-class covariance matrix SB. In the nonparametric case the between-class covariance matrix is calculated as the scatter of the samples around the expected vectors of other classes’ instances in the neighborhood.

A number of experimental studies where parametric and nonparametric class-conditional FE have been applied for kNN [20], dynamic integration of classifiers

Page 133: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

[19] and data with small sample size and high number of feature [11] were considered.

4 Experiments and Results

The experiments were conducted on 20 data sets with different characteristics taken from the UCI machine learning repository [7]. The main characteristics of the data sets are presented in the first four columns of Table 1, which includes the names of the data sets, the numbers of instances included in the data sets, the numbers of different classes of instances, and the numbers of different kinds of features (binarized categorical plus numerical) included in the instances. Each categorical feature was replaced with a redundant set of binary features, each corresponding to a value of the original feature.

In the experiments, the accuracy of 3-nearest neighbor classification (3NN), Naïve-Bayes (NB) learning algorithm, and C4.5 decision tree (C4.5) [18] was calculated. All they are well known in the data mining and machine learning communities and represent three different approaches to learning from data. The main motivation to use 3 different kinds of classifiers is that we expect different impact of FE on the representation space not only with respect to different data sets but also with respect to different classifiers. In particular, for kNN it is expected that FE can produce better neigbourhood, for C4.5 – better (more informative) individual features, and for Naïve Bayes – uncorrelated features.

For each data set 30 test runs of Monte-Carlo cross validation were made to evaluate classification accuracy with the four feature extraction approaches and without feature extraction. In each run, the data set is first split into the training set and the test set by stratified random sampling to keep class distributions approximately same. Each time 30 percent of the instances of the data set are first randomly picked up to the test set. The remaining 70 percent of instances form the training set, which is used for finding the feature-extraction transformation matrix w. The test environment was implemented within the WEKA framework (the machine learning library in Java) [21]. The classifiers from this library were used with their default settings.

For PCA we used a 0.85 variance threshold, and for RP we took the number of projected features equal to 75% of original space. We took all the features extracted by parametric FE as it was always equal to number of classes-1.

Main results are presented in the last three columns of Table 1. Each cell contains the ordered list of 5 symbols from A to E, which code different FE techniques. A is RP, B - PCA, C - PAR (parametric FE), D - NPAR (nonparametric FE), E - Plain (case when no FE technique has been applied). At the first position is symbol that corresponds to the highest accuracy and the last one – to the lowest accuracy. A hyphen, when used instead of comma between the symbols, denotes the fact that the difference between the corresponding accuracies is less than 1%.

It can be seen from the table that for some data sets FE has no effect or deteriorates the classification accuracy compared to plain case E. In particular, for 3NN such situation is on 9 data sets from 20: Breast, Diabetes, Glass, Heart, Iris, Led, Monk-3,

Page 134: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

Thyroid, and Tic. For NB such situation is on 6 data sets from 20: Diabetes, Heart, Iris, Lymph, Monk-3, and Zoo. And for C4.5 such situation is on 11 data sets from 20: Car, Glass, Heart, Ionosphere, Led, Led17, Monk-1, Monk-3, Vehicle, Voting, and Zoo. It can be seen also that often different FE techniques are the best for different classifiers and for different data sets. Nevertheless, class-conditional FE approaches, especially the nonparametric approach are most often the best comparing to PCA or RP. On the other hand it is necessary to point out that the parametric FE was very often the worst, and for 3NN and C4.5 parametric FE was the worst technique more often than RP. Such results highlight the very unstable behavior of parametric FE.

Thus as it could be expected different FE techniques are often suited in different contexts not only for different data sets but also for different classifiers.

Table 1 – Datasets characteristics and relative accuracy results of 3NN, Naïve Bayes and C4.5 classifiers that were applied in different data spaces produced by corresponding FE techniques.

Dataset inst class feat 3NN NB C4.5 Balance 625 3 20 C,E,D,B,A C,E,D,B,A C,D,E,A,B Breast 286 2 38 E,B,D,C,A C,D,B-E,A C,B,D-E,A Car 1728 4 21 D,C,E,B,A D-C,E,B,A E,D,C,B,A Diabetes 768 2 8 E,D,B,A,C D-E,A,B,C D,A-E,B,C Glass 214 6 9 E,B-D,A,C D,B,E,A,C C-E,D-B-A Heart 270 2 13 E,A,D,B,C E,A,D,B,C E,A,D,B,C Ionosphere 351 2 33 D,B,E,A,C D,B,A,E,C A-B-D-E,C Iris Plants 150 3 4 A-B-D-E,C E-A,D-C,B A,E,B-D,C LED 300 10 7 E,B-C-D,A B,C,D,E,A B-C-D-E,A LED17 300 10 24 C,E,B-D,A C,E,D,B,A E,C,D,B,A Liver 345 2 6 D,E,B,A,C D,B,C,E,A B,E,C,D,A Lymph 148 4 36 B,D-E,C,A E,B,D,C,A B,E,D,C,A Monk-1 432 2 15 D,E,B,A,C D,B,C-E,A E,D,A,B,C Monk-2 432 2 15 D,E,C,B,A D,C,A-B,E D,E,B,C,A Monk-3 432 2 15 D-E,B,C,A C-D-E,B,A E,D,A-B,C Thyroid 215 3 5 E,A-B,D,C E,A,D-B,C B,E,A,C-D Tic 958 2 27 B-E,D,A,C B,D,C,A-E B,E,D,A,C Vehicle 846 4 18 D,E,A,B,C D,C,B,A-E E,D,A,B,C Voting 435 2 48 A,B-D-E,C D,A-B-E,C E,D,A-B,C Zoo 101 7 16 C,D,B-E,A D-E,C,A-B E,B-A,D,C

Figure 2 summarizes ranking results of the FE techniques according to the classifiers performance on 20 UCI data sets. Each bar on the histograms shows how many times an FE technique was the 1st, the 2nd, the 3rd, the 4th, or the 5th among the 20 possible. The number of times certain techniques got 1st-5th place is not necessarily integer since there were draws between 2, 3, or 4 techniques. In such cases each technique gets the ½, 1/3 or 1/4 score correspondingly.

It can be seen from the figure that there are many common patterns in the behavior of techniques for 3 different classifiers, yet there are some differences too. So, according the ranking results RP behavior is very similar with every classifier, PCA works better for C4.5, parametric FE is suited better for NB. Nonparametric FE is also better suited for NB, it is also good with 3NN. However, it is less successful for C4.5.

Page 135: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

0

2

4

6

8

10

12

kNN NB C45

12345

a) RP

0

2

4

6

8

10

12

kNN NB C45

12345

b) PCA

0

2

4

6

8

10

12

kNN NB C45

12345

c) Par

0

2

4

6

8

10

12

kNN NB C45

12345

d) NPar

Fig. 2. Ranking of the FE techniques according to the results on 20 UCI data sets

In Table 2, averaged over 20 data sets, accuracy results are presented for each classifier. It can be seen from the table that among the FE techniques the nonparametric approach is always the best on average for each classifier, the second best is PCA, then parametric FE, and, finally, RP shows the worst results. Classification in the original space (Plain) was almost as good as in the space of extracted features produced by the nonparametric approach when kNN is used. However, when NB is used, Plain accuracy is significantly lower comparing to the situation when the nonparametric FE is applied. Still, this accuracy is as good as in situation when PCA is applied and significantly higher in situations when RP or the parametric FE is applied. For C4.5 the situation is also different. So, Plain classification is the best option on average. With respect to RP our results differ from the conclusions made in [9], where RP was found to be suited better for nearest neighbor methods and less satisfactory for decision trees (according to the results on 5 data sets). We can see from the two last columns of Table 2 that on average RP suits better for kNN than to C4.5 (Plain-RP) indeed (however the difference is only 0.5%). However, if we take into consideration PCA (PCA-RP), as in this context RP is often seen as an alternative for PCA, we can see that in fact RP produces new spaces that are better suited for C4.5 than to kNN (1.6%). It is interesting also to analyze these differences for NB. It can be seen that for PCA-RP the difference is the greatest while

Page 136: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

for Plain-RP the difference is the least. We believe that this is due to the poor performance of NB comparing to kNN and C4.5.

Table 2. Averaged over 20 data sets accuracy results

RP PCA PAR NPAR Plain PCA-

RP(%) Plain-RP(%)

kNN .725 .774 .733 .808 .804 4.9 7.9

NB .704 .767 .749 .799 .769 6.3 6.4 C4.5 .741 .775 .761 .806 .825 3.3 8.4

In Figure 3 decrease/increase of accuracy due to the use of FE techniques is presented averaged over 20 data sets. It can be seen from the figure that on average, for 20 datasets analyzed FE has no effect on or deteriorates the classification accuracy. The only exception is combination of nonparametric FE and the NB classifier. In this situation averaged accuracy increases by 3%. In can be seen also that the nonparametric FE is the best among considered approaches from the accuracy perspective.

-5

0

5

10

kNN NB C45

RPPCAPARNPAR

Fig. 3. The decrease/increase of accuracy due to the use of FE techniques, averaged over 20 data sets.

5 Conclusions

FE techniques are powerful tools that can significantly increase the classification accuracy producing better representation spaces or resolving the problem of “the curse of dimensionality”. However, when applied blindly, FE may have no effect for the further classification or even deteriorate the classification accuracy. Moreover it is also possible that some data sets are so easy to learn that classification without any FE gains already the maximal possible accuracy and therefore it is hard to get any improvement due to FE.

In this paper, the experimental results show that for many data sets FE does increase classification accuracy.

We could see from the results that there is no best FE technique among the considered ones, and it is hard to say which one is the best for a certain classifier and/or for a certain problem, however according to the experimental results some major trends can be recognized.

Page 137: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

Class-conditional approaches (and especially nonparametric approach) were often the best ones. This indicated the fact how important is to take into account class information and do not rely only on the distribution of variance in the data. At the same time it is important to notice that the parametric FE was very often the worst, and for 3NN and C4.5 the parametric FE was the worst more often than RP. Such results highlight the very unstable behavior of parametric FE.

One possibility to improve the parametric FE, we think, is to combine it with PCA or a feature selection approach in a way that a few principal components or the most useful for classification features are added to those extracted by the parametric approach.

Although it is logical to assume that RP should have more success in applications where the distances between the original data points are meaningful and/or for such learning algorithms that use distances between the data points, our results show that this is not necessarily true. However, data sets in our experiments have 48 features at most and RP is usually applied for problems with much higher dimensionality.

Time taken to build classification models with and without FE is not reported in this study, we also do not present here the analyses of number of features extracted by a certain FE technique. These important issues will be presented in our further study.

A volume of accumulated empirical (and theoretical) findings, some trends, and some dependencies with respect to data set characteristics and use of FE techniques have been discovered or can be discovered. In particular, it was shown that FE techniques are beneficial for data sets with highly correlated features [12]. The nonparametric FE was found to be successful especially for data sets with very limited sample sizes [11]. On the other hand, there are certain assumptions on the performance of classification algorithms under certain conditions.

Thus, potentially the adaptive selection of the most suitable data mining techniques for a data set at consideration (that is a really challenging problem) might be possible. We see our further research efforts in this direction.

We would like to emphasize also the possibility to conduct experiments on synthetically generated datasets that is beneficial from two perspectives. First, this allows generating, testing and validating hypothesis on data mining strategy selection with respect to a dataset at hand under controlled settings when some data characteristics are varied while the others are held unchangeable.

Acknowledgments: This research is partly supported by the COMAS Graduate School of the University of Jyväskylä, Finland. I would like to thank Dr. Alexey Tsymbal and Prof. Seppo Puuronen for their valuable comments and suggestions to the paper.

References

1. Achlioptas, D. Database-friendly random projections. Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, Santa Barbara, California, ACM Press (2001)

2. Aivazyan, S.A. Applied statistics: classification and dimension reduction. Finance and

Page 138: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

Statistics, Moscow (1989) 3. Aladjem, M. Multiclass discriminant mappings. Signal Processing, Vol .35, (1994) 1-18 4. Aladjem, M. Parametric and nonparametric linear mappings of multidimensional data.

Pattern Recognition, Vol.24(6) (1991) 543-553 5. Bellman, R., Adaptive Control Processes: A Guided Tour, Princeton University Press,

(1961) 6. Bingham, E., Mannila, H. Random projection in dimensionality reduction: applications to

image and text data. Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, ACM Press, San Francisco, California, (2001)

7. Blake, C.L., Merz, C.J. UCI Repository of Machine Learning Databases. Dept. of Information and Computer Science, University of California, Irvine CA, (1998)

8. Fayyad, U.M. Data Mining and Knowledge Discovery: Making Sense Out of Data, IEEE Expert, Vol. 11(5), (1996) 20-25

9. Fradkin, D., Madigan, D. Experiments with random projections for machine learning. Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM Press, Washington, D.C., (2003)

10. Fukunaga, K. Introduction to statistical pattern recognition. Academic Press, London (1990)

11. Jimenez, L., Landgrebe, D. High Dimensional Feature Reduction Via Projection Pursuit. PhD Thesis and School of Electrical & Computer Engineering Technical Report TR-ECE 96-5, (1995)

12. Jolliffe, I.T. Principal Component Analysis. Springer, New York, NY, (1986) 13. Kiang, M. A comparative assessment of classification methods, Decision Support Systems,

Vol. 35, (2003) 441-454 14. Kohavi, R., Sommerfield, D., Dougherty, J. Data mining using MLC++: a machine learning

library in C++. Tools with Artificial Intelligence, IEEE CS Press, (1996) 234-245 15. Liu, H. Feature Extraction, Construction and Selection: A Data Mining Perspective, ISBN

0-7923-8196-3, Kluwer Academic Publishers (1998) 16. Michalski, R.S.. Seeking Knowledge in the Deluge of Facts, Fundamenta Informaticae,

Vol. 30, (1997) 283-297 17. Oza, N.C., Tumer, K. Dimensionality Reduction Through Classifier Ensembles. Technical

Report NASA-ARC-IC-1999-124, Computational Sciences Division, NASA Ames Research Center, Moffett Field, CA, (1999)

18. Quinlan, J.R. C4.5 Programs for Machine Learning. San Mateo CA: Morgan Kaufmann, (1993)

19. Tsymbal A., Pechenizkiy M., Puuronen S., Patterson D.W. Dynamic integration of classifiers in the space of principal components, In: L.Kalinichenko, R.Manthey, B.Thalheim, U.Wloka (Eds.), Proc. Advances in Databases and Information Systems: 7th East-European Conf. ADBIS'03, Lecture Notes in Computer Science, Vol. 2798, Heidelberg: Springer-Verlag (2003) 278-292

20. Tsymbal A., Puuronen S., Pechenizkiy M., Baumgarten M., Patterson D. Eigenvector-based feature extraction for classification. In Proc. 15th Int. FLAIRS Conference on Artificial Intelligence, Pensacola, FL, USA, AAAI Press (2002) 354-358

21. Witten I. and Frank E. Data Mining: Practical machine learning tools with Java implementations, Morgan Kaufmann, San Francisco, (2000)

Page 139: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

V

SUPERVISED LEARNING AND LOCAL DIMENSIONALITY REDUCTION WITHIN NATURAL

CLUSTERS: BIOMEDICAL DATA ANALYSIS

Pechenizkiy, M., Tsymbal, A. & Puuronen, S. 2005. Manuscript submitted to IEEE Transactions on Information Technology in Biomedicine, Special Post-conference Issue "Mining Biomedical Data" (as extended version of Pechenizkiy et al., 2005c).

Page 140: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source
Page 141: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

TITB-00152-2005 1

Abstract— Inductive learning systems have been successfully applied in a number of medical domains. Nevertheless, the effective use of these systems requires data preprocessing before applying a learning algorithm. This is especially important for multidimensional heterogeneous data presented by a large number of features of different types. Dimensionality reduction is one commonly applied approach. The goal of this paper is to study the impact of natural clustering on dimensionality reduction for supervised learning in the nosocomial infections domain. We compare several data mining strategies that apply dimensionality reduction by means of feature extraction or feature selection for subsequent supervised learning on real microbiological data. The results of our study show that local dimensionality reduction within natural clusters results in a better feature spaces for supervised learning in comparison with the global search on the whole data for the best feature space.

Index Terms— 1,2,3,4

I. INTRODUCTION

URRENT electronic data repositories, especially in

medical domains, contain huge amount of data including

also currently unknown and potentially interesting patterns

and relations that can be found using knowledge discovery

and data mining (DM) methods [5]. Inductive (supervised)

learning systems have been successfully applied in a number

of medical domains, for example, in localization of a primary

tumor, prognostics of recurrence of breast cancer, diagnosis of

thyroid diseases, and rheumatology [10].

However, researchers and practitioners realize that the

effective use of these learning systems requires data

preprocessing prior to model induction. This is especially

important for multidimensional heterogeneous data, presented

by a large number of features of different types. The so-called

Manuscript received August 30, 2005. This work was partly supported by

COMAS Graduate School of the University of Jyväskylä, Finland and the

Science Foundation Ireland under Grant No. S.F.I.-02IN.1I111.

M. Pechenizkiy is with the Department of Computer Science and

Information Systems, University of Jyväskylä, P.O. Box 35, 40351 Jyväskylä,

Finland, (phone: +358-14-2602472; fax: +358-14-2603011; e-mail:

[email protected]).

A. Tsymbal is with the Department of Computer Science, Trinity College

Dublin, Ireland (e-mail: [email protected]).

S. Puuronen is with the Department of Computer Science and Information

Systems, University of Jyväskylä, Finland, (e-mail: [email protected]).

“curse of dimensionality” [2] pertinent to many learning

algorithms, denotes the drastic raise of computational

complexity and classification error on data having large

number of features. Hence, the dimensionality of feature space

is often reduced before supervised learning is undertaken.

Generally, dimensionality reduction (DR) is only one effective

approach to data reduction among others like instance

selection or data selection [12]. We see the goal of DR to be

in: (1) reducing the quantity of data with a focus on relevant

data, and (2) improving the quality of data and/or its

representation for a supervised learning method.

Consequently, achievement of these goals results in a reduced

amount of data, relevance of this reduced data to the domain

and supervised learning method applied, and finally,

improvement of the performance of the learning method.

There are a number of DR techniques, and according to the

adopted reduction strategy they are usually divided into

feature selection (FS) and feature extraction (FE) (also called

feature discovery) approaches [12]. The key difference

between FS and FE is that in the former a subset of original

features only is selected while the latter is based on generation

of a completely new feature space through a functional

mapping, retaining in fewer dimensions as much information

about the data as possible [12]. Many FS techniques are

usually sensitive to interacting or correlated features. That is

why transformation of the given representation might be

preferable.

For some problem domains a feature subset may be useful

in one part of the instance space, and at the same time it may

be useless or even misleading in another part of it. Therefore,

it may be difficult or even impossible in some datasets to

remove irrelevant and/or redundant features and leave only

useful ones by means of global FS. However, if it is possible

to find local homogeneous regions of heterogeneous data, then

there are more chances to apply FS successfully (individually

to each region). For FE the decision whether to proceed

globally over the entire instance space or locally on different

parts of the instance space is also one of the key issues. It can

be seen that despite being globally high dimensional and

sparse, data distributions in some domain areas are locally low

dimensional and dense, for example in physical movement

systems [19].

One possible approach for local FS or local FE would be

Supervised Learning and Local Dimensionality

Reduction within Natural Clusters:

Biomedical Data Analysis

Mykola Pechenizkiy, Alexey Tsymbal, and Seppo Puuronen

C

Page 142: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

TITB-00152-2005 2

clustering (grouping) of the whole data set into smaller

regions. Generally, different clustering techniques can be used

for this purpose, for example the k-means or EM techniques

[20]. However, in this paper we emphasize the possibility to

apply natural clustering aimed at using contextual features for

splitting the whole heterogeneous data space into more

homogeneous clusters. Usually, features that are not useful for

classification alone but are useful in combination with other

(context-sensitive) features are called contextual (or

environmental) features [18].

In this paper we apply our natural clustering approach to

real clinical data trying to construct local models that would

help in the better prediction of antibiotic resistance and in

understanding its development.

In our experimental study we apply k-nearest neighbor

classification (kNN) to build antibiotic sensitivity prediction

models. We apply the principle of natural clustering, grouping

the instances into partitions related to certain pathogen types.

We apply three different wrapper-based sequential FS

techniques and three eigenvector-based FE techniques

globally and locally and analyze their impact on the

performance of kNN classifier.

The paper is organized as follows. In Section 2 we briefly

consider dimensionality reduction techniques used in the

study. In Section 3 data used in our experiments and its nature

are described. In Section 4 we describe how natural clustering

approach was applied to our data. In Section 5 we present the

results of experiments with the FE and FS techniques applied

globally for the whole data set and locally in clusters for

further classification. Finally, in Section 6 we briefly conclude

with a summary and present the directions of further research.

II. DIMENSIONALITY REDUCTION TECHNIQUES USED IN THE

STUDY

A. Feature Extraction Techniques

Principal Component Analysis (PCA) is one of the most

commonly used FE techniques. It is based on extracting the

axes on which data shows the highest variability [9]. Although

PCA “spreads out” the data in the new basis (new extracted

axes), and can be of great help in unsupervised learning, there

is no guarantee that the new axes are consistent with the

discriminatory features in a classification problem.

Another approach is to account for class information during

the FE process. One technique is to use some class

separability criterion (for example, from Fisher’s linear

discriminant analysis), based on a family of functions of

scatter matrices: the within-class covariance, the between-

class covariance, and the total covariance matrices. Parametric

and nonparametric eigenvector-based approaches that use the

within- and between-class covariance matrices thus taking into

account class information have been analyzed and compared

[17]. Both the parametric and nonparametric approaches use

the simultaneous diagonalization algorithm to optimize the

relation between the within- and between-class covariance

matrices. The difference between the approaches is in

calculation of the between-class covariance matrix. The

parametric approach accounts for one mean per class and one

total mean, and therefore may extract at most

number_of_classes-1 features. The nonparametric method

tries to increase the number of degrees of freedom in the

between-class covariance matrix, measuring the between-class

covariances on a local basis. Our previous experiments with

parametric and nonparametric FE approaches show that

nonparametric FE is often more robust to different dataset

characteristics and often results in higher classification

accuracy of such basic supervised learning techniques as

Naïve Bayes, C4.5 and kNN comparing to parametric FE [13].

B. Feature Selection Techniques

Greedy hill climbing is one of the simplest search strategies

that consider sequential changes to the current feature subset

[4]. Often, it is just the addition or deletion of a single feature

from the subset at a time. We selected the most commonly

used sequential strategies for FS [1]: forward feature selection

(FFS), backward feature elimination (BFE), and bidirectional

search (BS). The first strategy starts with no features and

successively adds new ones. On the contrary, the second one

begins with all the features and step-wisely deletes features

one-by-one. Bidirectional search proceeds in the both forward

and backward directions in turn.

Search algorithms that implement these strategies may

consider all possible changes to the current subset and then

select the best, or may simply choose the first change that

improves the merit of the current feature subset. In either case,

once a change is accepted, it is never reconsidered (hence the

name “greedy”). The FS process can stop adding/deleting

features when none of the evaluated subsets improves the

previous result or, alternatively, the search can continue to

produce and evaluate new feature subsets while the result does

not start to degrade. The evaluation of selected feature subsets

in our study was based on the wrapper paradigm that assumes

interaction between the FS process and the classification

model [8].

III. DOMAIN: NOSOCOMIAL INFECTIONS

A. Nosocomial Infections

Nosocomial infections and antibiotic resistance (AR) are

highly important problems that impact the morbidity and

mortality of hospitalized patients as well as their cost of care.

It is known that 3 to 40 percent of patients admitted to hospital

acquire an infection during their stay, and that the risk for

hospital-acquired infection, or nosocomial infection, has risen

steadily in recent decades. Formally, nosocomial infections

are defined as infections arising after 48 hours of hospital

admission. Infections arising earlier are assumed to be arisen

prior to admission, though this is not always true [7]. The

frequency of nosocomial infections depends mostly on the

type of conducted operation being greater for “dirty”

operations (10-40%), and smaller for “pure” operations (3-

7%). For example, such a serious infectious disease as

Page 143: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

TITB-00152-2005 3

meningitis is often the result of nosocomial infection.

Analysis of microbiological data included in antibiograms

collected in different institutions over different periods of time

is considered as one of the most important activities to restrain

the spreading of AR and to avoid the negative consequences

of this phenomenon. Traditional hospital infection control

surveillance and localization of hospital infection often relies

on the manual review of suspected cases of nosocomial

infections and the tabulation of basic summary statistics which

requires considerable time and resources, and produced

measures and patterns are often not up-to-date. Advanced

computer-based analysis methods might help to discover more

potentially useful patterns more efficiently. [3].

It has been widely recognized lately that sophisticated,

active, and timely intra-hospital surveillance is needed.

Computer-assisted infection control surveillance research has

focused on identifying high-risk patients, the use of expert

systems to identify possible cases of nosocomial infection, and

the detection of deviations in the occurrence of predefined

events [6].

Nosocomial infections are the inevitable consequence of

long treatment, especially in Intensive Care Units (ICUs). The

first step of nosocomial infection arising is the colonization of

skin and mucous tunic by hospital microorganism cultures.

The peculiarity of these cultures is the acquisition of

unpredictable AR according to the policy of the use of

antimicrobial medications in the present department or

institution.

Treatment of nosocomial infections starts normally with a

microbiological investigation. In this investigation pathogens

are isolated and for each isolated bacterium, an antibiogram is

built (represents bacterium’s resistance to a series of

antibiotics). The user of the test system can define the set of

antibiotics used to test bacterial resistance. The result of the

test is presented as an antibiogram that is a vector of couples

(antibiotic/resistance). The information included in this

antibiogram is used to prescribe an antibiotic with a desired

level of resistance for the isolated pathogen.

The antibiogram is not uniquely identified given a

bacterium species, but it can vary significantly for bacteria of

the same species. This is due to the fact that the same bacteria

of the same species may have evolved differently and have

developed different resistances to antibiotics. However, very

often groups of antibiotics have similar resistance when tested

on a given bacterium species, despite its strains [6].

AR is an especially difficult problem for nosocomial

infections in hospitals because they attack critically ill patients

who are more vulnerable to infections than the general

population and therefore require more antibiotics. Heavy use

of antibiotics in these patients hastens the mutations in

bacteria that bring about drug resistance [16]. According to

the Center for Disease Control and Prevention (CDC)

statistics, more than 70 percent of bacteria that cause hospital-

acquired infections are resistant to at least one of antibiotics

most commonly used to treat infections. Persons infected with

drug-resistant organisms are more likely to have longer

hospital stays and require treatment with second or third

choice drugs that may be less effective, more toxic, and more

expensive [16]. In short, antimicrobial resistance drives up

health care costs, increases the severity of disease, and

increases the death rates of some infections.

B. Source and Nature of Data

The data used in our analysis were collected in the Hospital

of N.N Burdenko Institute of Neurosurgery using “Vitek-60”

analyzer (developed by bioMérieux) over the years 1997-2004

and information systems "Microbiologist" (developed by the

Medical Informatics Laboratory of the institute) and

"Microbe" (developed by Russian company "MedProject-3").

In our previous pilot many-sided analysis of this data we

applied a number of different DM techniques trying to build

an accurate predictive model, to explore and understand our

data and find valuable association rules [14].

C. Data Organization

Each instance of the data used in the analysis represents one

sensitivity test and contains the following features (Table 1):

pathogen that is isolated during the microbe identification

analysis, antibiotic that is used in the sensitivity test and the

result of the sensitivity test (sensitive S, resistant R, or

intermediate I), obtained from “Vitek” according to the

guidelines of the National Committee for Clinical Laboratory

Standards (NCCLS) [6].

The information about sensitivity analysis is related to a

patient, his/her demographical data (sex, age) and

hospitalization in the Institute (main department, whether the

test was taken while the patient was in ICU, days spent in the

hospital before, etc.). Each instance of microbiological test in

the database corresponds to a single specimen that may be

blood, liquor, urine, etc. In this study we focus on the analysis

of meningitis cases only, and the specimen is liquor.

For the purposes of this analysis we picked up all 4430

instances of sensitivity tests related to the meningitis cases of

the period of January 2002 – July 2004. Grouping of binary

features for pathogens and antibiotics were introduced so that

17 pathogens and 39 antibiotics were combined into 6 and 15

groups respectively.

Thus, each instance in our data has 30 features (beside the

ID-like attributes for records, patients, antibiotics, pathogens

and so on) that included information corresponding to a single

sensitivity test augmented with data concerning the type of the

antibiotic used and the isolated pathogen, and clinical features

of the patient and his/her demographics, and the microbiology

test result as the class attribute (Table 1).

IV. NATURAL CLUSTERING APPROACH

We apply so-called natural clustering, i.e. clustering based

on the knowledge of domain area experts. The main reason to

do so was our belief (supported by expert opinion and pilot

exploratory many-sided data analysis) that the patterns of AR

have different nature and, therefore, their behavior might be

Page 144: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

TITB-00152-2005 4

different for different natural clusters. Thus, we were

interested in how the accuracy of classification models varies

from one cluster to another and whether it is possible to

achieve better accuracy applying a classifier locally in each

cluster instead of the use of global classification.

The data for our analysis is relatively high dimensional and

heterogeneous; heterogeneity is presented by a number of

contextual (environmental) features. Semantically, the

sensitivity concept is related first of all to the pathogen and

antibiotic concepts. For our study binary features that describe

the pathogen grouping were selected as prior environmental

features, and they were used for hierarchical natural clustering

(the hierarchy was introduced by the grouping of the features).

In our database, the whole data set can be divided into two

nearly equal-sized natural clusters: gram+ and gram–. Then,

the gram+ cluster consists of the staphylococcus and

enterococcus clusters, and gram– cluster consists of the

enterobacteria and nonfermentes clusters. This natural

clustering is depicted in Figure 1.

We can see from the figure that clusters gram– and gram+

are approximately of the same size. The further division of

gram+ cluster results in less balanced clusters. The further

division of gram– cluster results in highly unbalanced clusters

by the number of instances in each.

In our many-sided exploratory analysis of this data, we

applied several classifiers to the whole dataset and

individually to the gram– and gram+ clusters. The results of

our experiments demonstrated that for example for the Naïve

Bayes and C4.5 classifiers the differences of accuracies

between the clusters were quite big and at the same time the

average accuracy over the clusters is always higher than the

corresponding global accuracy achieved with the whole

dataset. However, with instance-based classifiers (like kNN-

classifier) the difference between average of local accuracies

and global accuracy was insignificant for the division of

dataset into gram– and gram+ clusters.

We also applied the same classification algorithms for one

antibiotic group consisting of three subgroups. And the

experimental results demonstrated that the average accuracy

of classifiers (except with C4.5) over subgroups (i.e. when

they are applied locally within each cluster) is higher in

comparison with the global classifiers accuracy.

These results motivated us to continue the study of our

natural clustering approach with regard to applying different

DR techniques globally and locally for further construction of

local classification models.

V. EXPERIMENTAL STUDY

A. Experiment Design

In Figure 2 we present the main idea behind our

experimental setup.

In our experiments we were aimed at checking three major

hypotheses:

(1) whether natural clustering is an efficient approach for

the construction of local models;

(2) whether dimensionality reduction within natural clusters

produces better representation spaces for further supervised

learning in comparison with global dimensionality reduction;

(3) whether feature extraction techniques behave differently

globally and within natural clusters in comparison with feature

selection approaches with regard to their effect on

generalization accuracy.

For these purposes we have collected the accuracies for

classification with and without DR on the whole data set, and

on each natural cluster produced also with and without DR.

In our experimental studies we used an instance-based

classifier (kNN), the FFS, BFE, and BS feature selection

techniques available in the machine learning library with Java

implementation “WEKA 3.4.2” [20]. We used the

conventional PCA and the class-conditional parametric (Par)

and nonparametric (NPar) FE techniques [13], which we

implemented within the same library. We used k=7 and the

inverse distance for kNN since these parameters were found to

be the best combination for our data in our pilot studies. The

TABLE I

DATASET FEATURES

Name Type

Patient and hospitalization relatedsex {Male, Female}

age [0;72], mean 29.8

recurring stay {True,False}

days of stay in NSI before test [0;317], mean 87.5

days of stay in ICU [0;237], mean 34

days of stay in NSI before

specimen was received

[0;169] mean 31.6

bacterium is isolated when patient

is in ICU

{True,False}

main department {0,…,9}

department of stay {0,…,11}

Pathogen and pathogen groupspathogen name {Pat_name1, …, Pat_name17}

group1 {True,False}

… …

group6 {True,False}

Antibiotic and antibiotic groupsantibiotic name {Ant_name1, …, Ant_name39}

group1 {True,False}

… …

group15 {True,False}

sensitivity {Sensitive, Intermediate, Resistant}

Major dataset characteristics: 4430 instances, 3 classes, 30 features: 4

numerical, 2 categorical and 24 binary. After binarization of categorical

features we have got 48 features, of which 4 are numerical and 44 – binary.

FIGURE I

NATURAL CLUSTERING OF DATA WITH REGARD TO PATHOGENS

AR data 4430

gram +

2134

gram –

2296

eterobact.

783

nonferm.

1513

staphiloc.

2013

enterococ.

121

Page 145: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

TITB-00152-2005 5

threshold for variance covered was set to 95% for each FE

technique and for NPar we used parameter settings kNN = 7

and alpha = 1.

B. Experimental Results

The main results of experiments are given in Table 2. Each

row of the table contains the name of the dataset (cluster),

number of instances in it, accuracy of 7-NN classifier and the

number of features used for each FS (FFS, BFE, and BS) and

FE (PCA, Par, NPar) technique averaged over 30 test runs.

The column noFS is related to the case when no FS was

applied and noFE to the case when no FE was applied but all

the categorical features were binarized.

The first row corresponds to the (global) results on the

whole data set. The last row corresponds to overall accuracy

achieved with the most appropriate selection of sub data sets

(clusters): staphylococcus, enterococcus, and gram– (best).We have not analyzed the difference between selected feature

subsets. However, we can see that in many cases the number

of features (original or transformed) selected is different in

different clusters and this number depends on whether DR

was applied globally or locally. This also supports our

hypothesis about the heterogeneity of data.

In Figure 3 comparison of local and global results of 7-NN

classifier for 7 different clusters (including the whole data set)

are shown. Results show similar behavior of FS and FE

across the 7 different clusters. Analyzing the histograms one

by one we can see that the DR techniques for our data result in

the best classification accuracy when applied locally to

staphylococcus, enterococcus, and gram– clusters.

Applying 7-NN with DR locally in the gram+ and gram–clusters (see Figure 3a) does not outperform global accuracy.

However, we can see that accuracy results for cluster gram+are much higher than for the gram– cluster. The FE methods

were almost equally good for the gram+ cluster. For the

gram– cluster, Par was the worst and NPar was the best. But

still 7-NN without any FE performed slightly better for gram–. The FS methods had no effect on 7-NN accuracy for the

gram– cluster. And BS was the only FS method that increased

7-NN accuracy for gram+.

In Figure 3b we can see that applying 7-NN and DR

techniques individually to the staphylococcus and

enterococcus clusters significantly increases the overall

accuracy. Local Par outperforms local 7-NN by 1.3% (avg Par

vs. avg noFE), NPar decreases the accuracy of 7-NN by 2%,

and PCA has no effect. Local FS decreases the performance of

7-NN by 3.5 - 4.5%. Relatively low accuracy for the

enterococcus cluster does not decrease much the average

accuracy since this cluster is rather small and contains only

5.7% of instances from gram+ while staphylococcus contains

94.3%. However, the analysis of how good or bad the

performance of certain local model is helps to understand

which subsets of instances are harder to classify or which

subsets of instances are very noisy.

In Figure 3c we can see that dividing the cluster gram–further into enterobacteria and nonfermentes does not increase

the accuracy of 7-NN both with and without local FE or FS.

Now, if we compare the FE and FS horizontal triples of

histograms we can see that for our data the sequential

strategies for FS have no success. Exceptionally, BS was

successful when applied individually to gram+ and gram–clusters. The FE methods have more diverse behaviour. So,

PCA is the best for the enterococcus cluster while Par is the

best for staphylococcus (Figure 3b top). NPar demonstrated

the best accuracy for global FE on the whole data set. This

leads to an idea of adaptive selection of FE method for each

cluster, that is the use of PCA in one cluster, and Par or NPar

in some other cluster may result in significantly higher overall

accuracy. Figure 4 shows the classification accuracies for our

data with regards to the selection of the best DR techniques

within corresponding clusters: locally in gram+ and globally

in gram-.We compare the impact of FS and FE on classification

either globally or locally with the most appropriate selection

of clusters: staphylococcus, enterococcus (joined into

averaged results of gram+), and gram–. Due to space

limitations we do not present a separate figure but list here the

main conclusions of this comparison: (1) Natural clustering is

useful for our data (in terms of increasing generalization

accuracy) only by means of FE with any (global or local) DR;

(2) FE was useful (it improved generalization accuracy) both

when applied globally and locally, while FS increases the

accuracy of 7-NN only when applied locally to the gram+ and

gram– clusters. This fact supports the hypothesis about

heterogeneity of our data; (3) FS applied locally on this data

results in higher accuracy produced by 7-NN comparing to

local FE. However we need to point out that this was due to

the binarization of categorical features (that is required for

FE) that leads to increase in redundant binary features. 7-NN

produces almost 3% higher accuracy results for data presented

by original categorical (not binarized) features. Perhaps, by

analyzing possible reasons of why the accuracy of 7-NN on

this data decreases after binarization, we can improve the

overall situation in the FS-FE competition; (4) Par produced

very poor results when applied globally, but performed

surprisingly well in some of the clusters. NPar was quite

stable across different clusters, and it was the best FE

FIGURE 4

CLASSIFICATION ACCURACY WITH THE SELECTION OF THE BEST DR WITHIN

EACH CORRESPONDING CLUSTER

0.680.690.700.710.720.730.740.750.760.770.780.790.80

g+(avg) g-(global) avg global

Page 146: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

TITB-00152-2005 6

technique for the situation when DR was applied globally.

VI. CONCLUSION AND FUTURE DIRECTIONS

DR is an effective approach to data reduction aimed at

focusing on relevant features and improving the quality of

data representation for classification. We experimentally

compared and showed the benefits of local and global DR by

means of FS and FE. In this study we applied the natural clustering approach aimed at using contextual features for

splitting a real-world clinical data set into more homogeneous

clusters in order to construct local models that would help in

the better prediction of antibiotic resistance.

The results of our experiments demonstrate that the proper

selection of a local DR technique can lead to a significant

increase of predictive accuracy in comparison with the global

7NN classification with or without DR. The amount of

features extracted or selected locally is always smaller than

that in the global space that also shows the usefulness of

natural clustering in coping with data heterogeneity (and

higher dimensionality).

Our future research efforts are going to be directed towards

the comparison of a mixture of FE models for classification

built on natural clusters and on clusters produced by

traditional clustering techniques. We analyzed spatial

contextual features related to categorization of different

pathogens. We believe that natural clustering according to

features that contain implicit or explicit information about

timestamp of certain instance may give interesting results in

different time contexts.

Another challenging goal is the adaptive selection of FE

method for each cluster according to certain characteristics of

the cluster. So, the appropriate use of PCA in one cluster, and

Par or NPar in some other cluster may result in significantly

higher overall accuracy.

ACKNOWLEDGMENT

We would like to thank Dr. Michael Shifrin and Dr. Irina

Alexandrova from N. N. Burdenko Institute of Neurosurgery,

Russian Academy of Medical Sciences, Moscow, Russia for

the provided dataset and helpful discussions. We also thank

WEKA developers for the source code used in our work.

REFERENCES

[1] Aha D.W., Bankert R. A Comparative Evaluation of Sequential Feature

Selection Algorithms. In: D. Fisher and H. Lenz (eds.), Proc. 5th Int.

Workshop on Artificial Intelligence and Statistics, 1995, pp. 1-7.

[2] Bellman, R. 1961. Adaptive Control Processes: A Guided Tour,

Princeton, Princeton University Press.

[3] Brossette SE, Sprague AP, Jones WT, Moser SA A data mining system

for infection control surveillance, Methods of Information in Medicine

39(4-5), 2000, pp. 303-310

[4] Caruana R., Freitag D. Greedy attribute selection, Proceedings of the

11th International Conference on Machine Learning ICML94, Morgan

Kaufmann, San Francisco, CA, 1994, pp. 26-28.

[5] Fayyad, U., G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy,

Advances in Knowledge Discovery and Data Mining, AAAI/ MIT Press,

1997.

[6] Ferraro M.J., et al. Methods for Dilution Antimicrobial Susceptibility

Tests for Bacteria that Grow Aerobically: Approved Standard: Sixth

Edition & Performance Standards for Antimicrobial Susceptibility

Testing. Wayne, PA: National Committee for Clinical Laboratory

Standarts, NCCLS, 2004. (Documents M7-A6 and M100-S14,

www.nccls.org).

[7] Gaynes R.P., Surveillance of nosocomial infections: a fundamental

ingredient for quality. Infect Control Hosp Epidemiol 18(7), 1997, pp.

475– 478.

[8] Hall, M.A., and Smith, L.A. (1999). Feature selection for machine

learning: comparing a correlation-based filter approach to the wrapper.

Proceedings of the FLAIRS’99 Conf. on Artificial Intelligence, AAAI

Press.

[9] Jollife, I. T. Principal Component Analysis. Springer-Verlag, New York.

1986.

[10] Kononenko I., Inductive and Bayesian learning in medical diagnosis.

Applied Artificial Intelligence 7(4), 1993, pp. 317-337.

[11] Lamma E., Manservigi M., Mello P., Nanetti A., Riguzzi F., Storari S.,

The automatic discovery of alarm rules for the validation of

microbiological data, 6th Int. Workshop on Intelligent Data Analysis in

Medicine and Pharmacology, IDAMAP 2001, UK, 2001.

[12] Liu, H. Feature Extraction, Construction and Selection: A Data Mining

Perspective, Kluwer, 1998.

[13] Pechenizkiy M. 2005. Impact of the Feature Extraction on the

Performance of a Classifier: kNN, Naïve Bayes and C4.5. In: B.Kegl,

G.Lapalme (Eds.): Proc. of 18th CSCSI Conference on Artificial

Intelligence AI’05, LNAI 3501, Springer Verlag, pp. 268-279.

[14] Pechenizkiy M., Tsymbal A., Puuronen S., Shifrin M., Alexandrova I.

2005. Knowledge Discovery from Microbiology Data: Many-sided

Analysis of Antibiotic Resistance in Nosocomial Infections, (to appear)

In: K.D. Althoff et al. (Eds) Post-Conference Proc. of 3rd Conf. on

Professional Knowledge Management: Experiences and Visions,

Kaiserslautern, Germany.

[15] Samore M, Lichtenberg D, Saubermann L, et al. A clinical data

repository enhances hospital infection control. In Proc American

Medical Informatics Association Annual Fall Symposium, 1997; 56–60.

[16] The Problem of Antibiotic Resistance, NIAID Fact Sheet. National

Institute of Allergy and Infectious Diseases (NIAID), National Institutes

of Health, U.S. Department of Health and Human Services, USA

(available at www.niaid.nih.gov/factsheets/antimicro.htm)

[17] Tsymbal, A. Puuronen S., Pechenizkiy M., Baumgarten M., Patterson D.

Eigenvector-based feature extraction for classification. In Proc. 15th Int.

FLAIRS Conf. on Artificial Intelligence, AAAI Press, 2002, pp. 354-

358.

[18] Turney P., The management of context-sensitive features: A review of

strategies. In Proc Workshop on Learning in Context-Sensitive Domains

at the 13th Int. Conf. on Machine Learning (ICML96), pp. 60-66.

[19] Vijayakumar S., Schaal S. 1997. Local Dimensionality Reduction for

Locally Weighted Learning, Proc. IEEE International Symposium on

Computational Intelligence in Robotics and Automation, (CIRA'97),

220-225.

[20] Witten, I., Frank E. Data Mining: Practical Machine Learning Tools with

Java Implementations, Morgan Kaufmann, San Francisco, 2000.

Page 147: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

TITB-00152-2005 7

FIGURE 2. EXPERIMENTAL APPROACH

TABLE 2 THE BASIC EXPERIMENTAL RESULTS: ACCURACY OF 7NN CLASSIFIER FOR EACH APPROACH AMD NUMBER OF FEATURES USED BY CORRESPONDING

APPROACH FOR CLASSIFICATION ON AVERAGE

Accuracy of 7-NN classifier and number of features used

Feature Selection Feature Extraction Dataset Inst

FFS BFE BS noFS

PCA Par NPar noFE

global 4430 .742 8 .744 8 .738 8 .748 28 .696 24 .682 1 .734 39.1 .719 44

gram– (g–) 2296 .706 6 .709 6 .713 6 .706 24 .662 31 .622 1 .678 31.5 .685 35

gram+ (g+) 2134 .787 5 .784 5 .798 5 .788 24 .745 19 .752 1 .749 32.6 .738 35

eterobac. 783 .677 4 .677 4 .679 4 .677 23 .635 16 .612 1 .644 28 .643 31

nonferm. 1513 .716 7 .72 8 .72 9 .716 23 .680 30 .635 1 .700 26.8 .709 31

staphyloc. 2013 .799 5 .757 5 .756 5 .799 23 .766 20 .785 1 .754 33.5 .772 37

enteroc. 121 .736 3 .719 3 .727 4 .736 23 .658 11 .608 1 .631 21.8 .603 28

best 4430 .730 5 .731 5 .733 5 .749 24 .709 19 .696 1 .711 33 .722 36

0.60

0.65

0.70

0.75

0.80

PCA Par NPar noFE

gram - gram+ avg global

0.60

0.65

0.70

0.75

0.80

PCA Par NPar noFE

staph enterococ avg g+(global)

0.60

0.65

0.70

0.75

0.80

PCA Par NPar noFE

enterobac nonf avg g-(global)

0.65

0.70

0.75

0.80

FFS BFE BS noFS

gram - gram+ avg global

0.65

0.70

0.75

0.80

FFS BFE BS noFS

staph enterococ avg g+(global)

0.65

0.70

0.75

0.80

FFS BFE BS noFS

enterobac nonf avg g-(global)

a) whole dataset b) ‘gram+‘ cluster c) ‘gram–’ cluster

FIGURE 3. ACCURACY RESULTS OF 7NN CLASSIFIER WITH DIFFERENT APPROACHES:

Training

Data

Test

Data

SL SL

Classifier Classifier C1

SL

DR

Natural Clustering

Accuracy

Cluster1 Cluster2 Clustern

SL SL

C2 Cn C1

SL SL SL

C2 Cn

DR DR DR

Accuracy AccuracyAccuracy

Page 148: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source
Page 149: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

VI

IMPACT OF SAMPLE REDUCTION ON PCA-BASED FEATURE EXTRACTION FOR SUPERVISED

LEARNING

Pechenizkiy, M., Puuronen, S. & Tsymbal, A. 2006. (to appear) In: H. Haddad et al. (Eds.), Proceedings of 21st ACM Symposium on Applied Computing (ACM SAC’06, Data Mining Track), ACM Press.

Page 150: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source
Page 151: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

The Impact of Sample Reduction on PCA-based Feature Extraction for Supervised Learning

Mykola Pechenizkiy Department of CS&ISs

Univ. of Jyväskylä, P.O. Box 35, Finland-40351

[email protected]

Seppo Puuronen Department of CS&ISs

Univ. of Jyväskylä, P.O. Box 35, Finland-40351

[email protected]

Alexey Tsymbal Department of CS

Trinity College Dublin, Ireland

[email protected]

ABSTRACT“The curse of dimensionality” is pertinent to many learning algorithms, and it denotes the drastic raise of computational complexity and classification error in high dimensions. In this paper, different feature extraction (FE) techniques are analyzed as means of dimensionality reduction, and constructive induction with respect to the performance of Naïve Bayes classifier. When a data set contains a large number of instances, some sampling approach is applied to address the computational complexity of FE and classification processes. The main goal of this paper is to show the impact of sample reduction on the process of FE for supervised learning. In our study we analyzed the conventional PCA and two eigenvector-based approaches that take into account class information. The first class-conditional approach is parametric and optimizes the ratio of between-class variance to the within-class variance of the transformed data. The second approach is a nonparametric modification of the first one based on the local calculation of the between-class covariance matrix. The experiments are conducted on ten UCI data sets, using four different strategies to select samples: (1) random sampling, (2) stratified random sampling, (3) kd-tree based selective sampling, and (4) stratified sampling with kd-tree based selection. Our experiments show that if the sample size for FE model construction is small then it is important to take into account both class information and data distribution. Further, for supervised learning the nonparametric FE approach needs much less instances to produce a new representation space that result in the same or higher classification accuracy than the other FE approaches.

Categories and Subject DescriptorsH.2.8 [Information Systems]: Database Management – Database Applications, Data Mining

General TermsAlgorithms, Performance, Design, Experimentation

KeywordsFeature Extraction, Sample Reduction, Supervised Learning

1. INTRODUCTION Numerous data mining techniques have recently been developed to extract knowledge from databases. Fayyad [7] introduced knowledge discovery from databases (KDD) as “the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data”. The process comprises several steps, which involve data selection, data pre-processing, data transformation, application of machine learning techniques, and the interpretation and evaluation of patterns.

In this paper we analyze problems related to data transformation, phase before applying certain machine learning techniques. We consider feature extraction (FE) for supervised learning (SL). It is aimed at finding a transformation of the original space that would produce such new features, which preserve or improve class separability as much as possible and form a new lower-dimensional problem representation space (RS). Thus, FE for SL addresses (1) the so-called problem of “the curse of dimensionality” [3], which requires dimensionality reduction [1], and (2) the problem of poor RS caused by the presence of some irrelevant or indirectly relevant individual features. We consider these problems further in Section 2 and different types of FE techniques for SL in Section 3, including Principal Component Analysis (PCA), and two class-conditional approaches to FE.

When a data set contains a large number of instances, some sampling approach is often applied to address the computational complexity of FE and classification processes. The focus of this paper is within the study of sample reduction effect on three FE techniques with regard to the classification performance. In Section 4 we consider basic sampling strategies used in this study.

We conduct a number of experiments on ten UCI datasets, analyzing the impact of sample reduction on three FE techniques with regard to the classification performance of Naïve Bayes (NB). The results of these experiments are reported in Section 5. And then, in Section 6 we briefly summarize with the main conclusions and further research directions.

2. REPRESENTATION SPACE AND SL In many real-world applications, numerous features are used in an attempt to better describe instances. If all those features are used to build up classifiers, then they operate in high dimensions, and the learning process becomes computationally and analytically complicated, resulting often in the drastic rise of classification error. Hence, there is a need to reduce the dimensionality of the feature space before classification. Different feature selection (FS) and FE techniques are used to cope with “the curse of dimensionality” and produce better RS. While FS is aimed at

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’06, April 23–27, 2006, Dijon, France. Copyright 2006 ACM 1-59593-108-2/06/0004…$5.00.

Page 152: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

selecting a subset of original features only, FE generates new features by means of some functional mapping keeping as much information in the data as possible [8].

The essential drawback of all the methods that just assign weights to individual features is their insensitivity to interacting or correlated features. Also, in many cases some features are useful on one example set but useless or even misleading in another. That is why the transformation of the given representation before weighting the features in such cases can be preferable. However, FE and subset selection are not, of course, totally independent processes and they can be considered as different ways of task representation. And the use of such techniques is determined by the purposes, and, moreover, sometimes FE and selection methods are combined together in order to improve the solution.

Even, if the dimensionality of problem is relatively low, the problem is that most inductive learning approaches assume that the features used to represent instances are sufficiently relevant. However, it was shown experimentally that this assumption does not hold often for many learning problems [13]. Some features may not be directly relevant, and some features may be redundant or irrelevant. Even those inductive learning approaches that apply feature selection techniques, and can eliminate irrelevant features and thus somehow account for the problem of high dimensionality, often fail to find good representation of data. This happens because of the fact that many features in their original representation are weakly or indirectly relevant to the problem. The existence of such features usually requires the generation of new, more relevant features that are some functions of the original ones. Such functions may vary from very simple as a product or a sum of a subset of the original features to very complex as a feature that reflects whether some geometrical primitive is present or absent in an instance.

The original RS can be improved for learning by removing less relevant features, adding more relevant features and abstracting features.

Constructive induction (CI) is a learning process that consists of two intertwined phases, one of which is responsible for the construction of the “best” RS and the second concerns with generating hypotheses in the found space [13]. In Figure 1 we can see two problems – with a) high-quality, and b) low-quality RSs. So, in a) points marked by “+” are easily separated from the points marked by “–” using a straight line or a rectangular border. But in b) “+” and “–” are highly intermixed which indicates the inadequateness of the original RS. A common approach is to search for complex boundaries to separate the classes. The CI approach suggests searching for a better representation space where the groups are better separated, as in c).

a) High quality RS b) Low quality RS c) Improved RS due to CI – + – + – – + + + –

+ + – – + + – – + + – + – CI + +

+ + – – + + + – + – + + + – + – – + – – – –

Figure 1. High vs. low quality RSs for concept learning. CI aims at improving the quality of the low-quality RS [13]

However, in this paper the focus is on constructing new features from the original ones by means of some functional mapping that is known as FE. We consider FE from both perspectives – as a constructive induction technique and as a dimensionality reduction technique.

3. FE FOR SUPERVISED LEARNING Generally, FE for SL can be seen as a search process among all possible transformations of the original feature set for the best one, which preserves class separability as much as possible in the space with the lowest possible dimensionality [8]. In other words we are interested in finding a projection w:

xwy T (1)

where y is a 1d transformed data point (presented using d

features), w is a dk transformation matrix, and x is a 1koriginal data point (presented using k features).

3.1 PCAPrincipal Component Analysis (PCA) is a classical statistical method, which extracts a lower dimensional space by analyzing the covariance structure of multivariate statistical observations [11].

The main idea behind PCA is to determine the features that explain as much of the total variation in the data as possible with as few of these features as possible. The computation of the PCA transformation matrix is based on the eigenvalue decomposition of the covariance matrix S and therefore is computationally rather expensive.

n

i

Tiiiondecompositeig

1

))((_ mxmxSw (2)

where n is the number of instances, xi is the i-th instance, and m is the mean vector of the input data.

Computation of the principal components can be presented with the following algorithm:

1. Calculate the covariance matrix S from the input data. 2. Compute the eigenvalues and eigenvectors of S and sort them

in a descending order with respect to the eigenvalues. 3. Form the actual transition matrix by taking the predefined

number of components (eigenvectors). 4. Finally, multiply the original feature space with the obtained

transition matrix, which yields a lower- dimensional representation.

The necessary cumulative percentage of variance explained by the principal axes is used commonly as a threshold, which defines the number of components to be chosen.

3.2 Class-conditional Eigenvector-based FE In [14] it was shown that although PCA is the most popular FE technique, it has a serious drawback, namely the conventional PCA gives high weights to features with higher variabilities irrespective of whether they are useful for classification or not.

Page 153: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

This may give rise to the situation where the chosen principal component corresponds to the attribute with the highest variability but without any discriminating power.

A usual approach to overcome the above problem is to use some class separability criterion [2], e.g. the criteria defined in Fisher’s linear discriminant analysis and based on the family of functions of scatter matrices:

wSw

wSww

WT

BT

J )( (3)

where SB in the parametric case is the between-class covariance matrix that shows the scatter of the expected vectors around the mixture mean, and SW is the within-class covariance, that shows the scatter of samples around their respective class expected vectors.

A number of other criteria were proposed in [8]. Both parametric and nonparametric approaches optimize criterion (4) by using the simultaneous diagonalization algorithm [8].

It should be noticed that there is a fundamental problem with the parametric nature of the covariance matrices. The rank of SB is at most the number of classes-1, and hence no more than this number of new features can be obtained.

The nonparametric method overcomes this problem by trying to increase the number of degrees of freedom in the between-class covariance matrix, measuring the between-class covariances on a local basis. The k-nearest neighbor (kNN) technique is used for this purpose. The algorithm for nonparametric FE is the same as for parametric extraction. Simultaneous diagonalization is used as well, and the only difference is in calculating the between-class covariance matrix SB. In the nonparametric case the between-class covariance matrix is calculated as the scatter of the samples around the expected vectors of other classes’ instances in the neighborhood.

A number of experimental studies where parametric and nonparametric class-conditional FE have been applied for kNN, NB, and C4.5 [15], dynamic integration of classifiers [16] and

data with small sample size and high number of feature [10] were considered.

4. RANDOM, STRATIFIED AND KD-TREE BASED SAMPLING When a data set contains a large number of instances, some sampling strategy is normally applied before the FE and classification processes to reduce their computational time and cost.

In our study we apply random sampling (area in Figure 2 marked with dashed box), stratified random sampling (whole Figure 2), kd-tree based selective sampling (area in Figure 3 marked with dashed box), and stratified sampling with kd-tree based selection (whole Figure 3).

Random sampling and stratified random sampling are the most commonly applied strategies as they are straightforward and fast. In random sampling information about the distribution of instances by classes is disregarded. Therefore, intuitively, stratified sampling, which randomly selects instances from each group of instances (related to the corresponding class) separately, is preferable.

However, the assumption that instances are not uniformly distributed and some instances are more representative than others [12] motivates to apply a selective sampling approach. The main idea of selective sampling is to identify and select representative instances, so that fewer instances are needed to achieve similar (or even better) performance. The common approach to selective sampling is data partitioning (or data indexing) that is aimed to find some structure in data and then to select instances from each partition of the structure. Although there exist many data partitioning techniques (see e.g. [9] for an overview), we choose kd-tree for our study because of its simplicity, wide use and last but not least availability in WEKA library [17].

A kd-tree is a generalization of the simple binary tree which uses k features instead of a single feature to split instances in a multi-dimensional space [9]. The splitting is done recursively in each of the successor nodes until the node contains no more than a prede-

11

%100S

pN

cc S

pN

%100

NNc

ii

1SS

c

ii

1N

1N

cN

Figure 2. Stratified random sampling.

Page 154: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

fined number of instances (called bucket size) or cannot be split further. The order in which features are chosen to split can result in different kd-trees. As the goal of partitioning for selective sampling is to split instances into different (dissimilar) groups, a splitting feature is chosen if the data variance is maximized along the dimension associated with the splitting feature.

In Figure 3 (area inside the dashed box) the basic idea of selective sampling is presented graphically. First, a kd-tree is constructed from data, then a defined percent of instances is selected from each leaf of the tree and added to the resulting sample to be used by FE and NB models construction.

11

%100S

pNkd-tree

building

Root

kd-tree

11N 1

nN

11

1 NNn

ii

cc S

pN

%100

kd-tree

building

Root

kd-tree

cN1c

nN

c

n

i

ci NN

1

FE + NBo o o o o oo o o

k

k

k

clas

s1

class

c

Sample

SSc

ii

1

k

Data

N

k

1N

k

cN

Random Sampling

Random Sampling

Figure 3. Stratified sampling with kd-tree based selection of (representative) instances.

Liu [12] proposed to use kd-tree based selective sampling approach for unsupervised feature selection.

We shall run a few steps forward to the analysis of the results of experimental study and say that, although being different in nature, stratified sampling and kd-tree based sampling have similar effect with respect to the application of FE for NB classification. This fact is the main motivation to try the combination of these approaches, so that both class information and information about data distribution are used (as presented in Figure 3). It can be seen from the figure that the main difference is in constructing several local kd-trees for each group of instances related to certain class, instead of constructing one global tree.

5. EXPERIMENTS AND RESULTS The experiments were conducted on 10 data sets with different characteristics taken from the UCI machine learning repository [4]. The main characteristics of the data sets are presented in Table 1, which include the names of the data sets, the numbers of instances included in the data sets, the numbers of different classes of instances, and the numbers of numerical and categorical/binary features, and total number of numerical plus binary or binarized categorical features included in the instances. Each categorical feature was replaced with a redundant set of binary features, each corresponding to a value of the original feature.

In the experiments, the accuracy of NB learning algorithm was calculated. Although NB relies on an assumption that the features

used for deriving a prediction are independent of each other, given the predicted value (that is rarely true in practice), it has been recently shown that the NB can be optimal even when the independence assumption is violated by a wide margin [5]. It was shown that the NB can be effectively used in ensemble techniques, which perform also bias reduction, as boosting [6].

Table 1. Datasets characteristics

Dataset inst class Feat

(num)

Feat

(cat/bin)

Feat

(num+bin)

Hypothyr. 3772 3 7 22 31Ionosph. 351 2 33 0 33 Kr-vs-kp 3196 2 0 37 40 Liver 345 2 6 0 6 Monk-1 432 2 0 6 15 Monk-2 432 2 0 6 15 Monk-3 432 2 0 6 15 Tic 958 2 0 9 27 Vehicle 846 4 18 0 18 Waveform 5000 3 21 0 21

For example, Elkan’s application of the boosted NB won the first place out of 45 entries in the data mining competition KDD’97 [6]. Beside this, when the NB is applied to the subproblems of lower dimensionalities as in the random subspace method, the error bias of the Bayesian probability estimates caused by the independence assumption becomes smaller. We can take into consideration also the fact that such FE techniques like PCA are aimed not only to reduce the dimensionality but also to produce

Page 155: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

uncorrelated features.

For each data set 30 test runs of Monte-Carlo cross validation were made to evaluate classification accuracy with the four FE approaches and without FE. In each run, the data set is first split into the training set and the test set by stratified random sampling to keep class distributions approximately same. Each time 20 percent instances of the data set are first randomly picked up to the test set. The sampling approaches are applied to the remaining 80 percent instances to form the training set, which is used for finding the feature-extraction transformation matrix w. We were selecting p = i*10% from the original sample, with i = {1, …, 10}. The bucket size for a kd-tree was selected proportionally for each data set, equal to 10% of original number of instances.

For PCA and parametric FE we used a 0.85 variance threshold, and we took all the features extracted by parametric FE as it was always equal to number of classes-1. The test environment was implemented within the WEKA framework (the machine learning library in Java) [17].

In Figure 4 accuracies of NB classification are presented for different sample sizes (from 10% to 100%). The figure is divided into four parts, each presenting the results of one sampling strategy: a) random sampling, b) stratified random sampling, c) kd-tree based sampling, and d) stratified kd-tree based sampling. For each sampling strategy four approaches are compared: Plain

0.67

0.69

0.71

0.73

0.75

0.77

0.79

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

a) random

0.67

0.69

0.71

0.73

0.75

0.77

0.79

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

b) stratified

0.67

0.69

0.71

0.73

0.75

0.77

0.79

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

c) kd-tree

0.67

0.69

0.71

0.73

0.75

0.77

0.79

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

d) stratified + kd-tree Figure 4. Naïve Bayes accuracy with different samples selected by a) random sampling, b) stratified random sampling, c) kd-tree based sampling, and d) stratified kd-tree based sampling.

denotes the situation where NB is applied without any preceding FE, and PCA, PAR, and NPAR denote correspondingly the situations when PCA, parametric FE and nonparametric FE are applied for NB. With random sampling (Figure 4a), NB produces the highest accuracy when no FE is applied (Plain) with p < 20%, and when nonparamentic FE (NPAR) is applied if p > 20%. For p > 30% NPAR results in (at least 2%) higher NB accuracy than PCA,PAR, and Plain even with p = 100%. NPAR produces highest

accuracy values when p 70% and in this respect it behaves differently than others which produce highest accuracy values with p = 100%.

PAR achieves the same level as Plain when p = 50% and slightly outperforms Plain when p > 80%. PCA is the worst when 10% < p < 70%. With p > 70% PCA behaves very similar to PAR.

When stratified sampling is applied (Figure 4b), Plain is best only when p 10% and NPAR is the best when p 20%. As with random sampling for p > 30% NPAR results in (at least 2%) higher NB accuracy than PCA, PAR, and Plain even with p = 100%. NPAR produces highest accuracy values when p 70%. PAR achieves the same level as Plain when p = 50% and slightly outperforms Plain when p > 70%. PCA is the worst when p < 80% and achieves the same level of Plain when p 80%, and PAR when p 90%.

The kd-tree based sampling (Figure 4c) has very similar effect to the stratified sampling on the performance of NB, although being different in nature. The only difference is that NPAR produces highest accuracy values when p 80%.

The kd-tree based sampling when applied for each class of instances separately (Figure 4d), improves the positive effect of stratified sampling wrt each FE for p < 50%. Also, Plain and NPAR are equal when p = 10%, NPAR is the best when p > 10%, and NPAR produces the highest accuracy values when p 60%.

Comparing the results related to four sampling strategies we can conclude that no matter which one of four sampling strategies is used, if sample size is small, p 10%, then Plain shows the best accuracy results; if sample size p 20%, then NPAR outperforms other methods; and if sample size p 30%, NPAR outperforms other methods even if they use 100% of sample. The best p for NPAR depends on sampling method: for random and stratified p = 70%, for kd-tree p = 80%, and for stratified + kd-tree p = 60%. PCA is the worst techniques when applied on a small sample size, especially when stratification or kd-tree indexing is used.

Generally, all sampling strategies have similar effect on final classification accuracy of NB for p > 30%. The significant difference in performance is within 10% p 30% as can be seen from the figure.

The intuitive explanation for this is that when taking very large portion of data, it does not matter which strategy is used since most of the selected instances likely to be the same (maybe chosen in different orders). However, the smaller the portion of the sample, the more important is how the instances are selected.

Figure 5 shows how stratification improves the effect of kd-tree sampling for 10% p 30%. The left part of the figure shows the difference in NB accuracy due to use of random sampling comparing to kd-tree based sampling, and the right part – due to use of random sampling comparing to kd-tree based sampling with stratification.

6. CONCLUSIONS AND FURTHER RESEARCH FE techniques are powerful tools that can significantly increase the classification accuracy producing better representation spaces or resolving the problem of “the curse of dimensionality”. When a date set includes many instances, sample reduction techniques are

Page 156: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

10 % 20 % 30 %

PCA PAR NPAR PLAIN

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

10 % 20 % 30 %

PCA PAR NPAR PLAIN

Figure 5. The kd-tree based sampling vs. the random sampling (left) and the stratified kd-tree based sampling vs. the random sampling (right).

used before applying DM techniques.

In this paper, we analyzed the impact of sample reduction on the process of FE for SL. The experimental results of our study show that the type of sampling approach is not important when the selected sample size is relatively large. However, it is important to take into account both class information and information about data distribution when the sample size to be selected is small. With regard to this, we are planning to analyze further the performance of FE for SL when 5-30% of instances are selected from a data set. Also it might be interesting to see for which data sets there was a significant difference in samples selected by different strategies, e.g. how many instances from random sample are significantly different from instances selected by stratified random sampling or kd-tree based selective sampling.

Actual time taken to build classification models with and without FE with regard to the selected samples is not reported in this study, we also do not present here the analyses of number of features extracted by a certain FE technique. These important issues will be presented in our further study.

It is interesting to analyze the sample reduction impact on the other commonly applied learning algorithms like decision trees and lazy learning and compare results with the reported findings.

7. ACKNOWLEDGMENTS This research is partly supported by the COMAS Graduate School of the University of Jyväskylä, Finland, and Science Foundation Ireland under Grant No. S.F.I.-02IN.1I111.

8. REFERENCES[1] Aivazyan, S.A. Applied statistics: classification and

dimension reduction. Finance and Statistics, Moscow, 1989. [2] Aladjem, M. Parametric and nonparametric linear mappings

of multidimensional data. Pattern Recognition 24(6), 1991, 543-553.

[3] Bellman, R., Adaptive Control Processes: A Guided Tour,Princeton University Press, 1961.

[4] Blake, C.L., Merz, C.J. UCI Repository of Machine Learning Databases. Dept. of Information and Computer Science, University of California, Irvine CA, 1998.

[5] Domingos P. and Pazzani M. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning 29 (2,3), 1997, 103–130.

[6] Elkan C. Boosting and naïve Bayesian learning, Tech. Report CS97-557, Department of CS and Engineering, University of California, San Diego, USA, 1997

[7] Fayyad, U.M. Data Mining and Knowledge Discovery: Making Sense Out of Data, IEEE Expert 11(5), 1996, 20-25.

[8] Fukunaga, K. Introduction to statistical pattern recognition.Academic Press, London, 1990.

[9] Gaede V., Günther O. Multidimensional access methods, ACM Comput. Surv. 30 (2), 1998, 170–231.

[10] Jimenez, L., Landgrebe, D. High Dimensional Feature Reduction via Projection Pursuit. PhD Thesis and School of Electrical & Computer Engineering Technical Report TR-ECE 96-5, 1995.

[11] Jolliffe, I.T. Principal Component Analysis. Springer, New York, NY, 1986.

[12] Liu H., Motoda H., Yu L. A selective sampling approach to active feature selection, Artificial Intelligence 159(1-2), 2004, 49-74.

[13] Michalski, R.S.. Seeking Knowledge in the Deluge of Facts, Fundamenta Informaticae 30, 1997, 283-297.

[14] Oza, N.C., Tumer, K. Dimensionality Reduction Through Classifier Ensembles. Technical Report NASA-ARC-IC-1999-124, Computational Sciences Division, NASA Ames Research Center, Moffett Field, CA, 1999.

[15] Pechenizkiy M. Impact of the Feature Extraction on the Performance of a Classifier: kNN, Naïve Bayes and C4.5. In: B.Kegl, G.Lapalme (Eds.): Proc. of 18th CSCSI Conference on Artificial Intelligence AI’05, LNAI 3501, Springer-Verlag, 2005, 268-279.

[16] Tsymbal A., Pechenizkiy M., Puuronen S., Patterson D.W. Dynamic integration of classifiers in the space of principal components, In: L.Kalinichenko, R.Manthey, B.Thalheim, U.Wloka (Eds.), Proc. Advances in Databases and Information Systems: 7th East-European Conf. ADBIS'03,LNCS 2798, Springer-Verlag, 2003, 278-292.

[17] Witten I. and Frank E. Data Mining: Practical machine learning tools with Java implementations, Morgan Kaufmann, San Francisco, 2000.

Page 157: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

VII

FEATURE EXTRACTION FOR DYNAMIC INTEGRATION OF CLASSIFIERS

Pechenizkiy, M., Tsymbal, A., Puuronen, S. & Patterson, D. 2005. Manuscript submitted to Fundamenta Informaticae, IOS Press (as extended version of Tsymbal et al., 2003)

Page 158: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source
Page 159: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

Feature Extraction for Dynamic Integration of Classifiers

Mykola Pechenizkiy [email protected] Dept. of Computer Science and Information Systems, University of Jyväskylä, Jyväskylä, Finland

Alexey Tsymbal [email protected] Dept. of Computer Science, Trinity College Dublin, Dublin, Ireland

Seppo Puuronen [email protected] Dept. of Computer Science and Information Systems, University of Jyväskylä, Jyväskylä, Finland

David W. Patterson [email protected] Northern Ireland Knowledge Engineering Laboratory, University of Ulster, Belfast, U.K.

Abstract. Recent research has shown the integration of multiple classifiers to be one of the most important directions in machine learning and data mining. In this paper, we present an algorithm for the dynamic integration of classifiers in the space of extracted features (FEDIC). It is based on the technique of dynamic integration, in which local accuracy estimates are calculated for each base classifier of an ensemble, in the neighborhood of a new instance to be processed. Generally, the whole space of original features is used to find the neighborhood of a new instance for local accuracy estimates in dynamic integration. However, when the dynamic integration takes place in the high dimensions the search for the neighborhood of a new instance is problematic, since most of such space is empty and neighbors in fact are located far from each other. Furthermore, when noisy or irrelevant features are presents it is likely that also irrelevant neighbors will be associated with a test instance. In this paper, we propose to use feature extraction in order to cope with the curse of dimensionality in the dynamic integration of classifiers. We consider classical principal component analysis and two eigenvector-based class-conditional feature extraction methods that take into account class information. Experimental results show that, on some data sets, the use of FEDIC leads to significantly higher ensemble accuracies than the use of plain dynamic integration in the space of original features.

1 Introduction Knowledge discovery in databases (KDD) is a combination of data warehousing, decision support, and data mining that indicates an innovative approach to information management. KDD is an emerging area that considers the process of finding previously unknown and potentially interesting patterns and relations in large databases (Fayyad, 1996). Current electronic data repositories are growing quickly and contain huge amount of data from commercial, scientific, and other domain areas. The capabilities for collecting and storing all kinds of data totally exceed the abilities to analyze, summarize, and extract knowledge from this data. Numerous data mining methods have recently been developed to extract knowledge from these large databases. Selection of the most appropriate data-mining method or a group of the most appropriate methods is usually not straightforward. Often the method selection is

Page 160: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

done statically for all new instances of the domain area without analyzing each particular new instance. Usually better data mining results can be achieved if the method selection is done dynamically taking into account characteristics of each new instance (Tsymbal, 2002).

Recent research has proved the benefits of the use of ensembles of base classifiers for classification problems (Dietterich, 1997). The challenge of integrating base classifiers is to decide which of them to select or how to combine their classifications to the final classification.

In many real-world applications, numerous features are used in an attempt to ensure accurate classification. If all those features are used to build up classifiers, then they operate in high dimensions, and the learning process becomes computationally and analytically complicated. For instance, many classification techniques are based on Bayes decision theory or on nearest neighbor search, which suffer from the so-called “curse of dimensionality” (Bellman, 1961) due to the drastic rise of computational complexity and classification error in high dimensions (Aivazyan, 1989). Hence, there is a need to reduce the dimensionality of the feature space before classification. According to the adopted strategy dimensionality reduction techniques are divided into feature selection and feature transformation (also called feature discovery). The variants of the last one are feature extraction and feature construction. The key difference between feature selection and feature transformation is that during the first process only a subset of original features is selected while the second approach is based on a generation of completely new features; feature construction implies discovering missing information about the relationships among features by inferring or creating additional features (Liu, 1998). Feature extraction is a dimensionality reduction technique that extracts a subset of new features from the original set of features by means of some functional mapping keeping as much information in the data as possible (Fukunaga, 1999).

An essential drawback of all methods that just assign weights to individual features is their insensitivity to interacting or correlated features. Also, in many cases some features are useful in one example set but useless or even misleading in another. That is why the transformation of the given representation before weighting features in such cases can be preferable. However, feature extraction and subset selection are of course not totally independent processes and they can be considered as different ways of a task representation. And the use of such techniques is determined by the purposes and, moreover, sometimes feature extraction and selection methods are combined together in order to improve the solution.

In this paper, we consider the use of feature extraction in order to cope with the curse of dimensionality in the dynamic integration of classifiers. Feature extraction is aimed to find more compact and better representation of instances that is used during the dynamic integration of classifiers for finding neighborhood for a new instance.

We propose the FEDIC (Feature Extraction for Dynamic Integration of Classifiers) algorithm, which combines the dynamic selection, dynamic voting and dynamic voting with selection integration techniques (DS, DV and DVS) with the conventional PCA and two class-conditional eigenvector-based approaches (that use the within- and between-class covariance matrices). The first eigenvector-based approach is parametric, and the other one is nonparametric. Both these take class information into account when extracting features in contrast to PCA (Fukunaga, 1999).

One of our hypotheses is that with data sets, where feature extraction improves classification accuracy when employing a single classifier (like kNN), it will also improve classification accuracy of an ensemble when a dynamic integration approach is employed in the space of extracted features. Conversely, with data sets, where feature extraction decreases (or has no effect) classification accuracy with the use of a single classifier, then feature extraction will also decrease (or will have no effect) classification accuracy when employing a dynamic integration approach. This assumption comes from the idea that providing better representation of instances, which are used during the dynamic integration for finding neighborhood for a new instance, will result in accounting more appropriate local estimates.

The paper is organized as follows. In the next section the dynamic integration of classifiers is discussed. Section 3 briefly considers PCA-based feature extraction techniques with respect to classification problems. In Section 4 we consider the FEDIC algorithm, which performs the

Page 161: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

dynamic integration of classifiers in the transformed space. In Section 5 experiments conducted on a number of data sets from the UCI machine learning repository are described, and the results of the FEDIC algorithm are analyzed and compared to the results of both the static and dynamic selection techniques shown in the non-transformed space.

2 Dynamic Integration of Classifiers Recently the integration of classifiers has been under active research in machine learning, and different approaches have been considered (Chan, 1996). The integration of an ensemble of classifiers has been shown to yield higher accuracy than the most accurate base classifier alone in different real-world problems (Dietterich, 1997.).

The task of using an ensemble of models can be broken down into two basic questions: (1) what set of learned models should be generated?; and (2) how should the predictions of the learned models be integrated? (Merz, 1998). To generate a set of accurate and diverse classifiers, and to integrate the predictions of classifiers several approaches have been tried. In the next subsections we consider some of these approaches.

2.1 Techniques for the generation of base classifiers in ensembles One way of generating a diverse set of models is to use learning algorithms with heterogeneous representations and search biases (Merz, 1998), such as decision trees, neural networks, instance-based learning, etc.

Another approach is to use models with homogeneous representations that differ in their method of search or in the data on which they are trained. This approach includes several techniques for generating base models, such as learning base models from different subsets of the training data. For example, two well-known ensemble methods of this type are bagging and boosting (Quinlan, 1996).

The base models with homogeneous representations may be binary classifiers that are integrated to implement a multiclass learner (i.e., where the number of class labels is greater than 2). Each classifier in such an ensemble is learnt to distinguish one class label from the others. For example, Dietterich and Bakiri (1995) map each class label onto a bit string prior to learning. Bit strings for class labels are designed to be well separated, thus serving as error-correcting output codes (ECOC). An off-the-shelf system for learning binary classifications (e.g., 0 or 1) can be used to build multiple classifiers, one for each bit in the output code. An instance is classified by predicting each bit of its output code (i.e., label), and then classifying the instance as the label with the “closest” matching output code.

Also, natural randomisation in the process of model search (e.g., random weight setting in the backpropagation algorithm for training neural networks) can be used to build different models with homogeneous representation. The randomisation can also be injected artificially. For example, in Heath (1996) a randomised decision tree induction algorithm, which generates different decision trees every time it is run, was used for that purpose.

Another way for building models with homogeneous representations, which proved to be effective, is the use of different subsets of features for each model. For example, in Oza and Tumer (1999) base classifiers are built on different feature subsets, where each feature subset includes features relevant for distinguishing one class label from the others (the number of base classifiers is equal to the number of classes). Finding a set of feature subsets for constructing an ensemble of accurate and diverse base models is also known as ensemble feature selection (Opitz, 1999).

Sometimes, a combination of the techniques considered above can be useful in order to provide the desired characteristics of the generated models. For example, a combination of boosting and wagging (which is a kind of bagging technique) is considered by Webb (2000).

In addition to these general-purpose methods for generating a diverse ensemble of models, there are learning algorithm-specific techniques. For example, Opitz & Shavlik, (1996) employ a genetic algorithm in backpropagation to search for a good population of neural network classifiers.

Ensemble feature selection is the focus of this paper, and in the next section we consider techniques for integration of an ensemble of models.

Page 162: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

2.2 Techniques for integration of an ensemble of models Brodley & Lane, (1996) have shown that simply increasing coverage of an ensemble

through diversity is not enough to insure increased prediction accuracy. If the integration method does not utilize the coverage, then no benefit arises from integrating multiple classifiers. Thus, the diversity and coverage of an ensemble are not in themselves sufficient conditions for ensemble accuracy. It is also important for ensemble accuracy to have a good integration method that will utilize the diversity of the base models.

The challenging problem of integration is to decide which one(s) of the classifiers to rely on or how to combine the results produced by the base classifiers. Techniques using two basic approaches have been suggested as a solution to the integration problem: (1) a combination approach, where the base classifiers produce their classifications and the final classification is composed using them; and (2) a selection approach, where one of the classifiers is selected and the final classification is the result produced by it.

Several effective techniques for the combination of classifiers have been proposed. One of the most popular and simplest techniques used to combine the results of the base classifiers, is simple voting (also called majority voting and select all majority (SAM)) (Bauer & Kohavi, 1999). In the voting technique, the classification of each base classifier is considered as an equally weighted vote for that particular classification. The classification that receives the biggest number of votes is selected as the final classification (ties are solved arbitrarily). Often, weighted voting is used: each vote receives a weight, which is usually proportional to the estimated generalization performance of the corresponding classifier. Weighted Voting (WV) works usually much better than simple majority voting (Bauer & Kohavi, 1999).

More sophisticated combination techniques include the SCANN method based on the correspondence analysis and using the nearest neighbor search in the correspondence analysis results (Merz, 1998, 1999); and techniques to combine minimal nearest neighbor classifiers within the stacked generalization framework (Skalak, 1997). Two effective classifier combination techniques based on stacked generalization called “arbiter” and “combiner” were presented by Chan (1996). Hierarchical classifier combination has also been considered. Experimental results of Chan and Stolfo (1996, 1997) showed that the hierarchical (multi-level) combination approach, where the dataset was distributed among a number of sites, was often able to sustain the same level of accuracy as a global classifier trained on the entire dataset.

A number of selection techniques have also been proposed to solve the integration problem. One of the most popular and simplest selection techniques is the cross-validation majority (CVM, we call it simply Static Selection, SS, in our experiments) (Schaffer, 1993). In CVM, the cross-validation accuracy for each base classifier is estimated using the training set, and then the classifier with the highest accuracy is selected (ties are solved using voting). More sophisticated selection approaches include estimation of the local accuracy of the base classifiers by considering errors made in instances with similar predictions (Merz, 1995), learning a number of meta-level classifiers (“referees”) that predict whether the corresponding base classifiers are correct or not for new instances (each “referee” is a C4.5 tree that recognizes two classes) (Koppel & Engelson, 1996). Todorovski and Dzeroski, (2000) trained a meta-level decision tree, which dynamically selected a base model to be applied to the considered instance, using the level of confidence of the base models in correctly classifying the instance.

The approaches to classifier selection can be divided into two subsets: static and dynamic selection. The static approaches propose one “best” method for the whole data space, while the dynamic approaches take into account each new instance to be classified and its neighbourhood only. The CVM is an example of the static approach, while the other selection techniques considered above are examples of the dynamic approach.

Techniques for combining classifiers can be static or dynamic as well. For example, widely used weighted voting (Bauer & Kohavi, 1999) is a static approach. The weights for each base classifier’s vote do not depend on the instance to be classified. In contrast, the reliability-based weighted voting (RBWV) introduced in (Cordella, 1999) is a dynamic voting

Page 163: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

approach. It uses classifier-dependent estimation of the reliability of predictions for each particular instance. Usually, better results can be achieved if the classifier integration is done dynamically taking into account the characteristics of each new instance.

2.3 Motivation of dynamic integration In Figure 1, a simple example of the distribution of errors of a model built by a C4.5 decision tree learning algorithm (Quinlan, 1993) over the instance space is considered. The classification problem includes two features x1 and x2, and two classes “class_1” and “class_2”. The target function is x1>x2. The greyed areas represent instances, which are incorrectly classified by the model. It can be seen that the errors are concentrated in triangular regions in this case.

x1

x2

class_1

class_2

 Figure 1 - An example of the distribution of errors of a C4.5 decision tree

In (Gama, 1999) it was shown that the distribution of the error rate over the instance space is not homogeneous for many types of classifiers. Depending on the classifier, the error rate will be more concentrated on certain regions of the instance space than in others.

The basic idea of dynamic integration is that the information about a model’s errors in the instance space can be used for learning just as the original instances were used for learning the model. In (Giacinto & Roli, 1999), a theoretical framework of dynamic classifier selection was presented showing that the accuracy of dynamic selection approaches the accuracy of the optimal Bayesian classifier when the number of instances in the dataset grows.

In this paper, a dynamic integration approach is presented that estimates the local accuracy of the base classifiers by analyzing their accuracy on nearby instances to the instance to be classified (Puuronen et al., 1999a). Instead of directly applying selection or combination as an integration method, we use cross validation to collect information about the classification accuracies of the base classifiers and use this information to estimate the local classification accuracies for each new instance. These estimates are based on the weighted nearest neighbor classification (WNN) (Cost & Salzberg, 1993).

The proposed dynamic integration technique contains two main phases (Puuronen et al., 1999a; Puuronen & Tsymbal, 2001). The general idea of the proposed dynamic integration technique, which is divided into two phases, can be seen in Figure 2.

First, at the learning phase, the training set is partitioned into folds. During the cross validation run, we estimate the local classification errors of each base classifier for each instance of the training set according to the 1/0 loss function. These local errors together with the features of the instances of the training set form a meta-level training set used by WNN. The learning phase finishes with training the base classifiers on the whole training set. The application phase begins with determining the K-nearest neighborhood for the new instance using a distance metric based on the values of its features. Then, the meta-level classifier (WNN) is used to predict the local classification errors of each base classifier for a new instance using the meta-level training set. In WNN, to weigh the classification errors for an instance k from the K-nearest neighborhood of the test instance, we use a tri-cube function (Hastie & Tibshirani, 1996):

Page 164: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

    331 ))/(1( +−= Kkk ddw ,   (1) 

where dk is the distance from the instance k to the test instance, and dK+1 is the distance from the test instance to the first nearest instance not included in the K-neighborhood. T

Learning phase

Application phase

(x, ?) h* = F(h1, h2, …, hS)

(x, y*)

FE

T′

T1

h1

accuracy estimate

T2

h2

accuracy estimate

Ts

hs

accuracy estimate

Figure 2 – Dynamic integration: learning and application phases

Three different approaches based on the local accuracy estimates have been proposed: Dynamic Selection (DS) (Puuronen et al., 1999a), Dynamic Voting (DV) (Puuronen et al., 1999a), and Dynamic Voting with Selection (DVS) (Tsymbal et al., 2001). All these are based on the same local accuracy estimates obtained using WNN and (1). DS simply selects a classifier with the least predicted local classification error, as was also proposed in (Giacinto & Roli, 1999; Woods et al., 1997). In DV, each base classifier receives a weight that is proportional to the estimated local accuracy of the base classifier, and the final classification is produced by combining the votes of each classifier with their weights. In DVS, the base classifiers with highest local classification errors are discarded (the classifiers with errors that fall into the upper half of the error interval of the base classifiers) and locally weighted voting (DV) is applied to the remaining base classifiers.

In this paper we consider feature extraction process (depicted by dotted lines in Fig. 2) as means of neighborhood enhancement with respect to the dynamic integration of classifiers during the application phase. This idea will be considered further in Section 4. In the next section we consider basic ideas of PCA-based feature extraction for classification.

3 PCA-based Feature Extraction for Classification In this section we consider main idea of PCA-based feature transformation. The conventional PCA-based approach does not take into account class information. We consider a simple example that shows both good and bad performance of PCA from class discrimination point of view.

Then, we introduce class conditional feature extraction approaches that are based on fisher linear discriminant analysis. The first approach has parametric nature and the second approach is a nonparametric one.

Page 165: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

In any case, all feature extraction methods have a similar property: they extract new features, which are different from the original set, and thus constitute a new representation space, which usually has a significantly smaller number of features.

3.1 PCA-based feature transformations Principal Component Analysis (PCA) is a classical statistical method, which extracts a lower dimensional space by analyzing the covariance structure of multivariate statistical observations.

PCA transforms the original set of features into a smaller subset of linear combinations that account for most of variance of the original set. The objectives of PCA are: dimensionality reduction, determining of linear combinations of variables, feature selection, choosing of the most useful variables; identification of underlying variables.

The main idea that lies in the PCA is to determine the features which explain as much of the total variation in the data as possible with as few of these features as possible (William & Goldstein, 1998; Jolliffe, 1986).

Consider X in an pn× data matrix, consisting of n independent observations of the

variables ),...,( 1 pT xx=x , and centered about the mean ∑

=

=n

iin 1

1 xm so that column totals

are zero. The covariance matrix of X is estimated as:

∑=

−−=n

i

Tii

1

))(( mxmxS (2)

and any combination xaT , where a and x are 1×p vectors, has estimated variance SaaT . The features are extracted in such a way that the first principal component, defined as xaT

and denoted as PC(1), belongs to the largest amount of total variation in the data, with a

constraint 1=aaT . This involves solving the equation { } 0=− aλaSaaa

TT

dd

, where λ is a

Lagrange multiplier. This equation gives the system of equations to define a: 0=− λI)a(S . (3)

In order to find a non-zero solution of the equation λIS − should be a singular matrix, i.e. 0|| =− λIS . (4)

It is known that if matrix S is symmetric and nonnegative definite (as that is the case with any covariance matrix) equation (4) has p real nonnegative roots 0...21 ≥≥≥≥ pλλλ , that are called eigenvalues of matrix S (Aivazyan, 1989).

Taking into account that the variance of λ must satisfy λaSa = we obtain that in order to maximize the variance of PC(1) the maximum should be chosen among the p eigenvalues of matrix S . The standardization ensures that 1=i

Ti aa and the construction implies that

jijTi ≠= ,0aa . The vector ia contains the coefficients of the PC(i), and iXa gives the

scores of the n individuals on the PC(i). Thus, the first principal component is obtained as a linear combination of the eigenvector that corresponds to the maximum eigenvalue of S and the original data X, i.e.

∑=

=p

jjjPC

1)1()1( xa . (5)

The second principal component PC(2) is then uncorrelated with the first linear combination and associated with the maximum amount of the remaining total variation.

Analogously, the m-th principal component ∑=

=p

jjjmmPC

1)()( xa should have the largest

variance of all linear combinations that are uncorrelated with all of the previously extracted principal components.

Page 166: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

The noncorrelatedness of the principal components can be seen from the XAT variance estimating which equals

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

=

p

T

λ

λλ

0

0

A 2

1

AS . (6)

To conclude, in a PCA-based feature transformation we are interested in finding a projection A:

XAT=Y , (7)where 1×Tp transformed data points constitute Y transformed space, A is a mp × transformation matrix, and X consists of 1×m original data points.

Computation of the principal components can be presented with the following algorithm: 1. Calculate the covariance matrix S from the input data. 2. Compute the eigenvalues and eigenvectors of S and sort them in a descending

order with respect to eigenvalues. 3. Form the actual transition matrix by taking the predefined number of components

(actually, eigenvectors) 4. Finally, multiply the original feature space with the obtained transition matrix,

which yields a lower dimensional representation. The necessary cumulative percentage of variance explained by the principal axes should be

consulted in order to set a threshold, which defines the number of components to be chosen will take.

The following list summarizes the properties of PCA that are interesting from the FE point of view: variance maximization of the extracted features; the extracted features are uncorrelated; best linear approximation in the mean-square sense, because the truncation error is the sum of the lower eigenvalues; information that is contained in the extracted features is maximized (William & Goldstein, 1998). Besides the properties pointed above PCA has the following advantages: the model parameters can be computed directly from the data, e.g. by diagonalizing the sample covariance of the data; compression and decompression are quite easy operations and require only matrix multiplication.

Although PCA has a number of advantages, there are some drawbacks. One of them is that PCA gives high weight to features with higher variabilities disregarding whether they are useful for classification or not. In Figure 3 that is adopted from Oza & Tumer, 1999 we can see why it is dangerous to not use class information. The first case shows the proper work of PCA where the first principal component corresponds to the variable with the highest discriminating power along the first principal axis but the second figure gives us an opportunity to see that the chosen component is not relevant for this problem.

In order to overcome this problem, some variations of PCA with local and nonlinear processing are used to improve dimensionality reduction (Kambhatla & Leen, 1997). In spite of certain benefits for classification problems of such methods comparing to global PCA we have to notice that they do not directly use class information.

In some cases it can be difficult to understand the set of new features, extracted by PCA, although this problem relates to every feature extraction method.

3.2 Class conditional feature extraction Feature extraction for classification is a search among all possible transformations for the best one, which preserves class separability as much as possible in the space with the lowest possible dimensionality (Aladjem, 1994).

Although PCA is still probably the most popular feature extraction technique, it has a serious drawback, giving high weights to features with higher variabilities, irrespective of whether they are useful for classification or not. This may give rise to the situation where the

Page 167: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

chosen principal component corresponds to an attribute with the highest variability but has no discriminating power (as it was considered in Figure 1).

Figure 3 – PCA for classification: a) effective work of PCA, b) the case where an

irrelevant principal component was chosen from the classification point of view (O denotes the origin of initial feature space x1, x2 and OT – the origin of transformed feature space PC(1), PC(2)).

The usual decision is to use some class separability criterion, based on a family of functions of scatter matrices: the within-class covariance, the between-class covariance, and the total covariance matrices.

The within-class covariance matrix shows the scatter of samples around their respective class expected vectors:

∑∑==

−−=in

j

Tiij

iij

c

iiW n

1

)()()()(

1))(( mxmxS , (8)

where c is the number of classes, ni is the number of instances in a class i, )(ijx is the j-th

instance of i-th class, and m(i) is the mean vector of the instances of i-th class. The between-class covariance matrix shows the scatter of the expected vectors around the

mixture mean: Tii

c

iiB n ))(( )()(

1

mmmmS −−= ∑=

, (9)

where c is the number of classes, ni is the number of instances in a class i, m(i) is the mean vector of the instances of i-th class, and m is the mean vector of all the input data.

The total covariance matrix shows the scatter of all samples around the mixture mean. It can be shown analytically that this matrix is equal to the sum of the within-class and between-class covariance matrices (Fukunaga 1990): WB SSS += . (10)

One possible criterion based on the family of functions of considered scatter matrices is Fisher linear discriminant:

aSaaSaa

WT

BT

J =)( , (11)

A number of other criteria were proposed in (Fukunaga, 1990). The criterion (11) can be optimized by the use of the simultaneous diagonalization algorithm (Fukunaga, 1990):

1. Transformation of X to Y: XΦΛY T1/2−= , where Λ and Φ are the eigenvalues and eigenvectors matrices of

WS . 2. Computation of BS in the obtained Y space. 3. Selection of m eigenvectors of BS , mψψ ,...,1 , which correspond to the m largest

eigenvalues.

Page 168: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

4. Finally, new feature space YΨZ Tm= , where ],...,[ 1 mψψΨ = , can be obtained.

3.3 Parametric vs. nonparametric feature extraction In this subsection we consider the difference between parametric and nonparametric approaches for class conditional feature extraction.

In the previous subsection we introduced the main principle of class separability criterion, based on a family of functions of scatter matrices.

It should be noticed that there is a fundamental problem with the parametric nature of the covariance matrices. The rank of the SB is at most the number of classes-1, and hence no more than this number of new features can be obtained.

The nonparametric method overcomes this problem by trying to increase the number of degrees of freedom in the between-class covariance matrix, measuring the between-class covariances on a local basis. The k-nearest neighbor (kNN) technique is used for this purpose.

A two-class nonparametric feature extraction method was considered in (Fukunaga 1990), and it is extended in Tsymbal, et al. (2002) to the multiclass case. The algorithm for nonparametric feature extraction is the same as for the parametric extraction. Simultaneous diagonalization is used as well, and the only difference is in calculating the between-class covariance matrix SB. In the nonparametric case the between-class covariance matrix is calculated as the scatter of the samples around the expected vectors of other classes’ instances in the neighborhood:

∑∑∑≠===

−−=c

ijj

Tjik

ik

jik

ik

n

kik

c

iiB

i

wn1

)(*

)()(*

)(

11))(( mxmxS , (12)

where )(*j

ikm is the mean vector of the nNN instances of j-th class, which are nearest neighbors to )(i

kx . The number of nearest instances nNN is a parameter, which should be set in advance. In (Fukunaga 1990) it was proposed to use nNN equal to 3, but without any justification. The coefficient wik is a weighting coefficient, which shows importance of each summand in (12). The goal of this coefficient is to assign more weight to those elements of the matrix, which involve instances lying near the class boundaries and thus more important for classification. We generalize the two-class version of this coefficient proposed in (Fukunaga 1990) to the multiclass case:

∑=

=c

j

jnNN

ik

jnNN

ikj

ik

d

dw

1

)()(

)()(

),(

)},({min

xx

xx

α

α, (13)

where ),( )()( jnNN

ikd xx is the distance from )(i

kx to its nNN-nearest neighbor of class j, and α is a parameter which should be set in advance.

Two parameters (nNN and α ) are used to assign more weight to those elements of the matrix, which involve instances lying near the class boundaries and thus being more important for classification. In (Fukunaga, 1990) the parameter α was set to 1 and nNN to 3, but without any strict justification. In (Tsymbal et al., 2002) it was shown that these parameters have different optimal values for each data set.

Considered difference between parametric and nonparametric approaches in SB calculation is depicted in Figure 4. Solid lines depict distances between classes’ means and the global mean in parametric approach and local distances between individuals, not belonging to the same class, while dashed lines depict distances within a class.

In the next section we will see how feature extraction is used in the dynamic integration of classifiers. A corresponding algorithm will be introduced.

4 Dynamic Integration of Classifiers with Instance Space Transformation In order to address the curse of dimensionality in the dynamic integration of classifiers, we propose the FEDIC (Feature Extraction for Dynamic Integration of Classifiers) algorithm that first performs feature extraction and then uses a dynamic scheme to integrate classifiers.

Page 169: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

Figure 4 – Differences in the between-class covariance matrix calculation for

nonparametric (left) and parametric (right) approaches for the two-class (ω1 and ω2) problem.

4.1 Scheme of the FEDIC algorithm In Figure 5, a scheme that illustrates the components of the FEDIC approach is

presented. The FEDIC learning model consists of two major phases: the training phase and the application phase. The training phase is further divided into the training of the base classifiers phase, and the feature extraction phase (FE). The application phase is further divided into the meta-learning phase and the dynamic integration phase (DIC).

Dynamic Selection

Dynamic Voting

Dynamic Voting with Selection

Dynamic Integration

Divide instances

Data set

Training set Validation set Test set

S - size of the emsemble N - number of features TS - training subset BC - base classifier NN - nearest neighborhood

Search for NN

Feature Extraction :

RSM(S,N)

TS1

Training BC1

accuracy estimation

TSS

Training BCS

accuracy estimation

TSi

Training BCi

accuracy estimation

Local accuracy estimates

Trained Base

classifiers

Meta-Data

WNN: for each nn predict local errors

of every BC

Transformed training set

Transforma-tion models

Feature subsets refinement

PCA Par Non-Par

training phase 

application ph

ase 

...

...

...

...

...

...

Meta-Learning

Figure 5 - Scheme of the FEDIC approach

The model is built using a wrapper approach (Kohavi, 1995) where the variable parameters in FE and DIC can be adjusted to improve performance as measured at the model validation

Page 170: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

phase in an iterative manner. These parameters include the threshold value that is related to the amount of covered variance by the first principal components and thus defines the number of output features in the transformed space (it is set up for each feature extraction method); the optimal values of α and nNN parameters (as described in Section 3) in the nonparametric feature extraction technique and the number of nearest neighbors in DIC as described later.

The basic stages of the algorithm are shown in Fig. 6. First, the data set is divided into the training set, the validation set, and the test set using stratified random sampling. The training set is provided in two copies. One copy of the training set is used for building the base classifiers, and the other is used during the feature extraction phase. The validation set is used in the refinement cycle during the training of base classifiers phase. The test set is used during the application phase. _______________________________________________________ D, T, V, Tst – whole data set, train, validation and test subsets M = {L, BC, FE, T’} – meta-data E – local error estimates L – local accuracy estimates of BCs in the neighborhood of test instance BC – set of base classifiers FE = {PCA, Par, NPar} – transformation models T’ – transformed train data sets R – summary of results thresh – percent of variance explained by extracted features selected kNN, alpha – function FEDIC(D) returns R begin T,V,Tst← divideDataSet(D) BC, E ← trainingBC(T,V) if (FE is NPar) set preselected_kNN_and_alpha FE, T’ ← featureExtraction(T, thresh, kNN, alpha) L ← metaLearning(E, FE, Tst, T’ ) return R ← DIC(Tst, L, BC) end ________________________________________________________

Figure 6 – Main phases of FEDIC algorithm

Then, the algorithm proceeds with the training of the base classifiers phase, and the feature extraction phase that can be performed independently (in parallel) from each other. As a result of these phases the meta-data is constructed. The meta-data includes trained base classifiers and corresponding local accuracy estimates, and transformation models and transformed training set. As meta-data is ready the meta-learning phase is started. A test instance comes through transformation process so that it is in the same feature space as meta-data (transformed train set). As the result of this phase local accuracy estimates of baseclassifiers in the neighborhood of test instance are produced. These local estimates are used then in dynamic integration of classifiers phase, where corresponding integration procedure is applied and a final prediction is performed.

In the next subsections we further consider the training phase, the feature extraction phase, the meta-learning phase and the dynamic integration of classifiers phase of the FEDIC algorithm. In Fig. 7 data set schemas corresponding to training, feature extraction and application phases are presented.

4.2 The training of the base classifiers phase The training phase (see Fig. 8) consists of two main phases: (1) construction of the initial ensemble in random subspaces; and (2) iterative refinement of the ensemble members. The iterative refinement based on hill-climbing search is used to improve the accuracy and

Page 171: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

diversity of the base classifiers. For all the feature subsets, each feature tries to be switched (included or deleted). If the resulting feature subset produces better performance on the validation set that change is kept. This process is continued until no further improvements are possible. The process usually terminates after no more than four passes through the feature set.

input Training BC

output Training BC

input FE-phase

output FE-phase

input app.-phase

Base-level Meta-level Meta-level Meta-level Meta-level

categorical1 categorical1 numerical1 pc1 categorical1 … => … … => … … categoricall categoricall numericalp pcp’ categoricall numerical1 Numerical1 Class class numerical1 … … … numericalm numericalm numericalm Class E1 pc1 … … Ebc pcp’ Class class

Figure 7 – Data set schemas corresponding to training base classifiers, feature extraction and application phases _________________________________________________________ T training set V validation set BC set of trained base classifiers FS set of feature subsets for the base classifiers c(x) classification of instance x Cj j-th base classifier Cj(x) prediction of Cj on instance x Ej(x) estimation of error of Cj on instance x E error matrix S number of base classifiers N number of features function trainingBC(T,V) returns BC, E begin FS = RSM (S, N) {the random subspace method} for i = 1 to S {refinement of each feature subset} FS[i] = hillClimbing(FS[i]) Cj = trainBC(FS[i]) for each x from V compare Cj(x) with c(x) and derive Ej(x) E ← Ei(x) BC← Cj return BC, E end _________________________________________________________

Figure 8 – The training base classifiers phase

As the feature subset’s performance measure the fitness function proposed by Opitz (1999) is used in a genetic algorithm, where the fitness of a feature subset i was selected to be proportional to the classification accuracy and diversity of the corresponding classifier:

iii divaccFitness ⋅+= α , (14)

Page 172: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

where acci and divi are the accuracy and diversity calculated over the validation set, and is the coefficient determining the degree with which the diversity influences on the fitness of the current feature subset.

The training set T is partitioned into v folds. Then, cross-validation is used to estimate the errors of the base classifiers Ej(x*) on the training set.

The training phase continues with training the base classifiers Cj on the whole training set. As a result of this phase a set of trained base classifiers BC and the error matrix E are produced.

4.3 The feature extraction phase The feature extraction phase begins with preprocessing which includes categorical features’ binarization. Each categorical feature is replaced with a redundant set of binary features, each corresponding to a value of the original feature. During this phase, feature extraction techniques are applied to the preprocessed training set T to produce transformed data with a reduced number of features. The pseudo-code of this process is shown in Fig. 9. The pre-processed data set is the input for the FE module where one of the three functions used: (1) PCA_FE that implements conventional PCA, (2) Par_FE that implements parametric feature extraction, or (3) NPar_FE that implements nonparametric feature extraction. The function getYspace is used to calculate an intermediate transformed space needed for the parametric and nonparametric approaches. _________________________________________________________

T, T’ preprocessed train and transformed train data sets S, Sb, Sw total, between- and within-class covariance matrices m mean vector Y intermediate transformed space Λ , Φ eigenvalues and eigenvectors matrices threshold the amount of variance in the selected PCs

function PCA_FE(T, threshold) returns T’ begin

∑=

−−←n

i

Tii

1))(( mxmxS

)eigs(SΦΛ, ← {the eigensystem decomposition} PCAP ← formPPCA(threshold, Λ , Φ )

{forms the transformation matrix} return TPT PCA←'

end

function Par_FE(T,threshold) returns T’ begin

Y← getYspace(T) Tii

c

iiB n ))(( )()(

1mmmmS −−←∑

=

{computing of Sb in the Y space}

)eigs( BSΦΛ, ←

ParP ← formPPar(threshold, Λ , Φ ) return YPT Par←′

end

function NPar_FE(T,threshold, kNN, alpha) returns T’ begin

Y← getYspace(T)

Page 173: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

∑=

← c

j

jnNN

ik

jnNN

ikj

ik

d

dw

1

)()(

)()(

),(

)},({min

xx

xx

α

α

∑∑∑≠===

−−←c

ijj

Tjik

ik

jik

ik

n

kik

c

iiB

i

wn1

)(*

)()(*

)(

11))(( mxmxS

)eigs( BSΦΛ, ←

NParP ← formPNPar(threshold, Λ , Φ ) return YPT NPar←′

end function getYspace(T) returns Y begin

∑∑==

−−←in

j

Tiij

iij

c

iiW n

1

)()()()(

1))(( mxmxS

)eigs( wSΦΛ, ← return XΦΛY T1/2−←

end ________________________________________________________

Figure 9 – The feature extraction phase

4.4 The meta-learning phase The instances from test set Tst, transformed training set T’, and local error estimates E is the input for the meta-learning phase (Fig. 10), where the performance of each base classifier for a new test instance is predicted. A test instance comes through transformation process (a corresponding FE transformation model is used) so that it is in the same feature space as meta-data (transformed train set). Then, the meta-learning phase proceeds with finding in the transformed training set T’ a nearest neighborhood NN of the transformed test instance x’. _________________________________________________________ x test instance from test set Tst x’ an instance x transformed with PPCA, PNPar or PNPar W vector of weights for base classifiers T’ transformed training set Lj(x) prediction of error of Cj on instance x function metaLearning_phase(T’, E, x, FE) returns L begin x’ ← transformTst(x, FE) NN = FindNeighborhood(T’, x’, nn) loop for j from 1 to m

)(11

ii NNj

nn

iNNj EW

nnL x⋅← ∑

=

{WNN estimation}

L← Lj return L end _________________________________________________________

Figure 10 – The meta-learning phase

The size nn of the set NN is an adjustable parameter. The classification error Lj is predicted for each base classifier Cj using the WNN procedure as described in Section 2.3. As the result of this phase local accuracy estimates L of base classifiers in the neighborhood of test instance are produced.

Page 174: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

4.5 The dynamic integration phase The local accuracy estimates L, base classifiers BC, and test instances Tst is the input for the dynamic integration phase where the base classifier(s) is(are) selected to produce the classification for a new instance. One of these functions is used for this purpose: DS_integration_phase, DV_integration_phase or DVS_integration_phase (Figure 11). The first function DS_integration_phase implements Dynamic Selection. In the DS a classifier with the lowest error (with the least global error in the case of ties) is selected to make the final classification. The second function DV_integration_phase implements Dynamic Voting. In the DV application phase each base classifier Cj receives a weight Wj that depends on the local classifier’s performance, and the final classification is conducted by voting classifier predictions Cj(x) with their weights Wj. In the case with DVS_integration_phase, the base classifiers Cj with high local errors Lj

are discarded (the classifiers with local errors that fall into the upper half of the error interval of the ensemble). After, dynamic voting (DV) is applied to the restricted set of classifiers. _________________________________________________________ x test instance from test set Tst W vector of weights for base classifiers T original training set Lj(x) prediction of error of Cj on test instance x

function DS_integration_phase(x, BC, L)returns class of x begin jLargminl

j← {number of cl-er with min. Lj }

{with the least global error in the case of ties} return Cl(x) end

function DV_integration_phase(x, BC, L)returns class of x begin L1W −= return Weighted_Voting(W,C1(x),...,Cm(x)) end

function DVS_integration_phase(x, BC, L)returns class of x begin threshold = ½ (upper Lj

- lower Lj) list = selectBadCl-ers(threshold) discardClassifiers(list) return DV_integration_phase(x, BC, L) end _________________________________________________________

Figure 11 – The dynamic integration phase

We do not devote a separate subsection to the model validation and model evaluation phases since they are performed in a straightforward manner. At the validation phase, the performance of the given model with the given parameter settings on an independent validation data set is tested. The given model, its parameter settings and performance are recorded. And at the final evaluation phase, the best model from the number of obtained models is selected. This model is the one with optimal parameters settings as based on the validation results. The selected model is then tested with a test data set. In the next section we consider our experiments where we analyzed and compared the described above feature-extraction techniques applied to dynamic integration of classifiers.

Page 175: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

5 Experimental Studies In this section we describe the experimental settings and present results first on 21 UCI datasets (Blake & Merz, 1998) and then on 3 large medical data sets with real cases of acute abdominal pain.

5.1 Experimental settings For each data set 70 test runs were made. In each test run a data set was first split into the training set, the validation set, and the test set by stratified random sampling. Each time 60 percent of the instances were included in the training set. The other 40 percent were divided into two sets of approximately equal size (the validation and test sets). The validation set was used in the iterative refinement of the ensemble. The test set was used for the final estimation of the ensemble accuracy.

To construct ensembles of base classifiers we have used the EFS_SBC (Ensemble Feature Selection for the Simple Bayesian Classification) algorithm, introduced in (Tsymbal et al, 2003a). Initial base classifiers were built using the Naïve Bayes on the training set and later refined using a hill-climbing cycle on the validation data set. The size of ensemble was selected to be equal to 25. It was shown that the biggest gain is achieved already with this number of base classifiers (Bauer & Kohavi, 1999). The diversity coefficient alpha was selected as it was recommended in (Tsymbal et al, 2003a) for each data set.

At each run of the algorithm, we collected accuracies for the four types of integration of the base classifiers: Static Selection (SS), Weighted Voting (WV), Dynamic Selection (DS), Dynamic Voting (DV) and Dynamic Voting with Selection (DVS). In dynamic integration, the number of nearest neighbors for the local accuracy estimates was pre-selected from the set of six values: 1, 3, 7, 15, 31, 63 ( 6,...,1,12 =− nn ), for each data set separately. Heterogeneous Euclidean-Overlap Metric (HEOM) (Wilson & Martinez, 1997) was used for calculation of the distances in dynamic integration. For each feature selection technique (PCA (conventional PCA), Par (parametric approach), NPar (nonparametric approach)), first, we have considered experiments with the best eigenvalue threshold from the set of ten values: 0.65, 0.75, 0.85, 0.9, 0.95, 0.97, 0.99, 0.995, 0.999, 1 as it was done in (Tsymbal et al., 2002). However, then another ten values were used: 0.75, 0.85, 0.9, 0.95, 0.97, 0.99, 0.999, 0.99999, 0.9999999, 1 that seemed to be more relevant from algorithm sensitivity point of view.

All the possible settings used throughout the experimental study are summarized in the Table 1.

Table 1. Experiment settings parameters

Feature extraction technique PCA, Par, NPar Threshold for every FE 0.75, 0.85, 0.9, 0.95, 0.97, 0.99, 0.999, 0.99999, 0.9999999, 1 kNN and alpha for NPar are preselected for each data set according to Tsymbal et

al., 2002, presented in Table 2 Integration method SS, WV, DS, DV, DVS k in DS, DV, DVS 1, 3, 7, 15, 31, or 63 diversity coefficient are preselected for each data set according to Tsymbal et

al., presented in Table 2

A multiplicative factor of 1 was used for the Laplace correction in simple Bayes. Numeric features were discretized into ten equal-length intervals (or one per observed value, whichever was less). Software for the experiments was implemented using the MLC++ machine learning library (Kohavi et al., 1996).

5.2 Experimental results on UCI data sets We conducted the experiments on 21 data sets with different characteristics taken from the UCI machine learning repository (Blake & Merz, 1998). The main characteristics of the data sets, which include the name of a data set, the number of instances included in a data set, the number of different classes of instances, and the number of different types of features

Page 176: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

(categorical and numerical) included in the instances are presented in Table 2. In (Tsymbal et al, 2001) results of experiments with feature subset selection techniques with the dynamic selection of classifiers using these data sets were presented.

Table 2. UCI data sets used in the study

Original features Data set Inst Classes Cat-

al Num-

al total

Num-al + binarized

cat-al kNN alpha Div

alpha

Balance 625 3 4 0 4 20 255 1/3 2 Breast Cancer 286 2 9 0 9 48 1 5 1 Car 1728 4 6 0 6 21 63 5 2 Diabetes 768 2 0 8 8 8 127 1/5 1 Glass Recogn. 214 6 0 9 9 9 1 1 4 Heart Disease 270 2 0 13 13 13 31 1 1 Ionosphere 351 2 0 34 34 34 255 3 1 Iris Plants 150 3 0 4 4 4 31 1/5 2 LED 300 10 0 0 7 7 15 1/3 1 LED17 300 10 0 0 24 24 15 5 ¼ Liver 345 2 0 6 6 6 7 3 1 Lymphography 148 4 15 3 18 38 7 1 2 MONK-1 432 2 6 0 6 15 1 1 4 MONK-2 432 2 6 0 6 15 63 20 ¼ MONK-3 432 2 6 0 6 15 1 1/3 2 Soybean 47 4 0 35 35 35 3 1 1 Thyroid 215 3 0 5 5 5 215 3 2 Tic-Tac-Toe 958 2 9 0 9 27 1 1 4 Vehicle 846 4 0 18 18 18 3 3 4 Voting 435 2 16 0 16 48 15 1/3 1/2 Zoo 101 7 16 0 16 16 7 1/20 2

The basic accuracy results of the experiments with FEDIC algorithm on 21 UCI data sets

are presented in Appendix A (the results are averaged over 70 runs). For initial comparison of the results of dynamic integration with feature extraction using

FEDIC with the results when no feature extraction was used, and dynamic integration was therefore carried out in the space of original features, in the Table 3 we present the best results of dynamic integration with respect to each feature extraction technique. In the same table we present for comparison the classification accuracies of the 3-NN classifier applied together with one of the three feature extraction techniques, namely PCA, parametric (Par) and nonparametric (NPar) approaches. Additionally, classification accuracy for the situation without feature extraction (Plain) is also shown. Then, in the same order, the best accuracies (for the FEDIC approaches) from the dynamic selection, dynamic voting and dynamic voting with selection schemes are presented. The last column contains the best classification accuracies for the static integration of classifiers (SIC) selected from static selection and weighted voting. Each row of the table corresponds to a single data set. The last row includes the results averaged over all the data sets.

From Table 3 one can see that the nonparametric approach has the best accuracy on average with the base classifier and share the best accuracy results with PCA for the dynamic integration scenarios. The parametric approach has extracted the least number of features (and it has been the least time-consuming approach) its performance has been unstable. The parametric approach has rather weak results for dynamic integration on the Breast, Glass, Monk-1, and Tic and Vehicle data sets in comparison to the other feature extraction approaches. The scenario with parametric feature extraction has the worst average accuracy (even worse than dynamic integration in the space of original features has been performed). This contradicts with the results of parametric approach for single 3-NN classifier. It had poor results only on Glass and Monk-1 data sets and was the second best approach. On the contrary, PCA was the worst (although the most stable) feature extraction approach for 3-NN

Page 177: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

classifier, but shows equally good result on average as nonparametric approach when applied for dynamic integration. However, it can be seen that behavior of feature extraction approaches may differ from one data set to the other.

The results of Table 3 show that, in some cases, dynamic integration in the space of extracted features results in significantly higher accuracies than dynamic integration in the space of original features.

Table 3. Accuracy results of 3-NN classifier and FEDIC algorithm (the best from DS, DV, DVS is selected)

3-NN FEDIC SIC Data set Plain PCA Par NPar Plain PCA Par Npar Plain

Balance .834 .827 .893 .863 .898 .898 .897 .899 .896Breast .724 .721 .676 .676 .744 .747 .731 .747 .744Car .806 .824 .968 .964 .911 .920 .941 .942 .863Diabetes .730 .730 .725 .722 .763 .767 .761 .763 .761Glass .664 .659 .577 .598 .679 .674 .603 .621 .623Heart .790 .777 .806 .706 .839 .839 .838 .839 .839Ionospher .849 .872 .843 .844 .918 .920 .915 .917 .916Iris .955 .963 .980 .980 .941 .933 .940 .935 .929LED .667 .646 .630 .635 .745 .746 .745 .745 .751LED17 .378 .395 .493 .467 .690 .690 .690 .690 .690Liver .616 .664 .612 .604 .625 .635 .621 .623 .615Lymph .814 .813 .832 .827 .841 .840 .828 .836 .824Monk-1 .758 .767 .687 .952 .832 .838 .709 .942 .746Monk-2 .504 .717 .654 .962 .665 .665 .663 .675 .664Monk-3 .843 .939 .990 .990 .984 .975 .985 .987 .971Soybean .995 .992 .987 .986 1 1 1 1 1Thyroid .938 .921 .942 .933 .961 .958 .951 .955 .953Tic .684 .971 .977 .984 .930 .964 .783 .895 .730Vehicle .694 .753 .752 .778 .676 .717 .668 .717 .603Voting .921 .923 .949 .946 .953 .953 .945 .949 .951Zoo .932 .937 .885 .888 .960 .961 .959 .961 .948Avg .766 .801 .803 .824 .836 .840 .818 .840 .810

This is the situation with the Car, Liver, Monk-1, Monk-2, Tic-Tac-Toe and Vehicle data sets. We highlight this with bold font in the table and present a histogram in the Figure 11, where for each data set we compare best accuracy from {ds_plain, dv_plain, dvs_plain} marked plain with best accuracy from {ds_pca, dv_pca, dvs_pca; ds_par, dv_par, dvs_par; ds_npar, dv_npar, dvs_npar} marked bestFE. It can be seen that on Car, Liver, Monk1-2, Tic-Tac-Toe, and Vehicle data sets feature extraction for dynamic integration of classifiers results in significant increase of accuracy comparing to performance of dynamic integration of classifiers in the space of original features. Thus, the corresponding average increase of accuracy for these data sets was 3.1% (.911 vs. .942), 1% (.625 vs. .635), 11% (.832 vs. .942), 1% (.665 vs. 675), 3.4% (.93 vs. .964) and 4.1% (.676 vs. .717).

There is an important trend in the results – as a rule the FEDIC algorithm outperforms dynamic integration on plain features only on those data sets, on which feature extraction for classification with a single classifier provides better results than the classification on the plain features. If we analyze this correlation further (see Table 4), we will come to the conclusion that feature extraction influences the accuracy of dynamic integration to a similar extent as feature extraction influences the accuracy of base classifiers. This trend supports our expectations about the behavior of the FEDIC algorithm.

Page 178: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

Figure 11 – Accuracy results on 21 UCI data sets: DIC vs. FEDIC

Table 4. Accuracy improvement due FE for 3-NN classifier, due DIC for base classifier, and due FE for DIC.

Data set FE – plain (%) DIC – SIC (%) FEDIC-DIC (%) Significant Balance 5.9 0.2 0.1 + / – / – Breast -0.3 0.0 0.3 – / – / – Car 16.2 4.8 3.1 + / + / + Diabetes 0.0 0.2 0.4 – / – / – Glass -0.5 5.6 -0.5 – / + / – Heart 1.6 0.0 0.0 + / – / – Ionospher 2.3 0.2 0.2 + / – / – Iris 2.5 1.2 -0.1 + / + / – LED -2.1 -0.6 0.1 – / – / – LED17 11.5 0.0 0.0 + / – / – Liver 4.8 1.0 1.0 + / + / + Lymph 1.8 1.7 -0.1 + / + / – Monk-1 19.4 8.6 11.0 + / + / + Monk-2 45.8 0.1 1.0 + / – / + Monk-3 14.7 1.3 0.3 + / + / – Soybean -0.3 0.0 0.0 – / – / – Thyroid 0.4 0.8 -0.3 – / – / – Tic 30 20.0 3.4 + / + / + Vehicle 8.4 7.3 4.1 + / + / + Voting 2.8 0.2 0.0 + / – / – Zoo 0.5 1.2 0.1 – / + / – avg 5.8 2.6 0.4

However, it could be seen also from Table 4 that this rule does not always hold true. For

example, with Iris dataset we have such situation that FE improves classification of kNN, and DIC outperforms static integration, nevertheless, FEDIC does not result in improvement of DIC in the original space. Similar situation is with Lymph and Monk-3 datasets. The reason for that behavior is that both the meta-level learning process in dynamic integration, and the base learning process in base classifiers use the same feature space. Though, it is necessary to note, that the output values are still different in those learning tasks (these are local classification errors and the classes themselves correspondingly). Thus, the feature space is the same, and the output values to be predicted are different. This justifies that the influence

Page 179: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

of feature extraction on the accuracy of dynamic integration in comparison with the influence on the accuracy of a single classifier is still different to a certain degree.

Still, the positive surprise for us was that in one case, with Monk-2 dataset FEDIC outperformed DIC in original space (as FE with kNN was better than plain kNN) even though DIC was not significantly better than static integration.

These six data set where FEDIC outperforms DIC are selected for further analysis. For these data sets we have pairwise compared each FEDIC technique with the others and with static integration using the paired Student t-test with the .95 level of significance. Results of the comparison are given in Table 5. Columns 2-6 of the table contain the results of comparing a technique corresponding to the row of a cell with a technique corresponding to the column, using the paired t-test. Each cell contains win/tie/loss information according to the t-test. For example, PCA has 3 wins against the parametric extraction, 1 draw and 1 loss on 5 data sets.

It can be seen from Table 5 that different FE methods may result in the better performance of dynamic integration of classifiers.

Table 5. Results of the paired t-test (win/tie/loss information) for six data sets, on which FEDIC outperforms plain dynamic integration

In the Figure 12 below we analyse further FEDIC accuracy results on Car, Liver, Monk-1,

Monk-2, Tic-Tac-Toe and Vehicle data sets. It can be seen that behaviour of different approaches differ from one data set to the other.

On Car data set (Figure 12, a) in each case, dynamic integration procedure in the space of extracted features is better than the corresponding dynamic integration procedure in the space of original features. Integration methods with parametric and nonparametric FE behave in a very similar way. However, these class conditional approaches give additional significant increase in accuracy results comparing to PCA. This means that PCA disregards class information, important for this data set (remember Figure 3). DS and DVS work better than DV in both original and extracted spaces. DS is better than DVS when applied in original space and feature space extracted by PCA. However this difference is not significant for feature spaces extracted by class conditional approaches.

The situation with Liver data set (Figure 12, b) is the opposite of the Car data set. Class conditional approaches do not improve the work of dynamic integration. And DS with parametric and nonparametric FE show even significantly lower accuracy. PCA on the contrary significantly increase the accuracy: 1% for DV (the best integration method as in original space as in feature space extracted by PCA), 4% for DS and 3,5% for DVS.

For both Monk-1 and Monk-2 problems (Figure 12, c and d) the main contribution for good accuracy results are because of nonparametric FE. Average corresponding increase of accuracy is about 10%. Parametric FE shows the worst results with any integration method, and PCA provides no significant change for dynamic integration.

The behaviour of feature extraction approaches on Tic-Tac-Toe data set (Figure 12, e) is similar to the behavior on Liver data set: class conditional approaches deteriorate the work of dynamic integration, especially, parametric FE. And PCA, on the contrary, significantly increases the accuracy of every integration method.

On Vehicle data set (Figure 12, f) PCA and nonparametric FE show similar results, but parametric approach extracts too few features and fails because of this – the results even worse than of DIC in the original space. DVS is the best integration method here and together with PCA and nonparametric FE give 4% increase of accuracy.

PCA_DIC Par_DIC Nonpar_DIC Plain_DIC SIC PCA_DIC 3/1/1 2/0/3 4/1/0 4/1/0 Par_DIC 1/1/3 0/2/3 2/2/1 3/1/1 Nonpar_DIC 3/0/2 3/2/0 4/1/0 5/0/0 Plain_DIC 0/1/4 1/2/2 0/1/4 1/3/1 SIC 0/1/4 1/1/3 0/0/5 1/3/1

Page 180: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

0.850.860.870.880.890.900.910.920.930.940.95

ds_p

lain

dv_p

lain

dvs_

plain

ds_p

ca

dv_p

ca

dvs_

pca

ds_p

ar

dv_p

ar

dvs_

par

ds_n

par

dv_n

par

dvs_

npar

a) Car

0.6300.6350.6400.6450.6500.6550.6600.6650.6700.6750.680

ds_p

lain

dv_p

lain

dvs_

plain

ds_p

ca

dv_pc

a

dvs_

pca

ds_p

ar

dv_p

ar

dvs_

par

ds_n

par

dv_n

par

dvs_

npar

d) Monk-2

0.5900.5950.6000.6050.6100.6150.6200.6250.6300.6350.640

ds_p

lain

dv_p

lain

dvs_

plain

ds_p

ca

dv_p

ca

dvs_

pca

ds_p

ar

dv_p

ar

dvs_

par

ds_n

par

dv_n

par

dvs_

npar

b) Liver

0.680.710.740.770.800.830.860.890.920.950.98

ds_p

lain

dv_p

lain

dvs_

plain

ds_p

ca

dv_p

ca

dvs_

pca

ds_p

ar

dv_p

ar

dvs_

par

ds_n

par

dv_n

par

dvs_

npar

e) Tic-Tac-Toe

0.690.710.730.750.770.790.810.830.850.870.890.910.930.95

ds_p

lain

dv_p

lain

dvs_

plain

ds_p

ca

dv_p

ca

dvs_

pca

ds_p

ar

dv_p

ar

dvs_

par

ds_n

par

dv_n

par

dvs_

npar

c) Monk-1

0.630.640.650.660.670.680.690.700.710.720.73

ds_p

lain

dv_p

lain

dvs_

plain

ds_p

ca

dv_p

ca

dvs_

pca

ds_p

ar

dv_p

ar

dvs_

par

ds_n

par

dv_n

par

dvs_

npar

f) Vehicle

Figure 12 – FEDIC accuracy results for Car, Liver, Monk-1, Monk-2, Tic-Tac-Toe and Vehicle data sets

In Figure 13 below we grouped the same accuracy results by integration method (a), b) and c)) and by feature extraction method (d), e), and f)). For every integration method DS, DV and DVS (Figure 13, a-c) PCA is 2 times significantly better that other FE approaches (on Liver and Tic-Tac-Toe data sets), parametric FE 2 times significantly worse than PCA, nonparametric FE and DS in the space of original features (on Monk-1 and Tic-Tac-Toe data sets) and 1 time significantly worse than PCA and nonparametric FE (on Vehicle data set). The nonparametric FE was 2 times significantly better (on Monk-1 and Monk-2 data sets) for the DS, and 1 time (on Monk-1 data set) for the DV and DVS. Class-conditional FE approaches work well on Car data set but work bad (with DS) or slightly worse (with DV and

Page 181: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

DVS) on Liver data set. However, we can see that patters of FE approaches behaviour are very similar with regard to any integration method.

0.500.550.600.650.700.750.800.850.900.951.00

Car Liver MONK-1 MONK-2 Tic-Tac- Vehicle

ds_plain ds_pca ds_par ds_npar

a) Dynamic Selection

0.500.550.600.650.700.750.800.850.900.951.00

Car Liver MONK-1 MONK-2 Tic-Tac- Vehicle

ds_pca dv_pca dvs_pca

d) PCA

0.500.550.600.650.700.750.800.850.900.951.00

Car Liver MONK-1 MONK-2 Tic-Tac- Vehicle

dv_plain dv_pca dv_par dv_npar

b) Dynamic Voting

0.500.550.600.650.700.750.800.850.900.951.00

Car Liver MONK-1 MONK-2 Tic-Tac- Vehicle

ds_par dv_par dvs_par

e) parametric FE

0.500.550.600.650.700.750.800.850.900.951.00

Car Liver MONK-1 MONK-2 Tic-Tac- Vehicle

dvs_plain dvs_pca dvs_par dvs_npar

c) Dynamic Voting with Selection

0.500.550.600.650.700.750.800.850.900.951.00

Car Liver MONK-1 MONK-2 Tic-Tac- Vehicle

ds_npar dv_npar dvs_npar

f) nonparametric FE

Figure 13 – Accuracy results on Car, Liver, Monk-1, Monk-2, Tic-Tac-Toe and Vehicle data sets grouped by integration method (a), b) and c)) and by feature extraction method (d), e), and f)).

For PCA (Figure 13, d) DV was 3 times the worst (on Car, Monk-1, and Tic-Tac-Toe data sets) and never significantly better than DS and DVS. DS with PCA was the best on Monk-1 data set and DVS with PCA was the best on Vehicle data set. For parametric FE (Figure 12, e) DV was significantly better than DS on Liver data set. However, there is no big difference in accuracy results with respect to selected integration procedure. For nonparametric FE (Figure 12, f) DV was the worst on Car, Monk-1 and Tic-Tac-Toe data sets, and was slightly better than DS and DVS on Liver data set. DS was the best on Monk-1 and Tic-Tac-Toe data sets. DVS is slightly better than DS and DV on Vehicle data set.

Page 182: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

Since our main assumption about the benefit of applying DIC in the space of extracted features was the enhancement of neighborhood, we analysed how the neighborhood changes by means of FE. We measured for each test instance, what proportion from the selected nearest neighbors belong to the same class as the true class value of the test instance. We calculated such proportions for number of neighbors equal 1, 3, 7, 15, 31, 63, i.e. the same set that was used in the FEDIC experiments. The summary of these results is presented in Figure 14, and all the results are presented in Appendix B. We noticed that there exists complete correspondence of the results to the accuracy results in Table 4, irrespective of the number of nearest neighbors accounted by DIC. FE has no effect on Breast, Diabetes, Glass, Iris, Soybean, Thyroid, and Zoo data sets. There is small negative effect of FE on Led data set. There is strong positive effect of FE on Balance, Car, Led17, Monk-1, Monk-2, Monk-3, Tic-tac-toe, and Vehicle data sets. The effect of FE on Ionosphere, Liver, Lymph, and Vote has not so strong, but still significant effect for output proportion of the instances belonging to the same class as a test instance. However, we can see that for FEDIC there is no effect of FE for Balance, Heart, Ionosphere, Led17, Monk-3 and Voting data set. We think, that this can be explained that these data sets are easy to learn for DIC and actually same accuracy results are achieved by DIC in the space of original features, because there is no space for further accuracy improvement.

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

balance car led led17 liver lymph monk-1 monk-2 monk-3 tic vehicle vote

PlainPCAParNpar

Figure 14 – Averaged over number of test set instances proportion of the number of nearest neighbours with the same class label as a true class label of a new test instance to the total number of nearest neighbours.

Beside the accuracy results we also collected the information about the threshold values that correspond to the highest accuracy and the information about the number of features extracted in average (that correspond to the threshold values). In Table 7 we present the average threshold value and average number of features extracted for those 6 dataset where FEDIC outperforms DIC in the original space.

Table 7 – The average threshold value and average number of features extracted for 6 dataset where FEDIC outperforms DIC in the original space

3-NN PCA PAR NPar PCA Par NPar ds dv dvs ds dv dvs ds dv dvs thresh 0.89 0.83 0.90 0.91 0.91 0.91 0.83 0.82 0.83 0.93 0.93 0.95feat 11.8 1.7 4.7 11.7 11.7 11.7 1.7 1.5 1.7 6.4 6.4 6.7

Page 183: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

The parametric approach has extracted the least number of features and it has been the least time-consuming approach. The nonparametric approach extracts more features due to its nonparametric nature, and still it was less time-consuming than the PCA and classification in the space of original features. We should also point out that feature extraction speeds up dynamic integration in the same way as feature extraction speeds up a single classifier. This is as one could expect, since nearest-neighbor search for the prediction of local accuracies is the most time-consuming part of the application phase in dynamic integration, and it uses the same feature space as a single nearest-neighbor classifier does. Moreover, the two nearest-neighbor search processes (in dynamic integration and in a base classifier) are completely identical, and differ only in the number of nearest neighbors used to define the neighborhood.

Threshold values and number of extracted features are very similar in both base level classification (3-NN) and DIC. However, we found that nonparametric FE approaches extract more features for DIC comparing to the single 3-NN classifier and for dynamic voting with selection the most number of features is extracted on average.

6 Conclusion Feature extraction as a dimensionality reduction technique helps to overcome the problems related to the “curse of dimensionality“ with respect to the dynamic integration of classifiers. The experiments on 21 UCI data sets showed that the DIC approaches based on the plain feature sets had worse results in comparison to the results obtained using the FEDIC algorithm. This supports the fact that dimensionality reduction can often enhance a classification model.

The results showed that the proposed FEDIC algorithm outperforms the dynamic schemes on plain features usually only on those data sets, on which feature extraction for classification with a single classifier provides better results than classification on plain features. When we analyzed this dependency further, we came to a conclusion that feature extraction influenced on the accuracy of dynamic integration in most cases in the similar manner as feature extraction influenced on the accuracy of a classifier.

The nonparametric approach and PCA were the best on average; parametric approach was the worst. However, it is necessary to note that parametric approach produces the most compact space of new feature and even if its resulting accuracy is not higher than the one of PCA or nonparametric approach parametric FE can be still beneficial from performance point of view.

PCA for DIC showed stable results as it was with 3-NN classifier. Surprisingly, PCA was as good as nonparametric approach for DIC, while the accuracy results of PCA were significantly worse comparing to nonparametric approach when applied for 3-NN classifier.

We performed deeper analyses for those six data sets where FEDIC significantly outperforms dynamic integration in the space of original features.

Further research is needed to define the dependencies between the characteristics of a data set and the type and parameters of the feature extraction approach that best suits it.

Some of the most important issues for future research to be raised by this work, include how the algorithm could automatically determine whether FE is beneficial for a data set at hand, what is the most suitable feature extraction method from the characteristics of the input data and what are the optimal parameter settings for the selected feature extraction method. Also of interest is, how the most appropriate dynamic integration scheme could be automatically identified, and what is the optimal number of nearest neighbors in DIC.

An improvement in accuracy results can be achieved if class-conditional FE techniques extract new feature with respect to base classifiers errors rather than with respect to initial class-label information of the instances (as it is in the train set).

Another interesting direction for further research is consideration whether FE should be applied globally (over the entire instance space) or locally (differing in different parts of the instance space). It can be seen that despite being globally high dimensional and sparse, data distributions in some domain areas are locally low dimensional and dense, e.g. in physical movement systems (Vijayakumar & Schaal, 1997).

Page 184: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

Acknowledgments: This research is partly supported by COMAS Graduate School of the University of Jyväskylä, Finland and Science Foundation Ireland. We would like to thank the UCI ML repository of databases, domain theories and data generators for the data sets, and the MLC++ library for the source code used in this study.

References 1. Aivazyan, S.A. Applied statistics: classification and dimension reduction. Finance and

Statistics, Moscow (1989). 2. Aladjem, M. Parametric and nonparametric linear mappings of multidimensional data.

Pattern Recognition 24(6) (1991), 543-553. 3. Bauer, E., Kohavi, R. An empirical comparison of voting classification algorithms:

bagging, boosting, and variants. Machine Learning 36(1,2), (1999), 105-139. 4. Bellman, R., Adaptive Control Processes: A Guided Tour, Princeton University Press

(1961). 5. Blake, C.L., Merz, C.J. UCI repository of machine learning databases. Dept. of

Information and Computer Science, University of California, Irvine, CA (1998). 6. C. Brodley, T. Lane, Creating and exploiting coverage and diversity, in: Proc. AAAI-96

Workshop on Integrating Multiple Learned Models, Portland, OR, 1996, 8-14. 7. Cordella L.P., Foggia P., Sansone C., Tortorella F., Vento M. 1999. Reliability parameters

to improve combination strategies in multi-expert systems, Pattern Analysis and Applications 2(3), 205-214.

8. Cost, S. & Salzberg, S. 1993. A weighted nearest neighbor algorithm for learning with symbolic features. Machine Learning 10(1), 57-78.

9. Dietterich T.G., Bakiri G. 1995. Solving multiclass learning problems via error-correcting output codes, Journal of Artificial Intelligence Research 2, 263-286.

10. Dietterich, T.G. Machine learning research: four current directions. AI Magazine 18(4) (1997) 97-136.

11. Fayyad U.M. Data Mining and Knowledge Discovery: Making Sense Out of Data, IEEE Expert 11(5), (1996), 20-25

12. Fukunaga, K. Introduction to statistical pattern recognition. Academic Press, London (1999).

13. Gama, J.M.P. 1999. Combining classification algorithms. Dept. of Computer Science, University of Porto, Portugal. PhD thesis.

14. Giacinto, G. & Roli, F. 1999. Methods for dynamic classifier selection. In Proc. ICIAP '99, 10th International Conference on Image Analysis and Processing. Los Alamitos, CA: IEEE CS Press, 659-664.

15. Hall, M.A. Correlation-based feature selection of discrete and numeric class machine learning. In Proc. Int. Conf. On Machine Learning (ICML-2000), San Francisco, CA. Morgan Kaufmann, San Francisco, CA (2000) 359-366.

16. Hastie, T. & Tibshirani, R. 1996. Discriminant adaptive nearest-neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 18 (6), 607-616.

17. Heath D., Kasif S., Salzberg S. 1996. Committees of decision trees. In: B. Gorayska, J. Mey (eds.), Cognitive Technology: in Search of a Humane Interface, Elsevier Science, 305-317.

18. Jolliffe, I.T. Principal Component Analysis. Springer, New York, NY. (1986). 19. Kambhatla N., Leen T. K. 1997. Dimension reduction by local principal component

analysis. Neural Computation 9, 1493-1516. 20. Kohavi, R. A study of cross-validation and bootstrap for accuracy estimation and model

selection. In C.Mellish (ed.), Proc. 14th Int. Joint Conf. on Artificial Intelligence IJCAI-95. Morgan Kaufmann, San Francisco, CA (1995) 1137-1145.

21. Kohavi, R. Wrappers for performance enhancement and oblivious decision graphs. Dept. of Computer Science, Stanford University, Stanford, USA. PhD Thesis (1995).

Page 185: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

22. Kohavi, R., Sommerfield, D., Dougherty, J. Data mining using MLC++: a machine learning library in C++. In M.G.Radle (ed.) Proc. 8th IEEE Conf. on Tools with Artificial Intelligence. IEEE CS Press, Los Alamitos, CA (1996) 234-245.

23. Koppel M., Engelson S. 1996. Integrating multiple classifiers by finding their areas of expertise, in: AAAI-96 Workshop On Integrating Multiple Learning Models for Improving and Scaling Machine Learning Algorithms, Portland, OR, 53-58.

24. Krzanowski, W.J., & Marriott, F.H.C. Multivariate analysis part 2: Classification, Covariance structures and repeated measurements. London: Edward Arnold, 1994.

25. Liu H. Feature Extraction, Construction and Selection: A Data Mining Perspective, ISBN 0-7923-8196-3, Kluwer Academic Publishers (1998).

26. Merz, C.J. Dynamical selection of learning algorithms. In D.Fisher, H.-J.Lenz (eds.), Learning from data, artificial intelligence and statistics, Springer-Verlag, NY (1996).

27. Merz, C.J. Using correspondence analysis to combine classifiers. Machine Learning 36(1-2) (1999) 33-58.

28. Opitz D. 1999. Feature selection for ensembles. In: Proc. 16th National Conf. on Artificial Intelligence, AAAI Press, 379-384.

29. Opitz D., Shavlik J. 1996. Generating accurate and diverse members of a neural network ensemble, in: D. Touretzky, M. Mozer, M. Hassemo (eds.), Advances in Neural Information Processing Systems 8, MIT Press, 535-541.

30. Opitz, D. & Maclin, D. Popular ensemble methods: an empirical study. Journal of Artificial Intelligence Research 11 (1999), 169-198.

31. Oza, N.C., Tumer, K. Dimensionality reduction through classifier ensembles. Technical report NASA-ARC-IC-1999-124, Computational Sciences Division, NASA Ames Research Center, Moffett Field, CA (1999).

32. Puuronen, S. & Tsymbal, A. 2001. Local feature selection with dynamic integration of classifiers. Fundamenta Informaticae 47(1-2), special issue “Intelligent Information Systems”, 91-117.

33. Puuronen, S., Terziyan, V., Tsymbal, A. A dynamic integration algorithm for an ensemble of classifiers. In Z.W. Ras, A. Skowron (eds.), Foundations of Intelligent Systems: ISMIS’99, Lecture Notes in AI, Vol. 1609, Springer-Verlag, Warsaw (1999) 592-6.

34. Quinlan J.R. 1996. Bagging, boosting, and C4.5, in: Proc. 13th National Conf. on Artificial Intelligence AAAI-96, Portland OR, AAAI Press, 725-730.

35. Quinlan, J.R. 1993. C4.5 programs for machine learning. San Mateo CA: Morgan Kaufmann.

36. Todorovski L., Dzeroski S. 2000. Combining multiple models with meta decision trees, in: D.A. Zighed, J. Komorowski, J. Żytkow (eds.), Principles of Data Mining and Knowledge Discovery. In: Proc. PKDD 2000, Lyon, France, LNAI 1910, Springer, 54-64.

37. Tsymbal A. Dynamic Integration of Data Mining Methods in Knowledge Discovery Systems, PhD Thesis, University of Jyväskylä, Finland, (2002).

38. Tsymbal A., Puuronen S., Pechenizkiy M., Baumgarten M., Patterson D. Eigenvector-based feature extraction for classification. In Proc. 15th Int. FLAIRS Conference on Artificial Intelligence, Pensacola, FL, USA, AAAI Press (2002), 354-358.

39. Tsymbal A., Puuronen S., Skrypnyk I. Ensemble feature selection with dynamic integration of classifiers. In Int. ICSC Congress on Computational Intelligence Methods and Applications CIMA’2001, Bangor, Wales, U.K (2001).

40. Tsymbal, A., Puuronen, S., Patterson, D. Ensemble feature selection with the simple Bayesian classification. Information Fusion, Special Issue “Fusion of Multiple Classifiers”, Elsevier Science 4(2), (2003) 87-100.

41. Vijayakumar S., Schaal S. 1997. Local dimensionality reduction for locally weighted learning. In: Proceedings of the 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation, 220-225.

42. Webb G.I. 2000. MultiBoosting: a technique for combining boosting and wagging, Machine Learning 40(2), 159-196.

43. William D.R., Goldstein M. Multivariate Analysis. Methods and Applications. ISBN 0-471-08317-8, John Wiley & Sons (1984).

Page 186: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

44. Wilson, D.R. & Martinez, T.R. Improved heterogeneous distance functions. Journal of Artificial Intelligence Research 6(1) (1997), 1-34

45. Wolpert, D. Stacked Generalization. Neural Networks, Vol. 5 (1992) 241-259. 46. Woods, K., Kegelmeyer, W.P. & Bowyer, K. 1997. Combination of multiple classifiers

using local accuracy estimates. IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (4), 405-410.

Page 187: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

App

endi

x A

– th

e ba

sic

accu

racy

resu

lts o

f the

exp

erim

ents

on

21 U

CI d

ata

sets

D

IC

FED

IC

Pl

ain

PCA

Par

NPa

r St

atic

ds

dv

dv

s ds

dv

dv

s ds

dv

dv

s ds

dv

dv

s ss

w

v Ba

yes

Bala

nce

.896

.8

96

.898

.8

96

.895

.8

98

.896

.8

97

.896

.8

96

.896

.8

99

.896

.8

94

.896

Br

east

.7

31

.744

.7

44

.730

.7

47

.747

.7

23

.731

.7

31

.731

.7

47

.747

.7

23

.744

.7

34

Car

.9

11

.870

.8

86

.920

.8

77

.909

.9

41

.924

.9

38

.942

.9

22

.940

.8

63

.854

.8

66

Dia

bete

s .7

58

.761

.7

63

.756

.7

62

.767

.7

49

.761

.7

55

.753

.7

61

.763

.7

61

.761

.7

53

Gla

ss

.624

.6

79

.679

.6

22

.674

.6

74

.599

.6

03

.603

.6

10

.621

.6

21

.607

.6

23

.588

H

eart

.820

.8

39

.836

.8

18

.839

.8

34

.806

.8

38

.835

.8

10

.839

.8

38

.826

.8

39

.845

Io

nosp

he

.915

.9

18

.918

.9

17

.920

.9

20

.915

.9

15

.915

.9

16

.917

.9

17

.915

.9

16

.908

Iri

s .9

41

.938

.9

40

.933

.9

30

.927

.9

40

.929

.9

32

.935

.9

32

.933

.9

29

.919

.9

07

LED

.7

42

.744

.7

45

.742

.7

45

.746

.7

41

.744

.7

45

.741

.7

44

.745

.7

43

.751

.7

44

LED

17

.640

.6

90

.683

.6

45

.690

.6

84

.642

.6

90

.684

.6

47

.690

.6

84

.643

.6

90

.637

Li

ver

.608

.6

25

.613

.6

28

.635

.6

31

.595

.6

21

.611

.6

00

.623

.6

12

.599

.6

15

.606

Ly

mph

ogr

.817

.8

30

.841

.8

12

.830

.8

40

.800

.8

28

.827

.8

11

.830

.8

36

.814

.8

24

.813

M

ON

K-1

.8

32

.749

.7

66

.838

.7

48

.784

.7

09

.704

.6

98

.942

.8

73

.924

.7

46

.711

.7

43

MO

NK

-2

.637

.6

65

.651

.6

48

.665

.6

63

.651

.6

63

.657

.6

69

.672

.6

75

.655

.6

64

.567

M

ON

K-3

.9

84

.984

.9

84

.975

.9

72

.972

.9

85

.985

.9

85

.987

.9

86

.986

.9

7 .9

71

.969

So

ybea

n .9

91

1 1

.991

1

1 .9

91

1 1

.991

1

1 .9

91

1 1

Thyr

oid

.961

.9

56

.956

.9

58

.951

.9

51

.951

.9

46

.946

.9

55

.952

.9

52

.953

.9

43

.948

Ti

c-Ta

c-

.930

.7

89

.900

.9

64

.860

.9

51

.783

.7

71

.773

.8

95

.824

.8

71

.713

.7

30

.690

Ve

hicl

e .6

52

.664

.6

76

.700

.6

94

.717

.6

39

.657

.6

68

.704

.6

96

.717

.5

82

.603

.5

94

Votin

g .9

53

.925

.9

46

.953

.9

24

.946

.9

45

.924

.9

38

.949

.9

23

.943

.9

51

.922

.9

04

Zoo

.948

.9

60

.960

.9

48

.960

.9

61

.946

.9

59

.959

.9

47

.959

.9

61

.941

.9

48

.936

A

vera

ge

.823

.8

20

.828

.8

28

.825

.8

34

.807

.8

14

.814

.8

30

.829

.8

36

.801

.8

06

.793

Page 188: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

App

endi

x B

– A

vera

ged

over

num

ber o

f tes

t set

inst

ance

s pro

porti

on o

f the

num

ber o

f nea

rest

nei

ghbo

urs w

ith th

e sa

me

clas

s lab

el a

s a tr

ue

clas

s lab

el o

f a n

ew te

st in

stan

ce to

the

tota

l num

ber o

f nea

rest

nei

ghbo

urs.

Pl

ain

PCA

Pa

r N

Par

Dat

a se

t 1

3 7

15

31

63

1 3

7 15

31

63

1

3 7

15

31

63

1 3

7 15

31

63

B

alan

ce

.600

.6

30

.649

.6

11

.590

.5

68

.650

.6

47

.644

.6

13

.585

.5

69

.880

.8

73

.873

.8

74

.871

.8

57

.890

.8

67

.841

.8

17

.779

.7

32

Bre

ast

.740

.7

13

.711

.6

91

.684

.6

66

.710

.6

93

.676

.6

63

.649

.6

35

.670

.6

63

.659

.6

55

.653

.6

47

.660

.6

47

.639

.6

29

.619

.6

07

Car

.770

.8

13

.783

.7

51

.741

.6

85

.800

.7

97

.776

.7

54

.718

.6

83

.960

.9

50

.939

.9

21

.897

.8

76

.960

.9

47

.934

.9

09

.869

.8

24

Dia

bete

s .7

00

.683

.6

73

.665

.6

59

.650

.7

00

.687

.6

74

.665

.6

59

.650

.6

80

.687

.6

90

.689

.6

90

.686

.6

80

.680

.6

76

.675

.6

71

.665

Gla

ss

.680

.6

20

.564

.5

02

.392

.3

38

.690

.6

17

.561

.5

05

.414

.3

39

.560

.5

33

.500

.4

65

.399

.3

39

.600

.5

50

.506

.4

63

.394

.3

35

Hea

rt .7

60

.757

.7

44

.733

.7

12

.665

.7

40

.747

.7

33

.712

.6

92

.650

.7

70

.773

.7

73

.775

.7

67

.743

.6

80

.683

.6

76

.668

.6

53

.621

Iono

sphe

re

.860

.8

40

.813

.7

73

.716

.6

77

.880

.8

57

.833

.8

00

.744

.6

85

.820

.8

30

.830

.8

29

.825

.8

10

.840

.8

17

.807

.8

01

.787

.7

54

Iris

Pla

nts

.960

.9

43

.937

.8

99

.788

.4

75

.950

.9

40

.913

.8

93

.820

.4

76

.970

.9

63

.956

.9

54

.902

.4

76

.970

.9

63

.957

.9

55

.903

.4

76

LED

.6

50

.637

.6

31

.567

.4

18

.259

.6

40

.617

.6

26

.576

.4

23

.261

.6

20

.610

.6

10

.561

.3

87

.236

.6

20

.610

.6

11

.557

.3

86

.236

LED

17

.350

.3

13

.283

.2

58

.222

.1

80

.390

.3

40

.304

.2

69

.225

.1

77

.440

.4

20

.389

.3

47

.282

.2

01

.440

.3

97

.367

.3

27

.263

.1

89

Live

r .6

20

.577

.5

57

.546

.5

34

.524

.6

30

.610

.5

86

.567

.5

54

.539

.6

00

.593

.5

86

.583

.5

77

.564

.5

70

.563

.5

59

.558

.5

55

.545

Lym

ph

.790

.7

43

.711

.6

71

.616

.5

33

.790

.7

27

.671

.6

36

.590

.5

23

.810

.7

97

.790

.7

88

.772

.5

97

.800

.7

87

.774

.7

55

.713

.5

73

MO

NK

-1

.600

.7

07

.709

.6

33

.611

.5

58

.710

.7

17

.714

.6

67

.615

.5

88

.670

.6

67

.666

.6

69

.667

.6

67

.950

.8

97

.873

.8

07

.716

.6

42

MO

NK

-2

.530

.5

70

.567

.5

93

.580

.5

77

.700

.6

40

.593

.5

81

.579

.5

71

.610

.6

10

.610

.6

11

.605

.5

90

.920

.9

00

.861

.8

12

.749

.6

63

MO

NK

-3

.880

.8

17

.793

.7

34

.688

.6

30

.850

.8

57

.807

.7

72

.709

.6

61

.980

.9

80

.977

.9

71

.967

.9

63

.980

.9

80

.974

.9

63

.946

.9

22

Soyb

ean

.990

.9

80

.861

.4

99

.245

.1

21

.980

.9

40

.756

.4

73

.251

.1

53

.960

.9

60

.867

.5

05

.245

.1

21

.970

.9

67

.867

.5

04

.249

.1

40

Thyr

oid

.960

.9

33

.906

.8

67

.797

.7

39

.930

.9

07

.876

.8

29

.764

.7

11

.940

.9

33

.921

.9

07

.824

.7

48

.950

.9

30

.911

.8

83

.805

.7

37

Tic

.670

.7

10

.823

.7

66

.685

.6

41

.970

.9

70

.966

.9

09

.859

.8

13

.970

.9

73

.973

.9

73

.974

.9

76

.970

.9

73

.973

.9

75

.974

.9

73

Veh

icle

.6

80

.653

.6

06

.552

.4

87

.417

.7

50

.723

.6

83

.623

.5

43

.455

.7

30

.723

.7

14

.701

.6

80

.654

.7

60

.743

.7

14

.673

.6

07

.539

Vot

ing

.920

.9

10

.907

.8

97

.885

.8

66

.920

.9

10

.899

.8

88

.876

.8

50

.940

.9

43

.944

.9

45

.947

.9

49

.940

.9

40

.943

.9

41

.940

.9

35

Zoo

.950

.9

10

.829

.6

86

.461

.2

31

.950

.9

00

.824

.6

77

.459

.2

31

.900

.8

60

.771

.6

36

.426

.2

37

.900

.8

60

.781

.6

36

.425

.2

37

Aver

age

.746

.7

36

.717

.6

62

.596

.5

24

.778

.7

54

.720

.6

70

.606

.5

34

.785

.7

78

.764

.7

31

.684

.6

16

.812

.7

95

.774

.7

29

.667

.5

88

Page 189: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

VIII

FEATURE EXTRACTION FOR CLASSIFICATION IN KNOWLEDGE DISCOVERY SYSTEMS

Pechenizkiy M., Puuronen S., Tsymbal A. 2003., In: V.Palade, R.J.Howlett, L.C.Jain (Eds.), Proc. 7th Int. Conf. on Knowledge-Based Intelligent Information & Engineering Systems KES’2003, Lecture Notes in Artificial Intelligence, Vol.2773, Heidelberg: Springer-Verlag, 526-532. With permission from Springer Science.

Page 190: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source
Page 191: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 526-532, 2003. Springer-Verlag Berlin Heidelberg 2003

Feature Extraction for Classificationin Knowledge Discovery Systems

Mykola Pechenizkiy1, Seppo Puuronen1, and Alexey Tsymbal2

1 University of JyväskyläDepartment of Computer Science and Information Systems,P.O. Box 35, FIN-40351, University of Jyväskylä, Finland

{mpechen, sepi}@cs.jyu.fi2 Trinity College Dublin

Department of Computer ScienceCollege Green, Dublin 2, Ireland

[email protected]

Abstract. Dimensionality reduction is a very important step in the datamining process. In this paper, we consider feature extraction forclassification tasks as a technique to overcome problems occurringbecause of “the curse of dimensionality”. We consider three differenteigenvector-based feature extraction approaches for classification. Thesummary of obtained results concerning the accuracy of classificationschemes is presented and the issue of search for the most appropriatefeature extraction method for a given data set is considered. A decisionsupport system to aid in the integration of the feature extraction andclassification processes is proposed. The goals and requirements set forthe decision support system and its basic structure are defined. Themeans of knowledge acquisition needed to build up the proposedsystem are considered.

1 Introduction

Data mining applies data analysis and discovery algorithms to discover informationfrom vast amounts of data. A typical data-mining task is to predict an unknown valueof some attribute of a new instance when the values of the other attributes of the newinstance are known and a collection of instances with known values of all theattributes is given. In many applications, data, which is the subject of analysis andprocessing in data mining, is multidimensional, and presented by a number offeatures. The so-called “curse of dimensionality” pertinent to many learningalgorithms, denotes the drastic raise of computational complexity and classificationerror with data having big amount of dimensions [2]. Hence, the dimensionality of thefeature space is often tried to be reduced before classification is undertaken.

Feature extraction (FE) is one of the dimensionality reduction techniques. FEextracts a subset of new features from the original feature set by means of some

Page 192: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

Feature Extraction for Classification in Knowledge Discovery Systems 527

functional mapping keeping as much information in the data as possible [5].Conventional Principal Component Analysis (PCA) is one of the most commonlyused feature extraction techniques. PCA extracts the axes on which the data shows thehighest variability [7]. There exist many variations of the PCA that use local and/ornon-linear processing to improve dimensionality reduction, though they generally donot use class information [9].

In our research, beside the PCA, we consider also two eigenvector-basedapproaches that use the within- and between-class covariance matrices and thus dotake into account the class information. We analyse them with respect to the task ofclassification with regard to the learning algorithm being used and to the dynamicintegration of classifiers (DIC).

During the last years data mining has evolved from less sophisticated first-generation techniques to today's cutting-edge ones. Currently there is a growing needfor next-generation data mining systems to manage knowledge discoveryapplications. These systems should be able to discover knowledge by combiningseveral available techniques, and provide a more automatic environment, or anapplication envelope, surrounding this highly sophisticated data mining engine [4].

In this paper we consider a decision support system (DSS) approach that is basedon the methodology used in expert systems (ES). The approach combines featureextraction techniques with different classification tasks. The main goal of such asystem is to automate as far as possible the selection of the most suitable featureextraction approach for a certain classification task on a given data set according to aset of criteria.

In the next sections we consider the feature extraction process for classification andpresent the summary of achieved results. Then we consider a decision support systemthat integrates the feature extraction and classification processes, describing its goals,requirements, structure, and the ways of knowledge acquisition. As a summary theobtained preliminary results are discussed and the focus of further research isdescribed.

2 Eigenvector-Based Feature Extraction

Generally, feature extraction for classification can be seen as a search process amongall possible transformations of the feature set for the best one, which preserves classseparability as much as possible in the space with the lowest possible dimensionality[5]. In other words we are interested in finding a projection w:

xwy T= (1)

where y is a 1'×p transformed data point (presented using 'p features), w is a 'pp ×transformation matrix, and x is a 1×p original data point (presented using pfeatures).

In [10] it was shown that the conventional PCA transforms the original set offeatures into a smaller subset of linear combinations that account for the most of thevariance of the original data set. Although it is the most popular feature extractiontechnique, it has a serious drawback, namely the conventional PCA gives high

Page 193: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

528 Mykola Pechenizkiy et al.

weights to features with higher variabilities irrespective of whether they are useful forclassification or not. This may give rise to the situation where the chosen principalcomponent corresponds to the attribute with the highest variability but having nodiscriminating power.

A usual approach to overcome the above problem is to use some class separabilitycriterion [1], e.g. the criteria defined in Fisher linear discriminant analysis and basedon the family of functions of scatter matrices:

wSwwSw

wW

TB

TJ =)( (2)

where SB is the between-class covariance matrix that shows the scatter of the expectedvectors around the mixture mean, and SW is the within-class covariance, that showsthe scatter of samples around their respective class expected vectors.

A number of other criteria were proposed in [5]. Both parametric andnonparametric approaches optimize the criterion (2) by using the simultaneousdiagonalization algorithm [5].

In [11] we analyzed the task of eigenvector-based feature extraction forclassification in general; a 3NN classifier was used as an example. The experimentswere conducted on 21 data sets from the UCI machine learning repository [3]. Theexperimental results supported our expectations. Classification without featureextraction produced clearly the worst results. This shows the so-called “curse ofdimensionality” with the considered data sets and the classifier supporting thenecessity to apply some kind of feature extraction in that context. In the experiments,the conventional PCA was the worst feature extraction technique on average. Thenonparametric technique was only slightly better than the parametric one on average.However, the nonparametric technique performed much better on categorical data.

Still, it is necessary to note that each feature extraction technique was significantlyworse than all the other techniques at least on a single data set. Thus, among thetested techniques there does not exist “the overall best” one for classification withregard to all given data sets.

3 Managing Feature Extraction and Classification Processes

Currently, as far as we know, there is no feature extraction technique that would bethe best for all data sets in the classification task. Thus the adaptive selection of themost suitable feature extraction technique for a given data set needs further research.Currently, there does not exist canonical knowledge, a perfect mathematical model, orany relevant tool to select the best extraction technique. Instead, a volume ofaccumulated empirical findings, some trends, and some dependencies have beendiscovered.

We consider a possibility to take benefit of the discovered knowledge bydeveloping a decision support system based on the methodology of expert systemdesign [6] in order to help to manage the data mining process. The main goal of thesystem is to recommend the best-suited feature extraction method and a classifier fora given data set. Achieving this goal produces a great benefit because it might be

Page 194: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

Feature Extraction for Classification in Knowledge Discovery Systems 529

possible to reach the performance of the wrapper type approach by using the filterapproach. In the wrapper type approach the interaction between the feature selectionprocess and the construction of the classification model is applied and the parametertuning for every stage and for every method is needed. In the filter approach theevaluation process is independent from the learning algorithm and the methods, andtheir parameters' selection process is performed according to a certain set of criteria inadvance. However, the additional goal of the prediction of model's outputperformance requires also further consideration.

The “heart” of the system is the Knowledge Base (KB) that contains a set of factsabout the domain area and a set of rules in a symbolic form describing the logicalreferences between a concrete classification problem and recommendations about thebest-suited model for a given problem. The Vocabulary of KB contains the lists ofterms that include feature extraction methods and their input parameters, classifiersand their input and output parameters, and three types of data set characteristics:simple measures such as the number of instances, the number of attributes, and thenumber of classes; statistical measures such as the departure from normality,correlation within attributes, the proportion of total variation explained by the first kcanonical discriminants; and information-theoretic measures such as the noisiness ofattributes, the number of irrelevant attributes, and the mutual information of class andattribute.

Filling in the knowledge base is among the most challenging tasks related to thedevelopment of the DSS. There are two potential sources of knowledge to bediscovered for the proposed system. The first is the background theory of the featureextraction and classification methods, and the second is the set of field experiments.The theoretical knowledge can be formulated and represented by an expert in the areaof specific feature extraction methods and classification schemes. Generally it ispossible to categorise the facts and rules that will be present in the Knowledge Base.The categorisation can be done according to the way the knowledge has been obtained– has it been got from the analysis of experimental results or from the domain theory.Another categorisation criterion is the level of confidence of a rule. The expert maybe sure in a certain fact or may just think or to hypothesize about another fact. In asimilar way, a rule that has been just generated from the analysis of results byexperimenting on artificially generated data sets but has been never verified on real-worlds data sets and a rule that has been verified on a number of real-world problems.In addition to the “trust“ criteria due to the categorisation of the rules it is possible toadapt the system to a concrete researcher's needs and preferences by giving higherweights to the rules that actually are the ones of the user.

4 Knowledge Acquisition from the Experiments

Generally, the knowledge base is a dynamic part of the decision support system thatcan be supplemented and updated through the knowledge acquisition and knowledgerefinement processes [6].

Potential contribution of knowledge to be included into the KB might be founddiscovering a number of criteria from the experiments conducted on artificiallygenerated data sets with pre-defined characteristics. The results of experiments can be

Page 195: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

530 Mykola Pechenizkiy et al.

examined looking at the dependencies between the characteristics of a data set ingeneral and the characteristics of every local partition of the instance space inparticular. Further, the type and parameters of the feature extraction approach bestsuited for the data set will help to define a set of criteria that can be applied for thegeneration of rules of KB.

The results of our preliminary experiments support that approach. The artificiallygenerated data sets were manipulated by changing the amount of irrelevant attributes,the level of noise in the relevant attributes, the ratio of correlation among theattributes, and the normality of the distributions of classes. In the experiments,supervised feature extraction (both the parametric and nonparametric approaches)performed better than the conventional PCA when noise was introduced to the datasets. The similar trend was found with the situation when artificial data sets containedmissing values. The finding was supported by the results of experiments on theLED17, Monk-3 and Voting UCI data sets (Table 1) that are known as ones thatcontain irrelevant attributes, noise in the attributes and a plenty of missing values.Thus, this criterion can be included in the KB to be used to give preference tosupervised methods when there exist noise or missing values in a data set.Nonparametric feature extraction essentially outperforms the parametric approach onthe data sets, which include significant nonnormal class distributions and are not easyto learn. This initial knowledge about the nature of the parametric and nonparametricapproaches and the results on artificial data sets were supported by the results ofexperiments on Monk-1 and Monk-2 UCI data sets (Table 1).

Table 1. Accuracy results of the experiments

Dataset PCA Par NPar PlainLED17 .395 .493 .467 .378MONK-1 .767 .687 .952 .758MONK-2 .717 .654 .962 .504MONK-3 .939 .990 .990 .843Voting .923 .949 .946 .921

5 Discussions

So far we have not found a simple correlation-based criterion to separate thesituations when a feature extraction technique would be beneficial for theclassification. Nevertheless, we found out that there exists a trend between thecorrelation ratio in a data set and the threshold level used in every feature extractionmethod to address the amount of variation in the data set explained by the selectedextracted features. This finding helps in the selection of the initial threshold value as astart point in the search for the optimal threshold value. However, further research andexperiments are required to check these findings.

One of our further goals is to make the knowledge acquisition processsemiautomatic using the possibility of deriving new rules and updating the old onesbased on the analysis of results obtained during the self-run experimenting. Thisprocess will include generating artificial data sets with known characteristics (simple,

Page 196: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

Feature Extraction for Classification in Knowledge Discovery Systems 531

statistical and information-theoretic measures); running the experiments on thegenerated artificial data sets; derivation of dependencies and definition of criteriafrom the obtained results and updating the knowledge base; validating the constructedtheory with a set of experiments on real-world data sets, and reporting on the successor failure of certain rules.

We consider a decision tree learning algorithm as a mean of automatic ruleextraction for the knowledge base. Decision tree learning is one of the most widelyused inductive learning methods [12]. A decision tree is represented as a set of nodesand arcs. Each node contains a feature (an attribute) and each arc leaving the node islabelled with a particular value (or range of values) for that feature. Together, a nodeand the arcs leaving it represent a decision about the path an example follows whenbeing classified by the tree. Given a set of training examples, a decision tree isinduced in a “top-down” fashion by repeatedly dividing up the examples according totheir values for a particular feature.

In this context, mentioned above data set characteristics and a classificationmodel's outputs that include accuracy, sensitivity, specificity, time complexity and soon represent instance space. And the combination of a feature extraction method's anda classification model's names with their parameter values represent class labels. Bymeans of analysing the tree branches it is possible to generate “if-then” rules for theknowledge base. A rule reflects certain relationship between meta-data-setcharacteristics and a combination of a feature extraction method and a classificationmodel.

6 Conclusions

Feature extraction is one of the dimensionality reduction techniques that are oftenused to cope with the problems caused by the “curse of dimensionality”. In this paperwe considered three eigenvector-based feature extraction approaches, which wereapplied for different classification problems. We presented the summary of resultsthat shows a high level of complexity in dependencies between the data setcharacteristics and the data mining process. There is no feature extraction method thatwould be the most suitable for all classification tasks. Due to the fact that there is nowell-grounded strong theory that would help us to build up an automated system forsuch feature extraction method selection, a decision support system that wouldaccumulate separate facts, trends, and dependencies between the data characteristicsand output parameters of classification schemes performed in the spaces of extractedfeatures was proposed.

We considered the goals of such a system, the basic ideas that define its structureand methodology of knowledge acquisition and validation. The Knowledge Base isthe basis for the intelligence of the decision support system. That is why werecognised the problem of discovering rules from the experiments of an artificiallygenerated data set with known predefined simple, statistical, and information-theoretic measures, and validation of those rules on benchmark data sets as a priorresearch focus in this area.

It should be noticed that the proposed approach has a serious limitation. Namelythe drawbacks can be expressed in the terms of fragmentariness and incoherence(disconnectedness) of the components of knowledge to be produced. And we

Page 197: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

532 Mykola Pechenizkiy et al.

definitely do not claim the completeness of our decision support system. Otherwise,certain constrains and assumptions to the domain area were considered, and thelimited sets of feature extraction methods, classifiers and data set characteristics wereconsidered in order to guarantee the desired level of confidence in the system whensolving a bounded set of problems.

Acknowledgments

This research is partly supported by the COMAS Graduate School of the Universityof Jyväskylä, Finland and Science Foundation, Ireland. We would like to thank theUCI ML repository of databases, domain theories and data generators for the datasets, and the MLC++ library for the source code used in this study.

References

[1] Aivazyan, S.A.: Applied Statistics: Classification and Dimension Reduction.Finance and Statistics, Moscow, 1989.

[2] Bellman, R., Adaptive Control Processes: A Guided Tour, Princeton UniversityPress, 1961.

[3] Blake, C.L., Merz, C.J. UCI Repository of Machine Learning Databases[http://www.ics.uci.edu/~mlearn/ MLRepository.html]. Dept. of Informationand Computer Science, University of California, Irvine CA, 1998.

[4] Fayyad U.M. Data Mining and Knowledge Discovery: Making Sense Out ofData, IEEE Expert, Vol. 11, No. 5, Oct., 1996, pp. 20-25.

[5] Fukunaga, K. Introduction to Statistical Pattern Recognition. Academic Press,London, 1991.

[6] Jackson P. Introduction to Expert Systems, 3rd Edn. Harlow, England: AddisonWesley Longman, 1999.

[7] Jolliffe, I.T. Principal Component Analysis. Springer, New York, NY. 1986.[8] Kohavi, R., Sommerfield, D., Dougherty, J. Data mining using MLC++: a

machine learning library in C++. Tools with Artificial Intelligence, IEEE CSPress, 234-245, 1996.

[9] Liu H. Feature Extraction, Construction and Selection: A Data MiningPerspective, ISBN 0-7923-8196-3, Kluwer Academic Publishers, 1998.

[10] Oza, N.C., Tumer, K. Dimensionality Reduction Through Classifier Ensembles.Technical Report NASA-ARC-IC-1999-124, Computational Sciences Division,NASA Ames Research Center, Moffett Field, CA, 1999.

[11] Tsymbal A., Puuronen S., Pechenizkiy M., Baumgarten M., Patterson D.Eigenvector-based feature extraction for classification, In: Proc. 15th Int.FLAIRS Conference on Artificial Intelligence, Pensacola, FL, USA, AAAIPress, 354-358, 2002.

[12] Quinlan, J.R. 1993. C4.5 Programs for Machine Learning. San Mateo CA:Morgan Kaufmann.

Page 198: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source
Page 199: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

IX

DATA MINING STRATEGY SELECTION VIA EMPIRICAL AND CONSTRUCTIVE INDUCTION

Pechenizkiy, M. 2005. In: M. Hamza (Ed.), Proceedings of the IASTED International Conference on Databases and Applications DBA’05, Calgary: ACTA Press, 59-64. With permission from ACTA Press.

Page 200: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source
Page 201: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

DATA MINING STRATEGY SELECTION VIA EMPIRICAL AND

CONSTRUCTIVE INDUCTION

Mykola Pechenizkiy Department of Computer Science and Information Systems

University of Jyväskylä P.O. Box 35

Jyväskylä 40351 Finland

[email protected]

ABSTRACT Nowadays there exist a number of data-mining techniques to extract knowledge from large databases. Recent research has shown that no single technique can dominate some other technique on all possible data-mining problems. Nevertheless, many empirical studies report that a technique or a group of techniques can perform significantly better than any other technique on a certain data-mining problem or a group of problems. Therefore, a data mining system has a challenge of selecting the most appropriate technique(s) for a problem at hand. In the real world it is infeasible to perform a comparison of all applicable approaches. Several meta-learning approaches have been applied for automatic technique selection by several researchers with little success. The goal of this paper is to consider and critically analyze such approaches. In the centre of our analysis we introduce a general framework for data mining strategy selection via empirical and constructive induction. KEY WORDS Data mining, meta-learning, constructive induction 1. Introduction Data mining (DM) is an emerging area that considers the process of finding previously unknown and potentially interesting patterns and relations in large databases [1]. Numerous DM techniques have recently been developed to extract knowledge from these large databases. During the last years DM has evolved from less sophisticated first-generation techniques to today’s cutting-edge ones. Currently there is a growing need for the next-generation knowledge discovery systems, which should be able to discover knowledge by combining several available techniques. It was pointed out in [2] that as classification systems become an integral part of an organizational decision support system (DSS), adaptability to variations in data characteristics and dynamics of business scenarios becomes increasingly important. Michalski, with respect to the orientation of machine-learning systems, emphasized the importance of moving from single-strategy systems to development of systems that integrate two or more learning strategies [3]. The

important issue here is that the integration should not be done in a predefined way. On the contrary, the goal is to develop a system that would integrate the whole spectrum of (machine) learning strategies and would be able to select the most suitable strategy or strategies for a given problem [3]. Selection of the most appropriate data-mining technique or a group of the most appropriate techniques is usually not straightforward. Since the DM process comprises several steps, which involve data selection, data pre-processing, and data transformation, applying of machine learning techniques, and interpretation and evaluation of patterns, selection of a technique for each step is needed. In this paper we refer to a combination of DM techniques selected as DM strategy. In this paper we consider meta-learning approaches (Section 2) that use meta-data obtained from problem categorization and available techniques’ categorization that have been applied for automatic technique selection by several researchers. Then we analyze the main limitations of such approaches, discuss why they were unsuccessful and suggest the way of their improvement, and introduce a general framework for DM strategy selection (Section 3). We conclude with the discussion of key issues and introduce the direction of further research (Section 4). 2. Automatic Algorithm Selection via Meta-learning 2.1 Meta-learning Generally, it is hard to characterise algorithm’s performance either as good or bad, or as better or worse than the performance of some other algorithm, since performance is context dependent. However, it is known from of many empirical studies that an algorithm or a group of some algorithms can perform significantly better than other algorithm on a certain problem or a group of problems that are characterised by some properties [2]. Meta-learning or bias learning is the effort to automatically induce correlations between tasks and inductive strategies. Thus, different DM approaches can be not only inductively evaluated within given domain but also correlated to domains’ characteristics.

454-102 59

melissa
Page 202: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

Successful meta-learning in the context of (automatic) algorithm(s) selection would be really very important and beneficial in the DM practice. It is obvious that in a real-world situation it is very unlikely to perform a brute-force search comparing all applicable approaches. Several meta-learning approaches for automatic algorithm selection have been introduced in the literature. A comprehensive overview can be found in [4]. The most popular strategies for meta-learning are characterisation of a dataset in terms of its statistical/information properties or the more recently introduced landmarking approach [5] and characterization of algorithms. The general idea of meta-learning with respect to selection of a classifier for a dataset at hand is presented in Figure 1.

Fig. 1. Meta-learning system for suggestion of a classifier

Having a collection of datasets and a collection of classifiers, we can characterize them producing meta-data as a result of their (algorithms and datasets) characterisation. When meta-data is available, a machine- learning algorithm can be applied to it. As a result a meta-learning model that maps dataset characteristics to classifiers characteristics with respect to the introduced performance criteria is built. When a new dataset is introduced to the system, necessary dataset’s characteristics are estimated so that the meta-model is able to suggest an algorithm or a combination of algorithms according to a performance criterion. In the rest of this section we introduce the basic dimensions of automatic algorithm selection via meta-learning that include potential meta-learning goals, and meta-learning spaces, meta-data, and characterization of datasets and techniques available. 2.2 Meta-learning space When providing a recommendation to a user, a suggestion may come as a list of applicable algorithms, the best algorithm or a ranking of the algorithms. In the first case the pool of classifiers is divided into two sets: algorithms that are expected to achieve good results on the dataset at hand, and algorithms that are expected to have poor performance [6]. In the second case the single algorithm that is expected to produce the best results on the dataset under examination, according to the performance criterion used, is presented in [7]. In the third case, the ranking list of potentially the most adequate algorithms can be suggested [8]. A good ranking algorithm should not only produce the ordered list of ranks but to check if the difference between algorithms is not significant and to show that fact if so (e.g. giving the same ranks).

In order to provide such recommendations, a corresponding meta-learning problem has to be established, since there is a need to learn something about the performance of learning algorithms. The meta-learning space (MLS) can be considered as a space that is constituted by collections of meta-learning problems [4]. Thus, when someone is interested in a list of applicable algorithms for a dataset, MLS consists of a number of (meta-)classification problems each of them is associated with one of the algorithms under consideration. Another formulation of MLS may include a number of meta-learning problems that correspond to all the pairwise comparisons of learners from a given pool. In this case meta-models describe the conditions under which one algorithm is preferable to another. A regression approach can be used also to predict the accuracy of an algorithm. Then, the MLS consists of one regression problem for each algorithm. 2.3 Characterisation of a dataset Characteristics of a dataset that can be used for meta-learning are commonly divided into those that describe the nature of attributes, attributes themselves, associations between attributes, and associations between attributes and a target variable. To account for this, three categories of dataset characteristics can be used: (1) simple characteristics (number of examples, number of attributes, number of classes, etc.), (2) statistical characteristics (mean absolute correlation of attributes, mean skewness and mean kurtosis of attributes, etc.), (3) information theory characteristics (entropy of class, mean entropy of attributes, noise-signal ratio, etc.). The first characterisation of a dataset in terms of its statistical/information properties can be referred to the framework of the Statlog project [9]. The exhaustive list of statistical and information measures of a dataset and the data characterisation tool (DCT) is proposed in [6]. The second approach to characterise a dataset is landmarking. The idea is to directly characterise a dataset by relating the performance of some learners (landmarkers) to the performance of some other algorithm [5]. This idea has reflection in the common practice of data miners. When a new problem is stated, some preliminary exploration is performed usually in order to quickly come up with some few potentially promising approaches. Model-based data characterisation attempts to make use of properties of the concepts that are learned with certain algorithms or the direct use of the concepts themselves in a higher-order learning settings. In relational data characterisation different characteristics produced by the DCT from each attribute can be summarised by the histogram-based approach [4]. It was shown that histograms, when used as an aggregation method (e.g. mutual information between symbolic attributes), could preserve more information about the DCT properties of the individual attributes compared to the single aggregation functions (average, minimum and maximum).

Performance criteria Meta-learning space

Meta-model

Datasets Classifiers

A new dataset Suggested

classifier(s)

60

Page 203: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

2.4 Characterisation of an algorithm Naturally, beside characterisation of a dataset from the data side, some application restrictions or priorities can be introduced. Thus, a user may like to define the most (un)desirable or the most crucial characteristic(s) of an algorithm to be selected for certain application. The most common characteristics that are taken into account are: algorithm taxonomy, interpretability of the model, importance of results interpretability and algorithm transparency, explanation of decision; training time, testing time, accuracy, and cost handling for misclassification. A good overview of potential algorithm’s characteristics is introduced in [10]. Three different types of sources, from which the knowledge of the characteristics of a learning algorithm is drawn, are considered. The first type includes basic requirements, capabilities and/or limitations of an algorithm. Corresponding characteristics usually are defined by the specification rather than determined empirically. Examples of such characteristics are algorithm run-time parameters, ability of handling misclassification costs, and data types supported. The second source is related to the expert knowledge about the algorithms. An interpretability example is a fact that a neural network results are assumed to be less interpretable comparing to associated rules; with respect to performance of algorithm an example could be the known fact that kNN is very sensitive to irrelevant attributes. The third source of algorithm characterization is the past learning experience. The potential problem with this source is that interactions of both the dataset and algorithm characteristics can distort the characteristics of the algorithm extracted from previous learning experience. From this perspective systematically controlled experiments might be more beneficial, since unlike real world datasets, synthetically generated data allow to test exactly the desired number of characteristics while keeping all the others unchangeable. 3. A Framework for DM Strategy Selection via Empirical and Constructive Induction 3.1 Limitations of meta-learning approaches In the previous section we considered meta-learning approach as means of automatic algorithm selection, particularly, classifier selection for a dataset at hand. Despite of limited success reported by researchers, someone needs to admit that meta-learning approach as such has several shortcomings. Lindner and Studer in [6] reported on two general problems with meta-models that associate dataset and algorithm characteristics. The first problem is the representativeness of meta-data examples. The possible space of learning problems and thus a meta-learning-space is vast and getting larger with the invention of new algorithms, consideration of new characteristics and parameters. But the size of meta datasets used in the studies is naturally rather small

because of computational complexity of producing a single meta-example – usually time-consuming cross-validation process is used to estimate the performance of every algorithm used in the study. The other problem (related especially to the landmarking approach) is computational complexity (up to O(n3), where n is the number of examples) of some sophisticated statistical measures. Such measures are not scalable to large datasets and actually such amount of resources can be used to a sophisticated learner directly. In [11] a meta-learning framework that consists of coupling an IBL (kNN) algorithm with a ranking method was considered. The problem of algorithm selection/ranking has been divided into two distinct phases. During the first phase a subset of relevant datasets is identified by the introduced zooming technique. Zooming employs kNN with a distance function based on a set of dataset characteristics. In the second phase a ranking on the basis of the performance information of the candidate algorithms on the selected datasets that are similar to the one at hand is constructed. It was found that the selection of a ranking algorithm actually is much less important than the meta characteristics selection, especially when kNN (which is sensitive to irrelevant features) is used in the second phase. In [4] several meta-learners have been evaluated and compared. The C5.0 decision tree algorithm as a meta-learner performed better (non-significantly) than the other inductive algorithms. It seems however that so far there is no agreement among researcher on which strategy is most suitable for meta-learning. And although up-to-date published related papers show rather optimistic attitude to the meta-learning approach in general, the results presented in those papers are not so optimistic. Most meta-learning approaches for automatic algorithm selection (meta-decision tree, meta-instance based learner) assume that the features used to represent meta-instances are sufficiently relevant. However, it was experimentally shown that this assumption often does not hold for many learning problems. Some features may not be directly relevant, and some features may be redundant or irrelevant. Even those meta-learning approaches that apply feature selection techniques, and can eliminate irrelevant features and thus some how account the problem of high dimensionality, often fail to find good representation of meta-data. This happens because of the fact that many features in their original representation are weakly or indirectly relevant to the problem. Existence of such features usually requires generation of new, more relevant features that are some functions of the original ones. Such functions may vary from very simple as a product or a sum of a subset of the original features to very complex as a feature that reflects whether some geometrical primitive is present or absent in an instance. The discretization (quantization) of continuous features may serve to abstraction of some features when reduction of the range of possible values is desirable. The original representation space can be improved for (meta-) learning by removing less relevant features,

61

Page 204: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

adding more relevant features and abstracting features. In the following section we consider a constructive induction approach with respect to classification in general and classifier selection in particular. 3.2 Constructive induction Constructive induction (CI) is a learning process that consists of two intertwined phases, one of which is responsible for the construction of the “best” representation space and the second concerns with generating hypothesis in the found space [12]. In Figure 2 we can see two problems – with a) high quality and b) low quality representation spaces (RS). So in ‘a)’ the points marked by “+” are easily separated from the points marked by “–” using straight line or a rectangular border. But in ‘b)’ “+” and “–” are highly intermixed that indicates the inadequateness of the original RS. A traditional approach is to search for complex boundaries to separate the classes, and the constructive induction approach is to search for a better representation space where the groups are much better separated as in ‘c)’. Constructive induction systems consider learning as a dual search process for an appropriate representation in the space of representational spaces and for an appropriate hypothesis in the specific representational space. Michalski introduced constructive (expand the representation space by attribute generation methods) and destructive (contract the representational space by feature selection or feature abstraction) operators. Meta-rules construction from meta-data to guide the selection of the operators was considered.

a) High quality RS b) Low quality RS c) Improved RS – + – + – – + + + –

+ + – – + + + – – + + – + –

+ +

+ + – – + + – +

+ + + – + – – + – – – –

Fig. 2. High vs. low quality RS for concept learning [13]

Constructive induction methods are classified into three categories: data driven (information from the training examples is used), hypothesis driven (information from the analysis of the form of intermediate hypothesis is used) and knowledge driven (domain knowledge provided by experts is used) methods. 3.3 Framework for DM strategy selection In this section we consider a general framework for DM strategy selection, which is depicted in Figure 3. This framework follows Fayyad’s vision of KDD as a process and incorporates the idea of meta-learning for automatic algorithm selection, considered in Section 2 and constructive induction approach together. KDD-Manager plays the central role here from the data-

Fig. 3. DM strategy selection via meta-learning and taking benefit of

constructive induction approach

mining strategy selection point of view. Based on the meta-model that maps performance characteristics of available ML techniques and their combinations onto dataset characteristics KDD-Manager recommends the most appropriate algorithms that would constitute a certain DM strategy. Beside this main duty, KDD-Manager is responsible for the communication processes between different components (when cooperation and/or switching between different techniques is the case), and between an end-user or DM-researcher and the system. Data Pre-processors are used for preliminary data cleaning, handling missing values, and that is more important from meta-learning perspective – retrieve the necessary characteristics of a Dataset at hand. A dataset can be an artificial one, synthetically produced by Data Generator. This possibility allows varying one group of dataset characteristics while controlling all the others (keeping them constant). Machine-learning algorithms are used to construct classification or regression model or to cluster the representation space, producing thus local regions, so that further steps toward knowledge discovery can be applied locally for each region. Instance manipulators and Feature manipulators are aimed to find the most suitable representation of the problem for a machine-learning algorithm. The possible scenario of data-mining strategy selection could be for example: k-means clustering, then ranked PCA-baised feature extraction (FE) [14], and then kNN classifier. An additional option is to apply the co-clustering technique that refers to simultaneous clustering along multiple dimensions, and particularly, in our case, along two dimensions: instances and features. Post-processors and visualizers are aimed correspondingly, to enhance the induced models from both performance and interpretability perspectives (an example are rule grouping and pruning processes), and to present the achieved results to an end-user. GUI (Graphical User Interface) is aimed to provide assistance for user input, recommendations, results reporting/visualisation/explanation, user feedback, and

Meta-Model, ES, KB

Feature Manipu-

lators

ML algorithms/ Classifiers

Post-processors/visualisers

Meta-Data

Meta-learning

Data set

KDD‐Manager

Data Pre-processors

Instances Manipu-

lators

GUI

Data generator

62

Page 205: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

Meta-model management. Meta-model can be seen as the brains of KDD-Manager. A Meta-model is constituted by available expert knowledge, and knowledge acquired through meta-learning process on Meta Data that includes algorithms’ and datasets’ characteristics. Data processing steps from an end-user point of view are similar to the ones introduced in [1]. The basic idea of a meta-learning cycle was considered in Section 2.1. However, we would like to be focused here on some important features of the framework that are implied by constructive induction. Before applying a constructive induction approach, a decision about how relevant is the original representation space needs to be made. A landmarker approach can be used for this purposes or estimation may come directly from data characterization. Generally, different evaluating criteria can be used for constructive induction with respect to the subsequent learning strategy. The gain ratio, gini index, and chi-square analysis are used when attributes are evaluated on the basis of their expected global performance, i.e. a new representation space is expected to discriminate among all classes. However, when each class is described independently from the other classes, construction of the simplest and the most accurate rules for each class is desired. Thus, a feature that has low global discriminating value (and thus ignored by a decision tree that uses a global criterion) at the same time may characterize very well a single class. Michalski in [12] shows this on example of recognizing the upper-case letters of English alphabet: feature has tail is very useful in the rule-based approach to discriminate Qs from the other letters, but this feature is most likely to be neglected since it has relatively low overall utility (because Qs normally occur rather rarely). When this is a case, a declarative knowledge representation may be more beneficial than the procedural one. We consider FE for classification [14] as analogy of constructive induction. Indeed, what we are trying to achieve by means of FE is the most appropriate data representation for the subsequent classification. The notions of both destructive and constructive operators relevant to our approach and appropriate use of one or both of them can be beneficial. One approach is to select and perform FE keeping in mind the subsequent classification, estimate how good a new RS is and then perform selection of a classifier. This is the data driven approach. However, another approach – the selection of a combination of a FE technique and a classifier is more sophisticated. This is the hypothesis driven approach; in this case FE and classification cannot be separated into two different independent processes. Nevertheless, better understanding of the combined FE technique and a classifier as a whole and, respectively, better results on recommendation of a DM strategy that is constituted by the combination can be achieved. It was shown that FE techniques are beneficial in cases with a very limited sample size [15] and highly correlated

features [16]. Sohn in [8] shows that many meta-features that are commonly considered are highly correlated. Therefore we believe that dimensionality reduction of the meta-level by means of FE can provide better results. Furthermore, if case based reasoning (or just kNN) is used as a tool for meta-learning then the similarity measure is very important. Euclidian-based similarity depends heavily on how good is the instance space representation, namely, how relevant are the features throughout the whole space. Hence, better results can be achieved if (meta-) FE process is undertaken before a meta-learning technique is applied. Actually, we successfully applied a similar idea in some sense in [17], where FE is applied on meta-level for dynamic integration of classifiers. 4. Discussion and Concluding Remarks In this paper we considered meta-learning approaches for automated algorithm selection and analyzed main limitations of such approaches, we discussed why they were unsuccessful and suggested the ways of their improvement. In this section we summarise the key points of the proposed framework for DM strategy selection via empirical and constructive induction. Our main research goal is to contribute to knowledge in the problem of DM strategy selection for a certain data-mining problem at hand. Although there is a rather big number of DM techniques – one can hardly find a technique that would be the best for all datasets. This situation is very natural and it is typical with respect to any group of methods or techniques. For example, (if to consider all the datasets equally probable), it was proven in a number of so-called “no free lunch theorems” that for any algorithm, any improved performance over one class of problems is exactly paid for by performance over another class of problems [18]. Although this NFL theorem is often critisized on the fact that some of data sets are more probable to appear in the real world, even critics of this theorem would agree that for every method it's possible to find a data set (even if the probability to meet it as a real-life problem is very small) where its accuracy is less than that of a rival method(s). Thus, the adaptive selection of the most suitable DM techniques that constitute a DM strategy is a real challenging problem. Unfortunately, there does not exist canonical knowledge, a perfect mathematical model, or any relevant tool to select the best technique. Instead, a volume of accumulated empirical findings, some trends, and some dependencies have been discovered. On the other hand there are certain assumptions on performance of algorithms under certain conditions. We proposed a DSS approach in the framework to recommend a data-mining strategy rather than a classifier or any other ML algorithm. And the important difference here is that constituting a data-mining strategy the system searches for the most appropriate ML algorithm with respect to the most suitable data representation (for this algorithm). In [17] we show particularly, how different

63

Page 206: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

combinations of PCA-based FE techniques with a (meta-) classifier improve the performance due to the dual search. We believe that a deeper analysis of a limited set of DM techniques (particularly, FE techniques and classifiers) of the both theoretical and experimental levels is a more beneficial approach rather than application of the meta-learning approach only to the whole range of machine learning techniques at once. Combination of theoretical and (semiautomatic) experimental approaches requires the integration of knowledge produced by a human-expert and the meta-learning approach. In [19] we consider the approach based on a meta-learning paradigm as a way of knowledge acquisition from the experiments and on the methodology used in expert systems design for representation and accumulation and use of expert knowledge and knowledge acquired through the meta-learning process. In the framework we considered the constructive induction approach that may include the feature extraction, the feature construction and the feature selection processes as means of relevant representation space construction. We consider pairwise comparison of classifiers of the meta-level as more beneficial comparing to regression and ranking approaches with respect to contribution to knowledge, since the pairwise comparison gives more insight to the understanding of advantages and weaknesses of available algorithms, producing more specific characterizations. With respect to meta-model construction we recommend the meta-rules extraction and learning by analogy rather than inducing a meta-decision tree. The argumentation is straightforward. A decision tree is a form of procedural knowledge. Since it has been constructed it is not easy to update it according to changing decision-making conditions. So, if a feature related to a high-level node in the tree is unmeasured (e.g. due to time-/cost-consuming processing), the decision tree can produce nothing but probabilistic reasoning. On the contrary, decision rules are a form of declarative knowledge. From a set of decision rules it is possible to construct many different, but logically equivalent, or nearly equivalent, decision trees [13]. Thus the decision rules are a more stable approach to meta-learning rather than the decision trees. We would like to stress again on the possibility to conduct experiments on synthetically generated datasets that is beneficial from two perspectives. First, this allows generating, testing and validating hypothesis on DM strategy selection with respect to a dataset at hand under controlled settings when some data characteristics are varied while the others are held unchangeable. Beside this, experiments on synthetic datasets allow producing additional instances for the meta-dataset. 5. Acknowledgements This research is partly supported by the COMAS Graduate School of the University of Jyväskylä, Finland. I

would like to thank Dr. Alexey Tsymbal and Prof. Seppo Puuronen for their valuable comments and suggestions to this paper. References: 1. U.M. Fayyad. Data Mining and Knowledge Discovery:

Making Sense Out of Data, IEEE Expert 11(5), 1996, 20-25. 2. M. Kiang. A comparative assessment of classification

methods, Decision Support Systems 35, 2003, 441-454. 3. R. Michalski. Toward a unified theory of learning:

multistrategy task-adaptive learning. In Readings in Knowledge Acquisition and Learning, eds. B. G. Buchanan & D. C. Wilkins, 1993, 7-38.

4. A. Kalousis. Algorithm Selection via Meta-Learning. University of Geneve, PhD Thesis, Dept. of CS, 2002.

5. B. Pfahringer, H. Bensusan & C. Giraud-Carrier. Meta-Learning by Landmarking Various Learning Algorithms. Proc. 17th Int. Conf. on Machine Learning, San Francisco, 2000, 743-750.

6. G. Lindner & R. Studer. AST: Support for Algorithm Selection with a CBR Approach. Proc. 3rd European Conf. on PKDD, Prague, Springer Verlag, 1999, 418-423.

7. L. Todorovski & S. Dzeroski. Combining multiple models with meta decision trees, in: Proc. 4rd European Conf. on PKDD, Springer Verlag, 2000,. 54-64.

8. S. Sohn. Meta Analysis of Classification Algorithms for Pattern Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21, 1999, 1137-1144

9. D. Michie, D. Spiegelhalter, & C. Taylor. Machine Learning, Neural and Statistical Classification. (Ellis Horwood, 1994).

10. M. Hilario & A. Kalousis. Characterizing Learning Models and Algorithms for Classification. University of Geneva, TR UNIGE-AI-9-01, 1999.

11. C. Soares & P. Brazdil. Zoomed ranking: Selection of classification algorithms based on relevant performance information. Proc. 4th PKDD, Springer, 2000, 126-35.

12. R. Michalski. Seeking Knowledge in the Deluge of Facts, Fundamenta Informaticae, 30, 1997, 283-297.

13. T. Arciszewski, R. Michalski & J. Wnek. Constructive induction: the key to design creativity. TR MLI 95-6, George Mason University, 1995.

14. K. Fukunaga. Introduction to statistical pattern recognition. (Academic Press, London, 1999).

15. L. Jimenez & D. Landgrebe. High dimensional feature reduction via projection pursuit. PhD Thesis, TR-96-5, 1996.

16. I. Jolliffe. Principal Component Analysis. (New York: Springer-Verlag. 1986)

17. A. Tsymbal, M. Pechenizkiy, S. Puuronen & D. Patterson 2003. Dynamic integration of classifiers in the space of principal components. Proc. 7th East-European Conf. ADBIS'03, Heidelberg: Springer-Verlag, 2003, 278-292.

18. D. Wolpert & W. MacReady. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1(1), 1996, 67-82.

19. M. Pechenizkiy, S. Puuronen & A. Tsymbal. Feature Extraction for Classification in the Data Mining Process. Information Theories and Applications 10(1), Sofia, FOI-Commerce, 2003, 321-329.

64

Page 207: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source
Page 208: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source
Page 209: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

J Y V Ä S K Y L Ä S T U D I E S I N C O M P U T I N G

1 ROPPONEN, JANNE, Software risk management -foundations, principles and empiricalfindings. 273 p. Yhteenveto 1 p. 1999.

2 KUZMIN, DMITRI, Numerical simulation ofreactive bubbly flows. 110 p. Yhteenveto 1 p.1999.

3 KARSTEN, HELENA, Weaving tapestry:collaborative information technology andorganisational change. 266 p. Yhteenveto3 p. 2000.

4 KOSKINEN, JUSSI, Automated transienthypertext support for software maintenance.98 p. (250 p.) Yhteenveto 1 p. 2000.

5 RISTANIEMI, TAPANI, Synchronization and blindsignal processing in CDMA systems. -Synkronointi ja sokea signaalinkäsittelyCDMA järjestelmässä. 112 p. Yhteenveto 1 p.2000.

6 LAITINEN, MIKA, Mathematical modelling ofconductive-radiative heat transfer. 20 p.(108 p.) Yhteenveto 1 p. 2000.

7 KOSKINEN, MINNA, Process metamodelling.Conceptual foundations and application. 213p. Yhteenveto 1 p. 2000.

8 SMOLIANSKI, ANTON, Numerical modeling oftwo-fluid interfacial flows. 109 p. Yhteenveto1 p. 2001.

9 NAHAR, NAZMUN, Information technologysupported technology transfer process. Amulti-site case study of high-tech enterprises.377 p. Yhteenveto 3 p. 2001.

10 FOMIN, VLADISLAV V., The process of standardmaking. The case of cellular mobile telephony.- Standardin kehittämisen prosessi. Tapaus-tutkimus solukkoverkkoon perustuvastamatkapuhelintekniikasta. 107 p. (208 p.)Yhteenveto 1 p. 2001.

11 PÄIVÄRINTA, TERO, A genre-based approachto developing electronic documentmanagement in the organization. 190 p.Yhteenveto 1 p. 2001.

12 HÄKKINEN, ERKKI, Design, implementation andevaluation of neural data analysisenvironment. 229 p. Yhteenveto 1 p. 2001.

13 HIRVONEN, KULLERVO, Towards betteremployment using adaptive control of labourcosts of an enterprise. 118 p. Yhteenveto 4 p.2001.

14 MAJAVA, KIRSI, Optimization-based techniquesfor image restoration. 27 p. (142 p.)Yhteenveto 1 p. 2001.

15 SAARINEN, KARI, Near infra-red measurementbased control system for thermo-mechanicalrefiners. 84 p. (186 p.) Yhteenveto 1 p. 2001.

16 FORSELL, MARKO, Improving component reusein software development. 169 p. Yhteenveto1 p. 2002.

17 VIRTANEN, PAULI, Neuro-fuzzy expert systemsin financial and control engineering.245 p. Yhteenveto 1 p. 2002.

18 KOVALAINEN, MIKKO, Computer mediatedorganizational memory for process control.

Moving CSCW research from an idea to aproduct. 57 p. (146 p.) Yhteenveto 4 p. 2002.

19 HÄMÄLÄINEN, TIMO, Broadband network qualityof service and pricing. 140 p. Yhteenveto 1 p.2002.

20 MARTIKAINEN, JANNE, Efficient solvers fordiscretized elliptic vector-valued problems.25 p. (109 p.) Yhteenveto 1 p. 2002.

21 MURSU, ANJA, Information systemsdevelopment in developing countries. Riskmanagement and sustainability analysis inNigerian software companies. 296 p. Yhteen-veto 3 p. 2002.

22 SELEZNYOV, ALEXANDR, An anomaly intrusiondetection system based on intelligent userrecognition. 186 p. Yhteenveto 3 p. 2002.

23 LENSU, ANSSI, Computationally intelligentmethods for qualitative data analysis. 57 p.(180 p.) Yhteenveto 1 p. 2002.

24 RYABOV, VLADIMIR, Handling imperfecttemporal relations. 75 p. (145 p.) Yhteenveto2 p. 2002.

25 TSYMBAL, ALEXEY, Dynamic integration of datamining methods in knowledge discoverysystems. 69 p. (170 p.) Yhteenveto 2 p. 2002.

26 AKIMOV, VLADIMIR, Domain decompositionmethods for the problems with boundarylayers. 30 p. (84 p.). Yhteenveto 1 p. 2002.

27 SEYUKOVA-RIVKIND, LUDMILA, Mathematical andnumerical analysis of boundary valueproblems for fluid flow. 30 p. (126 p.) Yhteen-veto 1 p. 2002.

28 HÄMÄLÄINEN, SEPPO, WCDMA Radio networkperformance. 235 p. Yhteenveto 2 p. 2003.

29 PEKKOLA, SAMULI, Multiple media in groupwork. Emphasising individual users indistributed and real-time CSCW systems.210 p. Yhteenveto 2 p. 2003.

30 MARKKULA, JOUNI, Geographic personal data, itsprivacy protection and prospects in a location-based service environment. 109 p. Yhteenveto2 p. 2003.

31 HONKARANTA, ANNE, From genres to contentanalysis. Experiences from four caseorganizations. 90 p. (154 p.) Yhteenveto 1 p.2003.

32 RAITAMÄKI, JOUNI, An approach to linguisticpattern recognition using fuzzy systems.169 p. Yhteenveto 1 p. 2003.

33 SAALASTI, SAMI, Neural networks for heart ratetime series analysis. 192 p. Yhteenveto 5 p.2003.

34 NIEMELÄ, MARKETTA, Visual search in graphicalinterfaces: a user psychological approach. 61p. (148 p.) Yhteenveto 1 p. 2003.

35 YOU, YU, Situation Awareness on the worldwide web. 171 p. Yhteenveto 2 p. 2004.

36 TAATILA, VESA, The concept of organizationalcompetence – A foundational analysis.- Perusteanalyysi organisaation kompetenssinkäsitteestä. 111 p. Yhteenveto 2 p. 2004.

Page 210: PhD MPechenizkiy withPageNums › ~mpechen › publications › pubs › PechenizkiyPHD...datasets, and the machine learning library in C++ (MLC++) and WEKA developers for the source

J Y V Ä S K Y L Ä S T U D I E S I N C O M P U T I N G

37 LYYTIKÄINEN, VIRPI, Contextual and structuralmetadata in enterprise documentmanagement. - Konteksti- ja rakennemetatietoorganisaation dokumenttien hallinnassa.73 p. (143 p.) Yhteenveto 1 p. 2004.

38 KAARIO, KIMMO, Resource allocation and loadbalancing mechanisms for providing qualityof service in the Internet. 171 p. Yhteenveto1 p. 2004.

39 ZHANG, ZHEYING, Model component reuse.Conceptual foundations and application inthe metamodeling-based systems analysisand design environment. 76 p. (214 p.) Yh-teenveto 1 p. 2004.

40 HAARALA, MARJO, Large-scale nonsmoothoptimization variable metric bundle methodwith limited memory. 107 p. Yhteenveto 1 p.2004.

41 KALVINE, VIKTOR, Scattering and point spectrafor elliptic systems in domains withcylindrical ends. 82 p. 2004.

42 DEMENTIEVA, MARIA, Regularization inmultistage cooperative games. 78 p. 2004.

43 MAARANEN, HEIKKI, On heuristic hybridmethods and structured point sets in globalcontinuous optimization. 42 p. (168 p.)Yhteenveto 1 p. 2004.

44 FROLOV, MAXIM, Reliable control overapproximation errors by functional type aposteriori estimates. 39 p. (112 p.) 2004.

45 ZHANG, JIAN, Qos- and revenue-aware resourceallocation mechanisms in multiclass IPnetworks. 85 p. (224 p.) 2004.

46 KUJALA, JANNE V., On computation in statisticalmodels with a psychophysical application. 40p. (104 p.) 2004.

47 SOLBAKOV, VIATCHESLAV, Application ofmathematical modeling for waterenvironment problems. 66 p. (118 p.) 2004.

48 HIRVONEN, ARI P., Enterprise ArchitecturePlanning in Practice. The Perspectives ofInformation and Communication TechnologyService Provider and End-User. 44 p. (135 p.)Yhteenveto 2 p. 2005.

49 VARTIAINEN, TERO, Moral conflicts in a projectcourse in information systems education.320 p. Yhteenveto 1p. 2005.

50 HUOTARI, JOUNI, Integrating graphicalinformation system models with visualizationtechniques. - Graafisten tietojärjestelmäku-vausten integrointi visualisointitekniikoilla.56 p. (157 p.) Yhteenveto 1p. 2005.

51 WALLENIUS, EERO R., Control and managementof multi-access wireless networks. 91 p.(192 p.) Yhteenveto 3 p. 2005.

52 LEPPÄNEN, MAURI, An ontological frameworkand a methodical skeleton for methodengineering – A contextual approach. 702 p.Yhteenveto 2 p. 2005.

53 MATYUKEVICH, SERGEY, The nonstationaryMaxwell system in domains with edges andconical points. 131 p. Yhteenveto 1 p. 2005.

54 SAYENKO, ALEXANDER, Adaptive scheduling forthe QoS supported networks. 120 p. (217 p.)2005.

55 KURJENNIEMI, JANNE, A study of TD-CDMA andWCDMA radio network enhancements. 144 p.(230 p.) Yhteenveto 1 p. 2005.

56 PECHENIZKIY, MYKOLA, Feature extraction forsupervised learning in knowledge discoverysystems. 86 p. (174 p.) Yhteenveto 2 p. 2005.