FD FS High Dimensional Data

Embed Size (px)

Citation preview

  • 8/11/2019 FD FS High Dimensional Data

    1/61

    FEATURE DISCRETIZATION AND SELECTION

    TECHNIQUES FOR HIGH-DIMENSIONALDATA

    Artur J. Ferreira

    Supervisor, Prof. Mario A. T. Figueiredo

    Instituto Superior de Engenharia de LisboaInstituto de Telecomunicacoes, Lisboa

    Priberam Machine Learning Lunch Seminar17 April 2012

  • 8/11/2019 FD FS High Dimensional Data

    2/61

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

    Outline

    Introduction

    BackgroundHigh-Dimensional DataFeature Discretization (FD)Feature Selection (FS)

    FD ProposalsStatic Unsupervised ProposalsStatic Supervised Proposal

    FS ProposalsFeature Selection Proposals

    Conclusions

    Some Resources

    2 / 4 5

  • 8/11/2019 FD FS High Dimensional Data

    3/61

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

    Introduction and MotivationSome well-known facts about machine learning:

    1 An adequate (sometimes discrete) representation of the datais necessary

    2 High-dimensional datasets are increasingly common

    3 Learning in high-dimensional data is challenging

    4 Sometimes, the curse-of-dimensionalityproblem arises small number of instances n and large dimensionality d (e. g.,

    micro-array data or text categorization data) it must be addressed in order to have effective learning

    algorithms

    5 Feature discretization (FD) and feature selection (FS)techniques address these problems

    achieve adequate representations select an adequate subset of features with a convenient

    representation3 / 4 5

  • 8/11/2019 FD FS High Dimensional Data

    4/61

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

    Introduction and MotivationSome well-known facts about machine learning:

    1 An adequate (sometimes discrete) representation of the datais necessary

    2 High-dimensional datasets are increasingly common

    3 Learning in high-dimensional data is challenging

    4 Sometimes, the curse-of-dimensionalityproblem arises small number of instances n and large dimensionality d (e. g.,

    micro-array data or text categorization data) it must be addressed in order to have effective learning

    algorithms

    5 Feature discretization (FD) and feature selection (FS)techniques address these problems

    achieve adequate representations select an adequate subset of features with a convenient

    representation3 / 4 5

  • 8/11/2019 FD FS High Dimensional Data

    5/61

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

    Introduction and MotivationSome well-known facts about machine learning:

    1 An adequate (sometimes discrete) representation of the datais necessary

    2 High-dimensional datasets are increasingly common

    3 Learning in high-dimensional data is challenging

    4 Sometimes, the curse-of-dimensionalityproblem arises small number of instances n and large dimensionality d (e. g.,

    micro-array data or text categorization data) it must be addressed in order to have effective learning

    algorithms

    5 Feature discretization (FD) and feature selection (FS)techniques address these problems

    achieve adequate representations select an adequate subset of features with a convenient

    representation3 / 4 5

    S C S

  • 8/11/2019 FD FS High Dimensional Data

    6/61

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

    Introduction and MotivationSome well-known facts about machine learning:

    1 An adequate (sometimes discrete) representation of the datais necessary

    2 High-dimensional datasets are increasingly common

    3 Learning in high-dimensional data is challenging

    4 Sometimes, the curse-of-dimensionalityproblem arises small number of instances n and large dimensionality d (e. g.,

    micro-array data or text categorization data) it must be addressed in order to have effective learning

    algorithms

    5 Feature discretization (FD) and feature selection (FS)techniques address these problems

    achieve adequate representations select an adequate subset of features with a convenient

    representation3 / 4 5

    I d i B k d FD P l FS P l C l i S R

  • 8/11/2019 FD FS High Dimensional Data

    7/61

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

    Introduction and MotivationSome well-known facts about machine learning:

    1 An adequate (sometimes discrete) representation of the datais necessary

    2 High-dimensional datasets are increasingly common

    3 Learning in high-dimensional data is challenging

    4 Sometimes, the curse-of-dimensionalityproblem arises small number of instances n and large dimensionality d (e. g.,

    micro-array data or text categorization data) it must be addressed in order to have effective learning

    algorithms

    5 Feature discretization (FD) and feature selection (FS)techniques address these problems

    achieve adequate representations select an adequate subset of features with a convenient

    representation3 / 4 5

    I t d ti B k d FD P l FS P l C l i S R

  • 8/11/2019 FD FS High Dimensional Data

    8/61

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

    Introduction and MotivationSome well-known facts about machine learning:

    1 An adequate (sometimes discrete) representation of the datais necessary

    2 High-dimensional datasets are increasingly common

    3 Learning in high-dimensional data is challenging

    4 Sometimes, the curse-of-dimensionalityproblem arises small number of instances n and large dimensionality d (e. g.,

    micro-array data or text categorization data) it must be addressed in order to have effective learning

    algorithms

    5 Feature discretization (FD) and feature selection (FS)techniques address these problems

    achieve adequate representations select an adequate subset of features with a convenient

    representation3 / 4 5

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

  • 8/11/2019 FD FS High Dimensional Data

    9/61

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

    Introduction and MotivationSome well-known facts about machine learning:

    1 An adequate (sometimes discrete) representation of the datais necessary

    2 High-dimensional datasets are increasingly common

    3 Learning in high-dimensional data is challenging

    4 Sometimes, the curse-of-dimensionalityproblem arises small number of instances n and large dimensionality d (e. g.,

    micro-array data or text categorization data) it must be addressed in order to have effective learning

    algorithms

    5 Feature discretization (FD) and feature selection (FS)techniques address these problems

    achieve adequate representations select an adequate subset of features with a convenient

    representation

    3 / 4 5

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

  • 8/11/2019 FD FS High Dimensional Data

    10/61

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

    Introduction and MotivationSome well-known facts about machine learning:

    1 An adequate (sometimes discrete) representation of the datais necessary

    2 High-dimensional datasets are increasingly common

    3 Learning in high-dimensional data is challenging

    4 Sometimes, the curse-of-dimensionalityproblem arises small number of instances n and large dimensionality d (e. g.,

    micro-array data or text categorization data) it must be addressed in order to have effective learning

    algorithms

    5 Feature discretization (FD) and feature selection (FS)techniques address these problems

    achieve adequate representations select an adequate subset of features with a convenient

    representation

    3 / 4 5

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

  • 8/11/2019 FD FS High Dimensional Data

    11/61

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

    Introduction and MotivationSome well-known facts about machine learning:

    1 An adequate (sometimes discrete) representation of the datais necessary

    2 High-dimensional datasets are increasingly common

    3 Learning in high-dimensional data is challenging

    4 Sometimes, the curse-of-dimensionalityproblem arises small number of instances n and large dimensionality d (e. g.,

    micro-array data or text categorization data) it must be addressed in order to have effective learning

    algorithms

    5 Feature discretization (FD) and feature selection (FS)techniques address these problems

    achieve adequate representations select an adequate subset of features with a convenient

    representation

    3 / 4 5

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

  • 8/11/2019 FD FS High Dimensional Data

    12/61

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

    High-Dimensional DataSome high-dimensional datasets available on the WEB with

    different types of problemsDatasets with cclasses and n instances shown by increasing

    dimensionality d; notice that in some cases dn

    Dataset d c n Type of Data

    Colon 2000 2 62 MicroarraySRBCT 2309 4 83 MicroarrayAR10P 2400 10 130 FacePIE10P 2420 10 210 FaceTOX-171 5748 4 171 MicroarrayExample1 9947 2 50 Text, BoWORL10P 10304 10 100 Face11-Tumors 12553 11 174 MicroarrayLung-Cancer 12601 5 203 MicroarraySMK-CAN-187 19993 2 187 MicroarrayDexter 20000 2 2600 BoWGLI-85 22283 2 85 MicroarrayDorothea 1000000 2 1950 Drug Discovery

    4 / 4 5

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

  • 8/11/2019 FD FS High Dimensional Data

    13/61

    g p p

    High-Dimensional DataThese datasets are available on-line:

    The well-known university of California at Irvine(UCI)repository, archive.ics.uci.edu/ml/datasets.html

    The gene expression model selector(GEMS) project, atwww.gems-system.org/

    5 / 4 5

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

    http://localhost/var/www/apps/conversion/tmp/scratch_5/archive.ics.uci.edu/ml/datasets.htmlhttp://www.gems-system.org/http://www.gems-system.org/http://localhost/var/www/apps/conversion/tmp/scratch_5/archive.ics.uci.edu/ml/datasets.html
  • 8/11/2019 FD FS High Dimensional Data

    14/61

    High-Dimensional DataThe recently developed Arizona state university(ASU) repository

    featureselection.asu.edu/datasets.php hashigh-dimensional datasets

    6 / 4 5

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

    http://localhost/var/www/apps/conversion/tmp/scratch_5/featureselection.asu.edu/datasets.phphttp://localhost/var/www/apps/conversion/tmp/scratch_5/featureselection.asu.edu/datasets.php
  • 8/11/2019 FD FS High Dimensional Data

    15/61

    Feature Discretization

    Feature Discretization (FD) aims at: representing a feature with a set of symbols from a finite set

    keeping enough information for the learning task

    ignoring minor (noisy/irrelevant) fluctuations on the data

    FD can be performed by:

    unsupervisedorsupervisedmethods; the latter uses classlabels to compute the discretization intervals

    staticordynamicmethods static - the discretization intervals are computed by using

    solely the training data dynamic - rely on a wrapper approach with a quantizer and a

    classifier

    7 / 4 5

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

  • 8/11/2019 FD FS High Dimensional Data

    16/61

    Feature Discretization: Static Unsupervised Approaches

    Three static unsupervised techniques are commonly used for FD: equal-interval binning(EIB) - uniform quantization

    equal-frequency binning(EFB) - non-uniform quantization toattain uniform distribution

    proportional k-interval discretization (PkID) - the number andsize of discretized intervals are adjusted to the number oftraining instances

    As compared to both EIB and EFB, PkID provides nave-Bayes

    classifiers with:

    competitive classification performance for smaller datasets

    better classification performance for larger datasets

    8 / 4 5

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

  • 8/11/2019 FD FS High Dimensional Data

    17/61

    Feature Discretization: Static Supervised Approaches

    Some static supervised techniques for FD:

    information entropy maximization (IEM), Fayyad and Irani,1993

    minimal description length (MDL), Kononenko, 1995

    class-attribute interdependence maximization (CAIM), 2004

    class-attribute contingency coefficient(CACC), 2008

    correlation maximization (CM), 2011

    Empirical evidence shows that:

    the IEM and MDL methods have good performance regardingaccuracy and running-time.

    the CACC and CAIM methods have higher running-time thanboth IEM and MDL. In some cases, they attain better resultsthan these methods

    9 / 4 5

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

  • 8/11/2019 FD FS High Dimensional Data

    18/61

    Feature Discretization: Wekas Environment

    We can find IEM (Fayyad and Irani, 1993) and MDL (Kononenko,

    1995) in the Wekas machine learning packageweka.filters.supervised.attribute.Discretize

    Note: the unsupervised PkID method is also available.

    10/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

  • 8/11/2019 FD FS High Dimensional Data

    19/61

    Feature Selection

    Feature Selection (FS) is a central problem in Machine Learningand Pattern Recognition

    what is the best subset of features for a given problem?

    how many features should we choose?

    There are many recent papers published in 2012, regarding FS

    FS can be performed by:

    unsupervisedorsupervisedmethods; the latter uses classlabels

    filter, wrapper, orembeddedapproaches

    11/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

  • 8/11/2019 FD FS High Dimensional Data

    20/61

    Feature Selection: Filter Approach

    The FS filter approach:

    assesses the quality of a given feature subset using solelycharacteristics of that subset

    it does not rely on the use of any learning algorithm

    There are many (successful) filter approaches for FS:

    unsupervised Term-Variance (TV), Laplacian Score (LS),Laplacian Score Extended (LSE), SPEC, ...

    supervised Relief, ReliefF, CFS, FiR, ...

    supervised FCBF, mrMR, MIM, CMIM, IG, ...

    12/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

  • 8/11/2019 FD FS High Dimensional Data

    21/61

    Feature Selection: Information-Theoretic FiltersClaude Shannons information theory(IT) has been the basis for

    the proposal of many filter methods (IT-based filters):

    Brown, G., Pocock, A., Zhao, M., Lujan, M., 2012. Conditional likelihood

    maximisation: A unifying framework for information theoretic feature

    selection. Journal of Machine Learning Research 13, 2766. 13/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

  • 8/11/2019 FD FS High Dimensional Data

    22/61

    Feature Selection: Wrapper ApproachWrapper approaches

    uses some method to search the space of all possible subsetsof features

    assess the quality by learning and evaluating a classifier withthat feature subset

    Combinatorial search makes wrappers inadequate forhigh-dimensional data

    Some recent wrapper approaches for FS (2009-2011):

    Greedy randomized adaptive search procedure(GRASP)

    A GRASP-based FS hybrid filter-wrapper method A sparse model for high-dimensional data, based on linear

    programming

    Estimation of the redundancy between feature sets, by theconditional MI between feature subsets and each class

    14/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

  • 8/11/2019 FD FS High Dimensional Data

    23/61

    Feature Selection: Embedded Approach

    The embedded (or integrated) approach:

    simultaneously learns the classifier and chooses a subset offeatures

    assigns weights to features

    the objective function encourages some of these weights to

    become zero

    Examples of the embedded approach are:

    sparse multinomial logistic regression (SMLR*)

    sparse logistic regression (SLogReg) method

    Bayesian logistic regression (BLogReg), a more adaptiveversion of SlogReg

    joint classifier and feature optimization (JCFO*), whichapplies FS inside a kernel function

    15/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

  • 8/11/2019 FD FS High Dimensional Data

    24/61

    Feature Selection on High-Dimensional Data

    How difficult/challenging is FS on a given dataset?

    The ratio1

    RFS= n

    ac,

    withn patterns (instances), cclasses, and a being the median arityof the features, discretized with EFB, measures this difficulty. Lowvalues imply more difficult FS problems.

    Another question:Shouldnt the d/n ratio also be taken into account?

    1Brown, G., Pocock, A., Zhao, M., Lujan, M., 2012. Conditional likelihoodmaximisation: A unifying framework for information theoretic feature

    selection. Journal of Machine Learning Research 13, 2766.16/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

  • 8/11/2019 FD FS High Dimensional Data

    25/61

    Feature Selection: an (outdated) categorization

    Liu, H. and Yu, L., Toward Integrating Feature Selection Algorithms for

    Classification and Clustering, IEEE Transactions on Knowledge and Data

    Engineering, vol. 17, n. 4, April 2005, pp 491502 17/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

  • 8/11/2019 FD FS High Dimensional Data

    26/61

    Feature Discretization: Static U-LBG1 AlgorithmRecently, we have proposed the use of the Linde-Buzo-Gray

    algorithm for unsupervised static FS2

    The LBG algorithm is applied individually to each featureleading to discrete features with minimum mean square error(MSE)

    Rationale: low MSE(Xi, Q(Xi)) is adequate for learning! It is stopped when:

    the MSE distortion falls below some threshold or the maximum number of bits qper feature is reached

    obtains a variable number of bits per feature

    uses (, q) as input parameters; to 5% of the range ofeach feature and q {4, . . . , 10} are adequate

    2A. Ferreira and M. Figueiredo, An unsupervised approach to featurediscretization and selection, Elsevier - Pattern Recognition Journal, DOI:

    10.1016/j.patcog.2011.12.00818/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

  • 8/11/2019 FD FS High Dimensional Data

    27/61

    Feature Discretization: U-LBG2 Algorithm

    Similar to U-LBG1 (both aim at obtaining quantizers thatrepresent the features with a small distortion) with the following

    key differences: each discretized feature will be given the same (maximum)

    number of bits q;

    only one quantizer is learned for each feature

    19/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

  • 8/11/2019 FD FS High Dimensional Data

    28/61

    Experimental Results: Unsupervised DiscretizationDiscretization performance:

    total number of bits per pattern (T. Bits) and Test Set ErrorRate (Err, average of ten runs)

    using up to q= 7 bits, using the nave Bayes classifier

    EFB U-LBG1 U-LBG2Dataset Original Err T. Bits Err T. Bits Err T. Bits Err

    Phoneme 21.30 30 22.30 9 22.80 30 20.60Pima 25.30 48 25.20 30 25.20 48 25.80Abalone 28.00 48 27.60 15 27.20 48 27.70Contraceptive 34.80 54 31.40 15 38.00 54 34.80Wine 3.73 78 4.80 27 3.20 78 3.20

    Hepatitis 20.50 95 21.50 32 21.00 39 18.00WBCD 5.87 180 5.13 60 5.87 180 5.67Ionosphere 10.60 198 9.80 49 17.40 198 11.00SpamBase 15.27 324 13.40 54 15.73 324 15.67Lung 35.00 318 35.00 74 35.83 318 35.00Arrhythmia 32.00 1392 51.56 553 30.22 1392 41.56

    20/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

  • 8/11/2019 FD FS High Dimensional Data

    29/61

    Some insight on the accuracy of each feature

    Test set error rate of nave Bayes using only a single feature, discretizedwithq= 4 bit by the U-LBG2 algorithm, on the WBCD dataset.

    Horizontal dashed line is the test set error rate of the p=30 features. 21/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

  • 8/11/2019 FD FS High Dimensional Data

    30/61

    Some insight on the accuracy of each feature

    Test set error rate of nave Bayes using only a single feature, discretizedwithq= 8 bit by the U-LBG2 algorithm, on the WBCD dataset.

    Horizontal dashed line is the test set error rate of the p=30 features. 22/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

  • 8/11/2019 FD FS High Dimensional Data

    31/61

    Some insight on progressive discretization 1/2

    Test set error rate of nave Bayes using solely feature 17 of the WBCD

    dataset (original feature and U-LBG2 discretized with q {1, . . . , 10}).

    The discrete versions do not provide higher accuracy! 23/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

  • 8/11/2019 FD FS High Dimensional Data

    32/61

    Some insight on progressive discretization 2/2

    Test set error rate of nave Bayes using solely feature 25 of the WBCD

    dataset (original feature and U-LBG2 discretized with q {1, . . . , 10}).

    With q= 10, we get a small improvement on the accuracy. 24/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

  • 8/11/2019 FD FS High Dimensional Data

    33/61

    Experimental Results Analysis

    Somme comments on these methods and results: Often, discretization improves classification accuracy

    EFB usually attains better results than EIB

    U-LBG2 is usually faster than U-LBG1, but allocates more

    bits per feature Often, one of the LBG methods attains better results than

    EFB

    The U-LBG procedures are more complex than either EIB or

    EFB PkID attains results close to EFB, when using the nave Bayes

    classifier

    25/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

  • 8/11/2019 FD FS High Dimensional Data

    34/61

    Feature Discretization: S-LBG Algorithm (ideas)

    Supervised version of the U-LBG counterparts:

    each feature is discretized with LBG

    it uses the mutual information (MI) between discretizedfeatures and the class label, in order to control the

    discretization procedure it stops at bbits or when the relative increase of MI with

    respect to the previous quantizer is less than a given threshold

    each feature is discretized with an increasing number of bits,

    stopping only when: there is no significant increase on the relevance of the feature the maximum number of bits is reached

    26/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

  • 8/11/2019 FD FS High Dimensional Data

    35/61

    Feature Discretization: MI Algorithm (ideas)

    Ongoing work with the following key ideas:

    discretize each feature to maximize MI with the class label

    do this in a progressive approach, scanning all features allocate a variable number of bits per feature

    check how the relevance of each feature changes whendiscretized with b+ 1 bits, as compared to bbits

    27/45

  • 8/11/2019 FD FS High Dimensional Data

    36/61

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

  • 8/11/2019 FD FS High Dimensional Data

    37/61

    Experimental ResultsDiscretization performance:

    total number of bits per pattern (TBits) and Test Set ErrorRate (Err, average of ten runs)

    using up to q= 7 bits, using the nave Bayes classifier

    EFB U-LBG1 U-LBG2 S-LBGDataset Base TBits Err TBits Err TBits Err TBits Err

    Iris 2.6 24 4.5 7 9.0 24 2.6 15 3.4Phoneme 21.3 30 22.3 9 22.8 30 20.6 20 22.5Pima 25.3 48 25.2 30 25.2 48 25.8 40 24.4Abalone 28.0 48 27.6 15 27.2 48 27.7 35 27.8Contrac. 34.8 54 31.4 15 38.0 54 34.8 29 34.7Wine 3.7 78 4.8 27 3.2 78 3.2 57 4.5

    Hepatitis 20.5 95 21.5 32 21.0 39 18.0 29 19.0WBCD 5.8 180 5.1 60 5.8 180 5.6 116 5.4Ionosph. 10.6 198 9.8 49 17.4 198 11.00 177 11.0SpamBase 15.2 324 13.4 54 15.7 324 15.6 220 14.6Lung 35.0 318 35.0 74 35.8 318 35.0 135 35.0Arrhyt. 32.0 1392 51.5 553 30.2 1392 41.5 1050 31.3

    29/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

  • 8/11/2019 FD FS High Dimensional Data

    38/61

    Feature Selection

    Somme comments on existing FS methods:

    Usually, wrappers perform better than embedded methods,taking a longer time

    Embedded methods perform better than filters, being muchslower

    On very high-dimensional datasets: both wrapper and embedded methods are too expensive filters are the only applicable option!

    even some filter FS methods can take a prohibitive time in theredundancy analysis and elimination stage

    30/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

    ( )

  • 8/11/2019 FD FS High Dimensional Data

    39/61

    Feature Selection: Relevance-Redundancy (RR)

    The Yu and Liu relevance-redundancy framework approach3

    Optimal subset is provided by parts III and IV, whitout bothirrelevantand redundant features

    3Yu, L., Liu, H., Dec. 2004. Efficient feature selection via analysis of

    relevance and redundancy, JMLR 5, 12051224. 31/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

    S ( )

  • 8/11/2019 FD FS High Dimensional Data

    40/61

    Feature Selection: Relevance-Redundancy (RR)

    The Yu and Liu relevance-redundancy framework approach4

    First compute relevance, then redundancy After this, find the optimal subset (parts III and IV)

    4Yu, L., Liu, H., Dec. 2004. Efficient feature selection via analysis of

    relevance and redundancy, JMLR 5, 12051224. 32/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

    A RR FS h f hi h di i l d (id )

  • 8/11/2019 FD FS High Dimensional Data

    41/61

    A RR FS approach for high-dimensional data (ideas)

    Key observations for filters on high-dimensional data: Some supervised filter methods (e.g. CFS and mrMR) are

    computionally expensive

    They waste time on subspace search and redundancy analysis

    Redundancy is typically found among the most relevantfeatures

    Our proposal for fast unsupervised and supervised filter RR FS onhigh-dimensional data:

    sorts the dfeatures by decreasing relevance

    computes the redundancy between the most relevant features

    computes up to d1 pairwise similarities

    33/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

    A RR FS h f hi h di i l d (id )

  • 8/11/2019 FD FS High Dimensional Data

    42/61

    A RR FS approach for high-dimensional data (ideas)

    We keep only features with high relevance and low (i.e., belowsome threshold MS) similarity among themselves

    It is not expected that redundant features are consecutive in

    the ranked sorted feature list However, it is a waste of time to compute the redundancy

    between weakly relevant and irrelevant features!

    So, we perform the redundancy check only among the top

    relevant features

    34/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

    A RR FS h f hi h di i l d t (d t il )

  • 8/11/2019 FD FS High Dimensional Data

    43/61

    A RR FS approach for high-dimensional data (details)Input: X: n d matrix, n patterns of a d-dimensional training set.

    m ( d):, maximum number of features to keep.

    MS: maximum allowed similarity between pairs of features.Output: FeatKeep: an mdimensional array (with m m) containing the indexes

    of the selected features.X : n m matrix, reduced dimensional training set, with features sorted

    by decreasing relevance.

    1: Compute the relevance riof each feature Xi(columns ofX), for

    i {1, . . . , d}, using one of the dispersion measures MAD or MM).2: Sort the features by decreasing order ofri. Let i1, i2,..., idbe the

    resulting permutation of{1,...,d} (i.e., ri1 ri2 ... rid).3: FeatKeep[1] =i1; prev=1; next=2;4: forf= 2 to d do

    5: s=S(Xif, Xiprev

    );6: if s

  • 8/11/2019 FD FS High Dimensional Data

    44/61

    Relevance Measures

    For unsupervised learning, we found relevance proportional to

    dispersionThemean absolute difference

    MADi = 1

    n

    nj=1

    XijXi ,

    and the mean-median

    MMi =|Ximedian(Xi)|,

    i.e., the absolute difference between the mean and median ofXiare adequate measures of relevance

    They attain better results than variance

    36/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

    R d d M s

  • 8/11/2019 FD FS High Dimensional Data

    45/61

    Redundancy MeasureTheredundancybetween two features, say Xi and Xj, is computedby the absolute cosine

    | cos(XiXj)|=

    Xi, Xj

    Xi Xj

    =

    nk=1

    XikXjk

    n

    k=1

    X2ik

    n

    k=1

    X2jk

    , (1)

    where , denotes the inner product and . the Euclidean normWe have 0 | cos(XiXj)| 1:

    with 0 meaning that the two features are orthogonal(maximally different)

    1 resulting from colinear features

    Similar to the Pearsons correlation coefficient37/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

    Experimental Results: Relevance and Redundancy

  • 8/11/2019 FD FS High Dimensional Data

    46/61

    Experimental Results: Relevance and RedundancyThe relevance and similarity of the consecutive m= 1000top-ranked features of the Brain-Tumor1 dataset (d= 5920

    features)

    Among the top-ranked features, we have high redundancy!38/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

    Experimental Results: Unsupervised FS

  • 8/11/2019 FD FS High Dimensional Data

    47/61

    Experimental Results: Unsupervised FSComparison with other unsupervised approaches for a 10-fold CVwith linear SVM. Best result in bold face and second best is

    underlinedOur Approach Unsupervised Baseline

    Dataset MAD MM AMGM TV LS SPEC No FSColon 21.0 17.7 19.4 17.7 21.0 22.6 19.4SRBCT 0.0 0.0 0.0 0.0 0.0 0.0 0.0

    PIE10P 0.0 0.0 0.0 0.0 0.0 0.0 0.0Lymphoma 2.2 2.2 2.2 2.2 2.2 2.2 2.2

    Leukemia1 4.2 4.2 5.6 4.2 5.6 4.2 4.2

    Brain-Tumor1 14.4 12.2 12.2 12.2 13.3 28.9 12.2Leukemia 2.8 2.8 2.8 2.8 2.8 30.6 2.8Example1 2.7 2.7 2.6 2.7 3.3 22.9 2.7

    ORL0P 2.0 2.0 4.0 5.0 1.0 1.0 1.0Lung-Cancer 4.9 5.9 4.9 5.9 5.4 6.4 4.9

    SMK-CAN-187 41.7 41.7 41.7 41.7 26.2 25.7 26.2Dexter 5.3 5.2 5.2 5.2 5.2 39.3 4.7GLI-85 12.9 14.1 15.3 14.1 11.8 9.4 11.8

    Dorothea 24.0 24.0 24.0 24.0 25.0 22.0 24.039/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

    Experimental Results: Supervised FS

  • 8/11/2019 FD FS High Dimensional Data

    48/61

    Experimental Results: Supervised FSComparison with other supervised approaches for a 10-fold CV withlinear SVM. Best result in bold face and second best is underlined

    Our Approach Supervised Filters BaseDataset MM FiR MI RF CFS FCBF FiR mrMR No FSColon 24.2 22.6 24.2 19.4 25.8 22.6 19.4 21.0 21.0SRBCT 0.0 0.0 0.0 0.0 0.0 4.8 0.0 4.8 0.0

    PIE10P 0.0 0.0 0.5 0.0 * 1.0 0.0 24.8 0.0

    Lymph. 2.2 2.2 2.2 2.2 * 3.3 2.2 22.8 2.2Leuk1 5.6 2.8 6.9 6.9 * 5.6 4.2 9.7 5.6B-Tum1 13.3 12.2 13.3 11.1 * 18.9 11.1 25.6 10.0

    Leuk. 2.8 12.5 2.8 2.8 * 4.2 4.2 8.3 2.8

    Example1 2.3 2.2 2.2 3.7 * 6.3 2.1 28.3 2.4ORL0P 4.0 5.0 2.0 1.0 * 1.0 2.0 68.0 1.0

    B-Tum2 34.0 22.0 30.0 22.0 * 36.0 24.0 42.0 26.0P-Tumor 7.8 5.9 4.9 7.8 * 9.8 7.8 12.7 8.8L-Cancer 5.9 6.4 4.9 4.9 * 6.4 5.4 11.8 5.9

    SMK-187 41.7 40.6 53.5 24.6 * 33.2 23.5 33.2 24.1Dexter 6.7 6.0 7.7 9.3 * 15.3 6.7 18.0 6.3GLI-85 14.1 12.9 17.6 11.8 * 20.0 14.1 16.5 14.1Dorot. 25.0 26.0 25.0 * * * 25.0 * 25.040/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

    Experimental Results: Running Time

  • 8/11/2019 FD FS High Dimensional Data

    49/61

    Experimental Results: Running-TimeFor each dataset:

    the first row contains the test set error rates (%) of linear

    SVM for a 10-fold CV for our RR method (with MS= 0.8)and other FS methods selecting m features

    the second row shows the totalrunning time taken by each FSalgorithm to select m features for the 10-folds

    Our RR Filter Embedded#, m MAD FiR LS SPEC RF FCBF FiR mrMR BRegColon 24.2 30.6 21.0 24.2 24.2 25.8 24.2 17.7 21.0m=1000 0.3 7.4 0.4 1.5 9.0 147.0 7.1 19.8 2.2

    SRBCT 0.0 0.0 0.0 0.0 0.0 1.2 0.0 3.6 2.4m=1800 0.5 9.7 0.4 2.3 11.6 8.0 9.3 15.8 5.7

    Lymph. 2.2 2.2 2.2 3.3 2.2 3.3 3.3 25.0 8.7m=2000 0.8 29.5 0.8 9.0 21.7 25.2 29.2 24.7 143.7

    P-Tumor 8.8 6.9 10.8 9.8 5.9 10.8 9.8 13.7 7.8m=4000 1.8 21.6 2.4 13.5 47.8 29.1 21.1 48.8 46.1

    L-Cancer 5.9 6.4 6.4 8.4 6.9 7.4 6.9 11.3 7.4m=8000 3.8 54.9 7.5 47.5 95.5 208.8 54.1 88.5 207.4

    41/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

    Conclusions

  • 8/11/2019 FD FS High Dimensional Data

    50/61

    Conclusions

    High-dimensional datasets are increasingly common Learning in high-dimensional data is challenging

    Unsupervised and supervised FD and FS methods canalleviate the inherent complexity

    Wrapper and embedded methods are too costly Our filter FD and FS proposals in this talk: are both time and space efficient attain competitive results with the state-of-the-art techniques can act as pre-processors to wrapper and embedded methods

    A promising avenue of research (our ongoing work)?perform progressive FD and FS simultaneously!

    42/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

    Conclusions

  • 8/11/2019 FD FS High Dimensional Data

    51/61

    Conclusions

    High-dimensional datasets are increasingly common Learning in high-dimensional data is challenging

    Unsupervised and supervised FD and FS methods canalleviate the inherent complexity

    Wrapper and embedded methods are too costly Our filter FD and FS proposals in this talk: are both time and space efficient attain competitive results with the state-of-the-art techniques can act as pre-processors to wrapper and embedded methods

    A promising avenue of research (our ongoing work)?perform progressive FD and FS simultaneously!

    42/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

    Conclusions

  • 8/11/2019 FD FS High Dimensional Data

    52/61

    Conclusions

    High-dimensional datasets are increasingly common Learning in high-dimensional data is challenging

    Unsupervised and supervised FD and FS methods canalleviate the inherent complexity

    Wrapper and embedded methods are too costly Our filter FD and FS proposals in this talk: are both time and space efficient attain competitive results with the state-of-the-art techniques can act as pre-processors to wrapper and embedded methods

    A promising avenue of research (our ongoing work)?perform progressive FD and FS simultaneously!

    42/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

    Conclusions

  • 8/11/2019 FD FS High Dimensional Data

    53/61

    Conclusions

    High-dimensional datasets are increasingly common Learning in high-dimensional data is challenging

    Unsupervised and supervised FD and FS methods canalleviate the inherent complexity

    Wrapper and embedded methods are too costly Our filter FD and FS proposals in this talk: are both time and space efficient attain competitive results with the state-of-the-art techniques can act as pre-processors to wrapper and embedded methods

    A promising avenue of research (our ongoing work)?perform progressive FD and FS simultaneously!

    42/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

    Conclusions

  • 8/11/2019 FD FS High Dimensional Data

    54/61

    Conclusions

    High-dimensional datasets are increasingly common Learning in high-dimensional data is challenging

    Unsupervised and supervised FD and FS methods canalleviate the inherent complexity

    Wrapper and embedded methods are too costly Our filter FD and FS proposals in this talk: are both time and space efficient attain competitive results with the state-of-the-art techniques can act as pre-processors to wrapper and embedded methods

    A promising avenue of research (our ongoing work)?perform progressive FD and FS simultaneously!

    42/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

    Conclusions

  • 8/11/2019 FD FS High Dimensional Data

    55/61

    High-dimensional datasets are increasingly common Learning in high-dimensional data is challenging

    Unsupervised and supervised FD and FS methods canalleviate the inherent complexity

    Wrapper and embedded methods are too costly Our filter FD and FS proposals in this talk: are both time and space efficient attain competitive results with the state-of-the-art techniques can act as pre-processors to wrapper and embedded methods

    A promising avenue of research (our ongoing work)?perform progressive FD and FS simultaneously!

    42/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

    Conclusions

  • 8/11/2019 FD FS High Dimensional Data

    56/61

    High-dimensional datasets are increasingly common Learning in high-dimensional data is challenging

    Unsupervised and supervised FD and FS methods canalleviate the inherent complexity

    Wrapper and embedded methods are too costly Our filter FD and FS proposals in this talk:

    are both time and space efficient attain competitive results with the state-of-the-art techniques can act as pre-processors to wrapper and embedded methods

    A promising avenue of research (our ongoing work)?perform progressive FD and FS simultaneously!

    42/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

    Conclusions

  • 8/11/2019 FD FS High Dimensional Data

    57/61

    High-dimensional datasets are increasingly common Learning in high-dimensional data is challenging

    Unsupervised and supervised FD and FS methods canalleviate the inherent complexity

    Wrapper and embedded methods are too costly Our filter FD and FS proposals in this talk:

    are both time and space efficient attain competitive results with the state-of-the-art techniques can act as pre-processors to wrapper and embedded methods

    A promising avenue of research (our ongoing work)?perform progressive FD and FS simultaneously!

    42/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

    Conclusions

  • 8/11/2019 FD FS High Dimensional Data

    58/61

    High-dimensional datasets are increasingly common Learning in high-dimensional data is challenging

    Unsupervised and supervised FD and FS methods canalleviate the inherent complexity

    Wrapper and embedded methods are too costly Our filter FD and FS proposals in this talk:

    are both time and space efficient attain competitive results with the state-of-the-art techniques can act as pre-processors to wrapper and embedded methods

    A promising avenue of research (our ongoing work)?perform progressive FD and FS simultaneously!

    42/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

    Calling Weka from MATLAB - FD

  • 8/11/2019 FD FS High Dimensional Data

    59/61

    FD example. Discretize dataset X using Kononekos MDL method

    ...t = weka.filters.supervised.attribute.Discretize();config = wekaArgumentString(-R first-last,-K);

    t.setOptions(config);wekaData = wekaCategoricalData(X,SY2MY(Y));t.setInputFormat( wekaData );t.useFilter( wekaData, t ) ;d = size(X,2);

    for i=1 : dcutPoints(i) = t.getCutPoints(i-1);

    end ...

    43/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

    Calling Weka from MATLAB - FS

  • 8/11/2019 FD FS High Dimensional Data

    60/61

    FS example. Apply ReliefF on dataset X

    ...t = weka.attributeSelection.ReliefFAttributeEval();config = wekaArgumentString(-M, m, -D, 1 -K, k);

    t.setOptions(config);t.buildEvaluator(wekaCategoricalData(X,SY2MY(Y)));out.W = zeros(1,nF);for i =1:nF;

    out.W(i) = t.evaluateAttribute(i-1);

    endout.fList = sort(out.W, descend);...

    44/45

    Introduction Background FD Proposals FS Proposals Conclusions Some Resources

    Some useful resources

  • 8/11/2019 FD FS High Dimensional Data

    61/61

    Datasets available on-line: The university of California at Irvine(UCI) repository,

    archive.ics.uci.edu/ml/datasets.html The gene expression model selector(GEMS) project, atwww.gems-system.org/

    The Arizona state university(ASU) repository,http://featureselection.asu.edu/datasets.php

    The world health organization (WHO) data,http://apps.who.int/ghodata/

    Machine learning tools on-line: ENTool, machine learning toolbox,

    http://www.j-wichard.de/entool/ PRTools machine learning toolbox, Delft University of

    Technology, http://www.prtools.org/ Weka, http://www.cs.waikato.ac.nz/ml/weka/ The ASU FS package with filter and embedded methods

    http://featureselection.asu.edu/software.php 45/45

    http://localhost/var/www/apps/conversion/tmp/scratch_5/archive.ics.uci.edu/ml/datasets.htmlhttp://www.gems-system.org/http://featureselection.asu.edu/datasets.phphttp://apps.who.int/ghodata/http://www.j-wichard.de/entool/http://www.prtools.org/http://www.cs.waikato.ac.nz/ml/weka/http://featureselection.asu.edu/software.phphttp://featureselection.asu.edu/software.phphttp://www.cs.waikato.ac.nz/ml/weka/http://www.prtools.org/http://www.j-wichard.de/entool/http://apps.who.int/ghodata/http://featureselection.asu.edu/datasets.phphttp://www.gems-system.org/http://localhost/var/www/apps/conversion/tmp/scratch_5/archive.ics.uci.edu/ml/datasets.html