Upload
jose-testoso
View
221
Download
0
Embed Size (px)
Citation preview
8/11/2019 FD FS High Dimensional Data
1/61
FEATURE DISCRETIZATION AND SELECTION
TECHNIQUES FOR HIGH-DIMENSIONALDATA
Artur J. Ferreira
Supervisor, Prof. Mario A. T. Figueiredo
Instituto Superior de Engenharia de LisboaInstituto de Telecomunicacoes, Lisboa
Priberam Machine Learning Lunch Seminar17 April 2012
8/11/2019 FD FS High Dimensional Data
2/61
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Outline
Introduction
BackgroundHigh-Dimensional DataFeature Discretization (FD)Feature Selection (FS)
FD ProposalsStatic Unsupervised ProposalsStatic Supervised Proposal
FS ProposalsFeature Selection Proposals
Conclusions
Some Resources
2 / 4 5
8/11/2019 FD FS High Dimensional Data
3/61
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Introduction and MotivationSome well-known facts about machine learning:
1 An adequate (sometimes discrete) representation of the datais necessary
2 High-dimensional datasets are increasingly common
3 Learning in high-dimensional data is challenging
4 Sometimes, the curse-of-dimensionalityproblem arises small number of instances n and large dimensionality d (e. g.,
micro-array data or text categorization data) it must be addressed in order to have effective learning
algorithms
5 Feature discretization (FD) and feature selection (FS)techniques address these problems
achieve adequate representations select an adequate subset of features with a convenient
representation3 / 4 5
8/11/2019 FD FS High Dimensional Data
4/61
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Introduction and MotivationSome well-known facts about machine learning:
1 An adequate (sometimes discrete) representation of the datais necessary
2 High-dimensional datasets are increasingly common
3 Learning in high-dimensional data is challenging
4 Sometimes, the curse-of-dimensionalityproblem arises small number of instances n and large dimensionality d (e. g.,
micro-array data or text categorization data) it must be addressed in order to have effective learning
algorithms
5 Feature discretization (FD) and feature selection (FS)techniques address these problems
achieve adequate representations select an adequate subset of features with a convenient
representation3 / 4 5
8/11/2019 FD FS High Dimensional Data
5/61
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Introduction and MotivationSome well-known facts about machine learning:
1 An adequate (sometimes discrete) representation of the datais necessary
2 High-dimensional datasets are increasingly common
3 Learning in high-dimensional data is challenging
4 Sometimes, the curse-of-dimensionalityproblem arises small number of instances n and large dimensionality d (e. g.,
micro-array data or text categorization data) it must be addressed in order to have effective learning
algorithms
5 Feature discretization (FD) and feature selection (FS)techniques address these problems
achieve adequate representations select an adequate subset of features with a convenient
representation3 / 4 5
S C S
8/11/2019 FD FS High Dimensional Data
6/61
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Introduction and MotivationSome well-known facts about machine learning:
1 An adequate (sometimes discrete) representation of the datais necessary
2 High-dimensional datasets are increasingly common
3 Learning in high-dimensional data is challenging
4 Sometimes, the curse-of-dimensionalityproblem arises small number of instances n and large dimensionality d (e. g.,
micro-array data or text categorization data) it must be addressed in order to have effective learning
algorithms
5 Feature discretization (FD) and feature selection (FS)techniques address these problems
achieve adequate representations select an adequate subset of features with a convenient
representation3 / 4 5
I d i B k d FD P l FS P l C l i S R
8/11/2019 FD FS High Dimensional Data
7/61
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Introduction and MotivationSome well-known facts about machine learning:
1 An adequate (sometimes discrete) representation of the datais necessary
2 High-dimensional datasets are increasingly common
3 Learning in high-dimensional data is challenging
4 Sometimes, the curse-of-dimensionalityproblem arises small number of instances n and large dimensionality d (e. g.,
micro-array data or text categorization data) it must be addressed in order to have effective learning
algorithms
5 Feature discretization (FD) and feature selection (FS)techniques address these problems
achieve adequate representations select an adequate subset of features with a convenient
representation3 / 4 5
I t d ti B k d FD P l FS P l C l i S R
8/11/2019 FD FS High Dimensional Data
8/61
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Introduction and MotivationSome well-known facts about machine learning:
1 An adequate (sometimes discrete) representation of the datais necessary
2 High-dimensional datasets are increasingly common
3 Learning in high-dimensional data is challenging
4 Sometimes, the curse-of-dimensionalityproblem arises small number of instances n and large dimensionality d (e. g.,
micro-array data or text categorization data) it must be addressed in order to have effective learning
algorithms
5 Feature discretization (FD) and feature selection (FS)techniques address these problems
achieve adequate representations select an adequate subset of features with a convenient
representation3 / 4 5
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
8/11/2019 FD FS High Dimensional Data
9/61
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Introduction and MotivationSome well-known facts about machine learning:
1 An adequate (sometimes discrete) representation of the datais necessary
2 High-dimensional datasets are increasingly common
3 Learning in high-dimensional data is challenging
4 Sometimes, the curse-of-dimensionalityproblem arises small number of instances n and large dimensionality d (e. g.,
micro-array data or text categorization data) it must be addressed in order to have effective learning
algorithms
5 Feature discretization (FD) and feature selection (FS)techniques address these problems
achieve adequate representations select an adequate subset of features with a convenient
representation
3 / 4 5
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
8/11/2019 FD FS High Dimensional Data
10/61
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Introduction and MotivationSome well-known facts about machine learning:
1 An adequate (sometimes discrete) representation of the datais necessary
2 High-dimensional datasets are increasingly common
3 Learning in high-dimensional data is challenging
4 Sometimes, the curse-of-dimensionalityproblem arises small number of instances n and large dimensionality d (e. g.,
micro-array data or text categorization data) it must be addressed in order to have effective learning
algorithms
5 Feature discretization (FD) and feature selection (FS)techniques address these problems
achieve adequate representations select an adequate subset of features with a convenient
representation
3 / 4 5
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
8/11/2019 FD FS High Dimensional Data
11/61
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Introduction and MotivationSome well-known facts about machine learning:
1 An adequate (sometimes discrete) representation of the datais necessary
2 High-dimensional datasets are increasingly common
3 Learning in high-dimensional data is challenging
4 Sometimes, the curse-of-dimensionalityproblem arises small number of instances n and large dimensionality d (e. g.,
micro-array data or text categorization data) it must be addressed in order to have effective learning
algorithms
5 Feature discretization (FD) and feature selection (FS)techniques address these problems
achieve adequate representations select an adequate subset of features with a convenient
representation
3 / 4 5
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
8/11/2019 FD FS High Dimensional Data
12/61
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
High-Dimensional DataSome high-dimensional datasets available on the WEB with
different types of problemsDatasets with cclasses and n instances shown by increasing
dimensionality d; notice that in some cases dn
Dataset d c n Type of Data
Colon 2000 2 62 MicroarraySRBCT 2309 4 83 MicroarrayAR10P 2400 10 130 FacePIE10P 2420 10 210 FaceTOX-171 5748 4 171 MicroarrayExample1 9947 2 50 Text, BoWORL10P 10304 10 100 Face11-Tumors 12553 11 174 MicroarrayLung-Cancer 12601 5 203 MicroarraySMK-CAN-187 19993 2 187 MicroarrayDexter 20000 2 2600 BoWGLI-85 22283 2 85 MicroarrayDorothea 1000000 2 1950 Drug Discovery
4 / 4 5
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
8/11/2019 FD FS High Dimensional Data
13/61
g p p
High-Dimensional DataThese datasets are available on-line:
The well-known university of California at Irvine(UCI)repository, archive.ics.uci.edu/ml/datasets.html
The gene expression model selector(GEMS) project, atwww.gems-system.org/
5 / 4 5
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
http://localhost/var/www/apps/conversion/tmp/scratch_5/archive.ics.uci.edu/ml/datasets.htmlhttp://www.gems-system.org/http://www.gems-system.org/http://localhost/var/www/apps/conversion/tmp/scratch_5/archive.ics.uci.edu/ml/datasets.html8/11/2019 FD FS High Dimensional Data
14/61
High-Dimensional DataThe recently developed Arizona state university(ASU) repository
featureselection.asu.edu/datasets.php hashigh-dimensional datasets
6 / 4 5
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
http://localhost/var/www/apps/conversion/tmp/scratch_5/featureselection.asu.edu/datasets.phphttp://localhost/var/www/apps/conversion/tmp/scratch_5/featureselection.asu.edu/datasets.php8/11/2019 FD FS High Dimensional Data
15/61
Feature Discretization
Feature Discretization (FD) aims at: representing a feature with a set of symbols from a finite set
keeping enough information for the learning task
ignoring minor (noisy/irrelevant) fluctuations on the data
FD can be performed by:
unsupervisedorsupervisedmethods; the latter uses classlabels to compute the discretization intervals
staticordynamicmethods static - the discretization intervals are computed by using
solely the training data dynamic - rely on a wrapper approach with a quantizer and a
classifier
7 / 4 5
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
8/11/2019 FD FS High Dimensional Data
16/61
Feature Discretization: Static Unsupervised Approaches
Three static unsupervised techniques are commonly used for FD: equal-interval binning(EIB) - uniform quantization
equal-frequency binning(EFB) - non-uniform quantization toattain uniform distribution
proportional k-interval discretization (PkID) - the number andsize of discretized intervals are adjusted to the number oftraining instances
As compared to both EIB and EFB, PkID provides nave-Bayes
classifiers with:
competitive classification performance for smaller datasets
better classification performance for larger datasets
8 / 4 5
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
8/11/2019 FD FS High Dimensional Data
17/61
Feature Discretization: Static Supervised Approaches
Some static supervised techniques for FD:
information entropy maximization (IEM), Fayyad and Irani,1993
minimal description length (MDL), Kononenko, 1995
class-attribute interdependence maximization (CAIM), 2004
class-attribute contingency coefficient(CACC), 2008
correlation maximization (CM), 2011
Empirical evidence shows that:
the IEM and MDL methods have good performance regardingaccuracy and running-time.
the CACC and CAIM methods have higher running-time thanboth IEM and MDL. In some cases, they attain better resultsthan these methods
9 / 4 5
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
8/11/2019 FD FS High Dimensional Data
18/61
Feature Discretization: Wekas Environment
We can find IEM (Fayyad and Irani, 1993) and MDL (Kononenko,
1995) in the Wekas machine learning packageweka.filters.supervised.attribute.Discretize
Note: the unsupervised PkID method is also available.
10/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
8/11/2019 FD FS High Dimensional Data
19/61
Feature Selection
Feature Selection (FS) is a central problem in Machine Learningand Pattern Recognition
what is the best subset of features for a given problem?
how many features should we choose?
There are many recent papers published in 2012, regarding FS
FS can be performed by:
unsupervisedorsupervisedmethods; the latter uses classlabels
filter, wrapper, orembeddedapproaches
11/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
8/11/2019 FD FS High Dimensional Data
20/61
Feature Selection: Filter Approach
The FS filter approach:
assesses the quality of a given feature subset using solelycharacteristics of that subset
it does not rely on the use of any learning algorithm
There are many (successful) filter approaches for FS:
unsupervised Term-Variance (TV), Laplacian Score (LS),Laplacian Score Extended (LSE), SPEC, ...
supervised Relief, ReliefF, CFS, FiR, ...
supervised FCBF, mrMR, MIM, CMIM, IG, ...
12/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
8/11/2019 FD FS High Dimensional Data
21/61
Feature Selection: Information-Theoretic FiltersClaude Shannons information theory(IT) has been the basis for
the proposal of many filter methods (IT-based filters):
Brown, G., Pocock, A., Zhao, M., Lujan, M., 2012. Conditional likelihood
maximisation: A unifying framework for information theoretic feature
selection. Journal of Machine Learning Research 13, 2766. 13/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
8/11/2019 FD FS High Dimensional Data
22/61
Feature Selection: Wrapper ApproachWrapper approaches
uses some method to search the space of all possible subsetsof features
assess the quality by learning and evaluating a classifier withthat feature subset
Combinatorial search makes wrappers inadequate forhigh-dimensional data
Some recent wrapper approaches for FS (2009-2011):
Greedy randomized adaptive search procedure(GRASP)
A GRASP-based FS hybrid filter-wrapper method A sparse model for high-dimensional data, based on linear
programming
Estimation of the redundancy between feature sets, by theconditional MI between feature subsets and each class
14/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
8/11/2019 FD FS High Dimensional Data
23/61
Feature Selection: Embedded Approach
The embedded (or integrated) approach:
simultaneously learns the classifier and chooses a subset offeatures
assigns weights to features
the objective function encourages some of these weights to
become zero
Examples of the embedded approach are:
sparse multinomial logistic regression (SMLR*)
sparse logistic regression (SLogReg) method
Bayesian logistic regression (BLogReg), a more adaptiveversion of SlogReg
joint classifier and feature optimization (JCFO*), whichapplies FS inside a kernel function
15/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
8/11/2019 FD FS High Dimensional Data
24/61
Feature Selection on High-Dimensional Data
How difficult/challenging is FS on a given dataset?
The ratio1
RFS= n
ac,
withn patterns (instances), cclasses, and a being the median arityof the features, discretized with EFB, measures this difficulty. Lowvalues imply more difficult FS problems.
Another question:Shouldnt the d/n ratio also be taken into account?
1Brown, G., Pocock, A., Zhao, M., Lujan, M., 2012. Conditional likelihoodmaximisation: A unifying framework for information theoretic feature
selection. Journal of Machine Learning Research 13, 2766.16/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
8/11/2019 FD FS High Dimensional Data
25/61
Feature Selection: an (outdated) categorization
Liu, H. and Yu, L., Toward Integrating Feature Selection Algorithms for
Classification and Clustering, IEEE Transactions on Knowledge and Data
Engineering, vol. 17, n. 4, April 2005, pp 491502 17/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
8/11/2019 FD FS High Dimensional Data
26/61
Feature Discretization: Static U-LBG1 AlgorithmRecently, we have proposed the use of the Linde-Buzo-Gray
algorithm for unsupervised static FS2
The LBG algorithm is applied individually to each featureleading to discrete features with minimum mean square error(MSE)
Rationale: low MSE(Xi, Q(Xi)) is adequate for learning! It is stopped when:
the MSE distortion falls below some threshold or the maximum number of bits qper feature is reached
obtains a variable number of bits per feature
uses (, q) as input parameters; to 5% of the range ofeach feature and q {4, . . . , 10} are adequate
2A. Ferreira and M. Figueiredo, An unsupervised approach to featurediscretization and selection, Elsevier - Pattern Recognition Journal, DOI:
10.1016/j.patcog.2011.12.00818/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
8/11/2019 FD FS High Dimensional Data
27/61
Feature Discretization: U-LBG2 Algorithm
Similar to U-LBG1 (both aim at obtaining quantizers thatrepresent the features with a small distortion) with the following
key differences: each discretized feature will be given the same (maximum)
number of bits q;
only one quantizer is learned for each feature
19/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
8/11/2019 FD FS High Dimensional Data
28/61
Experimental Results: Unsupervised DiscretizationDiscretization performance:
total number of bits per pattern (T. Bits) and Test Set ErrorRate (Err, average of ten runs)
using up to q= 7 bits, using the nave Bayes classifier
EFB U-LBG1 U-LBG2Dataset Original Err T. Bits Err T. Bits Err T. Bits Err
Phoneme 21.30 30 22.30 9 22.80 30 20.60Pima 25.30 48 25.20 30 25.20 48 25.80Abalone 28.00 48 27.60 15 27.20 48 27.70Contraceptive 34.80 54 31.40 15 38.00 54 34.80Wine 3.73 78 4.80 27 3.20 78 3.20
Hepatitis 20.50 95 21.50 32 21.00 39 18.00WBCD 5.87 180 5.13 60 5.87 180 5.67Ionosphere 10.60 198 9.80 49 17.40 198 11.00SpamBase 15.27 324 13.40 54 15.73 324 15.67Lung 35.00 318 35.00 74 35.83 318 35.00Arrhythmia 32.00 1392 51.56 553 30.22 1392 41.56
20/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
8/11/2019 FD FS High Dimensional Data
29/61
Some insight on the accuracy of each feature
Test set error rate of nave Bayes using only a single feature, discretizedwithq= 4 bit by the U-LBG2 algorithm, on the WBCD dataset.
Horizontal dashed line is the test set error rate of the p=30 features. 21/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
8/11/2019 FD FS High Dimensional Data
30/61
Some insight on the accuracy of each feature
Test set error rate of nave Bayes using only a single feature, discretizedwithq= 8 bit by the U-LBG2 algorithm, on the WBCD dataset.
Horizontal dashed line is the test set error rate of the p=30 features. 22/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
8/11/2019 FD FS High Dimensional Data
31/61
Some insight on progressive discretization 1/2
Test set error rate of nave Bayes using solely feature 17 of the WBCD
dataset (original feature and U-LBG2 discretized with q {1, . . . , 10}).
The discrete versions do not provide higher accuracy! 23/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
8/11/2019 FD FS High Dimensional Data
32/61
Some insight on progressive discretization 2/2
Test set error rate of nave Bayes using solely feature 25 of the WBCD
dataset (original feature and U-LBG2 discretized with q {1, . . . , 10}).
With q= 10, we get a small improvement on the accuracy. 24/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
8/11/2019 FD FS High Dimensional Data
33/61
Experimental Results Analysis
Somme comments on these methods and results: Often, discretization improves classification accuracy
EFB usually attains better results than EIB
U-LBG2 is usually faster than U-LBG1, but allocates more
bits per feature Often, one of the LBG methods attains better results than
EFB
The U-LBG procedures are more complex than either EIB or
EFB PkID attains results close to EFB, when using the nave Bayes
classifier
25/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
8/11/2019 FD FS High Dimensional Data
34/61
Feature Discretization: S-LBG Algorithm (ideas)
Supervised version of the U-LBG counterparts:
each feature is discretized with LBG
it uses the mutual information (MI) between discretizedfeatures and the class label, in order to control the
discretization procedure it stops at bbits or when the relative increase of MI with
respect to the previous quantizer is less than a given threshold
each feature is discretized with an increasing number of bits,
stopping only when: there is no significant increase on the relevance of the feature the maximum number of bits is reached
26/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
8/11/2019 FD FS High Dimensional Data
35/61
Feature Discretization: MI Algorithm (ideas)
Ongoing work with the following key ideas:
discretize each feature to maximize MI with the class label
do this in a progressive approach, scanning all features allocate a variable number of bits per feature
check how the relevance of each feature changes whendiscretized with b+ 1 bits, as compared to bbits
27/45
8/11/2019 FD FS High Dimensional Data
36/61
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
8/11/2019 FD FS High Dimensional Data
37/61
Experimental ResultsDiscretization performance:
total number of bits per pattern (TBits) and Test Set ErrorRate (Err, average of ten runs)
using up to q= 7 bits, using the nave Bayes classifier
EFB U-LBG1 U-LBG2 S-LBGDataset Base TBits Err TBits Err TBits Err TBits Err
Iris 2.6 24 4.5 7 9.0 24 2.6 15 3.4Phoneme 21.3 30 22.3 9 22.8 30 20.6 20 22.5Pima 25.3 48 25.2 30 25.2 48 25.8 40 24.4Abalone 28.0 48 27.6 15 27.2 48 27.7 35 27.8Contrac. 34.8 54 31.4 15 38.0 54 34.8 29 34.7Wine 3.7 78 4.8 27 3.2 78 3.2 57 4.5
Hepatitis 20.5 95 21.5 32 21.0 39 18.0 29 19.0WBCD 5.8 180 5.1 60 5.8 180 5.6 116 5.4Ionosph. 10.6 198 9.8 49 17.4 198 11.00 177 11.0SpamBase 15.2 324 13.4 54 15.7 324 15.6 220 14.6Lung 35.0 318 35.0 74 35.8 318 35.0 135 35.0Arrhyt. 32.0 1392 51.5 553 30.2 1392 41.5 1050 31.3
29/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
8/11/2019 FD FS High Dimensional Data
38/61
Feature Selection
Somme comments on existing FS methods:
Usually, wrappers perform better than embedded methods,taking a longer time
Embedded methods perform better than filters, being muchslower
On very high-dimensional datasets: both wrapper and embedded methods are too expensive filters are the only applicable option!
even some filter FS methods can take a prohibitive time in theredundancy analysis and elimination stage
30/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
( )
8/11/2019 FD FS High Dimensional Data
39/61
Feature Selection: Relevance-Redundancy (RR)
The Yu and Liu relevance-redundancy framework approach3
Optimal subset is provided by parts III and IV, whitout bothirrelevantand redundant features
3Yu, L., Liu, H., Dec. 2004. Efficient feature selection via analysis of
relevance and redundancy, JMLR 5, 12051224. 31/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
S ( )
8/11/2019 FD FS High Dimensional Data
40/61
Feature Selection: Relevance-Redundancy (RR)
The Yu and Liu relevance-redundancy framework approach4
First compute relevance, then redundancy After this, find the optimal subset (parts III and IV)
4Yu, L., Liu, H., Dec. 2004. Efficient feature selection via analysis of
relevance and redundancy, JMLR 5, 12051224. 32/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
A RR FS h f hi h di i l d (id )
8/11/2019 FD FS High Dimensional Data
41/61
A RR FS approach for high-dimensional data (ideas)
Key observations for filters on high-dimensional data: Some supervised filter methods (e.g. CFS and mrMR) are
computionally expensive
They waste time on subspace search and redundancy analysis
Redundancy is typically found among the most relevantfeatures
Our proposal for fast unsupervised and supervised filter RR FS onhigh-dimensional data:
sorts the dfeatures by decreasing relevance
computes the redundancy between the most relevant features
computes up to d1 pairwise similarities
33/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
A RR FS h f hi h di i l d (id )
8/11/2019 FD FS High Dimensional Data
42/61
A RR FS approach for high-dimensional data (ideas)
We keep only features with high relevance and low (i.e., belowsome threshold MS) similarity among themselves
It is not expected that redundant features are consecutive in
the ranked sorted feature list However, it is a waste of time to compute the redundancy
between weakly relevant and irrelevant features!
So, we perform the redundancy check only among the top
relevant features
34/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
A RR FS h f hi h di i l d t (d t il )
8/11/2019 FD FS High Dimensional Data
43/61
A RR FS approach for high-dimensional data (details)Input: X: n d matrix, n patterns of a d-dimensional training set.
m ( d):, maximum number of features to keep.
MS: maximum allowed similarity between pairs of features.Output: FeatKeep: an mdimensional array (with m m) containing the indexes
of the selected features.X : n m matrix, reduced dimensional training set, with features sorted
by decreasing relevance.
1: Compute the relevance riof each feature Xi(columns ofX), for
i {1, . . . , d}, using one of the dispersion measures MAD or MM).2: Sort the features by decreasing order ofri. Let i1, i2,..., idbe the
resulting permutation of{1,...,d} (i.e., ri1 ri2 ... rid).3: FeatKeep[1] =i1; prev=1; next=2;4: forf= 2 to d do
5: s=S(Xif, Xiprev
);6: if s
8/11/2019 FD FS High Dimensional Data
44/61
Relevance Measures
For unsupervised learning, we found relevance proportional to
dispersionThemean absolute difference
MADi = 1
n
nj=1
XijXi ,
and the mean-median
MMi =|Ximedian(Xi)|,
i.e., the absolute difference between the mean and median ofXiare adequate measures of relevance
They attain better results than variance
36/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
R d d M s
8/11/2019 FD FS High Dimensional Data
45/61
Redundancy MeasureTheredundancybetween two features, say Xi and Xj, is computedby the absolute cosine
| cos(XiXj)|=
Xi, Xj
Xi Xj
=
nk=1
XikXjk
n
k=1
X2ik
n
k=1
X2jk
, (1)
where , denotes the inner product and . the Euclidean normWe have 0 | cos(XiXj)| 1:
with 0 meaning that the two features are orthogonal(maximally different)
1 resulting from colinear features
Similar to the Pearsons correlation coefficient37/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Experimental Results: Relevance and Redundancy
8/11/2019 FD FS High Dimensional Data
46/61
Experimental Results: Relevance and RedundancyThe relevance and similarity of the consecutive m= 1000top-ranked features of the Brain-Tumor1 dataset (d= 5920
features)
Among the top-ranked features, we have high redundancy!38/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Experimental Results: Unsupervised FS
8/11/2019 FD FS High Dimensional Data
47/61
Experimental Results: Unsupervised FSComparison with other unsupervised approaches for a 10-fold CVwith linear SVM. Best result in bold face and second best is
underlinedOur Approach Unsupervised Baseline
Dataset MAD MM AMGM TV LS SPEC No FSColon 21.0 17.7 19.4 17.7 21.0 22.6 19.4SRBCT 0.0 0.0 0.0 0.0 0.0 0.0 0.0
PIE10P 0.0 0.0 0.0 0.0 0.0 0.0 0.0Lymphoma 2.2 2.2 2.2 2.2 2.2 2.2 2.2
Leukemia1 4.2 4.2 5.6 4.2 5.6 4.2 4.2
Brain-Tumor1 14.4 12.2 12.2 12.2 13.3 28.9 12.2Leukemia 2.8 2.8 2.8 2.8 2.8 30.6 2.8Example1 2.7 2.7 2.6 2.7 3.3 22.9 2.7
ORL0P 2.0 2.0 4.0 5.0 1.0 1.0 1.0Lung-Cancer 4.9 5.9 4.9 5.9 5.4 6.4 4.9
SMK-CAN-187 41.7 41.7 41.7 41.7 26.2 25.7 26.2Dexter 5.3 5.2 5.2 5.2 5.2 39.3 4.7GLI-85 12.9 14.1 15.3 14.1 11.8 9.4 11.8
Dorothea 24.0 24.0 24.0 24.0 25.0 22.0 24.039/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Experimental Results: Supervised FS
8/11/2019 FD FS High Dimensional Data
48/61
Experimental Results: Supervised FSComparison with other supervised approaches for a 10-fold CV withlinear SVM. Best result in bold face and second best is underlined
Our Approach Supervised Filters BaseDataset MM FiR MI RF CFS FCBF FiR mrMR No FSColon 24.2 22.6 24.2 19.4 25.8 22.6 19.4 21.0 21.0SRBCT 0.0 0.0 0.0 0.0 0.0 4.8 0.0 4.8 0.0
PIE10P 0.0 0.0 0.5 0.0 * 1.0 0.0 24.8 0.0
Lymph. 2.2 2.2 2.2 2.2 * 3.3 2.2 22.8 2.2Leuk1 5.6 2.8 6.9 6.9 * 5.6 4.2 9.7 5.6B-Tum1 13.3 12.2 13.3 11.1 * 18.9 11.1 25.6 10.0
Leuk. 2.8 12.5 2.8 2.8 * 4.2 4.2 8.3 2.8
Example1 2.3 2.2 2.2 3.7 * 6.3 2.1 28.3 2.4ORL0P 4.0 5.0 2.0 1.0 * 1.0 2.0 68.0 1.0
B-Tum2 34.0 22.0 30.0 22.0 * 36.0 24.0 42.0 26.0P-Tumor 7.8 5.9 4.9 7.8 * 9.8 7.8 12.7 8.8L-Cancer 5.9 6.4 4.9 4.9 * 6.4 5.4 11.8 5.9
SMK-187 41.7 40.6 53.5 24.6 * 33.2 23.5 33.2 24.1Dexter 6.7 6.0 7.7 9.3 * 15.3 6.7 18.0 6.3GLI-85 14.1 12.9 17.6 11.8 * 20.0 14.1 16.5 14.1Dorot. 25.0 26.0 25.0 * * * 25.0 * 25.040/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Experimental Results: Running Time
8/11/2019 FD FS High Dimensional Data
49/61
Experimental Results: Running-TimeFor each dataset:
the first row contains the test set error rates (%) of linear
SVM for a 10-fold CV for our RR method (with MS= 0.8)and other FS methods selecting m features
the second row shows the totalrunning time taken by each FSalgorithm to select m features for the 10-folds
Our RR Filter Embedded#, m MAD FiR LS SPEC RF FCBF FiR mrMR BRegColon 24.2 30.6 21.0 24.2 24.2 25.8 24.2 17.7 21.0m=1000 0.3 7.4 0.4 1.5 9.0 147.0 7.1 19.8 2.2
SRBCT 0.0 0.0 0.0 0.0 0.0 1.2 0.0 3.6 2.4m=1800 0.5 9.7 0.4 2.3 11.6 8.0 9.3 15.8 5.7
Lymph. 2.2 2.2 2.2 3.3 2.2 3.3 3.3 25.0 8.7m=2000 0.8 29.5 0.8 9.0 21.7 25.2 29.2 24.7 143.7
P-Tumor 8.8 6.9 10.8 9.8 5.9 10.8 9.8 13.7 7.8m=4000 1.8 21.6 2.4 13.5 47.8 29.1 21.1 48.8 46.1
L-Cancer 5.9 6.4 6.4 8.4 6.9 7.4 6.9 11.3 7.4m=8000 3.8 54.9 7.5 47.5 95.5 208.8 54.1 88.5 207.4
41/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Conclusions
8/11/2019 FD FS High Dimensional Data
50/61
Conclusions
High-dimensional datasets are increasingly common Learning in high-dimensional data is challenging
Unsupervised and supervised FD and FS methods canalleviate the inherent complexity
Wrapper and embedded methods are too costly Our filter FD and FS proposals in this talk: are both time and space efficient attain competitive results with the state-of-the-art techniques can act as pre-processors to wrapper and embedded methods
A promising avenue of research (our ongoing work)?perform progressive FD and FS simultaneously!
42/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Conclusions
8/11/2019 FD FS High Dimensional Data
51/61
Conclusions
High-dimensional datasets are increasingly common Learning in high-dimensional data is challenging
Unsupervised and supervised FD and FS methods canalleviate the inherent complexity
Wrapper and embedded methods are too costly Our filter FD and FS proposals in this talk: are both time and space efficient attain competitive results with the state-of-the-art techniques can act as pre-processors to wrapper and embedded methods
A promising avenue of research (our ongoing work)?perform progressive FD and FS simultaneously!
42/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Conclusions
8/11/2019 FD FS High Dimensional Data
52/61
Conclusions
High-dimensional datasets are increasingly common Learning in high-dimensional data is challenging
Unsupervised and supervised FD and FS methods canalleviate the inherent complexity
Wrapper and embedded methods are too costly Our filter FD and FS proposals in this talk: are both time and space efficient attain competitive results with the state-of-the-art techniques can act as pre-processors to wrapper and embedded methods
A promising avenue of research (our ongoing work)?perform progressive FD and FS simultaneously!
42/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Conclusions
8/11/2019 FD FS High Dimensional Data
53/61
Conclusions
High-dimensional datasets are increasingly common Learning in high-dimensional data is challenging
Unsupervised and supervised FD and FS methods canalleviate the inherent complexity
Wrapper and embedded methods are too costly Our filter FD and FS proposals in this talk: are both time and space efficient attain competitive results with the state-of-the-art techniques can act as pre-processors to wrapper and embedded methods
A promising avenue of research (our ongoing work)?perform progressive FD and FS simultaneously!
42/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Conclusions
8/11/2019 FD FS High Dimensional Data
54/61
Conclusions
High-dimensional datasets are increasingly common Learning in high-dimensional data is challenging
Unsupervised and supervised FD and FS methods canalleviate the inherent complexity
Wrapper and embedded methods are too costly Our filter FD and FS proposals in this talk: are both time and space efficient attain competitive results with the state-of-the-art techniques can act as pre-processors to wrapper and embedded methods
A promising avenue of research (our ongoing work)?perform progressive FD and FS simultaneously!
42/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Conclusions
8/11/2019 FD FS High Dimensional Data
55/61
High-dimensional datasets are increasingly common Learning in high-dimensional data is challenging
Unsupervised and supervised FD and FS methods canalleviate the inherent complexity
Wrapper and embedded methods are too costly Our filter FD and FS proposals in this talk: are both time and space efficient attain competitive results with the state-of-the-art techniques can act as pre-processors to wrapper and embedded methods
A promising avenue of research (our ongoing work)?perform progressive FD and FS simultaneously!
42/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Conclusions
8/11/2019 FD FS High Dimensional Data
56/61
High-dimensional datasets are increasingly common Learning in high-dimensional data is challenging
Unsupervised and supervised FD and FS methods canalleviate the inherent complexity
Wrapper and embedded methods are too costly Our filter FD and FS proposals in this talk:
are both time and space efficient attain competitive results with the state-of-the-art techniques can act as pre-processors to wrapper and embedded methods
A promising avenue of research (our ongoing work)?perform progressive FD and FS simultaneously!
42/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Conclusions
8/11/2019 FD FS High Dimensional Data
57/61
High-dimensional datasets are increasingly common Learning in high-dimensional data is challenging
Unsupervised and supervised FD and FS methods canalleviate the inherent complexity
Wrapper and embedded methods are too costly Our filter FD and FS proposals in this talk:
are both time and space efficient attain competitive results with the state-of-the-art techniques can act as pre-processors to wrapper and embedded methods
A promising avenue of research (our ongoing work)?perform progressive FD and FS simultaneously!
42/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Conclusions
8/11/2019 FD FS High Dimensional Data
58/61
High-dimensional datasets are increasingly common Learning in high-dimensional data is challenging
Unsupervised and supervised FD and FS methods canalleviate the inherent complexity
Wrapper and embedded methods are too costly Our filter FD and FS proposals in this talk:
are both time and space efficient attain competitive results with the state-of-the-art techniques can act as pre-processors to wrapper and embedded methods
A promising avenue of research (our ongoing work)?perform progressive FD and FS simultaneously!
42/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Calling Weka from MATLAB - FD
8/11/2019 FD FS High Dimensional Data
59/61
FD example. Discretize dataset X using Kononekos MDL method
...t = weka.filters.supervised.attribute.Discretize();config = wekaArgumentString(-R first-last,-K);
t.setOptions(config);wekaData = wekaCategoricalData(X,SY2MY(Y));t.setInputFormat( wekaData );t.useFilter( wekaData, t ) ;d = size(X,2);
for i=1 : dcutPoints(i) = t.getCutPoints(i-1);
end ...
43/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Calling Weka from MATLAB - FS
8/11/2019 FD FS High Dimensional Data
60/61
FS example. Apply ReliefF on dataset X
...t = weka.attributeSelection.ReliefFAttributeEval();config = wekaArgumentString(-M, m, -D, 1 -K, k);
t.setOptions(config);t.buildEvaluator(wekaCategoricalData(X,SY2MY(Y)));out.W = zeros(1,nF);for i =1:nF;
out.W(i) = t.evaluateAttribute(i-1);
endout.fList = sort(out.W, descend);...
44/45
Introduction Background FD Proposals FS Proposals Conclusions Some Resources
Some useful resources
8/11/2019 FD FS High Dimensional Data
61/61
Datasets available on-line: The university of California at Irvine(UCI) repository,
archive.ics.uci.edu/ml/datasets.html The gene expression model selector(GEMS) project, atwww.gems-system.org/
The Arizona state university(ASU) repository,http://featureselection.asu.edu/datasets.php
The world health organization (WHO) data,http://apps.who.int/ghodata/
Machine learning tools on-line: ENTool, machine learning toolbox,
http://www.j-wichard.de/entool/ PRTools machine learning toolbox, Delft University of
Technology, http://www.prtools.org/ Weka, http://www.cs.waikato.ac.nz/ml/weka/ The ASU FS package with filter and embedded methods
http://featureselection.asu.edu/software.php 45/45
http://localhost/var/www/apps/conversion/tmp/scratch_5/archive.ics.uci.edu/ml/datasets.htmlhttp://www.gems-system.org/http://featureselection.asu.edu/datasets.phphttp://apps.who.int/ghodata/http://www.j-wichard.de/entool/http://www.prtools.org/http://www.cs.waikato.ac.nz/ml/weka/http://featureselection.asu.edu/software.phphttp://featureselection.asu.edu/software.phphttp://www.cs.waikato.ac.nz/ml/weka/http://www.prtools.org/http://www.j-wichard.de/entool/http://apps.who.int/ghodata/http://featureselection.asu.edu/datasets.phphttp://www.gems-system.org/http://localhost/var/www/apps/conversion/tmp/scratch_5/archive.ics.uci.edu/ml/datasets.html