Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray

Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art

Classifiers: What Can We Learn About the Data

Erinija Pranckeviciene, Ray Somorjai,

Institute for Biodiagnostics, NRC Canada,

Outline of the presentation

• Description of the algorithm

• Results on Agnostic Learning vs. Prior Knowledge (AL vs. PK) challenge datasets

• Conclusions

Motivation to enter the Challenge

• For small sample size / high dimensional datasets, the feature selection procedure will adapt to the peculiarities of the training dataset (sample bias).

• An ideal model selection procedure would produce stable estimates of the classification error rate and the identities of discovered features would not vary much across the different random splits.

• Our experiments with Linear Programming SVM (LP-SVM) on biomedical datasets produced results more robust to the sample bias and demonstrated the property stated above.

Decided to check LP-SVM’s robustness property in a controlled experiment- an independent platform of

the AL vs. PK challenge.

Classification with LP-SVM• The formulation of LP-SVM (known as Liknon,

Bhattacharya et al.) is very similar to the conventional linear SVM, except for the objective function, which is linear, due to the L1 norm of the regularization term.

• The solution of the LP-SVM is a linear discriminant, in which the weight magnitudes identify those original features that are important for class discrimination.

• Different values of the regularization parameter C in the optimization problem produce different discriminants.

Outline of the algorithm

1) Available training data are processed in 10-fold

stratified (based on the existing proportions of the

class samples) crossvalidation:

9/10 of the data are for training, 1/10 for independent testing.

2) The training portion is split randomly into balanced

training and unbalanced monitoring sets.

3) We perform 31 random splits.

Evolution of the models4) The training set is used to find several LP-SVM

discriminants, determined by the sequence of values of the regularization parameter C, in every split. Increasing C increases the number of features.

5) A balanced error rate (BER) for every discriminant is estimated on the monitoring set.

6) The discriminant / model with the smallest monitoring BER is retained.

Example of the evolution of the models on a synthetic data

Feature profiles and other classifiers

7) In a single fold, a feature profile is derived, by counting the frequency of inclusion of the features in the best BER discriminant (we have 31 best BER discriminants).

8) As a result, we have an ensemble of linear discriminants operating on the selected features and the feature profile is to be tested with other classifiers

(Several thresholds of the frequency of inclusion were examined for different datasets, to test state-of-art classification rules, such as KNN, fisher. Etc.).

Final model selection

9) The performance of all competing models derived

in a single fold is estimated by BER on the independent test set. Thus we have 10 estimates.

9) The final model is selected out of the 10 estimated models.

10) The identities of the features occurring in all profiles can also be examined separately.

Experimental setup: algorithmic parameters for AL vs. PK datasets

• T1+T2 - size of the training set; • M1+M2 - size of the monitoring set;• V1+V2 - size of the validation set;• Dim - dimensionality of the data;• Models - number of the models tested;• Th - threshold of the frequency of inclusion of feature in the feature profile.

ADA results

• Identity: the identities of the features occurring in all profiles

100%- 2, 8, 9, 18, 20, 24, 30

Last 1, test err 0.181812



(Th 55%)

GINA results

• Identity: the identities of the features occurring in all profiles More than 85% - 367, 815, 510, 648, 424

Last 3: knn1: 0.060, knn3: 0.058, ens: 0.153




(Th 50%)

HIVA results

• Identity: the identities of the features occurring in all profiles 90%

Last 1, test err 0.299795 Last 2, test err 0.313


Th 20%

Best entry ( former)

0.2939

NOVA results

• Identity: the identities of the features occurring in all profiles 100%




Th 80%

Best entry ( former)

0.0725

SYLVA results

• Identity: the identities of the features occurring in all profiles• 100% - 202, 55


Last 2, test err 0.0197 Last 3, test err 0.01897

Th 20%

Determination of C values

Given N1 and N2 measurements x of individual feature k in two classes, C value is:

• Sort the C values corresponding to d features in ascending order and solve a model for each.

• The idea behind comes from the analysis of the dual.

• If many features, then many models have to be solved. Computationally not feasible, C values have to be condensed.

Different ways of condensing C

• The challenge submissions differed in how the C values were chosen.

• Initially a histogram was used. Based on the final ranking this method worked out better for HIVA and NOVA.

• In the last submissions a rate of change of a slightly modified objective function of primal was used. This worked better for ADA, GINA and SYLVA.

• Still looking for a less heuristic and more precise method…

Conclusions• The main advantages of our method are simplicity and the

interpretability of the results. The disadvantage is high computational burden.

• Ensembles tend to perform better than individual rules, except for GINA.

• Same feature identities were consistently discovered in all splits and folds.

• The derived feature identities have to be compared with the ground truth in the Prior Knowledge track.

• Some arbitrariness, unavoidable in this experiment, will be dealt with in the future work- the threshold in feature profile, number of samples for training and monitoring, number of splits, number of the models.

Many thank’s

To Muoi Tran for discussions and support,

For your attention!

Documents

Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray