19
Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray Somorjai , Institute for Biodiagnostics, NRC Canada,

Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray

Embed Size (px)

Citation preview

Page 1: Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray

Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art

Classifiers: What Can We Learn About the Data

Erinija Pranckeviciene, Ray Somorjai,

Institute for Biodiagnostics, NRC Canada,

Page 2: Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray

Outline of the presentation

• Description of the algorithm

• Results on Agnostic Learning vs. Prior Knowledge (AL vs. PK) challenge datasets

• Conclusions

Page 3: Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray

Motivation to enter the Challenge

• For small sample size / high dimensional datasets, the feature selection procedure will adapt to the peculiarities of the training dataset (sample bias).

• An ideal model selection procedure would produce stable estimates of the classification error rate and the identities of discovered features would not vary much across the different random splits.

• Our experiments with Linear Programming SVM (LP-SVM) on biomedical datasets produced results more robust to the sample bias and demonstrated the property stated above.

Decided to check LP-SVM’s robustness property in a controlled experiment- an independent platform of

the AL vs. PK challenge.

Page 4: Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray

Classification with LP-SVM• The formulation of LP-SVM (known as Liknon,

Bhattacharya et al.) is very similar to the conventional linear SVM, except for the objective function, which is linear, due to the L1 norm of the regularization term.

• The solution of the LP-SVM is a linear discriminant, in which the weight magnitudes identify those original features that are important for class discrimination.

• Different values of the regularization parameter C in the optimization problem produce different discriminants.

Page 5: Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray

Outline of the algorithm

1) Available training data are processed in 10-fold

stratified (based on the existing proportions of the

class samples) crossvalidation:

9/10 of the data are for training, 1/10 for independent testing.

2) The training portion is split randomly into balanced

training and unbalanced monitoring sets.

3) We perform 31 random splits.

Page 6: Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray

Evolution of the models4) The training set is used to find several LP-SVM

discriminants, determined by the sequence of values of the regularization parameter C, in every split. Increasing C increases the number of features.

5) A balanced error rate (BER) for every discriminant is estimated on the monitoring set.

6) The discriminant / model with the smallest monitoring BER is retained.

Page 7: Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray

Example of the evolution of the models on a synthetic data

Page 8: Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray

Feature profiles and other classifiers

7) In a single fold, a feature profile is derived, by counting the frequency of inclusion of the features in the best BER discriminant (we have 31 best BER discriminants).

8) As a result, we have an ensemble of linear discriminants operating on the selected features and the feature profile is to be tested with other classifiers

(Several thresholds of the frequency of inclusion were examined for different datasets, to test state-of-art classification rules, such as KNN, fisher. Etc.).

Page 9: Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray

Final model selection

9) The performance of all competing models derived

in a single fold is estimated by BER on the independent test set. Thus we have 10 estimates.

9) The final model is selected out of the 10 estimated models.

10) The identities of the features occurring in all profiles can also be examined separately.

Page 10: Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray

Experimental setup: algorithmic parameters for AL vs. PK datasets

• T1+T2 - size of the training set; • M1+M2 - size of the monitoring set;• V1+V2 - size of the validation set;• Dim - dimensionality of the data;• Models - number of the models tested;• Th - threshold of the frequency of inclusion of feature in the feature profile.

Page 11: Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray

ADA results

• Identity: the identities of the features occurring in all profiles

100%- 2, 8, 9, 18, 20, 24, 30

Last 1, test err 0.181812

Last 2, test err 0.181899

Last 3, test err 0.183274

(Th 55%)

Page 12: Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray

GINA results

• Identity: the identities of the features occurring in all profiles More than 85% - 367, 815, 510, 648, 424

Last 3: knn1: 0.060, knn3: 0.058, ens: 0.153

Last 1, test err 0.0583085

Last 2, test err 0.0539862

Last 3, test err 0.0533136

(Th 50%)

Page 13: Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray

HIVA results

• Identity: the identities of the features occurring in all profiles 90%

Last 1, test err 0.299795 Last 2, test err 0.313

Last 3, test err 0.305

Th 20%

Best entry ( former)

0.2939

Page 14: Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray

NOVA results

• Identity: the identities of the features occurring in all profiles 100%

Last 1, test err 0.075

Last 2, test err 0.074

Last 3, test err 0.081

Th 80%

Best entry ( former)

0.0725

Page 15: Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray

SYLVA results

• Identity: the identities of the features occurring in all profiles• 100% - 202, 55

Last 1, test err 0.0195

Last 2, test err 0.0197 Last 3, test err 0.01897

Th 20%

Page 16: Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray

Determination of C values

Given N1 and N2 measurements x of individual feature k in two classes, C value is:

• Sort the C values corresponding to d features in ascending order and solve a model for each.

• The idea behind comes from the analysis of the dual.

• If many features, then many models have to be solved. Computationally not feasible, C values have to be condensed.

Page 17: Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray

Different ways of condensing C

• The challenge submissions differed in how the C values were chosen.

• Initially a histogram was used. Based on the final ranking this method worked out better for HIVA and NOVA.

• In the last submissions a rate of change of a slightly modified objective function of primal was used. This worked better for ADA, GINA and SYLVA.

• Still looking for a less heuristic and more precise method…

Page 18: Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray

Conclusions• The main advantages of our method are simplicity and the

interpretability of the results. The disadvantage is high computational burden.

• Ensembles tend to perform better than individual rules, except for GINA.

• Same feature identities were consistently discovered in all splits and folds.

• The derived feature identities have to be compared with the ground truth in the Prior Knowledge track.

• Some arbitrariness, unavoidable in this experiment, will be dealt with in the future work- the threshold in feature profile, number of samples for training and monitoring, number of splits, number of the models.

Page 19: Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray

Many thank’s

To Muoi Tran for discussions and support,

For your attention!