Competent Undemocratic Committees Włodzisław Duch, Łukasz Itert and Karol Grudziński Department of Informatics, Nicholas Copernicus University, Torun,

Competent Undemocratic Competent Undemocratic CommitteesCommittees

Włodzisław Duch, Łukasz Itert and Karol Grudziński

Department of Informatics, Nicholas Copernicus University, Torun, Poland.

http://www.phys.uni.torun.pl/kmk

MotivationMotivationMotivationMotivationCombining information from different models is known as:

ensemble learning, mixture of experts, voting classification algorithms, or committees of models:

Important and popular subject in machine learning, with conferences and special issues of journals.

Useful for solving real problems, such as predicting the glucose levels of diabetic patients – many classifiers are available for this problem; 10 heads are wiser than one?

Committees:

1) improve the accuracy of a single model2) decrease the variance, stabilizing the results.

VariabilityVariabilityVariabilityVariabilityCommittees need different models.

Variability of committee models comes from:

1) Different samples taken from the same data Crossvalidation training, boosting, bagging, arcing …

Bagging: train on bootstrap samples, randomly draw a fixed number of training data vectors from the pool containing all training vectors. AdaBoost (Adaptive Boosting): assign weights to training instances, higher for incorrectly classified.

Arcing: simplified weighting of the training vectors.

2) Bias of models, due to the change of their complexity. The number of neurons, training parameters, pruning ...

VotingVotingVotingVotingLet P(Ci|X;Ml), be the posterior probability estimation for l=1..m models for i=1..K classes.

How to determine the committee decision?

1. Majority voting – go with the crowd.

2. Average results of all models.

3. Select a model that gives the largest probability (highest confidence).

4. Set a threshold to select models with highest confidence and use majority voting for these models.

5. Make linear combination of results.

More on votingMore on voting

Each model does not need to be accurate for all data, but should account well for a different subset of data.

• Krogh and Vedelsby: generalization error is small if highly accurate classifiers disagreeing with each other are used.

• Xin Yao: diversify models, create negative correlation between individual models and average results (no GA).

• Jacobs: mixture of experts, neural architecture with a gating network to select the most competent model.

• Ortega et al: a ``referee meta-model'' deciding which model should contribute to the final decision.

Competent Models: ideaCompetent Models: ideaDemocratic voting: all models always try to contribute.Undemocratic voting: only experts on local issues should vote.

For each model identify feature space regions where it is incompetent.

Use penalty factor to decrease the influence of incompetent model during voting.

Biological inspiration:

• Only a small subset of cortical modules are used by the brain for a given task.

• Incompetent modules are inhibited by rough evaluation of inputs at the thalamic level.

• Conclusion: weights should be input dependent!

Competent Models: ideaCompetent Models: ideaLinear meta-model gives m additional parameters:

1

( | ; ) ( | ; )m

i il i ll

p C M W P C M

X X

Models that have small weights may still be useful in some areas of the feature space. Use input-dependent weights:

to inhibit voting of the model Ml for class Ci around specific X.

Similarity Based Models use reference vectors; determine the areas of the input space where a given model is competent (makes a few errors) and where it fails.

1

( | ; ) ( | ; )m

i il i ll

p C M W P C M

X X X

Committees of Competent ModelsCommittees of Competent Models

1. Optimize parameters for all models Ml , l = 1...m on the training set using a cross-validation procedure.

2. For each model l = 1...m:a) for all training vectors Ri generate predicted classes Cl (Ri);b) if Cl (Ri)C(Ri), i.e. model Ml makes an error for vector Ri ,

determine the area of incompetence of the model, finding the distance di,j to the nearest vector that Ml has correctly classified;

c) set parameters of the incompetence factor F(||XRi||;Ml) in such a way that its value decreases significantly for ||XRi||di,j/2

3. The incompetence function for the model F(X;Ml) is a product of factors F(||XRi||;Ml) for all training vectors that have been incorrectly handled

CCMCCM1 in all areas of correct classification

( ; )0 in all areas of incorrect classificationlF M

X

F(||XRi||;Ml) examples:

1. Gaussian function F(||XRi||;Ml)=1 G(||XRi||a;i), where a coefficient is used to flatten the function.

2. F(||XRi||;Ml)=1/(1+||XRi||-a), similar to Gaussian

3. Sum of two logistic functions (||XRi|| di,j/2) + (||XRi|| di,j/2)

Vectors that cannot be correctly classified show up as errors that all model make, but some vectors that are erroneously classified by one model may be correctly handled by another.

CCM votingCCM voting

Use confidence factors to modify the linear weights.

1

1

( | ; ) | ;

; | ;

m

i il i ll m

m

il l i ll m

p C M W P C M

W F M P C M

X X X

X X

Confidence factors are products over local regions.

; l k lk

F M F R M X X

Numerical experiment: dataNumerical experiment: dataNumerical experiment: dataNumerical experiment: data

Dataset: Telugu vowel data,

871 vectors, 3 features (dominant formants), 6 classes (vowels)

[1] Pal, S.K. and Mitra S. (1999) Neuro-Fuzzy Pattern Recognition. J. Wiley, New York

Models included in the committee:

1. k=10, Euclidean (M1)

2. k=13, Manhattan (M2)

3. k=5, Euclidean (M3)

4. k=5, Manhattan (M4)

Accuracy of modelsAccuracy of modelsAccuracy of modelsAccuracy of models

Accuracy of all models for each class, in %:

Class M1 M2 M3 M4

C1

C2

C3

C4

C5

C6

50.0

88.8

84.3

85.4

91.3

90.6

45.8

91.0

84.3

84.8

88.4

92.8

65.3

87.6

84.9

90.1

90.3

90.1

62.5

89.9

84.7

88.1

90.1

90.4

Average 85.1 84.6 86.1 86.0

Comparison of result Comparison of result Comparison of result Comparison of result

System Accuracy Remarks

CUC committee

kNN

MLP

Fuzzy MLP

Bayes Classifier

Fuzzy Kohonen

88.2%0.6%

86.1%0.6%

84.6%

84.2%

79.2%

73.5%

2xCV (our calculation)

k=3, Euclidean, 2xCV

(our calculation)

2xCV, 10 neurons [1]

2xCV, 10 neurons [1]

2xCV, [1]

2xCV, [1]

Dataset: Telugu vowel (871 vectors, 3 features)

Comparison of committees resultsComparison of committees resultsComparison of committees resultsComparison of committees results

Results for Telugu vowel data:

Class Majority Confidence Combination + Competence

C1

C2

C3

C4

C5

C6

54.2

88.8

84.3

86.8

92.3

90.6

58.3

88.8

84.9

88.1

92.8

92.2

62.5

88.8

84.3

88.1

92.3

91.7

65.3

89.9

84.9

88.1

93.8

93.3

Average 85.9 87.0 87.0 88.2

ConclusionsConclusionsConclusionsConclusions• Assigning competence factors in various voting

procedures is an attractive idea.

• Learning becomes modular: each model specializes in different subproblems.

Some ideas:

• Combine DT, kNN, NN models.

• Use CCM with adaptive boosting.

• Use ROC curves to increase the AUC area by providing a convex combination of individual ROC curves.

• Diversify models by adding explicit negative correlation.

• Use constructive approach: add new models to committee that classify correctly remaining vectors.

Documents

Competent Undemocratic Committees Włodzisław Duch, Łukasz Itert and Karol Grudziński Department of Informatics, Nicholas Copernicus University, Torun,