View
129
Download
1
Category
Tags:
Preview:
Citation preview
D-Confidence: an active learning strategy which efficiently identifies small classes
Learning from Incomplete Specifications
Nuno Filipe Escudeiro nfe@isep.ipp.pt Alípio Mário Jorge amjorge@fc.up.pt
NAACL HLT, 6 de Junho de 2010
Outline
1. Motivations
2. D-Confidence
3. Evaluation
4. Conclusions
5. Future Work
• Fraud detection
• Medical data, disease detection
• Web page classification
• Mail categorization
• …
Motivations | D-Confidence | Evaluation | Conclusions | Future Work
Automatic resource organization•Large corpora•Unlabeled text documents•Labeling is expensive
Need to identify exemplary cases for all labels to learn… fast (with few labels)
NAACL HLT, 6 de Junho de 2010
NAACL HLT, 6 de Junho de 2010
Collecting and annotating exemplary cases
– Critical
– Costly
Labeling effort related to:
– Number of labels to learn
– Class distribution in the working set
– Sample representativeness
Motivations | D-Confidence | Evaluation | Conclusions | Future Work
NAACL HLT, 6 de Junho de 2010
Learning settings
– Supervised: high labeling effort
– Unsupervised: low expressiveness
– Semi-supervised: unable to deal with incomplete specifications
– Active learning: criterious selection of cases to label
• Minimize error
• Availability of pre-labeled examples on all classes
Motivations | D-Confidence | Evaluation | Conclusions | Future Work
NAACL HLT, 6 de Junho de 2010
Motivations | D-Confidence | Evaluation | Conclusions | Future Work
Active Learning
Accuracy at low cost
from a complete specification
D-Confidence
Accuracy and Representativeness at low cost
from incomplete specification
Active Learning
Accuracy at low cost
from a complete specification
D-Confidence
Accuracy and Representativeness at low cost
from incomplete specification
NAACL HLT, 6 de Junho de 2010
D-Confidence
– Active learning strategy selecting queries with:
• Low confidence
– exploitation / accuracy
• High distance to known classes
– exploration / representativeness
Motivations | D-Confidence | Evaluation | Conclusions | Future Work
NAACL HLT, 6 de Junho de 2010
Intuition
Motivations | D-Confidence | Evaluation | Conclusions | Future Work
NAACL HLT, 6 de Junho de 2010
Combines low-confidence with high-distance to produce a bias towards cases from unknown classes located in unexplored regions in case space
k
kk xlab,udistmedian
u|cconfmaxudConf
Motivations | D-Confidence | Evaluation | Conclusions | Future Work
NAACL HLT, 6 de Junho de 2010
Effect on (SVM) confidence
0
0,2
0,4
0,6
0,8
1
-5 -4 -3 -2 -1 0 1 2 3 4 5 6
Signed distance to dividing hyperplane
Co
nfi
den
ce
Motivations | D-Confidence | Evaluation | Conclusions | Future Work
NAACL HLT, 6 de Junho de 2010
D-Confidence
– Repository (UCI) datasets
– Text corpora
Motivations | D-Confidence | Evaluation | Conclusions | Future Work
NAACL HLT, 6 de Junho de 2010
Class distributionDataset # 1 2 3 4 5 6 7 8 9 10 11Iris 150 50 50 50 Cleveland 298 161 53 36 35 13 Vowels 330 30 30 30 30 30 30 30 30 30 30 30SatImg 500 125 48 96 46 67 118 Poker 500 270 170 34 12 4 3 3 2 1 1
Dataset ActiveLearn1st hit
Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Class 7 Class 8 Class 9 Class 10 Class 11
irisConf 1 7 3 dConf 1 3 1
clevelandConf 3 7 8 19 40 dConf 3 15 8 5 8
vowelsConf 3 10 14 31 12 27 29 15 31 18 24dConf 2 12 19 16 24 26 23 2 26 3 23
satimgConf 12 28 34 23 32 5 dConf 9 1 4 10 3 10
pokerConf 1 3 20 43 113 112 147 223 279 277 dConf 3 2 5 9 45 97 98 68 100 65
Motivations | D-Confidence | Evaluation | Conclusions | Future Work
NAACL HLT, 6 de Junho de 2010
D-Confidence
– Repository (UCI) datasets
– Text corpora
Motivations | D-Confidence | Evaluation | Conclusions | Future Work
NAACL HLT, 6 de Junho de 2010
Motivations | D-Confidence | Evaluation | Conclusions | Future Work
Text corpora
20 Newsgroups• 500 cases, 20 classes
• most frequent class 35
• least frequent class 20
Reuters-21578• 1000 cases, 52 classes
• most frequent class 435
• least frequent class 2
• 42 out of 52 classes with frequency below 10
NAACL HLT, 6 de Junho de 2010
Motivations | D-Confidence | Evaluation | Conclusions | Future Work
ConfidenceFarthestFirstdConfidence
ConfidenceFarthestFirstdConfidence
NAACL HLT, 6 de Junho de 2010
– D-Confidence identifies classes faster (lower cost)
– This gain is bigger for minority classes
– D-Confidence performs better in imbalanced data
– Error may increase
• Exploration / exploitation
• Representativeness / accuracy
Motivations | D-Confidence | Evaluation | Conclusions | Future Work
NAACL HLT, 6 de Junho de 2010
Motivations | D-Confidence | Evaluation | Conclusions | Future Work
– Semi-supervised D-Confidence
– Retrieve cases when representativeness assumption fails
– Scalability
Thank you!
Nuno Filipe Escudeiro nfe@isep.ipp.pt Alípio Mário Jorge amjorge@fc.up.pt
NAACL HLT, 6 de Junho de 2010
D-Confidence
– Simulated datasets
– Repository (UCI) datasets
– Text corpora
Motivations | D-Confidence | Evaluation | Conclusions | Future Work
NAACL HLT, 6 de Junho de 2010
Motivations | D-Confidence | Evaluation | Conclusions | Future Work
Levels (refer to training set properties)
Factor 1 (+) 0 (-)
Colinearity colinear centroids non-colinear centroids
Balancing imbalanced class distribution balanced class distribution
Cohesion isomorphic classes polymorphic classes
Overlapping overlapping separable
Response
ErrorGain = gen.error(dConfidence) – gen.error(Confidence)
Simulated datasets
NAACL HLT, 6 de Junho de 2010
Motivations | D-Confidence | Evaluation | Conclusions | Future Work
Colinear Imbalanced Isomorphic Overlapping
1 (+) 1 (+) 1 (+) 1 (+)
1 (+) 1 (+) 1 (+) 0 (-)
1 (+) 1 (+) 0 (-) 1 (+)
1 (+) 1 (+) 0 (-) 0 (-)
1 (+) 0 (-) 1 (+) 1 (+)
1 (+) 0 (-) 1 (+) 0 (-)
1 (+) 0 (-) 0 (-) 1 (+)
1 (+) 0 (-) 0 (-) 0 (-)
0 (-) 1 (+) 1 (+) 1 (+)
0 (-) 1 (+) 1 (+) 0 (-)
0 (-) 1 (+) 0 (-) 1 (+)
0 (-) 1 (+) 0 (-) 0 (-)
0 (-) 0 (-) 1 (+) 1 (+)
0 (-) 0 (-) 1 (+) 0 (-)
0 (-) 0 (-) 0 (-) 1 (+)
0 (-) 0 (-) 0 (-) 0 (-)
NAACL HLT, 6 de Junho de 2010
Motivations | D-Confidence | Evaluation | Conclusions | Future Work
Colinearity Imbalanced Isomorphic Overlapping
4,241 -3,835 -15,459 1,296
Error
NAACL HLT, 6 de Junho de 2010
Motivations | D-Confidence | Evaluation | Conclusions | Future Work
Finding cases from all classes
NAACL HLT, 6 de Junho de 2010
Meta-LearningColinearity
– correlation coefficient, r, among cluster centroids– colinear when |r| ~ 1
Balancing– variance of nk
– balanced when var(nk) ~ 0
Cohesion– #classes divided by #clusters– cohesive when ~ 1– representativeness fails (or highly overlapping clusters) when > 1
Overlapping– inter-cluster inertia divided by intra-cluster inertia– separable when >> 1
Motivations | D-Confidence | Evaluation | Conclusions | Future Work
NAACL HLT, 6 de Junho de 2010
Motivations | D-Confidence | Evaluation | Conclusions | Future Work
Recommended