View
31
Download
0
Category
Preview:
DESCRIPTION
RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION CHALLENGE Isabelle Guyon Amir Reza Saffari Azar Alamdari Gideon Dror. Part I. INTRODUCTION. Model selection. Selecting models (neural net, decision tree, SVM, …) - PowerPoint PPT Presentation
Citation preview
RESULTS OF THE WCCI 2006 PERFORMANCE PREDICTION
CHALLENGE
Isabelle GuyonAmir Reza Saffari Azar Alamdari
Gideon Dror
Part I
INTRODUCTION
Model selection
• Selecting models (neural net, decision tree, SVM, …)
• Selecting hyperparameters (number of hidden units, weight decay/ridge, kernel parameters, …)
• Selecting variables or features (space dimensionality reduction.)
• Selecting patterns (data cleaning, data reduction, e.g by clustering.)
Performance prediction
How good are you at predicting
how good you are?
• Practically important in pilot studies.
• Good performance predictions render model selection trivial.
Why a challenge?
• Stimulate research and push the state-of-the art.
• Move towards fair comparisons and give a voice to methods that work but may not be backed up by theory (yet).
• Find practical solutions to true problems.• Have fun…
History
• USPS/NIST.• Unipen (with Lambert Schomaker): 40 institutions
share 5 million handwritten characters. • KDD cup, TREC, CASP, CAMDA, ICDAR, etc.• NIPS challenge on unlabeled data.• Feature selection challenge (with Steve Gunn):
success! ~75 entrants, thousands of entries.• Pascal challenges.• Performance prediction challenge …
1980
1990
2000
2001
2002
2003
2004
2005
Challenge
• Date started: Friday September 30, 2005.
• Date ended: Monday March 1, 2006
• Duration: 21 weeks.
• Estimated number of entrants: 145.
• Number of development entries: 4228.
• Number of ranked participants: 28.
• Number of ranked submissions: 117.
Datasets
Dataset Domain Type Feat-ures
Training Examples
Validation Examples
Test Examples
ADA Marketing Dense 48 4147 415 41471
GINA Digits Dense 970 3153 315 31532
HIVADrug discovery
Dense 1617 3845 384 38449
NOVAText classif.
Sparse binary 16969 1754 175 17537
SYLVA Ecology Dense 216 13086 1308 130858
http://www.modelselect.inf.ethz.ch/
BER distribution
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
50100150
ADA
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
50100150
GINA
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
50100150
HIVA
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
50100150
NOVA
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50
50100150
SYLVA
BERTest BER
Results
Overall winners for ranked entries:
Ave rank: Roman Lutz with LB tree mix cut adaptedAve score: Gavin Cawley with Final #2
ADA: Marc Boullé with SNB(CMA)+10k F(2D) tv or SNB(CMA) + 100k F(2D) tv
GINA: Kari Torkkola & Eugene Tuv with ACE+RLSCHIVA: Gavin Cawley with Final #3 (corrected)NOVA: Gavin Cawley with Final #1SYLVA: Marc Boullé with SNB(CMA) + 10k F(3D) tv
Best AUC: Radford Neal with Bayesian Neural Networks
Part II
PROTOCOL and
SCORING
Protocol
• Data split: training/validation/test.• Data proportions: 10/1/100.• Online feed-back on validation data.• Validation label release one month before
end of challenge.• Final ranking on test data using the five
last complete submissions for each entrant.
Performance metrics
• Balanced Error Rate (BER): average of error rates of positive class and negative class.
• Guess error: BER = abs(testBER – guessedBER)
• Area Under the ROC Curve (AUC).
Optimistic guesses
0 0.1 0.2 0.3 0.4 0.5 0.6 0.70
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Test BER
Gu
esse
d B
ER
ADA
GINA
HIVA
NOVA
SYLVA
Scoring method
E = testBER + BER [1-exp(- BER/)] BER = abs(testBER – guessedBER)
Guessed BER
Cha
lleng
e sc
ore
Test BER
Test BER
BER/
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.510
-4
10-3
10-2
10-1
100
101
102
103
104
BER
DE
LTA
/SIG
MA
BER/
Test BER
E testBER + BER
ADA
GINA
HIVA
NOVA
SYLVA
Score
-10 -8 -6 -4 -2 0 2
0.04
0.045
0.05
0.055
0.06
0.065
log(gamma)
score
GINA
Roman LutzGavin Cawley
Radford Neal
Corinne Dahinden
Wei ChuNicolai Meinshausen
E
testBER testBER+BER
E = testBER + BER [1-exp(- BER/)]
Score (continued)
-10 -8 -6 -4 -2 0 2
0.2
0.25
0.3
0.35
0.4
log(gamma)
scor
e
ADA
-10 -8 -6 -4 -2 0 20.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
0.22
log(gamma)
scor
e
GINA
-10 -8 -6 -4 -2 0 20.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
log(gamma)
scor
e
HIVA
-10 -8 -6 -4 -2 0 20
0.05
0.1
0.15
0.2
0.25
log(gamma)
scor
e
NOVA
-10 -8 -6 -4 -2 0 20
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
log(gamma)
scor
e
SYLVAADA GINA SYLVA
HIVA NOVA
Part III
RESULT ANALYSIS
What did we expect?
• Learn about new competitive machine learning techniques.
• Identify competitive methods of performance prediction, model selection, and ensemble learning (theory put into practice.)
• Drive research in the direction of refining such methods (on-going benchmark.)
Method comparison
0 0.05 0.1 0.15 0.2 0.25 0.3 0.3510
-4
10-3
10-2
10-1
100
BER
Del
ta B
ER
X
TREE
NN/BNNNB
LD/SVM/KLS/GP
SYLVA
GINA
NOVA
ADA
HIVA
BER
Test BER
Danger of overfitting
0 20 40 60 80 100 120 140 1600
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5B
ER
Time (days)
ADA
GINA
HIVA
NOVA
SYLVA
Full line: test BER
Dashed line: validation BER
How to estimate the BER?
• Statistical tests (Stats): Compute it on training data; compare with a “null hypothesis” e.g. the results obtained with a random permutation of the labels.
• Cross-validation (CV): Split the training data many times into training and validation set; average the validation data results.
• Guaranteed risk minimization (GRM): Use of theoretical performance bounds.
Stats / CV / GRM ???
Top ranking methods
• Performance prediction:– CV with many splits 90% train / 10% validation– Nested CV loops
• Model selection:– Use of a single model family– Regularized risk / Bayesian priors– Ensemble methods– Nested CV loops, computationally efficient with
with VLOO
Other methods
• Use of training data only:– Training BER.– Statistical tests.
• Bayesian evidence.
• Performance bounds.
• Bilevel optimization.
Part IV
CONCLUSIONS AND FURTHER WORK
Open problems
Bridge the gap between theory and practice…• What are the best estimators of the variance of CV?• What should k be in k-fold?• Are other cross-validation methods better than k-
fold (e.g bootstrap, 5x2CV)?• Are there better “hybrid” methods?• What search strategies are best?• More than 2 levels of inference?
Future work
• Game of model selection.
• JMLR special topic on model selection.
• IJCNN 2007 challenge!
Benchmarking model selection?
• Performance prediction: Participants just need to provide a guess of their test performance. If they can solve that problem, they can perform model selection efficiently. Easy and motivating.
• Selection of a model from a finite toolbox: In principle a more controlled benchmark, but less attractive to participants.
CLOP
• CLOP=Challenge Learning Object Package.
• Based on the Spider developed at the Max Planck Institute.
• Two basic abstractions:– Data object– Model object
http://clopinet.com/isabelle/Projects/modelselect/MFAQ.html
CLOP tutorial
D=data(X,Y);hyper = {'degree=3', 'shrinkage=0.1'};
model = kridge(hyper); [resu, model] = train(model, D);tresu = test(model, testD);model = chain({standardize,kridge(hyper)});
At the Matlab prompt:
Conclusions
• Twice as much volume of participation as in the feature selection challenge
• Top methods as before (different order):– Ensembles of trees– Kernel methods (RLSC/LS-SVM, SVM)– Bayesian neural networks– Naïve Bayes.
• Danger of overfitting.• Triumph of cross-validation?
Recommended