University of Ljubljana ..: Faculty of Electrical Engineering
[LDOS] ..: Digital Signal, Image and Video Processing Laboratory
How to improve the statistical power of the 10-fold cross
validation scheme in Recommender Systems
Andrej Košir
Ante Odić
Marko Tkalčič
University of Ljubljana ..: Faculty of Electrical Engineering
[LDOS] ..: Digital Signal, Image and Video Processing Laboratory
Statistical power, replicability and reproducibility
What is:
Replicability: to get the same experimental result (on the same data)
Reproducibility : to get similar experimental results leading to the same
conclusion
Mackay, R., & Oldford, R. (2000). Scientific method, statistical method, and the speed of light, Working pa-
per 2000-02). Department of Statistics and Actuarial Science, University of Waterloo.
In terms of statistical testing
Higher power => better reproducibility
More likely to get to the same conclusions
University of Ljubljana ..: Faculty of Electrical Engineering
[LDOS] ..: Digital Signal, Image and Video Processing Laboratory
On stat hypothese testing
When we need to use stat tests?
The results should not change if we repeat the experiment
When we need it: at later stages of development where results are similar
Elements of statistical testing
Working hypotheses
Null and alternative hypotheses: 𝐻0 and 𝐻1 p-value: 𝑝
Risk level: 𝛼
Decision on 𝐻0
RS 1
RS 2
F1
F2
0.72
0.89
0.74
Test
data
University of Ljubljana ..: Faculty of Electrical Engineering
[LDOS] ..: Digital Signal, Image and Video Processing Laboratory
On errors and statistical power
Errors in test decision:
Errors of type I. and type II.
Effect size
Power:
For each test a new analysis is required
more is better
The best one can do
Task 1 - How to select sample size: apriory power
Task 2 - How to estimate achieved power: posterior power
History:
1908 by William Sealy Gosset (Student): he did not need it
Mainly ignored until then
Software: GPower
http://www.psycho.uni-duesseldorf.de/abteilungen/aap/gpower3/
OK type I.
type II. OK
0H
1H
1H0H
Power = 𝑃𝑟[ 𝐻1|𝐻1]
University of Ljubljana ..: Faculty of Electrical Engineering
[LDOS] ..: Digital Signal, Image and Video Processing Laboratory
The application we were working on: contextual variables
Which contextual variables are relevant:
What is context
Candidates: time, weather, mood, ...
Can we simply use it all?
• Irrelevant context can worse the performance of RS
Test if a given context is relevant
How: compare RS with and without it
ODIĆ, Ante, TKALČIČ, Marko, TASIČ, Jurij F., KOŠIR, Andrej. Predicting and detecting the
relevant contextual information in a movie-recommender system. Interact. comput.. [Print ed.], 2013,
vol. 25, no. 1, pp. 74-90, ilustr., doi:10.1093/iwc/iws003. [COBISS.SI-ID 9650260]
ODIĆ, Ante, TKALČIČ, Marko, TASIČ, Jurij F., KOŠIR, Andrej. Impact of the context relevancy on
ratings prediction in a movie-recommender system. Automatika (Zagreb), 2013, vol. 54, no. 2, pp. 252-
262, ilustr., doi:10.7305/automatika.54-2.258. [COBISS.SI-ID 9782356]
University of Ljubljana ..: Faculty of Electrical Engineering
[LDOS] ..: Digital Signal, Image and Video Processing Laboratory
The problem we observed: cross validation scheme
There were differences among folds, but not in conclusion
What is wrong?
Paired / unpaired?
What is usually done:
Confusion matrix computation is actually unpaired
ODIĆ, Ante, TKALČIČ, Marko, TASIČ, Jurij F., KOŠIR, Andrej. Predicting and detecting the relevant contextual
information in a movie-recommender system. Interact. comput., vol. 25, no. 1, pp. 74-90, 2013.
University of Ljubljana ..: Faculty of Electrical Engineering
[LDOS] ..: Digital Signal, Image and Video Processing Laboratory
Proposed solution
The procedure outline:
1. Select the scalar comparison measure (such as precision or F-measure).
2. Store the evaluation results of each fold and each method separately;
3. According to the specfic features of the evaluation results (distributions
etc.) select the most powerful test that meets these specific features
4. Perform the paired version of the selected test.
University of Ljubljana ..: Faculty of Electrical Engineering
[LDOS] ..: Digital Signal, Image and Video Processing Laboratory
Materials and methods (1)
Dataset:
Context Movie Dataset (LDOS-CoMoDa)
1611 ratings from 89 users to 946 items with associated contextual factors.
Contextual variables
• time (morning, afternoon, evening, night),
• daytype (working day, weekend, time (morning, afternoon, evening, night),
• season (spring, summer, autumn, winter),
• Location (home, public place, friend's house),
• weather (sunny/clear, rainy, stormy, snowy, cloudy),
• social (alone, partner, friends, colleagues, parents, public, family),
• endEmo (sad, happy, scared, surprised, angry, disgusted, neutral),
• dominantEmo (sad, happy, scared, surprised, angry, disgusted, neutral),
• mood (positive, neutral, negative),
• physical (healthy, ill), decision (user's choice, given by other), interaction (1rst, n-th)
Publically available: LDOS-CoMoDa contextual dataset: available at www.ldos.si/comoda.html.
Used by 29 researchers at this moment.
University of Ljubljana ..: Faculty of Electrical Engineering
[LDOS] ..: Digital Signal, Image and Video Processing Laboratory
Materials and methods (2), results
Experimental design
10-fold cross validation
Two procedures: ProcPaired, ProcIndep
Results – which contextual variable improves MF?
Tests: Wilcoxon signed rank test (ProcIndep) and
Mann Whitney U test, (ProcPaired)
The achieved (post-hoc) statistical power for the paired test (pw pa.) and for the
independent test (pw in.) along with the computed p-values
Id Var 1 Var 2 pw paired p paired pw indep. p indep.
1 Physical Weather 0.42 0.001 0.14 0.24
2 Decision Social 0.99 0.004 0.25 0.19
3 interaction Social 0.06 <0.001 0.05 0.43
University of Ljubljana ..: Faculty of Electrical Engineering
[LDOS] ..: Digital Signal, Image and Video Processing Laboratory
Discussion
Power improvements:
The first combination (physical vs. weather): 0.14 0.42, low but useful;
The second combination (decision vs. social): 0.19 0.99, the difference in
power is again substantial;
The third combination (interaction vs. social): 0.05 0.06, irrelevant;
It does not require substantial additional work
Worth of effort
University of Ljubljana ..: Faculty of Electrical Engineering
[LDOS] ..: Digital Signal, Image and Video Processing Laboratory
Further work
We limited to 10-fold cross validation and simple tests only. There is more
out there.
We will concentrate on a comparison of RS regarding the selected final tasks
(such as best five) and not limited to scalar performance measures (such as
precision at five).
More sophisticated statistical approaches:
are available such as a multi-level repeated binomial regression
my opinion: will not be used frequently
THANK YOU
Invitation: International Conference on Automatic Face and Gesture
Recognition FG2015, http://www.fg2015.org/
University of Ljubljana ..: Faculty of Electrical Engineering
[LDOS] ..: Digital Signal, Image and Video Processing Laboratory
Presentation structure
The goal
What it has to do with replicability and reproducibility?
Selected items from statistics
Our case & problem statement
Proposed solution & comments
Experimental results
Future work
Take away notes