44
October 8, 2014

Unifying Nearest Neighbors Collaborative Filtering

Embed Size (px)

Citation preview

Page 1: Unifying Nearest Neighbors Collaborative Filtering

October  8,  2014  

Page 2: Unifying Nearest Neighbors Collaborative Filtering

Unifying  Nearest  Neighbors  Collabora3ve  Filtering  

Koen  Verstrepen  Bart  Goethals    

Page 3: Unifying Nearest Neighbors Collaborative Filtering

Data:  Binary,  Posi>ve-­‐only  

3  

Page 4: Unifying Nearest Neighbors Collaborative Filtering

Task:  Find  best  recommenda>on  for    

4  

Page 5: Unifying Nearest Neighbors Collaborative Filtering

5  

By  compu>ng  scores  

Page 6: Unifying Nearest Neighbors Collaborative Filtering

6  

Focus  on  compu>ng  

Page 7: Unifying Nearest Neighbors Collaborative Filtering

7  

Mul>ple  solu>ons  for  compu>ng      s(        ,        )  Matrix  factoriza>on  Nearest  Neighbors  Methods  …  

Page 8: Unifying Nearest Neighbors Collaborative Filtering

8  

User-­‐Based  Nearest  Neighbors:  

Item-­‐Based  Nearest  Neighbors:  

Two  solu>ons  for  compu>ng      s(        ,        )  

Page 9: Unifying Nearest Neighbors Collaborative Filtering

9  

User-­‐Based  Nearest  Neighbors:  

Item-­‐Based  Nearest  Neighbors:  

Two  solu>ons  for  compu>ng      s(        ,        )  

Page 10: Unifying Nearest Neighbors Collaborative Filtering

Two  solu>ons  for  compu>ng    

10  

User-­‐Based  Nearest  Neighbors:  

Item-­‐Based  Nearest  Neighbors:  

 s(        ,        )  

Page 11: Unifying Nearest Neighbors Collaborative Filtering

We  observed  

Different  scores:      Because  different  informa>on  used:        Both  valuable  Each  ignore  the  other  type  of  informa>on  

11  

≠  

Page 12: Unifying Nearest Neighbors Collaborative Filtering

Research  Ques>on  

   

Can  we  clearly  characterize  how  both  algorithms  ignore  informa>on?  

   

Can  we  avoid  ignoring  informa>on?  

12  

Page 13: Unifying Nearest Neighbors Collaborative Filtering

Contribu>on  1  &  2  

13  

Page 14: Unifying Nearest Neighbors Collaborative Filtering

Contribu>on  1  &  2:  Unifying  Reformula>on        

14  

Instead  of  the  tradi>onal  perspec>ve:  

A  novel,  unifying  reformula>on:  

Page 15: Unifying Nearest Neighbors Collaborative Filtering

Contribu>on  1  &  2:  Unifying  Reformula>on        

15  

User-­‐based  and  Item-­‐based  have  different    w(        ,          )      

Page 16: Unifying Nearest Neighbors Collaborative Filtering

16  

Contribu>on  1  &  2:  A  novel  w(        ,          )      

KUNN  Unified  Nearest  Neighbors  

Page 17: Unifying Nearest Neighbors Collaborative Filtering

Details  

17  

Page 18: Unifying Nearest Neighbors Collaborative Filtering

18  

Computa>on  of    w(        ,          )      

User-­‐based  (reformula>on  exis>ng  alg.)  

Item-­‐based  (reformula>on  exis>ng  alg.)  KUNN  (novel  alg.)  

Page 19: Unifying Nearest Neighbors Collaborative Filtering

19  

If                                                    ,  we  have  evidence  to  work  with  

1234  :  UB  -­‐  IB  -­‐  KUNN  

19  

Page 20: Unifying Nearest Neighbors Collaborative Filtering

20  

If                                                    ,  we  have  evidence  to  work  with  

=  1  

1234  :  UB  -­‐  IB  -­‐  KUNN  

20  

Page 21: Unifying Nearest Neighbors Collaborative Filtering

                                                   only  relevant  to                if  also                                                            only  relevant  to                  if  also      

21  

&  

=  1  

1234  :  UB  -­‐  IB  -­‐  KUNN  

21  

Page 22: Unifying Nearest Neighbors Collaborative Filtering

                                                   only  relevant  to                if  also                                                            only  relevant  to                  if  also      

22  

&  

=  1  ×  1  

1234  :  UB  -­‐  IB  -­‐  KUNN  

22  

Page 23: Unifying Nearest Neighbors Collaborative Filtering

           needs  to  be  sufficiently  similar  to                    

23  

=  1  ×  1  

1234  :  UB  -­‐  IB  -­‐  KUNN  

23  

Page 24: Unifying Nearest Neighbors Collaborative Filtering

           needs  to  be  sufficiently  similar  to                    

24  

=  1  ×  1  ×  1  

1234  :  UB  -­‐  IB  -­‐  KUNN  

24  

Page 25: Unifying Nearest Neighbors Collaborative Filtering

             needs  to  be  sufficiently  similar  to                  

25  

=  1  ×  1  

1234  :  UB  -­‐  IB  -­‐  KUNN  

25  

Page 26: Unifying Nearest Neighbors Collaborative Filtering

             needs  to  be  sufficiently  similar  to                  

26  

=  1  ×  1  ×  1  

1234  :  UB  -­‐  IB  -­‐  KUNN  

26  

Page 27: Unifying Nearest Neighbors Collaborative Filtering

             sufficiently  similar  to                    à  +1                sufficiently  similar  to                    à  +1                                                                        otherwise  à    0                              

27  

=  1  ×  1  

1234  :  UB  -­‐  IB  -­‐  KUNN  

27  

Page 28: Unifying Nearest Neighbors Collaborative Filtering

             sufficiently  similar  to                    à  +1                sufficiently  similar  to                    à  +1                                                                        otherwise  à    0                              

28  

=  1  ×  1  ×  2  

1234  :  UB  -­‐  IB  -­‐  KUNN  

28  

Page 29: Unifying Nearest Neighbors Collaborative Filtering

         ‘s  are  less  informa>ve  if                and                like  many  things  

=  1  ×  1  ×  1  

1234  :  UB  -­‐  IB  -­‐  KUNN  

29  

Page 30: Unifying Nearest Neighbors Collaborative Filtering

         ‘s  are  less  informa>ve  if                and                like  many  things  

=  1  ×  1  ×  1  ×    

Unifying Nearest Neighbors Collaborative Filtering

Koen Verstrepen Bart Goethals

University of Antwerp

Antwerp, Belgium

{koen.verstrepen,bart.goethals}@uantwerp.be

ABSTRACTWe study collaborative filtering for applications in whichthere exists for every user a set of items about which theuser has given binary, positive-only feedback (one-class col-laborative filtering). Take for example an on-line store thatknows all past purchases of every customer. An importantclass of algorithms for one-class collaborative filtering arethe nearest neighbors algorithms, typically divided into user-based and item-based algorithms. We introduce a reformu-lation that unifies user- and item-based nearest neighborsalgorithms and use this reformulation to propose a novelalgorithm that incorporates the best of both worlds andoutperforms state-of-the-art algorithms. Additionally, wepropose a method for naturally explaining the recommen-dations made by our algorithm and show that this methodis also applicable to existing user-based nearest neighborsmethods.

Categories and Subject Descriptors: H.3.3 [Informa-tion Storage and Retrieval]: Information filtering

Keywords: One-class collaborative filtering, nearest neigh-bors, explaining recommendations, top-N recommendation,recommender systems.

1. INTRODUCTION

1p2 · 3

1p1 · 2

1p2 · 3 · 1 · 2

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or corecsysercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected]’14, October 6–10, 2014, Foster City, Silicon Valley, CA, USA.Copyright 2014 ACM 978-1-4503-2668-1/14/10 ...$15.00.http://dx.doi.org/10.1145/2645710.2645731.

u is a user

i is an item

u, v 2 U are users

i, j 2 I are items

The score for recommending item i to user u

Sum over all user-item-pairs (v, j)

Typically, the training data for collaborative filtering isrepresented by a matrix in which the rows represent theusers and the columns represent the items. A value in thismatrix can be unknown or can reflect the preference of therespective user for the respective item. Here, we consider thespecific setting of binary, positive-only preference feedback.Hence, every value in the preference matrix is 1 or 0, with1 representing a known preference and 0 representing theunknown. Pan et al. [10] call this setting one-class collab-orative filtering (OCCF). Applications that correspond tothis version of the collaborative filtering problem are likeson social networking sites, tags for photo’s, websites visitedduring a surfing session, articles bought by a customer etc.In this setting, identifying the best recommendations can

be formalized as ranking all user-item-pairs (u, i) accordingto the probability that u prefers i.One class of algorithms for solving OCCF are the nearest

neighbors algorithms. Sarwar et al. [13] proposed the wellknown user-based algorithm that uses cosine similarity andDeshpande et al. [2] proposed several widely used item-basedalgorithms.Another class of algorithms for solving OCCF are matrix

factorization algorithms. The state-of-the-art algorithms inthis class are Weighted Rating Matrix Factorization [7, 10],Bayesian Personalized Ranking Matrix Factorization [12] andCollaborative Less is More Filtering [14].This work builds upon the di↵erent item-based algorithms

by Deshpande et al. [2] and the user-based algorithm by Sar-war et al. [13] that uses cosine similarity. We introduce areformulation of these algorithms that unifies their existingformulations. From this reformulation, it becomes clear thatthe existing user- and item-based algorithms unnecessarilydiscard important parts of the available information. There-fore, we propose a novel algorithm that combines both user-

1234  :  UB  -­‐  IB  -­‐  KUNN  

30  

Page 31: Unifying Nearest Neighbors Collaborative Filtering

31  

         ‘s  are  less  informa>ve  if                and                are  popular  

=  1  ×  1  ×  1  

1234  :  UB  -­‐  IB  -­‐  KUNN  

31  

Page 32: Unifying Nearest Neighbors Collaborative Filtering

32  

         ‘s  are  less  informa>ve  if                and                are  popular  

=  1  ×  1  ×  1  ×    

Unifying Nearest Neighbors Collaborative Filtering

Koen Verstrepen Bart Goethals

University of Antwerp

Antwerp, Belgium

{koen.verstrepen,bart.goethals}@uantwerp.be

ABSTRACTWe study collaborative filtering for applications in whichthere exists for every user a set of items about which theuser has given binary, positive-only feedback (one-class col-laborative filtering). Take for example an on-line store thatknows all past purchases of every customer. An importantclass of algorithms for one-class collaborative filtering arethe nearest neighbors algorithms, typically divided into user-based and item-based algorithms. We introduce a reformu-lation that unifies user- and item-based nearest neighborsalgorithms and use this reformulation to propose a novelalgorithm that incorporates the best of both worlds andoutperforms state-of-the-art algorithms. Additionally, wepropose a method for naturally explaining the recommen-dations made by our algorithm and show that this methodis also applicable to existing user-based nearest neighborsmethods.

Categories and Subject Descriptors: H.3.3 [Informa-tion Storage and Retrieval]: Information filtering

Keywords: One-class collaborative filtering, nearest neigh-bors, explaining recommendations, top-N recommendation,recommender systems.

1. INTRODUCTION

1p2 · 3

1p1 · 2

1p2 · 3 · 1 · 2

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or corecsysercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected]’14, October 6–10, 2014, Foster City, Silicon Valley, CA, USA.Copyright 2014 ACM 978-1-4503-2668-1/14/10 ...$15.00.http://dx.doi.org/10.1145/2645710.2645731.

u is a user

i is an item

u, v 2 U are users

i, j 2 I are items

The score for recommending item i to user u

Sum over all user-item-pairs (v, j)

Typically, the training data for collaborative filtering isrepresented by a matrix in which the rows represent theusers and the columns represent the items. A value in thismatrix can be unknown or can reflect the preference of therespective user for the respective item. Here, we consider thespecific setting of binary, positive-only preference feedback.Hence, every value in the preference matrix is 1 or 0, with1 representing a known preference and 0 representing theunknown. Pan et al. [10] call this setting one-class collab-orative filtering (OCCF). Applications that correspond tothis version of the collaborative filtering problem are likeson social networking sites, tags for photo’s, websites visitedduring a surfing session, articles bought by a customer etc.In this setting, identifying the best recommendations can

be formalized as ranking all user-item-pairs (u, i) accordingto the probability that u prefers i.One class of algorithms for solving OCCF are the nearest

neighbors algorithms. Sarwar et al. [13] proposed the wellknown user-based algorithm that uses cosine similarity andDeshpande et al. [2] proposed several widely used item-basedalgorithms.Another class of algorithms for solving OCCF are matrix

factorization algorithms. The state-of-the-art algorithms inthis class are Weighted Rating Matrix Factorization [7, 10],Bayesian Personalized Ranking Matrix Factorization [12] andCollaborative Less is More Filtering [14].This work builds upon the di↵erent item-based algorithms

by Deshpande et al. [2] and the user-based algorithm by Sar-war et al. [13] that uses cosine similarity. We introduce areformulation of these algorithms that unifies their existingformulations. From this reformulation, it becomes clear thatthe existing user- and item-based algorithms unnecessarilydiscard important parts of the available information. There-fore, we propose a novel algorithm that combines both user-

1234  :  UB  -­‐  IB  -­‐  KUNN  

32  

Page 33: Unifying Nearest Neighbors Collaborative Filtering

         ‘s  are  less  informa>ve  if                and                like  many  things            ‘s  are  less  informa>ve  if                and                are  popular  

33  

=  1  ×  1  ×  2  

1234  :  UB  -­‐  IB  -­‐  KUNN  

33  

Page 34: Unifying Nearest Neighbors Collaborative Filtering

         ‘s  are  less  informa>ve  if                and                like  many  things            ‘s  are  less  informa>ve  if                and                are  popular  

34  

=  1  ×  1  ×  2  ×    

Unifying Nearest Neighbors Collaborative Filtering

Koen Verstrepen Bart Goethals

University of Antwerp

Antwerp, Belgium

{koen.verstrepen,bart.goethals}@uantwerp.be

ABSTRACTWe study collaborative filtering for applications in whichthere exists for every user a set of items about which theuser has given binary, positive-only feedback (one-class col-laborative filtering). Take for example an on-line store thatknows all past purchases of every customer. An importantclass of algorithms for one-class collaborative filtering arethe nearest neighbors algorithms, typically divided into user-based and item-based algorithms. We introduce a reformu-lation that unifies user- and item-based nearest neighborsalgorithms and use this reformulation to propose a novelalgorithm that incorporates the best of both worlds andoutperforms state-of-the-art algorithms. Additionally, wepropose a method for naturally explaining the recommen-dations made by our algorithm and show that this methodis also applicable to existing user-based nearest neighborsmethods.

Categories and Subject Descriptors: H.3.3 [Informa-tion Storage and Retrieval]: Information filtering

Keywords: One-class collaborative filtering, nearest neigh-bors, explaining recommendations, top-N recommendation,recommender systems.

1. INTRODUCTION

1p2 · 3

1p1 · 2

1p2 · 3 · 1 · 2

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or corecsysercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected]’14, October 6–10, 2014, Foster City, Silicon Valley, CA, USA.Copyright 2014 ACM 978-1-4503-2668-1/14/10 ...$15.00.http://dx.doi.org/10.1145/2645710.2645731.

u is a user

i is an item

u, v 2 U are users

i, j 2 I are items

The score for recommending item i to user u

Sum over all user-item-pairs (v, j)

Typically, the training data for collaborative filtering isrepresented by a matrix in which the rows represent theusers and the columns represent the items. A value in thismatrix can be unknown or can reflect the preference of therespective user for the respective item. Here, we consider thespecific setting of binary, positive-only preference feedback.Hence, every value in the preference matrix is 1 or 0, with1 representing a known preference and 0 representing theunknown. Pan et al. [10] call this setting one-class collab-orative filtering (OCCF). Applications that correspond tothis version of the collaborative filtering problem are likeson social networking sites, tags for photo’s, websites visitedduring a surfing session, articles bought by a customer etc.In this setting, identifying the best recommendations can

be formalized as ranking all user-item-pairs (u, i) accordingto the probability that u prefers i.One class of algorithms for solving OCCF are the nearest

neighbors algorithms. Sarwar et al. [13] proposed the wellknown user-based algorithm that uses cosine similarity andDeshpande et al. [2] proposed several widely used item-basedalgorithms.Another class of algorithms for solving OCCF are matrix

factorization algorithms. The state-of-the-art algorithms inthis class are Weighted Rating Matrix Factorization [7, 10],Bayesian Personalized Ranking Matrix Factorization [12] andCollaborative Less is More Filtering [14].This work builds upon the di↵erent item-based algorithms

by Deshpande et al. [2] and the user-based algorithm by Sar-war et al. [13] that uses cosine similarity. We introduce areformulation of these algorithms that unifies their existingformulations. From this reformulation, it becomes clear thatthe existing user- and item-based algorithms unnecessarilydiscard important parts of the available information. There-fore, we propose a novel algorithm that combines both user-

1234  :  UB  -­‐  IB  -­‐  KUNN  

34  

Page 35: Unifying Nearest Neighbors Collaborative Filtering

In  summary  

NN  algorithms  reformulated  as  sum  over  user-­‐item  pair        One  such  pair  is  (            ,            )  For  compu>ng  the  contribu>on  of  the  pair  (            ,            )  Exis>ng  user-­‐based  methods  ignore  info  about    Exis>ng  item-­‐based  methods  ignore  info  about  KUNN  combines  info  about                  and                  

35  

Page 36: Unifying Nearest Neighbors Collaborative Filtering

One  More  Thing  

36  

Page 37: Unifying Nearest Neighbors Collaborative Filtering

Contribu>on  3  

37  

Page 38: Unifying Nearest Neighbors Collaborative Filtering

Explaining  recommenda>ons  

38  

Page 39: Unifying Nearest Neighbors Collaborative Filtering

Explaining  recommenda>ons  

39  

Page 40: Unifying Nearest Neighbors Collaborative Filtering

Explaining  recommenda>ons  

40  

Page 41: Unifying Nearest Neighbors Collaborative Filtering

Explaining  recommenda>ons  

41  

Page 42: Unifying Nearest Neighbors Collaborative Filtering

Read  our  paper  because?  

1.  Our  reformula>on  could  inspire  you  

2.  You  might  want  to  benchmark  with  KUNN  (link  to  source  code  in  paper)  

3.  You  want  to  explain  NN  recommenda>ons    (incl.  user-­‐based)  

 

42  

Page 43: Unifying Nearest Neighbors Collaborative Filtering
Page 44: Unifying Nearest Neighbors Collaborative Filtering

Accuracy  Results  Rela>ve  to  POP  

44  

experimental setups. In the user selected setup (Sec. 5.1),the users selected which items they rated. In the randomselected setup (Sec. 5.2), users were asked to rate randomlychosen items.

5.1 User Selected SetupThis experimental setup is applicable to the Movielens

and the Yahoo!Musicuser datasets. In both datasets, userschose themselves which items they rated. Following Desh-pande et al. [2] and Rendle et al. [12], one preference of everyuser is randomly chosen to be the test preference for thatuser. If a user has only one preference, no test preference ischosen. The remaining preferences are represented as a 1 inthe training matrix R. All other entries of R are zero. Wedefine the hit set H

u

of a user u as the set containing all testpreferences of that user. For this setup, this is a singleton,denoted as {h

u

}, or the empty set if no test preference ischosen. Furthermore, we define U

t

as the set of users witha test preference, i.e. U

t

= {u 2 U | |Hu

| > 0}. Table 2summarizes some characteristics of the datasets.For every user u 2 U

t

, every algorithm ranks the items{i 2 I | R

ui

= 0} based on R. We denote the rank of thetest preference h

u

in such a ranking as r(hu

).Then, every set of rankings is evaluated using three mea-

sures. Following Deshpande et al. [2] we use hit rate at 10and average reciprocal hit rate at 10. In general, 10 canbe replaced by any natural number N |I|. We followDeshpande et al. [2] and choose N=10.Hit rate at 10 is given by

HR@10 =1

|Ut

|X

u2Ut

|Hu

\ top10 (u)|,

with top10(u) the 10 highest ranked items for user u. Hence,HR@10 gives the percentage of test users for which the testpreference is in the top 10 recommendations.

Average reciprocal hit rate at 10 is given by

ARHR@10 =1

|Ut

|X

u2Ut

|Hu

\ top10 (u)| · 1r(h

u

).

Unlike hit rate, average reciprocal hit rate takes into accountthe rank of the test preference in the top 10 of a user.

Following Rendle et al. [12], we also use the AMAN versionof area under the curve, which is for this experimental setupgiven by

AUCAMAN =1

|Ut

|X

u2Ut

|I|� r(hu

)|I|� 1

.

AMAN stands for All Missing As Negative, meaning thata missing preference is evaluated as a dislike. Like averagereciprocal hit rate, the area under the curve takes into ac-count the rank of the test preference in the recommendationlist for a user. However, AUCAMAN decreases slower thanARHR@10 when r(h

u

) increases.We repeat all experiments five times, drawing a di↵erent

random sample of test preferences every time.

5.2 Random Selected SetupThe user selected experimental setup introduces two bi-

ases in the evaluation. Firstly, popular items get more rat-ings. Secondly, the majority of the ratings is positive. Thesetwo biases can have strong influences on the results and are

thoroughly discussed by Pradel et al. [11]. The random se-lected test setup avoids these biases.Following Pradel et al. [11], the training dataset is con-

structed in the user selected way: users chose to rate anumber of items they selected themselves. The test data-set however, is the result of a voluntary survey in whichrandom items were presented to the users and a rating wasasked. In this way, both the popularity and the positivitybias are avoided.This experimental setup is applicable to the dataset

Yahoo!Musicrandom. The training data, R, of this dataset isidentical to the full Yahoo!Musicuser dataset. Additionally,this dataset includes a test dataset in which the rated itemswere randomly selected. For a given user u, the hit set H

u

contains all preferences of this user which are present in thetest dataset. The set of dislikes M

u

contains all dislikes ofthis user which are present in the test dataset. We defineUt

= {u 2 U | |Hu

| > 0, |Mu

| > 0}, i.e. all users withboth preferences and dislikes in the test dataset. Table 2summarizes some characteristics of the dataset.For every user u 2 U

t

, every algorithm ranks the itemsin H

u

[Mu

based on the training data R. The rank of anitem i in such a ranking is denoted r(i).Then, following Pradel et al. [11], we evaluate every set of

rankings with the AMAU version of area under the curve,which is given by

AUCAMAU =1

|Ut

|X

u2Ut

AUCAMAU (u),

with

AUCAMAU (u) =X

h2Hu

|{m 2 Mu

| r(m) > r(h)}||H

u

||Mu

| .

AMAU stands for All Missing As Unknown, meaning that amissing preference does not influence the evaluation. Hence,AUCAMAU (u) measures, for a user u, the fraction of dislikesthat is, on average, ranked behind the preferences. A bigadvantage of this measure is that it only relies on knownpreferences and known dislikes. Unlike the other measures,it does not make the bold assumption that items not pre-ferred by u in the past are disliked by u.Because of the natural split in a test and training dataset,

repeating the experiment with di↵erent test preferences isnot applicable.

5.3 Parameter SelectionEvery personalization algorithm in the experimental eval-

uation has at least one parameter. For every experiment wetry to find the best set of parameters using grid search. Anexperiment is defined by (1) the outer training dataset R,(2) the outer hitset

Su2Ut

Hu

, (3) the outer dislikesS

u2Ut

Mu

,

(4) the evaluation measure, and (5) the algorithm. Apply-ing grid search, we first choose a finite number of parametersets. Secondly, from the training data R, we create five in-ner data splits, defined by Rk,

Su2Ut

Hk

u

, andS

u2Ut

Mu

for

k 2 {1, 2, 3, 4, 5}. Then, for every parameter set, we re-run the algorithm on all five inner training datasets Rk andevaluate them on the corresponding inner test datasets withthe chosen evaluation measure. Finally, the parameter setwith the best average score over the five inner data splits,is chosen to be the best parameter set for the outer training

experimental setups. In the user selected setup (Sec. 5.1),the users selected which items they rated. In the randomselected setup (Sec. 5.2), users were asked to rate randomlychosen items.

5.1 User Selected SetupThis experimental setup is applicable to the Movielens

and the Yahoo!Musicuser datasets. In both datasets, userschose themselves which items they rated. Following Desh-pande et al. [2] and Rendle et al. [12], one preference of everyuser is randomly chosen to be the test preference for thatuser. If a user has only one preference, no test preference ischosen. The remaining preferences are represented as a 1 inthe training matrix R. All other entries of R are zero. Wedefine the hit set H

u

of a user u as the set containing all testpreferences of that user. For this setup, this is a singleton,denoted as {h

u

}, or the empty set if no test preference ischosen. Furthermore, we define U

t

as the set of users witha test preference, i.e. U

t

= {u 2 U | |Hu

| > 0}. Table 2summarizes some characteristics of the datasets.For every user u 2 U

t

, every algorithm ranks the items{i 2 I | R

ui

= 0} based on R. We denote the rank of thetest preference h

u

in such a ranking as r(hu

).Then, every set of rankings is evaluated using three mea-

sures. Following Deshpande et al. [2] we use hit rate at 10and average reciprocal hit rate at 10. In general, 10 canbe replaced by any natural number N |I|. We followDeshpande et al. [2] and choose N=10.Hit rate at 10 is given by

HR@10 =1

|Ut

|X

u2Ut

|Hu

\ top10 (u)|,

with top10(u) the 10 highest ranked items for user u. Hence,HR@10 gives the percentage of test users for which the testpreference is in the top 10 recommendations.

Average reciprocal hit rate at 10 is given by

ARHR@10 =1

|Ut

|X

u2Ut

|Hu

\ top10 (u)| · 1r(h

u

).

Unlike hit rate, average reciprocal hit rate takes into accountthe rank of the test preference in the top 10 of a user.

Following Rendle et al. [12], we also use the AMAN versionof area under the curve, which is for this experimental setupgiven by

AUCAMAN =1

|Ut

|X

u2Ut

|I|� r(hu

)|I|� 1

.

AMAN stands for All Missing As Negative, meaning thata missing preference is evaluated as a dislike. Like averagereciprocal hit rate, the area under the curve takes into ac-count the rank of the test preference in the recommendationlist for a user. However, AUCAMAN decreases slower thanARHR@10 when r(h

u

) increases.We repeat all experiments five times, drawing a di↵erent

random sample of test preferences every time.

5.2 Random Selected SetupThe user selected experimental setup introduces two bi-

ases in the evaluation. Firstly, popular items get more rat-ings. Secondly, the majority of the ratings is positive. Thesetwo biases can have strong influences on the results and are

thoroughly discussed by Pradel et al. [11]. The random se-lected test setup avoids these biases.Following Pradel et al. [11], the training dataset is con-

structed in the user selected way: users chose to rate anumber of items they selected themselves. The test data-set however, is the result of a voluntary survey in whichrandom items were presented to the users and a rating wasasked. In this way, both the popularity and the positivitybias are avoided.This experimental setup is applicable to the dataset

Yahoo!Musicrandom. The training data, R, of this dataset isidentical to the full Yahoo!Musicuser dataset. Additionally,this dataset includes a test dataset in which the rated itemswere randomly selected. For a given user u, the hit set H

u

contains all preferences of this user which are present in thetest dataset. The set of dislikes M

u

contains all dislikes ofthis user which are present in the test dataset. We defineUt

= {u 2 U | |Hu

| > 0, |Mu

| > 0}, i.e. all users withboth preferences and dislikes in the test dataset. Table 2summarizes some characteristics of the dataset.For every user u 2 U

t

, every algorithm ranks the itemsin H

u

[Mu

based on the training data R. The rank of anitem i in such a ranking is denoted r(i).Then, following Pradel et al. [11], we evaluate every set of

rankings with the AMAU version of area under the curve,which is given by

AUCAMAU =1

|Ut

|X

u2Ut

AUCAMAU (u),

with

AUCAMAU (u) =X

h2Hu

|{m 2 Mu

| r(m) > r(h)}||H

u

||Mu

| .

AMAU stands for All Missing As Unknown, meaning that amissing preference does not influence the evaluation. Hence,AUCAMAU (u) measures, for a user u, the fraction of dislikesthat is, on average, ranked behind the preferences. A bigadvantage of this measure is that it only relies on knownpreferences and known dislikes. Unlike the other measures,it does not make the bold assumption that items not pre-ferred by u in the past are disliked by u.Because of the natural split in a test and training dataset,

repeating the experiment with di↵erent test preferences isnot applicable.

5.3 Parameter SelectionEvery personalization algorithm in the experimental eval-

uation has at least one parameter. For every experiment wetry to find the best set of parameters using grid search. Anexperiment is defined by (1) the outer training dataset R,(2) the outer hitset

Su2Ut

Hu

, (3) the outer dislikesS

u2Ut

Mu

,

(4) the evaluation measure, and (5) the algorithm. Apply-ing grid search, we first choose a finite number of parametersets. Secondly, from the training data R, we create five in-ner data splits, defined by Rk,

Su2Ut

Hk

u

, andS

u2Ut

Mu

for

k 2 {1, 2, 3, 4, 5}. Then, for every parameter set, we re-run the algorithm on all five inner training datasets Rk andevaluate them on the corresponding inner test datasets withthe chosen evaluation measure. Finally, the parameter setwith the best average score over the five inner data splits,is chosen to be the best parameter set for the outer training

experimental setups. In the user selected setup (Sec. 5.1),the users selected which items they rated. In the randomselected setup (Sec. 5.2), users were asked to rate randomlychosen items.

5.1 User Selected SetupThis experimental setup is applicable to the Movielens

and the Yahoo!Musicuser datasets. In both datasets, userschose themselves which items they rated. Following Desh-pande et al. [2] and Rendle et al. [12], one preference of everyuser is randomly chosen to be the test preference for thatuser. If a user has only one preference, no test preference ischosen. The remaining preferences are represented as a 1 inthe training matrix R. All other entries of R are zero. Wedefine the hit set H

u

of a user u as the set containing all testpreferences of that user. For this setup, this is a singleton,denoted as {h

u

}, or the empty set if no test preference ischosen. Furthermore, we define U

t

as the set of users witha test preference, i.e. U

t

= {u 2 U | |Hu

| > 0}. Table 2summarizes some characteristics of the datasets.For every user u 2 U

t

, every algorithm ranks the items{i 2 I | R

ui

= 0} based on R. We denote the rank of thetest preference h

u

in such a ranking as r(hu

).Then, every set of rankings is evaluated using three mea-

sures. Following Deshpande et al. [2] we use hit rate at 10and average reciprocal hit rate at 10. In general, 10 canbe replaced by any natural number N |I|. We followDeshpande et al. [2] and choose N=10.Hit rate at 10 is given by

HR@10 =1

|Ut

|X

u2Ut

|Hu

\ top10 (u)|,

with top10(u) the 10 highest ranked items for user u. Hence,HR@10 gives the percentage of test users for which the testpreference is in the top 10 recommendations.

Average reciprocal hit rate at 10 is given by

ARHR@10 =1

|Ut

|X

u2Ut

|Hu

\ top10 (u)| · 1r(h

u

).

Unlike hit rate, average reciprocal hit rate takes into accountthe rank of the test preference in the top 10 of a user.

Following Rendle et al. [12], we also use the AMAN versionof area under the curve, which is for this experimental setupgiven by

AUCAMAN =1

|Ut

|X

u2Ut

|I|� r(hu

)|I|� 1

.

AMAN stands for All Missing As Negative, meaning thata missing preference is evaluated as a dislike. Like averagereciprocal hit rate, the area under the curve takes into ac-count the rank of the test preference in the recommendationlist for a user. However, AUCAMAN decreases slower thanARHR@10 when r(h

u

) increases.We repeat all experiments five times, drawing a di↵erent

random sample of test preferences every time.

5.2 Random Selected SetupThe user selected experimental setup introduces two bi-

ases in the evaluation. Firstly, popular items get more rat-ings. Secondly, the majority of the ratings is positive. Thesetwo biases can have strong influences on the results and are

thoroughly discussed by Pradel et al. [11]. The random se-lected test setup avoids these biases.Following Pradel et al. [11], the training dataset is con-

structed in the user selected way: users chose to rate anumber of items they selected themselves. The test data-set however, is the result of a voluntary survey in whichrandom items were presented to the users and a rating wasasked. In this way, both the popularity and the positivitybias are avoided.This experimental setup is applicable to the dataset

Yahoo!Musicrandom. The training data, R, of this dataset isidentical to the full Yahoo!Musicuser dataset. Additionally,this dataset includes a test dataset in which the rated itemswere randomly selected. For a given user u, the hit set H

u

contains all preferences of this user which are present in thetest dataset. The set of dislikes M

u

contains all dislikes ofthis user which are present in the test dataset. We defineUt

= {u 2 U | |Hu

| > 0, |Mu

| > 0}, i.e. all users withboth preferences and dislikes in the test dataset. Table 2summarizes some characteristics of the dataset.For every user u 2 U

t

, every algorithm ranks the itemsin H

u

[Mu

based on the training data R. The rank of anitem i in such a ranking is denoted r(i).Then, following Pradel et al. [11], we evaluate every set of

rankings with the AMAU version of area under the curve,which is given by

AUCAMAU =1

|Ut

|X

u2Ut

AUCAMAU (u),

with

AUCAMAU (u) =X

h2Hu

|{m 2 Mu

| r(m) > r(h)}||H

u

||Mu

| .

AMAU stands for All Missing As Unknown, meaning that amissing preference does not influence the evaluation. Hence,AUCAMAU (u) measures, for a user u, the fraction of dislikesthat is, on average, ranked behind the preferences. A bigadvantage of this measure is that it only relies on knownpreferences and known dislikes. Unlike the other measures,it does not make the bold assumption that items not pre-ferred by u in the past are disliked by u.Because of the natural split in a test and training dataset,

repeating the experiment with di↵erent test preferences isnot applicable.

5.3 Parameter SelectionEvery personalization algorithm in the experimental eval-

uation has at least one parameter. For every experiment wetry to find the best set of parameters using grid search. Anexperiment is defined by (1) the outer training dataset R,(2) the outer hitset

Su2Ut

Hu

, (3) the outer dislikesS

u2Ut

Mu

,

(4) the evaluation measure, and (5) the algorithm. Apply-ing grid search, we first choose a finite number of parametersets. Secondly, from the training data R, we create five in-ner data splits, defined by Rk,

Su2Ut

Hk

u

, andS

u2Ut

Mu

for

k 2 {1, 2, 3, 4, 5}. Then, for every parameter set, we re-run the algorithm on all five inner training datasets Rk andevaluate them on the corresponding inner test datasets withthe chosen evaluation measure. Finally, the parameter setwith the best average score over the five inner data splits,is chosen to be the best parameter set for the outer training

experimental setups. In the user selected setup (Sec. 5.1),the users selected which items they rated. In the randomselected setup (Sec. 5.2), users were asked to rate randomlychosen items.

5.1 User Selected SetupThis experimental setup is applicable to the Movielens

and the Yahoo!Musicuser datasets. In both datasets, userschose themselves which items they rated. Following Desh-pande et al. [2] and Rendle et al. [12], one preference of everyuser is randomly chosen to be the test preference for thatuser. If a user has only one preference, no test preference ischosen. The remaining preferences are represented as a 1 inthe training matrix R. All other entries of R are zero. Wedefine the hit set H

u

of a user u as the set containing all testpreferences of that user. For this setup, this is a singleton,denoted as {h

u

}, or the empty set if no test preference ischosen. Furthermore, we define U

t

as the set of users witha test preference, i.e. U

t

= {u 2 U | |Hu

| > 0}. Table 2summarizes some characteristics of the datasets.For every user u 2 U

t

, every algorithm ranks the items{i 2 I | R

ui

= 0} based on R. We denote the rank of thetest preference h

u

in such a ranking as r(hu

).Then, every set of rankings is evaluated using three mea-

sures. Following Deshpande et al. [2] we use hit rate at 10and average reciprocal hit rate at 10. In general, 10 canbe replaced by any natural number N |I|. We followDeshpande et al. [2] and choose N=10.Hit rate at 10 is given by

HR@10 =1

|Ut

|X

u2Ut

|Hu

\ top10 (u)|,

with top10(u) the 10 highest ranked items for user u. Hence,HR@10 gives the percentage of test users for which the testpreference is in the top 10 recommendations.

Average reciprocal hit rate at 10 is given by

ARHR@10 =1

|Ut

|X

u2Ut

|Hu

\ top10 (u)| · 1r(h

u

).

Unlike hit rate, average reciprocal hit rate takes into accountthe rank of the test preference in the top 10 of a user.

Following Rendle et al. [12], we also use the AMAN versionof area under the curve, which is for this experimental setupgiven by

AUCAMAN =1

|Ut

|X

u2Ut

|I|� r(hu

)|I|� 1

.

AMAN stands for All Missing As Negative, meaning thata missing preference is evaluated as a dislike. Like averagereciprocal hit rate, the area under the curve takes into ac-count the rank of the test preference in the recommendationlist for a user. However, AUCAMAN decreases slower thanARHR@10 when r(h

u

) increases.We repeat all experiments five times, drawing a di↵erent

random sample of test preferences every time.

5.2 Random Selected SetupThe user selected experimental setup introduces two bi-

ases in the evaluation. Firstly, popular items get more rat-ings. Secondly, the majority of the ratings is positive. Thesetwo biases can have strong influences on the results and are

thoroughly discussed by Pradel et al. [11]. The random se-lected test setup avoids these biases.Following Pradel et al. [11], the training dataset is con-

structed in the user selected way: users chose to rate anumber of items they selected themselves. The test data-set however, is the result of a voluntary survey in whichrandom items were presented to the users and a rating wasasked. In this way, both the popularity and the positivitybias are avoided.This experimental setup is applicable to the dataset

Yahoo!Musicrandom. The training data, R, of this dataset isidentical to the full Yahoo!Musicuser dataset. Additionally,this dataset includes a test dataset in which the rated itemswere randomly selected. For a given user u, the hit set H

u

contains all preferences of this user which are present in thetest dataset. The set of dislikes M

u

contains all dislikes ofthis user which are present in the test dataset. We defineUt

= {u 2 U | |Hu

| > 0, |Mu

| > 0}, i.e. all users withboth preferences and dislikes in the test dataset. Table 2summarizes some characteristics of the dataset.For every user u 2 U

t

, every algorithm ranks the itemsin H

u

[Mu

based on the training data R. The rank of anitem i in such a ranking is denoted r(i).Then, following Pradel et al. [11], we evaluate every set of

rankings with the AMAU version of area under the curve,which is given by

AUCAMAU =1

|Ut

|X

u2Ut

AUCAMAU (u),

with

AUCAMAU (u) =X

h2Hu

|{m 2 Mu

| r(m) > r(h)}||H

u

||Mu

| .

AMAU stands for All Missing As Unknown, meaning that amissing preference does not influence the evaluation. Hence,AUCAMAU (u) measures, for a user u, the fraction of dislikesthat is, on average, ranked behind the preferences. A bigadvantage of this measure is that it only relies on knownpreferences and known dislikes. Unlike the other measures,it does not make the bold assumption that items not pre-ferred by u in the past are disliked by u.Because of the natural split in a test and training dataset,

repeating the experiment with di↵erent test preferences isnot applicable.

5.3 Parameter SelectionEvery personalization algorithm in the experimental eval-

uation has at least one parameter. For every experiment wetry to find the best set of parameters using grid search. Anexperiment is defined by (1) the outer training dataset R,(2) the outer hitset

Su2Ut

Hu

, (3) the outer dislikesS

u2Ut

Mu

,

(4) the evaluation measure, and (5) the algorithm. Apply-ing grid search, we first choose a finite number of parametersets. Secondly, from the training data R, we create five in-ner data splits, defined by Rk,

Su2Ut

Hk

u

, andS

u2Ut

Mu

for

k 2 {1, 2, 3, 4, 5}. Then, for every parameter set, we re-run the algorithm on all five inner training datasets Rk andevaluate them on the corresponding inner test datasets withthe chosen evaluation measure. Finally, the parameter setwith the best average score over the five inner data splits,is chosen to be the best parameter set for the outer training

experimental setups. In the user selected setup (Sec. 5.1),the users selected which items they rated. In the randomselected setup (Sec. 5.2), users were asked to rate randomlychosen items.

5.1 User Selected SetupThis experimental setup is applicable to the Movielens

and the Yahoo!Musicuser datasets. In both datasets, userschose themselves which items they rated. Following Desh-pande et al. [2] and Rendle et al. [12], one preference of everyuser is randomly chosen to be the test preference for thatuser. If a user has only one preference, no test preference ischosen. The remaining preferences are represented as a 1 inthe training matrix R. All other entries of R are zero. Wedefine the hit set H

u

of a user u as the set containing all testpreferences of that user. For this setup, this is a singleton,denoted as {h

u

}, or the empty set if no test preference ischosen. Furthermore, we define U

t

as the set of users witha test preference, i.e. U

t

= {u 2 U | |Hu

| > 0}. Table 2summarizes some characteristics of the datasets.For every user u 2 U

t

, every algorithm ranks the items{i 2 I | R

ui

= 0} based on R. We denote the rank of thetest preference h

u

in such a ranking as r(hu

).Then, every set of rankings is evaluated using three mea-

sures. Following Deshpande et al. [2] we use hit rate at 10and average reciprocal hit rate at 10. In general, 10 canbe replaced by any natural number N |I|. We followDeshpande et al. [2] and choose N=10.Hit rate at 10 is given by

HR@10 =1

|Ut

|X

u2Ut

|Hu

\ top10 (u)|,

with top10(u) the 10 highest ranked items for user u. Hence,HR@10 gives the percentage of test users for which the testpreference is in the top 10 recommendations.

Average reciprocal hit rate at 10 is given by

ARHR@10 =1

|Ut

|X

u2Ut

|Hu

\ top10 (u)| · 1r(h

u

).

Unlike hit rate, average reciprocal hit rate takes into accountthe rank of the test preference in the top 10 of a user.

Following Rendle et al. [12], we also use the AMAN versionof area under the curve, which is for this experimental setupgiven by

AUCAMAN =1

|Ut

|X

u2Ut

|I|� r(hu

)|I|� 1

.

AMAN stands for All Missing As Negative, meaning thata missing preference is evaluated as a dislike. Like averagereciprocal hit rate, the area under the curve takes into ac-count the rank of the test preference in the recommendationlist for a user. However, AUCAMAN decreases slower thanARHR@10 when r(h

u

) increases.We repeat all experiments five times, drawing a di↵erent

random sample of test preferences every time.

5.2 Random Selected SetupThe user selected experimental setup introduces two bi-

ases in the evaluation. Firstly, popular items get more rat-ings. Secondly, the majority of the ratings is positive. Thesetwo biases can have strong influences on the results and are

thoroughly discussed by Pradel et al. [11]. The random se-lected test setup avoids these biases.Following Pradel et al. [11], the training dataset is con-

structed in the user selected way: users chose to rate anumber of items they selected themselves. The test data-set however, is the result of a voluntary survey in whichrandom items were presented to the users and a rating wasasked. In this way, both the popularity and the positivitybias are avoided.This experimental setup is applicable to the dataset

Yahoo!Musicrandom. The training data, R, of this dataset isidentical to the full Yahoo!Musicuser dataset. Additionally,this dataset includes a test dataset in which the rated itemswere randomly selected. For a given user u, the hit set H

u

contains all preferences of this user which are present in thetest dataset. The set of dislikes M

u

contains all dislikes ofthis user which are present in the test dataset. We defineUt

= {u 2 U | |Hu

| > 0, |Mu

| > 0}, i.e. all users withboth preferences and dislikes in the test dataset. Table 2summarizes some characteristics of the dataset.For every user u 2 U

t

, every algorithm ranks the itemsin H

u

[Mu

based on the training data R. The rank of anitem i in such a ranking is denoted r(i).Then, following Pradel et al. [11], we evaluate every set of

rankings with the AMAU version of area under the curve,which is given by

AUCAMAU =1

|Ut

|X

u2Ut

AUCAMAU (u),

with

AUCAMAU (u) =X

h2Hu

|{m 2 Mu

| r(m) > r(h)}||H

u

||Mu

| .

AMAU stands for All Missing As Unknown, meaning that amissing preference does not influence the evaluation. Hence,AUCAMAU (u) measures, for a user u, the fraction of dislikesthat is, on average, ranked behind the preferences. A bigadvantage of this measure is that it only relies on knownpreferences and known dislikes. Unlike the other measures,it does not make the bold assumption that items not pre-ferred by u in the past are disliked by u.Because of the natural split in a test and training dataset,

repeating the experiment with di↵erent test preferences isnot applicable.

5.3 Parameter SelectionEvery personalization algorithm in the experimental eval-

uation has at least one parameter. For every experiment wetry to find the best set of parameters using grid search. Anexperiment is defined by (1) the outer training dataset R,(2) the outer hitset

Su2Ut

Hu

, (3) the outer dislikesS

u2Ut

Mu

,

(4) the evaluation measure, and (5) the algorithm. Apply-ing grid search, we first choose a finite number of parametersets. Secondly, from the training data R, we create five in-ner data splits, defined by Rk,

Su2Ut

Hk

u

, andS

u2Ut

Mu

for

k 2 {1, 2, 3, 4, 5}. Then, for every parameter set, we re-run the algorithm on all five inner training datasets Rk andevaluate them on the corresponding inner test datasets withthe chosen evaluation measure. Finally, the parameter setwith the best average score over the five inner data splits,is chosen to be the best parameter set for the outer training

experimental setups. In the user selected setup (Sec. 5.1),the users selected which items they rated. In the randomselected setup (Sec. 5.2), users were asked to rate randomlychosen items.

5.1 User Selected SetupThis experimental setup is applicable to the Movielens

and the Yahoo!Musicuser datasets. In both datasets, userschose themselves which items they rated. Following Desh-pande et al. [2] and Rendle et al. [12], one preference of everyuser is randomly chosen to be the test preference for thatuser. If a user has only one preference, no test preference ischosen. The remaining preferences are represented as a 1 inthe training matrix R. All other entries of R are zero. Wedefine the hit set H

u

of a user u as the set containing all testpreferences of that user. For this setup, this is a singleton,denoted as {h

u

}, or the empty set if no test preference ischosen. Furthermore, we define U

t

as the set of users witha test preference, i.e. U

t

= {u 2 U | |Hu

| > 0}. Table 2summarizes some characteristics of the datasets.For every user u 2 U

t

, every algorithm ranks the items{i 2 I | R

ui

= 0} based on R. We denote the rank of thetest preference h

u

in such a ranking as r(hu

).Then, every set of rankings is evaluated using three mea-

sures. Following Deshpande et al. [2] we use hit rate at 10and average reciprocal hit rate at 10. In general, 10 canbe replaced by any natural number N |I|. We followDeshpande et al. [2] and choose N=10.Hit rate at 10 is given by

HR@10 =1

|Ut

|X

u2Ut

|Hu

\ top10 (u)|,

with top10(u) the 10 highest ranked items for user u. Hence,HR@10 gives the percentage of test users for which the testpreference is in the top 10 recommendations.

Average reciprocal hit rate at 10 is given by

ARHR@10 =1

|Ut

|X

u2Ut

|Hu

\ top10 (u)| · 1r(h

u

).

Unlike hit rate, average reciprocal hit rate takes into accountthe rank of the test preference in the top 10 of a user.

Following Rendle et al. [12], we also use the AMAN versionof area under the curve, which is for this experimental setupgiven by

AUCAMAN =1

|Ut

|X

u2Ut

|I|� r(hu

)|I|� 1

.

AMAN stands for All Missing As Negative, meaning thata missing preference is evaluated as a dislike. Like averagereciprocal hit rate, the area under the curve takes into ac-count the rank of the test preference in the recommendationlist for a user. However, AUCAMAN decreases slower thanARHR@10 when r(h

u

) increases.We repeat all experiments five times, drawing a di↵erent

random sample of test preferences every time.

5.2 Random Selected SetupThe user selected experimental setup introduces two bi-

ases in the evaluation. Firstly, popular items get more rat-ings. Secondly, the majority of the ratings is positive. Thesetwo biases can have strong influences on the results and are

thoroughly discussed by Pradel et al. [11]. The random se-lected test setup avoids these biases.Following Pradel et al. [11], the training dataset is con-

structed in the user selected way: users chose to rate anumber of items they selected themselves. The test data-set however, is the result of a voluntary survey in whichrandom items were presented to the users and a rating wasasked. In this way, both the popularity and the positivitybias are avoided.This experimental setup is applicable to the dataset

Yahoo!Musicrandom. The training data, R, of this dataset isidentical to the full Yahoo!Musicuser dataset. Additionally,this dataset includes a test dataset in which the rated itemswere randomly selected. For a given user u, the hit set H

u

contains all preferences of this user which are present in thetest dataset. The set of dislikes M

u

contains all dislikes ofthis user which are present in the test dataset. We defineUt

= {u 2 U | |Hu

| > 0, |Mu

| > 0}, i.e. all users withboth preferences and dislikes in the test dataset. Table 2summarizes some characteristics of the dataset.For every user u 2 U

t

, every algorithm ranks the itemsin H

u

[Mu

based on the training data R. The rank of anitem i in such a ranking is denoted r(i).Then, following Pradel et al. [11], we evaluate every set of

rankings with the AMAU version of area under the curve,which is given by

AUCAMAU =1

|Ut

|X

u2Ut

AUCAMAU (u),

with

AUCAMAU (u) =X

h2Hu

|{m 2 Mu

| r(m) > r(h)}||H

u

||Mu

| .

AMAU stands for All Missing As Unknown, meaning that amissing preference does not influence the evaluation. Hence,AUCAMAU (u) measures, for a user u, the fraction of dislikesthat is, on average, ranked behind the preferences. A bigadvantage of this measure is that it only relies on knownpreferences and known dislikes. Unlike the other measures,it does not make the bold assumption that items not pre-ferred by u in the past are disliked by u.Because of the natural split in a test and training dataset,

repeating the experiment with di↵erent test preferences isnot applicable.

5.3 Parameter SelectionEvery personalization algorithm in the experimental eval-

uation has at least one parameter. For every experiment wetry to find the best set of parameters using grid search. Anexperiment is defined by (1) the outer training dataset R,(2) the outer hitset

Su2Ut

Hu

, (3) the outer dislikesS

u2Ut

Mu

,

(4) the evaluation measure, and (5) the algorithm. Apply-ing grid search, we first choose a finite number of parametersets. Secondly, from the training data R, we create five in-ner data splits, defined by Rk,

Su2Ut

Hk

u

, andS

u2Ut

Mu

for

k 2 {1, 2, 3, 4, 5}. Then, for every parameter set, we re-run the algorithm on all five inner training datasets Rk andevaluate them on the corresponding inner test datasets withthe chosen evaluation measure. Finally, the parameter setwith the best average score over the five inner data splits,is chosen to be the best parameter set for the outer training

Movielens  1M   Yahoo!Music  

KUNN  WRMF  BPRMF  UB+IB  UB  IB  POP