Active learning strategies for rating elicitation in collaborative filtering

13

Active Learning Strategies for Rating Elicitation in CollaborativeFiltering: A System-Wide Perspective

MEHDI ELAHI and FRANCESCO RICCI, Free University of Bozen-BolzanoNEIL RUBENS, University of Electro-Communications

The accuracy of collaborative-filtering recommender systems largely depends on three factors: the qualityof the rating prediction algorithm, and the quantity and quality of available ratings. While research inthe field of recommender systems often concentrates on improving prediction algorithms, even the bestalgorithms will fail if they are fed poor-quality data during training, that is, garbage in, garbage out. Activelearning aims to remedy this problem by focusing on obtaining better-quality data that more aptly reflects auser’s preferences. However, traditional evaluation of active learning strategies has two major flaws, whichhave significant negative ramifications on accurately evaluating the system’s performance (prediction error,precision, and quantity of elicited ratings). (1) Performance has been evaluated for each user independently(ignoring system-wide improvements). (2) Active learning strategies have been evaluated in isolation fromunsolicited user ratings (natural acquisition).

In this article we show that an elicited rating has effects across the system, so a typical user-centricevaluation which ignores any changes of rating prediction of other users also ignores these cumulativeeffects, which may be more influential on the performance of the system as a whole (system centric). Wepropose a new evaluation methodology and use it to evaluate some novel and state-of-the-art rating elicitationstrategies. We found that the system-wide effectiveness of a rating elicitation strategy depends on the stageof the rating elicitation process, and on the evaluation measures (MAE, NDCG, and Precision). In particular,we show that using some common user-centric strategies may actually degrade the overall performanceof a system. Finally, we show that the performance of many common active learning strategies changessignificantly when evaluated concurrently with the natural acquisition of ratings in recommender systems.

Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search andRetrieval

General Terms: Algorithms, Experimentation, Performance

Additional Key Words and Phrases: Recommender systems, active learning, rating elicitation, cold start

ACM Reference Format:Elahi, M., Ricci, F., and Rubens, N. 2013. Active learning strategies for rating elicitation in collaborativefiltering: A system-wide perspective. ACM Trans. Intell. Syst. Technol. 5, 1, Article 13 (December 2013), 33pages.DOI: http://dx.doi.org/10.1145/2542182.2542195

1. INTRODUCTION

Choosing the right product to consume or purchase is nowadays a challenging problemdue to the growing number of products and variety of eCommerce services. Whileincreasing the number of choices provides an opportunity for a consumer to acquire the

Authors’ addresses: M. Elahi (corressponding author) and F. Ricci, Free University of Bozen-Bolzano, Bozen-Bolzano, Italy; email: [email protected]; N. Rubens, University of Electro-Communications,Tokyo, Japan.Permission to make digital or hard copies of part or all of this work for personal or classroom use is grantedwithout fee provided that copies are not made or distributed for profit or commercial advantage and thatcopies show this notice on the first page or initial screen of a display along with the full citation. Copyrights forcomponents of this work owned by others than ACM must be honored. Abstracting with credit is permitted.To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of thiswork in other works requires prior specific permission and/or a fee. Permissions may be requested fromPublications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)869-0481, or [email protected]© 2013 ACM 2157-6904/2013/12-ART13 $15.00DOI: http://dx.doi.org/10.1145/2542182.2542195

ACM Transactions on Intelligent Systems and Technology, Vol. 5, No. 1, Article 13, Publication date: December 2013.

13:2 M. Elahi et al.

products satisfying her special personal needs, it may at the same time overwhelm byproviding too many choices [Anderson 2006]. Recommender Systems (RSs) tackle thisproblem providing personalized suggestions for digital content, products, or services,that better match the user’s needs and constraints than the mainstream products[Resnick and Varian 1997; Ricci et al. 2011b; Jannach et al. 2010]. In this article wefocus on the collaborative filtering recommendation approach, and on techniques thatare aimed at identifying what information about the user tastes should be elicited bythe system to generate effective recommendations.

1.1. Recommender Systems and Social Networks

A Collaborative Filtering (CF) recommender system uses ratings for items provided bya network of users to recommend items, which the target user has not yet consideredbut will likely enjoy [Koren and Bell 2011; Desrosiers and Karypis 2011]. A collabo-rative filtering system computes its recommendations by exploiting relationships andsimilarities between users. These relations are defined by observing the users’ accessof the items managed by the system. For instance, consider the users actively usingAmazon or Last.fm. They browse and search items to buy or to listen to. They readusers comments and get recommendations computed by the system using the accesslogs, and the ratings of the users’ community.

The recommendations computed by a CF system are based on the network structurecreated by the users accessing the system. For instance, classical neighbor-based CFsystems evaluate user-to-user or item-to-item similarities based on the corating struc-ture of the users or items. Two users are considered similar if they rate items in acorrelated way. Analogously, two items are considered similar if the network of usershave rated the items in a correlated way. The recommendations for an active user arethen computed by suggesting the items that have a high average rating in the groupof users similar to the active one (user-based CF). In item-based approaches, a useris recommended items that are similar to those she liked in the past, where similarmeans that users rated similarly the two items. Even more novel matrix factorizationrecommendation techniques [Koren and Bell 2011], which are considered in this arti-cle, are modeling users and items with a vector of abstract factors that are learned bymining the rating behavior of a network of users. In these factor models, in addition tothe similarity of users with users and items with items, it is also possible to establishthe similarity of users with items, since both of them are represented uniformly witha vector of factors.

Recommender systems are often integrated into eCommerce applications to suggestitems to buy or consume, but RSs are now also frequently used in social networks,that is, applications primarily designed to support social interactions between theirusers. Moreover, in many popular social networks such as Facebook, Myspace, andGoogle plus, several applications have been introduced for eliciting user preferencesand characteristics, and then to provide recommendations. For instance, in Facebook,the largest social network, there are applications where users can rate friends, movies,photos, or links. Examples of such applications are Rate My Friends (with more than6,000 monthly active users), Rate my Photo, My Movie Rating, and LinkR. Some ofthese applications collect user preferences (ratings) only to create a user profile page.However, some of them use the data also to make recommendations to the users. Forinstance, using Rate My Friends, the user is requested to rate her friends. Then theapplication ranks her friends based on the ratings and presents the top scored usersthat she may be interested in connecting to. LinkR, another example of recommendersystems integrated into a social network, recommends a number of links and allowsthe user to rate them. Moreover, Facebook itself does also collect ratings by offering a


Active Learning Strategies for Rating Elicitation in Collaborative Filtering 13:3

“like” button in partner sites and exploits its usage to discover which friends, groups,apps, links, or games a particular user may like.

It is worth noting that all of the applications mentioned before must implement arating elicitation strategy, that is, identify items to present to the user in order to collecther ratings. In this article we propose and evaluate some strategies for accomplishingthis task. Hence, social networks can benefit from the techniques introduced here togenerate better recommendations for establishing new social relationships and henceimproving the core service of social networks.

1.2. Collaborative Filtering and Rating Acquisition

The CF rating prediction accuracy does depend on the characteristics of the predictionalgorithm. Hence, in the last years several variants of CF algorithms have been pro-posed. Koren and Bell [2011] and Desrosiers and Karypis [2011] provide an up-to-datesurvey of memory- and model-based methods.

In addition to the rating prediction technique, the number, the distribution, and thequality of the ratings known by the system can influence system’s performance. Ingeneral, the more informative about the user preferences the available ratings are, thehigher the recommendation accuracy. Therefore, it is important to keep acquiring fromthe users new and useful ratings, in order to maintain or improve the quality of therecommendations. This is especially true for the cold start stage, where a new user ora new item is added to the system [Schein et al. 2002; Liu et al. 2011; Zhou et al. 2011;Golbandi et al. 2011].

It is worth noting that RSs usually deal with huge catalogues, for example, Netflix,the popular American provider of on-demand Internet streaming media, manages al-most one million movies. Hence, if the recommender system wants to explicitly askthe user to rate some items, this set must be carefully chosen. First of all, the sys-tem should ask ratings for items that the user has experienced, otherwise no usefulinformation can be acquired. This is not easy, especially when the user is new to thesystem and there is not much knowledge that can be leveraged to predict what itemsthe user actually experienced in the past. But additionally, the system should exploittechniques to identify those items that if rated by the user would generate ratings datathat do improve the precision of the future recommendations, not only for the targetuser but for all of the system’s users. Informative ratings can provide additional knowl-edge about preferences of the users as well as fixing the errors in the rating predictionmodel.

1.3. Approach and Goals of this Article

In this work we focus on understanding the behavior of several ratings acquisitionstrategies, such as “provide your ratings for these top ten movies”. The goal of a ratingacquisition strategy is to enlarge the set of available data in the optimal way forthe whole system performance by eliciting the most useful ratings from each user. Inpractice, an RS user interface can be designed so that users browsing the existingitems can rate them if they wish. But, new ratings can also be acquired by explicitlyasking users. In fact, some RSs ask the users to rate the recommended items: mixingrecommendations with users’ preference elicitation. We will show that this approach isfeasible but it must be used with care, since relying on just one single strategy, such asasking the user opinion only for the items that the system believes the user likes, hasa potentially dangerous impact on the system effectiveness. Hence a careful selectionof the elicitation strategy is in order.

In this article we extend our previous work [Elahi et al. 2011a, 2011b] wherewe provided an initial evaluation of active learning rating elicitation strategies in



collaborative filtering. In this article, in addition to the “pure” strategies, namelythose implementing a single heuristic, we also consider “partially randomized” ones.Randomized strategies, in addition to asking (simulated) users to rate the itemsselected by a “pure” strategy, also ask to rate some randomly selected items. Random-ized strategies can diversify the item list presented to the user. But, more importantly,randomized strategies allow to cope with the nonmonotonically improving behaviorof the system effectiveness that we observed during the simulation of certain “pure”strategies. In fact, we discovered (as hypothesized by Rashid et al. [2002]) that certainstrategies, for instance, requesting to rate the items with the highest predicted ratings,may generate a system-wide bias, and inadvertently increase the system error.

RSs can be evaluated online and offline [Herlocker et al. 2004; Shani andGunawardana 2010; Cremonesi et al. 2010]. In the first case, one or more RSs arerun and experiments on real users are performed. This requires building or accessinga (or some) fully developed RS, with a large user community, which is expensive andtime consuming. Moreover, it is hard to test online several algorithms, such as thoseproposed here. Therefore, similarly to many previous experimental analyses, we per-formed offline experiments. We developed a program which simulates the real processof rating elicitation in a community of users (Movielens and Netflix), the consequentrating database growth starting from a relatively small one (cold start), and the systemadaptation (retraining) to the new set of data. Moreover, in this article we evaluate theproposed strategies in two scenarios: when the simulated users are confined to rateonly items that are presented to them by the active learning strategy or when they canvoluntarily add ratings on their own.

In the experiments performed here we used a state-of-the-art matrix factorizationrating prediction algorithm [Koren and Bell 2011; Timely Development 2008]. Henceour results can provide useful guidelines for managing real RSs that nowadays oftenrely on this technique. In factor models both users and items are assigned to factorvectors of the same size. Those vectors are obtained from the user ratings matrixwith optimization techniques trying to approximate the original rating matrix. Eachelement of the factor vector assigned to an item reflects how well the item representsa particular latent aspect [Koren and Bell 2011]. For our experiments we employed agradient descent optimization technique as proposed by Simon Funk [2006].

1.4. Article Contribution

The main contribution of our research is the introduction and the empirical evaluationof a set of rating elicitation strategies for collaborative filtering with respect to theirsystem-wide utility. Some of these strategies are new and some come from the liter-ature and the common practice. An important differentiating aspect of our study ismeasuring the effect of each strategy on several RSs evaluation measures and showingthat the best strategy depends on the evaluation measure. Previous works focussedonly on the rating prediction accuracy (Mean Absolute Error), and on the number ofacquired ratings. We analyze those aspects, but in addition we consider the recom-mendation precision, and the goodness of the recommendations’ ranking, measuredwith Normalized Discounted Cumulative Gain (NDCG). These measures are crucialfor determining the value of the recommendations [Shani and Gunawardana 2010].

Moreover, another major contribution of our work is the analysis of the performanceof the elicitation strategies taking into account the size of the rating database. Weshow that different strategies can improve different aspects of the recommendationquality at different stages of the rating database development. We show that in somestages elicitation strategies may induce a bias on the system and ultimately result ina decrease of the recommendation accuracy.



In summary, this article provides a realistic, comprehensive evaluation of severalapplicable rating elicitation strategies, providing guidelines and conclusions that couldhelp with their deployment in real RSs.

1.5. Novelty of the Proposed Approach

Rating elicitation has been also tackled in a few previous works [McNee et al. 2003;Rashid et al. 2002, 2008, Carenini et al. 2003; Jin and Si 2004; Harpale and Yang2008] that will be surveyed in Section 2. But these papers focused on a problem that isdifferent from what we consider here. Namely, they measured the benefit of the ratingelicited from one user, for example, in the sign-up stage, for improving the quality ofthe recommendations for that user. Conversely, we consider the impact of an elicitationstrategy on the overall system behavior, for example, the prediction accuracy averagedon all the system’s users. In other words, we try to identify strategies that can elicitfrom a user ratings that will contribute to the improvement of the system performancefor all of the users, and not just for the target user.

Previously conducted evaluations have assumed rather artificial conditions, thatis, all the users and items have some ratings since the beginning of the evaluationprocess and the system only asks of the simulated user ratings that are present in thedataset. In other words, previous studies did not consider the new-item and the new-user problem. Moreover, only a few evaluations simulated users with limited knowledgeabout the items (e.g., Harpale and Yang [2008]). We generate initial conditions for therating dataset based on the temporal evolution of the system, hence, in our experiments,new users and new items are present in a similar manner as in real settings. Moreover,the system does not know what items the simulated user has experienced, and mayask ratings for items that the user will not be able to provide. This better simulates arealistic scenario where not all rating requests can be satisfied by a user.

It is also important to note that previous analyses considered the situation wherethe active learning rating elicitation strategy was the only tool used to collect newratings from the users. Hence, elicitation strategies were evaluated in isolation fromongoing system usage, where users can freely enter new ratings. We propose a morerealistic evaluation setting, where in addition to the ratings acquired by the elicitationstrategies, ratings are also added by users on a voluntary basis. Hence, for the valida-tion experiments, we have also utilized a simulation process in which active learningis combined with the natural acquisition of the users’ ratings.

The rest of the article is structured as follows. In Section 2 we review related works.In Section 3 we introduce the rating elicitation strategies that we have analyzed. InSection 4 we present the first simulation procedure that we designed to more accuratelyevaluate the system’s recommendation performance (MAE, NDCG, and Precision). Theresults of our experiments are shown in Sections 5 and 6. Then in Section 7 we presentthe analysis of the active learning strategies when active learning is mixed with thenatural acquisition of the user ratings. Finally in Section 8 we summarize the resultsof this research and outline directions for future work.

2. RELATED WORK

Active learning in RS aims at actively acquiring user preference data to improve theoutput of the RS [Boutilier et al. 2003; Rubens et al. 2011]. Active learning for RSis a form of preference elicitation [Bonilla et al. 2010; Pu and Chen 2008; Chen andPu 2012; Braziunas and Boutilier 2010; Guo and Sanner 2010; Birlutiu et al. 2012],but the current research on active learning for recommender systems has focussed oncollaborative filtering, and in particular on the new-user problem. In this setting, it isassumed that a user has not rated any items, and the system is able to actively ask



the user to rate some items in order to generate recommendations for the user. In thissurvey we will focus on AL in collaborative filtering.

In many previous works, which we will describe shortly, the evaluation of a ratingelicitation strategy is performed by simulating the interaction with a new user whilethe system itself is not in a cold start stage, that is, it has already acquired manyratings from users.

Conversely, as we mentioned in the Introduction, in our work we simulate the appli-cation of several rating elicitation strategies in a more diverse set of scenarios; besidesthe typical settings in which the new user has not rated any items, while the systemalready possess many ratings provided by other users. We consider a more generalscenario where the user repeatedly comes back to the system for receiving recom-mendations, that is, while the system has possibly elicited ratings from other users.Moreover, we simulate a scenario where the system has initially a small overall knowl-edge of the users’ preferences, that is, has a small set of ratings to train the predictionmodel. Then, step by step, as the users come to the system new ratings are elicited.Another important difference, compared to the state-of-the-art, is that we consider theimpact of an elicitation strategy on the overall system behavior. This aims to measurehow the ratings elicited from one user can contribute to the improvement of the systemperformance even when making recommendations for other users.

Finally, we have also investigated a more realistic evaluation scenario where activelearning is combined with natural addition of the ratings, that is, some ratings arefreely added by the users without being requested. This scenario has not been appliedpreviously.

2.1. Rating Elicitation at Sign-Up

The first research in AL for recommender systems were motivated by the need toimplement more effective sign-up processes and used the classical neighbor-based ap-proaches to collaborative filtering [Desrosiers and Karypis 2011]. In Rashid et al. [2002]the authors focus explicitly on the sign-up process, that is, when a new user starts us-ing a collaborative filtering recommender system and must rate some items in orderto provide to the system some initial information about her preferences. Rashid et al.[2002] considered six techniques for explicitly determining the items to ask a user torate: entropy, where items with the largest rating entropy are preferred; random re-quest; popularity, which is measured as the number of ratings for an item, and hencethe most frequently rated items are selected; log(popularity) ∗ entropy where itemsthat are both popular and have diverse ratings are selected; and finally item-itempersonalized, where random items are proposed until the user rates one. Then, a rec-ommender is used to predict what items the user is likely to have seen based on theratings already provided by the user. These predicted items are requested to the user torate. Finally, the behavior of an item-to-item collaborative filtering system [Desrosiersand Karypis 2011] was evaluated with respect to MAE under an offline setting thatsimulated the sign-up process. The process was repeated multiple times and averagedfor all the test users. In that scenario the log(popularity) ∗ entropy strategy was foundto be the best. For this reason we have also evaluated log(popularity) ∗ entropy in ourstudy. But, it is worth noting that their result could not be automatically extendedto the scenario that we consider in this work, that is the evolution of the global sys-tem performance under the application of an active learning strategy applied to allthe users. In fact, as we mentioned earlier, in our experiments we simulate the si-multaneous acquisition of ratings from all the users, by asking each user in turn torate some items, and we repeat this process several times. This simulates the long-term usage of a recommender system where users utilize the system repeatedly to get



new recommendations and ratings provided by a user are also used to generate betterrecommendations for other users (system performance).

2.2. Conversational Approaches

Subsequently, researchers understood that in order to generate more effective ratingelicitation strategies the system should be conversational: it should better motivatethe rating requests, focusing on the user preferences, and the user should be able tomore freely enter her ratings, even without being explicitly requested.

In Carenini et al. [2003], a user-focussed approach is considered. They propose aset of techniques to intelligently select items to rate when the user is particularlymotivated to provide such information. They present a conversational and collaborativeinteraction model that elicits ratings so that the benefit of doing that is clear to theuser, thus increasing the motivation to provide a rating. Item-focused techniques thatelicit ratings to improve the rating prediction for a specific item are also proposed.Popularity, entropy, and their combination are tested, as well as their item-focusedmodifications. The item-focused techniques are different from the classical ones in thatpopularity and entropy are not computed on the whole rating matrix, but only on thematrix of user’s neighbors that have rated an item for which the prediction accuracy isbeing improved. Results have shown that item-focused strategies are constantly betterthan unfocused ones.

McNee et al. [2003] address even a more general problem, aiming at understandingwhich, among the following methods, is the best solution for rating elicitation in thestart-up phase: (a) allowing a user to enter the items and her ratings freely, (b) propos-ing to a user a list of items and asking her to rate them, or (c) combining the twoapproaches. They compare three interfaces for eliciting information from new usersthat implement the afore mentioned approaches. They performed an online experi-ment, which shows that the two pure approaches produced more accurate user modelsthan the mixed model with respect to MAE.

2.3. Bayesian Approaches

In another group of approaches AL is modeled as a Bayesian reasoning process. Harpaleand Yang [2008] developed such an approach extending and criticizing a previous oneintroduced in Jin and Si [2004]. In fact, in Jin and Si [2004], as is rather common inmost AL techniques and evaluation studies, the unrealistic assumption that a user canprovide a rating for any presented item is made. Conversely, they propose a revisedBayesian item selection approach, which does not make such an assumption, and whichintroduces an estimate of the probability that a user has consumed an item in thepast and is able to provide a rating. Their results show that the personalized Bayesianselection outperforms Bayesian selection and the random strategy with respect to MAE.Their simulation setting is similar to that used in Rashid et al. [2002], hence forthe same reason their results are not directly comparable with ours. There are otherimportant differences between their experiment and ours: their strategies elicit onlyone rating per request, while we assume that the system makes many rating requestsat the same time; they compare the proposed approach only with the random strategy,while we study the performance of several strategies; they do not consider the new-user problem, since in their simulations all the users have at least three ratings atthe beginning of the experiment, whereas in our experiments, there are users thathave no ratings at all in the initial stage of the experiment; they use a different ratingprediction algorithm (Bayesian versus matrix factorization). All these differences makethe two sets of experiments and the conclusions hard to compare. Moreover, in theirsimulations they assume that the system has a larger number of known ratings thanin our experiments.



2.4. Decision-Tree-Based Methods

Many recent approaches to rating elicitation in RS identify the items to request to theuser to rate as those providing the most useful knowledge for reducing the predictionerror of the recommender system. Many of these approaches exploit decision trees tomodel the conditional selection of an item to be rated, with regards to the ratingsprovided by the user for the items presented previously.

In Rashid et al. [2008] the authors extend their former work [Rashid et al. 2002]using a rating elicitation approach based on the usage of decision trees. The proposedtechnique is called IGCN, and builds a tree where each node is labelled by a particularitem to be asked for the user to rate. According to the user rating for the asked item adifferent branch is followed, and a new node that is labelled with another item to rateis determined. In order to build this decision tree, they first cluster the users in groups,by grouping users with similar profiles, and assigning each user to one of these clusters.The tree is incrementally extended by selecting for each node the item that provides thehighest information gain for correctly classifying the user in the right cluster. Hence,the items whose ratings are more important to correctly classify the users in the rightcluster are selected earlier in the tree. They also considered two alternative strategies.The first one is entropy0 that differs from the more classical entropy strategy, whichwe mentioned before, because the missing value is considered as a possible rating(category 0). Then, the second one is called HELF, where items with the largest valueof the harmonic mean of the entropy and popularity are selected. They have conductedoffline and online simulations, and concluded that IGCN and entropy0 perform the bestwith respect to MAE.

They evaluate the improvement of the rating prediction accuracy for the particu-lar user whose ratings are elicited, while, as we mentioned previously, we measurethe overall system effectiveness of a rating elicitation strategy. Moreover, in their ex-periments they use a very different rating prediction algorithm, that is, a standardneighbor-based approach [Desrosiers and Karypis 2011], while we use matrix factor-ization [Koren and Bell 2011].

In a more recent work Golbandi et al. [2010] use three strategies for rating elicitationin collaborative filtering. For the first method, GreedyExtend, the items that minimizethe Root Mean Square Error (RMSE) of the rating prediction (on the training set)are selected. For the Var method the items with the largest 2

√popularity ∗ variance

are selected, that is, those items that have many ratings in the training set and withdiverse values. Finally, for Coverage method, the items with the largest coverage areselected. They defined coverage for an item as the total number of users who coratedboth the selected item and any other item. They evaluated the performance of thesestrategies and compared them with previously proposed ones (popularity, entropy,entropy0, HELF, and random). In their experiments, every strategy ranks and picksthe top 200 items to be presented to new users. Then, considering the ratings of theusers for these items as training set they predict ratings in a Netflix test set for everysingle user and compute RMSE. They show that GreedyExtend outperforms the otherstrategies. In fact, this strategy is quite effective, as it obtains the same error rate, afterhaving acquired just 10 ratings, which the second best strategy (Var) achieves after 26ratings. However, despite this remarkable achievement, GreedyExtend is static, thatis, selects the items without considering the ratings previously entered by the user.Even here the authors focus on the new-user problem. In our work we do not makesuch assumption, and propose and evaluate strategies that can be used in all stages,and not only at the start-up stage.

Even more recently, in Golbandi et al. [2011] the same authors of the paper describedbefore have developed an adaptive version of their approach. Here, the items selected



for a user depend on the previous ratings she has provided. They also propose a tech-nique based on decision trees where at each node there is a test based on a particularitem (movie). The node divides the users into three groups based on the rating of theuser for that movie: lovers, who rated the movie highly; haters, who rated the movieas low; and unknowns, who did not rate the movie. In order to build the decision tree,at each node the movie whose rating knowledge produces the largest reduction of theRMSE is selected. The rating prediction is computed (approximated) as the weightedaverage of the ratings given by the users that belong to that node. They have evalu-ated their approach using the Netflix training dataset (100M ratings) to construct thetrees, and evaluated the performance of the proposed strategy on the Netflix test set(2.8M ratings). The proposed strategy has shown a significant reduction of RMSE com-pared with GreedyExtend, Var, and HELF strategies. They were able to achieve withonly 6 ratings the same accuracy that is achieved by the next best strategy, namelyGreedyExtend, after acquiring over 20 ratings. Moreover, that accuracy is obtained byVar and HELF strategies after acquiring more than 30 ratings.

It should be noted that their results are again rather difficult to compare withours. They simulate a scenario where the system is trained and the decision tree isconstructed from a large training dataset. So they assume a large initial knowledge ofthe system. Then, they focus on completely new users, that is, those without a singlerating in the training set. In contrast, in our work, we assume that the system hasa very limited global knowledge of the users. In our experiments this is simulated bygiving to the system only 2% of the rating dataset. Moreover, we analyze the systemdynamics as more users are repeatedly requested to enter their ratings.

2.5. Time-Dependent Evolution of a Recommender System

Finally we want to mention an interesting and related work that is not addressingthe active learning process of rating elicitation but is studying the time-dependentevolution of a recommender system as new ratings are acquired. In Burke [2010] theauthor analyzes the temporal properties of a standard user-based collaborative filtering[Herlocker et al. 1999] and Influence Limiter [Resnick and Sami 2007], a collaborativefiltering algorithm developed for counteracting profile injection attacks by consideringthe time at which a user has rated an item.

They evaluate the accuracy of these two prediction algorithms while the users arerating items and the database is growing. This is radically different from the typicalevaluations that we mentioned earlier, where the rating dataset is decomposed intothe training and testing sets without considering the timestamp of the ratings. InBurke [2010] it is argued that considering the time at which the ratings were addedto the system gives a better picture of the real user experience during the interactionswith the system in terms of recommendation accuracy. They conducted their analysison Movielens large dataset (1M ratings), and discovered that while using InfluenceLimiter, MAE is not decreasing with the addition of more data indicating that the algo-rithm is not effective in terms of accuracy improvement. For the standard user-basedcollaborative filtering algorithm they observed the presence of two time segments: thestart-up period, until day 70 with MAE dropping gradually, and the remaining period,where MAE was dropping much slower.

This analysis is complementary to our study. This work analyzes the performanceof a recommendation algorithm while the users are adding their ratings in a naturalmanner, that is, without being explicitly requested to rate items selected by an activelearning strategy. We have investigated the situation where in addition to this naturalstream of ratings coming from the users, the system selectively chooses additionalitems and presents them to the users to get their ratings.



3. ELICITATION STRATEGIES

A rating dataset R is a n× m matrix of real values (ratings) with possible null entries.The variable rui, denotes the entry of the matrix in position (u, i), and contains therating assigned by user u to item i. rui could store a null value representing the factthat the system does not know the opinion of the user on that item. In the Movielensand Netflix datasets the rating values are integers between 1 and 5 (inclusive).

A rating elicitation strategy S is a function S(u, N, K,Uu) = L which returns a list ofitems L = {i1, . . . , iM}, M ≤ N, whose ratings should be elicited from the user u, whereN is the maximum number of items that the strategy should return, K is the datasetof known ratings, that is, the ratings (of all the users) that have been already acquiredby the RS. K is also an n × m matrix containing entries with real or null values. Thenot-null entries represent the knowledge of the system at a certain point of the RSevolution. Finally, Uu is the set of items whose ratings have not yet been elicited fromu, hence potentially interesting. The elicitation strategy enforces that L ⊂ Uu and willnot repeatedly ask a user to rate the same item; that is, after the items in L are shownto a user they are removed from Uu.

Every elicitation strategy analyzes the dataset of known ratings K and scores theitems in Uu. If the strategy can score at least N different items, then the N itemswith the highest score are returned. Otherwise a smaller number of items M ≤ N isreturned. It is important to note that the user may have not experienced the itemswhose ratings are requested; in this case the system will not increase the number ofknown ratings. In practice, following a strategy may result in collecting a larger num-ber of ratings, while following another one may result in fewer but more informativeratings. These two properties (rating quantity and quality) play a fundamental role inrating elicitation.

3.1. Individual Strategies

We considered two types of strategies: pure and partially randomized. The first onesimplement a unique heuristic, whereas the second type of strategies hybridize a pureone by adding some random rating requests that are still unknown to the system. Aswe mentioned in the Introduction these strategies add some diversity to the systemrequests and, as we will show later, can cope with an observed problem of the purestrategies which may in some cases increase the system error.

The pure strategies that we have considered are as follows.

—Popularity. For all the users the score for item i ∈ Uu is equal to the number of not-null ratings for i contained in K, that is, the number of known ratings for the itemi. This strategy will rank the items according to the popularity score and then willselect the top N items. Note that this strategy is not personalized, that is, the sameN items are proposed to be rated by any user. The rationale of this strategy is thatmore popular items are more likely to be known by the user, and hence it is morelikely that a request for such a rating will increase the size of the rating database.

—Log(popularity) * entropy. The score for the item i ∈ Uu is computed by multiplyingthe logarithm of the popularity of i with the entropy of the ratings for i in K. Alsoin this case, as for any strategy, the top N items according to the computed scoreare proposed to be rated by the user. This strategy tries to combine the effect of thepopularity score, which is discussed previously, with the heuristics that favor itemswith more diverse ratings (larger entropy), which may provide more useful (discrim-inative) information about the user’s preferences [Carenini et al. 2003; Rashid et al.2002].

—Binary Prediction. The matrix K is transformed in a matrix B with the same numberof rows and columns, by mapping null entries in K to 0, and not-null entries to 1.



Hence, the matrix B models only whether a user rated (bui = 1) or not (bui = 0)an item, regardless of its value [Koren 2008]. A factor model, similarly to what isdone for standard rating prediction, is built using the matrix B as training data,to compute the predictions for the entries in B that are 0 [Koren and Bell 2011].In this case predictions are numbers between 0 and 1. The larger is the predictedvalue for the entry bui, the larger is the predicted probability that the user u hasconsumed an item i, and hence may be able to rate it. Finally, the score for the itemi ∈ Uu corresponds to the predicted value for bui. Hence, by selecting the top N itemswith the highest score this strategy tries to select the items that the user has mostlikely experienced, in order to maximize the likelihood that the user can provide therequested rating. In that sense it is similar to the popularity strategy, but it tries tomake a better prediction of what items the user can rate by exploiting the knowledgeof the items the user has rated in the past. Note that the better are the predictionsbui for the items in Uu the larger is the number of ratings that this strategy canacquire.

—Highest Predicted. A rating prediction rui, based on the ratings in K, is computed forall the items i ∈ Uu and the score for i is this predicted value rui. Then, the top Nitems according to this score are selected. The idea is that the items with the highestpredicted rating are supposed to be the items that the user likes the most. Hence,it could also be more likely that the user has experienced these items. Moreover,her ratings could also reveal important information on what the user likes. We alsonote that this is the default strategy for RSs, that is, enabling the user to rate therecommendations.

—Lowest Predicted. It uses an opposite heuristics compared to highest predicted: for allthe items i ∈ Uu the prediction rui is computed, but then the score for i is Maxr − rui,where Maxr is the maximum rating value (e.g., 5). This ensures that the itemswith the lowest predicted ratings rui will get the highest score and therefore willbe selected for elicitation. Lowest predicted items are likely to reveal what the userdislikes, but are likely to elicit a few ratings, since users tend to not rate items thatthey do not like; reflected by the distributions of the ratings voluntarily provided bythe users [Marlin et al. 2011].

—Highest and Lowest Predicted. For all the items i ∈ Uu a prediction rui is computed.The score for an item is | Maxr+Minr

2 − rui|, where Minr is the minimum rating value(e.g., 1). This score is simply the distance of the predicted rating of i from the midpointof the rating scale. Hence, this strategy selects the items with extreme ratings, thatis, the items that the user either hates or loves.

—Random. The score for an item i ∈ Uu is a random integer from 1 to 5. Hence alsothe top N items in Uu according to this score are simply randomly chosen. This is abaseline strategy, used for comparison.

—Variance. The score for the item i ∈ Uu is equal to the variance of its ratingsin the dataset K. Hence this strategy selects the items in Uu that have beenrated in a more diverse way by the users. This is representative of the strategiesthat try to collect more useful ratings, assuming that the opinions of the useron items with more diverse ratings are more useful for the generation of correctrecommendations.

—Voting. The score for the item i is the number of votes given by a committee ofstrategies including popularity, variance, entropy, highest-lowest predicted, binaryprediction, and random. Each of these strategies produces its top 10 candidates forrating elicitation, and then the items appearing more often in these lists are selected.This strategy depends on the selected voting strategies. We have also included ran-dom strategy so as to impose an exploratory behavior.



Finally, we would like to note that we have also evaluated other strategies: entropy,and log(pop) ∗ variance. But, since their observed behaviors are very similar to some ofthe previously mentioned strategies, we did not include them.

3.2. Partially Randomized Strategies

A pure strategy may not be able to return the requested number of items. For instance,there are cases where no rating predictions can be computed by the RS for the useru. This happens, for instance, when u is a new user and none of his ratings is known.Hence, in this situation the highest predicted strategy is not able to score any of theitems. In this case the randomized version of the strategy can generate purely randomitems for the user to rate.

A partially randomized strategy modifies the list of items returned by a pure strategyintroducing some random items. As we mentioned in the Introduction, the partiallyrandomized strategies have been introduced to cope with some problems of the purestrategies (see Section 5). More precisely, the randomized version Ran of the strategyS with randomness p ∈ [0, 1] is a function Ran(S(u, N, K,Uu), p) returning a new listof items L′ computed as follows.

(1) L = S(u, N, K,Uu) is obtained.(2) If L is an empty list, that is, the strategy S for some reason could not generate the

elicitation list, then L′ is computed by taking N random items from Uu.(3) If |L| < N, L′ = L ∪ {i1, . . . , iN−|L|}, where i j is a random item in Uu.(4) If |L| = N, L′ = {l1, . . . , lM, iM+1, . . . , iN}, where lj is a random item in L, M =

N ∗ (1 − p), and i j is a random item in Uu.

4. EVALUATION APPROACH

In order to study the effects of the considered elicitation strategies we set up thefollowing simulation procedure. The goal is to simulate the influence of elicitationstrategies on the evolution of an RS’s performance. To achieve this, we partition all theavailable (not-null) ratings in R into three different matrices with the same number ofrows and columns as R.

—K contains the ratings that are considered to be known by the system at a certainpoint in time.

—X contains the ratings that are considered to be known by the users but not by thesystem. These ratings are incrementally elicited, that is, they are transferred into Kif the system asks for them from the (simulated) users.

—T contains a portion of the ratings that are known by the users but are withheldfrom X for evaluating the elicitation strategies, that is, to estimate the evaluationmeasures (defined later).

In Figure 1(b) we illustrate graphically how the partition of the available ratings ina dataset could look. As defined in the previous section, Uu is the set of items whoseratings are not known to the system and may therefore be selected by the elicitationstrategies. That means that kui has a null value and the system has not yet asked u forit. In Figure 1(b) these are the items that for a certain user have ratings that are notmarked with grey boxes. In this setting, a request to rate an item, which is identifiedby a strategy S, may end up with a new (not-null) rating kui inserted in K, if the userhas experienced the item i, that is, if xui is not null, or in a no action, if xui has a nullvalue in the matrix X. The first case corresponds to the situation where the items aremarked with a black box for user u in Figure 1(b). In any case, the system will removethe item i from Uu as to avoid asking to rate the same item again.



Fig. 1. Comparison of the ratings data configurations used for evaluating user-centered and system-centeredactive learning strategies.

We will discuss later how the simulation is initialized, namely, how the matrices K,X, and T are built from the full rating dataset R. In any case, these three matricespartition the full dataset R; if rui has a not-null value then either kui or xui or tui isassigned that value, that is, only one of entries is not null.

The test of a strategy S proceeds in the following way.

(1) The not-null ratings in R are partitioned into the three matrices K, X, T .(2) MAE, Precision, and NDCG are measured on T , training the rating prediction

model on K.(3) For each user u:

(a) only the first time that this step is executed, Uu, the unclear set of user u isinitialized to all the items i with a null value kui in K;

(b) using strategy S (pure or randomized) a set of items L = S(u, N, K,Uu) iscomputed;

(c) the set Le, containing only the items in L that have a not-null rating in X, iscreated;

(d) assign to the corresponding entries in K the ratings for the items in Le as foundin X;

(e) remove the items in L from Uu: Uu = Uu \ L and X.(4) MAE, Precision, and NDCG are measured on T , and the prediction model is re-

trained on the new set of ratings contained in K.(5) Repeat steps 3 and 4 (Iteration) for I times.

It is important to note here the peculiarity of this evaluation strategy that hasbeen mentioned already in Section 2. Traditionally, the evaluation of active learningstrategies has been user centered; that is, the usefulness of the elicited rating wasjudged based on the improvement in the user’s prediction error. This is illustrated inFigure 1(a). In this scenario the system is supposed to have a large number of ratingsfrom several users, and focusing on a new user (the first one in Figure 1(a)) it firstelicits ratings from this new user that are in X, and then system predictions for thisuser are evaluated on the test set T . Hence these traditional evaluations focussed onthe new-user problem and measured how the ratings elicited from a new user mayhelp the system to generate good recommendations for this particular user. We notethat eliciting a rating from a user may improve not only the rating predictions forthat user, but also the predictions for the other users, which is what we are evaluatingin our experiments and it is graphically illustrated in Figure 1(b). To illustrate this



point, let us consider an extreme example in which a new item is added to the system.The traditional user-centered AL strategy, when trying to identify the items that aparticular user u should rate, may ignore obtaining his rating for that new item. Infact, this item has not been rated by any other user and therefore its ratings cannotcontribute to improve the rating predictions for u. However, the rating of u for the newitem would allow to bootstrap the predictions for the rest of the users in the system,and hence from the system’s perspective the elicited rating is indeed very informative.

The MovieLens [Miller et al. 2003] and Netflix rating databases were used for ourexperiments. Movielens consists of 100,000 ratings from 943 users on 1682 movies.From the full Netflix dataset, which contains 1,000,000 ratings, we extracted the first100,000 ratings that were entered into the system. They come from 1491 users on 2380items, so this sample of Netflix data is 2.24 times sparser than Movielens data.

We also performed some experiments with the larger versions of both Movielensand Netflix datasets (1,000,000 ratings) and obtained very similar results [Elahi et al.2011a]. However, using the full set of Netflix data required much longer times toperform our experiments since we train and test a rating prediction model at eachiteration: every time we add to K new ratings elicited from the simulated users. Afterhaving observed a very similar performance on some initial experiments we focussedon the smaller datasets to be able to run more experiments.

When deciding how to split the available data into the three matrices K, X, and Tan obvious choice is to follow the actual time evolution of the dataset, that is, to insertin K the first ratings acquired by the system, then to use a second temporal segment topopulate X and finally use the remaining ratings for T . An approach that follows thisidea is detailed in Section 7.

But, it is not sufficient to test the performance of the proposed strategies for aparticular evolution of the rating dataset. Since we want to study the evolution of arating dataset under the application of a new strategy we cannot test it only againstthe temporal distribution of the data that was generated by a particular (unknown)previously used elicitation strategy. Hence we first followed the approach also usedin Harpale and Yang [2008] to randomly split the rating data, but unlike Harpaleand Yang [2008] we generated several random splits of the ratings into K, X, and T .This allows us to generate ratings configurations where there are users and itemsthat initially have no ratings in the known dataset K. We believe that this approachprovided us with a very realistic experimental setup, letting us to address both thenew-user and the new-item problems [Ricci et al. 2011a].

Finally, for both datasets the experiments were conducted by partitioning (randomly)the 100,000 not-null ratings of R in the following manner: 2,000 ratings in K (i.e., verylimited knowledge at the beginning), 68,000 ratings in X, and 30,000 ratings in T .Moreover, |L| = 10, which means that at each iteration the system asks from a userhis opinion on at most 10 items. The number of iterations was set as I = 170 sinceafter that stage almost all the ratings are acquired and the system performance isnot changing anymore. Moreover, the number of factors in the SVD prediction modelwas set to 16, which enabled the system to obtain a very good prediction accuracy,not very different from configurations using hundreds of factors, as shown in Korenand Bell [2011]. Note that since the factor model is trained at each iteration and foreach strategy learning the factor model is the major computational bottleneck of theconducted experiments for this reason we did not use a very large number of factors.Moreover, in these experiments we wanted to compare the system performance underthe application of several strategies, hence, the key measure is the relative performanceof the system rather than its absolute value. All the experiments were performedfive times and results presented in the following section are obtained averaging thesefive repetitions.



We considered three evaluation measures: Mean Absolute Error (MAE), Precision,and Normalized Discounted Cumulative Gain (NDCG) [Herlocker et al. 2004; Shaniand Gunawardana 2010; Manning 2008]. For computing precision we extracted, foreach user, the top 10 recommended items (whose ratings also appear in T ) and consid-ered as relevant the items with true ratings equal to 4 or 5.

Discounted Cumulative Gain (DCG) is a measure originally used to evaluate effec-tiveness of information retrieval systems [Jarvelin and Kekalainen 2002], and is alsoused for evaluating collaborative filtering RSs [Weimer et al. 2008; Liu and Yang 2008].In RSs the relevance is measured by the rating value of the item in the predicted rec-ommendation list. Assume that the recommendations for u are sorted according to thepredicted rating values, then DCGu is defined as

DCGu =N∑

i=1

riu

log2(i + 1), (1)

where riu is the true rating (as found in T ) for the item ranked in position i for user u,

and N is the length of the recommendation list.Normalized discounted cumulative gain for user u is then calculated as

NDCGu = DCGu

IDCGu, (2)

where IDCGu stands for the maximum possible value of DCGu, that could be obtainedif the recommended items are ordered by decreasing value of their true ratings. Wemeasured also the overall average discounted cumulative gain NDCG by averagingNDCGu over the full population of users.

5. EVALUATION OF THE PURE STRATEGIES

In this section we present the results of a first set of experiments in which the purestrategies have been evaluated. We first illustrate how the system MAE is changing asthe system is acquiring new ratings with the proposed strategies. Then we show howthe NDCG and the system precision are affected by the considered rating acquisitionstrategies.

5.1. Mean Absolute Error

The MAE computed on the test matrix T at the successive iterations of the applicationof the elicitation strategies (for all the users) is depicted in Figure 2. First of all, wecan observe that the considered strategies have a similar behavior on both datasets(Netflix and MovieLens). Moreover, there are two clearly distinct groups of strategies.

(1) Monotone error decreasing strategies are lowest-highest predicted, lowest pre-dicted, voting, and random.

(2) Nonmonotone error decreasing strategies are binary predicted, highest predicted,popularity, log(popularity)*entropy, and variance.

Strategies of the first group show an overall better performance (MAE) for the middlestage, but not for the beginning and the end stage. At the very beginning, that is,during the iterations 1–3 the best performing strategy is binary predicted. Then, duringiterations 4–11 the voting strategy obtains the lowest MAE on the Movielens dataset.Then the random strategy becomes the best, and it is overtaken by the lowest-highest-predicted strategy only at iteration 46. The results on the Netflix dataset differ asfollows. The binary predicted is the best strategy for a longer period, that is, from thebeginning until iteration 7, and then voting outperforms this strategy untill iteration46, where lowest-highest-predicted starts exhibiting the lowest error. At the iteration



Fig. 2. System MAE evolution under the effect of the pure rating elicitation strategies.

80, the MAE stops changing for all of the prediction-based strategies. This occursbecause the known set K at that point already reaches the largest possible size forthose strategies, that is, all the ratings in X, which can be elicited by these strategies,have been transferred to K. Conversely, the MAE of the voting and random strategieskeeps decreasing, until all of the ratings in X are moved to K. It is important to notethat the prediction-based strategies (e.g., highest predicted) cannot elicit ratings forwhich the prediction cannot be made, for example, if a movie has no ratings in K.



Table I. The Distribution of the Rating Values for the Ratings Elicited bythe Highest Predicted Strategy at Two Stages of its Application

Percentage of elicited ratings of value rIterations r = 1 r = 2 r = 3 r = 4 r = 5

from 1 to 5 2.06% 4.48% 16.98% 36.56% 39.90%from 35 to 39 6.01% 13.04% 29.33% 34.06% 17.53%

The behavior of the nonmonotone strategies can be divided into three stages. First,they all decrease the MAE at the beginning (approximately iterations 1–5). Second,they slowly increase it, up to a point when MAE reaches a peek (approximately it-erations 6–35). Third, they slowly decrease MAE untill the end of the experiment(approximately iterations 36–80). This behavior occurs since the strategies in the sec-ond group have a strong selection bias with regards to the properties of the items whichmay negatively affect MAE. For instance, the highest predicted strategy at the initialiterations (from 1–5) elicits primarily items with high ratings, however, this behaviordoes not persist as could be seen from the rating distribution for the iterations 35–40(Table I). As a result, in the beginning stages, this strategy adds to the known matrix(K) disproportionately more high ratings than low ones, and this ultimately biases therating prediction towards overestimating the ratings.

Low rated movies are selected for elicitation by the highest predicted strategy intwo cases: (1) when a low rated item is predicted to have a high rating, (2) when allthe highest predicted ratings have been already elicited or marked as “not available”(they are not present in X and removed from Uu). Looking into the data we discoveredthat at the iteration 36 the highest-predicted strategy has already elicited most of thehighest ratings. Hence, the next ratings that are elicited are actually average or lowratings, which reduces the bias in K and also the prediction error. The random andlowest-highest predicted strategies do not introduce such a bias, and this results in aconstant decrease of MAE.

5.2. Number of Acquired Ratings

In addition to measuring the quality of the elicited ratings (as described in the previoussection), it is also important to measure the number of elicited ratings. In fact, certainstrategies can acquire more ratings by better estimating what items the user hasactually experienced and is therefore able to rate. We simulate the limited knowledgeof users by making available only the ratings in the matrix X. Conversely, while astrategy may not be able to acquire many ratings, those actually acquired can be veryuseful for improving recommendations.

Figure 3 shows the number of ratings in K that are known to the system, as thestrategies elicit new ratings from the simulated users. It is worth nothing, even inthis case, the strong similarity of the behavior of the elicitation strategies in bothdatasets. The only strategy that differs substantially in the two datasets is random.This is clearly caused by the larger number of users and items that are present inthe Netflix data. In fact, while both datasets contain 100,000 ratings, the sparsity ofthe Movielens dataset is much higher: containing only 2.8% of the possible ratings(1491*2380) versus 6.3% of the possible ratings (943*1682) contained in the Movielensdataset. This larger sparsity makes it more difficult for a pure random strategy toselect items that are known to the user. In general this is a major limitation of anyrandom strategy, that is, a very slow rate of addition of new ratings. Hence for relativelysmall problems (with regards to the number of items and users) the random strategymay be applicable, but for larger problems it is rather impractical. In fact, observingFigure 3, one can see that in the Movielens simulations after 70 iterations, in which



Fig. 3. Evolution of the number of ratings elicited by the AL strategies.

70 ∗ 10 ∗ 943 = 660,100 ratings’ requests were made (iterations * number-of-rating-requests * users), the system has acquired on average only 28,000 new ratings (thesystem was initialized with 2,000 ratings, hence bringing the total number of ratingsto 30,000). This means that only for one out of 23 random rating requests a user is able



to provide a rating. In the Netflix dataset this ratio is even worse. It is interesting tonote that even the popularity strategy has a poor performance in terms of number ofelicited ratings; it elicited the first 28,000 ratings at a speed equal to one rating for each6.7 rating requests. We also observe that according to our results, quite surprisingly,the higher sparsity of the Netflix sample has produced a substantially different impactonly on the random strategy.

It is also clear that certain strategies are not able to acquire all the ratings in X.For instance, lowest-highest-predicted, lowest-predicted, and highest-predicted stopacquiring new ratings once they have collected 50,000 ratings (Movielens). This isdue to the fact that these strategies, in order to acquire from a user her ratings foritems, need the recommender system to generate rating predictions for those items.This happens when the user’s ratings in the test set T have no corresponding ratingsanywhere in the known dataset K, and hence matrix factorization cannot derive anyrating predictions for them.

Figure 4 illustrates a related aspect, namely to what degree the acquired ratingsare useful for the effectiveness of the system, that is, how the same number of ratingsacquired by different strategies can reduce MAE. From Figure 4 it is clear that in thefirst stage of the process, that is, when a small number of ratings are present in theknown matrix K, the random and lowest-predicted strategies collect ratings that aremore effective in reducing MAE. Successively, the lowest-highest-predicted strategyacquires more useful ratings. This is an interesting result, showing that the items withthe lowest-predicted ratings and random items are providing more useful information,even though these ratings are difficult to acquire.

5.3. Normalized Discounted Cumulative Gain

In this section we analyze the results of the experiments with regards to the NDCGmetric. As discussed in Section 4, in order to compute NDCG for a particular user,first the ratings for the items in the recommendation list are predicted. Then, theNormalized Discounted Cumulative Gain (NDCG) is computed by dividing the DCGof the ranked list of the recommendations with the DCG obtained by the best rankingof the same items for that user. NDCG is computed on the top 10 recommendationsfor every user. Moreover, recommendations lists are created only with items that haveratings in the testing dataset. This is necessary in order to compute DCG. We notethat sometimes the testing set contains less than 10 items for some users. In this caseNDCG is computed on this smaller set.

Moreover, when computing NDCG, in some cases the rating prediction algorithm(matrix factorization) cannot generate rating predictions for all 10 items that are intest set of a user. This happens when the user’s ratings in the test set T have nocorresponding ratings anywhere in the known dataset K, and hence matrix factoriza-tion cannot derive any rating predictions for them. It is important to notice that idealrecommendation lists for a user are rather stable during the experiments that use thesame dataset. Therefore, if an algorithm is not able to generate predicted recommenda-tion lists of size 10, lists of the size which is available are used which results in smallerNDCG values.

Figure 5 depicts the NDCG curves for the pure strategies. Higher NDCG valuecorresponds to higher rated items being present in the predicted recommendation lists.Popularity is the best strategy at the beginning of the experiment. But at iteration 3,in the Movielens dataset, and 9 in the Netflix dataset, the voting strategy passes thepopularity strategy and then remains the best one. In Movielens the random strategyovertakes the voting strategy at iteration 70, but this is not observed in the Netflixdata. Excluding the voting and random strategies, popularity, log(popularity)*entropy,



Fig. 4. System MAE evolution versus the number of ratings elicited.

and variance are the best in both datasets. Lowest predicted is by far the worst, andthis is quite surprising considering how effective it is in reducing MAE. By furtheranalyzing the experiment data we discovered that the lowest-predicted strategy is noteffective for NDCG since it is eliciting more ratings for the lowest ranked items which



Fig. 5. System NDCG evolution under the application of the pure rating elicitation strategies.

are useless for predicting the ranking of the top items. Another striking differencefrom the MAE experiments is that all the strategies improve NDCG monotonically. Itis also important to note that here the random strategy is by far the best. This is againdifferent from its behavior in MAE experiments.



Fig. 6. System precision under the application of the pure rating elicitation strategies (Movielens).

5.4. Precision

As we have already observed with regards to MAE and NDCG, for both Netflix andMovielens datasets very similar results were observed in the initial experiments. Forthis reason, in the rest of this article we use just the Movielens dataset. Precision,as it was described in Section 4, measures the proportion of items rated 4 and 5 thatare found in the recommendation list. Figure 6 depicts the evolution of the systemprecision when the elicitation strategies are applied. Here, highest predicted is thebest performing strategy for the largest part of the test. Starting from iteration 50it is as good as the binary-predicted and the lowest-highest-predicted strategies. It isalso interesting to note that all the strategies monotonically increase the precision.Moreover, the random strategy, differently from NDCG, does not perform so well, ifcompared with the highest-predicted strategy. This is again related to the fact thatthe random strategy increases substantially the coverage by introducing new users.But for new users the precision is significantly smaller as the system has not enoughratings to produce good predictions.

In summary from these experiments one can conclude that among the evaluatedstrategies there is no single best strategy, that dominates the others for all the eval-uation measures. The random and voting strategies are the best for NDCG, whereasfor MAE low-high predicted performs quite good, and finally for Precision low-highpredicted, highest predicted, and voting work well.

6. EVALUATION OF THE PARTIALLY RANDOMIZED STRATEGIES

Among the pure strategies only the random one is able to elicit ratings for items thathave not been evaluated by the users already present in K. Partially randomized strate-gies address this problem by asking new users to rate random items (see Section 3).In this section we have used partially randomized strategies where p = 0.2, that is, atleast 2 of the 10 items that are requested to be rated by the simulated users are chosenat random.



Fig. 7. System MAE evolution under the application of the partially randomized strategies (Movielens).

Figure 7 depicts the system MAE evolution during the experimental process. Wenote that here all the curves are monotone, that is, it is sufficient to add just a smallportion of randomly selected ratings to the elicitation lists to reduce the bias of thepure, prediction-based, strategies.

It should be mentioned that we have not evaluated the partially randomized votingstrategy because it already includes the random strategy as one of the voting strategies.The best performing partially randomized strategies, with respect to MAE, are, at thebeginning of the process, the partially randomized binary-predicted, and subsequentlythe low-high-predicted (similarly to the pure strategies case).

Figure 8 shows the NDCG evolution under the effect of the partially randomizedstrategies. During iterations 1–6, the partially randomized popularity strategy obtainsthe best NDCG. During iterations 7–170, namely, for the largest part of the test, thebest strategy is the partially randomized highest predicted. Again, as we observed forthe pure strategy version, the worst is the lowest predicted. It is important to note thatthe strategies that show good performance at the beginning (partially-randomized-highest and binary-predicted strategies) are those aimed at finding items that a usermay know and therefore is able to rate. Hence, these strategies are very effective in theearly stage when there are many users with very few items in the known dataset K.

Figure 9 shows the precision of the partially randomized strategies. The partiallyrandomized highest-predicted strategy shows again the best results during most of thetest, as for NDCG. During the iterations 1–6 the best strategy with respect to precisionis partially-randomized binary-predicted strategy, but then the classical approach ofrequesting the user to rate the items that the system considers the best recommenda-tions (highest predicted) is the winner. During iterations 111–170 partially randomizedvariance, popularity, log(popularity)*entropy, highest predicted and binary predictedhave very similar precision values. Similarly to NDCG, the worst strategy is the lowestpredicted, that is, eliciting ratings for the items that user dislikes does little to improvethe recommender’s precision. Interestingly this is not the case if the goal is to improveMAE.



Fig. 8. System NDCG evolution under the application of the partially randomized strategies (Movielens).

Fig. 9. System precision under the application of the randomized strategies (Movielens).

7. COMBINING ACTIVE LEARNING AND NATURAL ACQUISITION OF RATINGS

For these experiments, we designed a procedure to simulate the evolution of an RS’sperformance by mixing the usage of active learning strategies with the natural acqui-sition of ratings. We are interested in observing the temporal evolution of the qualityof the recommendations generated by the system when, in addition to exploiting anactive learning strategy for requesting the user to rate some items, the users were ableto voluntarily add ratings without being explicitly requested to do so, just as happens



in actual settings. To accomplish this goal, we have used the larger version of theMovielens dataset (1,000,000 ratings) for which we considered only the ratings of usersthat were active and rated movies for at least 8 weeks (2 month). Movielens consistsof 377,302 ratings from 1,236 users on 3,574 movies. The ratings are timestampedwith values ranging from 25/04/2000 to 28/02/2003. We measure the performance ofthe recommendation algorithm on a test set, as more and more ratings are added tothe known set K, while the simulated time advances from 25/04/2000 to 28/02/2003.We combined this natural acquisition of the ratings with active learning as describednext.

We split the available data into three matrices K, X, and T , as we did previously,but now we also consider the timestamp of the ratings. Hence, we initially insert inK the ratings acquired by Movielens in the first week (3,705 ratings). Then we splitrandomly the remaining ratings to obtain 70% of the ratings in X (261,730) and 30%in T (111,867).

For these new experiments, we perform a simulated iteration every week. That is,each simulated day (starting from the second week) an active learning strategy requestsfor rating 40 items from each user, who already has some nonnull ratings in K, that is,those user’s ratings that are known by the system at that point in time. If these ratingsare present in X, they are added to K. This procedure is repeated for 7 days (1 week).Then, all the ratings in the Movielens dataset that according to the timestamps wereacquired in that week are also added to K. Finally, the system is trained using theratings in K. To achieve a realistic setting for evaluating predictive performance of RS,we use only the items in T that users actually experienced during the following week(according to the timestamps). This procedure is repeated for I = 48 weeks (1 year).

In order to justify the large number of rating requests that the system makes eachweek, it is important to note that the simulated application of an active learningstrategy, as we do in our experiments, is able to add many fewer ratings than whatcould be elicited in a real setting. In fact, the number of ratings that are supposed to beknown by the users in the simulated process is limited by the number of ratings thathave been actually acquired in the Movielens dataset. In Elahi et al. [2011a] it has beenestimated that the number of items that are really known by the user is more than4 times larger than what is typically observed in the simulations. Hence, a lot of ourelicitation request would be unfulfilled, even though the user in actuality would havebeen able to rate the item. Therefore, instead of asking for 10 items as is typically done,we ask for 4 times as many items (40 items), to adjust for the discrepancy between theknowledge of the actual and simulated users.

In order to precisely describe the evaluation procedure, we use the following nota-tions, where n is the week index.

—Kn is the set of ratings known by the system at the end of the week n. These are theratings that have been acquired up to week n. They are used to train the predictionmodel, compute the active learning rating elicitation strategies for week n + 1, andtest the system’s performance using the ratings contained in the test set of the nextweek n + 1, Tn+1.

—Tn+1 is the set of ratings timestamped during the week n + 1 that are used as testset to measure the system performance after the ratings in the previous weeks havebeen added to Kn.

—ALn is the set of ratings, elicited by a particular elicitation strategy, and is added toknown set (Kn) at week n. We note that these are ratings that are present in X butnot in T . This is required for assuring that the active learning strategies are notmodifying the test set and that the system performance, under the application of thestrategies, is consistently tested on the same set of ratings.



—Xn This is the set of ratings in X timestamped in week n that are not in the test setTn. These ratings, together with the ratings in Tn, are all of the ratings acquired inMovielens during the week n, and therefore are considered to have been naturallyprovided by the (simulated) users without being asked by the system (natural acqui-sition). We note that it may happen that an elicitation strategy has already acquiredsome of these ratings, that is, the intersection of ALn and Xn may be not empty. Inthis case, only those not yet actively acquired are added to Kn.

The testing of an active learning strategy S now proceeds in the following way.

—System initialization: week 1(1) The entire ratings are partitioned randomly into the two matrices X, T .(2) The not null ratings in X1 and T1 are added to K1 : K1 = X1 ∪ T1.(3) Uu, the unclear set of user u, is initialized to all the items i with a null value kui

in K1.(4) The rating prediction model is trained on K1, and MAE, Precision, and NDCG

are measured on T2.—For all the weeks n starting from n = 2

(1) Initialize Kn with all the ratings in Kn−1.(2) For each user u with at least 1 rating in Kn−1:

—using strategy S a set of items L = S(u, N, Kn−1,Uu) is computed;—the set Le is created, containing only the items in L that have nonnull rating

in X. The ratings for the items in Le are added to ALn;—remove from Uu the items in L: Uu = Uu \ L.

(3) Add to Kn the ratings timestamped in week n and those elicited by S: Kn =ALn ∪ Xn ∪ Tn.

(4) Train the factor model on Kn.(5) Compute MAE, Precision, and NDCG on Tn+1.

7.1. Results

Figure 10 shows the MAE time evolution for the different strategies. It should be notedthat there is a huge fluctuation of MAE, from week to week. This is caused by the factthat for every week we train the system on the previous weeks data and we test thesystem performance on the next week’s ratings in the test set. Hence, the difficulty ofmaking good predictions may differ from week to week. For this reason, in the figure wefocus on a time range: weeks 1–17. In this figure the value at week n is obtained afterthe system has acquired the ratings for that week, and this is the result of evaluatingthe system’s performance on week n+1 (see the description of the simulation procedurein the previous section). The natural acquisition curve shows the MAE of the systemwithout using items acquired by the AL strategies, that is, the added ratings are onlythose that have been acquired during that week in the Movielens dataset.

The results show that in the second week the performance of the all strategies isvery close. But starting from the third week popularity and log(popularity)*entropyboth perform better than the others. These two strategies share similar characteristicsand outperform all the other strategies on the whole rating elicitation process. Voting,variance, and random are the next best strategies in terms of MAE.

In order to better show the results of our experiments, in Figure 11 we plotthree strategies that can be representative of other strategies. We have chosenlog(popularity)*entropy since it is one of the state-of-the-art strategies, highest pre-dicted since it performs very similar to other prediction-based strategies, and votingwhich is a novel strategy.

Considering the MAE obtained by the natural acquisition of ratings as a baselinewe can observe that the highest predicted does not perform very differently from the



Fig. 10. System MAE evolution under the simultaneous application of active learning strategies and naturalacquisition of ratings (Movielens).

Fig. 11. System MAE evolution under the application of three selected active learning strategies and naturalacquisition of ratings (Movielens).

baseline. The main reason is that this strategy is not acquiring additional ratingsbesides those already collected by the natural process, that is, the user would ratethese items on his own initiative. The other strategies, in addition to these ratings, arecapable of eliciting more ratings, also those that the user will rate later on, namely inthe successive weeks. We observe that here, differently from the previous experiments,



Fig. 12. System MAE evolution under the application of active learning strategies and natural acquisitionof ratings (Movielens). MAE values are normalized with respect to the MAE of the system that acquires newratings only using the natural acquisition. The number of new users entering the system every week is alsoshown.

all the strategies show a nonmonotone behavior. But, in this case, it is due to the factthat the test set, every week, is a subset of the ratings entered in Movielens duringthe following week. The predictive difficulty of this test set can therefore change fromweek to week, and hence influence the performance of the competing strategies.

In order to examine the results further, we have also plotted in Figure 12 the MAEof the strategies normalized with respect to the MAE of the baseline, that is, withoutratings obtained by active learning strategies: ( MAEStrategies

MAEBaseline)−1. We also plot in Figure 13

this normalized behavior only for the three selected strategies. This figure more clearlyshows the benefit of an active learning strategy in comparison with the natural process.Moreover, in Figure 13 the number of new users entering the system every week is alsoplotted, so as to understand the effect of new users entering the system on the systemperformance under the application of the considered strategies. The left y-axis in thisfigure shows the number of new users in the known set Kn and the right y-axis showsthe MAE normalized by the baseline. The gray solid line depicts the number of newusers entering the system every week.

Comparing the strategies in Figure 13, we can distinguish two types of strategies:the first type corresponds to the highest-predicted strtategy whose normalized MAE isvery close to the baseline. The second type includes log(popularity)*entropy and votingstrategies that express larger variations of performance, and substantially differ fromthe baseline (excluding the week 10). The overall performance of these strategies isbetter than the performance of the first type. Moreover, observing the number of newusers at each week we can see that the largest number of new users is entering atweeks 9, 10, and 14. For these weeks the normalized MAE shows the worst perfor-mances, with the largest value of MAE at week 10. Hence, the bad news is that in thepresence of many new users none of the strategies is effective, and better solutionsneed to be developed.

Despite the fact that new users are detrimental to the accuracy of the prediction, inthe long term, more users entering the system would result in a better recommender



Fig. 13. System MAE evolution under the application of three selected active learning strategies and naturalacquisition of ratings (Movielens). MAE values are normalized with respect to the MAE of the system thatacquires new ratings only using the natural acquisition. The number of new users entering the system everyweek is also shown.

Table II. Correlation of MAE with Number of Users in the Known Kn Set

Strategy Correlation Coefficient p-valueNatural Acquisition −0.4430 0.0016Variance −0.5021 0.0003Random −0.5687 0.0000Popularity −0.5133 0.0002Lowest predicted −0.4933 0.0004Low-high predicted −0.5083 0.0002Highest predicted −0.5153 0.0002Binary prediction −0.5126 0.0002Voting −0.5215 0.0001Log(pop)*entropy −0.5028 0.0003

system. Thus, we have computed the correlation coefficients between MAE curves ofthe strategies and the total number of users in known set Kn. Table II shows thesecorrelations as well as the corresponding p-values. There is a clear negative correlationwith the total number of users in the system. This means that the more users areentered to the system the lower the MAE becomes.

Another important aspect to consider is the number of ratings that are elicited bythe considered strategies in addition to the natural acquisition of ratings. As discussedbefore, certain strategies can acquire more ratings by better estimating what items arelikely to have been experienced by the user. Figure 14 illustrates the size of the knownset Kn as the strategies acquire more ratings from the simulated users. As shown infigure, although the number of ratings added naturally is by far larger than that ofany strategy (more than 314,000 ratings in week 48), still the considered strategiescan elicit many ratings. Popularity and log(popularity)*entropy are the strategies thatadd the most ratings; totaling more than 161,000 at the end of the experiment. On the



Fig. 14. Size evolution of the known set under the application of rating elicitation strategies (Movielens).

Table III. Strategies Performance Summary (performance: � - good, X. - bad)

other hand, voting is the strategy that elicits overall the smallest number of ratings.This can be due to the fact that sometimes most of the strategies vote for similarsets of items. Then the selected items would mostly overlap with naturally acquiredratings, which could result in fewer ratings being added to the known set. However, theremarkably good performance of voting may indicate that this strategy focuses moreon informativeness of the items rather than on their ratability.

8. CONCLUSIONS AND FUTURE WORK

In this work we have addressed the problem of selecting items to present to users foracquiring their ratings; this is also defined as the ratings elicitation problem. We haveproposed and evaluated a set of ratings elicitation strategies. Some of them have been



proposed in a previous work [Rashid et al. 2002] (popularity, log(popularity)*entropy,random, variance), and some, which we define as prediction-based strategies, are new:binary prediction, highest predicted, lowest predicted, and highest-lowest-predicted.Moreover, we have studied the behavior of other novel strategies and partially random-ized, which adds random ratings in the elicitation lists computed by the aforementionedstrategies; voting, which requests to rate the items that are selected by the largestnumber of voting strategies. We have evaluated these strategies with regards to theirsystem-wide effectiveness by implementing a simulation loop that models the day-by-day process of rating elicitation and rating database growth. We have taken intoaccount the limited knowledge of the users, which means that the users may not beable to rate all the items that the system proposes to them to rate. During the sim-ulation we have measured several metrics at different phases of the rating databasegrowth. The metrics include: MAE to measure the improvements in prediction accu-racy, Precision to measure the relevance of recommendations, Normalized DiscountedCumulative Gain (NDCG) to measure the quality of produced ranking, and coverageto measure the proportion of items over which the system can form predictions.

The evaluation (summarized in Table III) has shown that different strategies canimprove different aspects of the recommendation quality and in different stages of therating database development. Moreover, we have discovered that some pure strategiesmay incur the risk of increasing the system MAE if they keep adding only ratings witha certain value, for example, the highest ones, as for the highest-predicted strategy, anapproach that is often adopted in real RSs. In addition, prediction-based strategies arenot able to address either the problem of new users, nor of new items. Popularity andvariance strategies are able to select items for new users, but can not select items thathave no ratings.

Partially randomized strategies experience fewer problems because they elicit rat-ings for random items that had no ratings at all. In this case, the lowest-highest(highest) predicted is a good alternative if MAE (precision) is the targeted effective-ness measure. These strategies are easy to implement and as the experiments haveshown can produce considerable benefits.

Moreover, our results have shown that mixing active learning strategies with naturalacquisition of ratings influences the performance of the strategies. This is an impor-tant conclusion and no previous experiments have addressed and illustrated this issue.In this situation we show that the popularity and log(popularity)*entropy strategiesoutperform the other strategies. Our proposed voting strategy has shown good perfor-mance, that is, MAE but especially NDCG, with and without the natural acquisition.

This research identified a number of new problems that would definitely need to bestudied further. First of all, it is important to note that the results presented in thiswork clearly depend, as in any experimental study, on the chosen simulation setup,which can only partially reflect the real evolution of a recommender system. In ourwork we assume that a randomly chosen set of ratings, among those that the userreally gave to the system, represents the ratings known by the user, but yet unknownby the system. However, this set does not completely reflect all the user knowledge; itcontains only the ratings acquired using the specific recommender system. For instance,Movielens used a combined random and popularity technique for rating elicitation. Inreality, many more items are known by the user, but his ratings are not included inthe dataset. This is a common problem of any offline evaluation of a recommendersystem, where the performance of the recommendation algorithm is estimated on atest set that is never coincident with the recommendations set. The recommendationset is composed of the items with the largest predicted ratings. But if such an itemis not present in the test set, an offline evaluation will be never able to check if thatprediction is correct.



Moreover, we have already observed that the performance of some strategies (e.g.,random and voting) depends on the sparsity of the rating data. The Movielens dataand the Netflix sample that we used still have a considerably low sparsity comparedto other larger datasets. For example, if the data sparsity was higher, there wouldbe only a very low probability for the random strategy to select an item that a userhas consumed in the past and can provide a rating for. So the partially randomizedstrategies may perform worse in reality.

Furthermore, there remain many unexplored possibilities for sequentially applyingseveral strategies that use different approaches depending on the state of system [Elahi2011]. For instance, one may ask a user to rate popular items when the system doesnot know any user’s ratings yet, and use another strategy at a latter stage.

REFERENCES

ANDERSON, C. 2006. The Long Tail. Random House Business.BIRLUTIU, A., GROOT, P., AND HESKES, T. 2012. Efficiently learning the preferences of people. Mach. Learn. 90,

1, 1–28.BONILLA, E. V., GUO, S., AND SANNER, S. 2010. Gaussian process preference elicitation. In Proceedings of the

24th Annual Conference on Neural Information Processing Systems (NIPS’10). 262–270.BOUTILIER, C., ZEMEL, R. S., AND MARLIN, B. M. 2003. Active collaborative filtering. In Proceedings of the 19th

Conference in Uncertainty in Artificial Intelligence (UAI’03). 98–106.BRAZIUNAS, D. AND BOUTILIER, C. 2010. Assessing regret-based preference elicitation with the utpref recom-

mendation system. In Proceedings of the 11th ACM Conference on Electronic Commerce (EC’10). 219–228.BURKE, R. 2010. Evaluating the dynamic properties of recommendation algorithms. In Proceedings of the 4th

ACM Conference on Recommender Systems (RecSys’10). ACM Press, New York, 225–228.CARENINI, G., SMITH, J., AND POOLE, D. 2003. Towards more conversational and collaborative recommender

systems. In Proceedings of the International Conference on Intelligent User Interfaces. 12–18.CHEN, L. AND PU, P. 2012. Critiquing-based recommenders: Survey and emerging trends. User Model. User-

Adapt. Interact. 22, 1–2, 125–150.CREMONESI, P., KOREN, Y., AND TURRIN, R. 2010. Performance of recommender algorithms on top-n recommen-

dation tasks. In Proceedings of the ACM Conference on Recommender Systems (RecSys’10). 39–46.DESROSIERS, C. AND KARYPIS, G. 2011. A comprehensive survey of neighborhood-based recommendation meth-

ods. In Recommender Systems Handbook, F. Ricci, L. Rokach, B. Shapira, and P. B. Kantor, Eds., Springer,107–144.

ELAHI, M. 2011. Adaptive active learning in recommender systems. In Proceedings of the 19th InternationalConference on User Modeling, Adaption and Personalization (UMAP’11). 414–417.

ELAHI, M., REPSYS, V., AND RICCI, F. 2011a. Rating elicitation strategies for collaborative filtering. In Proceed-ings 12th International Conference on E-Commerce and Web Technologies (EC-Web’11). 160–171.

ELAHI, M., RICCI, F., AND REPSYS, V. 2011b. System-wide effectiveness of active learning in collaborativefiltering. In Proceedings of the International Workshop on Social Web Mining, co-located with Interna-tional Joint Conference on Artificial Intelligence. F. Bonchi, W. Buntine, R. Gavald, and S. Gu, Eds.http://users.cecs.anu.edu.au/∼sguo/swm2011 submission 4.pdf.

FUNK, S. 2006. Netflix update: Try this at home. http://sifter.org/∼simon/journal/20061211.html.GOLBANDI, N., KOREN, Y., AND LEMPEL, R. 2010. On bootstrapping recommender systems. In Proceedings of

the 19th ACM International Conference on Information and Knowledge Management (CIKM’10). ACMPress, New York, 1805–1808.

GOLBANDI, N., KOREN, Y., AND LEMPEL, R. 2011. Adaptive bootstrapping of recommender systems using decisiontrees. In Proceedings of the 4th International Conference on Web Search and Web Data Mining (WSDM’11).595–604.

GUO, S. AND SANNER, S. 2010. Multiattribute bayesian preference elicitation with pairwise comparison queries.In Proceedings of the 7th International Conference on Advances in Neural Networks. Springer, 396–403.

HARPALE, A. S. AND YANG, Y. 2008. Personalized active learning for collaborative filtering. In Proceedings ofthe 31st Annual International ACM SIGIR Conference on Research and Development in InformationRetrieval (SIGIR’08). ACM Press, New York, 91–98.

HERLOCKER, J. L., KONSTAN, J. A., BORCHERS, A., AND RIEDL, J. 1999. An algorithmic framework for perform-ing collaborative filtering. In Proceedings of the 22nd Annual ACM SIGIR International Conference onResearch and Development in Information Retrieval (SIGIR’99). ACM Press, New York, 230–237.



HERLOCKER, J. L., KONSTAN, J. A., TERVEEN, L. G., AND RIEDL, J. T. 2004. Evaluating collaborative filteringrecommender systems. ACM Trans. Inf. Syst. 22, 1, 5–53.

JANNACH, D., ZANKER, M., FELFERNIG, A., AND FRIEDRICH, G. 2010. Recommender Systems: An Introduction.Cambridge University Press.

JARVELIN, K. AND KEKALAINEN, J. 2002. Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf.Syst. 20, 4, 422–446.

JIN, R. AND SI, L. 2004. A bayesian approach toward active learning for collaborative filtering. In Proceedingsof the 20th Conference in Uncertainty in Artificial Intelligence (UAI’04). 278–285.

KOREN, Y. 2008. Factorization meets the neighborhood: A multifaceted collaborative filtering model. In Pro-ceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD’08). ACM Press, New York, 426–434.

KOREN, Y. AND BELL, R. 2011. Advances in collaborative filtering. In Recommender Systems Handbook, F. Ricci,L. Rokach, B. Shapira, and P. Kantor, Eds., Springer, 145–186.

LIU, N. N., MENG, X., LIU, C., AND YANG, Q. 2011. Wisdom of the better few: Cold start recommendation viarepresentative based rating elicitation. In Proceedings of the ACM Conference on Recommender Systems(RecSys’11). 37–44.

LIU, N. N. AND YANG, Q. 2008. Eigenrank: A ranking-oriented approach to collaborative filtering. In Proceedingsof the 31st Annual International ACM SIGIR Conference on Research and Development in InformationRetrieval (SIGIR’08). ACM Press, New York, 83–90.

MANNING, C. 2008. Introduction to Information Retrieval. Cambridge University Press.MARLIN, B. M., ZEMEL, R. S., ROWEIS, S. T., AND SLANEY, M. 2011. Recommender systems, missing data and

statistical model estimation. In Proceedings of the 22nd International Joint Conference on ArtificialIntelligence (IJCAI’11). 2686–2691.

MCNEE, S. M., LAM, S. K., KONSTAN, J. A., AND RIEDL, J. 2003. Interfaces for eliciting new user preferences inrecommender systems. In Proceedings of the International Conference on User Modeling. 178–187.

MILLER, B. N., ALBERT, I., LAM, S. K., KONSTAN, J. A., AND RIEDL, J. 2003. Movielens unplugged: Experienceswith an occasionally connected recommender system. In Proceedings of the 8th International Conferenceon Intelligent User Interfaces (IUI’03). ACM Press, New York, 263–266.

PU, P. AND CHEN, L. 2008. User-involved preference elicitation for product search and recommender systems.AI Mag. 29, 4, 93–103.

RASHID, A. M., ALBERT, I., COSLEY, D., LAM, S. K., MCNEE, S. M., KONSTAN, J. A., AND RIEDL, J. 2002. Getting toknow you: Learning new user preferences in recommender systems. In Proceedings of the InternationalConference on Intelligent User Interfaces (IUI’02). ACM Press, New York, 127–134.

RASHID, A. M., KARYPIS, G., AND RIEDL, J. 2008. Learning preferences of new users in recommender systems:An information theoretic approach. SIGKDD Explor. Newslett. 10, 90–100.

RESNICK, P. AND SAMI, R. 2007. The influence limiter: Provably manipulation-resistant recommender systems.In Proceedings of the ACM Conference on Recommender Systems (RecSys’07). ACM Press, New York,25–32.

RESNICK, P. AND VARIAN, H. R. 1997. Recommender systems. Comm. ACM 40, 3, 56–58.RICCI, F., ROKACH, L., AND SHAPIRA, B. 2011a. Introduction to recommender systems handbook. In Recommender

Systems Handbook, F. Ricci, L. Rokach, B. Shapira, and P. Kantor, Eds., Springer, 1–35.RICCI, F., ROKACH, L., SHAPIRA, B., AND KANTOR, P. B., EDS. 2011b. Recommender Systems Handbook. Springer.RUBENS, N., KAPLAN, D., AND SUGIYAMA, M. 2011. Active learning in recommender systems. In Recommender

Systems Handbook, F. Ricci, L. Rokach, B. Shapira, and P. Kantor, Eds., Springer, 735–767.SCHEIN, A. I., POPESCUL, A., UNGAR, L. H., AND PENNOCK, D. M. 2002. Methods and metrics for cold-start

recommendations. In Proceedings of the 25th Annual ACM SIGIR International Conference on Researchand Development in Information Retrieval (SIGIR’02). ACM Press, New York, 253–260.

SHANI, G. AND GUNAWARDANA, A. 2010. Evaluating recommendation systems. In Recommender Systems Hand-book, F. Ricci, L. Rokach, and B. Shapira, Eds., Springer, 257–298.

TIMELY DEVELOPMENT, L. 2008. Netflix prize. http://www.timelydevelopment.com/demos/NetflixPrize.aspx.WEIMER, M., KARATZOGLOU, A., AND SMOLA, A. 2008. Adaptive collaborative filtering. In Proceedings of the ACM

Conference on Recommender Systems (RecSys’08). ACM Press, New York, 275–282.ZHOU, K., YANG, S.-H., AND ZHA, H. 2011. Functional matrix factorizations for cold-start recommendation. In

Proceeding of the 34th International ACM SIGIR Conference on Research and Development in InformationRetrieval (SIGIR’11). 315–324.

Received February 2012; revised July 2012; accepted November 2012


Documents

Active learning strategies for rating elicitation in collaborative filtering