The Impact of Popularity Bias on Fairness and Calibration ...items from di‡erent categories (e.g. long tail items or items be-long to di‡erent providers)[9, 20] are ge−ing a

The Impact of Popularity Bias on Fairness and Calibration inRecommendation

Himan AbdollahpouriUniversity of Colorado Boulder

[email protected]

Masoud MansouryEindhoven University of Technology

Eindhoven, the [email protected]

Robin BurkeUniversity of Colorado Boulder

[email protected]

Bamshad MobasherDePaul University

[email protected]

ABSTRACTRecently there has been a growing interest in fairness-aware rec-ommender systems, including fairness in providing consistent per-formance across di�erent users or groups of users. A recommendersystem could be considered unfair if the recommendations do notfairly represent the tastes of a certain group of users while othergroups receive recommendations that are consistent with their pref-erences. In this paper, we use a metric called miscalibration formeasuring how a recommendation algorithm is responsive to users’true preferences and we consider how various algorithms may re-sult in di�erent degrees of miscalibration. A well-known type ofbias in recommendation is popularity bias where few popular itemsare over-represented in recommendations, while the majority ofother items do not get signi�cant exposure. We conjecture thatpopularity bias is one important factor leading to miscalibration inrecommendation. Our experimental results using two real-worlddatasets show that there is a strong correlation between how dif-ferent user groups are a�ected by algorithmic popularity bias andtheir level of interest in popular items. Moreover, we show thatthe more a group is a�ected by the algorithmic popularity bias, themore their recommendations are miscalibrated. Finally, we showthat the algorithms with greater popularity bias ampli�cation tendto have greater overall miscalibration.

KEYWORDSRecommender systems; Popularity bias; Long-tail recommendation;Fairness; Calibrated recommendation

1 INTRODUCTIONRecommender systems have been widely used in a variety of dif-ferent domains such as e-commerce, movies, music, entertainment,and online dating. �eir goal is to help users �nd relevant itemswhich are di�cult or otherwise time-consuming to �nd in the ab-sence of such systems. Recommendations are typically evaluatedusing measures such as precision, diversity, and novelty. Undersuch measures, depending on the situation, a recommended list ofitems may be considered good if it is relevant to the user, is diverseand also helps the user discover products that s/he would have notbeen able to discover in the absence of the recommender system.

One of the important considerations in recommendation gen-eration that has emerged recently is fairness. Recommendation

fairness may have di�erent meanings depending on the domain inwhich the recommender system is operating, the characteristics ofdi�erent users or groups of users (e.g. protected vs unprotected),and the goals of the system designers. For instance, Ekstrand et al.[12] de�ned fairness as consistent accuracy across di�erent groupsof users. In their experiments, they observed certain groups suchas women get lower accuracy results than men.

One of the metrics used to measure recommendation quality iscalibration, which measures whether the recommendations deliv-ered to a user are consistent with the spectrum of items the user haspreviously rated. For example, if a user has rated 70% action moviesand 30% romance, the user might expect to see a similar pa�ernin the recommendations [27]. If this ratio di�ers from the one inthe user’s pro�le, we say the recommendations are miscalibrated.Miscalibration by itself may not be considered unfair as it could besimply mean the recommendations are not personalized enough.However, if di�erent users or groups of users experience di�erentlevels of miscalibration in their recommendations, this may indicatean unfair treatment of a group of users. For example, authors in[28] de�ne several fairness metrics which focus on having a consis-tent performance in terms of estimation error across di�erent usergroups.

One well-known limitation of collaborative recommender sys-tems is the problem of popularity bias [7]: popular items are recom-mended frequently, in some cases even more than their popularitywould warrant, while the majority of other items do not get pro-portional a�ention. We de�ne algorithmic popularity bias as thetendency of an algorithm to amplify existing popularity di�erencesacross items in this way. We measure this ampli�cation through themetric of popularity li�, which quanti�es the di�erence betweenaverage item popularity in input and output for an algorithm. Suchpopularity bias could be problematic for a variety of di�erent rea-sons: long-tail (non-popular) items are important for generatinga fuller understanding of users’ preferences [21, 24]. In addition,long-tail recommendation can also be understood as a social good;a market that su�ers from popularity bias will lack opportunitiesto discover more obscure products and will be, by de�nition, dom-inated by a few large brands or well-known artists [10]. Such amarket will be more homogeneous and o�er fewer opportunitiesfor innovation and creativity.

In this paper, we conjecture that popularity bias is an importantfactor leading to miscalibration of the recommendation lists. We

arX

iv:1

910.

0575

5v3

[cs

.IR

] 1

6 O

ct 2

019

also show that users with di�erent level of interest in popularitems get di�erent level of miscalibration and hence resulting in anunfair treatment by the recommendation algorithms for di�erentgroups of users. We show that, across multiple algorithms, there is astrong negative association between users’ interest in popular itemsand the popularity li� of their recommendation. Users with lessinterest in popular items are the most a�ected by popularity bias.In addition, we show that algorithms with higher popularity biastend to have also higher miscalibration, showing the correlationbetween popularity bias and miscalibration in recommendationalgorithms.

Our contributions are as follows:

• Di�erential impact of popularity bias: We show thatdi�erent groups of users are a�ected di�erently by popu-larity bias.

• Connection between popularity bias and miscalibra-tion: We show that algorithms with a higher popularityampli�cation tend to also a have higher degree of overallmiscalibration.

• Association of popularity bias with miscalibration:We show that when the popularity li� is higher for a group,its miscalibration is also higher.

2 RELATEDWORK�e problem of popularity bias and the challenges it creates for therecommender system has been well studied by other researchers[6, 8, 23]. Authors in the mentioned works have mainly explored theoverall accuracy of the recommendations in the presence of long-tail distribution in rating data. Moreover, some other researchershave proposed algorithms that can control this bias and give morechance to long-tail items to be recommended [2, 3, 5, 17].

Moreover, the concept of fairness in recommendation has beenalso gaining a lot of a�ention recently [16, 28]. For example, �nd-ing solutions that remove algorithmic discrimination against usersbelong to a certain demographic information [29] or making sureitems from di�erent categories (e.g. long tail items or items be-long to di�erent providers)[9, 20] are ge�ing a fair exposure inthe recommendations. Our de�nition of fairness in this paper isaligned with the fairness objectives introduced by Yao and Huangin [28] where they de�ne unfairness as having inconsistent estima-tion error across di�erent users. We can generalize the estimationerror to simply be any kind of system performance. For instance,Steck de�nes fairness as concerning the various interests of a user,with the goal to re�ecting them according to their correspondingproportions [27] (i.e. calibration). In this paper, we use the samede�nition for fairness: a system is unfair if it delivers di�erentdegree of miscalibration to di�erent users.

With regard to looking at the performance of the recommendersystem for di�erent user groups, Ekstrand et al. [12] showed thatsome recommendation algorithms give signi�cantly less accuraterecommendations to groups from certain age or gender. In addition,Abdollahpouri et al. in [1] discuss the importance of recommen-dation evaluation with respect to the distribution of utilities givento di�erent stakeholders. For instance, the degree of calibration ofthe recommendation for each user group (i.e. a stakeholder) could

(a) �e inconsistency of recommender system’s performance for users A and B interms of miscalibration. User B’s recommendations are highly consistent (calibrated)with her pro�le while user A’s interest has been distorted in the recommendations.

(b) �e inconsistency of recommender system’s performance for users A and B interms of popularity bias. For user B there is no ampli�cation of popularity bias whileuser A has been highly a�ected.

Figure 1: Two di�erent types of inconsistency of recom-mender system’s performance across di�erent users

be considered as its utility and therefore, a balanced distribution ofthese utility values is important in a fair recommender system.

And �nally, Jannach et al. [15] compared di�erent recommenda-tion algorithms in terms of accuracy and popularity bias. In thatpaper they observed some algorithms concentrate more on popularitems than the others. In our work, we are mainly interested inseeing the popularity bias from the users’ expectations perspective.Our work is the �rst a�empt in connecting popularity bias and theconcept of miscalibration.

3 POPULARITY BIAS AND MISCALIBRATIONPopularity bias and miscalibration are both aspects of an algorithm’sperformance that are computed by comparing the a�ributes of theinput data with the properties of the recommendations that areproduced for users. In this section, we de�ne these terms moreprecisely.

3.1 MiscalibrationOne of the interpretations of fairness in recommendation is interms of whether the recommender provides consistent perfor-mance across di�erent users or groups of users. A recommendersystem could be considered unfair if the recommendations do notfairly represent the tastes of a certain group of users while othergroups receive recommendations that are consistent with their pref-erences. In this paper we use a metric called miscalibration [27]for measuring how a recommendation algorithm is responsive tousers’ true preferences and we consider how various algorithmsmay result in di�erent degrees of miscalibration. As we mentionedearlier, miscalibration, if it exists across all users, could simplymean failure of the algorithm to provide accurate personalization.But when di�erent groups of users experience di�erent levels ofmiscalibration, this could indicate unfair treatment of certain usergroups. From this standpoint, we call a recommender system unfairif it has di�erent levels of miscalibration for di�erent user groups.

Calibration is a general concept in machine learning, and recentlyexperienced a resurgence in the context of fairness of machine learn-ing algorithms. A classi�cation algorithm is called calibrated if thepredicted proportions of the various classes agree with the actualproportions of classes in the training data. Extending this notionto recommender systems, a calibrated recommender system is onethat re�ects the various interests of a user in the recommended list,and with their appropriate proportions. Figure 1a shows two users,A and B, and their pro�les. User A has 70% Drama movies and 30%Action movies in her pro�le. �erefore, it is expected to see thesame ratio in her recommendations but, as you can see, this propor-tion has been distorted by giving 55% (less than it was expected)drama movies and 45% (more than it was expected) action moviesas recommendations to that user. On the other hand, user B hasalso rated 70% Drama and 30% Action and her recommendationsperfectly match her expectations. �is shows an unfair scenariowhere the recommender system does not perform as well for userA as it does for user B.

For measuring the miscalibration of the recommendations weuse the metric introduced in [27].Assume u be a user and i be anitem. Also, suppose for each item i there is a set of features Cdescribing the item. For example, a song could be pop, jazz or amovie could have genres action, romance, comedy, etc. We use cfor each of these individual categories. Also, we assume that eachuser has rated one or more items, showing interest in features cbelonging to those items. We consider two distributions for eachuser u, one corresponding to the distribution of categories c acrossall items rated by u, and another representing the distribution ofcategories c across all recommended items to u:

• pu (c |u): the distribution over feature c of the set of items Γrated by user u in the past:

pu (c |u) =∑i ∈Γwu,i .p(c |i)∑

i ∈Γwu,i(1)

where wu,i is the weight of item i , e.g., how recentlyit was rated by user u. In this paper, we set w to 1 to fo-cus more speci�cally on di�erences in feature distributionrather than temporal aspects of user pro�les.

• qu (c |u): the distribution over feature c of the list of itemsrecommended to user u:

qu (c |u) =∑i ∈Λwr (i).p(c |i)∑

i ∈Λwr (i)(2)

where Λ is the set of recommended items. �e weightof item i due to its rank r (i) in the recommendations isdenoted by wr (i). Possible choices include the weightingschemes used in ranking metrics, like in Mean ReciprocalRank (MRR) or normalized Discounted Cumulative Gain(nDCG). As above, we set the weight wr to 1 in order tofocus on feature distribution across the recommendationset and to ensure the qu values are comparable to the puvalues.

�e degree of dissimilarity between qu (c |u) and pu (c |u) is usedto compute miscalibration in recommendations. �ere are variousestablished methods for determining if two �nite distributions aresimilar, such as statistical hypothesis testing, with the null hypoth-esis being that the two distributions are the same, as mentioned in

[27]. Authors in that paper used Kullback-Leibler (KL) divergenceas miscalibration metric. In our data, there are many pro�les withno ratings for some subset of the features, leading to zero valuesin the pu distribution, and similarly, recommendation lists mayconcentrate only on certain features, causing zero values in thequ distribution for some users. KL divergence is unde�ned wherethere are no observations. As an alternative, we use the Hellingerdistance, H , as suggested by authors in [27] for situations wherewe have many zeros. So, we measure miscalibration for user u,MC(p,q)u as follows:

MCu (pu ,qu ) = H (pu ,qu ) =

√pu − √qu 2√2

(3)

By de�nition, the Hellinger distance is a metric satisfying triangleinequality. �e

√2 in the de�nition is for ensuring that H (pu ,qu ) ≤

1 for all probability distributions [11].�e overall miscalibration metric MCG for each group G is ob-

tained by averaging MCu (p,q) across all users u in group G. �atis:

MCG (p,q) =∑u ∈G MCu (pu ,qu )

|G | (4)

MCG (p,q) is 0 when p and q are similar and it increases whentheir di�erence is higher.

3.1.1 Fairness. Similar to [27], in this paper we de�ne a systemto be unfair if it gives di�erent levels of miscalibration to di�erentuser groups. One could de�ne a group of users in many di�erentways based on some features such as gender, age, occupation, ed-ucation etc. Likewise, we could de�ne a group of users based onhow similar their interests are. For instance, authors in [4] catego-rize users into di�erent groups based on their degree of interest inpopular items (i.e. if users are interested in niche movies versusblockbuster ones). In this paper we use the same grouping method-ology because we are interested in measuring how popularity biasimpacts users with di�erent level of interest in popular items andalso how it is correlated with the miscalibration of the recommen-dation lists for di�erent users. Some insights based on groupingusing gender information is also reported in Section 6.

3.2 Popularity Bias in Rating DataAs noted earlier, recommendation algorithms are known to su�erfrom popularity bias. �at is, a few items are recommended tomany users while the majority of other items do not get a deservedexposure. �is bias could be because of the inherent nature of ratingdata which is skewed towards popular items and also because ofthe algorithmic ampli�cation of such bias. Figure 1-b shows thepercentage of rated items by two users A and B. We can see thatuser A’s recommendations are highly a�ected by popularity biaswhile for user B there is no ampli�cation of popularity bias in herrecommendations.

In many domains, rating data is skewed towards more popularitems–there are a few popular items with the majority of ratingswhile the rest of the items have far fewer ratings. Figure 2 showsthe long-tail distribution of item popularity in the well knownMovieLens 1M and Yahoo movies datasets. Similar distributionscan be found in other datasets as well. Although it is true that

(a) MovieLens (b) Yahoo Movies

Figure 2: �e long-tail of item popularity in rating data inMovieLens and Yahoo Movies datasets.

(a) MovieLens (b) Yahoo Movies

Figure 3: Users propensity towards popular items in Movie-Lens and Yahoo Movies datasets.

popular items are popular for a reason, algorithmic popularity biaso�en ampli�es this bias to a great extent.

Not every user has the same degree of interest towards popularitems [4, 22]. �ere might be users who are interested in lesspopular, niche items. �e recommender algorithm should be ableto address the needs of those users as well. Figure 3 shows theaverage popularity of rated items in di�erent users’ pro�les forboth MovieLens and Yahoo Movie datasets. �e users have beensorted �rst based on the average popularity of items in their pro�lethen the data has been plo�ed. On Movielens dataset there arefew users with extreme average item popularity on the most right-hand and le�-hand side of the plot while the majority of the usersfall in the middle of the distribution with average item popularitybetween 0.10 and 0.15. In Yahoo Movies, there are more users withlow-popularity pro�les, but otherwise, the distribution is similar.Both of these plots con�rm that users have di�erent degrees ofinterest towards popular items.

Due to this common imbalance in the original rating data, o�enalgorithms propagate and, in many cases, amplify the bias by over-recommending the popular items, giving them a higher opportunityfor being rated by more users. Repeated recommendation cycleswill continue to perpetuate this bias resulting in a rich-get-richerand poor-get-poorer vicious cycle. However, not every recommen-dation algorithm has the same ampli�cation power for popularitybias. In the next sections, we will de�ne a metric for measuring thedegree to which popularity bias is propagated by the recommen-dation algorithm. We will empirically evaluate the performance ofdi�erent algorithms with respect to popularity bias ampli�cation.We will also empirically show the relationship between popularitybias and miscalibration across di�erent algorithms and data sets.

Table 1: Precision@10 for di�erent algorithms on MovieLensand Yahoo Movies datasets.

algorithm MovieLens Yahoo MoviesItemKNN 0.223 0.127UserKNN 0.214 0.13Most popular 0.182 0.1SVD++ 0.122 0.2BMF 0.107 0.047

4 METHODOLOGYWe conducted our experiments on two publicly available datasets.�e �rst one is MovieLens 1M dataset which contains 1,000,209anonymous ratings of approximately 3,900 movies made by 6,040MovieLens users [14]. Each movie is associated with at least onegenre in this dataset with a total of 18 unique genres in the entiredataset. �e second dataset we used is Yahoo Movies1 which origi-nally contained 211,231 ratings provided by 7,642 users on 11,916movies. �ere are many items with only one rating making thisdataset extremely sparse. �erefore, we created a core-10 sample(each user and each movie should have at least 10 ratings). Our�nal sample of Yahoo Movies dataset contains 173,676 ratings on2,131 movies provided by 7,012 users. Analogous to MovieLensdataset, in Yahoo Movies dataset, each movie is associated withat least one genre with a total of 24 genres in the entire datatset.For all experiments, we set aside a random selection of 80% of therating data as training set and the remaining 20% as the test set. �ereason we use these two datasets is that they both have meaning-ful features associated with items that could represent users’ tastewhich is needed for measuring the miscalibration. For example, thegenre of a movie is indeed a meaningful characteristic of a moviewhich could represent users’ taste in movies.

We used several recommendation algorithms from the opensource java recommendation library (Librec) [13] including user-based collaborative �ltering (UserKNN ) [25], item-based collab-orative �ltering (ItemKNN ) [26], singular value decomposition(SVD++) [18], and biased matrix factorization (BMF ) [19] to coverboth neighborhood based and matrix factorization based algorithms.We also included the most-popular method (a non-personalizedalgorithm recommending the most popular items to every user) asan algorithm with extreme popularity bias. We tuned each algo-rithm to achieve its best performance in terms of precision. Table 1shows the precision values for each algorithm on both MovieLensand Yahoo Movies datasets. On MovieLens data, ItemKNN hasthe highest precision while on Yahoo Movies, SVD++ outperformsother algorithms. We set the size of the generated recommendationlist for each user to 10.

In this paper, we are interested in seeing how di�erent groupsof users with varying degree of interest towards popular items aretreated by the recommender system. �erefore, we grouped usersin both datasets into an arbitrary number of groups (10 in thispaper) based on their degree of interest in popular items. �at is,we �rst measure the average popularity of the rated items in eachuser’s pro�le and then organize them into 10 groups with the �rst

1h�ps://webscope.sandbox.yahoo.com/catalog.php?datatype=r

group having the lowest average item popularity (extremely nicheusers) and the last group with the highest average item popularity(heavily blockbuster-focused users). We denoted these groups asG1 through G10. In Section 6 we also discuss grouping based ongender.

In order to measure how each algorithm ampli�es the popularitybias in its generated recommendations for di�erent user groups,we de�ne popularity li� as a measure of the ampli�cation powerfor di�erent algorithms on di�erent user groups. First, we measurethe average item popularity of a group G (i.e. Group AveragePopularity) as follows:

GAPp (G) =

∑u ∈G

∑i∈Γu θ (i )|Γu |

|G | (5)

where θ (i) is the popularity value for item i (i.e. the ratio of userswho rated that item) and subscript p refers to the pro�le of users.

We also measure the average popularity of the recommendeditems to the users in the same group as:

GAPq (G) =

∑u ∈G

∑i∈Λ θ (i )|Λu |

|G | (6)

where subscript q refers to the recommendations.�erefore popularity li� (PL) for group G is de�ned as:

PL(G) =GAPq (G) −GAPp (G)

GAPp (G)(7)

Positive values for PL indicate ampli�cation of popularity bias bythe algorithm. A negative value for PL happens when, on average,the recommendations are less concentrated on popular items thanthe users’ pro�le. Moreover, the PL value of 0 means there is nopopularity bias ampli�cation.

Our goal is to show how popularity li�, PL, and miscalibration,MC are correlated for di�erent algorithms. In particular, we areinterested in discovering how popularity bias in recommendationalgorithms a�ects groups of users with di�erent degrees of interestin popular items. �at will reveal if there is any discriminationin terms of imposing popular items on di�erent user groups. Inaddition, we want to show how the degree of being a�ected byalgorithmic popularity bias (i.e. popularity li�) for these groups iscorrelated with how much miscalibration they experience in theirrecommendations.

5 RESULTS5.1 Popularity Bias PropagationNot every recommendation algorithm has the same ampli�cationpower for popularity bias. We �rst look at the performance ofdi�erent algorithms in terms of popularity bias propagation withoutpaying a�ention to how they perform for di�erent users or groupsof users.

Figure 4 shows the correlation between the number of timesan item is rated and how frequent it is recommended by di�erentalgorithms on MovieLens dataset. It can be seen that in all algo-rithms, there are many items that are almost never recommended(the items fall on the horizontal tail of the sca�er plot). As expectedthe Most popular algorithm seems to have the strongest correlationbetween the number of times an item is being rated and the number

of times it is recommended. Both User KNN and Item KNN alsoshow strong correlations.

Looking at the plot for the BMF algorithm, it seems there isnot a strong correlation between how o�en an item is rated andhow o�en it is recommended. Nevertheless, its very sparse sca�erplot shows the number of recommended items is still low. In thisalgorithm, there are many items that are recommended very rarely(even some of those that are rated more frequently) while few itemsare being recommended frequently. �e same fact applies to SVD++.For each point on the sca�er plot, you can also see the averagerating for the corresponding item to illustrate the quality of theserecommended items using each algorithm.

Figure 5 shows the same information on Yahoo Movies data.Note that, for all algorithms, items that are already rated by a userare not recommended and that is why there is an in�ection pointaround the popularity value of 0.35 for MovieLens and 0.2 for YahooMovies on the plot for the Most popular algorithm.

Figure 6 shows the total popularity li� for di�erent algorithmson MovieLens and Yahoo Movies datasets. As can be observed, themost popular algorithm has the highest popularity li� followed byItemKNN andUserKNN on both datasets. SVD++ and BMF showa negative popularity li� on the Yahoo Movies dataset meaning therecommendations, on average, are less concentrated on popularitems for di�erent user groups compared to their pro�les. �at isbecause these two algorithms do not have much popularity biason Yahoo Movies data and are generally recommending items withsmaller average popularity to users contributing an overall negativepopularity li�. However these two algorithms still show positivepopularity li� on MovieLens dataset, though to a much lesser degreethan their neighborhood-based counterparts.

Figure 7 shows the impact of popularity bias on the groups wecreated based on their degree of interest towards popular items. �eaverage popularity of each group is shown within the parenthesesnext to each group’s label with smaller values representing groupswith more niche tastes and larger values for more blockbuster-focused groups. �e percentage of users that fall within each groupis shown on the y-axis of this �gure. We can see that for Movie-Lens dataset, there is a normal distribution which reaches its peakbetween 0.10 and 0.13 while there are a few users who fall withinthe more extreme groups on the le� and right side of the plot. ForYahoo Movies dataset, however, the percentage of users on eachgroup is more consistent except for the groups with higher averagepopularity where there are only a few users.

5.2 Relationship Between Popularity Bias andMiscalibration

In this part we show the connection between popularity bias andunfairness in recommendation. Making this connection would behelpful in many fairness-aware recommendation scenarios because�xing the popularity bias could be used as an approach to tacklethis type of unfairness.

An illustration of the e�ect of the algorithmic popularity biason di�erent user groups is shown in �gure 8. Each dot represents agroup with certain average popularity of the users’ pro�les in thatgroup which is shown on the x-axis. On the y-axis, the popularityli� of di�erent algorithms on each user group is depicted. It can

(a) BMF (b) SVD++ (c) ItemKNN

(d) UserKNN (e) Most-popular

Figure 4: MovieLens:�e correlation between the popularity of items and how frequent they are being recommended.

(a) BMF (b) SVD++ (c) ItemKNN

(d) UserKNN (e) Most popular

Figure 5: Yahoo Movies:�e correlation between the popularity of items and how frequent they are being recommended.

be seen that groups with the lowest average popularity (nichetastes) are being a�ected the most by the algorithmic popularity biasand the higher the average popularity of the group, the lesser thegroup is a�ected by the popularity bias. �is shows how, unfairly,popularity bias is a�ecting di�erent user groups. Table 2 shows thepopularity li� experienced by two extreme groups: G1 (the most

niche group) and G10 (the most blockbuster-focused group). It canbe seen that, for all of the algorithms, the di�erence in popularityli� experienced by G1 is signi�cantly higher than G10 (p < 0.01 forT-test signi�cance).

Figure 9 shows the correlation between total popularity li� andmiscalibration for di�erent algorithms on MovieLens and Yahoo

(a) MovieLens

(b) Yahoo Movies

Figure 6: Total popularity li� for di�erent recommendationalgorithms.

(a) MovieLens

(b) Yahoo Movies

Figure 7: Histogram of group’s average item popularity inMovieLens and Yahoo Movies datasets.

(a) MovieLens

(b) Yahoo Movies

Figure 8: Average item popularity of user groups (G1through G10 from le� to right) and their observed popular-ity li� in MovieLens and Yahoo Movies datasets.

Movies datasets. We can see a general trend for these algorithms:�e more an algorithm imposes popularity bias (highest total popu-larity li�) on users, the higher the degree of miscalibration, showinga positive correlation between these two. Algorithms with the high-est popularity li� such as UserKNN and ItemKNN also have ahigher miscalibration compared to other algorithms. SVD++ andBMF have the lowest popularity li� and they also have the lowestmiscalibration. �is is an interesting �nding as it could be usefulin the design of recommender systems where sensitivity to thistype of miscalibration may dictate the choice on the underlyingalgorithm used.

Table 3 shows the miscalibration values for two groups G1 andG10 (two groups on the extreme sides of the popularity spectrum,as we discussed earlier). It can be seen that, for all algorithms, themiscalibration value for G1 is signi�cantly higher than the one forG10 (p < 0.01 for T-test signi�cance). �is shows the group withthe lowest value for its average item popularity has experiencedthe highest miscalibration for their recommendations. Moreover,we also saw in Table 2 that this group experienced the highestpopularity li� as well. �is shows again how popularity li� mightlead to miscalibration.

Table 2: �e popularity li� of di�erent recommendation algorithms on two groups G1 and G10

MovieLens Yahoo Movies

Popularity li� forG10




ItemKNN 0.4579 12.19 -0.26 11.57UserKNN 0.348 9.17 -0.31 5.738Most-popular 0.563 15.7 -0.25 15.13SVD++ -0.09 7.063 -0.84 1.96BMF 0.086 10.60 -0.871 3.079

Table 3: �e miscalibration of di�erent recommendation algorithms on two groups G1 and G10

MovieLens Yahoo Movies

Miscalibration forG10




ItemKNN 0.250 0.418 0.345 0.4660UserKNN 0.248 0.446 0.345 0.3953Most-popular 0.277 0.501 0.342 0.471SVD++ 0.380 0.556 0.272 0.315BMF 0.396 0.635 0.290 0.345

(a) MovieLens

(b) Yahoo Movies

Figure 9: Comparison of di�erent algorithms in terms oftheir total popularity li� and miscalibration.

6 GENDER ANALYSIS�e grouping methodology based on users’ interest in popular itemswas to illustrate how di�erent levels of interest in item popularitya�ects the recommendations and miscalibration. In particular, weintended to show how di�erent recommendation algorithms aretreating di�erent user groups with varying degree of interest inpopular items.

In this section we use gender as the grouping criteria to anal-yse the impact of popularity bias on di�erent user groups and itscorrelation with the miscalibration of the recommendations forthese two groups. Between the two datasets we used in previoussections only MovieLens has gender information about the users.�erefore, in this section, we only reported the results for di�erentalgorithms using this dataset. Out of 6040 users in this dataset, 1708are women and 4330 are men which shows an imbalance nature ofthis dataset. Table 4 shows the popularity li� of di�erent recom-mendation algorithms for men and women. It can be observed thatfor all algorithms, women have experienced a signi�cantly higherdegree of popularity li� (p < 0.05 except for SVD++ which treatedboth groups equally). What is interesting and is also consistent withour results in Section 5 is that the group who experienced higherpopularity li� (women) also experienced higher miscalibration, ascan be seen in Table 5. �e popularity li� and miscalibration ofthe SVD++ for men and women are very close to each other whichshows the consistency of this algorithm in delivering fairer recom-mendations. BMF imposes higher popularity li� on women butit treats both men and women equally when it comes to miscal-ibration. Note that, the larger group size for men (4330 vs 1708)

could be another reason for higher miscalibration for women asthe algorithms are trained on a large number of ratings for menand that could make the algorithms biased towards learning men’srating behavior. Further analysis regarding the rating behaviorof these two groups such as consistency of their ratings and theinformativeness of their pro�le would help in shedding light on theresults reported here.

7 DISCUSSIONIn Figure 9 we observed that there is a positive correlation betweenpopularity li� and the overall miscalibration of a recommendationalgorithm. Investigating which one is causing the other one needsa more in-depth experiment design and further analysis. However,we believe that popularity li� causes miscalibration and not theother way around. When a list is miscalibrated, it is due to over-representation or under-representation of some genres relative towhat the user expects given the items that s/he has rated. Popularityli� increases the recommendation frequency of popular moviesand the genres associated with them. As a result, these populargenres become over-represented (at the cost of suppressing thenon-popular movies and their associated genres) and thus causeoverall miscalibration.

We also observed in Table 1 that user groups who experiencehigher popularity li� also experience higher miscalibration. Aswe saw in Figure 8, user groups with lesser interest in popularitems are a�ected more by the popularity bias of a recommenda-tion algorithm (i.e. higher degree of popularity li�), because theirinterests are less likely to lie within the set of popular items. �emechanism suggested above would imply that these groups wouldalso experience greater miscalibration, and that is in fact what wefound.

8 CONCLUSION AND FUTUREWORKRecommendation algorithms o�en su�er from popularity bias prob-lem. �is bias could be problematic for reasons such as the needfor generating a fuller understanding of users’ preferences whichis usually done by recommending non-popular items and, to avoidignoring certain items while giving too much a�ention to few itemswhich is referred to as social good.

In this paper, we looked at the popularity bias problem from theuser’s perspective and we observed di�erent groups of users canbe a�ected di�erently by this bias depending on how much theyare interested in popular items. We also showed that the popularitybias has a strong correlation with miscalibration which measureshow consistent the recommendations are with the true users’ pref-erences. �at is, algorithms with high popularity bias (popularityli�) tend to also have higher total miscalibration. In particular,for two extreme groups on the spectrum of item popularity, weshowed that the group which is less interested in popular items isa�ected the most by popularity bias and also has the highest level ofmiscalibration. Consistent with these results, we observed that onMovieLens dataset the group who experienced higher popularityli� (women) also experienced higher miscalibration compared tomen.

We also showed that di�erent algorithms behave di�erently withrespect to miscalibration. In particular, commonly used neighborhood-based algorithms are highly susceptible to propagating popularitybias and hence to miscalibration. On the other hand, factorizationbased algorithms seem more resistant to this e�ect.

One limitation for our work in this paper is that all of our exper-iments only indicated a correlation between popularity bias andmiscalibration and unfairness. For future work, we intend to studythe causality of popularity bias on these issues. In particular, we willdesign experiments such as sampling methods to control popularitybias in data and see the e�ect of that on miscalibration and fairness.We will also investigate the e�ect of algorithms for controllingalgorithmic popularity bias on miscalibration and fairness.

REFERENCES[1] Himan Abdollahpouri, Gediminas Adomavicius, Robin Burke, Ido Guy, Dietmar

Jannach, Toshihiro Kamishima, Jan Krasnodebski, and Luiz Pizzato. 2019. Be-yond Personalization: Research Directions in Multistakeholder Recommendation.arXiv preprint arXiv:1905.01986 (2019).

[2] Himan Abdollahpouri, Robin Burke, and Bamshad Mobasher. 2017. ControllingPopularity Bias in Learning to Rank Recommendation. In Proceedings of the 11thACM conference on Recommender systems. ACM, 42–46.

[3] Himan Abdollahpouri, Robin Burke, and Bamshad Mobasher. 2019. ManagingPopularity Bias in Recommender Systems with Personalized Re-ranking.. InFlorida AI Research Symposium (FLAIRS). ACM, To appear.

[4] Himan Abdollahpouri, Masoud Mansoury, Robin Burke, and Bamshad Mobasher.2019. �e Unfairness of Popularity Bias in Recommendation. In Workshop onRecommendation in Multistakeholder Environments (RMSE) (2019).

[5] Gediminas Adomavicius and YoungOk Kwon. 2012. Improving Aggregate Rec-ommendation Diversity Using Ranking-Based Techniques. IEEE Transactions onKnowledge and Data Engineering 24, 5 (2012), 896–911. h�ps://doi.org/10.1109/TKDE.2011.15

[6] Chris Anderson. 2006. �e long tail: Why the future of business is selling more forless. Hyperion.

[7] Erik Brynjolfsson, Yu Je�rey Hu, and Michael D Smith. 2006. From niches toriches: Anatomy of the long tail. Sloan Management Review 47, 4 (2006), 67–71.

[8] Erik Brynjolfsson, Yu Je�rey Hu, and Michael D Smith. 2006. From niches toriches: Anatomy of the long tail. Sloan Management Review (2006), 67–71.

[9] Robin Burke, Nasim Sonboli, Masoud Mansoury, and Aldo Ordonez-Gauger. 2017.Balanced Neighborhoods for Fairness-aware Collaborative Recommendation. InWorkshop on Responsible Recommendation (FATRec).

[10] Oscar Celma and Pedro Cano. 2008. From hits to niches?: or how popular artistscan bias music recommendation and discovery. In Proceedings of the 2nd KDDWorkshop on Large-Scale Recommender Systems and the Net�ix Prize Competition.ACM, 5.

[11] David A Cieslak, T Ryan Hoens, Nitesh V Chawla, and W Philip Kegelmeyer.2012. Hellinger distance decision trees are robust and skew-insensitive. DataMining and Knowledge Discovery 24, 1 (2012), 136–158.

[12] Michael D Ekstrand, Mucun Tian, Ion Madrazo Azpiazu, Jennifer D Ekstrand,Oghenemaro Anuyah, David McNeill, and Maria Soledad Pera. 2018. All �eCool Kids, How Do �ey Fit In?: Popularity and Demographic Biases in Recom-mender Evaluation and E�ectiveness. In Conference on Fairness, Accountabilityand Transparency. 172–186.

[13] Guibing Guo, Jie Zhang, Zhu Sun, and Neil Yorke-Smith. 2015. LibRec: A JavaLibrary for Recommender Systems.. In UMAP Workshops.

[14] F Maxwell Harper and Joseph A Konstan. 2015. �e MovieLens Datasets: Historyand Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4 (2015),19.

[15] Dietmar Jannach, Lukas Lerche, Iman Kamehkhosh, and Michael Jugovac. 2015.What recommenders recommend: an analysis of recommendation biases andpossible countermeasures. User Modeling and User-Adapted Interaction 25, 5(2015), 427–491.

[16] Toshihiro Kamishima, Shotaro Akaho, Hideki Asoh, and Jun Sakuma. 2012.Fairness-aware Classi�er with Prejudice Remover Regularizer. In Proc. of theECML PKDD 2012, Part II. 35–50.

[17] Toshihiro Kamishima, Shotaro Akaho, Hideki Asoh, and Jun Sakuma. 2014. Cor-recting Popularity Bias by Enhancing Recommendation Neutrality. In PosterProceedings of the 8th ACM Conference on Recommender Systems, RecSys 2014, Fos-ter City, Silicon Valley, CA, USA, October 6-10, 2014. h�p://ceur-ws.org/Vol-1247/recsys14 poster10.pdf

[18] Y. Koren. 2008. Factorization meets the neighborhood: a multifaceted collab-orative �ltering model. In Proceeding of the 14th ACM SIGKDD international

https://doi.org/10.1109/TKDE.2011.15

https://doi.org/10.1109/TKDE.2011.15

http://ceur-ws.org/Vol-1247/recsys14_poster10.pdf

http://ceur-ws.org/Vol-1247/recsys14_poster10.pdf

Table 4: �e popularity li� of di�erent recommendation algorithms on two groups men and women.Bold values show signi�cance di�erence with p < 0.05

MovieLens

Popularity li� formen

Popularity li� forwomen

ItemKNN 1.58 1.76UserKNN 1.33 1.52Most-popular 1.71 1.91SVD++ 0.33 0.33BMF 0.87 1.01

Table 5: �e miscalibration of di�erent recommendation algorithms on two groups men and women.Bold values show signi�cance di�erence with p < 0.05

MovieLens

Miscalibration formen

Miscalibration forwomen

ItemKNN 0.368 0.42UserKNN 0.36 0.4Most-popular 0.40 0.48SVD++ 0.41 0.43BMF 0.37 0.37

conference on Knowledge discovery and data mining. ACM, 426–434.[19] Y. Koren, R. Bell, and C. Volinsky. 2009. Matrix factorization techniques for

recommender systems. Computer 42, 8 (2009), 30–37.[20] Weiwen Liu and Robin Burke. 2018. Personalizing Fairness-aware Re-ranking.

arXiv preprint arXiv:1809.02921 (2018). Presented at the 2nd FATRec Workshopheld at RecSys 2018, Vancouver, CA.

[21] Tien T Nguyen, Pik-Mai Hui, F Maxwell Harper, Loren Terveen, and Joseph AKonstan. 2014. Exploring the �lter bubble: the e�ect of using recommendersystems on content diversity. In Proceedings of the 23rd international conferenceon World wide web. ACM, 677–686.

[22] Jinoh Oh, Sun Park, Hwanjo Yu, Min Song, and Seung-Taek Park. 2011. Novelrecommendation based on personal popularity tendency. In 2011 IEEE 11th Inter-national Conference on Data Mining. IEEE, 507–516.

[23] Yoon-Joo Park and Alexander Tuzhilin. 2008. �e long tail of recommendersystems and how to leverage it. In Proceedings of the 2008 ACM conference onRecommender systems. ACM, 11–18.

[24] Paul Resnick, R Kelly Garre�, Travis Kriplean, Sean A Munson, and Natalie Jo-mini Stroud. 2013. Bursting your (�lter) bubble: strategies for promoting diverseexposure. In Proceedings of the 2013 conference on Computer supported cooperativework companion. ACM, 95–100.

[25] P. Resnick and H.R. Varian. 1997. Recommender systems. Commun. ACM 40, 3(1997), 58.

[26] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-basedcollaborative �ltering recommendation algorithms. In Proceedings of the 10thinternational conference on World Wide Web. ACM, 285–295.

[27] Harald Steck. 2018. Calibrated recommendations. In Proceedings of the 12th ACMConference on Recommender Systems. ACM, 154–162.

[28] Sirui Yao and Bert Huang. 2017. Beyond Parity: Fairness Objectives for Collabo-rative Filtering. CoRR abs/1705.08804 (2017). h�p://arxiv.org/abs/1705.08804

[29] Ziwei Zhu, Xia Hu, and James Caverlee. 2018. Fairness-Aware Tensor-BasedRecommendation. In Proceedings of the 27th ACM International Conference onInformation and Knowledge Management. ACM, 1153–1162.

http://arxiv.org/abs/1705.08804

Documents

The Impact of Popularity Bias on Fairness and Calibration ...items from di‡erent categories (e.g. long tail items or items be-long to di‡erent providers)[9, 20] are ge−ing a