13
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON CYBERNETICS 1 Lazy Collaborative Filtering for Data Sets With Missing Values Yongli Ren, Gang Li, Member, IEEE, Jun Zhang, Member, IEEE, and Wanlei Zhou, Senior Member, IEEE Abstract—As one of the biggest challenges in research on rec- ommender systems, the data sparsity issue is mainly caused by the fact that users tend to rate a small proportion of items from the huge number of available items. This issue becomes even more problematic for the neighborhood-based collaborative filtering (CF) methods, as there are even lower numbers of ratings available in the neighborhood of the query item. In this paper, we aim to ad- dress the data sparsity issue in the context of neighborhood-based CF. For a given query (user, item), a set of key ratings is first identified by taking the historical information of both the user and the item into account. Then, an auto-adaptive imputation (AutAI) method is proposed to impute the missing values in the set of key ratings. We present a theoretical analysis to show that the proposed imputation method effectively improves the perfor- mance of the conventional neighborhood-based CF methods. The experimental results show that our new method of CF with AutAI outperforms six existing recommendation methods in terms of accuracy. Index Terms—Imputation, neighborhood-based collaborative filtering (CF), recommender systems. I. I NTRODUCTION T HE RAPID development of data collection methods, database systems, and processing technology has led the “information age” toward a new and exciting stage. Researchers and practitioners from a number of diverse disciplines are con- fronted with the issue of “information overload.” This means that we are facing an explosion of data resources, which is making it difficult to identify those relevant to us. Manually evaluating all of these alternatives is not only infeasible but also impossible. As a consequence, how to automatically identify the most relevant items has become an urgent requirement in many areas. In the past decade, one promising method to effi- ciently alleviate the information overload is the recommender system, which attempts to automatically provide personalized recommendations based on the historical records of the users’ activities. Thus far, the recommender system has successfully found applications in a variety of e-commerce web services, ranging from the movie recommendations in Netflix.com to book recommendations in Amazon.com, etc. Recommender systems are commonly constructed on the basis of collabo- rative filtering (CF) methods because of their insensitivity to Manuscript received May 22, 2012; revised September 12, 2012 and November 8, 2012; accepted November 24, 2012. This paper was recom- mended by Associate Editor H. Wang. The authors are with the School of Information Technology, Deakin Univer- sity, Vic. 3125, Australia (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TSMCB.2012.2231411 the detailed features of users or items. In general, there are two types of CF methods: the neighborhood-based method and the model-based method [1], [2]. Although much research attention has been given to the model-based method, a lesson learned from the well-known Netflix Competition highlights that neighborhood-based methods and model-based methods explore very different levels of data patterns in the data set and, therefore, neither will continually produce optimal results on their own [3]. In this paper, we aim to address the data sparsity issue in the context of the neighborhood-based method with theoretical analysis. The neighborhood-based CF method, which serves as a lazy yet nonparametric method, has been widely applied in practice and has also enjoyed popularity because of its simplicity, justi- fiability, efficiency, and stability [1], [2]. However, in practice, there have been serious problems for practitioners trying to use this kind of method in areas where data are sparse. Data spar- sity is the problem that frustrates most recommender system methods. It refers to insufficient information on a user’s rating history, and the corresponding user × item rating matrix will thus become extremely sparse. When data are prevailing with missing values, two like-minded users may not show any simi- larity. To overcome this problem for recommendation purposes, two major categories of approaches exist: One approach is to utilize more suitable similarity metrics to identify a better set of neighborhood [1], [4], [5], and the other approach is to design better aggregation methods that integrate the item ratings given by all the neighbors [6], [7]. It might be argued that a third category of approaches based on data imputation [5], [8]–[10], to which this work belongs, has begun to emerge over the past couple of years. Among others, data imputation mainly involves selecting a set of (user, item) pairs whose values are missing or unavailable in the user × item rating matrix and filling them with imputed values before predicting the ratings for the active user. In a recent attempt, Ma et al. proposed a missing data prediction algorithm by taking the imputation confidence into consideration, whereby predicting all the missing data that satisfied their criteria, and they were confident to impute [9]. Xue et al. proposed another imputation-based CF algorithm by fusing both the model-based and the neighborhood-based meth- ods [8]. Previous research has also focused on investigating the effectiveness of different imputation techniques [11]–[13]. However, despite evidence of improved performance in predic- tive accuracy, application of data imputation for CF is still in its infancy. Issues such as does all missing data have the same importance to the recommendation purpose, how to select the most informative missing data to impute, and how to trade off the imputation benefit and error still largely remain unexplored. 2168-2267/$31.00 © 2012 IEEE

Lazy Collaborative Filtering for Data Sets With Missing Values

  • Upload
    wanlei

  • View
    222

  • Download
    4

Embed Size (px)

Citation preview

Page 1: Lazy Collaborative Filtering for Data Sets With Missing Values

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON CYBERNETICS 1

Lazy Collaborative Filtering for Data SetsWith Missing Values

Yongli Ren, Gang Li, Member, IEEE, Jun Zhang, Member, IEEE, and Wanlei Zhou, Senior Member, IEEE

Abstract—As one of the biggest challenges in research on rec-ommender systems, the data sparsity issue is mainly caused by thefact that users tend to rate a small proportion of items from thehuge number of available items. This issue becomes even moreproblematic for the neighborhood-based collaborative filtering(CF) methods, as there are even lower numbers of ratings availablein the neighborhood of the query item. In this paper, we aim to ad-dress the data sparsity issue in the context of neighborhood-basedCF. For a given query (user, item), a set of key ratings is firstidentified by taking the historical information of both the userand the item into account. Then, an auto-adaptive imputation(AutAI) method is proposed to impute the missing values in theset of key ratings. We present a theoretical analysis to show thatthe proposed imputation method effectively improves the perfor-mance of the conventional neighborhood-based CF methods. Theexperimental results show that our new method of CF with AutAIoutperforms six existing recommendation methods in terms ofaccuracy.

Index Terms—Imputation, neighborhood-based collaborativefiltering (CF), recommender systems.

I. INTRODUCTION

THE RAPID development of data collection methods,database systems, and processing technology has led the

“information age” toward a new and exciting stage. Researchersand practitioners from a number of diverse disciplines are con-fronted with the issue of “information overload.” This meansthat we are facing an explosion of data resources, which ismaking it difficult to identify those relevant to us. Manuallyevaluating all of these alternatives is not only infeasible but alsoimpossible. As a consequence, how to automatically identifythe most relevant items has become an urgent requirement inmany areas. In the past decade, one promising method to effi-ciently alleviate the information overload is the recommendersystem, which attempts to automatically provide personalizedrecommendations based on the historical records of the users’activities. Thus far, the recommender system has successfullyfound applications in a variety of e-commerce web services,ranging from the movie recommendations in Netflix.com tobook recommendations in Amazon.com, etc. Recommendersystems are commonly constructed on the basis of collabo-rative filtering (CF) methods because of their insensitivity to

Manuscript received May 22, 2012; revised September 12, 2012 andNovember 8, 2012; accepted November 24, 2012. This paper was recom-mended by Associate Editor H. Wang.

The authors are with the School of Information Technology, Deakin Univer-sity, Vic. 3125, Australia (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TSMCB.2012.2231411

the detailed features of users or items. In general, there aretwo types of CF methods: the neighborhood-based methodand the model-based method [1], [2]. Although much researchattention has been given to the model-based method, a lessonlearned from the well-known Netflix Competition highlightsthat neighborhood-based methods and model-based methodsexplore very different levels of data patterns in the data set and,therefore, neither will continually produce optimal results ontheir own [3]. In this paper, we aim to address the data sparsityissue in the context of the neighborhood-based method withtheoretical analysis.

The neighborhood-based CF method, which serves as a lazyyet nonparametric method, has been widely applied in practiceand has also enjoyed popularity because of its simplicity, justi-fiability, efficiency, and stability [1], [2]. However, in practice,there have been serious problems for practitioners trying to usethis kind of method in areas where data are sparse. Data spar-sity is the problem that frustrates most recommender systemmethods. It refers to insufficient information on a user’s ratinghistory, and the corresponding user × item rating matrix willthus become extremely sparse. When data are prevailing withmissing values, two like-minded users may not show any simi-larity. To overcome this problem for recommendation purposes,two major categories of approaches exist: One approach is toutilize more suitable similarity metrics to identify a better set ofneighborhood [1], [4], [5], and the other approach is to designbetter aggregation methods that integrate the item ratings givenby all the neighbors [6], [7]. It might be argued that a thirdcategory of approaches based on data imputation [5], [8]–[10],to which this work belongs, has begun to emerge over the pastcouple of years. Among others, data imputation mainly involvesselecting a set of (user, item) pairs whose values are missingor unavailable in the user × item rating matrix and fillingthem with imputed values before predicting the ratings for theactive user. In a recent attempt, Ma et al. proposed a missingdata prediction algorithm by taking the imputation confidenceinto consideration, whereby predicting all the missing data thatsatisfied their criteria, and they were confident to impute [9].Xue et al. proposed another imputation-based CF algorithm byfusing both the model-based and the neighborhood-based meth-ods [8]. Previous research has also focused on investigatingthe effectiveness of different imputation techniques [11]–[13].However, despite evidence of improved performance in predic-tive accuracy, application of data imputation for CF is still inits infancy. Issues such as does all missing data have the sameimportance to the recommendation purpose, how to select themost informative missing data to impute, and how to trade offthe imputation benefit and error still largely remain unexplored.

2168-2267/$31.00 © 2012 IEEE

Page 2: Lazy Collaborative Filtering for Data Sets With Missing Values

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON CYBERNETICS

In this paper, we propose an auto-adaptive imputation(AutAI) method that automatically identifies a key set ofmissing data and adaptively impute them according to eachindividual user’s rating history. Through this novel method, wecan maximize the information contained in the neighborhood ofthe active user while minimizing the imputation error broughtin. Inspired by Cover’s [14] research on the nearest-neighborrule, which showed that the nearest neighbor contains at least50% of the information in the infinite training set, we arguethat not all missing data possess equivalent information fora particular prediction and that some missing data are moreinformative than others. Then, we propose an AutAI methodto identify the critical key set of missing data for each activeuser adaptively so as to impute the most informative missingdata. Therefore, the proposed imputation method can lead tobetter recommendations in terms of accuracy. We provide atheoretical analysis on the performance benefit for the proposedAutAI method and propose a CF algorithm by using the AutAImethod from both the user and item aspects (AutAI-Fusion).

The major contributions of this paper are as follows:

• We propose a novel imputation method (AutAI) to identifya key set of missing data for each rating prediction toaddress the problem of data sparsity.

• We theoretically analyze the performance benefit of theproposed AutAI method.

• Based on the AutAI method, we propose a novel efficientCF algorithm (AutAI-Fusion) by simultaneously consider-ing both user activity and item popularity.

• We conduct a large set of experiments to examine theperformance of the AutAI method on various similaritymetrics and compare the proposed AutAI-Fusion algo-rithm with six state-of-the-art CF algorithms, includingother representative imputation-based algorithms.

The rest of this paper is organized as follows: In Section II,we present the background and related work. In Section III,we define the AutAI method in the context of neighborhood-based CF, together with a theoretical analysis on the imputationbenefit and error. Section IV presents the experiment results,followed by the conclusion in Section V.

II. BACKGROUND AND RELATED WORK

Here, we briefly recap the neighborhood-based CF methodand the model-based CF method.

A. Notation

Let us assume U = {u1, u2, . . . , um} to be a set of m usersand T = {t1, t2, . . . , tn} to be the set of n items. For m usersand n items, the user × item rating matrix R is represented asan m× n matrix.

The user × item matrix R can be decomposed into rowvectors: R = [u1,u2, . . . ,um]T and ui = [ri1, ri2, . . . , rin],where T denotes transpose. The row vector ui corresponds tothe user ui’s rating information, and rij denotes the rating thatuser ui gave to item tj . Nk(ux) denotes the set of user ux’s

k-nearest neighbors (KNNs). Txy = {ti ∈ T |rxi �= �, ryi �=�} denotes the set of items corated by both ux and uy .ua denotes the active user for whom the recommendation isprocessing. ts denotes the active item on which the recommen-dation is processing. ua and ts denote the average rating of userua and the average rating on item ts, respectively.

B. Neighborhood-Based CF

The neighborhood-based CF method is generally based onthe KNN rule. This is a type of lazy learning method bywhich the selection of the nearest neighbors will be delayeduntil a query (user, item) makes it to the system. Its input isusually the entire user × item rating matrix R. One of thepioneering works in neighborhood-based CF was proposed byResnick et al. [4]. The proposed system by Resnick et al., i.e.,GroupLens, utilizes users’ ratings as input, then predicts howmuch the active user likes an unrated item. Specifically, ratingsare exploited in a twofold way: 1) the use of ratings to identifysimilar users and (2) the use of ratings to predict a rating onitem ts by the active user ua. Two stages are involved in thisprocess: neighbor selection and rating aggregation.

In the stage of neighbor selection, similarity may be evalu-ated between any two rows or any two columns in the user ×item rating matrix, corresponding to the user-based methods[5] or the item-based methods [15]. It is clear that the similaritymetric is one of the foundational problems in CF. When theuser × item rating matrix is sparse, this problem becomeseven more important. Several methods have been proposed toaddress this problem. One method is to design better similaritymetrics that can tolerate the sparsity issue, e.g., two popularmetrics such as Pearson correlation coefficient (PCC) [4] andcosine-based similarity [1], [5]. Another method is to tackle thesparsity issue directly through data imputation, such as defaultvoting [5], Effective Missing Data Prediction (EMDP) [9], etc.This method will be discussed in detail in Section II-D.

In the stage of rating aggregation, for any item ti, all theratings on ti by users in Nk(ua) will be aggregated into thepredicted rating value rai by user ua [1]. The weighted majorityprediction algorithm predicts the rating on new items basedon a weighted sum of the users’ ratings on those items [16].Hence, the determination of a suitable set of weights becomesan essential problem here. In 2007, Bell and Koren castedweight determination as an optimization problem, which wassolved by linear regression [17]. Recently, Garcin et al. haveinvestigated the performance of three aggregators: the mean,the median, and the mode [18]. They found that both the medianand the mode lead to more accurate recommendations than themean. Larson and Hanjalic [7] proposed an approach basedon rated-item pools, which characterizes a user’s preferenceinto “positive,” “neutral,” or “negative.” Hence, the item setrated by a user is divided into three subsets accordingly. Then,three similarities are obtained based on these item subsets,respectively, and the final rating is predicted by aggregatingratings using the weight from these three subsets. While mostconventional work relies on the similarity between users oritems to determine the weight [19], data sparsity also affectsthis stage.

Page 3: Lazy Collaborative Filtering for Data Sets With Missing Values

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

REN et al.: LAZY COLLABORATIVE FILTERING FOR DATA SETS WITH MISSING VALUES 3

C. Model-Based CF

In contrast to the neighborhood-based CF, the model-basedCF methods construct a model from the given rating matrixand utilize the model to predict ratings on new items. A widerange of machine learning techniques has been adopted, suchas supervised learning [20], unsupervised learning [21]–[23],matrix decomposition [24], [25], etc.

For the supervised learning techniques, Su and Khoshgoftaarapplied belief nets to treat rating prediction as a multiclassclassification problem [20]. For the unsupervised learning tech-niques, clustering can improve the scalability of CF meth-ods by grouping similar users and using the cluster as theneighborhood [21]. Lemire and Maclachlan proposed the slopeone algorithm, based on the popularity differential principlebetween items. This means user rating behavior is influenced byboth the user’s rating style (e.g., the user’s average rating) andthe items popularity (e.g., the average rating for an item) [23].For the matrix decomposition techniques, Billsus and Pazzaniproposed using singular value decomposition to exploit the“latent structure” in user ratings [24].

Compared with the model-based CF methods, theneighborhood-based CF methods also enjoy popularitydue to their simplicity, justifiability, efficiency, and stability[2]. In the simplest form, there is only one parameter required,namely, the number of neighbors. The selected neighborsand their ratings can help the active user to better understandthe produced recommendation. Moreover, this can also formthe basis for an interactive system where users can chooseneighbors due to their own needs [26]. No costly trainingphase enables neighborhood-based CF to work on applicationsinvolving millions of users and items. Moreover, a lessonlearned from the well-known Netflix Competition indicatesthat neighborhood-based methods and model-based methodscan explore very different levels of data patterns in the dataset; therefore, none of them can always output optimal resultssolely based on its own [3].

D. Data Sparsity and Imputation

Data sparsity is one of the most challenging issues in rec-ommendation techniques. Since users tend to rate only a smallfraction of items in a system, the user × item rating matrixis usually very sparse, with a density of around 1% [15]. Fur-thermore, this sparsity issue may make the neighborhood-basedCF methods incapable of finding neighbors and, therefore, failto make accurate recommendations [15], [27]. To overcomethis problem, a number of methods have been proposed. Onecategory of these methods is to design a similarity measurementthat can tolerate the data sparsity issue in the recommendationfield [28]. Another category is to design better aggregationmethods that integrate the item ratings given by all neighbors[6], [7]. The third category of these methods is to fill in themissing data by imputation, e.g., the default voting [5], thesmoothing method [8], and the missing data prediction [9]. Inthis paper, we focus on the methods by using imputation.

Default voting is a straightforward imputation-based method[5] that assumes default values for those missing ratings, such

as exploiting the average ratings by a small group of users asthe default ratings for other items in order to increase the sizeof the corated item set [29]. Xue et al. proposed the use ofcertain machine learning methods to smooth all missing data inthe user × item rating matrix [8]. Ma et al. took the imputationconfidence into consideration and only filled in the missing datawhen it was confident to impute [9]. This idea works well asit prevents poor imputation. However, their proposed EMDPalgorithm treats all the missing data equally. There are otherimputation-related works recently focusing on various imputa-tion techniques [11]–[13]. For example, Zhu et al. proposed anonparametric iterative imputation method for mixed-attributedata sets by creating a mixture kernel to infer the probabilitydensity for independent attributes [11]. Zhang et al. investigatedthe imputation method by utilizing the information in incom-plete observations [12]. Zhang proposed a new shell-neighbormethod for data imputation by only utilizing the left andright nearest neighbors of each data attribute [13]. Little workinvestigates how to impute and which missing data should beimputed? This question is interesting and important as imputeddata will bring some imputation errors as well as alleviatingthe sparsity issue. Hence, we argue that there is a tradeoffbetween imputation benefit and imputation errors. In this case,is it possible to quantify the accompanied imputation error? Orcan we guarantee to achieve better performance by imputingthe missing data? In this paper, we will investigate thesequestions and propose an AutAI method for the neighborhood-based CF methods. The proposed method can be proven toimprove the performance of the conventional neighborhood-based CF methods all the time, and we demonstrate this byproviding theoretical and empirical analysis in the neighborselection stage of the neighborhood-based CF methods in thefollowing sections.

III. AUTAI-BASED CF

Here, we propose a novel imputation method to effectivelyimprove the performance of neighborhood-based CF. More-over, a theoretical analysis on the performance benefit of theproposed method is also provided. To the best of our knowl-edge, this paper is the first to analyze imputation-based methodsfrom a theoretical perspective in the area of CF.

A. Formulation of Neighborhood-Based CF

We first formulate the neighborhood-based CF using proba-bility theory [30]. The neighborhood-based CF provides rec-ommendations by estimating how much the active user maylike unrated items, which is known as the rating prediction task.Given two variables, the user consuming history u and the avail-able ratings r, the rating prediction task can be formulated asμ(u) = E(r|u), which is the expectation of dependent variabler given the independent variable u. For the recommendationpurpose, it is interesting to estimate the value ras on an unrateditem ts for a singular independent variable value ua. Theestimator for ua is then

μ(ua) = E(ras|ua). (1)

Page 4: Lazy Collaborative Filtering for Data Sets With Missing Values

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON CYBERNETICS

From the perspective of probability theory, the observationvalues sampled at ua can be used to estimate μ(ua).

However, there are no observation values at ua in the contextof CF. To tackle this problem, certain similar users of ua areselected and used to estimate μ(ua). This is the assumption ofCF: Similar users may have similar ratings on items. Specif-ically, to estimate μ(ua), we first select several neighbors ofua, then treat the ratings of these neighbors as samples at ua.Finally, we process the above estimation method as usual. Thisis the well-known KNN estimator [2], i.e.,

μ(ua) = E(ras|ua) =1

k

∑ux∈Nk(ua)

rxs (2)

where k is the number of selected neighbors, ux is one of ua’sneighbors, Nk(ua) is the set of KNNs, and rxs is the rating ofneighbor ux on ts. To further reduce the estimation bias, theKNN estimation usually applies the weighted average [31], i.e.,

μ(ua) = E(ras|ua) =1

η

∑ux∈Nk(ua)

waxrxs (3)

where wax is the weight of neighbor ux to ua, and η =∑ux∈Nk(ua)

wax is the normalizing factor. The weights reflecthow close the neighbors are to ua.

Recent research shows that the data sparsity issue bringsmuch difficulty to the neighborhood-based CF methods [32].The data sparsity issue can lead to the critical problem ofunreliable similarity since the similarity between two users isactually calculated using a very small set of corated items.The user relationship measured by the unreliable similaritycannot capture the overall relationship between two users [27].Moreover, the similarities of an active user ua to two users (ux

and uy) are computed on two different sets of corated items,which results in incomparable similarities. Consequently, theperformance of neighborhood-based CF is seriously affected byinaccurate similarities due to the data sparsity issue.

B. Novel AutAI Method

Here, we propose a novel AutAI method. To make a ratingprediction on the active item ts for the active user ua, weargue that not all missing data in the user × item rating matrixpossess equivalent information for this rating prediction, andconversely, there is a key set of missing data that is mostinformative for this particular prediction. Accordingly, thiskey set should be different for every prediction, even for thesame user.

The proposed AutAI method can identify which missingdata should be automatically imputed, with the imputed setadaptively determined according to a user’s own rating history.Specifically, to make rating prediction on item ts for user ua,the imputed set is identified by two factors, namely, the usersrelated to ua and the items related to ts. We denote the relatedusers as Ua and the related items as Ts, i.e.,

Ua = {ua′ |ra′s �= �} (4)

Ts ={ti|ti ⊂ [Taa′

1∪ · · · ∪ Taa′

i∪ · · · ∪ Taa′

l]}

(5)

where Taa′i

is the corated item set between the active userua and ua′

i∈ Ua, and l = |Ua|. For example, suppose that the

rating histories for two users ua and ua′ are represented as

ua = [ra1, 0, 0, ra4, ra5, 0, . . . , ran]

ua′ = [ra′1, ra′2, 0, 0, ra′5, 0, . . . , 0]

where 0 indicates that the corresponding rating is missing.Then, Taa′ = [t1, t5, . . .], since both users ua and ua′ haverated t1, t5, etc. Therefore, Ts is the union of Taa′ over allua′ ∈ Ua. Furthermore, with respect to item ts, we define thekey neighborhood for the active user ua as

Na,s = {ra′i|ua′ ∈ Ua, ti ∈ Ts} (6)

where ra′i can either be an observing or missing rating.Normally, due to the sparsity of the user × item rating matrix,this selected key neighborhood is also sparse. We define allthe missing data in this key neighborhood as the key set ofmissing data for the prediction ras. For each observing ratingra′i in Na,s, it plays a key role in the prediction for ras, asboth user ua′ and item ti are highly related to ras. Even forthe missing data in Na,s, they have equal importance for theprediction of ras. Please note that Na,s is defined from theuser’s perspective; hence, we call this the user-based AutAI.After Na,s is determined, for each missing data ra′i in Na,s,we use the following equation to impute its value:

ra′i = ua′ +

∑ux∈Nk(ua′ ) sim(ua′ , ux)× (rxi − ux)∑

ux∈Nk(ua′ ) sim(ua′ , ux)(7)

where sim(ua′ , ux) is the similarity between ua′ and ux asdefined, i.e.,

sim(ua′ , ux)

=

∑ti∈Ta′x

(ra′i − ua′)(rxi − ux)√∑ti∈Ta′x

(ra′i − ua′)2∑

ti∈Ta′x(rxi − ux)2

(8)

which is the PCC between ua′ and ux [4]. Then, the imputedratings in Na,s can be put back in the user × item ratingmatrix R and will be used for rating prediction on item tsfor user ua. It is clear that the missing data are adaptivelyimputed for user ua on item ts, and the key set of missingdata is automatically identified from the user’s perspective. Thepseudocode of the user-based AutAI method is presented inAlgorithm 1.

Algorithm 1 User-based AutAIInput: the user-item rating matrix R. the active user ua. the

active item ts.Output: R′: the imputed matrix R′.1:for each ux ∈ U do2: for each uy ∈ U & uy �= ux do3: calculate sim(ux, uy) according to (8);

Page 5: Lazy Collaborative Filtering for Data Sets With Missing Values

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

REN et al.: LAZY COLLABORATIVE FILTERING FOR DATA SETS WITH MISSING VALUES 5

4: end for5:end for6:R′ ← R;7:Ua ← {ua′ |ra′s �= �};8:Ts ← ∅;9:for each ua′ ∈ Ua do10: Taa′ ← {ti|rai �= �, ra′i �= �};11: Ts ← Ts ∪ Taa′ ;12:end for13:Na,s ← {ra′i|ua′ ∈ Ua, ti ∈ Ts};14:for each ra′i ∈ Na,s do15:if ra′i = � then16: calculate ra′i according to (7);17: R′(ua′ , ti) ← ra′i;18: end if19:end for20:return R′;

Similarly, AutAI can also work in an item-based man-ner. In this case, the key neighborhood for the active userua with respect to the active item ts is defined as N′

a,s ={ra′i|ux ∈ U′

a, ts′ ∈ T ′s}, where T ′

s = {ts′ |ras′ �= �}, U′a =

{ux|ux ⊂ [Uss′1∪ · · · ∪ Uss′

i∪ · · · ∪ Uss′

l]}, Uss′

idenotes the

set of users who rated both ts and ts′i, and l = |T ′

s |. Conse-quently, the missing data in N′

a,s form the key set of missingdata in this item-based manner. Imputing these missing data iscalled the item-based AutAI.

The imputed missing data in Na,s contribute in two ways.First, Cover’s [14] research on the nearest-neighbor rule showsthat the nearest neighbor contains at least 50% of the infor-mation in the infinite training set. In this sense, the missingdata in Na,s contain much more information than the missingdata outside Na,s. Hence, imputing these missing data willbring in much more benefit compared with imputing othermissing data. Moreover, imputation in this way will also bringin fewer associated imputation errors. We provide a theoreticalanalysis on this point in the following section. Second, AutAImakes further similarity measurement between each neighborto the active user in the same Ts space. Based on the commonassumption of the neighborhood-based CF algorithms that iftwo users rated n items similarly, they tend to similarly rateon other items [5], [15], this AutAI method makes the similaritymeasurement between the active user and any other neighbor onthe same Ts items rather than on the corated items between twousers. Formally, we denote the similarity between two users ua

and ux on imputed data as sim′(ua, ux), which can be measuredby any similarity metric; for example, its PCC-based version isformulated as

sim′(ua, ux)=

∑ti∈Ts(rai − ua)(rxi − ux)√∑

ti∈Ts(rai − ua)2∑

ti∈Ts(rxi − ux)2(9)

where ux is the average rating of user ux.On the other hand, the imputed missing data in N′

a,s alsocontribute in a similar way to its counterpart in Na,s; hence,we will not list them separately. The similarity between two

items, after applying the item-based AutAI, can be calculatedin a similar way, and its PCC-based version is defined as

sim′(ts, ti)=

∑ua′∈U ′

a(ra′s− ts)(ra′i− ti)√∑

ua′∈U ′a(ra′s− ts)2

∑ua′∈U ′

a(ra′i− ti)2

(10)

where ts is the average rating of item ts.

C. Performance Analysis

It is well established that neighbors selected in KNN-basedmethods are a determinant of their performance. Failure topick up the right neighbors leads to a bad performance, asthis directly violates the assumption of KNN-based methods[31], [33]: The regression curve around point ua is smooth, andselected neighbors are treated as samples at ua. This indicatesthat failing to find nearest neighbors will select wrong samplesand make the KNN rule stop working. Here, we theoreticallyanalyze how the proposed AutAI method will affect the se-lection of neighbors and guarantee to find proper neighborsmore accurately than the conventional neighborhood-based CFalgorithms. To the best of our knowledge, this is the first workto theoretically analyze why and how imputation can alleviatethe data sparsity issue in the field of CF.

Neighbor selection determines the performance of theneighborhood-based CF, which is based on the measured sim-ilarities among users [1], [31]. Let us consider a general case:Suppose the active user ua and two other users, i.e., ux and uy ,have the following rating histories:

ua = [ra1, ra2, . . . , rai, . . . , ral]ux = [rx1, rx2, . . . , rxi, . . . , rxl]uy = [ry1, ry2, . . . , ryi, . . . , ryl]

where rxi is the imputed rating on item ti for ux, and l = |Ts|.According to (5), ua contains the real ratings for ua, but ux

contains both real ratings and imputed ratings. Hence, both ua

and ux can be divided into two subsets: Tax and Ts \ Tax,where Tax denotes the corated item set between ua and ux, andTs \ Tax denotes the relative complement of Tax in Ts [30].Therefore, ua and ux can be represented as

ua = [

|Tax|︷ ︸︸ ︷ra(1), . . . , ra(p),

|Ts\Tax|︷ ︸︸ ︷ra(p+1), . . . , ra(l)]

ux = [

|Tax|︷ ︸︸ ︷rx(1), . . . , rx(p),

|Ts\Tax|︷ ︸︸ ︷rx(p+1), . . . , rx(l)]

where p = |Tax| and l = |Ts|.We measure the similarity between ua and ux by the distance

in their item space. The distance on each item can be consideredas an estimation of their similarity. The distance on item ti isdenoted as dti . Then, the distance dax between ua and ux canbe expressed by

dax =

{dti , if rxi is real ratingdti + εi, if rxi is imputed

(11)

where εi = rxi − rxi denotes the imputation error on item tifor user tx, and rxi denotes the imputed value.

Page 6: Lazy Collaborative Filtering for Data Sets With Missing Values

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON CYBERNETICS

Fig. 1. Distribution of dti .

In this paper, for two users, we assume that the distanceon any item is independent and identically distributed (i.i.d.).We also adopt normal distribution for theoretical analysis. Forexample, Fig. 1 shows the distribution of dti in the well-knownMovieLens data set. Specifically, the dashed blue line showsthe real distribution of dti in the experiment results, whereasthe solid black line shows a reference normal distributionwith the mean and standard deviation of dti . We observe that thereal distribution of dti is similar to the normal distribution. Thisindicates that it is reasonable to apply the normal distributionto do a theoretical analysis. Therefore, the probability densityfunction (pdf) of dti is defined as

p(dti) ∼{N

(μ1, σ

21

), for ua and ux

N(μ2, σ

22

), for ua and uy .

(12)

Furthermore, Fig. 2 shows the distribution of ε in the experi-ment results, where the dashed blue line and the solid black lineshow the real and the simulated distributions of ε, respectively.One can see that the real distribution of ε looks much likethe normal distribution. Therefore, in this theoretical study, wesuppose the imputation error ε is also i.i.d., and its pdf is

p(ε) ∼ N(με, σ

). (13)

Then, the distance dax between ua and ux is formulated asthe expectation over all possible items, i.e., dax = E(dt). Prac-tically, the expectation can be estimated by using the average ofall observations; hence

dax = E(dt) =1

n

n∑i=1

dti (14)

where n is the number of items taken into account whenmeasuring dax. In the user-based AutAI method, this set ofitems is Ts, and daaiax denotes the distance dax. ConsideringTs consists of two subsets, Tax and Ts \ Tax, daaiax is affectedby items coming from both of these, and can be representedas: 1/p

∑pi=1 dti and 1/q

∑qj=1 dtj , where p = |Tax|, q =

|Ts \ Tax|, and dtj = dtj + εj represents the estimation from

Fig. 2. Distribution of ε.

imputed values as defined in (11). Due to the existence of theimputation error ε, the similarity estimation coming from Ts \Tax has to take it into consideration. The cumulative imputationerror for the estimation of daaiax over Ts \ Tax is

εaai =1

q

q∑j=1

dtj −1

q

q∑j=1

dtj

=1

q

⎛⎝ q∑

j=1

(dtj + εj)−q∑

j=1

dtj

⎞⎠

=1

q

q∑j=1

εj . (15)

Consequently, daaiax in AutAI is represented as the calculationbased on real ratings plus the cumulative imputation errorεaai: 1/l

∑li=1 dti and εaai = 1/q

∑qj=1 εj , where l = |Ts|.

According to (13) and (15), εaai is also normal, and its pdf is

p(εaai) ∼ N(με,

σ2ε

q

)(16)

where q = |Ts \ Tax|. Normally, due to |Ts \ Tax| � 1, σ2ε/q

is much smaller than σ2ε . For example, in the MovieLens data

set, mean(|Ts \ Tax|) = 85.08, which means the order of mag-nitude of σ2

ε/q is around two orders of magnitude less than σ2ε .

Namely, εaai actually varies little around its mean value. Fig. 3shows the distribution of the imputation error ε (the user-basedCF is used as the imputation algorithm in our experiments) andthe distribution of εaai in the MovieLens data set. An importantobservation is that the standard variance of εaai is much smallerthan that of ε. It is exactly in line with the above analysis; hence,the standard variance of εaai can be ignored. Consequently, themeasurement of daaiax over Ts is defined as

daaiax = E(dt) =1

l

l∑i=1

dti + με (17)

Page 7: Lazy Collaborative Filtering for Data Sets With Missing Values

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

REN et al.: LAZY COLLABORATIVE FILTERING FOR DATA SETS WITH MISSING VALUES 7

Fig. 3. Distribution of ε and εaai.

where l = |Ts|. According to (12), (13), and (17), we obtain that

p(daaiax

)∼ N

(μ1 + με,

σ21

l

). (18)

Similarly, the pdf of the distance between ua and uy in AutAI,i.e., daaiay , is

p(daaiay

)∼ N

(μ2 + με,

σ22

l

). (19)

We define the distance divergence ψ between the distance oftwo users ux and uy to the active user ua as

ψ = dax − day (20)

which determines their order to the active user. It is well knownthat the confidence interval (CI) can be used to indicate thereliability of an estimation [30] and has been widely used in theresearch of neighborhood-based models [31]. Therefore, in thisstudy, we apply CI to measure the reliability of the estimationof ψ. CI is a range of values that quantify the uncertainty ofthe estimation, and a narrow CI means high precision [30].According to (18) and (19), the 100(1− α)% CI for ψ in AutAIcan be formulated as follows:

CI(ψaai) = ((μ1 + με)− (μ2 + με))

± zα/2

√σ21

l

l+

σ22

l

l

=(μ1 − μ2)± zα/2

√σ21

l2+

σ22

l2(21)

where zα/2 is a standard normal variate that is exceeded witha probability of α/2. We define σψaai

as the standard error ofψaai, i.e.,

σψaai=

√σ21

l2+

σ22

l2. (22)

Please note that the width of CI(ψaai) is proportional to σψaai.

Now, let us consider conventional neighborhood-based CF,the measurement of the distance between ua and ux, i.e., dknnax ,is estimated over Tax, which is

dknnax = E(dt) =1

p

p∑i=1

dti (23)

where p = |Tax|. According to (12) and (23), we have

p(dknnax

)∼ N

(μ1,

σ21

p1

)(24)

where p1 = |Tax|. Similarly, dknnay , which is the distance be-tween ua and uy in conventional neighborhood-based CF, alsofollows the normal distribution, and its pdf is

p(dknnay

)∼ N

(μ2,

σ22

p2

)(25)

where p2 = |Tay|. Therefore, according to (24) and (25), the100(1− α)% CI for ψknn, ψ in neighborhood-based CF ap-proaches, is

CI(ψknn) = (μ1 − μ2)± zα/2

√√√√ σ21

p1

p1+

σ22

p2

p2

=(μ1 − μ2)± zα/2

√σ21

p21+

σ22

p22(26)

where zα/2 is a standard normal variate that is exceeded with aprobability of α/2. σψknn

is the standard error of the estimatedψknn, i.e.,

σψknn=

√σ21

p21+

σ22

p22(27)

where p1 = |Tax| and p2 = |Tay|. Furthermore, the width ofCI(ψknn) is proportional to σψknn

.According to (5), it is clear that l = |Ts| ≥ |Tax| = p1 and

l = |Ts| ≥ |Tay| = p2. For example, in the benchmark dataset MovieLens, mean(|Ts|) = 103.47, and mean(|Tax|) = 18.4.Together with (22) and (27), we obtain

σψaai≤ σψknn

(28)

with equality if and only if |Ts| = |Tax| and |Ts| = |Tay|, whichis not valid when facing the data sparsity issue in the fieldof CF. Furthermore, according to (21) and (26), one can seethat the CI(ψaai) is narrower than or equal to CI(ψknn), asσψaai

is smaller than or equal to σψknn. Based on the above

theoretical analysis, we conclude that the proposed AutAImethod can effectively improve the performance of the conven-tional neighborhood-based CF methods through more accuratenearest-neighbor selections by using a sparse rating matrix in anovel way.

Page 8: Lazy Collaborative Filtering for Data Sets With Missing Values

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON CYBERNETICS

Algorithm 2 The AutAI-Fusion AlgorithmInput: R: user-item rating matrix; ua: the active user; ts: the

active item; k: the number of neighbors; λ: tradeoff parameter;Output: ras: rating on item ts by user ua

1:Ua ← {ua′ |ra′s �= �};2:Ts ← {ti|ti ⊂ [Taa′

1∪ · · · ∪ Taa′

i∪ · · · ∪ Taa′

l]};

3:impR ← AutAI(R, ua, ts);4:for each ua′ ∈ Ua do5:calculate sim′(ua, ua′) according to (9);6:end for7:Nk(ua) ← top-KNNs to ua;8:calculate ruas according to (29);9:T ′

s ← {ts′ |ras′ �= �};10:U′

a ← {ux|ux ⊂ [Uss′1∪ · · · ∪ Uss′

i∪ · · · ∪ Uss′

l]};

11:impR′ ← AutAI(RT , ts, ua);12:for each ts′ ∈ T ′

s do13:calculate sim′(ts, ts′) according to (10);14:end for15:Nk(ts) ← top-KNNs to ts;16:calculate rias according to (30);17:ras ← λruas + (1− λ)rias;18:return ras;

D. Rating Prediction

After the imputation is done, the question is how to predicta rating for the active user ua on item ts. By taking theusers’ activity and the items’ popularity into consideration, wepropose a novel algorithm, an AutAI algorithm for both usersand items (AutAI-Fusion).

To take users’ activity into consideration, we first perform auser-based AutAI imputation, as described in (6), then make aprediction as follows. The prediction for ras in this user-basedmanner is calculated as

ruas = ua +

∑ux∈Nk(ua)

sim′(ua, ux)× (rxs − ux)∑ux∈Nk(ua)

sim′(ua, ux)(29)

where ua is the average rating of ua, and sim′(ua, ux) is definedin (9).

Similarly, to take item popularity into consideration, we doan item-based AutAI imputation, as described in Section III-B.Then, the prediction for ras in this item-based manner iscalculated as

rias = ts +

∑ti∈Nk(ts)

sim′(ts, ti)× (rai − ti)∑ti∈Nk(ts)

sim′(ts, ti)(30)

where ts is the average rating of ts, and sim′(ts, ti) is definedin (10).

After imputing and predicting from both the user and item,the final prediction for ras in the proposed AutAI-Fusion algo-rithm is calculated as

ras = λruas + (1− λ)rias. (31)

TABLE IMAE COMPARISON ON DIFFERENT SIMILARITY MEASUREMENT

METRICS ON THE Given DATA SET. (A SMALLER VALUE

MEANS BETTER PERFORMANCE)

TABLE IIMAE COMPARISON ON DIFFERENT SIMILARITY MEASUREMENT

METRICS ON THE All-But-One DATA SET. (A SMALLER VALUE

MEANS BETTER PERFORMANCE)

The parameter λ determines to what extent the final predictionis based on user-based AutAI or item-based AutAI imputationprediction. When λ = 1, the prediction is completely generatedby taking user activity to perform an AutAI-based prediction.On the other hand, when λ = 0, the prediction is totally es-timated by taking item popularity to perform an AutAI-basedprediction. The value for λ can be determined by doing cross-validation, which will be further discussed in the experimentsection. The pseudocode for the AutAI-Fusion algorithm ispresented in Algorithm 2.

IV. EXPERIMENT

Here, we conduct several experiments to examine the perfor-mance of the proposed AutAI method.

A. Experimental Setup

The data set we experiment with is the popular benchmarkdata set MovieLens1, which includes around 1 million ratingscollected from 6040 users on 3900 movies.

Of particular interest in CF research is the relationshipbetween data sparsity and generated recommendations. Toevaluate the performance thoroughly, we extract a subset of2000 users from MovieLens who rated at least 30 movies, andfurther, we set up several different experimental configurations.Specifically, we split the selected subset into two sets, namely,the Training set and the Test set. The size of the Training setvaries from the first 500, 1000, and 1500 users, which are de-noted as M500, M1000, and M1500, respectively. The remaining500 users are treated as the Test set. For each of the active userwithin the Test set, we alter the number of rated items providedfrom 10, 20, to 30, which are represented as Given10, Given20,and Given30, respectively. This protocol is widely used inCF research [8], [9], [34]. Furthermore, we also apply the

1http://www.grouplens.org/

Page 9: Lazy Collaborative Filtering for Data Sets With Missing Values

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

REN et al.: LAZY COLLABORATIVE FILTERING FOR DATA SETS WITH MISSING VALUES 9

Fig. 4. Sensitivity to the size of neighborhood on the Given data set for PCC and COS. (a) PCC on M500Given10. (b) PCC on M500Given30. (c) PCC onM1500Given30. (d) COS on M500Given10. (e) COS on M500Given30. (f) COS on M1500Given30.

All-But-One configuration, in which we randomly select onesingle rating for each user in the data set, then try to predict itsvalue when observing all the other ratings the user has given.

For consistency with other literatures [8], [9], [34], we applythe mean absolute error (MAE) as the measurement metric,which is defined as

MAE =

∑(a,s)∈X |ras − ras|

|X| (32)

where ras is the true rating given by user ua on item ts,ras is the predicted rating, X denotes the test data set, and|X| represents its size. A smaller MAE value means betterperformance.

B. Performance With Different Similarity Metrics

In order to evaluate the performance of AutAI, we usethe proposed AutAI method on two state-of-the-art similaritymetrics in CF [1], [2], [8], [9], [34], namely, the PCC and theCosine-based similarity (COS). We implement four algorithmsin the user-based manner, namely, the user-based PCC algo-rithm, the user-based COS algorithm, and the user-based AutAIwith PCC and COS, respectively. Then, we compare theirprediction performance, in which the number of neighbors isset to 30. We examine AutAI performance on all the experimentconfigurations, including Given and All-But-One data sets.

The results on the Given and All-But-One data sets areshown in Tables I and II, respectively. The results show thatAutAI achieves significant improvements on both similaritymetrics in all experiment configurations. This indicates that the

proposed AutAI method works well and is robust in differentsparsity situations. On the All-But-One data set, AutAI achieveseven larger improvements on both PCC and COS metrics, asshown in Table II. For PCC, the MAE is reduced from 0.7477to 0.6910 by 7.58%, and for COS, it is reduced from 0.7652 to0.6932 by 9.41%. This is mainly because there is no limitationto users’ rating history; hence, AutAI can fully work in thisnatural setting.

Moreover, in order to examine AutAI’s performancethoroughly, we vary the number of neighbors from 5 to 40 toexamine its sensitivity to the neighborhood size. We report theresults on M500Given10, M500Given30, and M1500Given30

for both PCC and COS, as shown in Fig. 4. Fig. 5 shows theresults on the All-But-One data set. We observe that the neigh-borhood size does affect the performance. The AutAI methodoutperforms its counterpart across all neighborhood sizes from5 to 40 on all data sets. This indicates that the AutAI methodidentifies the neighborhood relationship more accurately thanthe conventional neighborhood-based CF methods, as wedemonstrate in Section III-C, and, therefore, achieves betterperformance across all neighborhood sizes. Specifically, whenthe number of ratings for each user is fixed, for example, in theGiven30 data set in Fig. 4(b) and (c), the performance of theuser-based CF method improves as the size of the training dataincreases from M500 to M1500. However, the AutAI methodcan always output smaller MAE than the user-based CF withall neighborhood sizes. Alternatively, when the size of thetraining set is fixed, for example, the M500 data set in Fig. 4(d)and (e), the user-based CF also achieves improved performanceas the number of given ratings increases from Given10 toGiven30. Similarly, the proposed AutAI method also keeps

Page 10: Lazy Collaborative Filtering for Data Sets With Missing Values

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON CYBERNETICS

Fig. 5. Sensitivity to the size of neighborhood on All-But-One. (a) PCC on All-But-One. (b) COS on All-But-One.

TABLE IIIPAIRED t-TEST FOR MAE ON THE Given DATA SET

AND THE All-But-One DATA SET

performing better than its counterpart across all neighborhoodsizes. Similar trends can be obtained on all the other data sets.In Fig. 5(a) and (b), it is clearly observed that AutAI achievesmuch larger improvements on the All-But-One data set thanthat on Given data sets. This is because the full rating historyof users is available, and this allows AutAI to discover the keyneighborhood as much and as accurately as possible.

The two-tailed paired t-test with a 95% confidence levelhas been applied to evaluate the performance of AutAI onboth similarity metrics. The results show that the difference ofperformance with and without AutAI is statistically significant.The detailed paired t-statistics are shown in Table III.

C. Comparison With Other Methods

Here, we examine the proposed AutAI-Fusion algorithmby comparing it with other state-of-the-art imputation-basedalgorithms, including the default voting method [5], the EMDPmethod [9], and the Scalable Cluster-based Pearson CorrelationCoefficient (SCBPCC) method [8], and two traditional CFalgorithms, namely, the user-based CF (UPCC) and the item-based CF (IPCC), and one model-based algorithm, namely,the Slope One algorithm [23]. The parameters in SCBPCCare set as λ = 0.35, and the cluster number K = 20. All theparameters in EMDP are set the same as [9], namely, λ = 0.7,γ = 30, δ = 25, and η = θ = 0.4. Please note that EMDP isan imputation-based algorithm by fusing the user-based CFalgorithm and the item-based CF algorithm. The parameter λin our algorithm is set to 0.4.

The results on the Given and the All-But-One data sets areshown in Tables IV and V, respectively. Clearly, it is observedthat AutAI-Fusion outperforms all of the other six algorithms

on all configurations. Specifically, on the M500Given10 dataset, although all imputation-based methods (AutAI-Fusion,EMDP, SCBPCC, and default voting) achieve better perfor-mance than UPCC and IPCC, AutAI-Fusion achieves thelargest improvement. This trend is obvious on all data sets asshown by the figures in bold in Tables IV and V. This indicatesthat when the training set is small and the rating history of usersis limited, AutAI-Fusion can still identify the most determinantmissing values to fill in, in order to identify more appropriaterelationships among neighbors. On the All-But-One data set, asshown in Table IV, AutAI-Fusion obtains even better resultsthan all compared algorithms by achieving a much smallerMAE 0.6864. This is mainly because there is no limitation tothe rating history of users and because AutAI can work moreefficiently in this natural configuration. Table VI lists the pairedt-test statistics (with a 95% confidence level) between AutAI-Fusion and these algorithms, and it is clear that the differencesbetween the performance of AutAI-Fusion and that of othermethods are statistically significant.

D. Comparison on Imputation Percentage

A particular issue of interest regarding the imputation-basedmethods is how much of the missing data in the user × itemrating matrix should be filled to improve performance. Here,we will examine this issue by comparing the proposed AutAI-Fusion algorithm with three other state-of-the-art imputation-based methods, including the default voting method [5], theEMDP method [9], and the SCBPCC method [8]. We definethe imputation percentage as the percentage of imputed miss-ing data, which represents how much of the missing data isimputed. Clearly, a smaller imputation percentage means betterperformance in terms of imputation complexity.

Table VII shows the imputation percentage of examinedalgorithms on the Given data set. We can see that the proposedAutAI-Fusion algorithm requires the lowest imputation per-centage on all Given data sets. Specifically, on M500Given10,AutAI-Fusion fills in 43.37% of the missing data, whereasdefault voting needs to fill in 70.15% of the missing data,

Page 11: Lazy Collaborative Filtering for Data Sets With Missing Values

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

REN et al.: LAZY COLLABORATIVE FILTERING FOR DATA SETS WITH MISSING VALUES 11

TABLE IVMAE COMPARISON WITH OTHER METHODS ON THE All-But-One DATA SET. (A SMALLER VALUE MEANS BETTER PERFORMANCE)

TABLE VMAE COMPARISON WITH OTHER METHODS ON THE Given DATA SET.

(A SMALLER VALUE MEANS BETTER PERFORMANCE)

TABLE VIPAIRED t-TEST FOR MAE ON THE Given DATA SET

AND THE All-But-One DATA SET

EMDP needs to fill in 73.66%, and SCBPCC fills in all themissing data. Please note that SCBPCC makes predictionsfor all missing data by using clustering algorithms; hence, itsimputation percentage will always be 100.00%.

Table VIII shows the imputation percentage of the comparedalgorithms on the All-But-One data set. It is also observed thatthe proposed AutAI-Fusion algorithm obtains the lowest impu-tation percentage. Specifically, AutAI-Fusion imputes 79.22%of the missing data, whereas the other compared algorithms im-pute almost all of the missing data. Specifically, default votingfills in 93.49% of the missing data, EMDP imputes 98.55%,and SCBPCC imputes all the missing data. Together with theresults shown in Table VII, the proposed AutAI-Fusion algo-rithm keeps imputing the least amount of missing data amongall compared imputation-based methods and achieves the bestperformance in terms of accuracy, as shown in Tables IV andV. This indicates that AutAI-Fusion can identify the mostinformative missing data to impute and, consequently, achievebetter performance in terms of both accuracy and imputationcomplexity.

TABLE VIIIMPUTATION PERCENTAGE COMPARISON WITH OTHER

IMPUTATION-BASED METHODS ON THE Given DATA SET.(A SMALLER VALUE MEANS LESS IMPUTATION COMPLEXITY)

TABLE VIIIIMPUTATION PERCENTAGE COMPARISON WITH OTHER

IMPUTATION-BASED METHODS ON THE All-But-One DATA SET.(A SMALLER VALUE MEANS LESS IMPUTATION COMPLEXITY)

E. Impact of Parameter

As discussed in Section III-D, we introduced parameter λto balance the prediction from the user-based AutAI and theitem-based AutAI, to simultaneously consider both the activityof users and the popularity of items. We conducted severalexperiments to determine the impact of λ to the proposedAutAI-Fusion algorithm. Specifically, we vary the λ value from0 to 1 with an increasing step of 0.05. The results are presentedin Fig. 6.

When λ = 0, the prediction totally depends on the item-based AutAI imputation; when λ = 1, the prediction fullydepends on the user-based AutAI imputation. Results on theGiven data sets show that when there are few training users orfew ratings from the active user, item-based AutAI can achievesignificant improvement compared with user-based AutAI. Forexample, on data sets M500Given10 and M500Given30, asshown in Fig. 6(a) and (b), the performance of AutAI-Fusionwith λ = 0 is better than its performance with λ = 1. This is be-cause a limited user rating history suppresses the effectivenessof user-based AutAI. However, AutAI-Fusion achieves an evenbetter performance with λ = 0.4. On the other hand, Fig. 6(c)shows that user-based AutAI and item-based AutAI achieve asimilar performance on the All-But-One configuration, whichindicates that, in a natural situation, there is no big differencebetween them. Yet, on all configurations, it is clear that betteraccuracy can be obtained by combining imputations from usersand items.

Page 12: Lazy Collaborative Filtering for Data Sets With Missing Values

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON CYBERNETICS

Fig. 6. Impact of parameter λ on MAE on both the Given and the All-But-One data set. (a) M500 Given10. (b) M500 Given30. (c) All-But-One.

F. Discussion

Here, we present a short discussion about the practicabilityof the proposed method based on our theoretical and empiricalstudy.

As the proposed AutAI method is independent of the sim-ilarity metric used for the identification of nearest neighbors,practitioners from different fields can use it in conjunctionwith any desired similarity metrics. In addition, AutAI alsodetermines the imputation area automatically based on histor-ical user/item information without administrative interaction.This makes it very convenient to configure for industrial use.Compared with other state-of-the-art CF methods [5], [8], [9],[23], the proposed AutAI-Fusion algorithm displays higheraccuracy and robustness.

In industries, e.g., amazon.com and netlifx.com, there arelarge numbers of users and items, which represent a largenumber of samples for each rating prediction. Taking this intoaccount, the proposed AutAI method can significantly reducethe bias and variance of rating predictions because it identifiesthe neighborhood relationship more accurately than the con-ventional neighborhood-based CF methods. On the other hand,large number of users and items usually leads to a large numberof key missing data to impute. However, compared with otherstate-of-the-art imputation-based methods [5], [8], [9], AutAIalso possesses relative lower imputation complexity, as shownin Section IV-D. From this point of view, the proposed AutAImethod is more practical than other state-of-the-art imputation-based methods.

V. CONCLUSION

In this paper, by treating each rating given by a user onan item as an observation for prediction on this item to otherusers, we define the notion of a key set of missing data forthe rating prediction and propose AutAI, which is a methodto automatically and adaptively identify these key missing datafor each prediction from both the user and item perspective.Theoretical and empirical analyses show that imputation basedon this key set of missing data leads to a more accurate iden-tification of neighborhood relationships than the conventionalneighborhood-based CF methods. Therefore, the AutAI-basedalgorithm can make more accurate predictions than state-of-the-art neighborhood-based CF algorithms, including otherimputation-based algorithms. To the best of our knowledge,this paper presents the first theoretical analysis for imputation-

based CF methods, and the proposed AutAI method guaranteesto work regardless of which similarity metric is applied. In thefuture, we plan to investigate how to incorporate context infor-mation in the proposed approach in order to further improve thequality of a neighborhood.

REFERENCES

[1] G. Adomavicius and A. Tuzhilin, “Toward the next generation of rec-ommender systems: A survey of the state-of-the-art and possible exten-sions,” IEEE Trans. Knowl. Data Eng., vol. 17, no. 6, pp. 734–749,Jun. 2005.

[2] C. Desrosiers and G. Karypis, “A comprehensive survey of neighborhood-based recommendation methods,” in Recommender Systems Handbook,F. Ricci, L. Rokach, B. Shapira, and P. B. Kantor, Eds. New York:Springer-Verlag, 2011, ch. 4, pp. 107–144.

[3] R. M. Bell and Y. Koren, “Lessons from the Netflix prize challenge,” ACMSIGKDD Expl. Newslett., vol. 9, no. 2, pp. 75–79, Dec. 2007.

[4] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl, “Grou-pLens: An open architecture for collaborative filtering of netnews,” inProc. ACM Conf. Comput. Supported Coop. Work, 1994, pp. 175–186.

[5] J. Breese, D. Heckerman, and C. Kadie, “Empirical analysis of predictivealgorithms for collaborative filtering,” in Proc. 14th Conf. UncertaintyArtif. Intell., 1998, pp. 43–52.

[6] L. M. D. Campos, J. M. Fernández-luna, J. F. Huete, andM. A. Rueda-morales, “Measuring predictive capability in collabo-rative filtering,” in Proc. 3rd ACM Conf. Recommender Syst., 2009,pp. 313–316.

[7] M. Larson and A. Hanjalic, “Exploiting user similarity based on rated-item pools for improved user-based collaborative filtering,” in Proc. 3rdACM Conf. Recommender Syst., 2009, pp. 125–132.

[8] G.-R. Xue, C. Lin, Q. Yang, W. Xi, H.-J. Zeng, Y. Yu, and Z. Chen,“Scalable collaborative filtering using cluster-based smoothing,” in Proc.SIGIR, 2005, pp. 114–121.

[9] H. Ma, I. King, and M. R. Lyu, “Effective missing data prediction forcollaborative filtering,” in Proc. SIGIR, 2007, pp. 39–46.

[10] X. Su, T. M. Khoshgoftaar, and R. Greiner, “A mixture imputation-boosted collaborative filter,” in Proc. FLAIRS Conf., 2008, pp. 312–316.

[11] X. Zhu, S. Zhang, Z. Jin, Z. Zhang, and Z. Xu, “Missing value estimationfor mixed-attribute data sets,” IEEE Trans. Knowl. Data Eng., vol. 23,no. 1, pp. 110–121, Jan. 2011.

[12] S. Zhang, Z. Jin, and X. Zhu, “Missing data imputation by utilizinginformation within incomplete instances,” J. Syst. Softw., vol. 84, no. 3,pp. 452–459, Mar. 2011.

[13] S. Zhang, “Shell-neighbor method and its application in missing dataimputation,” Appl. Intell., vol. 35, no. 1, pp. 123–133, Aug. 2011.

[14] T. Cover, “Estimation by the nearest neighbor rule,” IEEE Trans. Inf.Theory, vol. IT-14, no. 1, pp. 50–55, Jan. 1968.

[15] B. Sarwar, G. Karypis, J. Konstan, and J. Reidl, “Item-based collaborativefiltering recommendation algorithms,” in Proc. 10th Int. Conf. World WideWeb, 2001, pp. 285–295.

[16] S. A. Goldman and M. K. Warmuth, “Learning binary relations usingweighted majority voting,” Mach. Learn., vol. 20, no. 3, pp. 245–271,Sep. 1995.

[17] R. M. Bell and Y. Koren, “Improved neighborhood-based collaborativefiltering,” in Proc. KDD-Cup Workshop 13th Int. Conf. ACM SIGKDD,2007, pp. 7–14.

Page 13: Lazy Collaborative Filtering for Data Sets With Missing Values

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

REN et al.: LAZY COLLABORATIVE FILTERING FOR DATA SETS WITH MISSING VALUES 13

[18] F. Garcin, B. Faltings, R. Jurca, and N. Joswig, “Rating aggregation incollaborative filtering systems,” in Proc. 3rd ACM Conf. RecommenderSyst., 2009, pp. 349–352.

[19] A. Nakamura and N. Abe, “Collaborative filtering using weighted ma-jority prediction algorithms,” in Proc. Int. Conf. Mach. Learn., 1998,pp. 395–403.

[20] X. Su and T. Khoshgoftaar, “Collaborative filtering for multi-class datausing belief nets algorithms,” in Proc. IEEE 18th ICTAI, Nov. 2006,pp. 497–504.

[21] B. Sarwar, G. Karypis, and J. Konstan, “Recommender systems for large-scale e-commerce: Scalable neighborhood formation using clustering,” inProc. 5th Int. Conf. Comput. Inf. Technol., 2002, pp. 1–6.

[22] T. Hofmann and J. Puzicha, “Latent class models for collaborative filter-ing,” in Proc. Int. Joint Conf. Artif. Intell., 1999, vol. 16, pp. 688–693.

[23] D. Lemire and A. Maclachlan, “Slope one predictors for online rating-based collaborative filtering,” in Proc. SDM, 2005, pp. 1–5.

[24] D. Billsus and M. Pazzani, “Learning collaborative information filters,” inProc. 15th Int. Conf. Mach. Learn., 1998, pp. 46–54.

[25] Y. Koren, “Factorization meets the neighborhood: A multifaceted collab-orative filtering model,” in Proc. SIGKDD, 2008, pp. 426–434.

[26] R. Bell, Y. Koren, and C. Volinsky, “Modeling relationships at multiplescales to improve accuracy of large recommender systems,” in Proc. 13thInt. Conf. ACM SIGKDD Mining, 2007, pp. 95–104.

[27] M. R. Mclaughlin and J. L. Herlocker, “A collaborative filtering algorithmand evaluation metric that accurately model the user experience,” in Proc.SIGIR, 2004, pp. 329–336.

[28] C. Desrosiers and G. Karypis, “A novel approach to compute similaritiesand its application to item recommendation,” in Proc. PRICAI, 2010,pp. 39–51.

[29] S. Chee, J. Han, and K. Wang, “Rectree: An efficient collaborativefiltering method,” in Proc. Data Warehousing Knowl. Discovery, 2001,pp. 141–151.

[30] R. Johnson and G. K. Bhattacharyya, Statistics: Principles and Methods,6th ed. Hoboken, NJ: Wiley, 2009.

[31] N. S. Altman, “An introduction to Kernel and nearest-neighbor nonpara-metric regression,” Amer. Stat., vol. 46, no. 3, pp. 175–181, Aug. 1992.

[32] X. Su and T. M. Khoshgoftaar, “A survey of collaborative filtering tech-niques,” Adv. Artif. Intell., vol. 2009, pp. 421425-1–421425-19, Jan. 2009.

[33] J. Zhang, Y. Xiang, Y. Wang, W. Zhou, Y. Xiang, and Y. Guan, “Networktraffic classification using correlation information,” IEEE Trans. ParallelDistrib. Syst., vol. 24, no. 1, pp. 104–117, Jan. 2013.

[34] J. Wang, A. P. de Vries, and M. J. T. Reinders, “Unifying user-basedand item-based collaborative filtering approaches by similarity fusion,”in Proc. SIGIR, 2006, pp. 501–508.

Yongli Ren was born in Henan, China, in 1983.He received the B.Eng. and M.Eng. degrees fromZhengzhou University, Zhengzhou, China, in 2006and 2009, respectively. He is currently working to-ward the Ph.D. degree in the School of InformationTechnology, Deakin University, Australia.

His research interests include recommender sys-tems, data mining, and pattern recognition.

Dr. Ren was a recipient of the “best paper” awardat the IEEE/Association for Computing Machinery(ACM) Advances in Social Networks Analysis and

Mining (ASONAM) 2012 Conference.

Gang Li (M’11) received the Ph.D. degree fromDeakin University, Melbourne, Australia, in 2005.

He is a Senior Lecturer with the School of In-formation Technology, Deakin University, Australia.His research interests include data mining, machinelearning, and multimedia analysis.

Mr. Li has coauthored three papers that have won“best paper” prizes, including the IEEE/Associationfor Computing Machinery (ACM) Advances in So-cial Networks Analysis and Mining (ASONAM)2012 Conference. “best paper” award and the 2007

Nightingale Prize by Springer journal Medical and Biological Engineeringand Computing. He has also conducted projects on tourism and hospitalityresearch. He served on the Program Committee for over 40 internationalconferences in artificial intelligence, data mining, machine learning, andtourism and hospitality management. He is a regular reviewer for internationaljournals in the areas of data mining, computer network, and tourism/hospitalitymanagement.

Jun Zhang (M’12) received the Ph.D. degreefrom the University of Wollongong, Wollongong,Australia, in 2011.

He is currently with the School of InformationTechnology, Deakin University, Australia. He haspublished more than 30 research papers in refer-eed international journals and conferences, such asthe IEEE TRANSACTIONS ON IMAGE PROCESS-ING, the IEEE TRANSACTIONS ON PARALLEL AND

DISTRIBUTED SYSTEMS, and the IEEE TRANS-ACTIONS ON INFORMATION FORENSICS AND SE-

CURITY. His research interests include network and system security, patternrecognition, and multimedia retrieval.

Dr. Zhang was a recipient of the 2009 Chinese Government Award forOutstanding Self-Financed Student Abroad.

Wanlei Zhou (M’91–SM’09) received the Ph.D.degree from the Australian National University,Canberra, Australia, in 1991 and the D.Sc. degreefrom Deakin University, Australia, in 2002.

He is currently the Chair Professor of informationtechnology and the Head of the School of Infor-mation Technology, Deakin University, Melbourne,Australia. He has published more than 200 papers inrefereed international journals and refereed interna-tional conference proceedings. His research interestsinclude distributed and parallel systems, network

security, mobile computing, bioinformatics, and e-learning.