12
Recommendations Using Coupled Matrix Factorization No Author Given No Institute Given Abstract. Consider the task of recommending movies to users, such as the Netflix challenge. Given additional information for the movies, e.g. features derived from Internet Movie Database, are we able to improve the recommendation quality? In this work, we pose the aforementioned problem (which encompasses a variety of different scenarios, all under the family of recommender systems/collaborative filtering) as a regularized Coupled Matrix Factorization instance. We demonstrate the effectiveness of our approach over traditional collaborative filtering approaches that employ matrix factorization. In particular, we show that for two different recommendation tasks viz. movie recommendation and question recom- mendation for “ask an expert” web site, our model outperforms state of the art matrix factorization models, in terms of root mean squared error. 1 Introduction Given a set of users, a set of movies, as well as a rating (typically a number from 1-5) that each user gave to every movie they watched, the objective is to predict the rating that a particular user would give to a movie, if they were to watch it; this prediction will effectively be used as a movie recommendation, if the predicted rating exceeds a certain threshold. This problem belongs in the general family of recommender systems, a problem family rich in applications in the real world. One of the most prominent approaches of tackling this problem is Collabo- rative Filtering, a class of algorithms which attempt to use information from all users, and thereby come up with a prediction. In particular, doing so leverages similarities amongst users and movies, and has been shown to be very effective. A very popular solution to Collaborative Filtering is through Matrix Factoriza- tion; drawing from the movie recommendation example, we form a matrix whose rows correspond to movies and whose columns correspond to users; each element of the matrix contains either the rating of a user for a particular movie, or zero if there is no rating provided. The end goal is to fill in the missing entries of that matrix in such a way that a particular cost function is minimized. Suppose now that we gather additional data for each movie; for instance, we might have genre information, or rating information from some online repository such as Internet Movie Database (IMDb). Can this additional information prove helpful in the aforementioned task? And if so, how can we put this additional

Recommendations Using Coupled Matrix Factorizationajauhri/projects/10701/final_report.pdf · Recommendations Using Coupled Matrix Factorization No Author Given No Institute Given

Embed Size (px)

Citation preview

Recommendations Using Coupled MatrixFactorization

No Author Given

No Institute Given

Abstract. Consider the task of recommending movies to users, such asthe Netflix challenge. Given additional information for the movies, e.g.features derived from Internet Movie Database, are we able to improvethe recommendation quality? In this work, we pose the aforementionedproblem (which encompasses a variety of different scenarios, all under thefamily of recommender systems/collaborative filtering) as a regularizedCoupled Matrix Factorization instance. We demonstrate the effectivenessof our approach over traditional collaborative filtering approaches thatemploy matrix factorization. In particular, we show that for two differentrecommendation tasks viz. movie recommendation and question recom-mendation for “ask an expert” web site, our model outperforms state ofthe art matrix factorization models, in terms of root mean squared error.

1 Introduction

Given a set of users, a set of movies, as well as a rating (typically a numberfrom 1-5) that each user gave to every movie they watched, the objective is topredict the rating that a particular user would give to a movie, if they were towatch it; this prediction will effectively be used as a movie recommendation, ifthe predicted rating exceeds a certain threshold. This problem belongs in thegeneral family of recommender systems, a problem family rich in applications inthe real world.

One of the most prominent approaches of tackling this problem is Collabo-rative Filtering, a class of algorithms which attempt to use information from allusers, and thereby come up with a prediction. In particular, doing so leveragessimilarities amongst users and movies, and has been shown to be very effective.A very popular solution to Collaborative Filtering is through Matrix Factoriza-tion; drawing from the movie recommendation example, we form a matrix whoserows correspond to movies and whose columns correspond to users; each elementof the matrix contains either the rating of a user for a particular movie, or zeroif there is no rating provided. The end goal is to fill in the missing entries of thatmatrix in such a way that a particular cost function is minimized.

Suppose now that we gather additional data for each movie; for instance, wemight have genre information, or rating information from some online repositorysuch as Internet Movie Database (IMDb). Can this additional information provehelpful in the aforementioned task? And if so, how can we put this additional

information into play, framing the problem as an instance of collaborative filter-ing? Recently, there has been an emerging interest in this type of informationfusion, e.g. in [26], the authors combine user friendship information (user-usermatrix) with user interests (user-service matrix), in order to enhance the recom-mendation task at both ends. In this work, we propose to express this problemas an instance of Coupled Matrix Factorization (CMF).

Additionally, this problem formulation proves to be highly generic, so that itaccommodates a variety of other applications; in particular, we also apply themodel to perform a collaborative filtering for users of stackoverflow.com [3]. Thisis a forum where registered users can post questions and assign relevant tags toit. Multiple users provide answers to a question, and one of them may be chosenas the best answer. Irrespective of whether an answer is chosen as best or not, allanswers receive votes/points from the community. These points along with tagsmarked in the question, are added to the user’s profile who provided the answer.Hence, tags and points aggregated over all answers provided by the user, makesa set of technology tags of interest to the user. It will be preferred that questionswith related tags which could not be attached to a user profile should also beshown to him/her. We use a similar approach, as for Netflix, to provide questionrecommendations using tags to a user. The rows of the matrix correspond tousers and columns to tag points; each element of the matrix contains eitherthe points for a tag given to a user, or zero if none were awarded. The finalgoal is, as in the case of Netflix, to fill missing entries. Although, here additionalinformation is collected from stackoverflow.com itself rather than from a differentsource as IMDb; additional data for each user is collected from a different userwho has similar and more tags attached to his/her profile.

Our main contributions are:

– We formulate the problem of recommendation with additional informationas a Coupled Matrix Factorization with two different types of regularization,namely on the `1 and `2 norms, respectively.

– Our approach is agnostic of the type of the additional information, thusgeneric.

– We experiment on two diverse, real world applications and demonstrate theeffectiveness of our approach, in terms of rating prediction.

In Section 2, we describe cost function along with the method formulationand algorithm. Section 3 describes the mapping of raw data into our methodformulation (described in Section 2). Experiments and evaluation with standardtechniques are shown in Section 4. Section 5 introduces related work on howfusion of additional information helps in improving the objective. Moreover,we also survey the current state of the art, in terms of leveraging additionalinformation for recommendations. In Section 6, we conclude, and provide futuredirections for improvement.

2 Proposed Method

Notation

Scalars are denoted by lowercase, italic letters, e.g. x. Matrices are denoted byuppercase, boldface letters, e.g. X. We use Matlab notation for matrix indexing,i.e. the (i, j)-th element is denoted by X(i, j).

2.1 Method Description

We have focused on a broad spectrum of aspects related to the project, and morespecifically; on the algorithm development; the acquisition and the preprocessingof data; as well as coming up with a comprehensive means of comparison withexisting baselines; and applying it to different application domains.

We propose to jointly factorize X and Y into a set of low rank components,such that an appropriate loss function is minimized. The most commonly usedloss function is the Frobenius norm, which reflects the sum of squared errors. Weadditionally regularize the loss function in two different ways. Our first model isa Coupled Matrix Factorization with `2 regularization on the factors:

minA,B,C

α ‖X−ABT ‖2F + β ‖Y −ACT ‖2F + µ(‖A‖2F + ‖B‖2F + ‖C‖2F

)(1)

where α+β = 1. Note that the left factor matrix for both X and Y is the same,and this reflects the joint nature of our factorization. Main reason for the `2regularization is to avoid uncontrollably large values in the factors, and thus, toavoid overfitting.

The second means of regularization is adding `1 norm penalties to the threefactor matrices. This penalization is done for two reasons: 1) as before, in orderto overfitting, and 2) by using the `1 norm, we are producing a sparse latentembedding of the data into a lower dimension, which effectively suppresses noise,and retains the signal which is useful for collaborative filtering. The objectivefunction is the following:

minA,B,C

α ‖X−ABT ‖2F + β ‖Y −ACT ‖2F (2)

+ λ

∑i

∑f

|A(i, f)|+∑j

∑f

|B(j, f)|+∑l

∑f

|C(l, f)|

A similar model has been used in [2], on an entirely different setting, that ofexploratory biomarker identification in metabolomics.

As stated initially, our work revolves around Equations 1 and 2. By usingthe Frobenius norm as a loss function, we are able to solve the problem usingAlternating Least Squares, a block coordinate descent optimization technique;in a nutshell, block coordinate descent fixes all data but a specific block, and

800 900 1000 1100 1200 1300 1400 1500 1600 170090

100

110

120

130

140

150

160

170

Missing ratings

Sq

ua

red

Err

or

ours

ours sparse

svd

nmf

svt

(a) Netflix/IMDB 1

600 800 1000 1200 1400 1600 1800100

110

120

130

140

150

160

170

Missing ratings

Sq

ua

red

Err

or

ours

ours sparse

svd

nmf

svt

(b) Netflix/IMDB 2

550 600 650 700 750 800 850500

1000

1500

2000

2500

3000

3500

4000

Missing ratings

Sq

ua

red

Err

or

ours

svd

nmf

svt

(c) StackExchange

Figure 1: Squared Error for the missing portion of data, using rank equal to 10. We see that best performance is achieved for the Net-flix/IMDB 1 dataset, which has probably the most informative information.

50 52 54 56 58 60 62 64 66 68 70120

130

140

150

160

170

180

190

200

Missing ratings

RM

SE

ours

ours sparse

svd

nmf

svt

(a) Netflix/IMDB 1

50 52 54 56 58 60 62 64 66 68 7018

19

20

21

22

23

24

25

Missing ratings

RM

SE

ours

ours sparse

svd

nmf

svt

(b) Netflix/IMDB 2

50 52 54 56 58 60 62 64 66 68 7010

!1

100

101

102

Missing ratings

RM

SE

ours

svd

nmf

svt

(c) StackExchange (log-lin scale)

Figure 2: Squared Error for the missing portion of data, when reversing the roles of mathbfX, Y , using rank equal to 10

• Investigating the �,⇥ scaling/trade-off in our objective func-tion.

• Modify our objective function, such that we cater for ⌃1 normloss, instead of Frobenius.

• Apply our models to a wider variety of datasets.

4. REFERENCES[1] Evrim Acar, Gozde Gurdeniz, Morten A Rasmussen, Daniela

Rago, Lars O Dragsted, and Rasmus Bro. Coupled matrixfactorization with sparse factors to identify potentialbiomarkers in metabolomics. In Data Mining Workshops(ICDMW), 2012 IEEE 12th International Conference on,pages 1–8. IEEE, 2012.

[2] Jian-Feng Cai, Emmanuel J Candès, and Zuowei Shen. Asingular value thresholding algorithm for matrix completion.SIAM Journal on Optimization, 20(4):1956–1982, 2010.

[3] Daniel D Lee, HSebastian Seung, et al. Learning the parts ofobjects by non-negative matrix factorization. Nature,401(6755):788–791, 1999.

[4] Andrew L Maas, Raymond E Daly, Peter T Pham, DanHuang, Andrew Y Ng, and Christopher Potts. Learning wordvectors for sentiment analysis. In Proceedings of the 49thannual meeting of the association for computationalLinguistics (acL-2011), 2011.

[5] Tim OâAZKeefe and Irena Koprinska. Feature selection andweighting methods in sentiment analysis. ADCS 2009,page 67, 2009.

[6] E Papalexakis, N Sidiropoulos, and Rasmus Bro. Fromk-means to higher-way co-clustering: multilineardecomposition with sparse latent factors. Transactions onSignal Processing, 2013.

[7] Evangelos E Papalexakis, Nicholas D Sidiropoulos, andMinos N Garofalakis. Reviewer profiling using sparse matrixregression. In Data Mining Workshops (ICDMW), 2010 IEEEInternational Conference on, pages 1214–1219. IEEE, 2010.

[8] T. Wilderjans, E. Ceulemans, and I. Van Mechelen.Simultaneous analysis of coupled data blocks differing in size:A comparison of two weighting schemes. ComputationalStatistics & Data Analysis, 53(4):1086–1098, 2009.

[9] S.H. Yang, B. Long, A. Smola, N. Sadagopan, Z. Zheng, andH. Zha. Like like alike: joint friendship and interestpropagation in social networks. In Proceedings of the 20thinternational conference on World wide web, pages 537–546.ACM, 2011.

800 900 1000 1100 1200 1300 1400 1500 1600 170090

100

110

120

130

140

150

160

170

Missing ratings

Square

d E

rror

ours

ours sparse

svd

nmf

svt

(a) Netflix/IMDB 1

600 800 1000 1200 1400 1600 1800100

110

120

130

140

150

160

170

Missing ratings

Square

d E

rror

ours

ours sparse

svd

nmf

svt

(b) Netflix/IMDB 2

550 600 650 700 750 800 850500

1000

1500

2000

2500

3000

3500

4000

Missing ratings

Square

d E

rror

ours

svd

nmf

svt

(c) StackExchange

Figure 1: Squared Error for the missing portion of data, using rank equal to 10. We see that best performance is achieved for the Net-flix/IMDB 1 dataset, which has probably the most informative information.

50 52 54 56 58 60 62 64 66 68 70120

130

140

150

160

170

180

190

200

Missing ratings

RM

SE

ours

ours sparse

svd

nmf

svt

(a) Netflix/IMDB 1

50 52 54 56 58 60 62 64 66 68 7018

19

20

21

22

23

24

25

Missing ratings

RM

SE

ours

ours sparse

svd

nmf

svt

(b) Netflix/IMDB 2

50 52 54 56 58 60 62 64 66 68 7010

!1

100

101

102

Missing ratings

RM

SE

ours

svd

nmf

svt

(c) StackExchange (log-lin scale)

Figure 2: Squared Error for the missing portion of data, when reversing the roles of mathbfX, Y , using rank equal to 10

• Investigating the �,⇥ scaling/trade-off in our objective func-tion.

• Modify our objective function, such that we cater for ⌃1 normloss, instead of Frobenius.

• Apply our models to a wider variety of datasets.

4. REFERENCES[1] Evrim Acar, Gozde Gurdeniz, Morten A Rasmussen, Daniela

Rago, Lars O Dragsted, and Rasmus Bro. Coupled matrixfactorization with sparse factors to identify potentialbiomarkers in metabolomics. In Data Mining Workshops(ICDMW), 2012 IEEE 12th International Conference on,pages 1–8. IEEE, 2012.

[2] Jian-Feng Cai, Emmanuel J Candès, and Zuowei Shen. Asingular value thresholding algorithm for matrix completion.SIAM Journal on Optimization, 20(4):1956–1982, 2010.

[3] Daniel D Lee, HSebastian Seung, et al. Learning the parts ofobjects by non-negative matrix factorization. Nature,401(6755):788–791, 1999.

[4] Andrew L Maas, Raymond E Daly, Peter T Pham, DanHuang, Andrew Y Ng, and Christopher Potts. Learning wordvectors for sentiment analysis. In Proceedings of the 49thannual meeting of the association for computationalLinguistics (acL-2011), 2011.

[5] Tim OâAZKeefe and Irena Koprinska. Feature selection andweighting methods in sentiment analysis. ADCS 2009,page 67, 2009.

[6] E Papalexakis, N Sidiropoulos, and Rasmus Bro. Fromk-means to higher-way co-clustering: multilineardecomposition with sparse latent factors. Transactions onSignal Processing, 2013.

[7] Evangelos E Papalexakis, Nicholas D Sidiropoulos, andMinos N Garofalakis. Reviewer profiling using sparse matrixregression. In Data Mining Workshops (ICDMW), 2010 IEEEInternational Conference on, pages 1214–1219. IEEE, 2010.

[8] T. Wilderjans, E. Ceulemans, and I. Van Mechelen.Simultaneous analysis of coupled data blocks differing in size:A comparison of two weighting schemes. ComputationalStatistics & Data Analysis, 53(4):1086–1098, 2009.

[9] S.H. Yang, B. Long, A. Smola, N. Sadagopan, Z. Zheng, andH. Zha. Like like alike: joint friendship and interestpropagation in social networks. In Proceedings of the 20thinternational conference on World wide web, pages 537–546.ACM, 2011.

Recommenda)on*Using*Coupled*Matrix*Factoriza)on*

10-701 Project: Recommendation Using Coupled MatrixFactorization

Pallavi [email protected]

Abhinav [email protected]

Evangelos [email protected]

Pradeep PrabakarRavindran

[email protected]

1. INTRODUCTIONIn this project we propose to enhance user rating prediction and

recommendation engine using information collected from other pop-ular websites. Consider the example of Netflix, where users watchmovies and then rate them on a scale of 1-5. Suppose that we storethis rating information on a matrix X, whose rows correspond to allmovies that Netflix provides, and the columns correspond to users.By filling in the missing values of this matrix, we are able to pro-vide recommendations to users, provided that the predicted ratingis sufficiently high.

We enhance the prediction and recommendation engine using ad-ditional pieces of information built upon the same set of movies,but coming from other sources, e.g. IMDB and let’s call Y thematrix that stores this information. One could imagine a multitudeof possible features that could be extracted from IMDB and usedas columns of Y. Recently, there has been an emerging interestin this type of information fusion, e.g. in [9], the authors combineuser friendship information (user-user matrix) with user interests(user-service matrix), in order to enhance the recommendation taskin both ends.

We propose to jointly factorize X and Y into a set of low rankcomponents, such that an appropriate loss function is minimized.More formally, our proposed model revolves around the generaloptimization problem of

minA,B,C

� loss�X � ABT

⇥+ ⇥ loss

�Y � ACT

⇥(1)

where � + ⇥ = 1. Note that the left factor matrix for both X andY is the same, and this reflects the joint nature of our factorization.

There are a few distinct choices with respect to the loss functionthat we may use. The most commonly used one is the Frobeniusnorm, which reflects the sum of squared errors. On top of that, wewill also experiment with a model which in addition to the Frobe-nius norm loss, contains regularizers that promote sparsity.

With respect to the movie recommendation application, we planto use data from the Netflix prize, as well as crawled data fromIMDB. Additionally, this problem formulation proves to be ade-quately generic, so that it accommodates a variety of other appli-

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.

cations; in particular, we will also apply the model to perform acollaborative filtering for users of Stack exchange sites [?]. Thisa forum where registered users can post questions and assign rele-vant tags to it. Multiple users provide answers to a question, andone of them may be chosen as the best answer. Irrespective ofwhether an answer is chosen as best answer or not, all anwers re-ceive votes/points from the community. These points add to a user’sprofile and subsequently act as a set of tags/technolgies in which theuser is proficient. Hence, based on what type of answers a user hasprovided, it will be preffered that related questions in which he/shehas or may have interest should also be shown to him. We shall usethe same approach, as for Netflix, to provide recommendations toa user stack exchange site. For the scope of this project, we onlylooked at stackoverflow.com.

In a nutshell, the we propose to work on the following aspects:• Problem formulation, investigation of appropriate loss func-

tion, investigation of how to set parameters �,⇥ such that wetrade-off the differences between the two data sources.

• Comparison with techniques that don’t use additional infor-mation and current state of the art techniques that do.

• Application to real data as described above.

2. NOTATIONScalars are denoted by lowercase, italic letters, e.g. x. Matrices

are denoted by uppercase, boldface letters, e.g. X. We use Matlabnotation for matrix indexing, i.e. the (i, j)-th element is denotedby X(i, j).

3. OUR APPROACHWe have focused on a broad spectrum of aspects related to the

project, and more specifically; on the algorithm development; theacquisition and the preprocessing of data; as well as coming up witha comprehensive means of comparison with existing baselines; andapplying it to different application domains.

3.1 Different modelsBased on the general objective function of Eq. 1, we propose

to use two different models, which are described in the followinglines.

Frobenius norm loss The first and simplest form is obtainedby using the Frobenius norm as our loss function.

minA,B,C

� ⇤X � ABT ⇤2F + ⇥ ⇤Y � ACT ⇤2

F (2)

Frobenius norm loss with ⌅1 norm regularization The secondmodel is an extension of the previous one, with the additionof ⌅1 norm penalties on all three factor matrices. By intro-ducing ⌅1 norm penalties, we essentially enforce the model

to seek sparse solutions to A,B,C, since minimizing the⌥1 norm has been shown to be a good convex relaxation ofoptimal sparse basis pursuit (where ideally, minimization ofthe ⌥0 - or simply the number of non-zeros - norm has to beminimized, which proves to be NP-hard).

minA,B,C

↵ kX � ABT k2F + � kY � ACT k2

F + �X

i

X

j

|A(i, j)|

+ �X

i

X

j

|B(i, j)| + �X

i

X

j

|C(i, j)|

3.2 Algorithm

3.2.1 OptimizationAs stated initially, our work revolves around Equation 1. The

first model we are focusing on uses the squared Frobenius normas a loss function. By making that choice, we are able to solvethe problem using block coordinate descent; in a nutshell, blockcoordinate descent fixes all data but a specific block, and optimizesover that specific block. This is done in an alternating fashion, untilno change in the cost function is observed.

In our particular case, we may obtain a local optimum for Eq.1 by fixing two of the three matrices at each step, optimizing overthe third matrix and doing so in a "circular" way. The optimiza-tion problem is linear on any of the three matrix variables, thus,each step of the block coordinate descent is simply a Least Squaresproblem, solved using, e.g. the Moore-Penrose Pseudoinverse or(even better) a linear system of equations solver (which is highlyoptimized in Matlab).

Algorithm 1: Alternating Least Squares Algorithm for Eq. 1with Frobenius norm loss.

Require: X,Y of size I ⇥ J and ItimesK respectively, and lowrank F

Ensure: A of size I ⇥ F , b of size J ⇥ F , c of size K ⇥ F ,1: Initialize A, B, C (e.g. randomly or using SVD of X and Y).2: while convergence criterion is not met do3: Fix B,C and solve a least squares problem for A, by

stacking X,Y together.4: Fix A,C and solve a least squares problem for B.5: Fix A,B and solve a least squares problem for C.6: end while

If we add the ⌥1 norm penalties, using our second model, theprinciple of solving the optimization is the same i.e. alternating op-timization. More specifically, we may use the exact same logic asin Algorithm 1 with a small difference. Each one of the three leastsquares regression problems has now been transformed to what es-sentially is a Lasso problem. If we take a closer look at the objec-tive function for the simplest update rule of Algorithm 1 (where wefix A,B and solve for C then we can see that the problem we arerequired to solve is

minC

Y � ACT 2F +

i

j

|C(i, j)|. (3)

The above optimization problem is an instance of Lasso, wherewe have least squares regression of a matrix, with additional ⌥1constraints. In order to solve this problem, we use a coordinatedescent algorithm such as the one used in [7].

Both alternating optimization algorithms are guaranteed to con-verge to a locally optimal solution, since each step is an optimalsolution for the block that we optimize for, and thus, it is guaran-teed to decrease the cost monotonically. In the first model case,

each step is a least squares regression problem, which we can solveoptimally. In the second model, each subproblem is a Lasso prob-lem which is also convex, and thus we may solve it optimally aswell.

Finally, we also incorporate non-negativity constraint on all threefactor matrices A,B,C by optimally projecting the solution ofeach step of the algorithm to the [0,⇧) interval.

3.2.2 PredictionThe general factorization model of Eq. 1 is able to provide us

with movie recommendations for the Netflix dataset. Bear in mindthat in Netflix, a rating of 0 indicates a missing entry, thus, for thisparticular setting, we need not treat missing values in a special way.Say that X is the movie by user Netflix matrix. After we obtain thelow rank factors A,B,C, we may reconstruct X ⌅ ABT as amatrix of lower rank. It has been shown in the matrix completionliterature, e.g. [2], that the ultimate goal of matrix completion isminimal rank; even though we do not minimize the rank in thissetting, we impose a low rank on the factors, therefore steering ourmodel to be able to make good predictions of missing values.

Note that if we have a setting where 0 does not necessarily meana missing value, we need to treat the missing values in such a wayso that they are excluded from the optimization, such as in [1]which introduces a similar joint bilinear factorization model.

After we reconstruct X, we may simply look at the coefficientsof interest and obtain an estimate of the missing values. Further-more, we may also post-process these predicted values in a waythat it makes them more suitable for interpretation, e.g. quantizingthem in {1, 5}.

What is particularly interesting with our proposed approach isthe fact that we may perform two-way "recommendation"/completion,since we obtain a joint low rank factorization for both data matri-ces. Therefore, in the very same way, we may approximate matrixY as ACT and obtain our estimates.

For Stack Exchange data, the same prediction model applies.

3.3 Data Acquisition

3.3.1 Netflix & IMDB DataCurrently, we have a piece of Netflix as well as a piece of IMDB

at our disposal.The Netflix dataset that we have is the one released by Netflix

(only to be recalled later on, after the deanonymization). The num-ber of movies in this dataset is about 18K movies and the number ofusers who provided ratings (on a scale from 1-5) is about 2.5 mil-lion. The Netflix data corresponds to matrix X in our formulation,whose rows correspond to movies, and whose columns correspondto users.

In terms of matrix Y, we use information crawled from IMDB,ultimately on a shared subset of movies that exist in the Netflix data.About 200k movies have been crawled from the IMDB website.The following data for each movie has been collected:

• Reviews along with ratings for sentiment analysis• Movie rating and genre• Synopsis of each movie

The reason we collect all three different sources of data is thatwe would like to generate different version of the IMDB matrix Yand evaluate the performance of our proposed method in each case.In particular, the two different versions for Y are the same as faras the rows (movies) are concerned, but differ with respect to thecolumns (which can be seen as features for each movie):

1. The first version of Y uses the IMDB rating, as well as thegenre information as a set of features. The rating is a number

to seek sparse solutions to A,B,C, since minimizing the⌥1 norm has been shown to be a good convex relaxation ofoptimal sparse basis pursuit (where ideally, minimization ofthe ⌥0 - or simply the number of non-zeros - norm has to beminimized, which proves to be NP-hard).

minA,B,C

↵ kX � ABT k2F + � kY � ACT k2

F + �X

i

X

j

|A(i, j)|

+ �X

i

X

j

|B(i, j)| + �X

i

X

j

|C(i, j)|

3.2 Algorithm

3.2.1 OptimizationAs stated initially, our work revolves around Equation 1. The

first model we are focusing on uses the squared Frobenius normas a loss function. By making that choice, we are able to solvethe problem using block coordinate descent; in a nutshell, blockcoordinate descent fixes all data but a specific block, and optimizesover that specific block. This is done in an alternating fashion, untilno change in the cost function is observed.

In our particular case, we may obtain a local optimum for Eq.1 by fixing two of the three matrices at each step, optimizing overthe third matrix and doing so in a "circular" way. The optimiza-tion problem is linear on any of the three matrix variables, thus,each step of the block coordinate descent is simply a Least Squaresproblem, solved using, e.g. the Moore-Penrose Pseudoinverse or(even better) a linear system of equations solver (which is highlyoptimized in Matlab).

Algorithm 1: Alternating Least Squares Algorithm for Eq. 1with Frobenius norm loss.

Require: X,Y of size I ⇥ J and I ⇥ K respectively, and lowrank F

Ensure: A of size I ⇥ F , b of size J ⇥ F , c of size K ⇥ F ,1: Initialize A, B, C (e.g. randomly or using SVD of X and Y).2: while convergence criterion is not met do3: Fix B,C and solve a least squares problem for A, by

stacking X,Y together.4: Fix A,C and solve a least squares problem for B.5: Fix A,B and solve a least squares problem for C.6: end while

If we add the ⌥1 norm penalties, using our second model, theprinciple of solving the optimization is the same i.e. alternating op-timization. More specifically, we may use the exact same logic asin Algorithm 1 with a small difference. Each one of the three leastsquares regression problems has now been transformed to what es-sentially is a Lasso problem. If we take a closer look at the objec-tive function for the simplest update rule of Algorithm 1 (where wefix A,B and solve for C then we can see that the problem we arerequired to solve is

minC

Y � ACT 2F +

i

j

|C(i, j)|. (3)

The above optimization problem is an instance of Lasso, wherewe have least squares regression of a matrix, with additional ⌥1constraints. In order to solve this problem, we use a coordinatedescent algorithm such as the one used in [7].

Both alternating optimization algorithms are guaranteed to con-verge to a locally optimal solution, since each step is an optimalsolution for the block that we optimize for, and thus, it is guaran-teed to decrease the cost monotonically. In the first model case,

each step is a least squares regression problem, which we can solveoptimally. In the second model, each subproblem is a Lasso prob-lem which is also convex, and thus we may solve it optimally aswell.

Finally, we also incorporate non-negativity constraint on all threefactor matrices A,B,C by optimally projecting the solution ofeach step of the algorithm to the [0,⇧) interval.

3.2.2 PredictionThe general factorization model of Eq. 1 is able to provide us

with movie recommendations for the Netflix dataset. Bear in mindthat in Netflix, a rating of 0 indicates a missing entry, thus, for thisparticular setting, we need not treat missing values in a special way.Say that X is the movie by user Netflix matrix. After we obtain thelow rank factors A,B,C, we may reconstruct X ⌅ ABT as amatrix of lower rank. It has been shown in the matrix completionliterature, e.g. [2], that the ultimate goal of matrix completion isminimal rank; even though we do not minimize the rank in thissetting, we impose a low rank on the factors, therefore steering ourmodel to be able to make good predictions of missing values.

Note that if we have a setting where 0 does not necessarily meana missing value, we need to treat the missing values in such a wayso that they are excluded from the optimization, such as in [1]which introduces a similar joint bilinear factorization model.

After we reconstruct X, we may simply look at the coefficientsof interest and obtain an estimate of the missing values. Further-more, we may also post-process these predicted values in a waythat it makes them more suitable for interpretation, e.g. quantizingthem in {1, 5}.

What is particularly interesting with our proposed approach isthe fact that we may perform two-way "recommendation"/completion,since we obtain a joint low rank factorization for both data matri-ces. Therefore, in the very same way, we may approximate matrixY as ACT and obtain our estimates.

For Stack Exchange data, the same prediction model applies.

3.3 Data Acquisition

3.3.1 Netflix & IMDB DataCurrently, we have a piece of Netflix as well as a piece of IMDB

at our disposal.The Netflix dataset that we have is the one released by Netflix

(only to be recalled later on, after the deanonymization). The num-ber of movies in this dataset is about 18K movies and the number ofusers who provided ratings (on a scale from 1-5) is about 2.5 mil-lion. The Netflix data corresponds to matrix X in our formulation,whose rows correspond to movies, and whose columns correspondto users.

In terms of matrix Y, we use information crawled from IMDB,ultimately on a shared subset of movies that exist in the Netflix data.About 200k movies have been crawled from the IMDB website.The following data for each movie has been collected:

• Reviews along with ratings for sentiment analysis• Movie rating and genre• Synopsis of each movie

The reason we collect all three different sources of data is thatwe would like to generate different version of the IMDB matrix Yand evaluate the performance of our proposed method in each case.In particular, the two different versions for Y are the same as faras the rows (movies) are concerned, but differ with respect to thecolumns (which can be seen as features for each movie):

1. The first version of Y uses the IMDB rating, as well as thegenre information as a set of features. The rating is a number

1.  ##2.  ##

#

•  Implementa-on#in#Matlab#(Algorithm)/Java#&#C++/Python#(crawlers/preCprocessing)#

•  Three#datasets#•  NeGlix/IMDB1:#IMDB#ra-ngs#&#genres#•  NeGlix/IMDB2:#Sen-ment#analysis#on#IMDB/

use#sen-ments#as#features#•  StackExchange#(stackoverflow.com#only)#:#

user#x#topic#matrix#coupled#with#user#x#interest#matrix#

•  Fig.*1*Squared#error#for#increasing###missing#values#•  Baselines#

•  Singular#Value#Decomposi-on(SVD)#

•  NonCnega-ve#Matrix#Factoriza-on#(NMF)#

•  Singular#Value#Thresholding#(Matrix#Comple-on)#

•  For#NeGlix/IMDB1#our#approach#beats#all#baselines#

•  For#NeGlix/IMDB2#it#beats#them#some-mes#•  For#StackExchange,#it#works#as#well#as#SVD/

NMF##•  Fig.*2:#Exchange#roles#of#X#and#Y*

•  Conclusion:#For#these#3#datasets,#reversing#roles#of#X#and#Y#makes#the#approach#work#like#simple#Matrix#Factoriza-on#on#Y*

•  Y*more#informa-ve#for#X*#than#X#for#Y*

Abhinav0Jauhri,000Evangelos0Papalexakis,000Pradeep0Prabakar0Ravindran00

Solve#using#BlockCcoordinate##Descent#

Y* +#…#+#

a1# aF#

b1# bF#

a1#

c1#

X*

aF#

cF#+#…#+#

?*

?*?*

?*

?*

X’*�

Mo)va)on* Formula)on*

Objec)ve*Func)ons* Algorithm*Used*

Experimental*SeIng*Results*

•  Ra-ngs#by#users#to#movies#•  Goal:#Recommend#movies#to#users#based#on#

similar#movies#and#similar#users.#•  Collabora-ve#Filtering#•  Matrix#factoriza-on…#

•  Rich#database#for#a#large#set#of#movies#•  Many#features#per#movie#•  May#serve#as#addi-onal#set#of#informa-on#

for#NeGlix#recommenda-on#

X:#NeGlix#movie#x#user#matrix#Y:#IMDB#movie#x#feature#matrix#

10C701:#Machine#Learning,#Spring#2013#

Fig. 1: Pictorial representation of our proposed method. In order to fill in themissing values in X, we leverage matrix Y, which shares the same row dimensionas X; we obtain a joint low rank decomposition of both matrices, as in Eq. 2and we reconstruct X. We then obtain estimates of the missing values using thereconstructed matrix.

optimizes over that specific block. This is done in an alternating fashion, untilno change in the cost function is observed.

In our particular case, we may obtain a locally optimal solution for Eq. 1and 2 by fixing two of the three matrices at each step, optimizing over the thirdmatrix and doing so in a ”circular” way. The optimization problem is linear onany of the three matrix variables, thus, each step of the block coordinate descentis simply either a Ridge Regression [8] (for the `2 penalty), or a Lasso regressionproblem [22] (i.e. Least Squares regression with `1 regularization on the variablethat we regress on).

Our algorithm is guaranteed to converge to a locally optimal solution, sinceeach step is solved optimally for the specific block that we optimize for, andthus, it is guaranteed to decrease the cost monotonically. Finally, we incorporatenon-negativity constraint on all three factor matrices A,B,C by projecting thesolution of each step of the algorithm to the [0,∞) interval.

In Figure 1, we illustrate our proposed method. Using Netfilx as an example,say that X is the movie by user matrix. After we obtain the low rank factorsA,B,C, we may reconstruct X ≈ ABT as a matrix of lower rank. As we reviewin the Related Work section, low rank factorization models are behind manystate of the art collaborative filtering algorithms.

By decomposing X into a small number of rank one factors, what we essen-tially accomplish, is that we express the data as a sum of parts; consequently,we express each user as a weighted sum of latent movie concepts, defined bythe columns of A. We, thus, may compute this weighted sum of latent movie

concepts, and obtain a new vector representation for each user. Essentially, asshown in Fig. 1, we need to reconstruct X, by multiplying out ABT .

After we reconstruct X, we may simply look at the coefficients of interestand obtain an estimate of the missing values. Furthermore, we may also post-process these predicted values in a way that it makes them more suitable forinterpretation, e.g. quantizing them in {1, 5}. In [14], the interested reader mayfind a concise and intuitive tutorial behind the logic of using latent factor modelsin collaborative filtering.

There are two, particularly interesting observations for our proposed ap-proach, which differentiate it from the existing work that uses factorization mod-els: First, by coupling matrices X,Y, as in Fig. 1, we incorporate the additionalinformation, contained in Y, to the latent factor A. Thus, our low rank repre-sentation is richer in information, compared to a factorization performed solelyon X.Second, under our proposed formulation, we can perform two-way ”recom-mendation”/completion, since we obtain a joint low rank factorization for bothdata matrices. Therefore, in the very same way, we may approximate matrix Yas ACT and obtain our estimates. In the Experimental section, we demonstratethis two-way collaborative filtering property of our method.

3 Data Description

As we mentioned in the Introduction, we apply our algorithm to two, diverse,recommendation problems. The reason we do so, is because our algorithm isagnostic of the type of additional information given. In case of movie recommen-dations to Netflix users, the additional information from IMDb contains extrafeatures like genres, and ratings from a different set of audience. The set ofmovies in both Netflix and IMDb is invariant. On the other hand, for stack-overflow.com there is no other source like IMDb. Here additional informationcontains a different set of users having similar and/or more interests than theset of users for which recommendations are to be provided.

We shall see in Experimental Evaluation section how the selection of the typeof additional information reflects on recommendations.

Netflix & IMDb Data The Netflix dataset contains ratings of anonymizedusers, to movies. The number of movies in this dataset is about 18K moviesand the number of users who provided ratings (on a scale from 1-5) is about2.5 million. The Netflix data corresponds to matrix X in our formulation, whoserows correspond to movies, and whose columns correspond to users.

In terms of matrix Y, we use information crawled from IMDb, ultimately ona shared subset of movies that exist in the Netflix data. About 200k movies havebeen crawled from the IMDb website. For each movie, we collect the rating (ona scale from 1-10), as well as genre information, which is a binary feature.

Stack Overflow Data Here we look at the publically available data dumpfor stackoverflow.com[1]. Collection of data is divided into two steps viz. DataAggregation and Data Cleaning.

Data Aggregation: To make recommendations for a user, Ui, we first look atall answers provided by the user and aggreagte all points/up-votes received foreach answer. Every question posted on stackoverlow.com has a set of tags. A tagis a keyword which acts like metadata to describe the question and allows foreasy browsing or searching. Hence, when a random registered user of stackover-flow.com provides an answer to a question and the answer gets upvoted, theneach tag, along with number of votes for the answer, is attached to the userprofile to project his/her set of interests. Algorithm 1 below formally describesdescribes the aggregation of interest sets from the data dump.

Algorithm 1: Create interest set IntSi for a user Ui

Require: Set of answers Ai posted by the user Ui and the corresponding set ofquestions Q for each answer in Ai

1: Set the interest set, IntSi, for Ui as a null vector2: for each a in Ai do3: Find set of tags T for the question to which user(Ui) provided an answer(a)4: for each t in T do5: IntSi(t) = IntSi(t) + Upvotes(a)6: end for7: end for8: return vector IntSi

Data Cleaning : Since a lot of tags are found in not more than a couple ofhundred questions, we restrict the vector size of interest set for each user to be atmost 200 based on popularity, since we observed that tags beyond 200th mark,if ranked based on popularity, were mentioned in less than 0.2% of questions.

To form the X matrix, tag points correspond to the columns and users cor-respond to rows. We randomly make X from our pool of users(P). For Y wekeep the number of columns same as X. Each row Yi is a row in P such thatthe following function is maximized:

Yi = argmaxP\Xi

Xi · Pj

4 Experimental Evaluation

For the purposes of evaluating our approach we are applying the following pro-cedure: choose, uniformly at random, a set of data entries (movie ratings in thecase of the Netflix matrix, or genre and rating information in the case of theIMDb matrix) and arbitrarily set it to zero (bearing in mind that zero in thecase of ratings doesn’t have any physical interpretation, other than the fact that

a particular entry is missing). After having hidden a set of elements, we solveEq. 1 or 2 we predict them as we described earlier. Additionally, we are testingthe algorithm’s performance on the reverse setting, where we the roles of matrixX and Y are reversed (we are using X as additional data for Y).

4.1 Baselines

Since our method is based on matrix factorization, our baselines are two basic,albeit, strong matrix factorization methods, which may be applied to only oneof the two matrices X,Y at each time. Since both baselines operate under thefactorization regime, the prediction of missing values is done in the same fashion,after we obtain the low rank factors. The two baseline methods are:

Singular Value Decomposition The SVD of X in a lower rank can be shown(by the Eckart-Young theorem) that is the best low rank approximation ofX in the least squares sense. The model is written as:

X = UΣVT

However, optimality in the least squares sense does not give any guaranteesfor the completion quality, when values from X are missing. As we reviewin the Related Work section, many state of the art techniques are based onthe SVD.Non-negative Matrix Factorization: This model was popularized by [15] andis a small but very insightful modification of the bilinear, SVD, factoriza-tion. In particular, the additional constraint subjects the two factors to non-negativity constraints, which make a lot of sense, especially for interpretationpurposes (but not restricted to them), when the original data is non-negative(as in our case, and in many real world scenarios).

minA,B≥0

‖X−ABT ‖2F

4.2 Datasets

The three datasets we used were the following:

Netflix/IMDb: The X matrix is the Netflix movie-user matrix. Matrix Yis the IMDb matrix that contains ratings and genres. Because this datasetis relatively large for Matlab to handle, we take samples of 100 users at atime; we, then, execute our method (and the baselines) 10 times, in order toestimate their performance.Stack Overflow: Matrices X and Y are the ones shown previously for thestack overflow data.Synthetic: We create synthetic data as follows: First we draw random sparsematrices A of size I ×F , B of size J ×F and C of size I ×L. We then takeX = ABT and Y = ACT .

4.3 Root Mean Squared Error

We measure our method’s performance, compared to the baselines, by recordingthe Root mean Squared Error (RMSE) on the missing portion of the matrix, i.e.we only count the error on the ratings that we wish to predict. In Figure 2, weshow the RMSE as a function of the rank of the factorization (i.e. the numberof latent factors that we extract). In all three cases, we see that our methodoutperforms the baselines, indicating that additional information is beneficialfor collaborative filtering. The figures we show are for p = 10% of the ratingsmissing; for higher percentages of missing ratings, we observed a similar trendof our approach outperforming the baselines; however, as the number of missingvalues increases, the prediction quality decreases. We determined the parametersvia trial-and-error: in particular, for Netflix/IMDb α = 0.8, λ = 0.005 (CMF-`1)and α = 0.8, µ = 0.05 (CMF-`2; for StackOverflow α = 0.4, λ = µ = 1, and forSynthetic α = 0.8, λ = 5, and µ = 1.

5 10 150.205

0.21

0.215

0.22

0.225

0.23

Rank

RM

SE

Netflix/IMDb

CMF−ell

1

CMF−ell2

SVD

NMF

(a) Netflix/IMDb

5 10 150.75

0.8

0.85

0.9

0.95

1

Rank

RM

SE

StackOverflow

CMF−ell1

CMF−ell2

SVD

NMF

(b) StackOverflow

5 10 150.09

0.1

0.11

0.12

0.13

0.14

0.15

0.16

Rank

RM

SE

Synthetic

CMF−ell1

CMF−ell2

SVD

NMF

(c) Synthetic

Fig. 2: Completion RMSE for 10% missing values. In all three cases, both our ap-proaches (CMF-`1 & CMF-`2) outperform the two baselines, in terms of RMSE.

Taking our analysis a step further, we assess whether reversing the roles ofX and Y yields same, better, or worse completion quality.We, indeed, observethat the roles of X,Y, for the IMDb/Netflix dataset, are not reciprocal. Thatis, trying to complete values of Y using X proves to be as effective as justcompleting the values of Y via traditional techniques.

Parameter Sensitivity Analysis In our model, we introduce parametersα, β, λ, µ which may be useful in smoothing out the differences of the two piecesof data we combine together (α, β), control the regularization/overfitting (µ)and control the sparsity levels in the factor matrices (λ). Determining a set ofappropriate values for α, β proves to be a very challenging research subject wherethere is some amount of work on, e.g. [24].

For the previous experiments, we used hand tuned parameters, however, weconducted an experiment in order to evaluate the behaviour of our models withrespect to the α = 1 − β parameters, in terms of RMSE. Figure 3 shows theresults for the Netflix/IMDb and StackOverflow datasets.

0 0.2 0.4 0.6 0.8 1

10−0.3864

10−0.3862

10−0.386

10−0.3858

10−0.3856

Netflix/IMBDb

alpha

RM

SE

(lo

g s

cale

)

lambda=0.05

lambda=1

lambda=5

lambda=10

(a) Neftlix/IMDb - CMF-`1

0 0.2 0.4 0.6 0.8 1

10−0.5476

10−0.5475

StackOverflow

alpha

RM

SE

(lo

g s

cale

)

lambda=0.05

lambda=1

lambda=5

lambda=10

(b) StackOverflow - CMF-`1

0 0.2 0.4 0.6 0.8 1

10−0.387

10−0.386

10−0.385

10−0.384

alpha

RM

SE

(lo

g s

cale

)

Netflix/IMDb

mu=0.05

mu=1

mu=5

mu=10

(c) Neftlix/IMDb - CMF-`2

0 0.2 0.4 0.6 0.8 1

10−0.5652

10−0.565

10−0.5648

10−0.5646

10−0.5644

10−0.5642

alpha

RM

SE

(lo

g s

cale

)

StackOverflow

mu=0.05

mu=1

mu=5

mu=10

(d) StackOverflow - CMF-`2

Fig. 3: Parameter sensitivity analysis for our two datasets. We vary λ and αand measure the RMSE, for 10% of the values missing. We observe that in bothcases, very small values of λ, µ perform better.

5 Related Work

Collaborative Filtering The public release of the netflix data [4] resulted ina number of approaches developed in a short time for improving recommenda-tion systems. In general most of the current approaches to collaborative filteringare either neighborhood based methods or latent factor based approaches. Wemostly focus in factorization methods, however, we provide a brief overviewof neighborhood methods too. In neighborhood based approaches, the goal re-volves around estimating the similarity between two pairs of entities [18]. Here,similarities could be computed between user-user pairs or item-item pairs. In auser based recommendation scenario, we first try to identify the most similarusers for a given target user [17]. The similarity measure between any two userscould be computed in a number of ways such as vector cosine similarity, Pearsoncoeffecient etc [21][19][9][20].

In factorization based approaches, the use of SVD has been proposed in theliterature [5], [7]. One of the major drawbacks however in using SVD type ap-proaches is that we have to deal with the sparsity of the user-item matrix because

several of the user-item ratings are unknown. Also, modelling only based on theobserved ratings could result in serious overfitting. Approaches like imputationhave been used to make the user-item matrix more dense [10], however, thisprocedure might insert noise in the data. Several of the recent works have fo-cused on modelling the observed user-item matrix directly through a regularizedmodel, such as [11] [25]. By employing regularization, these methods attenuatethe effects of overfitting.

An overview of Matrix Factorization techniques for Collaborative Filteringcan be found in [14]. Most of the previous works are however based on theexplicit feedback that was already provided by the users. But one of the majoradvantages of our approach is the fusion of side information from additionalsources. By jointly factorizing all the matrices, we hope to improve the accuracyof the predicted ratings obtained by using netflix ratings alone. Also one ofthe major advantages of our approach is that, our joint factorization model iscompletely generic and could be applied to any domain where additional sideinformation can be garnered (Stackoverflow, Friendship recommendation).

The Netflix Prize winner applied matrix factorization with temporal dynam-ics[13] which allows to capture time dependent behaviours of users. Moreover,in the final year of the competition, they also add predictors [12] which captureeffects with respect to a user or movie. For example, a user Bob may have atendency to rate a movie with a certain percentage less than the average ratingfor all movies. Similar effects which are independent of user interactions were in-corporated in the final model for prediction. Our matrix factorization approach,does not consider such extra parameters and thereby reducing the overall com-putational cost.

Incorporating Additional Information There exists a fairly recent line ofwork which, like in the present paper, incorporates some form of additional infor-mation, in order to boost collaborative filtering. In [14], the authors mention theuse of implicit feedback as additional information for collaborative filtering. Theauthors of [23] study collaborative filtering for marketing, and also conclude thatadditional information improves predicted ratings. In [27], the authors attemptto model the context of a users choice of a movie (e.g. preference of a certainactor vs. preference of action movies in general), into the collaborative filteringusing latent factorizations framework. In [26], the authors combine friendship in-formation with user interests to improve recommendations.. The authors of [16]introduce three successful ways of combining social network information to MFcollaborative filtering, outperforming SVD. Finally, in [6], the authors provide atensor based method of incorporating social data into cross domain collaboratingfiltering.

The advantage of our approach is that it is agnostic to the additional infor-mation, thus flexible; namely matrix X whose values we need to complete, alongwith another matrix which matches one of X’s dimensions, suffice for our methodto operate, and as we demonstrated, to outperform the baselines in diverse, realworld applications.

6 Conclusions

In this paper:– We formulate the problem of recommendation with additional information

as a regularized Coupled Matrix Factorization; one of our proposed methodsincludes sparse modelling of the entities participating in the collaborativefiltering process.

– The way we formulate the problem is generic enough, to accommodate di-verse applications;

– We experiment on two such real world applications and show that our ap-proach outperforms the two baselines.As future work, we may consider incorporating temporal information (han-

dling tensors instead of matrices), as well as experiment with different loss func-tions, which might be more appropriate for certain applications, where ratingsare highly skewed.

References

1. Stack Exchange Data dump. http://blog.stackoverflow.com/2011/09/

creative-commons-data-dump-sep-11/. [Accessed September 4, 2013].

2. Evrim Acar, Gozde Gurdeniz, Morten A Rasmussen, Daniela Rago, Lars O Drag-sted, and Rasmus Bro. Coupled matrix factorization with sparse factors to identifypotential biomarkers in metabolomics. In Data Mining Workshops (ICDMW), 2012IEEE 12th International Conference on, pages 1–8. IEEE, 2012.

3. Jeff Attwood and Joel Spolsky. Stack Overflow. http://stackoverflow.com/.[Accessed September 4, 2013].

4. James Bennett and Stan Lanning. The netflix prize. In Proceedings of KDD cupand workshop, volume 2007, page 35, 2007.

5. Daniel Billsus and Michael J Pazzani. Learning collaborative information filters.In ICML, volume 98, pages 46–54, 1998.

6. Wei Chen, Wynne Hsu, and Mong Li Lee. Making recommendations from multipledomains. In Proceedings of the 19th ACM SIGKDD international conference onKnowledge discovery and data mining, pages 892–900. ACM, 2013.

7. Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Eigentaste: Aconstant time collaborative filtering algorithm. Information Retrieval, 4(2):133–151, 2001.

8. Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation fornonorthogonal problems. Technometrics, 12(1):55–67, 1970.

9. George Karypis. Evaluation of item-based top-n recommendation algorithms. InProceedings of the tenth international conference on Information and knowledgemanagement, pages 247–254. ACM, 2001.

10. Dohyun Kim and Bong-Jin Yum. Collaborative filtering based on iterative principalcomponent analysis. Expert Systems with Applications, 28(4):823–830, 2005.

11. Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborativefiltering model. In Proceedings of the 14th ACM SIGKDD international conferenceon Knowledge discovery and data mining, pages 426–434. ACM, 2008.

12. Yehuda Koren. The bellkor solution to the netflix grand prize. 2009.

13. Yehuda Koren. Collaborative filtering with temporal dynamics. Communicationsof the ACM, 53(4):89–97, 2010.

14. Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniquesfor recommender systems. Computer, 42(8):30–37, 2009.

15. Daniel D Lee, H. Sebastian Seung, et al. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, 1999.

16. Min Li, Zhiwei Jiang, Bin Luo, Jiubin Tang, Qing Gu, and Daoxu Chen. Productand user dependent social network models for recommender systems. In Advancesin Knowledge Discovery and Data Mining, pages 13–24. Springer, 2013.

17. Matthew R McLaughlin and Jonathan L Herlocker. A collaborative filtering algo-rithm and evaluation metric that accurately model the user experience. In Pro-ceedings of the 27th annual international ACM SIGIR conference on Research anddevelopment in information retrieval, pages 329–336. ACM, 2004.

18. Paul Resnick, Neophytos Iacovou, Mitesh Suchak, Peter Bergstrom, and JohnRiedl. Grouplens: an open architecture for collaborative filtering of netnews. InProceedings of the 1994 ACM conference on Computer supported cooperative work,pages 175–186. ACM, 1994.

19. Gerard Salton and Michael J McGill. Introduction to modern information retrieval.1986.

20. Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. Analysis ofrecommendation algorithms for e-commerce. In Proceedings of the 2nd ACM con-ference on Electronic commerce, pages 158–167. ACM, 2000.

21. Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. Item-basedcollaborative filtering recommendation algorithms. In Proceedings of the 10th in-ternational conference on World Wide Web, pages 285–295. ACM, 2001.

22. Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of theRoyal Statistical Society. Series B (Methodological), pages 267–288, 1996.

23. Akhmed Umyarov and Alexander Tuzhilin. Improving collaborative filtering rec-ommendations using external data. In Data Mining, 2008. ICDM’08. Eighth IEEEInternational Conference on, pages 618–627. IEEE, 2008.

24. T. Wilderjans, E. Ceulemans, and I. Van Mechelen. Simultaneous analysis ofcoupled data blocks differing in size: A comparison of two weighting schemes.Computational Statistics & Data Analysis, 53(4):1086–1098, 2009.

25. Mingrui Wu. Collaborative filtering via ensembles of matrix factorizations. InProceedings of KDD Cup and Workshop, volume 2007, 2007.

26. S.H. Yang, B. Long, A. Smola, N. Sadagopan, Z. Zheng, and H. Zha. Like like alike:joint friendship and interest propagation in social networks. In Proceedings of the20th international conference on World wide web, pages 537–546. ACM, 2011.

27. Shuang-Hong Yang, Bo Long, Alexander J Smola, Hongyuan Zha, and ZhaohuiZheng. Collaborative competitive filtering: learning recommender using context ofuser choice. In Proceedings of the 34th international ACM SIGIR conference onResearch and development in Information Retrieval, pages 295–304. ACM, 2011.