20.06.2012 DIMA – TU Berlin 1
Fachgebiet Datenbanksysteme und InformationsmanagementTechnische Universität Berlin
http://www.dima.tu-berlin.de/
AIM3 – Scalable Data Analysis and Data Mining
11 – Latent factor models for Collaborative Filtering
Sebastian Schelter, Christoph Boden, Volker Markl
20.06.2012 DIMA – TU Berlin 2
Itembased Collaborative Filtering
• compute pairwise similarities of the columns of the rating matrix using some similarity measure
• store top 20 to 50 most similar items per item in the item-similarity matrix
• prediction: use a weighted sum over all items similar to the unknown item that have been rated by the current user
Recap: Item-Based Collaborative Filtering
ijuiSj
ujijuiSj
ui
s
rs=p
),(
),(
20.06.2012 DIMA – TU Berlin 3
• the assumption that a rating is defined by all the user's ratings for commonly co-rated items ishard to justify in general
• lack of bias correction
• every co-rated item is looked at in isolation,say a movie was similar to „Lord of the Rings“, do we want each part to of the trilogy to contribute as a single similar item?
• best choice of similarity measure is based on experimentation not on mathematical reasons
Drawbacks of similarity-based neighborhood methods
20.06.2012 DIMA – TU Berlin 4
■ Idea
• ratings are deeply influenced by a set of factors that are very specific to the domain (e.g. amount of action in movies, complexity of characters)
• these factors are in general not obvious, we might be able to think of some of them but it's hard to estimate their impact on the ratings
• the goal is to infer those so called latent factors from the rating data by using mathematical techniques
Latent factor models
20.06.2012 DIMA – TU Berlin 5
■ Approach
• users and items are characterized by latent factors, each user and item is mapped onto a latent feature space
• each rating is approximated by the dot product of the user feature vectorand the item feature vector
• prediction of unknown ratings also usesthis dot product
• squared error as a measure of loss
Latent factor models
i
T
jijumr
2
i
T
jijumr
fn
jiRm,u
20.06.2012 DIMA – TU Berlin 6
Latent factor models
■ Approach
• decomposition of the rating matrix into the product of a user feature and an item feature matrix
• row in U: vector of a user's affinity to the features
• row in M: vector of an item's relation to the features
• closely related to Singular Value Decomposition which produces an optimal low-rank optimization of a matrix
R U
MT
≈
20.06.2012 DIMA – TU Berlin 7
Latent factor models
■ Properties of the decomposition
• automatically ranks features by their „impact“ on the ratings
• features might not necessarily be intuitively understandable
20.06.2012 DIMA – TU Berlin 8
■ Problematic situation with explicit feedback data
• the rating matrix is not only sparse, but partially defined, missing entries cannot be interpreted as 0 they are just unknown
• standard decomposition algorithms like Lanczos method for SVD are not applicable
Solution
• decomposition has to be done using the known ratings only
• find the set of user and item feature vectors that minimizes the squared error to the known ratings
Latent factor models
2
i
T
jji,MU,umrmin
20.06.2012 DIMA – TU Berlin 9
■ quality of the decomposition is not measured with respect to the reconstruction error to the original data, but with respect to the generalization to unseen data
■ regularization necessary to avoid overfitting
■ model has hyperparameters (regularization, learning rate) that need to be chosen
■ process: split data into training, test and validation set□ train model using the training set
□ choose hyperparameters according to performance on the test set
□ evaluate generalization on the validation set
□ ensure that each datapoint is used in each set once (cross-validation)
Latent factor models
20.06.2012 DIMA – TU Berlin 10
• add a regularizarion term
• loop through all ratings in the training set, compute associated prediction error
• modify parameters in the opposite direction of the gradient
• problem: approach is inherently sequential (although recent research might have unveiled a parallelization technique)
Stochastic Gradient Descent
222
jii
T
jji,MU,m+uλ+umrmin
i
T
jijuiumr=e
ijiu,ii
λumeγ+uu
jiiu,jj
λmueγ+mm
20.06.2012 DIMA – TU Berlin 11
■ Model
• feature matrices are modeled directly by using only the observed ratings
• add a regularization term to avoid overfitting
• minimize regularized error of:
Solving technique
• fixing one of the unknown variable to make this a simple quadratic equation
• rotate between fixing u and m until convergence(„Alternating Least Squares“)
Alternating Least Squares with Weighted λ-Regularization
222
jj
mii
ui
T
jijmn+unλ+umr=MU,f
20.06.2012 DIMA – TU Berlin 12
■ Which properties make this approach scalable?
• all the features in one iteration can be computed independently of each other
• only a small portion of the data necessary to compute a feature vector
Parallelization with Map/Reduce
• Computing user feature vectors: the mappers need to send each user's rating vector and the feature vectors of his/her rated items to the same reducer
• Computing item feature vectors: the mappers need to send each item's rating vector and the feature vectors of users who rated it to the same reducer
ALS-WR is scalable
20.06.2012 DIMA – TU Berlin 13
■ Problem: explicit feedback data is highly biased□ some users tend to rate more extreme than others
□ some items tend to get higher ratings than others
■ Solution: explicitly model biases□ the bias of a rating is model as a combination of the items average
rating, the item bias and the user bias
□ the rating bias can be incorporated into the prediction
Incorporating biases
jiijbbb
i
T
jjiijumbbr ˆ
20.06.2012 DIMA – TU Berlin 14
■ implicit feedback data is very different from explicit data!
□ e.g. use the number of clicks on a product page of an online shop
□ the whole matrix is defined!
□ no negative feedback
□ interactions that did not happen produce zero values
□ however we should have only little confidence in these (maybe the user never had the chance to interact with these items)
□ using standard decomposition techniques like SVD would give us a decomposition that is biased towards the zero entries, again not applicable
Latent factor models
20.06.2012 DIMA – TU Berlin 15
■ Solution for working with implicit data: weighted matrix factorization
■ create a binary preference matrix P
■ each entry in this matrix can be weighted by a confidence function□ zero values should get low confidence
□ values that are based on a lot of interactions should get high confidence
■ confidence is incorporated into the model□ the factorization will ‚prefer‘ more confident values
Latent factor models
222
),(jii
T
jijm+uλ+umpjic=MU,f
00
01
ij
ij
ijr
r
p
ijrjic 1),(
20.06.2012 DIMA – TU Berlin 16
• Sarwar et al.: „Item-Based Collaborative Filtering Recommendation Algorithms“, 2001
• Koren et al.: „Matrix Factorization Techniques for Recommender Systems“, 2009
• Funk: „Netflix Update: Try This at Home“, http://sifter.org/~simon/journal/20061211.html, 2006
• Zhou et al.: „Large-scale Parallel Collaborative Filtering for the Netflix Prize“, 2008
• Hu et al.: „Collaborative Filtering for Implicit Feedback Datasets“, 2008
Sources