Download pdf - Latent factor models for Collaborative Filtering

20.06.2012 DIMA – TU Berlin 1

Fachgebiet Datenbanksysteme und InformationsmanagementTechnische Universität Berlin

http://www.dima.tu-berlin.de/

AIM3 – Scalable Data Analysis and Data Mining

11 – Latent factor models for Collaborative Filtering

Sebastian Schelter, Christoph Boden, Volker Markl


Itembased Collaborative Filtering

• compute pairwise similarities of the columns of the rating matrix using some similarity measure

• store top 20 to 50 most similar items per item in the item-similarity matrix

• prediction: use a weighted sum over all items similar to the unknown item that have been rated by the current user

Recap: Item-Based Collaborative Filtering

ijuiSj

ujijuiSj

ui

s

rs=p

),(

),(


• the assumption that a rating is defined by all the user's ratings for commonly co-rated items ishard to justify in general

• lack of bias correction

• every co-rated item is looked at in isolation,say a movie was similar to „Lord of the Rings“, do we want each part to of the trilogy to contribute as a single similar item?

• best choice of similarity measure is based on experimentation not on mathematical reasons

Drawbacks of similarity-based neighborhood methods


■ Idea

• ratings are deeply influenced by a set of factors that are very specific to the domain (e.g. amount of action in movies, complexity of characters)

• these factors are in general not obvious, we might be able to think of some of them but it's hard to estimate their impact on the ratings

• the goal is to infer those so called latent factors from the rating data by using mathematical techniques

Latent factor models


■ Approach

• users and items are characterized by latent factors, each user and item is mapped onto a latent feature space

• each rating is approximated by the dot product of the user feature vectorand the item feature vector

• prediction of unknown ratings also usesthis dot product

• squared error as a measure of loss


i

T

jijumr

2

i

T

jijumr

fn

jiRm,u



■ Approach

• decomposition of the rating matrix into the product of a user feature and an item feature matrix

• row in U: vector of a user's affinity to the features

• row in M: vector of an item's relation to the features

• closely related to Singular Value Decomposition which produces an optimal low-rank optimization of a matrix

R U

MT

≈



■ Properties of the decomposition

• automatically ranks features by their „impact“ on the ratings

• features might not necessarily be intuitively understandable


■ Problematic situation with explicit feedback data

• the rating matrix is not only sparse, but partially defined, missing entries cannot be interpreted as 0 they are just unknown

• standard decomposition algorithms like Lanczos method for SVD are not applicable

Solution

• decomposition has to be done using the known ratings only

• find the set of user and item feature vectors that minimizes the squared error to the known ratings


2

i

T

jji,MU,umrmin


■ quality of the decomposition is not measured with respect to the reconstruction error to the original data, but with respect to the generalization to unseen data

■ regularization necessary to avoid overfitting

■ model has hyperparameters (regularization, learning rate) that need to be chosen

■ process: split data into training, test and validation set□ train model using the training set

□ choose hyperparameters according to performance on the test set

□ evaluate generalization on the validation set

□ ensure that each datapoint is used in each set once (cross-validation)



• add a regularizarion term

• loop through all ratings in the training set, compute associated prediction error

• modify parameters in the opposite direction of the gradient

• problem: approach is inherently sequential (although recent research might have unveiled a parallelization technique)

Stochastic Gradient Descent

222

jii

T

jji,MU,m+uλ+umrmin

i

T

jijuiumr=e

ijiu,ii

λumeγ+uu

jiiu,jj

λmueγ+mm


■ Model

• feature matrices are modeled directly by using only the observed ratings

• add a regularization term to avoid overfitting

• minimize regularized error of:

Solving technique

• fixing one of the unknown variable to make this a simple quadratic equation

• rotate between fixing u and m until convergence(„Alternating Least Squares“)

Alternating Least Squares with Weighted λ-Regularization

222

jj

mii

ui

T

jijmn+unλ+umr=MU,f


■ Which properties make this approach scalable?

• all the features in one iteration can be computed independently of each other

• only a small portion of the data necessary to compute a feature vector

Parallelization with Map/Reduce

• Computing user feature vectors: the mappers need to send each user's rating vector and the feature vectors of his/her rated items to the same reducer

• Computing item feature vectors: the mappers need to send each item's rating vector and the feature vectors of users who rated it to the same reducer

ALS-WR is scalable


■ Problem: explicit feedback data is highly biased□ some users tend to rate more extreme than others

□ some items tend to get higher ratings than others

■ Solution: explicitly model biases□ the bias of a rating is model as a combination of the items average

rating, the item bias and the user bias

□ the rating bias can be incorporated into the prediction

Incorporating biases

jiijbbb

i

T

jjiijumbbr ˆ


■ implicit feedback data is very different from explicit data!

□ e.g. use the number of clicks on a product page of an online shop

□ the whole matrix is defined!

□ no negative feedback

□ interactions that did not happen produce zero values

□ however we should have only little confidence in these (maybe the user never had the chance to interact with these items)

□ using standard decomposition techniques like SVD would give us a decomposition that is biased towards the zero entries, again not applicable



■ Solution for working with implicit data: weighted matrix factorization

■ create a binary preference matrix P

■ each entry in this matrix can be weighted by a confidence function□ zero values should get low confidence

□ values that are based on a lot of interactions should get high confidence

■ confidence is incorporated into the model□ the factorization will ‚prefer‘ more confident values


222

),(jii

T

jijm+uλ+umpjic=MU,f

00

01

ij

ij

ijr

r

p

ijrjic 1),(


• Sarwar et al.: „Item-Based Collaborative Filtering Recommendation Algorithms“, 2001

• Koren et al.: „Matrix Factorization Techniques for Recommender Systems“, 2009

• Funk: „Netflix Update: Try This at Home“, http://sifter.org/~simon/journal/20061211.html, 2006

• Zhou et al.: „Large-scale Parallel Collaborative Filtering for the Netflix Prize“, 2008

• Hu et al.: „Collaborative Filtering for Implicit Feedback Datasets“, 2008

Sources

http://sifter.org/~simon/journal/20061211.html