Factorization Meets the Item Embedding: Regularizing Matrix Factorization with Item Co-occurrence

Factorization Meets the Item Embedding: Regularizing Matrix Factorization with

Item Co-occurrenceDawen Liang

Columbia University/Netflix

Jaan Altosaar Laurent Charlin David Blei

A simple trick to boost the performance of your recommender system without

using any additional dataDawen Liang

Columbia University/Netflix

Jaan Altosaar Laurent Charlin David Blei

• User-item interaction is commonly encoded in a user-by-item matrix

• In the form of (user, item, preference) triplets

• Matrix factorization is the standard method to infer latent user preferences

Motivation

Items

User

s ? ?

• Alternatively we can model item co-occurrence across users

• Analogy: modeling a set of documents (users) as a bag of co-occurring words (items): e.g., “Pluto” and “planet”

Motivation

:…

,{ } ,{ }

,{ },

, ,{ }

Can we combine these two views in a single model?

YES

ItemsUs

ers ?

? ≈

User latent factors θ

Item latent factors β

K

# ItemsK

# us

ers *

Click matrix Y

“Collaborative filtering for implicit feedback datasets”, Y. Hu, Y. Koren, C. Volinsky, ICDM 08.

Lmf =X

u,i

cui(yui � ✓>u �i)2

• Skip-gram word2vec

• Learn a low-dimensional word embedding in a continuous space

• Predict context words given the current word

Word embedding

Item embedding

• Skip-gram word2vec

• Learn a low-dimensional word embedding in a continuous space

• Predict context words given the current word

We can embed item sequences in the same fashion

Levy & Goldberg show that skip-gram word2vec is implicitly factorizing (some variation of) the pointwise mutual information (PMI) matrix

“Neural Word Embedding as Implicit Matrix Factorization”, Levy & Goldberg, NIPS 14.

co-occurrence patterns for rare items (items not consumedby many users). Finally, we discuss future extensions ofthis model such as taking advantage of user co-occurrenceinformation in addition to item co-occurrence.

2. RELATED WORKThe user and item factors learned through matrix factor-

ization are usually regularized with their `2

-norm. It is alsostandard to further regularize the factors with side infor-

mation [21]. Side information is information, in addition tothe user-item preference data, about users, items, or bothusers and items (e.g., item metadata or user demographics)that may be indicative of preferences. It is customary toeither inform the factors by conditioning on the side infor-mation or to constrain the factors by having them generatethe side information (in which case the side information ismodeled as a random variable). Representative examplesof each case are given in [1] and [23] respectively. Recentwork also proposes an item regularizer that can generatethe words of the reviews associated with an item [2]. Thegenerative model uses a neural-net based language modelsimilar to word embeddings. The distinguishing feature ofour work is that the regularization comes from a (determinis-tic) transformation of the original user-item preference datainstead of additional data. Similar idea has also been appliedto music and product recommendations [11, 7] by explicitlyconditioning on the item contents.

This kind of regularization through incorporation of sideinformation can alternatively be viewed as enforcing morecomplex structure in the prior of the latent factors. Forexample, in Ranganath et al. [16], they find that imposinga deep exponential family prior on the matrix factorizationmodel, which implicitly conditions on consumption counts ina non-linear fashion, can be helpful in reducing the e↵ect ofextremely popular (or extremely unpopular) items on held-out recommendation tasks. This is analogous to our findingswith the CoFactor model.

Guardia-Sebaoun et al. [5] observe that user trajectoriesover items can be modeled much like sequences of words.In detail, they use a word embedding model [13] to learnitem representations. User embeddings are then inferredas to predict the next item in the trajectory (given theprevious item). User and item embeddings are finally usedas covariates in a subsequent regression model of user-itempreferences. Learning item embeddings for recommendationsshares commonalities with our work. The di↵erence is thatwe treat the items consumed by each user as exchangeable(i.e., we do not assume that the trajectories are given, ratherwe assume that we are given an unordered set of items foreach user). Additionally we show that in our model jointlylearning all parameters yields higher performance.

3. THE COFACTOR MODELIn this section, we first introduce the two building blocks

of the CoFactor model: matrix factorization (MF) and itemembedding. Then we describe the CoFactor model and howto perform model inference.

Matrix factorization. MF is standard in collaborativefiltering [9]. Given the sparse user-item interaction matrixY 2 RU⇥I from U users and I items, MF decomposes it intothe product of user and item latent factors denoted ✓u 2 RK

(u = 1, . . . , U) and �i 2 RK (i = 1, . . . , I), respectively.

In this paper, our focus is on implicit feedback data [8, 15].However, we would like to point out that this is not a limitingaspect of our CoFactor model – it can be readily extendedto the explicit feedback setting. In the matrix factorizationfor implicit feedback model [8], the following objective isoptimized:

Lmf

=X

u,i

cui(yui � ✓>u �i)2 + �✓

X

u

k✓uk22

+ ��

X

i

k�ik22

.

In this MF objective, the scaling parameter cui is a hyper-parameter that is normally set to be cy=1

> cy=0

. It can betuned to balance the unobserved ratings (y = 0) from theobserved ratings (y = 1) in the click matrix. �✓ and �� areregularization parameters. The solution to this objective L

mf

can be interpreted as the maximum a posteriori estimate forthe following probabilistic model:

✓u ⇠ N (0,��1

✓ IK)

�i ⇠ N (0,��1

� IK)

yui ⇠ N (✓>u �i, c�1

ui ).

We choose to introduce MF from an optimization perspectivesince our contribution results in a model which does not havea clear generative interpretation.

Word embedding. Word embedding models (e.g.,word2vec [13]) have gained success in many natural lan-guage processing tasks. Given a sequence of words, thesemodels attempt to embed each word into a low-dimensional(relative to the vocabulary size) continuous space. In theskip-gram model, the objective is to predict context words—surrounding words within a fixed window—given the currentword. Stochastic gradient descent (SGD) with negative sam-pling is normally used to train a word embedding model (seeMikolov et al. [13] for more details).

Levy and Goldberg [10] show that word2vec with a neg-ative sampling value of k can be interpreted as implicitlyfactorizing the pointwise mutual information (PMI) matrixshifted by log k. PMI between a word i and its context wordj is defined as:

PMI(i, j) = logP (i, j)

P (i)P (j)

Empirically, it is estimated as:

PMI(i, j) = log#(i, j) ·D#(i) ·#(j)

.

Here #(i, j) is the number of times word j appears in thecontext of word i. D is the total number of word-contextpairs. #(i) =

Pj #(i, j) and #(j) =

Pi #(i, j).

After making the connection between word2vec and matrixfactorization, Levy and Goldberg [10] further proposed toperform word embedding by spectral dimensionality reduc-tion (e.g., singular value decomposition) on shifted positivePMI (SPPMI) matrix:

SPPMI(i, j) = max�max{PMI(i, j), 0}� log k, 0

This is attractive since it does not require learning rate andhyperparamter tuning as in SGD. In our CoFactor model,we will follow the same approach to decompose the SPPMImatrix.

Item embedding. Users consume items sequentially.Sequences of items are analogous to sequences of words, so

current word/item

context word/item

Co-occurrence matrix

• PMI(“Pluto”, “planet”) > PMI(“Pluto”, “RecSys”)

Jointly factorize both the click matrix and co-occurrence PMI matrix with a shared

item representation/embedding

CoFactor

• Item representation must account for both user-item interactions and item-item co-occurrence

• Alternative interpretation: regularizing the traditional MF objective with item embeddings learned by factorizing the item co-occurrence matrix

Lco

=X

u,i

cui(yui � ✓>u �i)2 +

X

mij 6=0

(mij � �>i �j � wi � cj)

2

Matrix factorization Item embedding

Shared item representation/embedding

Problem/application-specific

• Define context as the entire user click history

• #(i, j) is the number of users who clicked on both item i and item j

• Do not require any additional information beyond standard MF model

How to define “co-occur”

• Data preparation: 70/20/10 train/test/validation

• Make sure train/validation do not overlap in time with test

• Metrics: Recall@20, 50, NDCG@100, MAP@100

Empirical studyArXiv ML-20M TasteProfile

# of users 25,057 111,148 221,830# of items 63,003 11,711 22,781

# interactions 1.8M 8.2M 14.0M% interactions 0.12% 0.63% 0.29%

with timestamps yes yes no

Table 1: Attributes of datasets after preprocessing. Inter-actions are non-zero entries (listening counts, watches, andclicks). % interactions refers to the density of the user-iteminteraction matrix (Y ). For datasets with timestamps, weensure there is no overlap in time between the training andtest sets.

why jointly factoring both the user click matrix anditem co-occurrence matrix boosts the performance byexploring the model fits.

• We also demonstrate the importance of joint learningby comparing with a model that does word2vec foritem embedding followed by MF in a two-step process.

4.1 DatasetsThroughout this study we use three medium- to large-

scale user-item consumption datasets from various domains:1) scientific articles data from the arXiv2; 2) users’ movieviewing data from MovieLens [6]; 3) the taste profile subsetof the million song dataset [3]. In more details:

ArXiv: user-paper clicks derived from log data collectedin the first half of 2012 from the arXiv pre-print server.The data are binarized (multiple clicks by the same useron a single paper are considered to be a single click). Wepreprocess the data to ensure that all users and items havea minimum of ten clicks.

MovieLens-20M (ML-20M): user-movie ratings col-lected from a movie recommendation service. We followthe standard procedure of binarizing explicit data by onlykeeping the ratings of four or higher and interpret them asimplicit feedback. We only keep users who have watched atleast five movies.

TasteProfile: user-song play counts collected by the mu-sic intelligence company Echo Nest (now owned by Spotify).3

We binarize the play counts and interpret them as implicitpreference data. We further preprocess the dataset by onlykeeping users with at least 20 songs in their listening historyand songs that are listened to by at least 50 users.

To create the training/validation/test splits for the datasetswith timestamps, we order all the user-item interactions bytime and take the first 80% as training data, from which werandomly selected 10% as the validation set. For the remain-ing 20% of the data, we only keep the users and items thatappear in the training and validation sets to obtain the testset. For TasteProfile, for which we do not have timestamps,we randomly split the observed user-item interactions intotraining/validation/test sets with 70/10/20 proportions. Thefinal dimensions of all the datasets after preprocessing aresummarized in Table 1.

4.2 MetricsAs is typical in evaluating implicit feedback recommen-

2http://arxiv.org3http://the.echonest.com

dation challenges, we turn to the ranking-based metrics:Recall@M , truncated normalized discounted cumulative gain(NDCG@M), and mean average precision (MAP@M). Foreach user, all the metrics compare the predicted rank of(unobserved) items with their true rank. While Recall@Mconsiders all items ranked within the first M to be equivalent,NDCG@M and MAP@M use a monotonically increasingdiscount to emphasize the importance of higher ranks versuslower ones. Formally, define ⇡ as a permutation over all theitems, 1{·} is the indicator function, and u(⇡(i)) returns 1 ifuser u has consumed item ⇡(i). In our case the model learnsto predict ranking ⇡ for each user by sorting the predictedpreference ✓>u �i for i = 1, . . . , I. Recall@M for user u isdefined as

Recall@M(u,⇡) :=MX

i=1

1{u(⇡(i)) = 1}min(M,

PIi0 1{u(⇡(i0)) = 1})

.

The expression in the denominator evaluates to the minimumbetween M and the number of items consumed by user u.In this way, Recall@M is normalized to have a maximum of1. This corresponds to successfully retrieving all the relevantitems in top M of the list.

DCG@M for user u is defined as

DCG@M(u,⇡) :=MX

i=1

21{u(⇡(i))=1} � 1log(i+ 1)

Normalized DCG (NDCG) is the DCG normalized to the0–1 range where a one signifies a perfect ranking.

Mean average precision (MAP@M) calculates the mean ofusers’ average precision (AP). The average precision AP@Mfor a user u is:

AP@M(u,⇡) :=MX

i=1

Precision@i(u,⇡)

min(i,PI

i0 1{u(⇡(i0)) = 1})

4.3 Experiment protocolsWe compare the CoFactor model with the weighted matrix

factorization (WMF) model [8]. Given the similarity betweenthe objective of CoFactor L

co

and that of WMF Lmf

, we canattribute the improvement to the regularization imposed bythe co-occurrence SPPMI matrix.

In all our experiments, the dimension of the latent spaceK is set to 100. The CoFactor model is trained followingthe inference algorithm described in Section 3.1. We monitorthe convergence of the algorithm using NDCG@100 on thevalidation set for both CoFactor and WMF.

Both CoFactor and WMF require hyperparameter tun-ing. We first select the best hyperparameters for WMF(the weight cy=0

and cy=1

, the regularization parameters� = �✓ = ��) based on their performance on the valida-tion set. For CoFactor, we then keep the same cy=1

/cy=0

ratio and regularization parameters, and grid search for thebest relative scale ` 2 {0.01, 0.05, 0.1, . . . , 1, 5, 10}. Note thatwhen scaling cui, we also scale �✓ and �� . The selectedrelative scale indicates the importance of regularization withco-occurrence information. If a large scaling parameter cuiis selected, it means the MF part of the model can be ef-fective at recommendation on its own without significanthelp from the information provided by the co-occurrencematrix (as the scaling parameter goes to infinity, CoFactoris reduced to WMF). On the other hand, a smaller scaling

Quantitative resultsArXiv ML-20M TasteProfile

WMF CoFactor WMF CoFactor WMF CoFactor

Recall@20 0.063 0.067 0.133 0.145 0.198 0.208Recall@50 0.108 0.110 0.165 0.177 0.286 0.300NDCG@100 0.076 0.079 0.160 0.172 0.257 0.268MAP@100 0.019 0.021 0.047 0.055 0.103 0.111

Table 2: Comparison between the widely-used weighted matrix factorization (WMF) model [8] and our CoFactor model.CoFactor significantly outperforms WMF on all the datasets across all metrics. The improvement is most pronounced on themovie watching (ML-20M) and music listening (TasteProfile) datasets.

parameter indicates that the model benefits from account-ing for co-occurrence patterns in the observed user behaviordata. We also grid search for the negative sampling valuesk 2 {1, 2, 5, 10, 50} which e↵ectively modulate how much toshift the empirically estimated PMI matrix.

4.4 Analyzing the CoFactor model fitsTable 2 summarizes the quantitative results. Each metric

is averaged across all users in the test set. As we can see,CoFactor outperforms WMF [8] on all datasets and acrossall metrics. The improvement is very clear for the MovieLens(ML-20M) and music (TasteProfile) data. Note that bothmodels make use of the same data and optimize similar objec-tives, with CoFactor benefiting from an extra co-occurrenceregularization term.

Exploratory analysis

When does CoFactor do better/worse? Figure 1 showsthe breakdown of NDCG@100 by user activity in the train-ing set for all three datasets. Although details vary acrossdatasets, the CoFactor model is able to consistently improverecommendation performance for users who have only con-sumed a small amount of items. This is understandable sincethe standard MF model is known to have the limitation ofnot being able to accurately infer inactive users’ preferences,while CoFactor explicitly makes use of the additional sig-nal from item co-occurrence patterns to help learn betterlatent representations, even when user-item interaction datais scarce. For active users (the rightmost panel of each plot),WMF does worse than CoFactor on ArXiv, better on ML-20M, and roughly the same on TasteProfile. However, sinceactive users are the minority (most recommendation datasetshave a long tail), the standard error is also bigger.

To understand when CoFactor makes better recommen-dation, we take a look at how it ranks relatively rare itemsdi↵erently from WMF. Figure 2 shows the histogram ofranks for movies with less than 10 viewings in the train-ing set (which consists of 3,239 movies out of the 11,711total movies). We show four randomly selected users on theMovieLens ML-20M data (the general trend is consistentacross the entire dataset, and we observe a similar pattenfor TasteProfile as well). We can see that WMF mostly rankrare movies in the middle of its recommendations, which isexpected because the collaborative filtering model is drivenby item popularity. On the other hand, CoFactor can pushthese rare movies both to the top and bottom of its recom-mendations. We did not observe as clear of a pattern on theArXiv data. We conjecture that this is due to the fact thatArXiv dataset is less popularity-biased than ML-20M andTasteProfile: the mean (median) of users who consumed an

(a) ArXiv

(b) ML-20M

(c) TasteProfile

Figure 1: Average normalized discounted cumulative gain(NDCG@100) breakdown for both CoFactor and weightedmatrix factorization (WMF) [8] based on user activity ondi↵erent datasets. Error bars correspond to one standarderror. There is some variation across datasets, but CoFactoris able to consistently improve recommendation performancefor users who have only consumed a small amount of items.

• We get better results by simply re-using the data

• Item co-occurrence is in principle available to MF model, but MF model (bi-linear) has limited modeling capacity to make use of it

< 50 ≥ 50, < 100 ≥ 100, < 150 ≥ 150, < 500 ≥ 5001umber of songs Whe user hDs lisWeneG Wo

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40A

verD

ge 1

DC

G@

100

CoFDFWorW0F

User activity: Low High

We observe similar trend for other datasets as well

Toy Story (24659) Fight Club (18728)

Kill Bill: Vol. 1 (8728) Mouchette (32)

Army of Shadows (L'armée des ombres) (96)

User’s watch history

The Silence of the Lambs (37217)

Pulp Fiction (37445) Finding Nemo (9290)

Atalante L’ (90) Diary of a Country Priest

(Journal d'un curé de campagne) (68)

Top recommendation by CoFactor

Rain Man (11862) Pulp Fiction (37445) Finding Nemo (9290)

The Godfather: Part II (15325) That Obscure Object of Desire

(Cet obscur objet du désir) (300)

Top recommendation by WMF

number of users who watched the movie in the training set

How important is joint learning?

item is 597 (48) for ML-20M and 441 (211) for TasteProfile,while only 21 (12) for ArXiv.

Figure 2: The histogram of ranks from four randomly se-lected users for movies with less than 10 watches in the Movie-Lens ML-20M training set for both CoFactor and weightedmatrix factorization (WMF) [9]. The extra regularization inCoFactor allows it to push these rare items further up anddown its recommendation list for each user.

In Figure 3, we compare CoFactor and WMF for a par-ticular user who has watched many popular movies (e.g.,“Toy Story” and “Fight Club”), as well as some rare Frenchmovies (e.g. “Mouchette” and “Army of Shadows”). We findthat WMF highly ranks popular movies and places themon top of its recommendations. Even when it recommendsa French movie (“That Obscure Object of Desire”) to thisuser, it is a relatively popular one (as indicated by the num-ber of users who have watched it). In contrast, CoFactorwill better balance between the popular movies and relevant(but relatively rare) French movies in its recommendation.This makes sense because rare French movies co-occur in thetraining set, and this is captured by the SPPMI matrix. Wefind similar exploratory examples in TasteProfile.

This explains the performance gap between CoFactor andWMF among active users on ML-20M, as CoFactor couldpotentially rank popular movies lower than WMF. For activeusers, they are more likely to have watched popular moviesin both the training and test sets. When we look at the top100 users where WMF does better than CoFactor in termsof NDCG@100, we find that the performance di↵erence isnever caused by WMF recommending a rare movie in thetest set that CoFactor fails to recommend.

How important is joint learning? One of the maincontributions of the proposed CoFactor model is incorpo-rating both MF and item embedding in a joint learningobjective. This “end-to-end” approach has proven e↵ective inmany machine learning models. To demonstrate the advan-tage of joint learning, we conduct an alternative experimenton ArXiv where we perform the learning in a two-stage pro-cess. First, we train a skip-gram word2vec model on the clickdata, treating users’ entire click histories as the context—thisis equivalent to factorizing the SPPMI matrix in CoFactormodel. We then fix these learned item embeddings from

WMF CoFactor word2vec + reg

Recall@20 0.063 0.067 0.052Recall@50 0.108 0.110 0.095NDCG@100 0.076 0.079 0.065MAP@100 0.019 0.021 0.016

Table 3: Comparison between joint learning (CoFactor)and learning from a separate two-stage (word2vec + reg)process on ArXiv. Even though they make similar modelingassumptions, CoFactor provides superior performance.

word2vec as the latent factors ˇi in the MF model, and learnuser latent factors ✓u. Learning ✓u in this way is the sameas one iteration of CoFactor, which is e↵ectively doing aweighted ridge regression.

We evaluate the model learned from this two-stage process(word2vec + reg) and report the quantitative results in Ta-ble 3. (The results for WMF and CoFactor are copied fromTable 2 for comparison.) The performance of the two-stagemodel is much worse. This is understandable because theitem embeddings learned from word2vec are certainly notas well-suited for recommendation as the item latent factorslearned from MF. Using this item embedding in the secondstage to infer user preferences will likely lead to inferior per-formance. On the other hand, CoFactor is capable of findingthe proper balance between MF and item embeddings to getthe best of both worlds.

5. DISCUSSIONIt is interesting that CoFactor improves over WMF by

re-using the preference data in a di↵erent encoding. Theitem co-occurrence information is, in principle, availablefrom the user-item interaction data to MF. However, as a bi-linear model with limited capacity, MF cannot uncover suchinformation. On the other hand, highly non-linear models,such as deep neural networks, have shown limited success atthe task of preference prediction [21, 17]. Our approach issomewhat in between, where we have used a deterministicnon-linear transformation to re-encode the input.

A natural question is whether it is always better to in-corporate item co-occurrence information. This depends onthe problem and specific data. However, as mentioned inSection 4.3, as the relative scale on the weight cui goes toinfinity, CoFactor is e↵ectively reduced to WMF. Therefore,as long as there is additional information that is useful inthe co-occurrence matrix, in principle, the performance ofCoFactor is lower bounded by that of WMF.

5.1 Possible extensions to the modelIt is straightforward to extend our CoFactor model to

explicitly handle sessions of user-item consumption databy constructing a session-based co-occurrence matrix, asdiscussed in Section 2. This is likely to yield improvedperformance in some settings, such as in purchase behavioror music listening data, where the notion of a ‘shopping cart’or ‘playlist’ induce session-based patterns.

Adding user-user co-occurrence to our model to regularizeuser latent factors is another natural extension, which weleave to future work. We anticipate this extension to yieldfurther improvements, especially for users in the long tail.The type of regularization we have added to MF models can

Extension• User-user co-occurrence

• Higher-order co-occurrence patterns

• Add the same type of item-item co-occurrence regularization in other collaborative filtering methods, e.g., BPR, factorization machine, or SLIM

Conclusion• We present CoFactor model:

• Jointly factorize both user-item click matrix and item-item co-occurrence matrix

• Motivated by the recent success of word embedding models (e.g., word2vec)

• Explore the results both quantitatively and qualitatively to investigate the pros/cons

Source code available: https://github.com/dawenl/cofactor

https://github.com/dawenl/cofactor

Thank you• We present CoFactor model:

• Jointly factorize both user-item click matrix and item-item co-occurrence matrix

• Motivated by the recent success of word embedding models (e.g., word2vec)

• Explore the results both quantitatively and qualitatively to investigate the pros/cons

Source code available: https://github.com/dawenl/cofactor

https://github.com/dawenl/cofactor

Data & Analytics

Factorization Meets the Item Embedding: Regularizing Matrix Factorization with Item Co-occurrence