Case Study 4: Collaborative Filtering · Case Study 4: Collaborative Filtering Cold-Start Problem ! Challenge: Cold-start problem (new movie or user)! Methods: use features of movie/user

1

1

Review: Cold Start Problem

Machine Learning for Big Data CSE547/STAT548, University of Washington

Emily Fox February 18th, 2014

©Emily Fox 2014

Case Study 4: Collaborative Filtering

Cold-Start Problem

n  Challenge: Cold-start problem (new movie or user) n  Methods: use features of movie/user

IN THEATERS

©Emily Fox 2014 2

2

Cold-Start Problem More Formally

n  Consider a new user u’ and predicting that user’s ratings ¨  No previous observations

¨  Objective considered so far:

¨  Optimal user factor:

¨  Predicted user ratings:

©Emily Fox 2014 3

minL,R

1

2

X

ruv

(Lu ·Rv � ruv)2 +

�u

2||L||2F +

�v

2||R||2F

An Alternative Formulation

n  A simpler model for collaborative filtering ¨  We would not have this issue if we assumed all users were identical

¨  What about for new movies? What if we had side information?

¨  What dimension should w be? ¨  Fit linear model:

¨  Minimize:

©Emily Fox 2014 4

3

Personalization n  If we don’t have any observations about a user, use wisdom of the crowd

¨  Address cold-start problem

n  Clearly, not all users are the same n  Just as in personalized click prediction, consider model with global and

user-specific parameters

n  As we gain more information about the user, forget the crowd

©Emily Fox 2014 5

User Features… n  In addition to movie features, may have information about the user:

n  Combine with features of movie:

n  Unified linear model:

©Emily Fox 2014 6

4

Feature-based Approach versus Matrix Factorization

n  Feature-based approach: ¨  Feature representation of user and movies fixed ¨  Can address cold-start problem

n  Matrix factorization approach: ¨  Suffers from cold-start problem ¨  User & movie features are learned from data

n  A unified model:

©Emily Fox 2014 7

Unified Collaborative Filtering via SGD

n  Gradient step observing ruv ¨  For L,R

¨  For w and wu:

©Emily Fox 2014 8

minL,R,w,{wu}u

1

2

X

ruv

(Lu ·Rv + (w + wu) · �(u, v)� ruv)2

+�u

2||L||2F +

�v

2||R||2F +

�w

2||w||22 +

�wu

2

X

u

||wu||22

"L(t+1)u

R(t+1)v

#

"(1� ⌘t�u)L

(t)u � ⌘t✏tR

(t)v

(1� ⌘t�v)R(t)v � ⌘t✏tL

(t)u

#

5

What you need to know…

n  Cold-start problem

n  Feature-based methods for collaborative filtering ¨ Help address cold-start problem

n  Unified approach

©Emily Fox 2014 9

10

Connections with Probabilistic Matrix Factorization



©Emily Fox 2014


6

Matrix Completion Problem

n  Filling missing data?

©Emily Fox 2014 11

Xij known for black cellsXij unknown for white cells

Rows index moviesColumns index users

X = Rows index users Columns index movies

Coordinate Descent for Matrix Factorization: Alternating Least-Squares

n  Fix movie factors, optimize for user factors ¨  Independent least-squares over users

n  Fix user factors, optimize for movie factors ¨  Independent least-squares over movies

n  System may be underdetermined:

n  Converges to

©Emily Fox 2014 12

minRv

X

u2Uv

(Lu ·Rv � ruv)2

minLu

X

v2Vu

(Lu ·Rv � ruv)2

minL,R

X

(u,v):ruv 6=?

(Lu ·Rv � ruv)2

7

Probabilistic Matrix Factorization (PMF)

n  A generative process: ¨  Pick user factors

¨  Pick movie factors

¨  For each (user,movie) pair observed: n  Pick rating as Lu Rv + noise

n  Joint probability:

©Emily Fox 2014 13

PMF Graphical Model

n  Graphically:

©Emily Fox 2014 14

P (L,R | X) / P (L)P (R)P (X | L,R)

8

Maximum A Posteriori for Matrix Completion

©Emily Fox 2014 15

P (L,R|X) / P (L,R,X) = p(L)p(R)p(X | L,R)

/ e�12�2

u

Pnu=1

Pki=1 L2

uie�12�2

v

Pmv=1

Pki=1 R2

vie�12�2

r

Pruv

(Lu·Rv�ruv)2

MAP versus Regularized Least-Squares for Matrix Completion

n  MAP under Gaussian Model:

n  Least-squares matrix completion with L2 regularization:

n  Understanding as a probabilistic model is very useful! E.g.,

¨  Change priors

¨  Incorporate other sources of information or dependencies

©Emily Fox 2014 16

minL,R

1

2

X

ruv

(Lu ·Rv � ruv)2 +

�u

2||L||2F +

�v

2||R||2F

max

L,RlogP (L,R | X) =

� 1

2�2u

X

u

X

i

L2ui

� 1

2�2v

X

v

X

i

R2vi �

1

2�2r

X

ruv

(Lu ·Rv � ruv)2+ const

9


n  Probabilistic model for collaborative filtering ¨ Models, choice of priors ¨ MAP equivalent to optimization for matrix completion

©Emily Fox 2014 17

18

Gibbs Sampling for Bayesian Inference



©Emily Fox 2014


10

©Emily Fox 2014 19

n  MAP estimation focuses on point estimation:

n  What if we want a full characterization of the posterior? ¨  Maintain a measure of uncertainty ¨  Estimators other than posterior mode (different loss functions) ¨  Predictive distributions for future observations

n  Often no closed-form characterization (e.g., mixture models, PMF, etc.)

Posterior Computations

ˆ

✓

MAP= argmax

✓p(✓ | x)

©Emily Fox 2014 20

n  Latent user and movie factors:

n  Observations n  Hyperparameters:

n  Want to predict new movie rating:

Bayesian PMF Example

Lu Rv

ruvu = 1, . . . , n

v = 1, . . . ,m

11

©Emily Fox 2014 21

n  Monte Carlo methods:

n  Ideally:


p(r⇤uv | X,�) =

Zp(r⇤uv | Lu, Rv)p(L,R | X,�)dLdR

Lu Rv

ruvu = 1, . . . , n

v = 1, . . . ,m

�u �v

�r

©Emily Fox 2014 22

n  Want posterior samples n  What can we sample from?

¨  Hint: Same reasoning as behind ALS, but sampling rather than maximization


(L(k), R(k)) ⇠ p(L,R | X,�)

12

©Emily Fox 2014 23

n  For user u:

n  Symmetrically for Rv conditioned on L (breaks down over movies) n  Luckily, we can use this to get our desired posterior samples


p(Lu | X,R,�u) / p(Lu | �u)Y

v2Vu

p(ruv | Lu, Rv,�r)

©Emily Fox 2014 24

n  Want draws:

n  Construct Markov chain whose steady state distribution is n  Then, asymptotically correct n  Simplest case:

Gibb Sampling

13

©Emily Fox 2014 25

n  Outline of Bayesian PMF sampler

Bayesian PMF Gibbs Sampler

©Emily Fox 2014 26

n  Netflix data with: ¨  Training set = 100,480,507 ratings from 480,189 users on 17,770 movie titles ¨  Validation set = 1,408,395 ratings. ¨  Test set = 2,817,131 user/movie pairs with the ratings withheld.

Bayesian PMF Results

Bayesian Probabilistic Matrix Factorization using MCMC

0 10 20 30 40 50 60

0.9

0.91

0.92

0.93

0.94

0.95

0.96

0.97

Epochs

RM

SE

PMF

Bayesian PMF

Netflix Baseline Score

SVD

Logistic PMF

4 8 16 32 64 128 256 5120.89

0.895

0.9

0.905

0.91

0.915

0.92

Number of Samples

RM

SE 30−D

60−D5.7 hrs. 23 hrs. 90 hrs.

11.7 hrs.47 hrs. 188 hrs.

Bayesian PMF

Figure 2. Left panel: Performance of SVD, PMF, logistic PMF, and Bayesian PMF using 30D feature vectors, on theNetflix validation data. The y-axis displays RMSE (root mean squared error), and the x-axis shows the number of epochs,or passes, through the entire training set. Right panel: RMSE for the Bayesian PMF models on the validation set as afunction of the number of samples generated. The two curves are for the models with 30D and 60D feature vectors.

4.2. Training PMF models

For comparison, we have trained a variety of linearPMF models using MAP, choosing their regularizationparameters using the validation set. In addition to lin-ear PMF models, we also trained logistic PMF mod-els, in which we pass the dot product between user-and movie-specific feature vectors through the logisticfunction !(x) = 1/(1 + exp(!x)) to bound the rangeof predictions:

p(R|U, V, ") =N!

i=1

M!

j=1

"

N (Rij |!(UTi Vj), "

!1)

#Iij

. (15)

The ratings 1, ..., 5 are mapped to the interval [0, 1]using the function t(x) = (x ! 1)/4, so that the rangeof valid rating values matches the range of predictionsour model can make. Logistic PMF models can some-times provide slightly better results than their linearcounterparts.

To speed up training, instead of performing full batchlearning, we subdivided the Netflix data into mini-batches of size 100,000 (user/movie/rating triples) andupdated the feature vectors after each mini-batch. Weused a learning rate of 0.005 and a momentum of 0.9for training the linear as well as logistic PMF models.

4.3. Training Bayesian PMF models

We initialized the Gibbs sampler by setting the modelparameters U and V to their MAP estimates obtainedby training a linear PMF model. We also set µ0 =0, #0 = D, and W0 to the identity matrix, for bothuser and movie hyperpriors. The observation noise

precision " was fixed at 2. The predictive distributionwas computed using Eq. 10 by running the Gibbs

sampler with samples {U (k)i , V (k)

j } collected after eachfull Gibbs step.

4.4. Results

In our first experiment, we compared a Bayesian PMFmodel to an SVD model, a linear PMF model, and alogistic PMF model, all using 30D feature vectors. TheSVD model was trained to minimize the sum-squareddistance to the observed entries of the target matrix,with no regularization applied to the feature vectors.Note that this model can be seen as a PMF modeltrained using maximum likelihood (ML). For the PMFmodels, the regularization parameters $U and $V wereset to 0.002. Predictive performance of these modelson the validation set is shown in Fig. 2 (left panel).The mean of the predictive distribution of the BayesianPMF model achieves an RMSE of 0.8994, compared toan RMSE of 0.9174 of a moderately regularized linearPMF model, an improvement of over 1.7%.

The logistic PMF model does slightly outperform itslinear counterpart, achieving an RMSE of 0.9097.However, its performance is still considerably worsethan that of the Bayesian PMF model. A simpleSVD achieves an RMSE of about 0.9280 and afterabout 10 epochs begins to overfit heavily. This ex-periment clearly demonstrates that SVD and MAP-trained PMF models can overfit and that the pre-dictive accuracy can be improved by integrating outmodel parameters and hyperparameters.

From Salakhutdinov and Mnih, ICML 2008

14

©Emily Fox 2014 27

n  Bayesian model better controls for overfitting by averaging over possible parameters (instead of committing to one)

Bayesian PMF Results

Bayesian Probabilistic Matrix Factorization using MCMC

−20 −10 0 10 20

−20

−10

0

10

20

Dimension1

Dim

ensi

on3

User A (4 ratings)

−20 −10 0 10 200

5

10

15

20

25

30

35

Dimension3

Freq

uenc

y C

ount

−20 −10 0 10 200

5

10

15

20

25

30

35

Dimension1

Freq

uenc

y C

ount

−20 −10 0 10 20

−20

−10

0

10

20

Dimension5

Dim

ensi

on1

User C (319 ratings)

−20 −10 0 10 200

5

10

15

20

25

30

35

Dimension1

Freq

uenc

y C

ount

−20 −10 0 10 200

5

10

15

20

25

Dimension5

Freq

uenc

y C

ount

−0.4 −0.2 0 0.2 0.4

−0.4

−0.2

0

0.2

0.4

Dimension1

Dim

ensi

on2

Movie X (5 ratings)

−1 −0.5 0 0.5 10

20

40

60

80

100

120

Dimension2

Freq

uenc

y C

ount

−1 −0.5 0 0.5 10

20

40

60

80

100

Dimension1

Freq

uenc

y C

ount

−0.4 −0.2 0 0.2 0.4

−0.4

−0.2

0

0.2

0.4

Dimension1

Dim

ensi

on2

Movie Y (142 ratings)

−0.2 −0.1 0 0.1 0.20

10

20

30

40

50

60

70

Dimension2

Freq

uenc

y C

ount

−0.2 −0.1 0 0.1 0.20

10

20

30

40

50

60

Dimension1

Freq

uenc

y C

ount

Figure 3. Samples from the posterior over the user and movie feature vectors generated by each step of the Gibbssampler. The two dimensions with the highest variance are shown for two users and two movies. The first 800 sampleswere discarded as “burn-in”.

D Valid. RMSE % Test RMSE %PMF BPMF Inc. PMF BPMF Inc.

30 0.9154 0.8994 1.74 0.9188 0.9029 1.7340 0.9135 0.8968 1.83 0.9170 0.9002 1.8360 0.9150 0.8954 2.14 0.9185 0.8989 2.13150 0.9178 0.8931 2.69 0.9211 0.8965 2.67300 0.9231 0.8920 3.37 0.9265 0.8954 3.36

Table 1. Performance of Bayesian PMF (BPMF) and lin-ear PMF on Netflix validation and test sets.

We than trained larger PMF models with D = 40 andD = 60. Capacity control for such models becomes arather challenging task. For example, a PMF modelwith D = 60 has approximately 30 million parameters.Searching for appropriate values of the regularizationcoe!cients becomes a very computationally expensivetask. Table 1 further shows that for the 60-dimensionalfeature vectors, Bayesian PMF outperforms its MAPcounterpart by over 2%. We should also point outthat even the simplest possible Bayesian extension ofthe PMF model, where Gamma priors are placed overthe precision hyperparameters !U and !V (see Fig. 1,left panel), significantly outperforms the MAP-trainedPMF models, even though it does not perform as well

as the Bayesian PMF models.

It is interesting to observe that as the feature di-mensionality grows, the performance accuracy for theMAP-trained PMF models does not improve, and con-trolling overfitting becomes a critical issue. The pre-dictive accuracy of the Bayesian PMF models, how-ever, steadily improves as the model complexity grows.Inspired by this result, we experimented with BayesianPMF models with D = 150 and D = 300 featurevectors. Note that these models have about 75 and150 million parameters, and running the Gibbs sam-pler becomes computationally much more expensive.Nonetheless, the validation set RMSEs for the twomodels were 0.8931 and 0.8920. Table 1 shows thatthese models not only significantly outperform theirMAP counterparts but also outperform Bayesian PMFmodels that have fewer parameters. These resultsclearly show that the Bayesian approach does not re-quire limiting the complexity of the model based on thenumber of the training samples. In practice, however,we will be limited by the available computer resources.

For completeness, we also report the performance re-sults on the Netflix test set. These numbers were ob-

From Salakhutdinov and Mnih, ICML 2008


n  Idea of full posterior inference vs. MAP estimation

n  Gibbs sampling as an MCMC approach n  Example of inference in Bayesian probabilistic

matrix factorization model

©Emily Fox 2014 28

15

29

Matrix Factorization and Probabilistic LFMs for Network Modeling



©Emily Fox 2014


©Emily Fox 2014 30

n  Structure of network data

Network Data

16

©Emily Fox 2014 31

n  Similarities to Netflix data: ¨  Matrix ¨  High-dimensional ¨  Sparse

n  Differences ¨  Square ¨  Binary

Properties of Data Source

©Emily Fox 2014 32

n  Vanilla matrix factorization approach:

n  What to return for link prediction?

n  Slightly fancier:

Matrix Factorization for Network Data

17

©Emily Fox 2014 33

n  Assume features (covariates) of the user or relationship

n  Each user has a “position” in a k-dimensional latent space

n  Probability of link:

Probabilistic Latent Space Models

●●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●●●

●

●●

●

●● ●●

●

●●●●●

●

●

●

●

●

●●

●

● ●●●

●

●

●

●

●●

●

●

●●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●● ●●●

●

●

●●●●

●

●

●

●

●

●

●

●

●

●

●

●●● ●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

● ●

●

●

●

●

●

●

●

●●●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●

●

●●

●

●

●

●●

●

● ●●

●

●

●

●●

●

●

●●●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

●●

●●

●

●

●● ●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●●

●

●

●

●●●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●●●

●●●●

●

●

●

●

●

●

●

●

●

●●●●

●

●

●●

●

●

●

●

●

●●●

●

●

●●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●●●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●

●●

●

●

●

●●●●●●●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●●●●

●

●

●

●●●

●

●

●

●

●

●

●

●●●●●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●●●

●●●●

●

●

●

●

●●

●

●●

●

●

●

●

●

●●●

●

●●

●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●●

●

●

●

●●

●

●

●

●●●●●

●●

●

●

●●●

●

●●●●●●

●

●

●

●●

●

●

●

●●

●

●●●

●

●●●● ●●●

●

●

●

●●

●

●

●

●

●●

●●

●

●

●

●

●

●

●●

● ●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●●●●●●●●

●

●●

●

●

●

●

●

●●

●

●●

●●

●

●

●●●●●

●

●●●●●●●●●

●

●●●

−10

−5

0

5

10

−10 0 10Z1

Z2

200

400

600

KabulDist

©Emily Fox 2014 34

n  Probability of link:

n  Bayesian approach: ¨  Place prior on user factors and regression coefficients ¨  Place hyperprior on user factor hyperparameters

n  Many other options and extensions (e.g., can use GMM for Lu à clustering of users in the latent space)

Probabilistic Latent Space Models

log odds p(ruv = 1 | Lu, Lv, xuv,�) = �0 + �

Txuv � |Lu � Lv|

log odds p(ruv = 1 | Lu, Lv, xuv,�) = �0 + �

Txuv � |LT

uLv|

18


n  Representation of network data as a matrix ¨ Adjacency matrix

n  Similarities and differences between adjacency matrices and general matrix-valued data

n  Matrix factorization approaches for network data ¨ Just use standard MF and threshold output ¨  Introduce link functions to constrain predicted values

n  Probabilistic latent space models ¨ Model link probabilities using distance between latent

factors

©Emily Fox 2014 35

Documents

Case Study 4: Collaborative Filtering · Case Study 4: Collaborative Filtering Cold-Start Problem ! Challenge: Cold-start problem (new movie or user)! Methods: use features of movie/user