Performance of Recommender Algorithms on Top-N Recommendation Tasks RecSys 2010 Intelligent Database Systems Lab. School of Computer Science & Engineering

Performance of Recommender Algorithms on Top-N Recommendation Tasks

RecSys 2010

Intelligent Database Systems Lab.

School of Computer Science & Engineering

Seoul National University

Center for E-Business TechnologySeoul National UniversitySeoul, Korea

Presented by Sangkeun Lee1/14/2011

Paolo Cremonesi, Yehuda Koren, Roberto TurrinPolitecnico di Milano, Yahoo! Research Haifa, Israel, Neptuny Milan, Italy

Copyright 2010 by CEBT

Introduction

Competition of recommender systems

By evaluating their error metrics such as RMSE (Root mean squared error)

Average error between estimated ratings and actual ratings

Why the majority of the literature is focused on error metrics?

Logical & convenient

However, many commercial systems perform top-N recom-mendation tasks

The systems suggest a few specific items to the user that are likely to be very appealing to him


Introduction: Top-N Performance

Classical error measures (e.g. RMSE, MAE) do not really mea-sure top-N performance

Measure for Top-N Performance

Accuracy metrics

– Recall and Precision

In this paper,

The authors present an extensive evaluation of several state-of-art recommender systems & naïve non-personalized algorithms

And they give us some insight from the experimental results

On Netflix & Movielens datasets


Testing Methodology: Dataset

For each dataset, known ratings are split into two sub-sets :

Training set M and test set T

Test set T contains only 5-starts ratings

– So, we can reasonably state that T contains items relevant to the respective users

For the Neflix dataset,

Training set = training dataset 100M ratings for Netflix prize

Test set = 5-star ratings from probe dataset for Netflix prize (|T|=384,573)

For the Movielens dataset,

Randomly sub-sampled 1.4% of the ratings from the dataset to create testset


Testing Methodology: measuring precision and recall

1) Train the model over the ratings in M

2) For each item I rated 5-starts by user u in T

Randomly select 1000 additional items unrated by user u

Predict the ratings for the test item I and for the additional 1000 items

Form a ranked list by ordering 1001 items according to the predicted ratings. Let p denote the rank of the item I within this list. (The best result: p=1)

Form a top-N recommendation list by picking the N top ranked items from the list. If p<=N we have a hit. Other-wise we have a miss.


Testing Methodology: measuring precision and recall

For any single test case,

recall for a single test can assume either 0 (miss) or 1(hit)

Precision for a single test can assume either the value 0(miss) or 1/N (hit)

The overall recall and precision are defined by averaging over all test cases


Rating distribution : Popular items vs.Long-tail

About 33% of ratings collected by Netflix involve only the 1.7% of most popular items

To evaluate the accuracy of recommender algorithms in suggesting non-trivial items, T has been partitioned into Thead and Tlong


Algorithms

Non-personalized models

Movie Rating Average (MovieAvg) – average of ratings

Top Popular (TopPop) – number of ratings – non applicable to mea-sure error metrics

Collaborative Filtering models

Neighborhood models

– The most common approaches

– Based on similarity among either users or items

Latent factor models

– Finding hidden factors

– Model users and items in the same latent factor spaces

– Predict ratings usib proximity (e.g., inner-product)


Neighborhood Models

Correlation Neighborhood (CorNgbr)

denotes rating bias of user u to item I (e.g. average ratings)

denotes the set of most similar items

represents shrink similarity

the number of common raters

similarity between items (cosine similarity)

Non-normalized Cosine Neighborhood (NNCosNgbr)

Higher ranking for items with many similar neighbors

It’s no longer es-timated rating, but still we can use this for top-N rec-ommendation tasks


Latent Factor Models

The key idea is to factorize user-item matrix into two lower rank matri-ces

One matrix containing user-factors

One matrix containing item-factors

Rating estimation is computed as

SVD is undefined in the presence of unknown values

Replace unknown ratings with baseline estimations

Learn factor vectors through a suitable objective function which minimizes the prediction error

And so on. (out of scope)

Two state of the art algorithms

Asymmetric-SVD (AsySVD)

SVD++ (high quality in RMSE)


Latent Factor Models: PureSVD

Now,

We are interested only in a correct item ranking

We don’t need exact rating prediction

PureSVD

Considering all missing values in the user rating matrix as zeros

Lets define

– u-th row of represents the user factors vector

Q

– i-th row of Q represents the item factors vector

It’s no longer es-timated rating, but still we can use this for top-N rec-ommendation tasks

the u-th row of the user rating matrix


RMSE Ranking

SVD++ 0.8911

AsySVD 0.9000

CorNgbr 0.9406

MovieAvg 1.053

Note that TopPop, NNCorNgbr, PureSVD are not applica-ble for measuring error metrics


Result: Movielens dataset

All item/Recall at N=10 AsySVD is about 0.28

TopPop 0.29

SVD++, NNCosNgbr 0.44

PureSVD 0.52 (50)

All item/Precision PureSVD outperforms

TopPopAysSVD

(widely used) CorNgbr un-deperforms!

Long-tail Accuracy of TopPop dramat-

ically falls down

PureSVD is still the best (150)

SVD++ is the best among RMSE-oriented algorithms

Similar!?

Best!?


Result: Netflix dataset

All items TopPop outperforms

CorNgbr

AsySVD and SVD++ slightly performs better than TopPop (Note that these algorithms are possibly better tuned for Neflix data)

NNCosNgbr works good

PureSVD is still the best

Long-tail CorNgbr significantly un-

derperforms for the head

But it performs well on long-tail data (Probably, it ex-plains why CorNgbr has been widely used)


PureSVD??

Poor design in terms of rating estimation

The authors did not expect the result

PureSVD

Easy to code &Good computational performance in both off-line and online

When moving to longer tail items, accuracy improves with raising the dimensionality of the PureSVD model. (50 -> 150)

– This could mean that first latent factors capture properties of popular items, while additional features capture properties of long-tail items


Conclusions

Error metrics have been more popular

Mathematical convenience

Formal optimization

However, it is well recognized that accuracy measures may be more natural

In summary,

(1) There is no monotonic(trivial) relation between error metrics and accuracy metrics

(2) Test-cases should be carefully selected as we can see the exper-imental results (long-tail vs. head) Watch out the possible pitfalls!

(3) New variants of existing algorithms improves the top-N perfor-mances

Q&A

Thank you

17

Documents

Performance of Recommender Algorithms on Top-N Recommendation Tasks RecSys 2010 Intelligent Database Systems Lab. School of Computer Science & Engineering