17
Performance of Recommender Algorithms on Top-N Recommendation Tasks RecSys 2010 Intelligent Database Systems Lab. School of Computer Science & Engineering Seoul National University Center for E-Business Technology Seoul National University Seoul, Korea Presented by Sangkeun Lee 1/14/2011 Paolo Cremonesi, Yehuda Koren, Roberto Turri Politecnico di Milano, Yahoo! Research Haifa, Israel, Neptuny Milan, Ita

Performance of Recommender Algorithms on Top-N Recommendation Tasks RecSys 2010 Intelligent Database Systems Lab. School of Computer Science & Engineering

Embed Size (px)

Citation preview

Page 1: Performance of Recommender Algorithms on Top-N Recommendation Tasks RecSys 2010 Intelligent Database Systems Lab. School of Computer Science & Engineering

Performance of Recommender Algorithms on Top-N Recommendation Tasks

RecSys 2010

Intelligent Database Systems Lab.

School of Computer Science & Engineering

Seoul National University

Center for E-Business TechnologySeoul National UniversitySeoul, Korea

Presented by Sangkeun Lee1/14/2011

Paolo Cremonesi, Yehuda Koren, Roberto TurrinPolitecnico di Milano, Yahoo! Research Haifa, Israel, Neptuny Milan, Italy

Page 2: Performance of Recommender Algorithms on Top-N Recommendation Tasks RecSys 2010 Intelligent Database Systems Lab. School of Computer Science & Engineering

Copyright 2010 by CEBT

Introduction

Competition of recommender systems

By evaluating their error metrics such as RMSE (Root mean squared error)

Average error between estimated ratings and actual ratings

Why the majority of the literature is focused on error metrics?

Logical & convenient

However, many commercial systems perform top-N recom-mendation tasks

The systems suggest a few specific items to the user that are likely to be very appealing to him

Page 3: Performance of Recommender Algorithms on Top-N Recommendation Tasks RecSys 2010 Intelligent Database Systems Lab. School of Computer Science & Engineering

Copyright 2010 by CEBT

Introduction: Top-N Performance

Classical error measures (e.g. RMSE, MAE) do not really mea-sure top-N performance

Measure for Top-N Performance

Accuracy metrics

– Recall and Precision

In this paper,

The authors present an extensive evaluation of several state-of-art recommender systems & naïve non-personalized algorithms

And they give us some insight from the experimental results

On Netflix & Movielens datasets

Page 4: Performance of Recommender Algorithms on Top-N Recommendation Tasks RecSys 2010 Intelligent Database Systems Lab. School of Computer Science & Engineering

Copyright 2010 by CEBT

Testing Methodology: Dataset

For each dataset, known ratings are split into two sub-sets :

Training set M and test set T

Test set T contains only 5-starts ratings

– So, we can reasonably state that T contains items relevant to the respective users

For the Neflix dataset,

Training set = training dataset 100M ratings for Netflix prize

Test set = 5-star ratings from probe dataset for Netflix prize (|T|=384,573)

For the Movielens dataset,

Randomly sub-sampled 1.4% of the ratings from the dataset to create testset

Page 5: Performance of Recommender Algorithms on Top-N Recommendation Tasks RecSys 2010 Intelligent Database Systems Lab. School of Computer Science & Engineering

Copyright 2010 by CEBT

Testing Methodology: measuring precision and recall

1) Train the model over the ratings in M

2) For each item I rated 5-starts by user u in T

Randomly select 1000 additional items unrated by user u

Predict the ratings for the test item I and for the additional 1000 items

Form a ranked list by ordering 1001 items according to the predicted ratings. Let p denote the rank of the item I within this list. (The best result: p=1)

Form a top-N recommendation list by picking the N top ranked items from the list. If p<=N we have a hit. Other-wise we have a miss.

Page 6: Performance of Recommender Algorithms on Top-N Recommendation Tasks RecSys 2010 Intelligent Database Systems Lab. School of Computer Science & Engineering

Copyright 2010 by CEBT

Testing Methodology: measuring precision and recall

For any single test case,

recall for a single test can assume either 0 (miss) or 1(hit)

Precision for a single test can assume either the value 0(miss) or 1/N (hit)

The overall recall and precision are defined by averaging over all test cases

Page 7: Performance of Recommender Algorithms on Top-N Recommendation Tasks RecSys 2010 Intelligent Database Systems Lab. School of Computer Science & Engineering

Copyright 2010 by CEBT

Rating distribution : Popular items vs.Long-tail

About 33% of ratings collected by Netflix involve only the 1.7% of most popular items

To evaluate the accuracy of recommender algorithms in suggesting non-trivial items, T has been partitioned into Thead and Tlong

Page 8: Performance of Recommender Algorithms on Top-N Recommendation Tasks RecSys 2010 Intelligent Database Systems Lab. School of Computer Science & Engineering

Copyright 2010 by CEBT

Algorithms

Non-personalized models

Movie Rating Average (MovieAvg) – average of ratings

Top Popular (TopPop) – number of ratings – non applicable to mea-sure error metrics

Collaborative Filtering models

Neighborhood models

– The most common approaches

– Based on similarity among either users or items

Latent factor models

– Finding hidden factors

– Model users and items in the same latent factor spaces

– Predict ratings usib proximity (e.g., inner-product)

Page 9: Performance of Recommender Algorithms on Top-N Recommendation Tasks RecSys 2010 Intelligent Database Systems Lab. School of Computer Science & Engineering

Copyright 2010 by CEBT

Neighborhood Models

Correlation Neighborhood (CorNgbr)

denotes rating bias of user u to item I (e.g. average ratings)

denotes the set of most similar items

represents shrink similarity

the number of common raters

similarity between items (cosine similarity)

Non-normalized Cosine Neighborhood (NNCosNgbr)

Higher ranking for items with many similar neighbors

It’s no longer es-timated rating, but still we can use this for top-N rec-ommendation tasks

Page 10: Performance of Recommender Algorithms on Top-N Recommendation Tasks RecSys 2010 Intelligent Database Systems Lab. School of Computer Science & Engineering

Copyright 2010 by CEBT

Latent Factor Models

The key idea is to factorize user-item matrix into two lower rank matri-ces

One matrix containing user-factors

One matrix containing item-factors

Rating estimation is computed as

SVD is undefined in the presence of unknown values

Replace unknown ratings with baseline estimations

Learn factor vectors through a suitable objective function which minimizes the prediction error

And so on. (out of scope)

Two state of the art algorithms

Asymmetric-SVD (AsySVD)

SVD++ (high quality in RMSE)

Page 11: Performance of Recommender Algorithms on Top-N Recommendation Tasks RecSys 2010 Intelligent Database Systems Lab. School of Computer Science & Engineering

Copyright 2010 by CEBT

Latent Factor Models: PureSVD

Now,

We are interested only in a correct item ranking

We don’t need exact rating prediction

PureSVD

Considering all missing values in the user rating matrix as zeros

Lets define

– u-th row of represents the user factors vector

Q

– i-th row of Q represents the item factors vector

It’s no longer es-timated rating, but still we can use this for top-N rec-ommendation tasks

the u-th row of the user rating matrix

Page 12: Performance of Recommender Algorithms on Top-N Recommendation Tasks RecSys 2010 Intelligent Database Systems Lab. School of Computer Science & Engineering

Copyright 2010 by CEBT

RMSE Ranking

SVD++ 0.8911

AsySVD 0.9000

CorNgbr 0.9406

MovieAvg 1.053

Note that TopPop, NNCorNgbr, PureSVD are not applica-ble for measuring error metrics

Page 13: Performance of Recommender Algorithms on Top-N Recommendation Tasks RecSys 2010 Intelligent Database Systems Lab. School of Computer Science & Engineering

Copyright 2010 by CEBT

Result: Movielens dataset

All item/Recall at N=10 AsySVD is about 0.28

TopPop 0.29

SVD++, NNCosNgbr 0.44

PureSVD 0.52 (50)

All item/Precision PureSVD outperforms

TopPopAysSVD

(widely used) CorNgbr un-deperforms!

Long-tail Accuracy of TopPop dramat-

ically falls down

PureSVD is still the best (150)

SVD++ is the best among RMSE-oriented algorithms

Similar!?

Best!?

Page 14: Performance of Recommender Algorithms on Top-N Recommendation Tasks RecSys 2010 Intelligent Database Systems Lab. School of Computer Science & Engineering

Copyright 2010 by CEBT

Result: Netflix dataset

All items TopPop outperforms

CorNgbr

AsySVD and SVD++ slightly performs better than TopPop (Note that these algorithms are possibly better tuned for Neflix data)

NNCosNgbr works good

PureSVD is still the best

Long-tail CorNgbr significantly un-

derperforms for the head

But it performs well on long-tail data (Probably, it ex-plains why CorNgbr has been widely used)

Page 15: Performance of Recommender Algorithms on Top-N Recommendation Tasks RecSys 2010 Intelligent Database Systems Lab. School of Computer Science & Engineering

Copyright 2010 by CEBT

PureSVD??

Poor design in terms of rating estimation

The authors did not expect the result

PureSVD

Easy to code &Good computational performance in both off-line and online

When moving to longer tail items, accuracy improves with raising the dimensionality of the PureSVD model. (50 -> 150)

– This could mean that first latent factors capture properties of popular items, while additional features capture properties of long-tail items

Page 16: Performance of Recommender Algorithms on Top-N Recommendation Tasks RecSys 2010 Intelligent Database Systems Lab. School of Computer Science & Engineering

Copyright 2010 by CEBT

Conclusions

Error metrics have been more popular

Mathematical convenience

Formal optimization

However, it is well recognized that accuracy measures may be more natural

In summary,

(1) There is no monotonic(trivial) relation between error metrics and accuracy metrics

(2) Test-cases should be carefully selected as we can see the exper-imental results (long-tail vs. head) Watch out the possible pitfalls!

(3) New variants of existing algorithms improves the top-N perfor-mances

Page 17: Performance of Recommender Algorithms on Top-N Recommendation Tasks RecSys 2010 Intelligent Database Systems Lab. School of Computer Science & Engineering

Q&A

Thank you

17