24
Training and Testing of Recommender Systems on Data Missing Not at Random Harald Steck at KDD, July 2010 Bell Labs, Murray Hill

Training and Testing of Recommender Systems on Data Missing Not at Random Harald Steck at KDD, July 2010 Bell Labs, Murray Hill

Embed Size (px)

Citation preview

Page 1: Training and Testing of Recommender Systems on Data Missing Not at Random Harald Steck at KDD, July 2010 Bell Labs, Murray Hill

Training and Testing of Recommender Systems on Data Missing Not at Random

Harald Steck at KDD, July

2010

Bell Labs, Murray Hill

Page 2: Training and Testing of Recommender Systems on Data Missing Not at Random Harald Steck at KDD, July 2010 Bell Labs, Murray Hill

2 | Recommender Systems | July 2010 Copyright © 2010 Alcatel-Lucent. All rights reserved.

Overview

itemsReal-World Problem:

Make personalized recommendations to users that they find “relevant”:

1. from all items (in store)

2. pick a few for each user

3. with the goal: each user finds recommended items “relevant”.

eg “relevant” = 5-star rating in Netflix data

Define Data Mining Goal (how to test):

- off-line test with historical rating data

- high accuracy

- RMSE on observed ratings (popular)

- nDCG on observed ratings [Weimer et al. ‘08]

Find (approximate) solution to Goal defined above:

- choose model(s)

- appropriate training-objective function

- efficient optimization method

approx.

approx.

Page 3: Training and Testing of Recommender Systems on Data Missing Not at Random Harald Steck at KDD, July 2010 Bell Labs, Murray Hill

3 | Recommender Systems | July 2010 Copyright © 2010 Alcatel-Lucent. All rights reserved.

Overview

items Define Data Mining Goal (how to test):

- off-line test with historical rating data

- high accuracy

- RMSE on observed ratings (popular)

- nDCG on observed ratings [Weimer et al. ‘08]

Find (approximate) solution to Goal defined above:

- choose model(s)

- appropriate training-objective function

- efficient optimization method

approx.

approx.

this talkReal-World Problem:

Make personalized recommendations to users that they find “relevant”:

1. from all items (in store)

2. pick a few for each user

3. with the goal: each user finds recommended items “relevant”.

eg “relevant” = 5-star rating in Netflix data

Page 4: Training and Testing of Recommender Systems on Data Missing Not at Random Harald Steck at KDD, July 2010 Bell Labs, Murray Hill

4 | Recommender Systems | July 2010 Copyright © 2010 Alcatel-Lucent. All rights reserved.

Data

(unknown) complete

rating matrix

items

i

users u

Page 5: Training and Testing of Recommender Systems on Data Missing Not at Random Harald Steck at KDD, July 2010 Bell Labs, Murray Hill

5 | Recommender Systems | July 2010 Copyright © 2010 Alcatel-Lucent. All rights reserved.

Data

(unknown) complete

rating matrix

observed ratings

(e.g., 1% in Netflix data)

items

i

users u

Page 6: Training and Testing of Recommender Systems on Data Missing Not at Random Harald Steck at KDD, July 2010 Bell Labs, Murray Hill

6 | Recommender Systems | July 2010 Copyright © 2010 Alcatel-Lucent. All rights reserved.

Data

(unknown) complete

rating matrix

observed ratings

(e.g., 1% in Netflix data)

missing-data mechanism

- (General) missing-data mechanism cannot be ignored [Rubin ’76; Marlin et al.

’09,’08,’07].

- Missing at random [Rubin ’76; Marlin et al. ’09,’08,’07]:

- Rating value has no effect on probability that it is missing

- Correct results obtained by ignoring missing ratings.

items

i

users u

Page 7: Training and Testing of Recommender Systems on Data Missing Not at Random Harald Steck at KDD, July 2010 Bell Labs, Murray Hill

7 | Recommender Systems | July 2010 Copyright © 2010 Alcatel-Lucent. All rights reserved.

Ratings are missing not at random (MNAR): Empirical Evidence

Graphs from [Marlin & Zemel ‘09]:

Survey: ask users to rate a random

list of items: approximates

complete data

Typical Data: users are free to choose

which items to rate -> available data are

MNAR : instead of giving low ratings, users

tend to not give a rating at all.

Page 8: Training and Testing of Recommender Systems on Data Missing Not at Random Harald Steck at KDD, July 2010 Bell Labs, Murray Hill

8 | Recommender Systems | July 2010 Copyright © 2010 Alcatel-Lucent. All rights reserved.

Overview

itemsReal-World Problem:

Make personalized recommendations to users that they find “relevant”:

1. from all items (in store)

2. pick a few for each user

3. with the goal: each user finds recommended items “relevant”.

Define Data Mining Goal (how to test):

- off-line test with historical rating data

- high accuracy

- RMSE, nDCG,… on observed ratings

- Top-k Hit-Rate,… on all items

Find (approximate) solution to Goal defined above:

- choose model(s)

- appropriate training-objective function

- efficient optimization method

approx.

approx.

this talk

Page 9: Training and Testing of Recommender Systems on Data Missing Not at Random Harald Steck at KDD, July 2010 Bell Labs, Murray Hill

9 | Recommender Systems | July 2010 Copyright © 2010 Alcatel-Lucent. All rights reserved.

Test Performance Measures on MNAR Data

- many popular performance measures cannot readily deal with missing ratings

- only a few from among all items can be recommended

- Top-k Hit Rate w.r.t. all items:

-

-

Page 10: Training and Testing of Recommender Systems on Data Missing Not at Random Harald Steck at KDD, July 2010 Bell Labs, Murray Hill

10 | Recommender Systems | July 2010 Copyright © 2010 Alcatel-Lucent. All rights reserved.

Test Performance Measures on MNAR Data

- most popular performance measures cannot readily deal with missing ratings

- only a few from among all items can be recommended

- Top-k Hit Rate w.r.t. all items:

- when comparing different rec. sys. on fixed data and fixed k: recall precision

- under mild assumption:

recall on MNAR data = unbiased estimate of recall on (unknown) complete data

Assumption: The relevant ratings are missing at random.

-

-

Page 11: Training and Testing of Recommender Systems on Data Missing Not at Random Harald Steck at KDD, July 2010 Bell Labs, Murray Hill

11 | Recommender Systems | July 2010 Copyright © 2010 Alcatel-Lucent. All rights reserved.

Test Performance Measures on MNAR Data

Top-k Hit-Rate:

- depends on k

- ignores ranking

k

1

TOPK

100

normalized w.r.t. # items

Page 12: Training and Testing of Recommender Systems on Data Missing Not at Random Harald Steck at KDD, July 2010 Bell Labs, Murray Hill

12 | Recommender Systems | July 2010 Copyright © 2010 Alcatel-Lucent. All rights reserved.

Test Performance Measures on MNAR Data

Top-k Hit-Rate:

- depends on k

- ignores ranking

Area under TOPK curve (ATOP):

- independent of k

- in [0,1], larger is better

- captures ranking of all items

- agrees with area under ROC curve in leading order if # relevant items << # items

- unbiased estimate from MNAR data for unknown complete data under above

assumption

k

1

TOPK

100

ATOP

normalized w.r.t. # items

Page 13: Training and Testing of Recommender Systems on Data Missing Not at Random Harald Steck at KDD, July 2010 Bell Labs, Murray Hill

13 | Recommender Systems | July 2010 Copyright © 2010 Alcatel-Lucent. All rights reserved.

Overview

itemsReal-World Problem:

Make personalized recommendations to users that they find “relevant”:

1. from all items (in store)

2. pick a few for each user

3. with the goal: each user finds recommended items “relevant”.

Define Data Mining Goal (how to test):

- off-line test with historical rating data

- high accuracy

- TOPK, ATOP,… on all items

approx.

approx.

this talk

Find (approximate) solution to Goal defined above:

- choose model(s)

- appropriate training objective function

- efficient optimization

Page 14: Training and Testing of Recommender Systems on Data Missing Not at Random Harald Steck at KDD, July 2010 Bell Labs, Murray Hill

14 | Recommender Systems | July 2010 Copyright © 2010 Alcatel-Lucent. All rights reserved.

Matrix of predicted ratings:

= r_m + .

- rating offset: r_m

- rank of matrices P,Q: dimension of low-dimensional latent space, eg d_0 = 50

Low-rank Matrix Factorization Model

items

i

users u

Page 15: Training and Testing of Recommender Systems on Data Missing Not at Random Harald Steck at KDD, July 2010 Bell Labs, Murray Hill

15 | Recommender Systems | July 2010 Copyright © 2010 Alcatel-Lucent. All rights reserved.

Training Objective Function: AllRank

minimal modification of usual least squares problem:

- account for all items per user: observed and missing ratings

- imputed value for missing ratings: r_m

- create balanced training set: weights (1 if observed, w_m if missing)

- (usual) regularization of matrix elements: lambda

Efficient Optimization:

- gradient descent by alternating least squares

- tuning parameters r_m, w_m, lambda have to be optimized as well (eg w.r.t. ATOP)

Page 16: Training and Testing of Recommender Systems on Data Missing Not at Random Harald Steck at KDD, July 2010 Bell Labs, Murray Hill

16 | Recommender Systems | July 2010 Copyright © 2010 Alcatel-Lucent. All rights reserved.

Experimental Results on Netflix Data: Imputed Rating Value r_m

- optimum for imputed value

exists

- optimal r_m 2

- optimal r_m may be interpreted

as mean of missing ratings

- exact imputation value < 2 is not

critical

- imputed value < observed mean

observed mean

ratings: 1…5 stars

Page 17: Training and Testing of Recommender Systems on Data Missing Not at Random Harald Steck at KDD, July 2010 Bell Labs, Murray Hill

17 | Recommender Systems | July 2010 Copyright © 2010 Alcatel-Lucent. All rights reserved.

Experimental Results on Netflix Data: Weight of Missing Ratings w_m

w_m=1: standard SVD (plus penalty term), like in Latent Semantic

Analysis

w_m 0.005 is optimal; compare to fraction of observed ratings =

0.01

w_m=0: ignores missing ratings, and is worst w.r.t. ATOP

Page 18: Training and Testing of Recommender Systems on Data Missing Not at Random Harald Steck at KDD, July 2010 Bell Labs, Murray Hill

18 | Recommender Systems | July 2010 Copyright © 2010 Alcatel-Lucent. All rights reserved.

Experimental Results on Netflix Data: Top-k Hit-Rate

Comparison of Approaches:

AllRank (RMSE =

1.106)

ignore missing ratings (RMSE =

0.921)

Page 19: Training and Testing of Recommender Systems on Data Missing Not at Random Harald Steck at KDD, July 2010 Bell Labs, Murray Hill

19 | Recommender Systems | July 2010 Copyright © 2010 Alcatel-Lucent. All rights reserved.

Experimental Results on Netflix Data: Top-k Hit-Rate

Comparison of Approaches:

AllRank (RMSE =

1.106)

ignore missing ratings (RMSE =

0.921)

zoomed into top 2 %:

Page 20: Training and Testing of Recommender Systems on Data Missing Not at Random Harald Steck at KDD, July 2010 Bell Labs, Murray Hill

20 | Recommender Systems | July 2010 Copyright © 2010 Alcatel-Lucent. All rights reserved.

Experimental Results on Netflix Data: Top-k Hit-Rate

Comparison of Approaches:

AllRank (RMSE = 1.106)

ignore missing ratings (RMSE = 0.921)

integrated model [Koren ’08] (RMSE = 0.887)

(trained to minimize RMSE)

39 % …………………………….…………… 50 % larger Top-k Hit-Rate: AllRank vs. integrated model

zoomed into top 2 %:

x

x

x

x

Page 21: Training and Testing of Recommender Systems on Data Missing Not at Random Harald Steck at KDD, July 2010 Bell Labs, Murray Hill

21 | Recommender Systems | July 2010 Copyright © 2010 Alcatel-Lucent. All rights reserved.

Experimental Results on Netflix Data: Top-k Hit-Rate

Comparison of Approaches:

AllRank (RMSE = 1.106)

ignore missing ratings (RMSE = 0.921)

integrated model [Koren ’08] (RMSE = 0.887)

(trained to minimize RMSE)

39 % …………………………….…………… 50 % larger Top-k Hit-Rate: AllRank vs. integrated model

zoomed into top 2 %:

x

x

x

x

Large increase in Top-k Hit-Rate when accounting also for missing ratings when training on MNAR data.

Page 22: Training and Testing of Recommender Systems on Data Missing Not at Random Harald Steck at KDD, July 2010 Bell Labs, Murray Hill

22 | Recommender Systems | July 2010 Copyright © 2010 Alcatel-Lucent. All rights reserved.

Related Work

explicit feedback data (ratings):

- improved RMSE on observed data also increases Top-k Hit-Rate on all items [Koren ’08]

- ratings are missing not at random:

- improved models: conditional RBM, NSVD1/2, SVD++ [Salakhutdinov ’07; Paterek ’07; Koren ‘08]

- test on “complete” data, train multinomial mixture model on MNAR data [Marlin et al. ’07,’09]

implicit feedback data (clickstream data, TV consumption, tags, bookmarks,

purchases, …):

- [Hu et al. ’07; Pan et al. ’07]:

- binary data, only positives are observed -> missing ones assumed negatives

- trained matrix-factorization model with weighted least-squares objective function

- claimed difference to explicit feedback data: latter provides positive and negative observations

Page 23: Training and Testing of Recommender Systems on Data Missing Not at Random Harald Steck at KDD, July 2010 Bell Labs, Murray Hill

23 | Recommender Systems | July 2010 Copyright © 2010 Alcatel-Lucent. All rights reserved.

Conclusions and Future Work

- considered explicit feedback data missing not at random (MNAR)

- test performance measures: - close to real-world problem

- unbiased on MNAR data (under mild assumption)

- (Area under) Top-k Hit Rate, ...

- efficient surrogate objective function for training:

- AllRank: accounting for missing ratings leads to large improvements in

Top-k Hit-Rate

Future Work:

- better test performance measures, training objective functions and models

- results obtained w.r.t. RMSE need not hold w.r.t. Top-k Hit-Rate on MNAR data, eg

collaborative filtering vs content based methods

Page 24: Training and Testing of Recommender Systems on Data Missing Not at Random Harald Steck at KDD, July 2010 Bell Labs, Murray Hill

24 | Recommender Systems | July 2010 Copyright © 2010 Alcatel-Lucent. All rights reserved.

www.alcatel-lucent.comwww.alcatel-lucent.com