The BellKor 2008 Solution to the Netflix Prize

The BellKor 2008 Solution to the Netflix Prize

byLeenarat Leelapanyalert

Netflix Dataset

• Over 100 million movie ratings with date-stamp (100,480,507 ratings)

• M = 17,770 movies• N = 480,189 customers• 1 (star) = no interest, 5(stars) = strong interest• Dec 31, 1999 – Dec 31, 2005

The user-item matrixN*M = 8,532,958,530 elements

98.9% values are missing

Netflix Competition

• 4.2 from 100 million ratings– Training set (Probe set)– Qualifying set (Quiz set & Test set)

• Scoring– Show RMSE achieved on the Quiz set– Best RMSE on the Test set → THE WINNER!!

Outline

• Necessary index letters• Baseline predictors– With temporal effects

• Latent factor models– with temporal effects

• Neighborhood models– with temporal effects

• Integrated models• Extra: Shrinking towards recent actions

Outline• Necessary index letters

• Baseline predictors →– With temporal effects




Adjust deviations of each user (rater, customer) and item (movie)

Outline• Necessary index letters• Baseline predictors– With temporal effects

• Latent factor models →– with temporal effects



Compare between items and usersby SVD

Outline• Necessary index letters• Baseline predictors– with temporal effects


• Neighborhood models→– with temporal effects


Compute the relationship between items(or users)

Outline• Necessary index letters• Baseline predictors

- with temporal effects • Latent factor models

–with temporal effects • Neighborhood models

–with temporal effects• Integrated models• Extra: Other methods

– Shrinking towards recent actions– Blending multiple solutions




• Integrated models →• Extra:Shrinking towards recent actions

Combine Latent factor models and Neighborhood models together




• Integrated models

• Extra: Shrinking towards recent actions → New ideas

Index Letters

• u,v → users, raters, or customers• i,j → movies, or items• rui → the score by user u of movie i• rui → predicted value of rui

• tui → the time of rating rui

• K → the training set which rui is known• R(u) → all the items for which rating by u• R(i) → the set of users who rated item i• N(u) → all items that can estimated u’s score

^

Baseline Predictors (bui)

µ → the overall average ratingbu → deviations of user ubi → deviation of item i

Example: µ = 3.7, Simha(bu) = -0.3,Titanic (bi) = 0.5 bui = 3.7 – 0.3 + 0.5 = 3.9 stars

iuui bbb

Estimate Parameter (bu, bi) – Formula

The regularization parameters (𝜆1,𝜆2) are determined by validation on the Probe set. In this case: 𝜆1 = 25, 𝜆2 = 10

iuui bbb

)()(

1

)(

iRr

b iRui

)()(

2

)(

uRbr

b iuRiu

Estimate Parameter (bu, bi) – The Least Squares Problem

iuui bbb

i

iKiu u

uiuuibbbbbr )()(min 2

),(

21

2

*

Estimate Parameter (bu, bi) – The Least Squares Problem

iuui bbb

i

iKiu u

uiuuibbbbbr )()(min 2

),(

21

2

*

to fit the given rating to avoid overfitting by penalizing the magnitudes of the parameters

Time Change VS Baseline Predictors

• An item’s popularity may change over time• Users change their baseline rating over time

)()( uiiuiuui tbtbb

iuui bbb

bi(tui)

• We do not expect movie likeability to fluctuate on a daily basis

• Time periods → Bins• 30 bins

)(,)( tBiniii bbtb

)()( uiiuiuui tbtbb

bu(tui)

• Unlike movies, user effects can change on a daily basis

• Time deviation

tu → the mean date of rating by tut → the date that user u rated the movieβ = 0.4 by validation on the Probe set

uuu ttttsigntdev )()(

bu(tui)

• Suit well with gradual drifts

)()()1( tdevbtb uuuu

uuu ttttsigntdev )()(

)()( uiiuiuui tbtbb

bu(tui)

• How about sudden drifts?– Since we found that multiple ratings that a user

gives in a single day

• A user rates on 40 different days on average• Thus, but requires about 40 parameters per user

utuuuu btdevbtb )()()3(

Baseline Predictors

)(,,)()(uiui tBiniituuiuuuui bbbtdevbtb

Baseline Predictors


Bu (user bias) Bi (movie bias)

Baseline Predictors

• Movie bias is not completely user-independent

cu(t) → time-dependent scaling featurecu → (stable part)cut → (day-specific variable)


Bu (user bias) Bi (movie bias)

)()()()( )(,, uiutBiniituuiuuuui tcbbbtdevbtbuiui

utuu cctc )(

RMSE = 0.9555)()()()( )(,, uiutBiniituuiuuuui tcbbbtdevbtb

uiui

Frequencies (additional)

• The number of ratings a user gave on a specific daySIGNIFICANT

Fui → the overall number of ratings that user u gave on day tui bif → the bias specific for the item i at log-frequency fRMSE 0.9555 → 0.9278

uiaui Ff log

uiuiui fiuiutBiniituuiuuuui btcbbbtdevbtb ,)(,, )()()()(

Why Frequencies Work?

• Bad when using with user-movie interaction terms• Nothing when using with user-related parameters• Rate a lot in a bulk → Not closely to the actual watching day– Positive approach– Negative approach

• High frequencies (or bulk ratings) do not represent much change in people’s taste, but mostly biased selection of movies

Predicting Future Days

• The day-specific parameters should be set to default value

• cu(tui) = cu

• bu,t = 0

• The transient temporal model doesn’t attempt to capture future changes.

Latent Factor Models

• To transform both items and users to the same latent factor space– Obvious dimensions• Comedy VS Drama• Amount of action• Orientation to children

– Less well defined dimensions• Depth of character development

• Tool → SVD

Singular Value Decomposition (SVD)

• Factoring matrices into a series of linear approximations that expose the underlying structure of the matrix


A B C

Simha 4 4 4

Ateeq 5 5 5

Smith 3 3 3

Greg 4 4 4

Mcq 4 4 4

Ramin 4 4 4

Xiao 4 4 4

Wu 3 3 3

Riz 5 5 5

Predicted Score = User Baseline Rating * Movie Average Score

4

5

3

4

4

4

4

3

5

1 1 1= *


A B C

Simha 4 4 5

Ateeq 4 5 5

Smith 3 3 2

Greg 4 5 4

Mcq 4 4 4

Ramin 3 5 4

Xiao 4 4 3

Wu 2 4 4

Riz 5 5 5



A B C

Simha 3.95 4.64 4.34

Ateeq 4.27 5.02 4.69

Smith 2.42 2.85 2.66

Greg 3.97 4.67 4.36

Mcq 3.64 4.28 4.00

Ramin 3.69 4.33 4.05

Xiao 3.33 3.92 3.66

Wu 3.08 3.63 3.39

Riz 4.55 5.35 5.00


4.34

4.69

2.66

4.36

4.00

4.05

3.66

3.39

5.00

0.91 1.07 1.00= *


A B C

Simha 3.95 4.64 4.34

Ateeq 4.27 5.02 4.69

Smith 2.42 2.85 2.66

Greg 3.97 4.67 4.36

Mcq 3.64 4.28 4.00

Ramin 3.69 4.33 4.05

Xiao 3.33 3.92 3.66

Wu 3.08 3.63 3.39

Riz 4.55 5.35 5.00


-

A B C

Simha 4 4 5

Ateeq 4 5 5

Smith 3 3 2

Greg 4 5 4

Mcq 4 4 4

Ramin 3 5 4

Xiao 4 4 3

Wu 2 4 4

Riz 5 5 5


A B C

Simha 0.05 -0.64 0.66

Ateeq -0.28 -0.02 0.31

Smith 0.58 0.15 -0.66

Greg 0.03 0.33 -0.36

Mcq 0.36 -0.28 0.00

Ramin -0.69 0.67 -0.05

Xiao 0.67 0.08 -0.66

Wu -1.08 0.37 0.61

Riz 0.45 -035 0.00


-0.18

-0.38

0.80

0.15

0.35

-0.67

0.89

-1.29

0.44

0.82 -0.20

-0.53= *


A B C

Simha 4 4 5

Ateeq 4 5 5

Smith 3 3 2

Greg 4 5 4

Mcq 4 4 4

Ramin 3 5 4

Xiao 4 4 3

Wu 2 4 4

Riz 5 5 5


4.34 -0.18 -0.90

4.69 -0.38 -0.15

2.66 0.80 0.40

4.36 0.15 0.47

4.00 0.35 -0.29

4.05 -0.67 0.68

3.66 0.89 0.33

3.39 -1.29 0.14

5.00 0.44 -0.36

0.91 1.07 1.00

0.82 -0.20 -0.53

-0.21 0.76 -0.62= *

Latent Factor Models

pu → user-factors vectorqi → item-factors vector

• Add implicit feedback– Asymmetric-SVD

– SVD++60 factorsRMSE =

0.8966

iTuuiui qpbr ˆ

)(

21

)(ûNj

juTiuiui yuNpqbr

)()(

21

21

)()()(ûNj

juRj

jujujTiuiui yuNxbruRqbr

Temporal Effects

• Time– Movie biases – go in and out of popularity over time

bi

– User biases – user change their baseline ratings over time bu

– User preferences – genre, perception on actors and directors, household

pu

)(

21

)()()(ûNj

juTiuiui yuNtpqtbr

Temporal Effects

• The same way we treat user bias we can also treat the user preferences

k=1,2,…,f

k=1,2,…,f

)(),...,(),()( 21 tptptptp ufuuT

u

)()()1( tdevbtb uuuu

utuuuu btdevbtb )()()3(

)()()1( tdevptp uukukuk

tukukuk ptptp ,)1()3( )()(

RMSE

f = 500RMSE =

0.8815

f = 500RMSE = 0.8841 !!

• Most accurate factor model (add frequencies)

f = 500, RMSE = 0.8784 f = 2000, RMSE = 0.8762

)()()1( tdevptp uukukuk

tukukuk ptptp ,)1()3( )()(

)(

21

)()()(ûNj

juTiuiui yuNtpqtbr

)(,

21

)()()()(ûNj

juTfi

Tiuiui yuNtpqqtbr

ui

Neighborhood Models

• To compute the relationship between items• Evaluate the score of a user to an item based

on ratings of similar items by the same user

The Similarity Measure

• The Pearson correlation coefficient, ρij

The Similarity Measure

• The Pearson correlation coefficient, ρij;λ2 = 100sij – similaritynij – the number of users that rated both i and j

• A weighted average of the ratings of neighborhood items

Sk(i;u) – the set of k items rated by u, which are most similar to i

ijij

ijdef

ij nn

s 2

);(

);()(

ûiSj ij

uiSj ujujijuiui

k

k

s

brsbr

Problem With The Model

• Isolate the relations between 2 items• Fully rely on the neighbors, even if they are absent

• The wij’s are not user specific• Sum over all item rated by u

);(

);()(

ûiSj ij

uiSj ujujijuiui

k

k

s

brsbr

)(

)(ûRj

ijujujuiui wbrbr

Improving The Model

• Isolate the relations between 2 items• Fully rely on the neighbors, even if they are absent

• The wij’s are not user specific• Sum over all item rated by u

• Not only what he rated, but also what he did not rate.• cij is expected to be high if j is predictive on i

);(

);()(

ûiSj ij

uiSj ujujijuiui

k

k

s

brsbr

)(

)(ûRj

ijujujuiui wbrbr

)()(

)(ûNjij

uRjijujujuiui cwbrbr

Improving The Model

• The current model somewhat overemphasizes the dichotomy between heavy raters and those that rarely rate

• Moderate this behavior by normalization

• 𝛼 = 0 → non-normalized rule – encourages greater deviations• 𝛼 = 1 → fully normalized rule – eliminate the effect of number of rating• In this case, 𝛼 = 0.5RMSE = 0.9002

)()(

)(ûNjij

uRjijujujuiui cwbrbr

)()(

21

21

)()()(ûNjij

uRjijujujuiui cuNwbruRbr

Improving The Model

RMSE = 0.9002• Reduce the model by pruning parameters

Sk(i) – the set of k items most similar i

k = 17,770 → RMSE = 0.8906k = 2000 → RMSE = 0.9067

)()(

21

21

)()()(ûNjij

uRjijujujuiui cuNwbruRbr

);();(

21

21

);()();(ûiNjij

k

uiRjijujuj

kuiui

kk

cuiNwbruiRbr

)()();( iSuRuiR kdef

k

)()();( iSuNuiN kdef

k

Integrated Models• Baseline predictors + Factor models + Neighborhood models

f = 170, k = 300 → RMSE = 0.8827• Further improve accuracy, we add a more elaborated temporal model for the user bias

f = 170, k = 300 → RMSE = 0.8786

);();()(

)1()1( 21

21

21

);()();()()()()(ûiNjij

k

uiRjijujuj

k

uNjju

Tiiuui

kk

cuiNwbruiRyuNtpqtbtbr

);();()(

)1()3( 21

21

21

);()();()()()()(ûiNjij

k

uiRjijujuj

k

uNjju

Tiiuui

kk

cuiNwbruiRyuNtpqtbtbr

EXTRA: Shrinking Towards Recent Actions

• To correct rui

• Shrink rui towards the average rating of u on day t• The single day effect is among the strongest temporal

effects in data α = 8β = 11nut – the number of ratings u gave on day trut – the mean rating of u at day tVut – the variance of u’s ratings at day t

ut

ututui

crcr

ˆ

)exp( ututut Vnc

Shrinking Towards Recent Actions• A stronger corrections accounts for periods longer than a single day• And tries to characterize the recent user behavior on similar movies

ui

uiuiui

crcr

1ˆ

)exp( uiuiut Vnc )exp( ujuiij

uij ttsw

jratedu

uijui wn

__

jratedu

uij

jrateduuj

uij

ui w

rwr

__

__

2

__

__

2

)()(

ui

jratedu

uij

jrateduuj

uij

ui rw

rwV

Q & A

Documents

The BellKor 2008 Solution to the Netflix Prize