RakutenViki Data Challenge Recommendation Metric Robin M. E. Swezey, Ph. D Intelligence Domain Group Rakuten Institute of Technology

RakutenViki Data ChallengeRecommendation Metric

Robin M. E. Swezey, Ph. DIntelligence Domain GroupRakuten Institute of Technologyhttp://rit.rakuten.co.jp

http://rit.rakuten.co.jp/

2

Self-Introduction

• Specs– Born near Paris in 1985 (27yrs old)– Dual Citizen / – D.E @ Nagoya Institute of Technology– Rakuten/R.I.T since July 2013– Currently consulting at Viki– Previous works in Rakuten

• Recommendation (next slide)• Advertisement

– Streaming content matching– Distributed response prediction

• Others– Women’s health application– ML Evangelism

3

Self-Introduction

• Work in Recommendation– Recommendation

• For golf booking deals - Swezey R., Chung Y.: Recommending short-lived dynamic packages for golf booking services, CIKM 2015 + 2 patents

• For travel (extension of the above)• Testing of recommender systems with offline simulator (RecoMiX)

User Action Logs

WebSite Simulator Recommender SystemAPI calls

Statis

tical

Result

s

Prior Training Set

2.7 Million books browsedfrom May to Nov. 2014

Feedback

Accuracy Metric

5

Accuracy Metric

• Expected Weighted Average Precision

• gives final evaluation score in leadboard, and is the Expectation of over all users

• S(k) is the importance of user k in terms of viewing• w(k) is the normalized weight of user k over the set of N users

𝐸𝑊𝐴𝑃@3=∑𝑘=1

𝑁

𝑤 (𝑘 ) [𝑊𝐴𝑃@3 ]𝑘

𝑤 (𝑘 )= 𝑆 (𝑘 )

∑𝑘=1

𝑁

𝑆 (𝑘)𝑎𝑛𝑑𝑆 (𝑘 )= [∑ 𝑦 𝑗 ]@user𝑘

6

Building the Metric

• Basic regression metric– Score-based regression metric: RMSE

• Disavantages– RMSE is generally inadequate for

recommendation problems» Difficult to convey meaning» We don’t care about exact

score prediction– What matters in information retrieval

is the final set of results

7

Building the Metric

• Basic classification metrics– Precision @n

• Proportion of matches (true positives) in recommendations (n)• How precise my n recommendations are

– Recall @n

• Proportion of matches (true positives) in user history (h)• How many trues am I able to recall with my n recommendations

– Advantages• Easier to convey than RMSE• Focused on the set of recommendations

8

Context

• Focusing the problem– In our setting, n is fixed to 3– Constraints

1. User history sizes vary from few to many videos

2. Order of recommendations matters ↓

3. Engagement of users matters

1

2

3

9

Precision

• Example– Precision @n: # of true positives / # of recommendations (n)

User history set Recommendations Precision @3

10

Precision



P = 2/3 = 0.66

11

Precision



P = 2/3 = 0.66

P = 1/3 = 0.33

12

Recall

• Example– Recall @n: # of true positives / history size (h)

User history set Recommendations Recall @3

13

Recall



P = 2/3 = 0.66

14

Recall



P = 2/3 = 0.66

P = 1/4 = 0.25

15

Back to Context

• Constraints1. User history sizes vary from few to many videos



1

2

3

16

Constraint 1: History sizes

• Example 1– A user with big history


17


• Example 1– If user history > n, recall never reaches 1

– Everything matches but recall is stuck at 50%


R = 3/6 = 0.5

18


• Example 2– A user with small history


19


• Example 2– If user history < n, precision never reaches 1

– Everything has been recalled but precision is stuck at 33%


P = 1/3 = 0.33

20


• Solution– Use min(h, n) as denominator: big history user


P = 3/3 = 1

21


• Solution– Use min(h, n) as denominator: small history user

– Precision and recall are the same metric in that case• P = R = |tp|/min(h,n)


P = 1/1 = 1

22

Back to Context




1

2

3

23

Constraint 2: Order

• Example– We only measure P@3, so for this user:


Algo 1

Algo 2

12

3

1

2

3

24

Constraint 2: Order

• Example– We only measure P@3, so for this user:


P = 1/3 = 0.33

Algo 1

Algo 2

12

3

1

2

3

25

Constraint 2: Order

• Example– Wait, what?


P = 1/3 = 0.33

P = 1/3 = 0.33

Algo 1

Algo 2

12

3

1

2

3

26

Constraint 2: Order

• Solution1. Average Precision (ap@n)

• Take precision@k from k=1 to n and average it over n:

• Note: this ap@n is slightly different than regular IR ap@n– No P(k)=0 if k-th item is not a true positive– Integration over n, not over recall– Stronger weight on good ordering– Later wrong predictions less penalized– For practical purpose because of score weighting

27

Constraint 2: Order

• Solution– AP@3 for the 2 algos:

User history set Recommendations AP@3

P@1 = 1/1

Algo 1

Algo 2

28

Constraint 2: Order



P@2 = 1/2

Algo 1

Algo 2

29

Constraint 2: Order



P@3 = 1/3

Algo 1

Algo 2

30

Constraint 2: Order



AP@n = 1/3(1/1+1/2+1/3)

= 0.61

Algo 1

Algo 2

31

Constraint 2: Order



ap@n = 1/3(1/1+1/2+1/3)

= 0.61

ap@n = 1/3(0+0+1/3)= 0.11

Algo 1

Algo 2

32

Averaging over Users

• From AP@n to MAP1. Average Precision (AP@n)

• Take precision@k from k=1 to n and average it over n:

2. MAP• Take the mean of AP@n over users:

33

Back to Context


2. Order of recommendations matters


1

2

3

34

Constraint 3: Engagement

• Improving MAP for Viki– Viewing time is of crucial importance in the Viki funnel

– Each viewed video is scored from 1 to 3 based on viewing time (equidistant binning) to quantify engagement

Unique Visitors

# of Video Starts

# of Available

Ads# of Filled Ads

Monetize

Frequency Engagement

Retention

Coverage

Main KPI Driver

35


• Solution part 1: from AP@n to WAP@n1. Sort user scores by engagement score descending

2. Take first min(h, n) scores (here: first 3) → yi

User history set Sorted true scores (yi)

32

12 3 2 2

2 31

36


• Solution part 1: from AP@n to WAP@n3. Make your ordered list of recommendations

4. Take each prediction’s score for this user → pi

Sorted scores list (yi) Recommendations (pi) WAP @3

3 2 2 3

20

1

2

32 31

37


• Solution part 1: from AP@n to WAP@n5. For each score-weighted precision@i,

A) Use cumulated pi as numerator

B) Use cumulated yi as denominator


WP@1 = 2/3 WP@2 = (2+3)/(3+2)WP@3 = (2+3+0)/ (3+2+2)

3 2 2 3

20

1

2

32 31

38


• Solution part 1: from AP@n to WAP@n6. Final step: Average over 3


WAP@3 = 1/3 [ 2/3 + (2+3)/(3+2) + (2+3+0)/ (3+2+2) ] = 0.79

3 2 2 3

20

1

2

32 31

39


• Solution part 1: from AP@n to WAP@n– Compute WAP from sorted scores:

• Weighted Average Precision, calculated for each user

User Videos watched Video scores Recommended videos

u1 v1, v2, v3, v4, v5, v6

3, 3, 2, 2, 1, 1 v1, v2, v3

u2 v1, v2, v3, v4, v5, v6

3, 3, 2, 2, 1, 1 v3, v2, v1

u3 v1, v2, v3, v4, v5, v6

3, 3, 2, 2, 1, 1 v4, v5, v6

[𝑊𝐴𝑃@3 ]𝑘=[ 13∑𝑖=13 ∑𝑗=1

𝑖

𝑝 𝑗

∑𝑗=1

min (𝑖 ,𝑛 )

𝑦 𝑗 ]@user 𝑘

40


• Solution part 2: weigh our user for the final expectation1. Sum user k’s video scores to get his engagement → Sk

User k history set Engagement of user k (Sk)

2 + 1 + 2 + 3= 8

32

12

k

41


• Solution part 2: weigh our user for the final expectation2. Total engagement is the sum of engagement of all users

• e.g. 200

3. Divide user k’s engagement Sk by total engagement → wk

User k history set Weight of user k (wk)

2 + 1 + 2 + 3= 8

8 / 200 = 0.0432

12

k

42


• Solution part 2: from MAP@n to EWAP@n

• gives final evaluation score in leadboard, and is the Expectation of over all users

• S(k) is the total engagement of user k• w(k) is the normalized weight of user k over the set of N users

𝐸𝑊𝐴𝑃@3=∑𝑘=1

𝑁

𝑤 (𝑘 ) [𝑊𝐴𝑃@3 ]𝑘

𝑤 (𝑘 )= 𝑆 (𝑘 )

∑𝑘=1

𝑁

𝑆 (𝑘)𝑎𝑛𝑑𝑆 (𝑘 )= [∑ 𝑦 𝑗 ]@user𝑘

43

Back to Context


2. Order of recommendations matters


44

Conclusion

• Constraints satisfied– WAP

• Takes values in [0,1] for each user regardless of user history size• Measures the ordering of retrieved videos• Measures engagement value of videos retrieved by participants

– EWAP• Weighs users according to engagement in final integration

– More examples in the Challenge Statement

45

Thank you

Any questions?

Documents

RakutenViki Data Challenge Recommendation Metric Robin M. E. Swezey, Ph. D Intelligence Domain Group Rakuten Institute of Technology