Utrecht University - Universiteit Utrecht

Recommendation Systems

Unsupervised Machine Learning

Prof. Yannis VelegrakisUtrecht University

[email protected]://velgias.github.io

mailto:[email protected]

2

Disclaimer

The following set of slides originates from a number of different presentations and courses of the following people. l Yannis Velegrakis (Utrecht University)l Jeff Ullman (Stanford University)l Bill Howe (U of Washington)l Martin Fouler (Thought Works) l Ekaterini Ioannou (Tilburg University)l Themis Palpanas (U of Paris-Descartes)l Yannis Velegrakis (Utrecht University)l Copyright stays with the authors.

No distribution is allowed without prior permission by the authors.

3

Example: Recommender Systems

l Customer Xn Buys Metallica CDn Buys Megadeth CD

l Customer Yn Does search on Metallican Recommender system

suggests Megadeth from data collected about customer X

4

Recommendations

Items

Search Recommendations

Products, web sites, blogs, news items, …

4

Examples:

5

From Scarcity to Abundance

l Shelf space is a scarce commodity for traditional retailers n Also: TV networks, movie theaters,…

l Web enables near-zero-cost dissemination of information about productsn From scarcity to abundance

l More choice necessitates better filtersn Recommendation enginesn How Into Thin Air made Touching the Void

a bestseller: http://www.wired.com/wired/archive/12.10/tail.html

http://www.wired.com/wired/archive/12.10/tail.html

6

Sidenote: The Long Tail

Source: Chris Anderson (2004)

6

7

Physical vs. Online

Read http://www.wired.com/wired/archive/12.10/tail.html to learn more!

http://www.wired.com/wired/archive/12.10/tail.html

8

Types of Recommendations

l Editorial and hand curatedn List of favoritesn Lists of “essential” items

l Simple aggregatesn Top 10, Most Popular, Recent Uploads

l Tailored to individual usersn Amazon, Netflix, …

8

9

Formal Model

lX = set of CustomerslS = set of Items

lUtility function u: X × S à RnR = set of ratingsnR is a totally ordered setne.g., 0-5 stars, real number in [0,1]

9

10

Utility Matrix

0.410.2

0.30.50.21

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

10

11

Key Problems

l (1) Gathering “known” ratings for matrixn How to collect the data in the utility matrix

l (2) Extrapolate unknown ratings from the known onesn Mainly interested in high unknown ratings

uWe are not interested in knowing what you don’t like but what you like

l (3) Evaluating extrapolation methodsn How to measure success/performance of

recommendation methods

11

12

(1) Gathering Ratings

l Explicitn Ask people to rate itemsn Doesn’t work well in practice – people

can’t be bothered

l Implicitn Learn ratings from user actions

uE.g., purchase implies high ratingn What about low ratings?

12

13

(2) Extrapolating Utilities

l Key problem: Utility matrix U is sparsen Most people have not rated most itemsn Cold start:

uNew items have no ratingsuNew users have no history

l Two approaches to recommender systems:n 1) Content-basedn 2) Collaborative

Content-based Recommender Systems

15

Content-based Recommendations

l Main idea: Recommend items to customer x similar to previous items rated highly by x

Example:l Movie recommendations

n Recommend movies with same actor(s), director, genre, …

l Websites, blogs, newsn Recommend other sites with “similar” content

15

16

Plan of Action

likes

Item profiles

RedCircles

Triangles

User profile

match

recommendbuild

16

17

Item Profiles

l For each item, create an item profile

l Profile is a set (vector) of featuresn Movies: author, title, actor, director,…n Text: Set of “important” words in document

l How to pick important features?n Usual heuristic from text mining is TF-IDF

(Term frequency * Inverse Doc Frequency)uTerm … FeatureuDocument … Item

17

18

Sidenote: TF-IDF

fij = frequency of term (feature) i in doc (item) j

ni = number of docs that mention term iN = total number of docs

TF-IDF score: wij = TFij × IDFi

Doc profile = set of words with highest TF-IDF scores, together with their scores

18

Note: we normalize TFto discount for “longer”

documents

19

User Profiles

20

Example 1: Boolean Utility Matrix

21

Example 2: Star Ratings

22

Making Predictions

23

User Profiles and Prediction

l User profile possibilities:n Weighted average of rated item profilesn Variation: weight by difference from average

rating for itemn …

l Prediction heuristic:n Given user profile x and item profile i, estimate 𝑢(𝒙, 𝒊) =

cos(𝒙, 𝒊) = 𝒙·𝒊| 𝒙 |⋅| 𝒊 |

24

Pros: Content-based Approach

l +: No need for data on other usersn No cold-start or sparsity problems

l +: Able to recommend to users with unique tastes

l +: Able to recommend new & unpopular itemsn No first-rater problem

l +: Able to provide explanationsn Can provide explanations of recommended items by listing

content-features that caused an item to be recommended

25

Cons: Content-based Approach

l –: Finding the appropriate features is hardn E.g., images, movies, music

l –: Recommendations for new usersn How to build a user profile?

l –: Overspecializationn Never recommends items outside user’s

content profilen People might have multiple interestsn Unable to exploit quality judgments of other users

Collaborative Filtering

nHarnessing the judgment of other users

27

Collaborative Filtering

l Consider user x

l Find set N of other users whose ratings are “similar” to x’s ratings

l Estimate x’s ratings based on ratings of users in N

x

N

28

29

Option 1: Jaccard Similarity

30

Option 2: Cosine Similarity

31

Option 3: Centered Cosine

32

Centered Cosine Similarity (2)

33

Rating Predictions

34

Item-Item Collaborative Filtering

l So far: User-user collaborative filteringl Another view: Item-item

n For item i, find other similar itemsn Estimate rating for item i based

on ratings for similar itemsn Can use same similarity metrics and

prediction functions as in user-user model

åå

Î

Î×

=);(

);(

xiNj ij

xiNj xjijxi s

rsr

sij… similarity of items i and jrxj…rating of user u on item jN(i;x)… set items rated by x similar to i

35

Item-Item CF (|N|=2)

121110987654321

455311

3124452

534321423

245424

5224345

423316

usersm

ovi

es

- unknown rating - rating between 1 to 5

36


121110987654321

455? 311

3124452

534321423

245424

5224345

423316

users

- estimate rating of movie 1 by user 5

mo

vies

37


121110987654321

455? 311

3124452

534321423

245424

5224345

423316

users

Neighbor selection:Identify movies similar to movie 1, rated by user 5

mo

vies

1.00

-0.18

0.41

-0.10

-0.31

0.59

sim(1,m)

Here we use Pearson correlation as similarity:1) Subtract mean rating mi from each movie i

m1 = (1+3+5+5+4)/5 = 3.6row 1: [-2.6, 0, -0.6, 0, 0, 1.4, 0, 0, 1.4, 0, 0.4, 0]2) Compute cosine similarities between rows

38


121110987654321

455? 311

3124452

534321423

245424

5224345

423316

users

Compute similarity weights:s1,3=0.41, s1,6=0.59

mo

vies

1.00

-0.18

0.41

-0.10

-0.31

0.59

sim(1,m)

39


121110987654321

4552.6311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average:

r1.5 = (0.41*2 + 0.59*3) / (0.41+0.59) = 2.6

mo

vies

40

Item-Item vs. User-User

0.418.010.90.30.5

0.81Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

¡ In practice, it has been observed that item-itemoften works better than user-user

¡ Why? Items are simpler, users have multiple tastes

41

Implementing Collaborative Filtering

42

Collaborative Filtering: Complexity

43

Pros/Cons of Collaborative Filtering

l + Works for any kind of itemn No feature selection needed

l - Cold Start:n Need enough users in the system to find a match

l - Sparsity: n The user/ratings matrix is sparsen Hard to find users that have rated the same items

l - First rater: n Cannot recommend an item that has not been

previously ratedn New items, Esoteric items

l - Popularity bias: n Cannot recommend items to someone with

unique taste n Tends to recommend popular items

44

Hybrid Methods

l Implement two or more different recommenders and combine predictionsn Perhaps using a linear model

l Add content-based methods to collaborative filteringn Item profiles for new item problemn Demographics to deal with new user problem

45

Global Baseline Estimate

46

Combining Global Baseline Estimate with CF

47

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

480,000 users

17,700 movies

Matrix R

48

Evaluation

1 3 4

3 5 5

4 5 5

3

3

2 ? ?

?

2 1 ?

3 ?

1

Test Data Set

RMSE = ./

∑(1,2)∈4 �̂�21 − 𝑟21 8

480,000 users

17,700 movies

Predicted rating

True rating of user x on item i

𝒓𝟑,𝟔

Matrix R

Training Data Set

49

Problems with Error Measures

l Narrow focus on accuracy sometimes misses the pointn Prediction Diversityn Prediction Contextn Order of predictions

l In practice, we care only to predict high ratings:n RMSE might penalize a method that does well

for high ratings and badly for others

50

Collaborative Filtering: Complexity

l Expensive step is finding k most similar customers: O(|X|) l Too expensive to do at runtime

n Could pre-compute

l Naïve pre-computation takes time O(k ·|X|)v X … set of customers

l We already know how to do this!n Near-neighbor search in high dimensions (LSH)n Clusteringn Dimensionality reduction

51

Tip: Add Data

l Leverage all the datan Don’t try to reduce data size in an

effort to make fancy algorithms workn Simple methods on large data do best

l Add more datan e.g., add IMDB data on genres

l More data beats better algorithmshttp://anand.typepad.com/datawocky/2008/03/more-data-usual.html

http://anand.typepad.com/datawocky/2008/03/more-data-usual.html

52

Grand Prize: 0.8563

Netflix: 0.9514

Movie average: 1.0533

User average: 1.0651

Global average: 1.1296

Performance of Various Methods

Basic Collaborative filtering: 0.94

Latent factors: 0.90

Latent factors+Biases: 0.89

Collaborative filtering++: 0.91

Latent factors+Biases+Time: 0.876

When no prize…LGetting desperate.

Try a “kitchen sink” approach!

Dimensionality Reduction

54


l Assumption: Data lies on or near a low d-dimensional subspace

l Axes of this subspace are effective representation of the data

55


l Compress / reduce dimensionality:n 106 rows; 103 columns; no updatesn Random access to any cell(s); small error: OK

The above matrix is really “2-dimensional.” All rows can be reconstructed by scaling [1 1 1 0 0] or [0 0 0 1 1]

56

Rank is “Dimensionality”

l Cloud of points 3D space:n Think of point positions

as a matrix:

l We can rewrite coordinates more efficiently!n Old basis vectors: [1 0 0] [0 1 0] [0 0 1]n New basis vectors: [1 2 1] [-2 -3 1]n Then A has new coordinates: [1 0]. B: [0 1], C: [1 1]

uNotice: We reduced the number of coordinates!

1 row per point:

ABC A

57


l Goal of dimensionality reduction is to discover the axis of data!

Rather than representingevery point with 2 coordinateswe represent each point with

1 coordinate (corresponding tothe position of the point on

the red line).

By doing this we incur a bit oferror as the points do not

exactly lie on the line

58

Why Reduce Dimensions?

Why reduce dimensions?l Discover hidden correlations/topics

n Words that occur commonly together

l Remove redundant and noisy featuresn Not all words are useful

l Interpretation and visualizationl Easier storage and processing of the data

58

59

SVD - Definition

A[m x n] = U[m x r] S [ r x r] (V[n x r])T

l A: Input data matrixn m x n matrix (e.g., m documents, n terms)

l U: Left singular vectors n m x r matrix (m documents, r concepts)

l S: Singular valuesn r x r diagonal matrix (strength of each ‘concept’)

(r : rank of the matrix A)l V: Right singular vectors

n n x r matrix (n terms, r concepts)

60

SVD - Properties

It is always possible to decompose a real matrix A into A = U S VT , where

l U, S, V: uniquel U, V: column orthonormal

n UT U = I; VT V = I (I: identity matrix)n (Columns are orthogonal unit vectors)

l S: diagonaln Entries (singular values) are positive,

and sorted in decreasing order (σ1 ³ σ2 ³ ... ³ 0)

61

SVD – Example: Users-to-Movies

l A = U S VT - example: Users to Movies

=SciFi

Romnce

x x

Mat

rix

Alie

nSer

enity

Cas

abla

nca

Am

elie

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32

12.4 0 00 9.5 00 0 1.3

0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09

62


l A = U S VT - example: Users to MoviesSciFi-concept

Romance-concept

=SciFi

Romnce

x x

Mat

rix

Alie

nSer

enity

Cas

abla

nca

Am

elie

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32

12.4 0 00 9.5 0

0 0 1.3

0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09

63


l A = U S VT - example:

Romance-concept

U is “user-to-concept” similarity matrix

SciFi-concept

=SciFi

Romnce

x x

Mat

rix

Alie

nSer

enity

Cas

abla

nca

Am

elie

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32

12.4 0 00 9.5 0

0 0 1.3

0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09

64


lA = U S VT - example:

SciFi

Romnce

SciFi-concept

“strength” of the SciFi-concept

=SciFi

Romnce

x x

Mat

rix

Alie

nSer

enity

Cas

abla

nca

Am

elie

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32

12.4 0 00 9.5 0

0 0 1.3

0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09

65


l A = U S VT - example:

SciFi-concept

V is “movie-to-concept”similarity matrix

SciFi-concept

=SciFi

Romnce

x x

Mat

rix

Alie

nSer

enity

Cas

abla

nca

Am

elie

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32

12.4 0 00 9.5 0

0 0 1.3

0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09

Dimensionality Reduction with SVD

67

SVD – Dimensionality Reduction

l Goal: Minimize the sumof reconstruction errors:

uwhere are the “old” and are the “new” coordinates

l SVD gives ‘best’ axis to project on:n ‘best’ = minimizing the reconstruction errors

l In other words, minimum reconstruction error

v1

first right singular vector

Movie 1 rating

Mov

ie 2

ratin

g

68

SVD - Interpretation #2

More detailsl Q: How exactly is dim. reduction done?

= x x

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32

12.4 0 00 9.5 0

0 0 1.3

0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09

69


More detailsl Q: How exactly is dim. reduction done?l A: Set smallest singular values to zero

= x x

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32

12.4 0 00 9.5 0

0 0 1.3

0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09

70



x x

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32

12.4 0 00 9.5 0

0 0 1.3

0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09

»

71



x x

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.13 0.02 -0.010.41 0.07 -0.030.55 0.09 -0.040.68 0.11 -0.050.15 -0.59 0.650.07 -0.73 -0.670.07 -0.29 0.32

12.4 0 00 9.5 0

0 0 1.3

0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.690.40 -0.80 0.40 0.09 0.09

»

72



» x x

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

0.13 0.020.41 0.070.55 0.090.68 0.110.15 -0.590.07 -0.730.07 -0.29

12.4 0 0 9.5

0.56 0.59 0.56 0.09 0.090.12 -0.02 0.12 -0.69 -0.69

Documents

Utrecht University - Universiteit Utrecht