Collaborative Filtering For Recommendation Engines

Making Collaborative Filtering Work for

Recommendation EnginesPatrick Roos

[email protected]

About Patrick• B.S. in “Discovery Informatics”, first graduate in 2007• Ph.D. in Computer Science from UMD (College Park)• National Cancer Institute (NIH) • Miner & Kasch • Worked on data science projects

across a broad variety of fields

What we’ll talk about• What is collaborative filtering?• What does it do for us?• How do we do it? • What is good/ not so good about it?• How can we modify it to address shortcomings?

What’s Collaborative Filtering?• Method of making automatic, personalized recommendations• “Collaborative Filtering” because recommendations are derived from

collaborative input from other users• Has had success in a variety of web-based markets, including

Amazon, iTunes, Netflix, and LastFM• Two main types:

• Item-Based Collaborative Filtering• User-Based Collaborative Filtering

Item-Based Collaborative Filtering• Provides recommendations

for a particular item• Based on item’s similarity

to other items • Similarity defined through

the users who preferred theitems or not

“People who liked this item also liked this other item”

Item-Based Collaborative Filtering • Item-based in essence just chains recommendations off of a particular

item• Item-based can recommend items to users who we know nothing

about except for what item they may be looking at currently

User-Based Collaborative Filtering• Provides recommendations

for particular user• Based on user’s similarity to

other users • Similarity defined through

the items users preferredor not

“People who like a lot of the same stuff you like also like this other stuff”

User-Based Collaborative Filtering• Takes the overall preferences of a user into account• Can be more personalized and diverse

What do we need for Collaborative Filtering? • Records of user ratings on items• Rating might just be purchased (1) or not (0) or some other indication

column type

user ID int

Item ID int

rating int

ratings (required) useful other things (optional)

• Time of rating• Tags of info on items• Categories of items• Newness of item• …

User-Based Collaborative Filtering1. Create user-to-item vectors2. Calculate user similarity (cosine similarity)3. Compute recommendation scores with weighted averages

1. Create user-to-item vectors# read ratings into a DataFrame from DB table query = "select * from ratings" ratings_df = pandas.read_sql(query, con=engine)

# pivot to get item bit vectors for each user user_item_vectors = ratings_df.pivot_table(index='user_id', columns='item_id', values='rating', fill_value = 0)

User-Item Vectors

Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9

User A 1 0 0 1 0 1 1 1 1

User B 1 0 0 1 0 0 0 1 1

User C 0 0 0 0 0 0 1 0 0

User D 0 0 1 1 1 0 0 0 0

2. Calculate user similarities• Cosine similarity measures the cosine of the angle between two

vectors

User-Item Vectors


User A 1 0 0 1 0 1 1 1 1

User B 1 0 0 1 0 0 0 1 1

User C 0 0 0 0 0 0 1 0 0

User D 0 0 1 1 1 0 0 0 0

2. Calculate user similaritiesUser-Item Vectors


User A 1 0 0 1 0 1 1 1 1

User B 1 0 0 1 0 0 0 1 1

User C 0 0 0 0 0 0 1 0 0

User D 0 0 1 1 1 0 0 0 0

User Similarity Matrix

User A User B User C User D

User A

User B … …

User C … … …

User D … … …

2. Calculate user similarities# base similarity matrix (all dot products)sim = numpy.dot(user_item_vectors, user_item_vectors.T)

# squared magnitude (total rating) of vectorssquare_mag = numpy.diag(sim)

# inverse squared magnitude inv_square_mag = 1 / square_mag inv_square_mag[numpy.isinf(inv_square_mag)] = 0

# inverse of the magnitudeinv_mag = numpy.sqrt(inv_square_mag)

# cosine similarity (elementwise multiply by inverse magnitudes)cosine = sim * inv_mag cosine = cosine.T * inv_mag

…

…

…

…

…

…

2. Calculate user similaritiesUser-Item Vectors


User A 1 0 0 1 0 1 1 1 1

User B 1 0 0 1 0 0 0 1 1

User C 0 0 0 0 0 0 1 0 0

User D 0 0 1 1 1 0 0 0 0



User A 1 0.816497 0.408248 0.235702

User B 0.816497 1 0 0.288675

User C 0.408248 0 1 0

User D 0.235702 0.288675 0 1

3. Compute recommendation scores• Take top K most similar users• Score all items by similarity-weighted sum of ratingsExample: User B, K =2

User-Item Vectors


User A 1 0 0 1 0 1 1 1 1

User B 1 0 0 1 0 0 0 1 1

User C 0 0 0 0 0 0 1 0 0

User D 0 0 1 1 1 0 0 0 0



User A 1 0.816497 0.408248 0.235702

User B 0.816497 1 0 0.288675

User C 0.408248 0 1 0

User D 0.235702 0.288675 0 1

For each user for who we want recommendations:

3. Compute recommendation scores• Take top K most similar users• Score all items by similarity-weighted sum of ratingsExample: User B, K =2




User A 1 0.816497 0.408248 0.235702

User B 0.816497 1 0 0.288675

User C 0.408248 0 1 0

User D 0.235702 0.288675 0 1

User-Item Vectors


User A 0.816497 0 0 0.816497 0 0.816497 0.816497 0.816497 0.816497

User B 1 0 0 1 0 0 0 1 1

User C 0 0 0 0 0 0 1 0 0

User D 0 0 0.288675 0.288675 0.288675 0 0 0 0

3. Compute recommendation scores• Take top K most similar users• Score all items by similarity-weighted average or sum of ratingsExample: User B, K =2

User-Item Vectors


User A 0.816497 0 0 0.816497 0 0.816497 0.816497 0.816497 0.816497

User B 1 0 0 1 0 0 0 1 1

User C 0 0 0 0 0 0 1 0 0

User D 0 0 0.288675 0.288675 0.288675 0 0 0 0



User A 1 0.816497 0.408248 0.235702

User B 0.816497 1 0 0.288675

User C 0.408248 0 1 0

User D 0.235702 0.288675 0 1



0.40824 0 0.1443 0.5526 0.1443 0.4082 0.4082 0.4082 0.4082

Last step• Remove items user B already has, leaves us with final

recommendations:

Collaborative Filtering Item Scores for User B with K = 2Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9

0.408248 0 0.144338 0.552586 0.144338 0.408248 0.408248 0.408248 0.408248

Item 2 Item 3 Item 5 Item 6 Item 7

0 0.144338 0.144338 0.408248 0.408248

Item-Based Collaborative Filtering1. Create user-to-item vectors2. Calculate item similarity (cosine similarity)3. Compute recommendation scores with weighted averages

All we need to do is transpose the user-item vectors matrix!!

End up with an item-to-item similarity matrix

User-Item Vectors Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9

User A 1 0 0 1 0 1 1 1 1

User B 1 0 0 1 0 0 0 1 1

User C 0 0 0 0 0 0 1 0 0

User D 0 0 1 1 1 0 0 0 0

Change: sim = numpy.dot(user_item_vectors, user_item_vectors.T)To: sim = numpy.dot(user_item_vectors.T, user_item_vectors)

Notes• All the math works out just the same when using ratings instead of

just 0’s and 1’s• Similarity matrices and recommendations can all be computed in

batch and stored/cached for quick recall• There is a big memory vs. speed tradeoff in method chosen to

compute similarity matrices from user-item-bit vectors• Vectorized operations (very fast, but lots of memory on big data sets)• Looping (don’t need much memory, but quite slow without parallelization)

Strengths of Collaborative Filtering• Content-Agnostic

• Does not require items or users to be tagged with content information

• Recommendations are very personalized because they’re based on like-minded people

• Adaptive to the user base• Can pick up on fitting recommendations that would be difficult or

impossible to identify with a content-based method• Inherently adaptive to preference changes over time

Weaknesses of Collaborative Filtering• Cold start problem

• Need user ratings observed to provide recommendation

• Computationally expensive over huge amounts of data• Lack of “heterophilious diffusion”

• Not good at meeting the potential desire of users to be recommended items from users that are not like them

• Can have a strong bias towards popular, older items

Modifying Collaborative Filtering• Can address some of the possible drawbacks of collaborative filtering

through score modifications• Weigh base scores by

content-based info:• Age of item• Category of item• Popularity of item• Etc.

Collaborative Filtering

Base pool of recommendations with scores

Score Modifications

Final Recommendation Set

Modifying Collaborative Filtering• Vary responsiveness to user’s

recent vs. old tastes• Multiply ratings by an

age decay function reflecting how age should affectrating importance

Hybrid Recommender System Approach• Combine collaborative filtering with content and model-based

methods• Create recommendations by both methods and combined• Content-based methods are complementary to collaborative filtering

• Use clustering, classification, and content-based grouping to limit the population for which to compute similarities

Practical Advantages of Item-Based• Many businesses have more customers than items

• Item-item similarity matrix much smaller than user-user

• Item similarity scores tend to converge over time

Summary• Collaborative filtering can make personalized recommendations based

on ‘social’ recommendations • All we need to do collaborative filtering is ratings of users on items• Implementation is relatively straight-forward (caveat of a significant

speed/memory tradeoff)• Great variety of creative modifications and ways to use it are possible

to “make it work” for a specific use case

Making Collaborative Filtering Work for

Recommendation EnginesPatrick Roos

[email protected]

Data & Analytics

Collaborative Filtering For Recommendation Engines