Upload
datasciencemd
View
132
Download
1
Embed Size (px)
Citation preview
About Patrick• B.S. in “Discovery Informatics”, first graduate in 2007• Ph.D. in Computer Science from UMD (College Park)• National Cancer Institute (NIH) • Miner & Kasch • Worked on data science projects
across a broad variety of fields
What we’ll talk about• What is collaborative filtering?• What does it do for us?• How do we do it? • What is good/ not so good about it?• How can we modify it to address shortcomings?
What’s Collaborative Filtering?• Method of making automatic, personalized recommendations• “Collaborative Filtering” because recommendations are derived from
collaborative input from other users• Has had success in a variety of web-based markets, including
Amazon, iTunes, Netflix, and LastFM• Two main types:
• Item-Based Collaborative Filtering• User-Based Collaborative Filtering
Item-Based Collaborative Filtering• Provides recommendations
for a particular item• Based on item’s similarity
to other items • Similarity defined through
the users who preferred theitems or not
“People who liked this item also liked this other item”
Item-Based Collaborative Filtering • Item-based in essence just chains recommendations off of a particular
item• Item-based can recommend items to users who we know nothing
about except for what item they may be looking at currently
User-Based Collaborative Filtering• Provides recommendations
for particular user• Based on user’s similarity to
other users • Similarity defined through
the items users preferredor not
“People who like a lot of the same stuff you like also like this other stuff”
User-Based Collaborative Filtering• Takes the overall preferences of a user into account• Can be more personalized and diverse
What do we need for Collaborative Filtering? • Records of user ratings on items• Rating might just be purchased (1) or not (0) or some other indication
column type
user ID int
Item ID int
rating int
ratings (required) useful other things (optional)
• Time of rating• Tags of info on items• Categories of items• Newness of item• …
User-Based Collaborative Filtering1. Create user-to-item vectors2. Calculate user similarity (cosine similarity)3. Compute recommendation scores with weighted averages
1. Create user-to-item vectors# read ratings into a DataFrame from DB table query = "select * from ratings" ratings_df = pandas.read_sql(query, con=engine)
# pivot to get item bit vectors for each user user_item_vectors = ratings_df.pivot_table(index='user_id', columns='item_id', values='rating', fill_value = 0)
User-Item Vectors
Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9
User A 1 0 0 1 0 1 1 1 1
User B 1 0 0 1 0 0 0 1 1
User C 0 0 0 0 0 0 1 0 0
User D 0 0 1 1 1 0 0 0 0
2. Calculate user similarities• Cosine similarity measures the cosine of the angle between two
vectors
User-Item Vectors
Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9
User A 1 0 0 1 0 1 1 1 1
User B 1 0 0 1 0 0 0 1 1
User C 0 0 0 0 0 0 1 0 0
User D 0 0 1 1 1 0 0 0 0
2. Calculate user similaritiesUser-Item Vectors
Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9
User A 1 0 0 1 0 1 1 1 1
User B 1 0 0 1 0 0 0 1 1
User C 0 0 0 0 0 0 1 0 0
User D 0 0 1 1 1 0 0 0 0
User Similarity Matrix
User A User B User C User D
User A
User B … …
User C … … …
User D … … …
2. Calculate user similarities# base similarity matrix (all dot products)sim = numpy.dot(user_item_vectors, user_item_vectors.T)
# squared magnitude (total rating) of vectorssquare_mag = numpy.diag(sim)
# inverse squared magnitude inv_square_mag = 1 / square_mag inv_square_mag[numpy.isinf(inv_square_mag)] = 0
# inverse of the magnitudeinv_mag = numpy.sqrt(inv_square_mag)
# cosine similarity (elementwise multiply by inverse magnitudes)cosine = sim * inv_mag cosine = cosine.T * inv_mag
…
…
…
…
…
…
2. Calculate user similaritiesUser-Item Vectors
Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9
User A 1 0 0 1 0 1 1 1 1
User B 1 0 0 1 0 0 0 1 1
User C 0 0 0 0 0 0 1 0 0
User D 0 0 1 1 1 0 0 0 0
User Similarity Matrix
User A User B User C User D
User A 1 0.816497 0.408248 0.235702
User B 0.816497 1 0 0.288675
User C 0.408248 0 1 0
User D 0.235702 0.288675 0 1
3. Compute recommendation scores• Take top K most similar users• Score all items by similarity-weighted sum of ratingsExample: User B, K =2
User-Item Vectors
Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9
User A 1 0 0 1 0 1 1 1 1
User B 1 0 0 1 0 0 0 1 1
User C 0 0 0 0 0 0 1 0 0
User D 0 0 1 1 1 0 0 0 0
User Similarity Matrix
User A User B User C User D
User A 1 0.816497 0.408248 0.235702
User B 0.816497 1 0 0.288675
User C 0.408248 0 1 0
User D 0.235702 0.288675 0 1
For each user for who we want recommendations:
3. Compute recommendation scores• Take top K most similar users• Score all items by similarity-weighted sum of ratingsExample: User B, K =2
For each user for who we want recommendations:
User Similarity Matrix
User A User B User C User D
User A 1 0.816497 0.408248 0.235702
User B 0.816497 1 0 0.288675
User C 0.408248 0 1 0
User D 0.235702 0.288675 0 1
User-Item Vectors
Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9
User A 0.816497 0 0 0.816497 0 0.816497 0.816497 0.816497 0.816497
User B 1 0 0 1 0 0 0 1 1
User C 0 0 0 0 0 0 1 0 0
User D 0 0 0.288675 0.288675 0.288675 0 0 0 0
3. Compute recommendation scores• Take top K most similar users• Score all items by similarity-weighted average or sum of ratingsExample: User B, K =2
User-Item Vectors
Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9
User A 0.816497 0 0 0.816497 0 0.816497 0.816497 0.816497 0.816497
User B 1 0 0 1 0 0 0 1 1
User C 0 0 0 0 0 0 1 0 0
User D 0 0 0.288675 0.288675 0.288675 0 0 0 0
User Similarity Matrix
User A User B User C User D
User A 1 0.816497 0.408248 0.235702
User B 0.816497 1 0 0.288675
User C 0.408248 0 1 0
User D 0.235702 0.288675 0 1
For each user for who we want recommendations:
Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9
0.40824 0 0.1443 0.5526 0.1443 0.4082 0.4082 0.4082 0.4082
Last step• Remove items user B already has, leaves us with final
recommendations:
Collaborative Filtering Item Scores for User B with K = 2Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9
0.408248 0 0.144338 0.552586 0.144338 0.408248 0.408248 0.408248 0.408248
Item 2 Item 3 Item 5 Item 6 Item 7
0 0.144338 0.144338 0.408248 0.408248
Item-Based Collaborative Filtering1. Create user-to-item vectors2. Calculate item similarity (cosine similarity)3. Compute recommendation scores with weighted averages
All we need to do is transpose the user-item vectors matrix!!
End up with an item-to-item similarity matrix
User-Item Vectors Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9
User A 1 0 0 1 0 1 1 1 1
User B 1 0 0 1 0 0 0 1 1
User C 0 0 0 0 0 0 1 0 0
User D 0 0 1 1 1 0 0 0 0
Change: sim = numpy.dot(user_item_vectors, user_item_vectors.T)To: sim = numpy.dot(user_item_vectors.T, user_item_vectors)
Notes• All the math works out just the same when using ratings instead of
just 0’s and 1’s• Similarity matrices and recommendations can all be computed in
batch and stored/cached for quick recall• There is a big memory vs. speed tradeoff in method chosen to
compute similarity matrices from user-item-bit vectors• Vectorized operations (very fast, but lots of memory on big data sets)• Looping (don’t need much memory, but quite slow without parallelization)
Strengths of Collaborative Filtering• Content-Agnostic
• Does not require items or users to be tagged with content information
• Recommendations are very personalized because they’re based on like-minded people
• Adaptive to the user base• Can pick up on fitting recommendations that would be difficult or
impossible to identify with a content-based method• Inherently adaptive to preference changes over time
Weaknesses of Collaborative Filtering• Cold start problem
• Need user ratings observed to provide recommendation
• Computationally expensive over huge amounts of data• Lack of “heterophilious diffusion”
• Not good at meeting the potential desire of users to be recommended items from users that are not like them
• Can have a strong bias towards popular, older items
Modifying Collaborative Filtering• Can address some of the possible drawbacks of collaborative filtering
through score modifications• Weigh base scores by
content-based info:• Age of item• Category of item• Popularity of item• Etc.
Collaborative Filtering
Base pool of recommendations with scores
Score Modifications
Final Recommendation Set
Modifying Collaborative Filtering• Vary responsiveness to user’s
recent vs. old tastes• Multiply ratings by an
age decay function reflecting how age should affectrating importance
Hybrid Recommender System Approach• Combine collaborative filtering with content and model-based
methods• Create recommendations by both methods and combined• Content-based methods are complementary to collaborative filtering
• Use clustering, classification, and content-based grouping to limit the population for which to compute similarities
Practical Advantages of Item-Based• Many businesses have more customers than items
• Item-item similarity matrix much smaller than user-user
• Item similarity scores tend to converge over time
Summary• Collaborative filtering can make personalized recommendations based
on ‘social’ recommendations • All we need to do collaborative filtering is ratings of users on items• Implementation is relatively straight-forward (caveat of a significant
speed/memory tradeoff)• Great variety of creative modifications and ways to use it are possible
to “make it work” for a specific use case