View
41.693
Download
0
Category
Preview:
DESCRIPTION
By Coen Stevens, Lead Recommendations Engineer at Wakoopa. Presented at http://recked.org
Citation preview
Coen StevensLead Recommendation Engineer
How to build a recommender system?
Wakoopa use case
Mission:Discover software & games
MacWindows Linux
Software tracker
Your profile
Updates
Software pages
Recommendations
Building a recommender systemApproach and challenges
Data
(implicit) (explicit)
• Noisy
• Only positive feedback
• Easy to collect
• Accurate
• Positive and negative feedback
• Hard to collect
what do we have?
Usage Ratingsvs.
Datawhat do we use?
• Active users (Tracker activity in the past month): ~9.000
• Actively used software items (in the past month): ~10.000
• We calculate recommendations for each OS together with Web applications separately
Recommender system methods
• Item-based collaborative filtering
• User-based collaborative filtering (we only use for calculating user similarities to find people like you)
• Combining both methods
Collaborative recommendations: The user will be recommended items that people with similar tastes and preferences liked (used) in the past
Item-Based Collaborative FilteringUser software usage matrix
220 90 180 22
280 12 42 80
175 210 210 45
165 35 195 13 25
100 50 185 35 190
60 65 185
Users
Software items
User software usage matrix [0, 1]
1 1 0 1 0 1 0
1 1 1 0 1 0 0
1 1 0 1 0 1 0
1 0 1 1 1 1 0
0 1 1 1 0 1 1
0 1 0 1 0 0 1
Users
Software items
How do we predict the probability that I would like to use GMail?
1 1 0 1 0 1 0
1 1 1 0 1 0 0
1 1 1 0 1 0
1 0 1 1 1 1 0
0 1 1 1 0 1 1
0 1 0 1 0 0 1
Users
Software items
?
Calculate the similarities between Gmail and the other software items.
1 1 0 1 0 1 0
1 1 1 0 1 0 0
1 1 0 1 0 1 0
1 0 1 1 1 1 0
0 1 1 1 0 1 1
0 1 0 1 0 0 1
Users
Software items
Cosine Similarity(Firefox, Gmail)
Calculate the similarities between Gmail and the other software items.
1 1 0 1 0 1 0
1 1 1 0 1 0 0
1 1 0 1 0 1 0
1 0 1 1 1 1 0
0 1 1 1 0 1 1
0 1 0 1 0 0 1
Users
Software items
Cosine Similarity(Firefox, Gmail)
Calculate the similarities between Gmail and the other software items.
1 1 0 1 0 1 0
1 1 1 0 1 0 0
1 1 0 1 0 1 0
1 0 1 1 1 1 0
0 1 1 1 0 1 1
0 1 0 1 0 0 1
Users
Software items
Cosine Similarity(Firefox, Gmail)
Popularity correction, we put less trust
in popular software
Item-item correlation matrix
1 0.1 0.6 0.1 0.1 0.1 0.7
0.2 1 0.8 0.5 0.8 0.1 0.9
0.1 0.6 1 0.5 0.7 0.2 0.3
0.2 0.6 0.4 1 0.8 0.2 0.3
0.5 0.4 0.4 0.4 1 0.1 0.2
0.5 0.5 0.3 0.5 0.3 1 0.3
0.2 0.6 0.3 0.8 0.7 0.7 1
Item-item correlation matrix
1 0.1 0.6 0.1 0.1 0.1 0.7
0.2 1 0.8 0.5 0.8 0.1 0.9
0.1 0.6 1 0.5 0.7 0.2 0.3
0.2 0.6 0.4 1 0.8 0.2 0.3
0.5 0.4 0.4 0.4 1 0.1 0.2
0.5 0.5 0.3 0.5 0.3 1 0.3
0.2 0.6 0.3 0.8 0.7 0.7 1
0.6
0.8
0.4
0.4
0.3
0.3
Gmail similarities
K-nearest neighbor approach
Gmail similarities
• Performance vs quality
• We take only the ‘K’ most similar items (say 4)
• Space complexity: O(m + Kn)
• Computational complexity: O(m + n²)
0.6
0.8
0.4
0.4
0.3
0.3
Gmail similarities
1
1
1
1
Calculate the predicted value for Gmail
User usage
0.6
0.8
0.4
0.4
Gmail similarities
0.9
0.8
0.6
0.2
Calculate the predicted value for Gmail
User usage
0.6
0.8
0.4
0.4
Gmail similarities
Usage correction, more usage results
in a higher score [0,1]
(0.6 * 0.9) + (0.8 * 0.8) + (0.4 * 0.6)
0.6 + 0.8 + 0.4 + 0.4= 0.82
Gmail similarities User usage
Calculate the predicted value for Gmail
0.9
0.8
0.6
0.2
0.6
0.8
0.4
0.4
(0.6 * 0.9) + (0.8 * 0.8) + (0.4 * 0.6)
0.6 + 0.8 + 0.4 + 0.4= 0.82
Gmail similarities User usage
Calculate the predicted value for Gmail
0.9
0.8
0.6
0.2
0.6
0.8
0.4
0.4
• User feedback
• Contacts usage
• Commercial vs Free
Calculate all unknown values andshow the Top-N recommendations to each user
1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1
Users
Software items
?
?? ? ?
??
???
??
??? ?
?
ExplainabilityWhy did I get this recommendation?
• Overlap between the item’s (K) neighbors and your usage
User-Based Collaborative Filtering
1 1 0 1 0 1 0
1 1 1 0 1 0 0
1 1 0 1 0 1 0
1 1 1 1 1 1 0
0 1 1 1 0 1 1
0 1 0 1 0 0 1
Cosine Similarity(Coen, Menno)
Finding people like you
0.1 0.2 0 0.4 0 0.4 0
0.1 0.2 0.6 0 0.8 0 0
0.1 0.2 0 0.4 0 0.4 0
0.1 0.2 0.6 0.4 0.8 0.4 0
0 0.2 0.6 0.4 0 0.4 0.2
0 0.2 0 0.4 0 0 0.2
Cosine Similarity(Coen, Menno)
Applying inverse user frequency
log(n/ni): ni is the number of users that uses item i and n is the total number of users in the database
The fact that you both use Textmate tells you more than when you both use firefox
0.1 0.2 0 0.4 0 0.4 0
0.1 0.2 0.6 0 0.8 0 0
0.1 0.2 0 0.4 0 0.4 0
0.1 0.2 0.6 0.4 0.8 0.4 0
0 0.2 0.6 0.4 0 0.4 0.2
0 0.2 0 0.4 0 0 0.2
Cosine Similarity(Coen, Menno)
1 0.8 0.6 0.5 0.7 0.2
0.8 1 0.4 0.7 0.5 0.5
0.6 0.4 1 0.4 0.9 0.1
0.5 0.8 0.4 1 0.6 0.4
0.8 0.5 0.9 0.6 1 0.2
0.2 0.5 0.1 0.4 0.2 1
User-user correlation matrix
Performancemeasure for success
• Cross-validation: Train-Test split (80-20)
• Precision and Recall:- precision = size(hit set) / size(total given recs) - recall = size(hit set) / size(test set)
• Root mean squared error (RMSE)
Implementation
• Ruby Enterprise Edition (garbage collection)
• MySQL database
• Built our own c-libraries
• Amazon EC2: - Low cost- Flexibility- Ease of use
• Open source
Future challenges
• What is the best algorithm for Wakoopa? (or you)
• Reducing space-time complexity (scalability):- Parallelization (Clojure)- Distributed computing (Hadoop)
Recommended