Upload
indicthreads
View
431
Download
2
Tags:
Embed Size (px)
DESCRIPTION
The 7th Annual IndicThreads Pune Conference was held on 14-15 December 2012. http://pune12.indicthreads.com/
Citation preview
How to Build a Recommendation Engine Using Apache MahoutViraj ParipatyadarGS Lab
2
Contents
• A recommendation problem• What is a recommender• Building a recommender using Mahout• Tips and tweaks
• Recommender considerations
A book store
• Sells books:• By various authors• Of various categories• On different subjects• From various publishers
• Readers/buyers are asked to rate• Readers/buyers can provide reviews
You walk into the store(buy something for a friend)
The store owner
• Asks you what:• your friend reads (already owns)• your friend usually likes more
• Has data on what:• his customers buy• his customers rate and review
• Uses a few strategies
1 - Find similar books
Depending on which books your friend has, pick books:• by the same author• on the same/similar subject/s• in the same category• from the same publication
(those with highest sales numbers)
2 - Find books with similar readership
• Define some similarity• e.g. two books are as similar as the number of
readers rating both of them
• Define some limit of relevance• e.g. only consider books which are more than 4
readers similar
• Look for all books which are similar to books your friend owns
Pick books from this set that you friend doesn’t own
3 - Find people with similar tastes
• Define some similarity• e.g. two people are as similar as the number of
books they like from the same category
• Define some limit of relevance• e.g. only consider the 3 top people when
ordered according to how similar they are to your friend
• Look for users similar to your friend and see what they read
Pick books which these people like and your friend doesn’t own
Example data1,101,5.0 3,101,2.5 4,106,4.0
1,102,3.0 3,104,4.0 5,101,4.0
1,103,2.5 3,105,4.5 5,102,3.0
2,101,2.0 3,107,5.0 5,103,2.0
2,102,2.5 4,101,5.0 5,104,4.0
2,103,5.0 4,103,3.0 5,105,3.5
2,104,2.0 4,104,4.5 5,106,4.0
• Your friend owns three books:• Gave 5 stars to book 101 (likes hugely and talks about it all the
time)
• Gave 3 stars to book 102 (has shown some liking to it)
• Gave 2.5 stars to book 103 (has read it, but didn’t say bad things about it)
Now, we need to recommend for your friend books he hasn’t seen
A pictorial representation
101 102 103 104 105 106 107
1
2
3
4
5
Visualize
101 102 103 104 105 106 107
1
2
3
4
5
A (slightly) bigger example1,101,5.0 3,111,2.5 6,103,2.0
1,102,3.0 4,101,5.0 6,106,4.0
1,103,2.5 4,103,3.0 6,113,3.0
1,109,3.5 4,104,4.5 6,115,5.0
1,112,4.0 4,106,4.0 7,103,4.5
2,101,2.0 4,109,2.0 7,104,2.5
2,102,2.5 4,111,2.5 7,108,4.0
2,103,5.0 5,101,4.0 7,109,3.5
2,104,2.0 5,102,3.0 7,110,3.5
2,107,4.5 5,103,2.0 7,112,2.5
2,113,3.5 5,104,4.0 8,101,2.0
3,101,2.5 5,105,3.5 8,105,4.0
3,104,4.0 5,106,4.0 8,106,4.5
3,105,4.5 5,109,3.0 8,110,3.0
3,107,5.0 5,112,4.0 8,114,5.0
3,115,4.0 6,101,4.5 8,115,3.5
A pictorial representation
101 102 103 104 105 106 107 108 109 110 111 112 113 114 115
1 2 3 4
5 6 7 8
Clearly, not a viable option
Mahout to the rescue
What is Apache Mahout
• Apache Mahout• A machine learning library• Works with Apache Hadoop
• Use cases:• Recommenders• Clustering• Classification
Recommenders in Mahout
• Recommenders use data culled from user behavior
• Recommending using Mahout• Similarity between users or items• Expressed as a number between 0-1
• Neighborhood of users/items• Recommendation using this info and an
algorithm• Generic• Specialized
Similarity
• Various algorithms:• Euclidean distance • Pearson correlation • Cosine measure • Spearman correlation • Tanimoto coefficient • Log-likelyhood
• Effectiveness dependent on the input data• Influences running time and memory
Neighborhood• Nearest N neighborhood (say, 4):
• Threshold neighborhood (say, > 0.8):
5
U
3
2
4
1
5
U
3
2
4
1
Recommender
• Recommenders• Generic recommender• User based• Item based
• Slope-one recommender• Singular Value Decomposition based• Liner Interpolation based• Cluster-based
• Recommender rescorer• Recommender evaluator
A real-life Web application
• News aggregator-cum-reader• Fetches news from a news service• Shows the news in a uniform UI• Lets readers read, like/dislike and comment on
news• Link social networks and share
• Make this a personalized newspaper• Track user actions• Derive and store preferences• Generate recommendations• Leverage social accounts, etc.
Overall design
User, application data
(MySQL)
News aggregation,
storage (Hbase)
Preferences, Recommender
(Mahout)
REST
REST
REST
Controller API (REST)
Web application
Phone/tablet applications
Third party applications
Recommender
REST (Grizzly, Tomcat)
REST service
Fetch recommendations
Input user actions
Recommender (offline, run periodically)
MySQL
Database
Input table dump
How to extract data – one dimension
1 2 3 4 5 6 7 8 91
10
100
1000
100004299
511
128
51
13
4 4
1
2
News article readership
News article readership
Number of News Articles
How to extract data – add dimensions
1 4 7 10 13 16 19 22 25 28 31 34 37 40 44 50 571
10
100
1000
10000
News article readership
Topic readership
Number of News articles / Topics
How more data helps
0 100 200 300 400 500 600 700 8000
5
10
15
20
25
30
35
40
No. of read-ers with x ar-ticles each
No. of readers with x top-ics each
Number of news articles/topics
21
How more data helps
5 15 25 35 45 55 65 75 85 950
1
2
3
4
5
6
7
8
9
No. of readers with x articles each
No. of readers with x topics each
Number of news articles/topics
How more data helps
95 145 195 245 295 345 3950
0.5
1
1.5
2
2.5
3
3.5
No. of readers with x articles each
No. of readers with x topics each
Number of news articles/topics
Learnings
• Know thy user• Frequency of visits• Preference logic wrt user
• Know thy items• Should have enough items per user• Maximize items per action• Should have enough intersections• Should not be transient
• Use tweaking abilities• Sharpen the saw
Questions
?