Indic threads pune12-recommenders-apache-mahout

Preview:

DESCRIPTION

The 7th Annual IndicThreads Pune Conference was held on 14-15 December 2012. http://pune12.indicthreads.com/

Citation preview

How to Build a Recommendation Engine Using Apache MahoutViraj ParipatyadarGS Lab

2

Contents

• A recommendation problem• What is a recommender• Building a recommender using Mahout• Tips and tweaks

• Recommender considerations

A book store

• Sells books:• By various authors• Of various categories• On different subjects• From various publishers

• Readers/buyers are asked to rate• Readers/buyers can provide reviews

You walk into the store(buy something for a friend)

The store owner

• Asks you what:• your friend reads (already owns)• your friend usually likes more

• Has data on what:• his customers buy• his customers rate and review

• Uses a few strategies

1 - Find similar books

Depending on which books your friend has, pick books:• by the same author• on the same/similar subject/s• in the same category• from the same publication

(those with highest sales numbers)

2 - Find books with similar readership

• Define some similarity• e.g. two books are as similar as the number of

readers rating both of them

• Define some limit of relevance• e.g. only consider books which are more than 4

readers similar

• Look for all books which are similar to books your friend owns

Pick books from this set that you friend doesn’t own

3 - Find people with similar tastes

• Define some similarity• e.g. two people are as similar as the number of

books they like from the same category

• Define some limit of relevance• e.g. only consider the 3 top people when

ordered according to how similar they are to your friend

• Look for users similar to your friend and see what they read

Pick books which these people like and your friend doesn’t own

Example data1,101,5.0 3,101,2.5 4,106,4.0

1,102,3.0 3,104,4.0 5,101,4.0

1,103,2.5 3,105,4.5 5,102,3.0

2,101,2.0 3,107,5.0 5,103,2.0

2,102,2.5 4,101,5.0 5,104,4.0

2,103,5.0 4,103,3.0 5,105,3.5

2,104,2.0 4,104,4.5 5,106,4.0

• Your friend owns three books:• Gave 5 stars to book 101 (likes hugely and talks about it all the

time)

• Gave 3 stars to book 102 (has shown some liking to it)

• Gave 2.5 stars to book 103 (has read it, but didn’t say bad things about it)

Now, we need to recommend for your friend books he hasn’t seen

A pictorial representation

101 102 103 104 105 106 107

1

2

3

4

5

Visualize

101 102 103 104 105 106 107

1

2

3

4

5

A (slightly) bigger example1,101,5.0 3,111,2.5 6,103,2.0

1,102,3.0 4,101,5.0 6,106,4.0

1,103,2.5 4,103,3.0 6,113,3.0

1,109,3.5 4,104,4.5 6,115,5.0

1,112,4.0 4,106,4.0 7,103,4.5

2,101,2.0 4,109,2.0 7,104,2.5

2,102,2.5 4,111,2.5 7,108,4.0

2,103,5.0 5,101,4.0 7,109,3.5

2,104,2.0 5,102,3.0 7,110,3.5

2,107,4.5 5,103,2.0 7,112,2.5

2,113,3.5 5,104,4.0 8,101,2.0

3,101,2.5 5,105,3.5 8,105,4.0

3,104,4.0 5,106,4.0 8,106,4.5

3,105,4.5 5,109,3.0 8,110,3.0

3,107,5.0 5,112,4.0 8,114,5.0

3,115,4.0 6,101,4.5 8,115,3.5

A pictorial representation

101 102 103 104 105 106 107 108 109 110 111 112 113 114 115

1 2 3 4

5 6 7 8

Clearly, not a viable option

Mahout to the rescue

What is Apache Mahout

• Apache Mahout• A machine learning library• Works with Apache Hadoop

• Use cases:• Recommenders• Clustering• Classification

Recommenders in Mahout

• Recommenders use data culled from user behavior

• Recommending using Mahout• Similarity between users or items• Expressed as a number between 0-1

• Neighborhood of users/items• Recommendation using this info and an

algorithm• Generic• Specialized

Similarity

• Various algorithms:• Euclidean distance • Pearson correlation • Cosine measure • Spearman correlation • Tanimoto coefficient • Log-likelyhood

• Effectiveness dependent on the input data• Influences running time and memory

Neighborhood• Nearest N neighborhood (say, 4):

• Threshold neighborhood (say, > 0.8):

5

U

3

2

4

1

5

U

3

2

4

1

Recommender

• Recommenders• Generic recommender• User based• Item based

• Slope-one recommender• Singular Value Decomposition based• Liner Interpolation based• Cluster-based

• Recommender rescorer• Recommender evaluator

A real-life Web application

• News aggregator-cum-reader• Fetches news from a news service• Shows the news in a uniform UI• Lets readers read, like/dislike and comment on

news• Link social networks and share

• Make this a personalized newspaper• Track user actions• Derive and store preferences• Generate recommendations• Leverage social accounts, etc.

Overall design

 

User, application data

(MySQL)

News aggregation,

storage (Hbase)

Preferences, Recommender

(Mahout)

REST

REST

REST

Controller API (REST)

Web application

Phone/tablet applications

Third party applications

Recommender

REST (Grizzly, Tomcat)

REST service

Fetch recommendations

Input user actions

Recommender (offline, run periodically)

MySQL

Database

Input table dump

How to extract data – one dimension

1 2 3 4 5 6 7 8 91

10

100

1000

100004299

511

128

51

13

4 4

1

2

News article readership

News article readership

Number of News Articles

How to extract data – add dimensions

1 4 7 10 13 16 19 22 25 28 31 34 37 40 44 50 571

10

100

1000

10000

News article readership

Topic readership

Number of News articles / Topics

How more data helps

0 100 200 300 400 500 600 700 8000

5

10

15

20

25

30

35

40

No. of read-ers with x ar-ticles each

No. of readers with x top-ics each

Number of news articles/topics

21

How more data helps

5 15 25 35 45 55 65 75 85 950

1

2

3

4

5

6

7

8

9

No. of readers with x articles each

No. of readers with x topics each

Number of news articles/topics

How more data helps

95 145 195 245 295 345 3950

0.5

1

1.5

2

2.5

3

3.5

No. of readers with x articles each

No. of readers with x topics each

Number of news articles/topics

Learnings

• Know thy user• Frequency of visits• Preference logic wrt user

• Know thy items• Should have enough items per user• Maximize items per action• Should have enough intersections• Should not be transient

• Use tweaking abilities• Sharpen the saw

Questions

?

Thank youviraj@gslab.com

viraj.paripatyadar@gmail.com