24
A QUICK TUTORIAL ON MAHOUT’S RECOMMENDATION ENGINE (V 0.4) Jee Vang, Ph.D. [email protected] 1 A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0 Unported License. Slide Version 3.1

A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)

  • Upload
    jee-vang

  • View
    28.507

  • Download
    0

Embed Size (px)

DESCRIPTION

A quick tutorial on Mahout's recommendation algorithm.

Citation preview

Page 1: A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)

A QUICK TUTORIAL ON MAHOUT’S RECOMMENDATION ENGINE (V 0.4)Jee Vang, [email protected]

1A Quick Tutorial on Mahout's Recommendation Engine is

licensed under a Creative Commons Attribution 3.0 Unported License.

Slide Version 3.1

Page 2: A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)

What is recommendation? Recommendation involves the prediction of

what new items a user would like or dislike based on preferences of or associations to previous items

(Made-up) Example: A user, John Doe, likes the following books (items):

A Tale of Two Cities The Great Gatsby For Whom the Bell Tolls

Recommendations will predict which new books (items), John Doe, will like: Jane Eyre The Adventures of Tom Sawyer

2

A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0

Unported License.

Page 3: A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)

What is Mahout? Mahout is a machine learning application

programming interface (API) built on Hadoop MapReduce (MR or M/R) Hadoop Distributed File System (HDFS)

Mahout is written in Java Mahout has machine learning algorithms in the

following areas: Clustering Pattern mining Classification Regression Evolutionary algorithms Recommenders/Collaborative filtering

3

A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0

Unported License.

Page 4: A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)

How does Mahout’s Recommendation Engine Work?

X =

S U RS is the similarity matrix between itemsU is the user’s preferences for itemsR is the predicted recommendations

4

A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0

Unported License.

Page 5: A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)

What is the similarity matrix, S?

S is a n x n (square) matrix Each element, e, in S are

indexed by row (j) and column (k), ejk

Each ejk in S holds a value that describes how similar are its corresponding j-th and k-th items

In this example, the similarity of the j-th and k-th items are determined by frequency of their co-occurrence (when the j-th item is seen, the k-th item is seen as well) In general, any similarity

measure may be used to produce these values

We see in this example that Items 1 and 2 co-occur 3 times, Items 1 and 3 co-occur 4 times, and so on…

S

Item 1

Item

1

Item 2Item 3

Item 4Item 5Item 6Item 7

Item

2

Item

3

Item

4Ite

m 5

Item

6

Item

7

5

A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0

Unported License.

Page 6: A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)

What is the user’s preferences, U?

The user’s preference is represented as a column vector Each value in the

vector represents the user’s preference for j-th item

In general, this column vector is sparse

Values of zero, 0, represent no recorded preferences for the j-th item

U

Item 1Item 2Item 3

Item 4Item 5Item 6Item 7

6

A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0

Unported License.

Page 7: A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)

What is the recommendation, R?

R is a column vector representing the prediction of recommendation of the j-th item for the user

R is computed from the multiplication of S and U S x U = R

In this running example, the user already has expressed positive preferences for Items 1, 4, 5 and 7, so we look at only Items 2, 3, and 6

We would recommend to the user Items 3, 2, and 6, in this order, to the user

R

Item 1Item 2Item 3

Item 4Item 5Item 6Item 7

7

A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0

Unported License.

Page 8: A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)

What data format does Mahout’s recommendation engine expects?

For Mahout v0.4, look at RecommenderJob

(org.apache.mahout.cf.taste.hadoop.item.RecommenderJob)

Each line of the input file should have the following format userID,itemID[,preferenceva

lue] userID is parsed as a long itemID is parsed as a long preferencevalue is parsed as

a double and is optional

Format 1123,345123,456123,789…789,458

Format 2123,345,1.0123,456,2.2123,789,3.4…789,458,1.2

8

A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0

Unported License.

Page 9: A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)

How do you run Mahout’s recommendation engine? Requirements

Hadoop cluster on GNU/Linux Java 1.6.x SSH

Assuming you have a Hadoop cluster installed and configured correctly with the data loaded into HDFS, $HADOOP_INSTALL$/bin/hadoop jar $TARGET$/mahout-core-0.4-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob

-Dmapred.input.dir=$INPUT$ -Dmapred.output.dir=$OUTPUT$ $HADOOP_INSTALL$ is the location where you installed Hadoop $TARGET$ is the directory where you have the Mahout jar file $INPUT$ is the input file name $OUTPUT$ is the output file name

There are plenty of runtime options (check javadocs) --userFile (path) : optional; a file containing userIDs; only preferences of these userIDs will be computed --itemsFile (path) : optional; a file containing itemIDs; only these items will be used in the

recommendation predictions --numRecommendations (integer) : number of recommendations to compute per user; default 10 --booleanData (boolean) : treat input data as having no preference values; default false --maxPrefsPerUser (integer) : maximum number of preferences considered per user in final

recommendation phase; default 10 --similarityClassname (classname): similarity measure (cooccurence, euclidean, log-likelihood,

pearson, tanimoto coefficient, uncentered cosine, cosine)

9

A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0

Unported License.

Page 10: A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)

What are the mechanics of Mahout’s recommendation engine? Mahout is built on Hadoop’s MapReduce (MR) API

<K1,V1> map <K2,V2> <K2,List(V2)> reduce <K3,V3>

A series of MR phases (Jobs) are called to accomplish the task of predicting recommendations ItemIDIndexMapper, ItemIDIndexReducer ItemPrefsMapper,ToUserVectorReducer CounterUsersMapper,CounterUsersReducer … PartialMultiplyMapper,AggregateAndRecommendRed

ucer

10

A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0

Unported License.

Page 11: A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)

Mahout’s Recommender Engine:Phase 1, Generate List of ItemIDs Input:

<LongWritable,Text> Output:

<VarIntWritable,VarLongWritable> Parses out itemIDlong Converts itemID to

int, itemIDint Emits

<itemIDint,itemIDlong>

Input: <VarIntWritable,List(VarLongWritable)>

Output: <VarIntWritable,VarLongWritable>

Find the smallest value in the list of values, itemIDlongmin

Emits <itemIDint, itemIDlongmin >

11

A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0

Unported License.

ItemIDIndexMapper ItemIDIndexReducer

Page 12: A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)

Mahout’s Recommender Engine:Phase 2, Create Preference Vector Input:

<LongWritable,Text> Output:

<VarLongWritable,VarLongWritable>

Parses out userID and itemID

Emits <userID,itemID>

Input: <VarLongWritable,List(VarLongWritable

)>

Output: <VarLongWritable,VectorWritable>

Creates preferences, U U is a sparse Vector

Emits <userID, U>

12

A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0

Unported License.

ToItemPrefsMapper ToUserVectorReducer

Page 13: A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)

Mahout’s Recommender Engine:Phase 3, Count Unique Users

Input: <LongWritable,Text>

Output: <CountUsersKeyWritable,VarLongWri

table>

Parses out userID Emits <userID,userID>

Input: <CountUsersKeyWritable,List(VarLongWrit

able)>

Output: <VarIntWritable,NullWritable>

Count all unique users, numUsers

Emits <numUsers, null>

13

A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0

Unported License.

CountUsersMapper CountUsersReducer

Page 14: A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)

Mahout’s Recommender Engine:Phase 4, Transpose Preferences Vectors Input:

<VarLongWritable,VectorWritable>

Uses MR output from Phase 2 Output:

<IntWritable,DistributedRowMatrix.MatrixEntryWritable>

Transposes MR output from Phase 2 MR Phase 2 output had users as

rows and items as cols Now, items are rows and users are

cols Each element, ejk, is transposed,

ekj Emits <k,ekj>

Input: <IntWritable,List(DistributedRowMatrix.MatrixEntryWrit

able)>

Output: <IntWritable,VectorWritable>

Writes transposed user preferences vectors, U’

Emits <row, U’>

14

A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0

Unported License.

MaybePruneRowsMapper ToItemVectorsReducer

Page 15: A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)

Mahout’s Recommender Engine:Phase 5.1, RowSimilarityJob, Compute Weights Input:

<IntWritable,VectorWritable>

Uses MR output from Phase 4

Output: <VarIntWritable,WeightedOccuren

ces> For each element, ejk, compute its weighted

occurrence, wjk Emits <k,wjk>

Input: <VarIntWritable,List(WeightedOccurrences)>

Output: <VarIntWritable,WeightedOccurrenceArray>

Transfers weighted occurrences to array and writes results

Emits <k, wjk>

15

A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0

Unported License.

RowWeightMapper WeightedOccurrencesPerColumnReducer

Page 16: A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)

Mahout’s Recommender Engine:Phase 5.2, RowSimilarityJob, Compute Similarities Input:

<VarIntWritable,WeightedOccurrenceArray>

Uses MR output from Phase 5.1

Output: <WeightedRowPair,Coocurrenc

e> For pair of rows, p,

write its column coocurrences, c

Emits < p, c>

Input: <WeightedRowPair,List(Coocurrenc

e)> Output:

<SimilarityMatrixEntryKey,MatrixEntryWritable> Compute the row

similarities between rowa and rowb, and write corresponding position in the matrix

Emits <rowj, matrix entry>

16

A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0

Unported License.

CooccurrencesMapper SimilarityReducer

Page 17: A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)

Mahout’s Recommender Engine:Phase 5.3, RowSimilarityJob, Similarity Matrix Input:

<SimilarityMatrixEntryKey,MatrixEntryWritable>

Uses MR output from Phase 5.2

Output: <SimilarityMatrixEntryKey,MatrixEntryWrit

able> Writes similarity matrix entry

key, sme, and matrix entry, me, as is

sme is basically each row me is basically each row-col

entry of the similarity matrix Emits <sme,me>

Input: <SimilarityMatrixEntryKey,List(MatrixEntryW

ritable)>

Output: <IntWritable,VectorWritable>

Write the row and its associated vector out

Emits <row, vector>

17

A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0

Unported License.

Mapper EntriesToVectorsReducer

Page 18: A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)

Mahout’s Recommender Engine:Phase 6, Pre-partial multiply, Similarity Matrix

Input: < IntWritable,VectorWritable> Uses MR output from Phase 5.3

Output: <IntWritable,VectorOrPrefWritable

> Wraps the similarity

vector, v1, into a different vector format, v2

Emits <row,v2>

Input: <IntWritable,List(VectorOrPrefWritable)>

Output: <IntWritable,VectorOrPrefWritable>

Write the row and each of its associated vector out

Emits <row, vector>

18

A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0

Unported License.

SimilarityMatrixRowWrapperMapper Reducer

Page 19: A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)

Mahout’s Recommender Engine:Phase 7, Pre-partial multiply, Preferences Input:

< VarLongWritable,VectorWritable>

Uses MR output from Phase 2

Output: < VarIntWritable,VectorOrPrefWritable>

Maps userID and preference vector, U

Emits <userID,U>

Input: <IntWritable,List(VectorOrPrefWritable)>

Output: <IntWritable,VectorOrPrefWritable>

Write the row and each of its associated vector out

Emits <row, vector>

19

A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0

Unported License.

UserVectorSplitterMapper Reducer

Page 20: A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)

Mahout’s Recommender Engine:Phase 8, Partial Multiply

Input: < VarLongWritable,VectorWritable> Uses MR outputs

from Phases 6 and 7 Output:

< VarIntWritable,VectorOrPrefWritable>

Maps row and vector, v Emits <row,v>

Input: <VarIntWritable,List(VectorOrPrefWritable)>

Output: <IntWritable,VectorOrPrefWritable>

Write the row and each of its associated vector similarity, userIDs, and preference values

Emits <row, vector>

20

A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0

Unported License.

Mapper ToVectorAndPrefReducer

Page 21: A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)

Mahout’s Recommender Engine:Phase 9, Filters Items

Input: <LongWritable,Text>

Output: <VarLongWritable,VectorLongWritab

le>

Parses userID and itemID

Emits <itemID,userID>

Input: <VarLongWritable,List(VarLongWrita

ble)>

Output: <VarIntWritable,VectorOrPrefWritable>

Writes itemID and vector of userIDs and preferences

Emits <itemID, vector>

21

A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0

Unported License.

ItemFilterMapper ItemFilterAsVectorAndPrefReducer

Page 22: A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)

Mahout’s Recommender Engine:Phase 10, Aggregate and Recommend Input:

<VarIntWritable,VectorAndPrefsWritable>

Uses MR outputs from phases 8 and 9

Output: <VarLongWritable,PrefAndSimilarityColumnWritable>

Writes userID and recommendations

Emits <userID,recommendation>

Input: <VarLongWritable,List(PrefAndSimilarityColumnWrit

able)>

Output: <VarLongWritable,RecommendedItemsWrit

able>

Writes userID and vector of recommendations

Emits <userID, vector>

22

A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0

Unported License.

PartialMultiplyMapper AggregateAndRecommendReducer

Page 23: A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)

Summary and Conclusion Mahout is a machine learning API built on

top of Hadoop which includes clustering, pattern mining, classification, regression, evolutionary algorithms, and recommenders

Mahout’s recommender engine transforms an expected input format into predicted recommendations Uses a series of MR phases to accomplish

predicting recommendations

23

A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0

Unported License.

Page 24: A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)

References S. Owen, R. Anil, T. Dunning, E. Friedman.

Mahout in Action. MEAP: Manning Publications, 2010.

T. White. Hadoop: The Definitive Guide. Sebastopol, CA: O’Reilly Media, Inc., 2009.

J. Venner. Pro Hadoop. Berkely, CA: Apress, 2009.

C. Lam. Hadoop in Action. Stamford, CT: Manning Publications Co., 2011.

24

A Quick Tutorial on Mahout's Recommendation Engine is licensed under a Creative Commons Attribution 3.0

Unported License.