Download pptx - 3e recommendation engines_meetup

REAL-TIME RECOMMENDATION SYSTEMS.PRANAB GHOSH, Big Data ConsultantBig Data Cloud MeetupApril 3 2014, Sunnyvale, CA

CONTENTS

Recommendation processing concepts Hadoop, Storm & Redis based Recommendation

Engine implementation in ‘Sifarish’. Content based recommendation and social

recommendation Key distinguishing features of ‘Sifarish’ compared

to Apache Mahout Real time Social Recommendations

2

HADOOP AT 30,000 FT

Power of functional programming and parallel processing join hands to create Hadoop

Basically parallel processing framework running on cluster of commodity machines

Stateless functional programming because processing of each row of data does not depend upon any other row or any state

Divide and conquer parallel processing. Data gets partitioned and each partition get processed by a separate mapper or reducer task.

3

STORM AT 30,000 FT

Clustered framework for scalable real time stream processing

Like Hadoop, parallel processing framework running on cluster of commodity machines

Instead of processes as in Hadoop, uses a combination of processes and threads for parallelism

Unlike 2 processing stages in Hadoop (map and reduce) there can be multiple processing stages defined in a Storm topology.

Unlike a Hadoop job, a topology once deployed runs continuously.

4

REDIS AT 30,000 FT

It’s a wonderful glue for Big Data eco system Can be thought of as a distributed data

structure server Can be used as a list, queue, cache etc. Supports master slave replication There is no sharding support

5

RECOMMENDATION SYSTEMS• You know recommender systems if you have visited

Amazon or Netflix. • Very computationally intensive, ideal for Big Data

processing.• In memory based recommendation engines, the entire

data set is used directly e.g user behavior based recommendation a.k.a social recommendation or content based recommendation engine. This is our focus.

• Model based recommendation, a model is built first by training the data and then predictions made e.g., Bayesian, decision tree

6

CONTENT BASED RECOMMENDATION

Recommendation is based on innate attributes of items under consideration

Each item is considered to be a point in an n dimensional feature space, where the item has n attributes

Distance between items in n dimensional space is computed to find similarities between items.

Similarity is inversely proportional to distance Attributes can be numerical, categorical or text. Not effective in for cross sell recommendation Essential for boot strapping recommender system

7

CONTENT BASED RECOMMENDATION

Distance between numerical attributes is simply the difference in values

Distance between categorical attributes is 0 if same 1 otherwise

Distance between text attributes is based on either jaccard distance or cosine distance

Distance between corresponding attributes is aggregated to find distance between items

Different weights can be assigned to different attributes for the aggregation to control the contribution of particular attribute

8

COLD START When bootstrapping a business no user behavior

data is available. Content based recommendation is the only option. Distance calculation is performed between user

profile and items. Two different kinds of entities. Attributes from one

entity is mapped to attributes of the other entity. User profile may have been provided explicitly by

user or derived from user behavior e.g. pages visited, search terms etc.

9

WARM START Refers to the case when some limited amount

interaction data is available The user may have browsed and / or bought some

item We use content based recommendation again, but

we find similarities between items of same type (e.g., product)

Use SameTypeSimilarity MR to find distance beween pairs of items for all possible pair

10

SOCIAL RECOMMENDATION

• Customers are fully engaged and significant amount of user behavior data is available

• Recommendation algorithms are based on user behavior data only

• Consider a matrix of user and item. Items are rows and users are columns a.k.a utility matrix. The matrix is sparse

• The cell value could be boolean e.g., whether user has purchased an item or shown interest in some way

• The cell value could also be numeric representing rating. Rating could be exclusive and derived from user behavior data

11

SOCIAL RECOMMENDATION

• The purpose of recommenders is to fill in the blanks in the utility matrix

• If an user has rated A, then enough users must have rated A as well as other items, for recommendation to be effective

• Effective in cross sell recommendation.• The utility matrix is dynamic causing drift in the

underlying model.• Periodic re-computation is necessary depending

upon the rate of change

12

DISTANCE BASED SOCIAL RECOMMENDATION

• Consider rows of the utility matrix, which are items vectors. The vector is n dimensional if there n users

• We can find distances between pair of item vectors• Consider a matrix of user and item. Items are rows

and users are columns a.k.a utility matrix• The cell value could be boolean e.g., whether user

has purchased an item or shown interest in some way

• The cell value could also be numeric representing rating. Rating could be exclusive and derived from user behavior data

13

ITEM CORRELATION

• We can find distances between pair of item vectors, using distance algorithms discussed earlier.

• ItemDynamicAttributeSimilarity is the MR used. Distance or correlation algorithm can be configured to Jaccard, Cosine or Pearson.

• This is known as item based correlation. The other, although less preferred, approach is user based correlation.

14

UTILITY MATRIX

Item/User

u1 u2 u3 u4 u5 u6

i1

i2 r21 r24 r25

i3 r31 r32 r33

i4 r43 r46

i5 r54 r55

i6 r61 r63 r65

i7 r72 r74 r75

15

IMPLICIT RATING ESTIMATE

• Generally users don’t explicitly rate items. It tends to be biased because users with extreme views tend to rate more

• The MR ImplicitRatingEstimator converts user engagement data (e.g, browsing product description page, product review page, placing item in shopping cart etc) to a rating value.

• This is an optional processing phase necessary, when explicit rating data is not available

16

RATING PREDICTOR• Based on rating by an user u1 for item i1, the rating

for an item i2 is predicted using the correlation between i1 and i2

• The MR job for rating prediction is UtilityPredictor• The correlation between items can be

multiplicative or additive. The type of correlation to be used can be set through a configuration parameter.

• For multiplicative correlation, the algorithms are Jaccard, Cosine or Pearson, as mentioned earlier.

• The next slide is on additive correlation

17

ADDITIVE ITEM CORRELATION

• Also known as Slope One Recommender• If a set of users have rated two items i1 and i2, we

find the average rating difference between and i2 and i1.

• If an user has rating for i2, we can predict the rating for i1 based on the average of the difference

• The steps can be repeated, e.g. find average rating difference between i3 and i1 and if the user has rating for i3, get another prediction for rating of i1.

18

AGGREGATION OF PREDICTED RATING

• If an user u1 has rated items i1, i2, ..i5, all of them could be correlated to an item i9. All 5 items will contribute towards prediction of rating for the item i9

• The MR UtilityAggregator aggregates predicted rating.

• We can either take average or median of all predicted ratings during. The choice can be made through configuration

19

BUSINESS GOAL INJECTION

• This is an optional processing phase, where items are associated with scores indicative of business interest (e.g. preferring items with excess inventory) in recommending an item

• Final recommendation score is a weighted average between predicted rating and the business goal score. The relative weights are configurable.

• The MR for this processing is BusinessGoalInjector

20

GROUP BY USER

• This is an optional task that groups the recommended items produced by the processing steps discussed so far by user ID

• The MR class TextSorter performs this task

21

TIME SENSITIVE RECOMMENDATION

• Timestamp is associated with rating data. Each cell in the rating matric has an associated time stamp.

• When processing, past rating data beyond a specified time window is discarded.

• Time window can be specified as a configuration parameter.

22

USER SEGMENTATION

• When user population is not homogenous, it is better to segment the users by clustering or other means

• Separate utility matrix should be built for each segment.

• Ratings should be predicted for each segment separately by running the MR pipeline for each segment

23

KEY DISTINGUISHING FEATURES OF SIFARISH

•Implicit rating generation from explicit user engagement events for social recommendation•Semantic matching using RDF model for knowledge representation for content based recommendation•Supports time widow, location attributes for content based recommendation•Time sensitive social recommendation•Business goal infused social recommendation•Real time social recommendation •Serendipity and novelty in social recommendation (planned)

24

Applica-tion

Servers

MapReduce(Multiple)

HDFS Redis Cache

Redis Cache

Redis Queue or Cache

Redis Queue Storm

1 2

4

5

6

7

89

3

REAL TIME PROCESSINGS

BATCHPROCESSINGS

REAL TIME RECOMMENDATION PROCESSING FLOW

• 1 - Copy historical event click stream data to HDFS• 2 - Copy output of multiple MR i.e. item correlation

matrix to Redis cache. This needs to be done whenever correlation matrix is re computed

• 3 - Copy event mapping metadata to Redis cache. This is one time operation.

• 4 - Write real time event click stream data to Redis queue

• 5 -Storm consumes event mapping metadata from Redis cache when the storm topology starts up.

26

REAL TIME RECOMMENDATION PROCESSING FLOW

• 6 - Storm consumes item correlation matrix from Redis cache

• 7 - Storm consumes event click stream data from Redis queue

• 8 - Storm writes recommended items for an user to Redis queue or cache

• 9 -Application server consumes recommended items from Redis queue or cache

27

REAL TIME RECOMMENDATION PROCESSING

• Only recent user engagement data is used. Recency is defined per session, by time window or event count.

• However, historical user engagement event is used to compute item correlation matrix using Hadoop.

• Historical user engagement event data is converted to implicit rating by Hadoop MR which is consumed by several more Hadoop MR to generate the item correlation matrix.

• Item correlation matrix is saved in Redis as a map for later consumption by Storm

28

REAL TIME RECOMMENDATION PROCESSING

• Storm ingests real time user engagement click stream data from a Redis queue and uses items correlation matrix generated by Hadoop to make Real time recommendation

• Storm writes recommended items to another Redis queue or cache

• In the next several slides we will go through some details of the steps involved

29

GENERATE IMPLICIT RATING

• As mentioned earlier this is generated by a Hadoop MR ImplicitRatingEstimator.

• Uses pre processes click stream data consisting of (userID, sessionID, eventType, timestamp).

• There are different event types indicative of user’s level of intent or interest for an item e.g. purchased item, in checkout, placed in shopping cart, browsed from search results etc.

• Events with strongest intent level are extracted from the click stream along with the counts for such event. This information is mapped to an implicit rating based some heuristics.

30

CONVERTING IMPLICIT RATING TO A COMPACT FORM

• Implicit rating generated in the previous step is of the format (userID, itemID, rating)

• However item correlation generating MR ItemDynamicAttributeSimilarity expects data is a compact format as (itemdID1, userID1:rating1, userID2:rating2,..)

• The format conversion is done through the Hadoop MR CompactRatingFormatter. It’s essentially a group by operation.

31

ITEM CORRELATION

• The MR ItemDynamicAttributeSimilarity generates item correlation with the output format (itemdID1, itemID2, corr1)

• There are many configuration parameters involved, the important being correlation algorithm, the choices being Jaccard, Cosine and Pearson

• For real time processing the correlation data needs to be a sparse matrix form.

• The MR CorrelationMatrixBuilder does the necessary transformation with the output being of the format (itemID1, itemID2:corr1, itemID2:corr2,….)

32

CACHING ITEM CORRELATION

• Item correlation matrix is loaded into a Redis map using a python script. The map key is the item ID and the value is the list of correlated itemIDs along with corresponding correlation coefficients

• A storm bolt reads the correlated items and coefficients from Redis, when it receives a new user engagement tuple from the Redis queue. The storm bolt also caches the correlation values in an in-memory Google Guava cache.

33

CACHING USER EVENT TO RATING MAPPING METADATA

• This mapping meta data is used by Storm bolt to convert real time user engagement event data to implicit rating

• The metadata JSON file content is loaded into a Redis cache by a python script.

• This is an one time operation. However it needs to be reloaded, if the metadata is changed.

34

STORM PROCESSING

• A Storm Redis Spout consumes user event data from a Redis queue.

• The event data is distributed across multiple Storm Bolt instances. The data is partitioned by userID (field grouped in Storm terminology)

• The bolt on receipt of the event data, estimates rating based on recent user engagement event click stream data

• It also looks up the corresponding row of the item correlation matrix from the in memory Google Guava cache using itemID as the key

• Guava cache loads from Redis cache in case of cache miss.

35

STORM PROCESSING

• Predicted ratings are calculated for items correlated with the item in the user event using the estimated rating and the item correlation row vector

• The predicted rating vector is aggregated with the cumulative predicted rating vector

• The cumulative predicted rating vectors are sorted by rating value and the top n items along with associated predicted ratings are written to a Redis queue.

• Output is written to the Redis queue in the format (userID1,itemIe1:rating1,itemdID2:rating2)

• Optionally, the recommendation output can be written to a Redis cache with userID as the key and recommended items as the value

36

EVENT CLICK STREAM

• The storm bolt maintains an window of recent user engagement event click stream data in an in-memory cache

• The click stream data expiry in the window can be managed in several ways driven by configuration

• If data is expired by session, whenever a new session is encountered for an user, the window is cleared

• If data is expired by time span, any event data older is discarded from the window

• If data is expired by a maximum count, older data is discarded from the window when the window size exceeds limit

37

EVENT CLICK STREAM

• The storm bolt maintains an window of recent event click stream data in an in memory cache

• The click stream data expiry in the window can be managed in several ways driven by configuration

• If data is expired by session, whenever a new session is encountered for an user, the window is cleared

• If data is expired by time span, any event data older is discarded from the window

• If data is expired by a maximum count, older data is discarded from the window when the window size exceeds limit

38

RESOURCES

•Sifarish github repository •https://github.com/pranab/sifarish

•Various related blog posts on sifarish for details:•http://pkghosh.wordpress.com/?s=recommendation

39

THANK YOU

Q & [email protected]

http://www.linkedin.com/in/pkghosh/

All Details @ Sifarish.org

mailto:[email protected]

http://www.linkedin.com/in/pkghosh/