Discovery

A FINDERBOTS.COM PRODUCTION

The Guide to Predictive Analytics

DISCOVERY


FINDERBOTS.COM• Independent Consulting Service

• Specialize in Big-data Predictive Analytics• Recommenders

• Personalized discovery

• Search optimization and personalization

• Committer to open source machine learning projects (Apache Mahout, Finderbots Solr-recommender)

Pat Ferrel

[email protected]


DISCOVERY: • Browse

• editorial categories

• user generated content—tags, hashtags, comments, likes, shares

• realtime predictive analytics driven “concepts”

• Search• keywords is not enough

• inferred keywords (from usage data)

• personalized search (from collaborative filtering data, just like Google)

• Recommendations• profile based, content based, usage based

• entire catalog can be skewed by predictive analytics

• required

• why?


DISCOVERY: • Browse

• editorial categories

• user generated content—tags, hashtags, comments, likes, shares

• realtime predictive analytics driven “concepts”

• Search• keywords is not enough

• inferred keywords (from usage data)

• personalized search (from collaborative filtering data, just like Google)

• Recommendations• profile based, content based, usage based

• entire catalog can be skewed by predictive analytics

• required

• why?

Netflix—80% of views

Amazon—60% of sales

Yahoo News—40% increase in TOS

Better Discovery = Better

Engagement


NOT JUST RECOMMENDATIONS

Pervasive Content Personalization


• Search for “leather laptop bag”

• Hmm, some are ok but not quite right

• Put some in “wishlist”

• Look at recommendations

• Add and remove as you like… …things improve!

• Never knew I wanted a “Messenger bag with a leather strap”

• Didn’t know what one was so would never have searched for it

RECOMMENDATIONS CAN DO WHAT SEARCH CANNOT


• Search for “leather laptop bag”

• Buy “leather messenger bag with leather strap”

• With the right usage data we can infer “messenger bag” = “laptop bag”

• Now –the the words I know will get me –the object I want even though –I didn’t know how to ask for it

SEARCH THAT KNOWS WHAT THE USER MEANS


THE CUTTING EDGE IN PREDICTIVE ANALYTICS• Uses any number of user actions—entire user clickstream

• Uses metadata—from user profile or item

• Uses context—on-site, time, location

• Uses content—unstructured text or semi-structured

• Personalizes recommendations even when content-based

• Mixes any number of “indicators” to increase quality or tune to specific context

• Solves the “cold-start” problem—items with too short a lifespan

• Can recommend to new users in realtime

• Improves Search

• Personalizes Search


THE GOOD NEWS

• 90% of these features come from 3 technologies• Search engine (Solr, Elasticsearch)

• Mahout

• Spark

• 90% of the flexibility comes at runtime via query—not from new analytical models.


Technical Overview

THE UNIVERSAL RECOMMENDER


ARCHITECTURE

HDFSaction logging

action logsMahout 1.0

spark-itemsimilarity

cooccurrenceindicators

Scalable Store

HDFS or DB

content ormetadata

content ormetadata =intrinsic indicators

content indicators

Spark

Mahout 1.0 spark-

rowsimilarity

Application

Catalog creation and

editing

Search Engine

indicatorsindex

query

recomm

endations

recs request

realtime background


ANATOMY OF A RECOMMENDATIONr = recommendationshp = a user’s history of some primary action (purchase for instance)P = the history of all users’ primary action rows are users, columns are items[PtP] = compares column to column using log-likelihood based cooccurrence

r = hp[PtP]


THE UNIVERSAL RECOMMENDER• Virtually all collaborative filtering type

recommenders can use only one indicator of preference—one action

• But the theory doesn’t stop there

• Virtually all user actions can be used to improve recommendations—purchase, view, category view…

r = hp[PtP]

r = hp[PtP] + hv[VtP] + hc[CtP] + …


A COOCCURRENCE INDICATOR• [PtP] is an indicator matrix for some primary

action like purchase• Rows = users, columns = items, boolean data

• Compares cooccurring interactions using the log-likelihood ratio—column-wise similarity

• LLR finds important cooccurrences and filters out the rest

• Comparing the history of the primary action to other actions finds the secondary actions that lead to the primary—the effect is to scrub secondary actions of non-meaningful ones


CROSS-COOCCURRENCE INDICATORShi = a user’s history of an actionP, V, C = the history of all users’ history of some

action (purchase, view, category view)[PtX] = the pairwise comparison of column to column—comparison may be across two actions but is always anchored by primary

r = hp[PtP] + hv[VtP] + hc[CtP] + …


CROSS-COOCCURRENCE SO WHAT?• The entire user’s clickstream can be used• Items clicked• Terms searched• Categories viewed• Items shared• People followed• Items liked or disliked• Video watched• Virtually any action the user can takes makes

it easier to predict what they will like in the future.


FROM INDICATOR TO RECOMMENDATION

• This actually means to take the user’s history hp and compare it to rows of the indicator matrix [PtP]

• TF-IDF weighting of indicators would be nice to mitigate popular items

• Query the indicator with user history

• Sort these by similarity strength and keep only the highest—you have recommendations

• Sound familiar?

• That is exactly what a search engine does—except for calculating indicators

r = hp[PtP]


INDICATOR TYPES• Cooccurrence and cross-cooccurrence

• Calculated from user actions as discussed

• Create with Mahout 1.0 spark-itemsimilarity

• Content or metadata• Tags, categories, description text, anything describing an item

• Create with Mahout 1.0 spark-rowsimilarity

• Intrinsic• Tags, genres, categories, popularity rank, geo-location,

anything describing an item

• Some may be derived from usage data like popularity rank, or hotness

• Is a known or specially calculated property of the item


CONTENT INDICATORS• Finds similar items based on their content—not which users preferred them

• Examples: text descriptions, tags, categories, genres

r = recommended items, based on tags

ht = a user’s history of an action on items with tags

[TTt] = item similarity based on similar tags—a content indicator

• This personalizes even content based recommendations

r = ht[TTt]


INTRINSIC INDICATORS• Attributes of items

• Genre, subject, category, tags

• Specially calculated based on business rules• Popularity, hotness

• Based on demographics• Preferred by people using mobile access

• Preferred by city dwellers

• Preferred by people in warmer climes

• Query by value—not user history

r = v*I


THE UNIVERSAL RECOMMENDER“Unified” means one query on all indicators at once

Unified query: query: users-history-of-purchases; field: purchase query: users-history-of-views; field: view query: users-history-of-categories-viewed; field: category query: users-history-of-purchases; field: tags query: users-location; field: geo-location-preferred …

r = hp[PtP] + hv[VtP] + hc[CtP] + ht[TTt] + l*L …


ONE OR MANY• One query—one trip to one scalable

search engine

• Many flavors—customize in the query• Customize for content context

• Customize for user context• Profile, location, time, …

• Customize for special indicators• Trending, hot, new, popular

• All personalized


POLISH THE APPLE• Auto-optimize via explore-exploit (important):

Randomize some returned recs, if they are acted upon they become part of the new training data and are more likely to be recommended in the future

• Visibility control:• Don’t show dups or Show dups at some rate

• Filter items the user has already seen

• Generate some intrinsic indicators like hotness, popularity—helps solve the “cold-start” problem

• Asymmetric train vs query management—for instance query with most recent actions, train on all ingested

• On-demand cross-validation scoring for tuning purposes

• A/B testing integration with explore-exploit

Technology

Discovery