Week 4 presentations · 2020. 10. 28. · Slides adapted from He presentation. Overview •An...

Preview:

Citation preview

CSE 291:Trends in Recommender Systems and Human Behavioral

Modeling

Week 4 presentations

RECURRENT RECOMMENDER NETWORKS

Rishab Gulati

About The Paper- First paper addressing movie recommendation in fully causal and integrated

manner.- Recommender System that can:

● Capture interaction between users and movies● Capture the temporal dynamics of users and movies

- Uses a Nonparametric model- Uses a combination of LSTMs and low rank factorization- Tuples: (User_id, Item_id, timestamp) -> Rating

Recurrent Neural Network- Captures the temporal dynamics- Basic rule:

- Uses LSTMs. Equations:

MODEL- Uses 2 separate LSTM Networks- Update functions:

- Throughout training, model learns the functions f,g,and h - Advantages: 1. Concise 2. Not affected by sparsity of the dataset 3. New data

can be incorporated without re-training

How it works:

- Dataset of M movies- xt -> rating vector of user at time t. xt ϵ ℝM

- xtj = k if user gave rating k to movie j at time t else 0- 𝛕t -> denotes a wallclock at time t - 1newbie = 1 -> New user - Wembed -> Transformation matrix to project information into embedding space.

- yt -> input to LSTM at time t

- Note: Steps where user has not rated any movies are not inserted to LSTM

- Stationary components present - ui and mj - User profile, long term preference- Movie Genre

- Rating function changes to:

- Standard factorization for stationary effects- LSTMs for long-range dynamic updates

Prediction:

- Uses extrapolated states:

Latest Observation(Input) → Updates states → Predictions

So, takes into account the causal effects of previous rating.

- Loss Function:

Backpropagation:

- Rating depends on both user-state RNN and movie-state RNN

- Backpropagation is computationally prohibitive

- Uses Subspace descent method

EXPERIMENTS

DATASET:

Model Setup:

- Single Layer LSTM with 40 hidden neurons- 40 dimensional input embeddings- 20 dimensional dynamic states- 20 dimensional and 160 dimensional stationary latent states for Netflix and

IMDB respectively- ADAM optimizer for LSTM and SGD for stationary factors- Train stationary factors first and then whole model- Stationary latent states initialized by:

- PMF for Netflix dataset- Autorec for IMDB dataset

- Runs for 10 epochs

Compared With:

- PMF - TimeSVD++ - AutoRec 20 dim stationary states and 40 dim embeddings in RNN beat PMF and TimeSVD++ of 160 dim and AutoRec with 500 dim embeddings

- Also more robust

Temporal Dynamics- Automatically models a variety of temporal effects

1) Exogenous Dynamics on Movies

2) Endogenous Dynamics on Movies

- Age Effect - Different Perspectives

- Change in rating mechanism

Performance in Cold Start Conditions- Compared with PMF in conditions where #ratings is very low.- Marker size denotes user counts

Incorporation of New Data

- Allows to incorporate newly observed ratings without re-training.

Observed info.(Input) → Rating→ extrapolate states → Predictions

- Fluctuations = avg_ratinghighest_rated_t - avg_ratinglowest_rated_t- Tested on movies with different levels of rating fluctuations.

MAIN TAKE-AWAYS

- First paper to implement movie recommendation in causal and integrated manner.

- Uses LSTMs to model temporal dynamics and standard factorization for stationary features

- Learns the functions and not the latent state upon training.- Automatically models variety of temporal effects.- Performs better than all the current recommender systems.

Questions?

Neural Collaborative FilteringHe et al., 2017

Presenter: Kulshreshth Dhiman

Slides adapted from He WWW17 presentation

Overview

• An ensemble of • Generalized Matrix Factorization

Weighted Inner product of user and item latent vectors applied to activation function a

out

• Multi-layer Perceptron Concatenated user and item latent vectors passed on to a deep network to learn user-item interaction

• Implicit feedback model• MovieLens and Pinterest dataset

Matrix Factorization

• Some notations▪ p

u = latent vector for user u

qi = latent vector for item i

▪ yui

= interaction (implicit feedback)

• MF estimates interaction yui

as inner product of pu and q

i

Limitations of Matrix Factorization

• The simple choice of inner product function can limit the expressiveness of a MF model

(E.g., assuming a unit length)

sim(u1, u2) = 0.5sim(u3, u1) = 0.4sim(u3, u2) = 0.66

u1

u2

u3sim(u4, u1) = 0.6 *****sim(u4, u2) = 0.2 *sim(u4, u3) = 0.4 ***

S42

> S43

(X)

S42

> S43

(X)

=cos(pu, q

i)

• The inner product can incur a large ranking loss for MF• Can use large number of latent factors but that may hurt

generalization error

• Learn interaction function from the data instead of using fixed inner product

Neural Collaborative Filtering (NCF) Framework

• NCF uses a multi-layer model to learn user-item interaction function

• Input: feature vector for user u (vu) and item i (v

i)

• Here, binarized sparse vector with one-hot encoding of userID and itemID

• Output: predicted score ŷui (how likely i is relevant to u)

• Convert it to a binary classification problem• Activation function: logistic function• Objective function – log loss / binary cross entropy• yui labeled as 1 if item i relevant to user u

• Otherwise 0

Interaction function

Generalized Matrix Factorization (GMF)

• NCF can express and generalize MF:

• Layer 1: element-wise product

• Output Layer: fully connected layer without bias

• h allows varying importance of latent dimensions

• aout

activation function can model non-linearities

Multi-Layer Perceptron (MLP)

• NCF can provides more non-linearityto learn user-item interactions

Layer 1:

Remaining Layers:

MF vs MLP

• MF uses an inner product as the interaction function:• Latent factors are independent with each other;• It empirically has good generalization ability for CF modelling

• MLP uses nonlinear functions to learn the interaction function:• Latent factors are not independent with each other;• The interaction function is learnt from data, which conceptually has a better

representation ability.

• Can we fuse two models to get a more powerful model?

Fusion of GMF and MLP (NeuMF)

• GMP and MLP have separatedof embeddings

• Concatenate weights of the two models

Experimental Setup

• Two public datasets from MovieLens and Pinterest:

• Transform MovieLens ratings to 0/1 implicit case

• Evaluation• Leave-one-out: hold the latest rating of each user as the test

• Randomly sample 100 items that are not interacted by the userRank the test item among the 100 items

• HR (Hit Ratio)@10: captures whether the test item is present in the top-10 list

• NDCG@10: accounts for position of the hit by assigning higher scores to hits at top ranks

Baselines

• ItemPop. Items are ranked by their popularity.

• ItemKNN [Sarwar et al, WWW’01]

The standard item-based CF method.

• BPR [Rendle et al, UAI’09]

Bayesian Personalized Ranking optimizes MF model with a pairwise ranking loss, which is tailored for implicit feedback and item recommendation.

• eALS [He et al, SIGIR’16]

The state-of-the-art CF method for implicit data. It optimizes MF model with a varying-weighted regression loss.

Results

• Predictive factors = size of last hidden layer• NeuMF outperforms eALS and BPR with about 5% relative improvement.• Of the three NCF methods: NeuMF > GMF > MLP• Three MF methods with different objective functions:

• GMF (log loss) >= eALS (weighted regression loss) > BPR (pairwise ranking loss)

Is Deep Learning Helpful?

• For same capability (i.e., same number of predictive factors), stacking more nonlinear layers improves the performance.

Conclusion

• We explored neural architectures for collaborative filtering. • Devised a general framework NCF;• Presented three instantiations GMF, MLP and NeuMF.

• Experiments show promising results:• Deeper models are helpful.• Combining deep models with MF in the latent space leads to better results.

Discussion Questions

• Q1: Are users and items end up being embedded in same latent space?

How does backpropagation works so that latent vectors are correctly identified?

• Q2: How is GMF (log-loss, pointwise) better than BPR loss function (pairwise)?

• Q3: How DDN resolves the inner-product issue?

• Q4: Converting ratings to implicit feedback. Rating 5 vs rating 1, both being mapped “1”

• Q5: Cold start problem? Incorporate auxiliary features

• Q6: Computationally expensive?

Collaborative Variational Autoencoder for

Recommender SystemsXiaopeng Li, James She

Presenter - Ajitesh GuptaSlides adapted from KDD 2017

Motivation● An increasing variety of content is available to online users

○ E.g. Movies, Images, Songs, Articles

● Companies want to target as many users as possible, as best as possible● Current systems work mostly on collaborative filtering

○ Predict recommendations based on previous history of user interaction (ratings, likes, views)

● Two major problems -○ Sparsity - Rating matrix can be and actually is extremely sparse in many cases○ Cold-start - No ratings available for new items

Use content● As mentioned earlier items have quite a lot of content

○ Images, textual descriptions

● Understanding contents helps understand user preferences○ Maybe they prefer certain style of narrative○ Maybe they like the presence of certain instruments in songs

● Combine collaborative methods with content methods -> Hybrid Method○ Previous history + subjective interests = best of both worlds

● Loosely coupled vs Tightly coupled○ Loosely coupled hybrid systems

■ Separate content and collaborative prediction■ Final prediction is linear combination/ regression

○ Tightly coupled hybrid systems■ Unified model of content and collaborative prediction

Challenges● How to select features ?

○ Content features are not directly suitable for recommendation○ Very high dimensional○ Not optimized for the task

● Moreover how to integrate that with the collaborative part ?● The authors look towards deep generative models

in order to learn such suitable representations

Variational Autoencoder

● Latent space modelled in the form of a gaussian○ Can just sample from gaussian to generate○ Gaussian makes functions easier to calculate

Collaborative Variational Autoencoder● Bayesian Generative model - Both content and rating are generated using latent variables

○ Ratings through graphical model○ Content through generation network

Collaborative Variational AutoencoderRating

User latent variable

Item latent variable

Item latent collaborative variable

Item latent content variable

Item content

Collaborative Variational AutoencodersGenerative Process

1. For each item j:a. Draw the latent collaborative variables v_j* from priorb. Draw the latent content variable x_j from priorc. Draw the item content fromd. Draw the item variable

2. For each user ia. Draw the user variable u_i from prior

3. For each rating R_ija. Draw rating from

Joint Probability

Inference● Intractable posterior requires approximate inference● Hence we do variational inference●● Gaussian distribution for

Inference● Inference network makes inference of z through 2 two paths

○ Generating content x○ Generating the rating R

● Learns good x both for representing useful information of x and for better recommendation

MAP Estimation for learningError between predicted and true ratings

Contribution of content for recommendation

Reconstruction error and KL for VAE

Regularization

Round 1

Round 2

MAP Estimation for learning

Prediction● Let D be the observed data. After all the parameters U , V (or μ, Λ), and the

weights of the inference network and generation network are learned, the predictions can be made by

● For point estimation, the prediction can simplified as

where E[z_j ] = μ_j , the mean vector obtained through the inference network for each item j. The items that have never been seen before will have no v_j* term, but the E[z_j] can be inferred through the content.

ExperimentsDataset - User written articles

citeulike-a citeulike-t

# users 5551 7947

# items 16,950 25,975

# ratings 204,956 134,850

Density 0.22% 0.07%

Content text text

Preprocessing● Bag of words● Tf-idf term selection● Normalization

Training Setting● Sparse - each user has 1 rating● Dense - each user has 10 ratings

Measure

Other methods● Collaborative Topic Regression

○ LDA content features + collaborative filtering○ Tight coupling

● DeepMusic○ Use of Neural network for linear regression to convert content to latent factors○ Latent factors obtained from Weighted Matrix Factorization○ Loosely coupled

● Collaborative Deep Learning○ Similar to CVAE but with Stacked Denoising Autoencoder + Collaborative Filtering○ Tightly coupled

Results

Results - Effect of penalizing reconstruction error (λr)

Results - Effect of increasing content dimensions (K)

Qualitative Results

Extensions to various types of content● Different neural network architecture for different types of content

○ Convolutional neural networks○ Recurrent neural networks

● Taking advantage of different types of content○ Adversarial networks

Thank You

TransNets: Learning to Transform for Recommendation

Rose Catherine, William Cohen. 2017

Presenter: Siyu Jiang

Motivation

• Task: Reviews -> Ratings• Reviews include: A user’s reviews for items.

An item’s received reviews from users.

• Use deep learning’s advantages in recommender systems.

• State-of-the-art method: Deep Cooperative Neural Networks (DeepCoNN) exists flaws.

Related Work: DeepCoNN• Take review texts as input

• Word Embedding• pre-trained, no updates.

• Convolutional Layer• With different kernels.

• Max-pooling Layer• Produce fixed size vector• Location invariant

• Fully Connected Layer

• Estimate ratings using Factorization Machines.

CN

N Text Pro

cessor

Limitations of DeepCoNN• User A’s rating on Item B = F(User A’s reviews, Item B’s reviews)

• Let Rev(A,B) denote user A’s review for item B.• User A’s reviews = Rev(A,B) + reviews for other items• Item B’s reviews = Rev(A,B) + reviews from other users.

• DeepCoNN used Rev(A,B) at both training and testing.

• It is unreasonable to recommend an item to users after they have experienced it.

• DeepCoNN’s Performance:• Train with Rev(A, B), Test with Rev(A, B): MSE = 1.21• Train with Rev(A, B), Test without Rev(A, B): MSE = 1.89• Train without Rev(A, B), Test without Rev(A, B): MSE = 1.70

TransNets

• Inspirations• Rev(A, B) is important to predicting Rating(A, B).• Rev(A, B) is available during training.

• Consists of two networks.

• Target Network to process Rev(A, B)

• Source Network to process texts of (user A and Item B).

TransNet Architecture

Target Network

Source Network• Input: Reviews by User A (exclude

the review for Item B) and Reviews for Item B (exclude the review by User A).

• Output: Rating(A, B).

• Transform layer: transform the user and the item vectors into an approximate feature of rev(A, B).

• Only source network is used during testing.

Training TransNet

• Three-step training.

• Step 1: Train target network on actual review.

• Step 2: Learn to transform.

• Step 3: Train a predictor on the transformed input.

Training TransNet

• Three-step training.

• Step 1: Train target network on actual review.

• Step 2: Learn to transform.

• Step 3: Train a predictor on the transformed input.

Training TransNet

• Three-step training.

• Step 1: Train target network on actual review.

• Step 2: Learn to transform.

• Step 3: Train a predictor on the transformed input.

Dataset

Result

• TranNet-Ext: include user and item identity into inputs.

• DeepCoNN: Train with Rev(A, B), Test without Rev(A, B).

• DeepCoNN: Train without Rev(A, B), Test without Rev(A, B).

• DeepCoNN + Test Reviews: Train with Rev(A, B), Test with Rev(A, B).

Thank you! & Questions?

What Your Images Reveal: Exploiting Visual Content for Point-of-Interest RecommendationWang, Wang, Tang, et al. Arizona State Univ. & Michigan State Univ.

Presented by Stephanie ChenOct. 25, 2017

This paper’s goal

To improve performance of Point-of-Interest (POI)

recommendation by incorporating visual content

from check-in data from

Location-Based Social Networks (LBSN)

Previous Work

Existing POI Recommendation Methods:

1. Temporal patterns2. Geographical influence3. Social correlations4. Textual content indications

Location-Based Social Networks

● Focused on check-ins with images ● Provides unique information about user

preferences and additional POI properties

● Authors’ hypothesis:

Users who post lots of images about food have more incentive to visit restaurants.

Extracting Visual Content

VGG16 model Convolutional Neural Network (CNN)

Optimization Framework - Section 4

● Gradient descent ● Negative sampling to simplify gradients

○ For each image pk ∈ Pui , randomly sample r images from those that are not posted by user ui○ Maximize the similarity between ui and the visual content of pk ○ Minimize the similarity between ui and the randomly-sampled r images

Data Sets - Section 5.1● Instagram check-ins from New York City &

Chicago○ Both with and without location tags

● Selected only: ○ locations visited by at least 2 unique users○ users who have checked in at least in 8 distinct

locations

● Threw out all selfies to reduce noise & improve performance

○ Not enough information on POIs or user’s interests towards POIs because human body/face took up entire image

○ Authors checked “manually” for selfies

Experiment

● Training Set: For each individual user in the check-in matrix, randomly select x% of all POIs where he has checked-in.

● Test Set: The rest of the observed user-POI pairs. ● Remove images associated with check-ins in the test set to ensure no test

data is exposed during training.

Performance Comparison - Section 5.2

Performance Comparison - Section 5.2

● Matrix-factorization-based POI recommender systems outperform user-oriented collaborative filtering (UCF)

● VPOI obtains better performance than baseline methods based on MF, suggesting that incorporating visual content can improve recommendation performance

● VPOI is better at handling the noisy images from Instagram than Visual Bayesian Personalized Ranking (VBPR) because:

○ VPOI users the images to learn latent features of users and POIs, whereas VBPR directly uses them as descriptions of locations to predict preference scores

Handling Cold-Start Users - Section 5.3

● Cold-Start Users: User without check-in history, or who never adds geo-tags to posted photos

● <30% of Instagram images are tagged with POIs ● To grab data containing only cold-start users, for each user:

○ Randomly select x% of all POIs for training, leaving the rest for the test set○ Remove images associated w/ check-ins in the test set ○ Randomly select 5% of users from the training set, and:

■ remove their check-ins from the training set■ remove their images with geo-tags

○ Results in 5% of data becoming cold-start users

Cold-Start Performance Comparison

Cold-Start Performance Comparison

● Introducing cold-start users deteriorates performance for all the recommendation systems

● Relative to the other systems’ results, VPOI performance degeneration was much smaller because VPOI learns user latent factors for cold-start users

Conclusion - Section 6

● Used CNN to extract features from visual content (images) and to learn latent user and POI features

● VPOI’s experimental results show it outperforms representative state-of-the-art POI recommender systems

● Future work: ○ Incorporate the original four factors to see if even better performance can be achieved ○ Use streaming recommender system techniques because user check-in records are streaming

Neat Notes

3 of Julian’s works were cited by these authors!

Room for Improvement

Dataset was not actually available on the first author’s webpage.

So many grammatical and spelling errors!

e.g. section 1: “how to extract useful visual contents from images as we are lack of ground truth of what are contained in the images”

e.g. table 4 title “perforamnce”

Appendix Slides

Notation - Section 3

3 types of objects: users, locations and images

U = {u1, u2, . . . , un} = the set of users,

L = {l1, l2, . . . , lm} = the set of locations,

P = {p1, . . . , pn} = the set of photos,

where n, m and N are the number of users, POIs and images, respectively.

R denotes the check-in matrix.

Notation - Section 3Users can check in at locations.

X ∈ Rnxm denotes the user-POI check-in matrix

Xij denotes the check-in frequency or rating of ui on lj

Users can upload images to LBSNs.

Pui = the set of images uploaded by ui

Users can also choose to add locations to images.

Plj denotes the set of images that are tagged lj

Acronyms

CNN = Convolutional Neural Network

LBSM = Location-Based Social Network

MF = Matrix Factorization

POI = Point of Interest

UCF = User-oriented Collaborative Filtering

VBPR = Visual Bayesian Personalized Ranking

VPOI = Visual Content Enhanced POI Recommendation

Neural Factorization Machines for Sparse Predictive Analytics

Xiangnan He, Tat-Seng Chua. 2017

Presenter: Chester Holtz

Motivation

• Many predictive tasks involving web applications need to model categorical variables, such as user IDs or demographics.

• To apply standard machine learning techniques, these categorical predictors are typically converted to a set of binary features via one-hot encoding, making the resultant feature vector highly sparse.

• To learn from such sparse data effectively, it is crucial to account for the interactions between features.

Factorization Machines

• FM can directly model explicit features• Rather than projecting data into latent vector space, FM projects

each feature into the latent space.• FM estimates the target by modelling all interactions between

each pair of features via factorized interaction parameters:

Factorization Machines Limitations

• FM is still a linear model• FM may not capture higher order and nonlinear feature interactions

present in the data.

Neural Factorization Machines

• Unifies linear FMs and neural networks for sparse data modelling• Perform non-linear transformation on the latent space of the

second-order feature interactions while capturing higher order feature interactions where f(x) is modeled as a neural network.

Bi-Interaction Layer

• The proposed Bi-Interaction Layer takes a set of dense k-dimensional feature embeddings and maps them to a single k-dimensional vector.

Formulation & Properties of NFM

• Formulation

• Generalizes FM (pf)• Model evaluation complexity (pf):

Evaluation

• Datasets• MovieLense

• Movie tag recommendation• Frappe

• App recommendation

• Baselines• LibFM• HOFM• Wide & Deep• DeepCross

Results & Discussion

• (RQ1)Dropout Improves Generalization, Batch Normalization Speeds up Training

• (RQ2)The BI-layer has effectively encoded second-order feature interactions - a simple non-linear function is sufficient to capture higher-order interactions.

Conclusion and Future Directions

• Proposed a novel neural network model NFM, which ties linear FM with the representation ability of non-linear neural networks for sparse predictive analytics.

• Proposed Bi-Interaction operation which provides the model the ability to learn more informative feature interactions at the lower level.

• Experiments on two real-world datasets and show that with only one hidden layer NFM outperforms FM, higher-order FM, and state-of-the-art deep learning approaches.

Future Directions

• Improve the efficiency of NFM by applying hashing techniques • study its performance for other IR tasks, such as search ranking and

targeted advertising. • Extend the objective function with regularizers like the graph

Laplacian. • Exploring the Bi-Interaction pooling for recurrent neural networks

(RNNs) for sequential data modelling

Discussion Questions

1. How can NFM combine with field-aware factorization machines to

associate several embedding vectors for a feature to different

interactions with other features in another field?

2. The authors sampled 2 negative instances to pair with one positive

instance, why it was it able to ensure the generalization and how

exactly it was implemented?

3. For sequential data, does the replacement of FNN with RNN yield

better performance?

4. Dropout intuitions - prevent complex feature co-adaptations

Deep Neural Networks for YouTube Recommendations

Paul Covington, Jay Adams, Emre Sargin

Presented By:Tushar Bansal

MotivationRecommend videos to Youtube users based on past activity.

Scale: Many existing recommendation algorithms proven to work well on small problems fail to operate on Youtube’s scale.

Freshness: Recommendation system should be responsive enough to model newly uploaded content as well as the latest actions taken by the user.

Noise: Historical user behavior on YouTube is inherently difficult to predict due to sparsity and a variety of unobservable external factors. Ground truth of user satisfaction is rare.

Model OverviewA combination of two models:

1. Candidate Generation: The enormous YouTube corpus is winnowed down to hundreds of videos that may be relevant to the user.

2. Ranking: Rank the candidate videos based on the impression (majorly) data.

Candidate Model● Multi-class classification problem with each video as a class.● Use implicit feedback data instead of explicit feedback.● Training: Softmax function

vj = candidate video embedding, u = user embedding

● Minimize cross entropy loss● Test: Nearest Neighbor● Fully connected ReLU layers in a tower pattern

Candidate Model: Features● User’s watch history represented by a sequence of sparse Video ID mapped

to dense vector representation via embeddings.● User’s search queries tokenized into unigrams and bigrams with each token

embedded● Example Age feature: Important to promote fresh content

Candidate Model: Features● Fix number of training samples per user.● Representing search queries as unordered bag of tokens.● Predicting the next watch rather than predicting held-out watch.● Train examples from other website embeddings too.

Candidate Model: Evaluation

*MAP: Mean average precision

Ranking Model● Many more features describing video and user-video relationship.● Ranking objective is based on live A/B testing results but generally a simple

function of expected watch time per impression.● Training: Weighted logistic regression

○ Positive samples weighted by observed watch time on the video○ Negative samples receive unit weight

● Fully connected ReLU layers in a tower pattern● Candidates from other candidate sources

Ranking Model: Features● User’s interaction with video or similar items (e.g. channel)● Past video impressions.● Last search queries, last N videos watched etc.

Ranking Model: Evaluation

Discussion

1. Youtube has many playlist from users, but it seems the paper does not consider that feature, if we would like to consider the feature, how to change the model?

2. The paper doesn’t mention about explicit feedback from the user such as channel subscription, likes which may greatly affect the candidate generation. Even though explicit feedback is sparse, but using it may improve the recommendation. Is it still advisable not to use it?

3. It may very well happen the recommendation may be of periodic nature. Based on time of day or day of week, the recommendation should vary. Would including time factor explicitly during candidate generation phase help in recommendation?

4. How can we ensure diversity of results?5. Give raw input to model instead of embedding?6. What is the motivation for using extreme multiclass classification? Why would not a

logistic loss function or a pairwise ranking work?

Thank You.

Deep Learning based Large Scale Visual Recommendation and Search for

E-commerceBy Devashish Shankar, Sujay Narumanchi, H A Ananya, Pramod Kompalli, Krishnendu Chaudhury

-Dhruv Sharma

Problems it talks about● Visual Recommendation

○ Retrieve a ranked list of catalog images similar to another catalog image

○ A Deep Ranking model for recommending item using images

● Visual Search○ Retrieve a ranked list of images similar to a “Wild”

(user-uploaded) image○ Uses the same deepnet model and NN search

● Scalability○ Scaling the solution to :

■ 50 M items■ 100K / hr ingestions (add,delete,modify item)■ 2000 queries / second with 100ms latency

● Domain: Fashion

Catalog and Wild Images

Catalog Images Wild Images

Datasets

1. Flipkart Fashion Dataset (Evaluation)2. Exact Street2Shop (Training)

Problem being solved

Model Architecture

Training Procedure● Triplet of images <q,p,n>● Relative similarity using Hinge loss● Train using In-class and Out-of-class negative triplets

Training Data Generation● Use Basic Image Similarity Scorer (BISS) models to approximate similarity

○ Each BISS identifies 1000 closest images to q

● Two types of training triplets.○ Catalog Image Triplets

■ All images are sampled from catalog images○ Wild Image Triplets

■ query images are sampled from wild images and positive and negative images are sampled from catalog images

Training Data Generation

● Catalog Image Triplets

○ q is sampled from set of catalog images○ p is sampled from set of catalog images with high similarity score using union of top 200

images from multiple BISS models○ n is sampled from set of catalog images with low similarity scores using rest of the images

from the union of BISS models■ In-class negative images: sampled from a union of top 500-1000 from each BISS■ Out-of class negative images: sampled from images of same category out of top 1000

from each BISS

Training Data Generation

● Wild Image Triplets

○ q is sampled from from the cropped wild images from Exact Street-to-Shop○ p is ground truth image match in the data set○ n is sampled from set of catalog images with low similarity scores from Visnet trained on

catalog images■ n is sampled from in-class and out-of-class images

● Object Localization for wild images (Search)○ Used Faster R-CNN (FRCNN) trained on Fashionista dataset

Production Pipeline (Scalability)

Evaluation

Evaluation

Limitations & Confusions1. The model is Non-personalized2. Recall reduced for Visnet-FRCNN 3. Separate Deep Ranking model for each category

Recommended