Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
CSE 291:Trends in Recommender Systems and Human Behavioral
Modeling
Week 4 presentations
RECURRENT RECOMMENDER NETWORKS
Rishab Gulati
About The Paper- First paper addressing movie recommendation in fully causal and integrated
manner.- Recommender System that can:
● Capture interaction between users and movies● Capture the temporal dynamics of users and movies
- Uses a Nonparametric model- Uses a combination of LSTMs and low rank factorization- Tuples: (User_id, Item_id, timestamp) -> Rating
Recurrent Neural Network- Captures the temporal dynamics- Basic rule:
- Uses LSTMs. Equations:
MODEL- Uses 2 separate LSTM Networks- Update functions:
- Throughout training, model learns the functions f,g,and h - Advantages: 1. Concise 2. Not affected by sparsity of the dataset 3. New data
can be incorporated without re-training
How it works:
- Dataset of M movies- xt -> rating vector of user at time t. xt ϵ ℝM
- xtj = k if user gave rating k to movie j at time t else 0- 𝛕t -> denotes a wallclock at time t - 1newbie = 1 -> New user - Wembed -> Transformation matrix to project information into embedding space.
- yt -> input to LSTM at time t
- Note: Steps where user has not rated any movies are not inserted to LSTM
- Stationary components present - ui and mj - User profile, long term preference- Movie Genre
- Rating function changes to:
- Standard factorization for stationary effects- LSTMs for long-range dynamic updates
Prediction:
- Uses extrapolated states:
Latest Observation(Input) → Updates states → Predictions
So, takes into account the causal effects of previous rating.
- Loss Function:
Backpropagation:
- Rating depends on both user-state RNN and movie-state RNN
- Backpropagation is computationally prohibitive
- Uses Subspace descent method
EXPERIMENTS
DATASET:
Model Setup:
- Single Layer LSTM with 40 hidden neurons- 40 dimensional input embeddings- 20 dimensional dynamic states- 20 dimensional and 160 dimensional stationary latent states for Netflix and
IMDB respectively- ADAM optimizer for LSTM and SGD for stationary factors- Train stationary factors first and then whole model- Stationary latent states initialized by:
- PMF for Netflix dataset- Autorec for IMDB dataset
- Runs for 10 epochs
Compared With:
- PMF - TimeSVD++ - AutoRec 20 dim stationary states and 40 dim embeddings in RNN beat PMF and TimeSVD++ of 160 dim and AutoRec with 500 dim embeddings
- Also more robust
Temporal Dynamics- Automatically models a variety of temporal effects
1) Exogenous Dynamics on Movies
2) Endogenous Dynamics on Movies
- Age Effect - Different Perspectives
- Change in rating mechanism
Performance in Cold Start Conditions- Compared with PMF in conditions where #ratings is very low.- Marker size denotes user counts
Incorporation of New Data
- Allows to incorporate newly observed ratings without re-training.
Observed info.(Input) → Rating→ extrapolate states → Predictions
- Fluctuations = avg_ratinghighest_rated_t - avg_ratinglowest_rated_t- Tested on movies with different levels of rating fluctuations.
MAIN TAKE-AWAYS
- First paper to implement movie recommendation in causal and integrated manner.
- Uses LSTMs to model temporal dynamics and standard factorization for stationary features
- Learns the functions and not the latent state upon training.- Automatically models variety of temporal effects.- Performs better than all the current recommender systems.
Questions?
Neural Collaborative FilteringHe et al., 2017
Presenter: Kulshreshth Dhiman
Slides adapted from He WWW17 presentation
Overview
• An ensemble of • Generalized Matrix Factorization
Weighted Inner product of user and item latent vectors applied to activation function a
out
• Multi-layer Perceptron Concatenated user and item latent vectors passed on to a deep network to learn user-item interaction
• Implicit feedback model• MovieLens and Pinterest dataset
Matrix Factorization
• Some notations▪ p
u = latent vector for user u
qi = latent vector for item i
▪ yui
= interaction (implicit feedback)
• MF estimates interaction yui
as inner product of pu and q
i
Limitations of Matrix Factorization
• The simple choice of inner product function can limit the expressiveness of a MF model
(E.g., assuming a unit length)
sim(u1, u2) = 0.5sim(u3, u1) = 0.4sim(u3, u2) = 0.66
u1
u2
u3sim(u4, u1) = 0.6 *****sim(u4, u2) = 0.2 *sim(u4, u3) = 0.4 ***
S42
> S43
(X)
S42
> S43
(X)
=cos(pu, q
i)
• The inner product can incur a large ranking loss for MF• Can use large number of latent factors but that may hurt
generalization error
• Learn interaction function from the data instead of using fixed inner product
Neural Collaborative Filtering (NCF) Framework
• NCF uses a multi-layer model to learn user-item interaction function
• Input: feature vector for user u (vu) and item i (v
i)
• Here, binarized sparse vector with one-hot encoding of userID and itemID
• Output: predicted score ŷui (how likely i is relevant to u)
• Convert it to a binary classification problem• Activation function: logistic function• Objective function – log loss / binary cross entropy• yui labeled as 1 if item i relevant to user u
• Otherwise 0
Interaction function
Generalized Matrix Factorization (GMF)
• NCF can express and generalize MF:
• Layer 1: element-wise product
• Output Layer: fully connected layer without bias
• h allows varying importance of latent dimensions
• aout
activation function can model non-linearities
Multi-Layer Perceptron (MLP)
• NCF can provides more non-linearityto learn user-item interactions
Layer 1:
Remaining Layers:
MF vs MLP
• MF uses an inner product as the interaction function:• Latent factors are independent with each other;• It empirically has good generalization ability for CF modelling
• MLP uses nonlinear functions to learn the interaction function:• Latent factors are not independent with each other;• The interaction function is learnt from data, which conceptually has a better
representation ability.
• Can we fuse two models to get a more powerful model?
Fusion of GMF and MLP (NeuMF)
• GMP and MLP have separatedof embeddings
• Concatenate weights of the two models
Experimental Setup
• Two public datasets from MovieLens and Pinterest:
• Transform MovieLens ratings to 0/1 implicit case
• Evaluation• Leave-one-out: hold the latest rating of each user as the test
• Randomly sample 100 items that are not interacted by the userRank the test item among the 100 items
• HR (Hit Ratio)@10: captures whether the test item is present in the top-10 list
• NDCG@10: accounts for position of the hit by assigning higher scores to hits at top ranks
Baselines
• ItemPop. Items are ranked by their popularity.
• ItemKNN [Sarwar et al, WWW’01]
The standard item-based CF method.
• BPR [Rendle et al, UAI’09]
Bayesian Personalized Ranking optimizes MF model with a pairwise ranking loss, which is tailored for implicit feedback and item recommendation.
• eALS [He et al, SIGIR’16]
The state-of-the-art CF method for implicit data. It optimizes MF model with a varying-weighted regression loss.
Results
• Predictive factors = size of last hidden layer• NeuMF outperforms eALS and BPR with about 5% relative improvement.• Of the three NCF methods: NeuMF > GMF > MLP• Three MF methods with different objective functions:
• GMF (log loss) >= eALS (weighted regression loss) > BPR (pairwise ranking loss)
Is Deep Learning Helpful?
• For same capability (i.e., same number of predictive factors), stacking more nonlinear layers improves the performance.
Conclusion
• We explored neural architectures for collaborative filtering. • Devised a general framework NCF;• Presented three instantiations GMF, MLP and NeuMF.
• Experiments show promising results:• Deeper models are helpful.• Combining deep models with MF in the latent space leads to better results.
Discussion Questions
• Q1: Are users and items end up being embedded in same latent space?
How does backpropagation works so that latent vectors are correctly identified?
• Q2: How is GMF (log-loss, pointwise) better than BPR loss function (pairwise)?
• Q3: How DDN resolves the inner-product issue?
• Q4: Converting ratings to implicit feedback. Rating 5 vs rating 1, both being mapped “1”
• Q5: Cold start problem? Incorporate auxiliary features
• Q6: Computationally expensive?
Collaborative Variational Autoencoder for
Recommender SystemsXiaopeng Li, James She
Presenter - Ajitesh GuptaSlides adapted from KDD 2017
Motivation● An increasing variety of content is available to online users
○ E.g. Movies, Images, Songs, Articles
● Companies want to target as many users as possible, as best as possible● Current systems work mostly on collaborative filtering
○ Predict recommendations based on previous history of user interaction (ratings, likes, views)
● Two major problems -○ Sparsity - Rating matrix can be and actually is extremely sparse in many cases○ Cold-start - No ratings available for new items
Use content● As mentioned earlier items have quite a lot of content
○ Images, textual descriptions
● Understanding contents helps understand user preferences○ Maybe they prefer certain style of narrative○ Maybe they like the presence of certain instruments in songs
● Combine collaborative methods with content methods -> Hybrid Method○ Previous history + subjective interests = best of both worlds
● Loosely coupled vs Tightly coupled○ Loosely coupled hybrid systems
■ Separate content and collaborative prediction■ Final prediction is linear combination/ regression
○ Tightly coupled hybrid systems■ Unified model of content and collaborative prediction
Challenges● How to select features ?
○ Content features are not directly suitable for recommendation○ Very high dimensional○ Not optimized for the task
● Moreover how to integrate that with the collaborative part ?● The authors look towards deep generative models
in order to learn such suitable representations
Variational Autoencoder
● Latent space modelled in the form of a gaussian○ Can just sample from gaussian to generate○ Gaussian makes functions easier to calculate
Collaborative Variational Autoencoder● Bayesian Generative model - Both content and rating are generated using latent variables
○ Ratings through graphical model○ Content through generation network
Collaborative Variational AutoencoderRating
User latent variable
Item latent variable
Item latent collaborative variable
Item latent content variable
Item content
Collaborative Variational AutoencodersGenerative Process
1. For each item j:a. Draw the latent collaborative variables v_j* from priorb. Draw the latent content variable x_j from priorc. Draw the item content fromd. Draw the item variable
2. For each user ia. Draw the user variable u_i from prior
3. For each rating R_ija. Draw rating from
Joint Probability
Inference● Intractable posterior requires approximate inference● Hence we do variational inference●● Gaussian distribution for
Inference● Inference network makes inference of z through 2 two paths
○ Generating content x○ Generating the rating R
● Learns good x both for representing useful information of x and for better recommendation
MAP Estimation for learningError between predicted and true ratings
Contribution of content for recommendation
Reconstruction error and KL for VAE
Regularization
Round 1
Round 2
MAP Estimation for learning
Prediction● Let D be the observed data. After all the parameters U , V (or μ, Λ), and the
weights of the inference network and generation network are learned, the predictions can be made by
● For point estimation, the prediction can simplified as
where E[z_j ] = μ_j , the mean vector obtained through the inference network for each item j. The items that have never been seen before will have no v_j* term, but the E[z_j] can be inferred through the content.
ExperimentsDataset - User written articles
citeulike-a citeulike-t
# users 5551 7947
# items 16,950 25,975
# ratings 204,956 134,850
Density 0.22% 0.07%
Content text text
Preprocessing● Bag of words● Tf-idf term selection● Normalization
Training Setting● Sparse - each user has 1 rating● Dense - each user has 10 ratings
Measure
Other methods● Collaborative Topic Regression
○ LDA content features + collaborative filtering○ Tight coupling
● DeepMusic○ Use of Neural network for linear regression to convert content to latent factors○ Latent factors obtained from Weighted Matrix Factorization○ Loosely coupled
● Collaborative Deep Learning○ Similar to CVAE but with Stacked Denoising Autoencoder + Collaborative Filtering○ Tightly coupled
Results
Results - Effect of penalizing reconstruction error (λr)
Results - Effect of increasing content dimensions (K)
Qualitative Results
Extensions to various types of content● Different neural network architecture for different types of content
○ Convolutional neural networks○ Recurrent neural networks
● Taking advantage of different types of content○ Adversarial networks
Thank You
TransNets: Learning to Transform for Recommendation
Rose Catherine, William Cohen. 2017
Presenter: Siyu Jiang
Motivation
• Task: Reviews -> Ratings• Reviews include: A user’s reviews for items.
An item’s received reviews from users.
• Use deep learning’s advantages in recommender systems.
• State-of-the-art method: Deep Cooperative Neural Networks (DeepCoNN) exists flaws.
Related Work: DeepCoNN• Take review texts as input
• Word Embedding• pre-trained, no updates.
• Convolutional Layer• With different kernels.
• Max-pooling Layer• Produce fixed size vector• Location invariant
• Fully Connected Layer
• Estimate ratings using Factorization Machines.
CN
N Text Pro
cessor
Limitations of DeepCoNN• User A’s rating on Item B = F(User A’s reviews, Item B’s reviews)
• Let Rev(A,B) denote user A’s review for item B.• User A’s reviews = Rev(A,B) + reviews for other items• Item B’s reviews = Rev(A,B) + reviews from other users.
• DeepCoNN used Rev(A,B) at both training and testing.
• It is unreasonable to recommend an item to users after they have experienced it.
• DeepCoNN’s Performance:• Train with Rev(A, B), Test with Rev(A, B): MSE = 1.21• Train with Rev(A, B), Test without Rev(A, B): MSE = 1.89• Train without Rev(A, B), Test without Rev(A, B): MSE = 1.70
TransNets
• Inspirations• Rev(A, B) is important to predicting Rating(A, B).• Rev(A, B) is available during training.
• Consists of two networks.
• Target Network to process Rev(A, B)
• Source Network to process texts of (user A and Item B).
TransNet Architecture
Target Network
Source Network• Input: Reviews by User A (exclude
the review for Item B) and Reviews for Item B (exclude the review by User A).
• Output: Rating(A, B).
• Transform layer: transform the user and the item vectors into an approximate feature of rev(A, B).
• Only source network is used during testing.
Training TransNet
• Three-step training.
• Step 1: Train target network on actual review.
• Step 2: Learn to transform.
• Step 3: Train a predictor on the transformed input.
Training TransNet
• Three-step training.
• Step 1: Train target network on actual review.
• Step 2: Learn to transform.
• Step 3: Train a predictor on the transformed input.
Training TransNet
• Three-step training.
• Step 1: Train target network on actual review.
• Step 2: Learn to transform.
• Step 3: Train a predictor on the transformed input.
Dataset
Result
• TranNet-Ext: include user and item identity into inputs.
• DeepCoNN: Train with Rev(A, B), Test without Rev(A, B).
• DeepCoNN: Train without Rev(A, B), Test without Rev(A, B).
• DeepCoNN + Test Reviews: Train with Rev(A, B), Test with Rev(A, B).
Thank you! & Questions?
What Your Images Reveal: Exploiting Visual Content for Point-of-Interest RecommendationWang, Wang, Tang, et al. Arizona State Univ. & Michigan State Univ.
Presented by Stephanie ChenOct. 25, 2017
This paper’s goal
To improve performance of Point-of-Interest (POI)
recommendation by incorporating visual content
from check-in data from
Location-Based Social Networks (LBSN)
Previous Work
Existing POI Recommendation Methods:
1. Temporal patterns2. Geographical influence3. Social correlations4. Textual content indications
Location-Based Social Networks
● Focused on check-ins with images ● Provides unique information about user
preferences and additional POI properties
● Authors’ hypothesis:
Users who post lots of images about food have more incentive to visit restaurants.
Extracting Visual Content
VGG16 model Convolutional Neural Network (CNN)
Optimization Framework - Section 4
● Gradient descent ● Negative sampling to simplify gradients
○ For each image pk ∈ Pui , randomly sample r images from those that are not posted by user ui○ Maximize the similarity between ui and the visual content of pk ○ Minimize the similarity between ui and the randomly-sampled r images
Data Sets - Section 5.1● Instagram check-ins from New York City &
Chicago○ Both with and without location tags
● Selected only: ○ locations visited by at least 2 unique users○ users who have checked in at least in 8 distinct
locations
● Threw out all selfies to reduce noise & improve performance
○ Not enough information on POIs or user’s interests towards POIs because human body/face took up entire image
○ Authors checked “manually” for selfies
Experiment
● Training Set: For each individual user in the check-in matrix, randomly select x% of all POIs where he has checked-in.
● Test Set: The rest of the observed user-POI pairs. ● Remove images associated with check-ins in the test set to ensure no test
data is exposed during training.
Performance Comparison - Section 5.2
Performance Comparison - Section 5.2
● Matrix-factorization-based POI recommender systems outperform user-oriented collaborative filtering (UCF)
● VPOI obtains better performance than baseline methods based on MF, suggesting that incorporating visual content can improve recommendation performance
● VPOI is better at handling the noisy images from Instagram than Visual Bayesian Personalized Ranking (VBPR) because:
○ VPOI users the images to learn latent features of users and POIs, whereas VBPR directly uses them as descriptions of locations to predict preference scores
Handling Cold-Start Users - Section 5.3
● Cold-Start Users: User without check-in history, or who never adds geo-tags to posted photos
● <30% of Instagram images are tagged with POIs ● To grab data containing only cold-start users, for each user:
○ Randomly select x% of all POIs for training, leaving the rest for the test set○ Remove images associated w/ check-ins in the test set ○ Randomly select 5% of users from the training set, and:
■ remove their check-ins from the training set■ remove their images with geo-tags
○ Results in 5% of data becoming cold-start users
Cold-Start Performance Comparison
Cold-Start Performance Comparison
● Introducing cold-start users deteriorates performance for all the recommendation systems
● Relative to the other systems’ results, VPOI performance degeneration was much smaller because VPOI learns user latent factors for cold-start users
Conclusion - Section 6
● Used CNN to extract features from visual content (images) and to learn latent user and POI features
● VPOI’s experimental results show it outperforms representative state-of-the-art POI recommender systems
● Future work: ○ Incorporate the original four factors to see if even better performance can be achieved ○ Use streaming recommender system techniques because user check-in records are streaming
Neat Notes
3 of Julian’s works were cited by these authors!
Room for Improvement
Dataset was not actually available on the first author’s webpage.
So many grammatical and spelling errors!
e.g. section 1: “how to extract useful visual contents from images as we are lack of ground truth of what are contained in the images”
e.g. table 4 title “perforamnce”
Appendix Slides
Notation - Section 3
3 types of objects: users, locations and images
U = {u1, u2, . . . , un} = the set of users,
L = {l1, l2, . . . , lm} = the set of locations,
P = {p1, . . . , pn} = the set of photos,
where n, m and N are the number of users, POIs and images, respectively.
R denotes the check-in matrix.
Notation - Section 3Users can check in at locations.
X ∈ Rnxm denotes the user-POI check-in matrix
Xij denotes the check-in frequency or rating of ui on lj
Users can upload images to LBSNs.
Pui = the set of images uploaded by ui
Users can also choose to add locations to images.
Plj denotes the set of images that are tagged lj
Acronyms
CNN = Convolutional Neural Network
LBSM = Location-Based Social Network
MF = Matrix Factorization
POI = Point of Interest
UCF = User-oriented Collaborative Filtering
VBPR = Visual Bayesian Personalized Ranking
VPOI = Visual Content Enhanced POI Recommendation
Neural Factorization Machines for Sparse Predictive Analytics
Xiangnan He, Tat-Seng Chua. 2017
Presenter: Chester Holtz
Motivation
• Many predictive tasks involving web applications need to model categorical variables, such as user IDs or demographics.
• To apply standard machine learning techniques, these categorical predictors are typically converted to a set of binary features via one-hot encoding, making the resultant feature vector highly sparse.
• To learn from such sparse data effectively, it is crucial to account for the interactions between features.
Factorization Machines
• FM can directly model explicit features• Rather than projecting data into latent vector space, FM projects
each feature into the latent space.• FM estimates the target by modelling all interactions between
each pair of features via factorized interaction parameters:
Factorization Machines Limitations
• FM is still a linear model• FM may not capture higher order and nonlinear feature interactions
present in the data.
Neural Factorization Machines
• Unifies linear FMs and neural networks for sparse data modelling• Perform non-linear transformation on the latent space of the
second-order feature interactions while capturing higher order feature interactions where f(x) is modeled as a neural network.
Bi-Interaction Layer
• The proposed Bi-Interaction Layer takes a set of dense k-dimensional feature embeddings and maps them to a single k-dimensional vector.
Formulation & Properties of NFM
• Formulation
• Generalizes FM (pf)• Model evaluation complexity (pf):
Evaluation
• Datasets• MovieLense
• Movie tag recommendation• Frappe
• App recommendation
• Baselines• LibFM• HOFM• Wide & Deep• DeepCross
Results & Discussion
• (RQ1)Dropout Improves Generalization, Batch Normalization Speeds up Training
• (RQ2)The BI-layer has effectively encoded second-order feature interactions - a simple non-linear function is sufficient to capture higher-order interactions.
Conclusion and Future Directions
• Proposed a novel neural network model NFM, which ties linear FM with the representation ability of non-linear neural networks for sparse predictive analytics.
• Proposed Bi-Interaction operation which provides the model the ability to learn more informative feature interactions at the lower level.
• Experiments on two real-world datasets and show that with only one hidden layer NFM outperforms FM, higher-order FM, and state-of-the-art deep learning approaches.
Future Directions
• Improve the efficiency of NFM by applying hashing techniques • study its performance for other IR tasks, such as search ranking and
targeted advertising. • Extend the objective function with regularizers like the graph
Laplacian. • Exploring the Bi-Interaction pooling for recurrent neural networks
(RNNs) for sequential data modelling
Discussion Questions
1. How can NFM combine with field-aware factorization machines to
associate several embedding vectors for a feature to different
interactions with other features in another field?
2. The authors sampled 2 negative instances to pair with one positive
instance, why it was it able to ensure the generalization and how
exactly it was implemented?
3. For sequential data, does the replacement of FNN with RNN yield
better performance?
4. Dropout intuitions - prevent complex feature co-adaptations
Deep Neural Networks for YouTube Recommendations
Paul Covington, Jay Adams, Emre Sargin
Presented By:Tushar Bansal
MotivationRecommend videos to Youtube users based on past activity.
Scale: Many existing recommendation algorithms proven to work well on small problems fail to operate on Youtube’s scale.
Freshness: Recommendation system should be responsive enough to model newly uploaded content as well as the latest actions taken by the user.
Noise: Historical user behavior on YouTube is inherently difficult to predict due to sparsity and a variety of unobservable external factors. Ground truth of user satisfaction is rare.
Model OverviewA combination of two models:
1. Candidate Generation: The enormous YouTube corpus is winnowed down to hundreds of videos that may be relevant to the user.
2. Ranking: Rank the candidate videos based on the impression (majorly) data.
Candidate Model● Multi-class classification problem with each video as a class.● Use implicit feedback data instead of explicit feedback.● Training: Softmax function
vj = candidate video embedding, u = user embedding
● Minimize cross entropy loss● Test: Nearest Neighbor● Fully connected ReLU layers in a tower pattern
Candidate Model: Features● User’s watch history represented by a sequence of sparse Video ID mapped
to dense vector representation via embeddings.● User’s search queries tokenized into unigrams and bigrams with each token
embedded● Example Age feature: Important to promote fresh content
Candidate Model: Features● Fix number of training samples per user.● Representing search queries as unordered bag of tokens.● Predicting the next watch rather than predicting held-out watch.● Train examples from other website embeddings too.
Candidate Model: Evaluation
*MAP: Mean average precision
Ranking Model● Many more features describing video and user-video relationship.● Ranking objective is based on live A/B testing results but generally a simple
function of expected watch time per impression.● Training: Weighted logistic regression
○ Positive samples weighted by observed watch time on the video○ Negative samples receive unit weight
● Fully connected ReLU layers in a tower pattern● Candidates from other candidate sources
Ranking Model: Features● User’s interaction with video or similar items (e.g. channel)● Past video impressions.● Last search queries, last N videos watched etc.
Ranking Model: Evaluation
Discussion
1. Youtube has many playlist from users, but it seems the paper does not consider that feature, if we would like to consider the feature, how to change the model?
2. The paper doesn’t mention about explicit feedback from the user such as channel subscription, likes which may greatly affect the candidate generation. Even though explicit feedback is sparse, but using it may improve the recommendation. Is it still advisable not to use it?
3. It may very well happen the recommendation may be of periodic nature. Based on time of day or day of week, the recommendation should vary. Would including time factor explicitly during candidate generation phase help in recommendation?
4. How can we ensure diversity of results?5. Give raw input to model instead of embedding?6. What is the motivation for using extreme multiclass classification? Why would not a
logistic loss function or a pairwise ranking work?
Thank You.
Deep Learning based Large Scale Visual Recommendation and Search for
E-commerceBy Devashish Shankar, Sujay Narumanchi, H A Ananya, Pramod Kompalli, Krishnendu Chaudhury
-Dhruv Sharma
Problems it talks about● Visual Recommendation
○ Retrieve a ranked list of catalog images similar to another catalog image
○ A Deep Ranking model for recommending item using images
● Visual Search○ Retrieve a ranked list of images similar to a “Wild”
(user-uploaded) image○ Uses the same deepnet model and NN search
● Scalability○ Scaling the solution to :
■ 50 M items■ 100K / hr ingestions (add,delete,modify item)■ 2000 queries / second with 100ms latency
● Domain: Fashion
Catalog and Wild Images
Catalog Images Wild Images
Datasets
1. Flipkart Fashion Dataset (Evaluation)2. Exact Street2Shop (Training)
Problem being solved
Model Architecture
Training Procedure● Triplet of images <q,p,n>● Relative similarity using Hinge loss● Train using In-class and Out-of-class negative triplets
Training Data Generation● Use Basic Image Similarity Scorer (BISS) models to approximate similarity
○ Each BISS identifies 1000 closest images to q
● Two types of training triplets.○ Catalog Image Triplets
■ All images are sampled from catalog images○ Wild Image Triplets
■ query images are sampled from wild images and positive and negative images are sampled from catalog images
Training Data Generation
● Catalog Image Triplets
○ q is sampled from set of catalog images○ p is sampled from set of catalog images with high similarity score using union of top 200
images from multiple BISS models○ n is sampled from set of catalog images with low similarity scores using rest of the images
from the union of BISS models■ In-class negative images: sampled from a union of top 500-1000 from each BISS■ Out-of class negative images: sampled from images of same category out of top 1000
from each BISS
Training Data Generation
● Wild Image Triplets
○ q is sampled from from the cropped wild images from Exact Street-to-Shop○ p is ground truth image match in the data set○ n is sampled from set of catalog images with low similarity scores from Visnet trained on
catalog images■ n is sampled from in-class and out-of-class images
● Object Localization for wild images (Search)○ Used Faster R-CNN (FRCNN) trained on Fashionista dataset
Production Pipeline (Scalability)
Evaluation
Evaluation
Limitations & Confusions1. The model is Non-personalized2. Recall reduced for Visnet-FRCNN 3. Separate Deep Ranking model for each category