132
CSE 291: Trends in Recommender Systems and Human Behavioral Modeling Week 4 presentations

Week 4 presentationscseweb.ucsd.edu/classes/fa17/cse291-b/pdfs/week4.pdf · - Automatically models a variety of temporal effects 1) Exogenous Dynamics on Movies. 2) Endogenous Dynamics

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • CSE 291:Trends in Recommender Systems and Human Behavioral

    Modeling

    Week 4 presentations

  • RECURRENT RECOMMENDER NETWORKS

    Rishab Gulati

  • About The Paper- First paper addressing movie recommendation in fully causal and integrated

    manner.- Recommender System that can:

    ● Capture interaction between users and movies● Capture the temporal dynamics of users and movies

    - Uses a Nonparametric model- Uses a combination of LSTMs and low rank factorization- Tuples: (User_id, Item_id, timestamp) -> Rating

  • Recurrent Neural Network- Captures the temporal dynamics- Basic rule:

    - Uses LSTMs. Equations:

  • MODEL- Uses 2 separate LSTM Networks- Update functions:

    - Throughout training, model learns the functions f,g,and h - Advantages: 1. Concise 2. Not affected by sparsity of the dataset 3. New data

    can be incorporated without re-training

  • How it works:

    - Dataset of M movies- xt -> rating vector of user at time t. xt ϵ ℝ

    M

    - xtj = k if user gave rating k to movie j at time t else 0- 𝛕t -> denotes a wallclock at time t - 1newbie = 1 -> New user - Wembed -> Transformation matrix to project information into embedding space.

    - yt -> input to LSTM at time t

    - Note: Steps where user has not rated any movies are not inserted to LSTM

  • - Stationary components present - ui and mj - User profile, long term preference- Movie Genre

    - Rating function changes to:

    - Standard factorization for stationary effects- LSTMs for long-range dynamic updates

  • Prediction:

    - Uses extrapolated states:

    Latest Observation(Input) → Updates states → Predictions

    So, takes into account the causal effects of previous rating.

    - Loss Function:

  • Backpropagation:

    - Rating depends on both user-state RNN and movie-state RNN

    - Backpropagation is computationally prohibitive

    - Uses Subspace descent method

  • EXPERIMENTS

    DATASET:

  • Model Setup:

    - Single Layer LSTM with 40 hidden neurons- 40 dimensional input embeddings- 20 dimensional dynamic states- 20 dimensional and 160 dimensional stationary latent states for Netflix and

    IMDB respectively- ADAM optimizer for LSTM and SGD for stationary factors- Train stationary factors first and then whole model- Stationary latent states initialized by:

    - PMF for Netflix dataset- Autorec for IMDB dataset

    - Runs for 10 epochs

  • Compared With:

    - PMF - TimeSVD++ - AutoRec 20 dim stationary states and 40 dim embeddings in RNN beat PMF and TimeSVD++ of 160 dim and AutoRec with 500 dim embeddings

    - Also more robust

  • Temporal Dynamics- Automatically models a variety of temporal effects

    1) Exogenous Dynamics on Movies

  • 2) Endogenous Dynamics on Movies

    - Age Effect - Different Perspectives

  • - Change in rating mechanism

  • Performance in Cold Start Conditions- Compared with PMF in conditions where #ratings is very low.- Marker size denotes user counts

  • Incorporation of New Data

    - Allows to incorporate newly observed ratings without re-training.

    Observed info.(Input) → Rating→ extrapolate states → Predictions

    - Fluctuations = avg_ratinghighest_rated_t - avg_ratinglowest_rated_t- Tested on movies with different levels of rating fluctuations.

  • MAIN TAKE-AWAYS

    - First paper to implement movie recommendation in causal and integrated manner.

    - Uses LSTMs to model temporal dynamics and standard factorization for stationary features

    - Learns the functions and not the latent state upon training.- Automatically models variety of temporal effects.- Performs better than all the current recommender systems.

  • Questions?

  • Neural Collaborative FilteringHe et al., 2017

    Presenter: Kulshreshth Dhiman

    Slides adapted from He WWW17 presentation

  • Overview

    • An ensemble of • Generalized Matrix Factorization

    Weighted Inner product of user and item latent vectors applied to activation function a

    out

    • Multi-layer Perceptron Concatenated user and item latent vectors passed on to a deep network to learn user-item interaction

    • Implicit feedback model• MovieLens and Pinterest dataset

  • Matrix Factorization

    • Some notations▪ p

    u = latent vector for user u

    qi = latent vector for item i

    ▪ yui

    = interaction (implicit feedback)

    • MF estimates interaction yui

    as inner product of pu and q

    i

  • Limitations of Matrix Factorization

    • The simple choice of inner product function can limit the expressiveness of a MF model

    (E.g., assuming a unit length)

    sim(u1, u2) = 0.5sim(u3, u1) = 0.4sim(u3, u2) = 0.66

    u1

    u2

    u3sim(u4, u1) = 0.6 *****sim(u4, u2) = 0.2 *sim(u4, u3) = 0.4 ***

    S42

    > S43

    (X)

    S42

    > S43

    (X)

    =cos(pu, q

    i)

    • The inner product can incur a large ranking loss for MF• Can use large number of latent factors but that may hurt

    generalization error

    • Learn interaction function from the data instead of using fixed inner product

  • Neural Collaborative Filtering (NCF) Framework

    • NCF uses a multi-layer model to learn user-item interaction function

    • Input: feature vector for user u (vu) and item i (v

    i)

    • Here, binarized sparse vector with one-hot encoding of userID and itemID• Output: predicted score ŷui (how likely i is relevant to u)• Convert it to a binary classification problem

    • Activation function: logistic function• Objective function – log loss / binary cross entropy• yui labeled as 1 if item i relevant to user u

    • Otherwise 0

    Interaction function

  • Generalized Matrix Factorization (GMF)

    • NCF can express and generalize MF:

    • Layer 1: element-wise product• Output Layer: fully connected layer

    without bias

    • h allows varying importance of latent dimensions• a

    out activation function can model non-linearities

  • Multi-Layer Perceptron (MLP)

    • NCF can provides more non-linearityto learn user-item interactions

    Layer 1:

    Remaining Layers:

  • MF vs MLP

    • MF uses an inner product as the interaction function:• Latent factors are independent with each other;• It empirically has good generalization ability for CF modelling

    • MLP uses nonlinear functions to learn the interaction function:• Latent factors are not independent with each other;• The interaction function is learnt from data, which conceptually has a better

    representation ability.

    • Can we fuse two models to get a more powerful model?

  • Fusion of GMF and MLP (NeuMF)

    • GMP and MLP have separatedof embeddings

    • Concatenate weights of the two models

  • Experimental Setup

    • Two public datasets from MovieLens and Pinterest: • Transform MovieLens ratings to 0/1 implicit case

    • Evaluation• Leave-one-out: hold the latest rating of each user as the test• Randomly sample 100 items that are not interacted by the user

    Rank the test item among the 100 items

    • HR (Hit Ratio)@10: captures whether the test item is present in the top-10 list• NDCG@10: accounts for position of the hit by assigning higher scores to hits at top ranks

  • Baselines

    • ItemPop. Items are ranked by their popularity.

    • ItemKNN [Sarwar et al, WWW’01]The standard item-based CF method.

    • BPR [Rendle et al, UAI’09]Bayesian Personalized Ranking optimizes MF model with a pairwise ranking loss, which is tailored for implicit feedback and item recommendation.

    • eALS [He et al, SIGIR’16]The state-of-the-art CF method for implicit data. It optimizes MF model with a varying-weighted regression loss.

  • Results

    • Predictive factors = size of last hidden layer• NeuMF outperforms eALS and BPR with about 5% relative improvement.• Of the three NCF methods: NeuMF > GMF > MLP• Three MF methods with different objective functions:

    • GMF (log loss) >= eALS (weighted regression loss) > BPR (pairwise ranking loss)

  • Is Deep Learning Helpful?

    • For same capability (i.e., same number of predictive factors), stacking more nonlinear layers improves the performance.

  • Conclusion

    • We explored neural architectures for collaborative filtering. • Devised a general framework NCF;• Presented three instantiations GMF, MLP and NeuMF.

    • Experiments show promising results:• Deeper models are helpful.• Combining deep models with MF in the latent space leads to better results.

  • Discussion Questions

    • Q1: Are users and items end up being embedded in same latent space?How does backpropagation works so that latent vectors are correctly identified?

    • Q2: How is GMF (log-loss, pointwise) better than BPR loss function (pairwise)?• Q3: How DDN resolves the inner-product issue?• Q4: Converting ratings to implicit feedback. Rating 5 vs rating 1, both being

    mapped “1”

    • Q5: Cold start problem? Incorporate auxiliary features• Q6: Computationally expensive?

  • Collaborative Variational Autoencoder for

    Recommender SystemsXiaopeng Li, James She

    Presenter - Ajitesh GuptaSlides adapted from KDD 2017

  • Motivation● An increasing variety of content is available to online users

    ○ E.g. Movies, Images, Songs, Articles

    ● Companies want to target as many users as possible, as best as possible● Current systems work mostly on collaborative filtering

    ○ Predict recommendations based on previous history of user interaction (ratings, likes, views)

    ● Two major problems -○ Sparsity - Rating matrix can be and actually is extremely sparse in many cases○ Cold-start - No ratings available for new items

  • Use content● As mentioned earlier items have quite a lot of content

    ○ Images, textual descriptions

    ● Understanding contents helps understand user preferences○ Maybe they prefer certain style of narrative○ Maybe they like the presence of certain instruments in songs

    ● Combine collaborative methods with content methods -> Hybrid Method○ Previous history + subjective interests = best of both worlds

    ● Loosely coupled vs Tightly coupled○ Loosely coupled hybrid systems

    ■ Separate content and collaborative prediction■ Final prediction is linear combination/ regression

    ○ Tightly coupled hybrid systems■ Unified model of content and collaborative prediction

  • Challenges● How to select features ?

    ○ Content features are not directly suitable for recommendation○ Very high dimensional○ Not optimized for the task

    ● Moreover how to integrate that with the collaborative part ?● The authors look towards deep generative models

    in order to learn such suitable representations

  • Variational Autoencoder

    ● Latent space modelled in the form of a gaussian○ Can just sample from gaussian to generate○ Gaussian makes functions easier to calculate

  • Collaborative Variational Autoencoder● Bayesian Generative model - Both content and rating are generated using latent variables

    ○ Ratings through graphical model○ Content through generation network

  • Collaborative Variational AutoencoderRating

    User latent variable

    Item latent variable

    Item latent collaborative variable

    Item latent content variable

    Item content

  • Collaborative Variational AutoencodersGenerative Process

    1. For each item j:a. Draw the latent collaborative variables v_j* from priorb. Draw the latent content variable x_j from priorc. Draw the item content fromd. Draw the item variable

    2. For each user ia. Draw the user variable u_i from prior

    3. For each rating R_ija. Draw rating from

    Joint Probability

  • Inference● Intractable posterior requires approximate inference● Hence we do variational inference●● Gaussian distribution for

  • Inference● Inference network makes inference of z through 2 two paths

    ○ Generating content x○ Generating the rating R

    ● Learns good x both for representing useful information of x and for better recommendation

  • MAP Estimation for learningError between predicted and true ratings

    Contribution of content for recommendation

    Reconstruction error and KL for VAE

    Regularization

  • Round 1

    Round 2

    MAP Estimation for learning

  • Prediction● Let D be the observed data. After all the parameters U , V (or μ, Λ), and the

    weights of the inference network and generation network are learned, the predictions can be made by

    ● For point estimation, the prediction can simplified as

    where E[z_j ] = μ_j , the mean vector obtained through the inference network for each item j. The items that have never been seen before will have no v_j* term, but the E[z_j] can be inferred through the content.

  • ExperimentsDataset - User written articles

    citeulike-a citeulike-t

    # users 5551 7947

    # items 16,950 25,975

    # ratings 204,956 134,850

    Density 0.22% 0.07%

    Content text text

    Preprocessing● Bag of words● Tf-idf term selection● Normalization

    Training Setting● Sparse - each user has 1 rating● Dense - each user has 10 ratings

    Measure

  • Other methods● Collaborative Topic Regression

    ○ LDA content features + collaborative filtering○ Tight coupling

    ● DeepMusic○ Use of Neural network for linear regression to convert content to latent factors○ Latent factors obtained from Weighted Matrix Factorization○ Loosely coupled

    ● Collaborative Deep Learning○ Similar to CVAE but with Stacked Denoising Autoencoder + Collaborative Filtering○ Tightly coupled

  • Results

  • Results - Effect of penalizing reconstruction error (λr)

  • Results - Effect of increasing content dimensions (K)

  • Qualitative Results

  • Extensions to various types of content● Different neural network architecture for different types of content

    ○ Convolutional neural networks○ Recurrent neural networks

    ● Taking advantage of different types of content○ Adversarial networks

  • Thank You

  • TransNets: Learning to Transform for Recommendation

    Rose Catherine, William Cohen. 2017

    Presenter: Siyu Jiang

  • Motivation

    • Task: Reviews -> Ratings• Reviews include: A user’s reviews for items.

    An item’s received reviews from users.

    • Use deep learning’s advantages in recommender systems.• State-of-the-art method: Deep Cooperative Neural Networks

    (DeepCoNN) exists flaws.

  • Related Work: DeepCoNN• Take review texts as input• Word Embedding

    • pre-trained, no updates.• Convolutional Layer

    • With different kernels.• Max-pooling Layer

    • Produce fixed size vector• Location invariant

    • Fully Connected Layer• Estimate ratings using

    Factorization Machines.

    CN

    N Text Pro

    cessor

  • Limitations of DeepCoNN• User A’s rating on Item B = F(User A’s reviews, Item B’s reviews)

    • Let Rev(A,B) denote user A’s review for item B.• User A’s reviews = Rev(A,B) + reviews for other items• Item B’s reviews = Rev(A,B) + reviews from other users.

    • DeepCoNN used Rev(A,B) at both training and testing. • It is unreasonable to recommend an item to users after they have

    experienced it.

    • DeepCoNN’s Performance:• Train with Rev(A, B), Test with Rev(A, B): MSE = 1.21• Train with Rev(A, B), Test without Rev(A, B): MSE = 1.89• Train without Rev(A, B), Test without Rev(A, B): MSE = 1.70

  • TransNets

    • Inspirations• Rev(A, B) is important to predicting Rating(A, B).• Rev(A, B) is available during training.

  • • Consists of two networks.

    • Target Network to process Rev(A, B)

    • Source Network to process texts of (user A and Item B).

    TransNet Architecture

  • Target Network

  • Source Network• Input: Reviews by User A (exclude

    the review for Item B) and Reviews for Item B (exclude the review by User A).

    • Output: Rating(A, B).• Transform layer: transform the user

    and the item vectors into an approximate feature of rev(A, B).

    • Only source network is used during testing.

  • Training TransNet

    • Three-step training.• Step 1: Train target network on

    actual review.

    • Step 2: Learn to transform.• Step 3: Train a predictor on the

    transformed input.

  • Training TransNet

    • Three-step training.• Step 1: Train target network on

    actual review.

    • Step 2: Learn to transform.• Step 3: Train a predictor on the

    transformed input.

  • Training TransNet

    • Three-step training.• Step 1: Train target network on

    actual review.

    • Step 2: Learn to transform.• Step 3: Train a predictor on the

    transformed input.

  • Dataset

  • Result

    • TranNet-Ext: include user and item identity into inputs. • DeepCoNN: Train with Rev(A, B), Test without Rev(A, B).• DeepCoNN: Train without Rev(A, B), Test without Rev(A, B).• DeepCoNN + Test Reviews: Train with Rev(A, B), Test with Rev(A, B).

  • Thank you! & Questions?

  • What Your Images Reveal: Exploiting Visual Content for Point-of-Interest RecommendationWang, Wang, Tang, et al. Arizona State Univ. & Michigan State Univ.

    Presented by Stephanie ChenOct. 25, 2017

  • This paper’s goal

    To improve performance of Point-of-Interest (POI)

    recommendation by incorporating visual content

    from check-in data from

    Location-Based Social Networks (LBSN)

  • Previous Work

    Existing POI Recommendation Methods:

    1. Temporal patterns2. Geographical influence3. Social correlations4. Textual content indications

  • Location-Based Social Networks

    ● Focused on check-ins with images ● Provides unique information about user

    preferences and additional POI properties

    ● Authors’ hypothesis:

    Users who post lots of images about food have more incentive to visit restaurants.

  • Extracting Visual Content

    VGG16 model Convolutional Neural Network (CNN)

  • Optimization Framework - Section 4

    ● Gradient descent ● Negative sampling to simplify gradients

    ○ For each image pk ∈ Pui , randomly sample r images from those that are not posted by user ui○ Maximize the similarity between ui and the visual content of pk ○ Minimize the similarity between ui and the randomly-sampled r images

  • Data Sets - Section 5.1● Instagram check-ins from New York City &

    Chicago○ Both with and without location tags

    ● Selected only: ○ locations visited by at least 2 unique users○ users who have checked in at least in 8 distinct

    locations

    ● Threw out all selfies to reduce noise & improve performance

    ○ Not enough information on POIs or user’s interests towards POIs because human body/face took up entire image

    ○ Authors checked “manually” for selfies

  • Experiment

    ● Training Set: For each individual user in the check-in matrix, randomly select x% of all POIs where he has checked-in.

    ● Test Set: The rest of the observed user-POI pairs. ● Remove images associated with check-ins in the test set to ensure no test

    data is exposed during training.

  • Performance Comparison - Section 5.2

  • Performance Comparison - Section 5.2

    ● Matrix-factorization-based POI recommender systems outperform user-oriented collaborative filtering (UCF)

    ● VPOI obtains better performance than baseline methods based on MF, suggesting that incorporating visual content can improve recommendation performance

    ● VPOI is better at handling the noisy images from Instagram than Visual Bayesian Personalized Ranking (VBPR) because:

    ○ VPOI users the images to learn latent features of users and POIs, whereas VBPR directly uses them as descriptions of locations to predict preference scores

  • Handling Cold-Start Users - Section 5.3

    ● Cold-Start Users: User without check-in history, or who never adds geo-tags to posted photos

  • Cold-Start Performance Comparison

  • Cold-Start Performance Comparison

    ● Introducing cold-start users deteriorates performance for all the recommendation systems

    ● Relative to the other systems’ results, VPOI performance degeneration was much smaller because VPOI learns user latent factors for cold-start users

  • Conclusion - Section 6

    ● Used CNN to extract features from visual content (images) and to learn latent user and POI features

    ● VPOI’s experimental results show it outperforms representative state-of-the-art POI recommender systems

    ● Future work: ○ Incorporate the original four factors to see if even better performance can be achieved ○ Use streaming recommender system techniques because user check-in records are streaming

  • Neat Notes

    3 of Julian’s works were cited by these authors!

  • Room for Improvement

    Dataset was not actually available on the first author’s webpage.

    So many grammatical and spelling errors!

    e.g. section 1: “how to extract useful visual contents from images as we are lack of ground truth of what are contained in the images”

    e.g. table 4 title “perforamnce”

  • Appendix Slides

  • Notation - Section 3

    3 types of objects: users, locations and images

    U = {u1, u2, . . . , un} = the set of users,

    L = {l1, l2, . . . , lm} = the set of locations,

    P = {p1, . . . , pn} = the set of photos,

    where n, m and N are the number of users, POIs and images, respectively.

    R denotes the check-in matrix.

  • Notation - Section 3Users can check in at locations.

    X ∈ Rnxm denotes the user-POI check-in matrix

    Xij denotes the check-in frequency or rating of ui on lj

    Users can upload images to LBSNs.

    Pui = the set of images uploaded by ui

    Users can also choose to add locations to images.

    Plj denotes the set of images that are tagged lj

  • Acronyms

    CNN = Convolutional Neural Network

    LBSM = Location-Based Social Network

    MF = Matrix Factorization

    POI = Point of Interest

    UCF = User-oriented Collaborative Filtering

    VBPR = Visual Bayesian Personalized Ranking

    VPOI = Visual Content Enhanced POI Recommendation

  • Neural Factorization Machines for Sparse Predictive Analytics

    Xiangnan He, Tat-Seng Chua. 2017

    Presenter: Chester Holtz

  • Motivation

    • Many predictive tasks involving web applications need to model categorical variables, such as user IDs or demographics.

    • To apply standard machine learning techniques, these categorical predictors are typically converted to a set of binary features via one-hot encoding, making the resultant feature vector highly sparse.

    • To learn from such sparse data effectively, it is crucial to account for the interactions between features.

  • Factorization Machines

    • FM can directly model explicit features• Rather than projecting data into latent vector space, FM projects

    each feature into the latent space.• FM estimates the target by modelling all interactions between

    each pair of features via factorized interaction parameters:

  • Factorization Machines Limitations

    • FM is still a linear model• FM may not capture higher order and nonlinear feature interactions

    present in the data.

  • Neural Factorization Machines

    • Unifies linear FMs and neural networks for sparse data modelling• Perform non-linear transformation on the latent space of the

    second-order feature interactions while capturing higher order feature interactions where f(x) is modeled as a neural network.

  • Bi-Interaction Layer

    • The proposed Bi-Interaction Layer takes a set of dense k-dimensional feature embeddings and maps them to a single k-dimensional vector.

  • Formulation & Properties of NFM

    • Formulation

    • Generalizes FM (pf)• Model evaluation complexity (pf):

  • Evaluation

    • Datasets• MovieLense

    • Movie tag recommendation• Frappe

    • App recommendation• Baselines

    • LibFM• HOFM• Wide & Deep• DeepCross

  • Results & Discussion

    • (RQ1)Dropout Improves Generalization, Batch Normalization Speeds up Training

    • (RQ2)The BI-layer has effectively encoded second-order feature interactions - a simple non-linear function is sufficient to capture higher-order interactions.

  • Conclusion and Future Directions

    • Proposed a novel neural network model NFM, which ties linear FM with the representation ability of non-linear neural networks for sparse predictive analytics.

    • Proposed Bi-Interaction operation which provides the model the ability to learn more informative feature interactions at the lower level.

    • Experiments on two real-world datasets and show that with only one hidden layer NFM outperforms FM, higher-order FM, and state-of-the-art deep learning approaches.

  • Future Directions

    • Improve the efficiency of NFM by applying hashing techniques • study its performance for other IR tasks, such as search ranking and

    targeted advertising. • Extend the objective function with regularizers like the graph

    Laplacian. • Exploring the Bi-Interaction pooling for recurrent neural networks

    (RNNs) for sequential data modelling

  • Discussion Questions

    1. How can NFM combine with field-aware factorization machines to

    associate several embedding vectors for a feature to different

    interactions with other features in another field?

    2. The authors sampled 2 negative instances to pair with one positive

    instance, why it was it able to ensure the generalization and how

    exactly it was implemented?

    3. For sequential data, does the replacement of FNN with RNN yield

    better performance?

    4. Dropout intuitions - prevent complex feature co-adaptations

  • Deep Neural Networks for YouTube Recommendations

    Paul Covington, Jay Adams, Emre Sargin

    Presented By:Tushar Bansal

  • MotivationRecommend videos to Youtube users based on past activity.

    Scale: Many existing recommendation algorithms proven to work well on small problems fail to operate on Youtube’s scale.

    Freshness: Recommendation system should be responsive enough to model newly uploaded content as well as the latest actions taken by the user.

    Noise: Historical user behavior on YouTube is inherently difficult to predict due to sparsity and a variety of unobservable external factors. Ground truth of user satisfaction is rare.

  • Model OverviewA combination of two models:

    1. Candidate Generation: The enormous YouTube corpus is winnowed down to hundreds of videos that may be relevant to the user.

    2. Ranking: Rank the candidate videos based on the impression (majorly) data.

  • Candidate Model● Multi-class classification problem with each video as a class.● Use implicit feedback data instead of explicit feedback.● Training: Softmax function

    vj = candidate video embedding, u = user embedding

    ● Minimize cross entropy loss● Test: Nearest Neighbor● Fully connected ReLU layers in a tower pattern

  • Candidate Model: Features● User’s watch history represented by a sequence of sparse Video ID mapped

    to dense vector representation via embeddings.● User’s search queries tokenized into unigrams and bigrams with each token

    embedded● Example Age feature: Important to promote fresh content

  • Candidate Model: Features● Fix number of training samples per user.● Representing search queries as unordered bag of tokens.● Predicting the next watch rather than predicting held-out watch.● Train examples from other website embeddings too.

  • Candidate Model: Evaluation

    *MAP: Mean average precision

  • Ranking Model● Many more features describing video and user-video relationship.● Ranking objective is based on live A/B testing results but generally a simple

    function of expected watch time per impression.● Training: Weighted logistic regression

    ○ Positive samples weighted by observed watch time on the video○ Negative samples receive unit weight

    ● Fully connected ReLU layers in a tower pattern● Candidates from other candidate sources

  • Ranking Model: Features● User’s interaction with video or similar items (e.g. channel)● Past video impressions.● Last search queries, last N videos watched etc.

  • Ranking Model: Evaluation

  • Discussion

    1. Youtube has many playlist from users, but it seems the paper does not consider that feature, if we would like to consider the feature, how to change the model?

    2. The paper doesn’t mention about explicit feedback from the user such as channel subscription, likes which may greatly affect the candidate generation. Even though explicit feedback is sparse, but using it may improve the recommendation. Is it still advisable not to use it?

    3. It may very well happen the recommendation may be of periodic nature. Based on time of day or day of week, the recommendation should vary. Would including time factor explicitly during candidate generation phase help in recommendation?

    4. How can we ensure diversity of results?5. Give raw input to model instead of embedding?6. What is the motivation for using extreme multiclass classification? Why would not a

    logistic loss function or a pairwise ranking work?

  • Thank You.

  • Deep Learning based Large Scale Visual Recommendation and Search for

    E-commerceBy Devashish Shankar, Sujay Narumanchi, H A Ananya, Pramod Kompalli, Krishnendu Chaudhury

    -Dhruv Sharma

    https://arxiv.org/find/cs/1/au:+Shankar_D/0/1/0/all/0/1https://arxiv.org/find/cs/1/au:+Narumanchi_S/0/1/0/all/0/1https://arxiv.org/find/cs/1/au:+Ananya_H/0/1/0/all/0/1https://arxiv.org/find/cs/1/au:+Kompalli_P/0/1/0/all/0/1https://arxiv.org/find/cs/1/au:+Chaudhury_K/0/1/0/all/0/1

  • Problems it talks about● Visual Recommendation

    ○ Retrieve a ranked list of catalog images similar to another catalog image

    ○ A Deep Ranking model for recommending item using images

    ● Visual Search○ Retrieve a ranked list of images similar to a “Wild”

    (user-uploaded) image○ Uses the same deepnet model and NN search

    ● Scalability○ Scaling the solution to :

    ■ 50 M items■ 100K / hr ingestions (add,delete,modify item)■ 2000 queries / second with 100ms latency

    ● Domain: Fashion

  • Catalog and Wild Images

    Catalog Images Wild Images

  • Datasets

    1. Flipkart Fashion Dataset (Evaluation)2. Exact Street2Shop (Training)

  • Problem being solved

  • Model Architecture

  • Training Procedure● Triplet of images ● Relative similarity using Hinge loss● Train using In-class and Out-of-class negative triplets

  • Training Data Generation● Use Basic Image Similarity Scorer (BISS) models to approximate similarity

    ○ Each BISS identifies 1000 closest images to q

    ● Two types of training triplets.○ Catalog Image Triplets

    ■ All images are sampled from catalog images○ Wild Image Triplets

    ■ query images are sampled from wild images and positive and negative images are sampled from catalog images

  • Training Data Generation

    ● Catalog Image Triplets

    ○ q is sampled from set of catalog images○ p is sampled from set of catalog images with high similarity score using union of top 200

    images from multiple BISS models○ n is sampled from set of catalog images with low similarity scores using rest of the images

    from the union of BISS models■ In-class negative images: sampled from a union of top 500-1000 from each BISS■ Out-of class negative images: sampled from images of same category out of top 1000

    from each BISS

  • Training Data Generation

    ● Wild Image Triplets

    ○ q is sampled from from the cropped wild images from Exact Street-to-Shop○ p is ground truth image match in the data set○ n is sampled from set of catalog images with low similarity scores from Visnet trained on

    catalog images■ n is sampled from in-class and out-of-class images

    ● Object Localization for wild images (Search)○ Used Faster R-CNN (FRCNN) trained on Fashionista dataset

  • Production Pipeline (Scalability)

  • Evaluation

  • Evaluation

  • Limitations & Confusions1. The model is Non-personalized2. Recall reduced for Visnet-FRCNN 3. Separate Deep Ranking model for each category