Nonnegative Shared Subspace Learning and Its Application to Social Media RetrievalPresenter: Andy Lim
The Problem• Rise in popularity of social image and video sharing platforms
• Precision of tag-based media retrieval
• Tags are• Noisy• Ambiguous• Incomplete• Subjective
• Lack of constraints• Free-text tags (i.e. “djfja;sldfkj”)
Tags: hotdog, chinese, trololol, aidjishi, sandwich, bread
Previous Research(Internal)• Improving tag relevance
• Sigurbjornsson and Zwol• Developed a method of recommending a set of relevant tags
based on tag popularity
• Li et al.• List all images for a given tag and determine tag relevance from
visual similarity
• All are confined to noisy tags within the primary dataset
The Approach• Internal vs. External
• Leverage external auxiliary sources of information to improve target tagging systems (presumably much noisier)
• Exploit disparate characteristics of target domain using auxiliary source
• Note: What is the optimal level of joint modeling such that the target domain still benefits from the auxiliary source?
Assumptions• There is a common underlying subspace shared by the primary
and secondary domains
• The primary domain is much nosier than the secondary domains
Nonnegative Matrix Factorization
• X (M x N data matrix) where N = documents in terms of M vocabulary words
• F (M x R nonnegative matrix) represents R basis vectors
• H (R x N nonnegative matrix) contains coordinates of each document
Joint Shared Nonnegative Matrix Factorization (JSNMF)• Input:• X (target domain), Y
(auxiliary domain), R1 and R2
(dimensionality of underlying subspaces of X and Y), K (basis vectors)
• Output:• W (joint shared subspace), U
(remaining subspace in target domain), V (remaining subspace in auxiliary domain), H (coordinate matrix for target domain), L (coordinate matrix for auxiliary domain)
Retrieval using JSNMF• Input: W, U, H, query
sentence SQ, number of images (or videos) to be retrieved N and image (or video) dataset
• Output: Return top N retrieved images (or videos)
Experiment• Use LabelMe tags (auxiliary) to improve• Image retrieval in Flickr• Video retrieval in Youtube
• Why LabelMe?• Object image tagging• Controlled vocabulary
Flickr Dataset• Downloaded 50,000 images from Flickr
• Average number of distinct tags = 8
• Removed• Rare tags (appears less than 5 times)• Images with no tags and non-English tags
• Obtained 20,000 labeled images
• 7,000 examples are kept for investigating internal auxiliary dataset
YouTube Dataset• Downloaded 18,000 videos’ metadata (tags, URL, category,
title, comments, etc.)
• Average number of distinct tags = 7
• Removed• Rare tags (appearing less than 2 times)• Videos with no tags or non-English tags
• Obtained dataset corresponding to 12,000 videos
• Again, kept 7,000 examples to be used as an internal auxiliary dataset
LabelMe Dataset• Added 7,000 images with tags from LabelMe
• Average number of distinct tags = 32
• Removed• Rare tags (appearing less than 2 times)
• Cleanup does not reduce dataset
Evaluation Measures• Defined query set Q• {cloud, man, street, water, road, leg, table, plant, girl, drawer,
lamp, bed, cable, bus, pole, laptop, plate, kitchen, river, pool, flower}
• Manually annotated the two datasets (Flickr and YouTube) with respect to the query set (no benchmark dataset available)
• Query term and an image is relevant if the concept is clearly visible in the image (or video)
Results with JSNMF• Precision-Scope Curve
• Fix recall at 0.1• Users are usually only interested in
first few results
• 10% improvement
Results with JSNMF• Under-representation• Shares very few basis vectors
• Over-representation• Forces many basis vectors to
represent both datasets
• Appropriate level of representation
Flickr Retrieval Results
• Results are better with LabelMe
• As recall increases, precision decreases
• When K=0 (no sharing) or K=40 (fully sharing), precision is lower compared to K=15