Multi-Task Deep Visual-Semantic Embedding for …Multi-Task Deep Visual-Semantic Embedding for Video...

Multi-Task Deep Visual-Semantic Embedding for Video Thumbnail Selection

Wu Liu1, Tao Mei2, Yongdong Zhang1, Cherry Che2, Jiebo Luo3,1Institute of Computing Technology, Chinese Academy of Sciences, 2Microsoft Research, 3University of Rochester,

Given the tremendous growth of online videos, video thumbnail, as the com-mon visualization form of video content, is becoming increasingly impor-tant to influence user’s browsing and searching experience. In this paper,we study the problem of video thumbnail selection with side semantic in-formation, which is overlooked in previous research. We investigate howto embed this side semantic information (e.g., title, description, query, andtranscript) with visual content to select semantically meaningful thumbnail-s. One such example can be found in Figure 1. Compared with thumbnailsselected by a conventional method in (b), the thumbnails selected by ourproposed method in (c) can not only well represent the video content, butalso help video browsers or searchers quickly find their interested videos.

To this end, we propose a multi-task deep visual-semantic embeddingmethod which serves as a bridge between the diverse side semantic infor-mation and visual content. The embedding model has a wide variety of realworld applications such as video thumbnail selection, online video summa-rization, video tag localization and video captioning. Our main idea is tolearn a deep visual-semantic embedding model which directly maps the t-wo views (textual and visual) to a latent semantic embedding space, wherethe relevance between two incomparable views can be computed throughtheir projections. Different from existing works [1], we employ a large-scale click-through-based video and image data to learn a robust embeddingmodel, as well as close the domain gap between video and image by a multi-task learning strategy. We demonstrate the effectiveness of this method inthe query-dependent thumbnail selection task. To the best of our knowl-edge, this paper represents one of the first attempts towards visual-semanticembedding for selecting video thumbnails.

The structure of the deep visual-semantic embedding model is describedas follow: for the input textual query, the model directly employ “GloVe”word embedding [5] to map the query into the latent semantic space. The“GloVe” word embedding model is pre-trained on a corpus of 840 billiontokens of web data and used here as its comprehensive performance. Forthe input candidate thumbnails, we leverage the deep convolutional neuralnetwork (CNN) architecture in the model by adapting the released C++ im-plementation in [3]. The original CNN consists of two parts: 1) the inputlayers, five convolution layers and maxpooling layers, and 2) two fully con-nection layers “FC1” and “FC2”, and the output layers. Aiming to mapthe thumbnails into the latent semantic space, we change the output to thesemantic vector representations of the query related to the thumbnail. Mean-while, the softmax prediction layer is replaced by a projection layer M. Asthe model’s performance highly depends on the massive public datasets, wetrain the deep visual-semantic embedding model on a click-through-basedvideo and image dataset to exploit the relevance between a query and theclicked thumbnail or image. Compared with artificially labeled data, suchclick-through data are large-scale, freely accessible, and more useful for un-derstanding the relevance of query-visual information.

However, directly training model on the fusion dataset neglects the gapbetween image and video. To solve the problem, we adopt the multi-tasklearning strategy, which refers to the joint training of multiple tasks, whileenforcing a common intermediate parameterization or representation [4] toimprove each individual task’s performance. By following the multi-tasklearning setting with K learning tasks (K = 2 in our setting), we can redefinethe goal of learning deep visual-semantic embedding model as Equation (1).

τ0∥M0 − I∥2F +

∑k=1

{τk∥△Mk∥2F +max[0,γ −S(t+k ,v)+S(t−k ,v)]},

where, M0 and Mk indicate the projection layers in the CNN. Differently, M0picks up general trends across multiple datasets and Mk = M0 +△Mk spe-

This is an extended abstract. The full paper is available at the Computer Vision Foundationwebpage.

Side Semantic Information

Title: How to replace an AC compressor in the car

Query: AC compressor replacement videos

(a) Video and Side Semantic Information

(b) Thumbnails Selected by Visual Representativeness based Method

(c) Thumbnails Selected by Multi-Task Visual-Semantic Embedding Method

Figure 1: Examples of query-dependent thumbnails. (a) is the video andits side semantic information, (b) shows the thumbnails selected by the vi-sual representativeness based method, and (c) contains the query-dependentthumbnails selected by multi-task deep visual-semantic embedding. Com-pared with (b), the thumbnails in (c) are more representative and semanti-cally meaningful.

cializes each particular task. The minimization of ∥M0 − I∥2F and ∥△Mk∥2

Fensure that the learning algorithm does not put too much emphasis ontothe shared or individual data. The minimization of max[0,γ − S(t+k ,v) +S(t−k ,v)],S(tk,v) = t⃗kMk v⃗k make sure that the model is trained to produce ahigher dot-product similarity between the semantic vector representation ofthe clicked query-thumbnail/image pairs. The τk ≥ 0,k = 0,1,2 are trade-offparameters that control the regularization of Mk. The training of multi-taskdeep visual-semantic embedding model contains two processes: we firsttrain M0 on the common dataset, then fine-tune M1 to the specific query-dependent thumbnail selection task on the click-through video dataset. Con-sequently, the learned embedding model avoids overfitting on the click-through video dataset and adequately exploits more query-thumbnail rela-tionship from the image and video datasets.

In the end, the learned multi-task deep visual-semantic embedding mod-el is employed to select the query-dependent thumbnail. First, we extracteight different video representative attributes [2] to select 20 most visualrepresentative keyframes as candidate thumbnails. Then, we leverage thetrained embedding model to map the side semantic information (i.e., queryand title) and visual content into a latent embedding space to calculate theirrelevance. Next, the visual representative and query relevance scores arefused to select the final thumbnails. Finally, the experiments on a collectionof 1,000 query-video pairs labeled by 191 workers on Amazon MechanicalTurk show that 74.83% thumbnails selected by our approach achieve usersatisfaction, which is nearly 6% higher than the baseline .

[1] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato,and T. Mikolov. Devise: A deep visual-semantic embedding model. InNIPS, pages 2121–2129, 2013.

[2] H. W. Kang and X. S. Hua. To learn representativeness of video frames.In ACM Multimedia, pages 423–426, 2005.

[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classificationwith deep convolutional neural networks. In NIPS, pages 1106–1114,2012.

[4] S. Parameswaran and K. Q. Weinberger. Large margin multi-task metriclearning. In NIPS, pages 1867–1875, 2010.

[5] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectorsfor word representation. In EMNLP, pages 1532–1543, 2014.

The dataset is released in “http://mcg.ict.ac.cn/mcg-vts.html” as a benchmark for videothumbnail selection in this community.

Multi-Task Deep Visual-Semantic Embedding for …Multi-Task Deep Visual-Semantic Embedding for Video...

Documents

Word Embedding - GitHub PagesWord Embedding 10 • Full document vs Window – Full document: general topics of the word • Latent Semantic Analysis • Expensive for word representation

Ladder Loss for Coherent Visual-Semantic Embedding · Visual-semantic embedding aims to map images and their descriptive sentences into a common space, so that we can retrieve sentences

NSEEN: Neural Semantic Embedding for Entity Normalizationambite/...Semantic_Embedding_for_Entity_Normalizati… · Data Normalization, linking entities to their canonical forms, is

Funding map using paragraph embedding based on semantic ... · techniques, word/paragraph embedding. The embedding techniques represent words and paragraphs as real-valued vectors

DeViSE: A Deep Visual-Semantic Embedding Model · dictions. In this paper we present a new deep visual-semantic embedding model trained to identify visual objects using both labeled

Multi-Task Deep Visual-Semantic Embedding for Video Thumbnail … · 2015. 5. 26. · Multi-Task Deep Visual-Semantic Embedding for Video Thumbnail Selection Wu Liu 1,3, Tao Mei 2,

A Uniﬁed Semantic Embedding: Relating …lsigal/Publications/nips2014hwang.pdfA Uniﬁed Semantic Embedding: Relating Taxonomies and Attributes Sung Ju Hwang Disney Research Pittsburgh,

Graph-RISE: Graph-Regularized Image Semantic Embedding · Graph-RISE: Graph-Regularized Image Semantic Embedding Da-Cheng Juan, Chun-Ta Lu, Zhen Li, Futang Peng, Aleksei Timofeev,

Semantic Enrichment of Pretrained Embedding Output for

Video Thumbnail Selection Deep Visual Semantic …arunv/master_thesis_defence.pdfDeep Visual Semantic Embedding for Video Thumbnail Selection Arun Balajee Vasudevan Supervisors: Michael

Habit2vec: Trajectory Semantic Embedding for Living ...hanchengcao.me › pdf › habit2vec.pdf · representation learning based system named habit2vec to represent user trajectory

PIE: Portrait Image Embedding for Semantic Control ... · HANS-PETER SEIDEL, Max Planck Institute for Informatics, SIC PATRICK PÉREZ, Valeo.ai MICHAEL ZOLLHÖFER, Stanford University

Semantic IR: Exploring Dependency and Word Embedding ... · ©2013 MFMER | slide-1 Semantic IR: Exploring Dependency and Word Embedding Features in Biomedical IR Rastegar-Mojarad,

Embedding Environmental Sustainability - Family Repositoryfamilyrepository.lppkn.gov.my/601/1/Embedding Environmental Sustainability.pdf · Embedding Environmental Sustainability

DeViSE: A Deep Visual-Semantic Embedding Modelpapers.nips.cc/paper/5204-devise-a-deep-visual-semantic-embedding-model.pdf · predict one of 1,000 object categories from the ILSVRC

A review of word embedding and document similarity ...ad-publications.informatik.uni-freiburg.de/theses/... · Word embeddings, which facilitate working with abstract semantic concepts

Subgraph-augmented Path Embedding for Semantic User Search ...shichuan.org/hin/topic/Similarity Measure/2018.WWW... · neous social network and a semantic relation (e.g., schoolmates),

Image Question Answering: A Visual Semantic Embedding ... · Image Question Answering: A Visual Semantic Embedding ... an automatic question generation ... Image Question Answering:

Towards Semantic Embedding in Visual Vocabulary Towards Semantic Embedding in Visual Vocabulary The Twenty-Third IEEE Conference on Computer Vision and

Deep semantic-visual embedding with localization