Video Scene-Aware Dialog Track in DSTC7 - COLIPSworkshop.colips.org/dstc7/proposals/DSTC7_Scene_Aware_Dialog.pdf · Dialog systems need to understand scenes in order to have conversations

Video Scene-Aware Dialog Track in DSTC7

Chiori Hori∗, Tim K. Marks∗, Devi Parikh∗∗, and Dhruv Batra∗∗

∗Mitsubishi Electric Research LaboratoriesCambridge, MA, USA

{chori, tmarks}@merl.com∗∗School of Interactive Computing

Georgia Tech{parikh, batra}@gatech.edu

Abstract

Dialog systems need to understand scenes in order to have conversations with usersabout the objects and events around them. Scene-aware dialog systems could bedeveloped by integrating state-of-the-art technologies from multiple research areas,including: end-to-end dialog technologies, which generate system responses usingmodels trained from dialog data; visual question answering (VQA) technologies,which answer to questions about images using learned image features; and videodescription technologies, in which videos are described/narrated using multimodalinformation. This article proposes a video scene-aware dialog track for the 7thDialog System Technology Challenges (DSTC7) workshop. The task is to generatesystem responses in a dialog about an input video.

1 Introduction

Figure 1: A sample of VQA [1].

Recently, spoken dialog technologies have beenapplied in real-world man-machine interfacesincluding smart phone digital assistants, car nav-igation, voice-controlled speakers, and human-facing robots. In these applications, however, allconversation is triggered by user speech input,and the contents of system responses are limitedby the training data (sets of dialogs). Currentdialog systems cannot understand scenes usingmultimodal sensor-based input such as visionand non-speech audio, so machines using suchdialog systems cannot have a conversation aboutwhat’s going on in their surroundings. To de-velop machines that can carry on a conversationabout objects and events taking place aroundusers, scene-aware dialog technology is essen-tial.

To interact with humans about visual informa-tion, Visual Question Answering (VQA) is intensively researched in the field of computer vision[1, 2, 3]. The goal of VQA is to generate answers to questions about an imaged scene, using theinformation present in a static image. Example questions from a popular VQA dataset are shown inFigure 1.

1

Figure 2: The VQA model of [1], which encodes questions using a two-layer LSTM and encodesimages using the last fully connected layer of VGGNet.

Figure 2 shows a recent VQA system [1] that encodes the questions using a two-layer long short-termmemory (LSTM) network and encodes the images using features from VGGNet [4]. Both the questionfeatures and the image features are transformed to a common space and fused via element-wisemultiplication. The result is then passed through a fully connected layer followed by a softmax layerto obtain a distribution over answers.

In the proposed track of DSTC7, we extend the goal of VQA in two ways. First, we extend theinteraction from from simple single-turn question answering to multi-turn dialogs, in which theutterances in each turn may reference information from previous turns of the dialog. (See Figure 3for an example.) Second, we extend the subject of the interaction from unimodal static images tomultimodal videos, where input features could come from multiple domains including image features,motion features, non-speech audio, and speech audio. There is recent work on each of these twoextensions individually, as described below. However, we are not aware of any existing research thatcombines the two extensions, as this challenge track proposes.

1.1 Extension from VQA to Visual Dialog

Figure 3: A sample of Visual Dialog [5]. The taskof Visual Dialog requires an AI agent to hold ameaningful dialog with humans in natural, conver-sational language about visual content.

While VQA takes a significant step towardshuman-machine interaction, it still representsonly a single round of a dialog—unlike in hu-man conversations, there is no scope for follow-up questions, no memory in the system of previ-ous questions asked by the user, nor consistencywith respect to previous answers provided bythe system

As a step towards conversational visual AI, weintroduced [5] the new task of visual dialog,which requires an AI agent to hold a meaningfuldialog with humans in natural, conversationallanguage about visual content [6], as shown inthe example in Figure 3. Specifically, givenan image I , a history of a dialog consisting of asequence of question-answer pairs, and a naturallanguage follow-up question, the task for themachine is to answer the question in free-formnatural language.

This task is the visual analogue of the TuringTest. Visual dialog is disentangled enough froma specific downstream task so as to serve as a

2

general test of machine intelligence, while beingsufficiently grounded in vision to allow objective evaluation of individual responses and benchmarkprogress. In order to be successful, a visual dialog agent must possess a host of multimodal AIcapabilities: the ability of infer context from dialog history and resolve co-references (what does‘it’ refer to?), the ability to ground the question in the image (where is ‘it’, the mug, located in theimage), and the ability to be consistent in its responses over time.

1.2 Extension from Image Description to Video Description

Multimodal Naïve fusion

x11 x12 x1L

yi

gi

si-1 si

x’11 x’12

yi-1 yi+1

x21 x22 x2L’

x’21 x’22

di

x’1L x’2L’

α1,i,1 α1,i,L α1,i,2

α2,i,1 α2,i,L’ α2,i,2

c1,i

WC1 WC2

c2,i

Figure 4: An encoder-decoder based sentence gen-erator with attention-based multimodal fusion forvideo description.

To understand scenes for dialogs, existing VQAtechnologies could be combined with technolo-gies for automatic video description, also knownas video captioning. Video description sys-tems generate natural language descriptions forscenes, such as outputting a sentence that sum-marizes an input video.

Some video description systems are extensionsof systems designed for image description, a.k.a.image captioning, in which the input is a sin-gle static image, and the output is a natural-language description such as a sentence. Recentwork on RNN-based image captioning includes[7, 8]. To improve image captioning perfor-mance, [9] added an attention mechanism, toenable focusing on specific parts of the imagewhen generating each word of the description.

In addition to being used for image description,encoder-decoder networks have also been ap-plied to the task of video description [10]. Re-cently, we [11] introduced a multimodal attention mechanism that selectively attends to different inputmodalities (feature types) in addition to different times in the input video, to improve the performanceof encoder-decoder-based video description (see Figure 4). Table 1 shows sample results of videodescriptions from [11].

Table 1: Sample video description results from [11] on the YouTube2Text dataset. The first row ofdescriptions were generated by a unimodal system with only image features (VGG-16) and temporalattention. “V”, “A” and “AV” respectively show results from using visual features alone, audiofeatures alone, and a combination of visual and audio features.

Sample ImageUnimodal (VGG-16) a monkey is running a man is slicing a potato a man is singing

Naïve Fusion (V) a dog is playing a woman is cutting an onion a man is singingNaïve Fusion (AV) a monkey is running a woman is peeling an onion a man is playing a guitar

Attentional Fusion (V) a monkey is pulling a dogs tail a man is slicing a potato a man is playing a guitarAttentional Fusion (AV) a monkey is playing a woman is peeling an onion a man is playing a violin

Discussion

Attentional Fusion (V)(i.e., Multimodal atten-tion on visual features)worked best.

Our inclusion of audiofeatures enabled the“peeling" action to beidentified.

Both audio features andmultimodal attention areneeded to identify "vio-lin".

3

Due to significant opportunities for new breakthroughs at the intersection of dialog and computervision technologies, we are proposing a challenge track for to the 7th Dialog System TechnologyChallenges (DSTC7) workshop.

The focus of this new challenge track is to train end-to-end conversation models from human-to-human conversations about video scenes. Challenge participants will train end-to-end dialog modelsusing paired data comprising short videos (with audio) and the text of human-to-human dialogs abouteach video. The goal of the system is to generate natural and informative sentences in response touser’s questions or comments, given the video and the previous lines of the dialog.

2 Tasks

In the proposed challenge track, a system must generate sentence(s) in response to a user input in agiven dialog context, which consists of dialog history (previous utterances by both user and system)in addition to video and audio information that comprise the scene. The quality of automaticallygenerated sentences is evaluated using objective measures to determine whether or not the generatedsentences are natural and informative. This track consists of two tasks:

Task 1: All or part of the provided training data may be used to train conversation models. Externaldata may not be used for training, except that publicly available pre-trained feature extractionmodels (e.g., VGGNet) are permitted.

Task 2: In addition to the training data provided, any additional publicly available data (such aspublicly available text, image, and video datasets and pretrained models) may be used asexternal knowledge to train the system. However, the training data should not overlap withthe validation and test data provided by organizers.

Challenge participants can select to submit entries in Task 1, Task 2, or both. The training data and abaseline system will be released to all participants of DSTC7.

3 Data collection

3.1 Video Scene-Aware Dialog Data Collection

We will collect dialog data using text-based conversations about short videos from existingvideo description datasets, such as CHARADES (http://allenai.org/plato/charades/) and Kinetics(https://deepmind.com/research/open-source/open-source-datasets/kinetics/). The data collectionparadigm will be similar to the one we used in [5], in which for each image, two different MechanicalTurk workers interacted via a text interface to yield a dialog. In [5], each dialog consisted of asequence of questions and answers about an image (see Figure 3).

3.2 Objective evaluation

The quality of the automatically generated sentences will be evaluated with objective measures tomeasure the similarity between the generated sentences and ground truth sentences. For the challengetrack, we will use nlg-eval1 for objective evaluation of system outputs, which is a publiclyavailable tool supporting various unsupervised automated metrics for natural language generation.The supported metrics include word-overlap-based metrics such as BLEU, METEOR, ROUGE_L,and CIDEr, and embedding-based metrics such as SkipThoughts Cosine Similarity, EmbeddingAverage Cosine Similarity, Vector Extrema Cosine Similarity, and Greedy Matching Score. Detailsof these metrics are described in [12].

4 Summary

This article described the proposed Video Scene-Aware Dialog track for the 7th Dialog SystemTechnology Challenges (DSTC7) workshop. The information provided to participants will include

1https://github.com/Maluuba/nlg-eval

4

https://github.com/Maluuba/nlg-eval

a detailed description of the baseline system, instructions for submitting results for evaluation, anddetails of the evaluation scheme.

References

[1] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. LawrenceZitnick, and Devi Parikh, “VQA: Visual Question Answering,” in International Conference onComputer Vision (ICCV), 2015.

[2] Peng Zhang, Yash Goyal, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh, “Yin andYang: Balancing and answering binary visual questions,” in Conference on Computer Visionand Pattern Recognition (CVPR), 2016.

[3] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh, “Making the Vin VQA matter: Elevating the role of image understanding in Visual Question Answering,” inConference on Computer Vision and Pattern Recognition (CVPR), 2017.

[4] Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scaleimage recognition,” CoRR, vol. abs/1409.1556, 2014.

[5] Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, DeviParikh, and Dhruv Batra, “Visual dialog,” CoRR, vol. abs/1611.08669, 2016.

[6] Abhishek Das, Satwik Kottur, José M.F. Moura, Stefan Lee, and Dhruv Batra, “Learningcooperative visual dialog agents with deep reinforcement learning,” in International Conferenceon Computer Vision (ICCV), 2017.

[7] Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L. Yuille, “Deep captioning withmultimodal recurrent neural networks (m-rnn),” CoRR, vol. abs/1412.6632, 2014.

[8] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan, “Show and tell: A neuralimage caption generator,” in IEEE Conference on Computer Vision and Pattern Recognition,CVPR 2015, Boston, MA, USA, June 7-12, 2015, 2015, pp. 3156–3164.

[9] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov,Richard S. Zemel, and Yoshua Bengio, “Show, attend and tell: Neural image caption generationwith visual attention,” in Proceedings of the 32nd International Conference on MachineLearning, ICML 2015, Lille, France, 6-11 July 2015, 2015, pp. 2048–2057.

[10] Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond J. Mooney,and Kate Saenko, “Translating videos to natural language using deep recurrent neural networks,”in NAACL HLT 2015, The 2015 Conference of the North American Chapter of the Associationfor Computational Linguistics: Human Language Technologies, Denver, Colorado, USA, May31 - June 5, 2015, 2015, pp. 1494–1504.

[11] Chiori Hori, Takaaki Hori, Teng-Yok Lee, Ziming Zhang, Bret Harsham, John R. Hershey,Tim K. Marks, and Kazuhiko Sumi, “Attention-based multimodal fusion for video description,”in The IEEE International Conference on Computer Vision (ICCV), Oct 2017.

[12] Shikhar Sharma, Layla El Asri, Hannes Schulz, and Jeremie Zumer, “Relevance of unsupervisedmetrics in task-oriented dialogue for evaluating natural language generation,” CoRR, vol.abs/1706.09799, 2017.

5

Documents

Video Scene-Aware Dialog Track in DSTC7 - COLIPSworkshop.colips.org/dstc7/proposals/DSTC7_Scene_Aware_Dialog.pdf · Dialog systems need to understand scenes in order to have conversations