Premise-based Multimodal Reasoning: A Human-like Cognitive

Premise-based Multimodal Reasoning: A Human-like Cognitive ProcessQingxiu Dong1, Ziwei Qin1, Heming Xia1, 2, Tian Feng1, 2, Shoujie Tong1, 2, Haoran Meng1, 2,

Lin Xu1, 2, Tianyu Liu1, Zuifang Sui1, Weidong Zhan1, 3, Baobao Chang1, Sujian Li1 and Zhongyu Wei4

1The MOE Key Laboratory of Computational Linguistics, Peking University2School of Software & Microelectronics, Peking University

3Department of Chinese Language and Literature, Peking University4School of Data Science, Fudan University

[email protected],{xiaheming}@pku.edu.cn{zwqin}@stu.pku.edu.cn

1 Introduction

Reasoning is one of the major challenges ofHuman-like AI (Lake et al., 2016) and has recentlyattracted intensive attention from natural languageprocessing (NLP) researchers (Dagan et al., 2006;MacCartney and Manning, 2007; Bowman andZhu, 2019). However, cross-modal reasoning needsfurther research.

For cross-modal reasoning, we observe that mostmethods fall into shallow feature matching withoutin-depth human-like reasoning, e.g. Fukui et al.(2016); Yang et al. (2015); Baltrusaitis et al. (2017)have shown that a number of cross-modal reasoningtasks can be easily solved by concatenating imagerepresentations and text representations. The rea-son lies in that existing cross-modal tasks directlyask questions for a image. However, human rea-soning in real scenes is often made under specificbackground information, a process that is studiedby the ABC theory (See Section 3) in social psy-chology.

We propose a shared task named “Premise-basedMultimodal Reasoning” (PMR), which requiresparticipating models to reason after establishinga profound understanding of background informa-tion. We believe that the proposed PMR wouldcontribute to and help shed a light on human-likein-depth reasoning.

We summarize the significance of this task asfollows:

- We introduce premise into cross-modal rea-soning, therefore image-text interaction canbe specified by the given premise, which mayfit in real-world applications like visual relatedsmart home devices.

- Different from existing cross-modal reasoningtasks, PMR’s negative answers are not irrele-vant to the image but are possible understand-

ings of the image. PMR is more challengingand can further push forward the fusion ofimage and text information in multimodal rea-soning.

- We apply the ABC Theory to task designingby introducing premise as prior belief in therational-emotive process. The proposed taskpaves the way for cognitive cross-modal rea-soning.

2 Task Description

We describe our PRM task by an example in Figure1 I&III. Given an image and a premise, we hopethat the machine will understand the image in com-bination with the premise so as to choose the exclu-sive correct answer among the four hypothetical ac-tions (See Appendix A). The premise would serveas the background knowledge or domain-specificcommon sense for the given image. In this way,we hope the reasoning models would make correctclassification considering the interaction betweenthe premise and the image, which is similar to thehuman cognitive process. In the multiple-choicesetting, all four hypothetical actions describe whatwill happen next, but only one of them is correctunder the specific premise.

Due to its cross-modal nature and resemblancewith human cognition, PMR would be interestingto researchers working on cross-modal reasoningand understanding. The task would push forwardthe study on the fusion between different modal-ities and human-like cognitive reasoning, whichis both a challenge and an opportunity for futureresearch. Moreover, we will provide detailedannotation information to reduce the hassle ofimage processing, which makes the task friendlyto NLP researchers.

arX

iv:2

105.

0712

2v1

[cs

.CL

] 1

5 M

ay 2

021

Ⅰ

Please infer what will happen next and choose the hypothetical action from the following options. A.[person4] is going to say how cute [person2] 's children are. B.[person5] is going to say how cute [person2] 's children are. C.[person4] sigh and say,“ The children’s parents are too busy at

work.” D.[person1] sigh and say,“ The children’s parents are too busy at

work.” Answer: A

Premise: [person2] is the mother of [person5].

Ⅱ

Ⅲ

[person4] is going to say how cute [person2] 's children are. [person3] comforts [person2] softly. [person3] dropped the photo jealously and angrily. [person4] sigh and say,“ Mychildren’s parents are too busy at work.”

Figure 1: We demonstrate a source image with object boxes (e.g. ‘[person1]’,‘[person2]’) in I. II shows a varietyof eligible hypothetical actions for the source image I without the specification of premise. An instance of PMR isshown in III, which requires reasoning models to answer the question according to the source image and the givenpremise.

3 Related Work

The ABC Theory The design of PMR is inspiredby the ABC Theory (Ellis, 1995) in social psy-chology. The theory represents a widely acceptedframework for how one’s behavioral patterns arecreated. It asserts that human behavior do not comedirectly from the events but from the interpretationswe made. We apply the ABC theory to the design-ing of PMR which formulates the given premise asthe prior belief, as shown in Figure 2.

Figure 2: Connections between the ABC theory andPMR.

Visual Commonsense Reasoning (VCR) VCR(Zellers et al., 2018) requires machines to under-stand an image and answer a multi-choice ques-tion. Specifically, questions are like ‘what is goingto happen next’, ‘infer the relationship between[personA] and [personB]’ and ‘why is [personA]smiling’, as well as the rationale why the answeris true. Since VCR collects its data from movieclips (without notifying movie names), we arguethat the answers may vary according to the specificmovie type, e.g. the question like ‘why is [per-

sonA] smiling’ might have divergent answers in ahorror movie and a romantic movie.

Taking one step forward, different people whoobserve the image from different aspects or un-der different premises would have divergent under-standings, and thus make different interpretationsof the same image. As shown in the example in Fig-ure 1 I&II, without specifying a premise, all fourhypothetical actions are reasonable in VCR. Whilein PMR, we provide the premise “[person2] is themother of [person5].”. To make the correct choice,machines have to perceive the image with the guid-ance of the premise. In essence, the premise inthe PMR task would effectively narrow down theeligible actions with respect to the given imageand thus require a higher level of human cognitionon premises, questions as well as images. PMRis orthogonal to the contributions of VCR, as weencourage in-depth human-like reasoning with thespecificity of premise while VCR focuses on com-mon sense reasoning in the cross-modal setting.

4 Data

4.1 Data Preparation

Image and its annotation data are modified fromVCR image collection (See Appendix B). To gener-ate premises compatible with images, we manuallywrote 52 premise templates belonging to six cate-gories: relationship, antecedent, personality, emo-tion, environment and identity. We expanded themto 32,142 premise sentences by data augmentation.

4.2 Crowdsourcing Annotations

Randomly combining each image with premisesof each category, we have constructed 36k exam-ples to be labeled. In each example, we ask theAMT workers to choose one of the given premisesas the image’s supplementary information. Afterthat, two hypothetical actions should be written,which describe what will happen next. Amongthem, Action-True contains image information andmeets the chosen premise. In contrast, Action-False contains image information but contradictsthe chosen premise. In order to ensure quality, wehave designed multi-stage quality control (See Ap-pendix C).

4.3 Post-processing

We designed a series of rules to generate action in-terference items, such as changing the order or gen-der of characters in action-True and action-False,and replacing keywords that appear in them, so thatthe newly generated options do not match the im-age. As a result, we finally provide four differenttypes of hypothetical actions, of which only one op-tion is a reasonable inference based on the premiseafter understanding the image. We use Accuracyas the main evaluation metric for this task. We willprovide human performance as a reference, and setup baseline models for PMR.

4.4 Copyright

We do not have any copyright issues. (See Ap-pendix D).

5 Pilot Task

Pilot Annotation To obtain high-quality annota-tions, we conducted three pilot annotations, by ad-justing the annotation process and platform. Withreference to the pilot annotations, we have madean estimate of the final data size and relevant cost.

Name Participant Size Price Time Rate

PA2 AMT workers 162 $0.05 1’40 48%PA1 AMT workers 462 $0.1 2’28 53%PA3 Students 460 $0.1 7’27 81%

Expect Students 36k $0.1 5’00 90%

Table 1: Information of three pilot annotations (PA1-PA3) and resource estimation for the complete data.Students are native English speakers or graduate stu-dents in English-related majors. Price refers to the re-ward and time refers to the average time per assignment.Rate stands for the pass rate.

According to the pilot annotations, we have op-timized the process and annotation website. Weinvited proficient students as annotators and pro-vided them with extra training before annotating.In addition, inspectors will be arranged to checkthe annotating quality in real-time. Therefore, un-qualified labels will be reworked.

Pilot Experiment We sampled 50 manually an-notated data as our testing data, so as to evaluatethis task. We converted them into the form of VCRquestions1. Concretely, we constructed the VCR-form question by concatenating “What is going tohappen next given the fact” with the premise. Then,we used the VL-BERT (Su et al., 2019) modeltrained on the VCR as our baseline. VL-BERT is asimple yet powerful pre-trainable generic represen-tation for visual-linguistic tasks, therefore we willprovide this model for participants who have noexperience in image processing, while participantscan use their own image processing module as well.VL-BERT’s performance on VCR and on PMR isshown in Table 2. Compared with VCR, PMR ismore challenging to multimodal reasoning, whichis of great research significance.

Name Accuracy Model

VCR (Q → A) 73.6% VL-BERT-basePMR (Q → A) 40.0% VL-BERT-base

Table 2: Pilot study on VL-BERT baseline.

6 Task Organizers

Zuifang Sui has co-organized CLP-2014 task1 and CCL2021 task 2. Her research interestsinclude Natural Language Processing, Text Min-ing and Language Knowledge Engineering. Shehas published more than 60 research papers in top-tier conferences and journals, such as ACL, IJCAI,EMNLP and COLING.

Weidong Zhan has co-organized CCL2021task 2. His research interests are Modern ChineseFormal Grammar, Chinese Information Processing,Chinese Language Knowledge Engineering, etc.

Information of other co-organizers can be foundin Appendix E.

1Testing data can be found fromhttps://github.com/dqxiu/PMR

ReferencesTadas Baltrusaitis, Chaitanya Ahuja, and Louis-

Philippe Morency. 2017. Multimodal MachineLearning: A Survey and Taxonomy. arXiv.

Samuel Bowman and Xiaodan Zhu. 2019. Deep learn-ing for natural language inference. In Proceedingsof the 2019 Conference of the North American Chap-ter of the Association for Computational Linguistics:Tutorials, pages 6–8.

Ido Dagan, Oren Glickman, and Bernardo Magnini.2006. The PASCAL Recognising Textual Entail-ment Challenge. In Machine Learning Challenges.Evaluating Predictive Uncertainty, Visual ObjectClassification, and Recognising Tectual Entailment,pages 177–190, Berlin, Heidelberg. Springer BerlinHeidelberg.

Albert Ellis. 1995. Changing rational-emotive ther-apy (RET) to rational emotive behavior ther-apy (REBT). Journal of Rational-Emotive andCognitive-Behavior Therapy, 13(2):85–89.

Akira Fukui, Dong Huk Park, Daylen Yang, AnnaRohrbach, Trevor Darrell, and Marcus Rohrbach.2016. Multimodal compact bilinear pooling forvisual question answering and visual grounding.arXiv preprint arXiv:1606.01847.

Brenden M Lake, Tomer D Ullman, Joshua B Tenen-baum, and Samuel J Gershman. 2016. Building Ma-chines That Learn and Think Like People. arXiv.

Bill MacCartney and Christopher D Manning. 2007.Natural logic for textual inference. In Proceedingsof the ACL-PASCAL Workshop on Textual Entail-ment and Paraphrasing, pages 193–200.

Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu,Furu Wei, and Jifeng Dai. 2019. Vl-bert: Pre-training of generic visual-linguistic representations.arXiv preprint arXiv:1908.08530.

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng,and Alex Smola. 2015. Stacked Attention Networksfor Image Question Answering. arXiv.

Rowan Zellers, Yonatan Bisk, Ali Farhadi, and YejinChoi. 2018. From Recognition to Cognition: VisualCommonsense Reasoning. arXiv.

ReferencesTadas Baltrusaitis, Chaitanya Ahuja, and Louis-

Philippe Morency. 2017. Multimodal MachineLearning: A Survey and Taxonomy. arXiv.

Samuel Bowman and Xiaodan Zhu. 2019. Deep learn-ing for natural language inference. In Proceedingsof the 2019 Conference of the North American Chap-ter of the Association for Computational Linguistics:Tutorials, pages 6–8.

Ido Dagan, Oren Glickman, and Bernardo Magnini.2006. The PASCAL Recognising Textual Entail-ment Challenge. In Machine Learning Challenges.Evaluating Predictive Uncertainty, Visual ObjectClassification, and Recognising Tectual Entailment,pages 177–190, Berlin, Heidelberg. Springer BerlinHeidelberg.

Albert Ellis. 1995. Changing rational-emotive ther-apy (RET) to rational emotive behavior ther-apy (REBT). Journal of Rational-Emotive andCognitive-Behavior Therapy, 13(2):85–89.

Akira Fukui, Dong Huk Park, Daylen Yang, AnnaRohrbach, Trevor Darrell, and Marcus Rohrbach.2016. Multimodal compact bilinear pooling forvisual question answering and visual grounding.arXiv preprint arXiv:1606.01847.

Brenden M Lake, Tomer D Ullman, Joshua B Tenen-baum, and Samuel J Gershman. 2016. Building Ma-chines That Learn and Think Like People. arXiv.

Bill MacCartney and Christopher D Manning. 2007.Natural logic for textual inference. In Proceedingsof the ACL-PASCAL Workshop on Textual Entail-ment and Paraphrasing, pages 193–200.

Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu,Furu Wei, and Jifeng Dai. 2019. Vl-bert: Pre-training of generic visual-linguistic representations.arXiv preprint arXiv:1908.08530.

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng,and Alex Smola. 2015. Stacked Attention Networksfor Image Question Answering. arXiv.

Rowan Zellers, Yonatan Bisk, Ali Farhadi, and YejinChoi. 2018. From Recognition to Cognition: VisualCommonsense Reasoning. arXiv.

http://arxiv.org/abs/1705.09406


https://doi.org/10.1007/bf02354453

https://doi.org/10.1007/bf02354453

https://doi.org/10.1007/bf02354453









https://doi.org/10.1007/bf02354453

https://doi.org/10.1007/bf02354453

https://doi.org/10.1007/bf02354453







A Data Format

Here is an example2 of what the data would looklike. {Premise: “[person2] is the mother of [person5].”,Image id: 1528,#You can get the image and its annotation informa-tion from this valueOptions:

{1: “[person4] is going to say how cute [per-son2] ’s children are.”,2: “[person5] is going to say how cute [per-son2] ’s children are.”,3: “[person4] sigh and say,“ The children’sparents are too busy at work.””,4: “[person1] sigh and say,“ The children’sparents are too busy at work.””},

#Four hypothetical actions, two of them are rea-sonable inference from the image, the others areinterference options.Answer: “1”#The only correct answer that can be derived fromunderstanding the image based on the premise.}

B Image Preparation

VCR image collection comes from the Large ScaleMovie Description Challenge and YouTube movieclips. It has been screened by the “interestingnessfilter” to retain meaningful images, including 110khigh-quality images in total. To ensure the labelingquality of PMR, we screened out images with lowbrightness, more than five people, or more than 15tags, and finally got 30k images for our labeling.

C Quality Control

Labeling Quality Control Before Post-processing, we adopted a fast secondarycrowdsourcing process to ensure labeling quality.Show the image, context and any action (true orfalse attribute are hidden) obtained from the firstcrowdsourced annotation to the annotator, and askthe annotator to judge whether the action is true orfalse. Suppose the judgments of the two annotatorsare the same as those of the first annotation. If thetrue or false attributes are consistent, the annotation

2More examples can be found fromhttps://github.com/dqxiu/PMR

will be regarded as a valid one, otherwise it will bediscarded.

Question Quality Control After Post-processing, we control the quality of multiple-choice questions through crowdsourcing. Werandomly hand the constructed multiple-choicequestions to the annotators on the AMT. Inaddition to the four hypothetical actions for eachmultiple-choice question, the “Wrong” option isadded to indicate that the question is wrong, theanswer is not unique, and there is no answer. Ifboth annotators choose the correct answer, themodified multiple-choice question will be regardedas a valid question. If both annotators select the“Wrong” option, the question will be discarded. Inother cases, the organizers of this task will checkthe question.

D Data Availability and Copyright

According to Section 107 of the CopyrightLaw3,and 28A and 30 of the Copyright Acts4, thereis one exception to copyright infringement whichis fair use (or fair dealing). Fair use is appropriatefor public benefit purposes, like research. Our useis not of commercial nature. Besides, we only usetexts that are publicly available, and the source willbe stated according to law. Users will downloadthe images directly from the original source.

E Other Organizers

Baobao Chang is an associate professor atPeking University. He has published over 60 high-quality academic papers in international confer-ences and Journals, including top-tier conferencessuch as ACL, EMNLP, COLING, SIGIR and IJ-CAI. He has served in the Program Committee ofvarious international conferences including ACL,EMNLP, COLING etc.

Sujian Li is an associate professor at Peking Uni-versity. She has published more than 100 researchpapers, and many of them are published in top-tierconferences, such as ACL, KDD, AAAI, SIGIR,and EMNLP. She has served in the Program Com-mittee of various international conferences includ-ing ACL, AAAI, IJCAI, EMNLP, COLING andWWW, and as Area chair of ACL 2017, Area chairof CCL 2015/2017 and Area chair of NLPCC 2016.

3https://www.copyright.gov/title17/ 92chap1.html1074https://www.gov.uk/government/publications/copyright-

acts-and-related-laws

Zhongyu Wei is an associate professor at Fu-dan University. His research focuses on naturallanguage processing, machine learning, with spe-cial emphasis on multi-modality information under-standing and generation cross vision and language.He has published more than 60 papers on top-tierconferences in related research fields. He servedas area co-chair of language grounding in EMNLP2020.

Qingxiu Dong is a Ph.D. student of Peking Uni-versity. Her research interests include natural lan-guage generation, debiasing and data mining.

Ziwei Qin is a Master student of Peking Uni-versity. His research interests include informationextraction and multimodal learning.

Heming Xia is a Master student of Peking Uni-versity. His research interests include multimodaland few-shot learning.

Haoran Meng is a Master student of Peking Uni-versity. His research interests include natural lan-guage generation, MultiModal Machine Learning.

Shoujie Tong is a Master student of Peking Uni-versity. His research interests include natural lan-guage processing and data mining.

Tian Feng is a Master student of Peking Uni-versity. Her research interests include natural-language question answering, neural question gen-eration and data mining.

Lin Xu is a Master student of Peking University.His research interest is knowledge graph.

Tianyu Liu is a Ph.D. student of Peking Uni-versity. His research interest is natural languagegeneration.

Documents

Premise-based Multimodal Reasoning: A Human-like Cognitive