Ask & Answer by Machine: Closing The Loop for Visual

Ask & Answer by Machine: Closing

The Loop for Visual Turing Challenge

Raveesh Mayya*, Rashmi Sankepally, Richard Neal, Sanghyun Hong, Thang NguyenDepartment of Computer Science, *Robert H. Smith School of Business, University of Maryland, College Park

Monday 29

thFebruary, 2016

Can a machine be intelligent? Is the abil-ity to solve a complex arithmetic problemconsidered intelligence? If so, machines

clearly are intelligent, in fact far surpassingthe intelligence of most humans. Creatingmachines to understand language (first ques-tion) and vision (second question) has beenthe core motivation for the development ofArtificial Intelligence, or AI, research. Sixtyyears ago, Alan Turing proposed a test for lan-guage capacity through evaluating how well amachine converses with a human being – TheTuring Test. Recently, scientists proposeda Visual Turing Challenge, in which a ma-chine is required to answer basic questionsthrough the use of visual information, suchas images. However, the disparity betweenthe ability of humans to evaluate such ques-tions and that of computers is still quite large,causing the questions for the Visual TuringChallenge to be too di�cult to provide use-ful evaluation. However, Computer Vision re-searchers at the University of Maryland haveproposed a novel middle layer approach, inwhich they first allow a computer to generatequestions based on images, and then design amethod using deep learning to answer thosequestions.

What is VQA?

Humans are generating huge numbers of images ev-ery day, through social networks, creation of digitalcontent, etc. With this amount of data at theirdisposal, recognizing objects from images no longer

meets the bar for computer vision. A machine shouldbe able to look into millions of images and digestthe information needed to answer important ques-tions, ranging from simple ones like counting objectsto far more complicated ones involving inference,such as whether a person sitting alone at a tableis waiting for someone. E↵orts have been madeto publish image datasets and the associated ques-tions for researchers to evaluate their algorithmshttp://www.visualqa.org. However, most recentresults on the Visual Question Answering(VQA)benchmarks are solely due to language only infor-mation and the contribution of visual recognition isunclear.

Figure 1: One example self talk by the presented system,while the a�rmative or questionable answeris decided by confidence score from visual an-swering executive [3]

Research at UMD

Overcoming the formidable task like the Visual Tur-ing Challenge is nontrivial. The most time consuming

page 1 of 3

http://www.visualqa.org

(and probably the most expensive) step is to collectvast amount of labeled data (both questions andanswers) annotated by humans. Researchers at theUniversity of Maryland - College Park(UMD) attackthis problem by letting the machine generate thequestions on its own, taking only images as input.As illustrated in figure 1 their approach, Neu-

ral Self Talk, simulates a self-talk process in whichan intelligent algorithm will look at images and askhuman-like questions. The phenomenon of self-talk iswell-known in the field of Psychology as a special formof intrapersonal communication that facilitates learn-ing and understanding. illustrates the phenomenonof self-talk. The model which generates questionsfrom images is called the Visual Question Generation,or VQG - it is built on an image captioning system.A state of the art Visual Question Answering (VQA)such as [2] is then deployed for generating answersgiven the questions and the original image. Figure2 shows this simple approach in the form of a flowdiagram. The two component models, VQG andVQA are detailed below and Figure 3 illustrates theintuition behind these models.

Figure 2: The flow chart of the approach[3]

Visual Question Generation (VQG)

This module deals with automatically generatingquestions from images. This extends a data drivenimage captioning system based on language modelsfrom Recursive Neural Networks (RNN) proposedby Karpathy and Li (2014) [1]. In the training step,the RNN will use a small set of images and theirquestions (a sequence of words) generated by humanannotators to learn how to generate a word giventhat it has seen a previous word and the image pixels.In the generation step, RNN will use the image pixels(from unannotated images) as the starting point andgradually fill up the missing words. Because themodel is guessing words with a certain associated

probability, many questions can obtained for a givenquestion—similar to what human annotators usuallydo.

Visual Question Answering (VQA)

Given a vast amount of questions generated automat-ically from VQG, a state of the art VQA [2] deploysa deep learning approach to answer a question givenan image. This model uses a 19 layer convolutionalneural network (CNN) that is pre-trained and a gen-eral purpose skip-gram word embedding. The VQGand VQA are trained simultaneously.

Figure 3: Internals of the Modules : A is the VQG mod-ule and B is the VQA module, picture takenfrom [3]

Experiments & Results

Experiments were performed on two standard imagedatasets - the DAQUAR dataset, which has 12,468human generated question answer pairs on 1,449images of indoor scenes and the COCO dataset forVQA - with 369,861 questions and 3,698,610 groundtruth answers based on 123,287 MSCOCO images.The evaluation is performed at two levels. On onelevel, they evaluate the questions generated by theVQG using standard linguistic and image captioningmeasures. On the other level the researchers useAmazon Mechanical Turk(MTurk) and check thereadability, grammar, correctness and accuracy ofthe description of the image in answers.

The team found that the questions generated fromthe models are similar to the questions asked byhumans based on linguistic and standard captioningmetrics. Even from the MTurk evaluation, they findthat the questions generated achieved near-human

page 2 of 3

Dataset Readability Correctness Human-likeDAQUAR 3.35±1.92 2.5±1.03 2.49±1.02

COCO-VQA 3.39±1.18 2.7±1.29 2.40±1.33

Table 1: Results from Amazon Mechanical Turk (MTurk) evaluation

readability. Table 1 summarizes these results. Forthe self-talk evaluation, turkers found the answersabove the “a bit like human being” correctness butbelow “half-human, half-machine” correctness. Inessence, a significant number of these turkers feltthat the answers were “roboty” and the very conceptof a self-talking robot as humorous. Figures 4 and 5show what they felt about self-talk. The researchersacknowledge that the project is still a long way beforeachieving the maturity and accuracy that the projectsets out to achieve.

Figure 4: Results from MTurk for the DAQUAR dataset

Figure 5: Results from MTurk for the COCO dataset

Conclusions

The Neural Self Talk considers the image under-standing problem as a sequence of consecutive self-questioning and answering processes. The two keycomponents, Visual Question Generation (VQG)module and Visual Question Answering (VQA) mod-ule, enable to construct the question and answerpairs. The evaluation results by users from MTurkdemonstrate the promising future potential of thisresearch: robots can behave like human being inthe near future. To improve the performance of theintelligent self talk , the research proposed two highlypossible ways: it can 1) incorporate common-senseknowledge and 2) construct a story-line that mimicshuman communication. The matured result of theimprovement will possibly make users feel fondnessfor the robot companions, and it will boost the cur-rent robot research by integrating the intelligent selftalk models in them.

References

[1] A. Karpathy and L. Fei-Fei. Deep visual-semanticalignments for generating image descriptions. InProceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, pages 3128–3137,2015.

[2] M. Ren, R. Kiros, and R. Zemel. Image ques-tion answering: A visual semantic embeddingmodel and a new dataset. arXiv preprint

arXiv:1505.02074, 2015.

[3] Y. Yang, Y. Li, C. Fermuller, and Y. Aloimonos.Neural self talk: Image understanding via contin-uous questioning and answering. arXiv preprint

arXiv:1512.03460, 2015.

Contributors

Raveesh K Mayya ([email protected])Rashmi Sankepally ([email protected])Richard Neal ([email protected])Sanghyun Hong ([email protected])Thang Dai Nguyen ([email protected])

page 3 of 3

Documents

Ask & Answer by Machine: Closing The Loop for Visual