What’s in that picture? VQA system - Machine...

What’s in that picture? VQA system Juanita Ordóñez

Department of Computer Science, Stanford Universityordonez2@stanford.edu

Introduction

Dataset

Feature Extraction

Approach

Qualitative Results Metric

Discussion & Future Work

References

• Images– UsedVGG[2]CNNpre-trainedonImageNet,scaledimagesto224x224x3priortofeedinginnetworkandextractedfeaturesfromFC-7

• Text– Removedallthepunctuation,convertedtolowercaseandbuiltvocabularyontrainingset.

• Answers– Extractedtop-1000mostfrequentanswersfromtrainingset.Modelpredictsascoreforeach.

• UsedVisualQuestionAnswering(VQA)[1]dataset

• 204,721images• 3questionsperimage• 10candidatesanswersper

question• Widevarietyofimage

dimensions,RGBandgrayscale

Figure2:Visualrepresentationofthepreprocessingstep.

VisualQuestionAnsweringisacomplextaskwhichaimsatansweringaquestionaboutanimage.Thistaskrequiresamodelthatcananalyzetheactionswithinavisualsceneandexpressanswersaboutsuchasceneinnaturallanguage.Thisprojectfocusesonbuildingamodelthatanswersopen-endedquestions.

Figure1:VQAdatasetimage,question,andcandidatesanswersexamples.

Softmax layer for 1000 class.

Encode question information using LSTM network

Kept Image and question information throughout MLP, this is done by concatenating FC output with question image context vector.

Figure3:Highlevelvisualrepresentationofthemodel.

Table1.resultsevaluatedontheval-testdataset.Eachmodelwastrainedforatotalof50epochswiththesamehyper-parameters.Weshowevaluationsforthefollowingmodels:MLPbaseline,RecursiveMLPwithbagofwordsandLSTM-RMLP.Wealsoshowtheresultsofalanguage-onlyLSTM-RMLPmodelwhereinnoimageinformationisused

OurresultsshowthatencodingthequestionusinganLSTM,aswedointheLSTM-RMLPmodule,ourVQAscoreswentupby3.87%.Thelanguage-onlymodelonlydidaround4%worseincomparisontothefullinformationLSTM-RMLP.Thisresultisextremelysurprisingasitmeansthatthemodeldoesquitewellinansweringquestionsaboutanimagewithouteverseeingit.FormynextstepsIwillremovethesoftmaxandgeneratearespondinawaysimilartoSequencetoSequencemodels.Iwouldalsoliketoexplorereinforcementlearningtrainingtechniques.Finally,IwanttoexperimentwithtrainingtheVGG-16modelend-to-end.

Figure4:Qualitativeresultsofmodelprediction,redindicatesthemodelgottheincorrectanswerandgreenrepresentsthemodelgotthecorrectanswer.

[1]Antol,Stanislaw,etal."Vqa:Visualquestionanswering." ProceedingsoftheIEEEInternationalConferenceonComputerVision.2015.[2]Simonyan,Karen,andAndrewZisserman."Verydeepconvolutionalnetworksforlarge-scaleimagerecognition." arXivpreprintarXiv:1409.1556 (2014).[3]https://www.tensorflow.org/get_started/summaries_and_tensorboard

Where each word vector fed into LSTM cell, and the last hidden state is concatenated with VGG features

Results

ThemodelperformancewasevaluatedusingtheVQAscoremetric.Whichaisthemodel’sanswermatchestoquestioncandidatesresponses.

What’s in that picture? VQA system - Machine...

Documents

Question Answering and Dialog Systems - ut · Factoid QA methods •IR-based question answering or text-based question answering •Relies on huge amounts of texts •Given a question,

Question Answering - eecs.csuohio.edu

Instant Question Answering System

Evaluating Question Answering Validation

Question answering - Cornell University · 2012. 1. 28. · Question answering • Overview and task definition • History • Open-domain question answering • Basic system architecture

Question Classification in Question Answering Systems23705/FULLTEXT01.pdf · 1.1 Question answering The goal of question answering (QA) is to provide a succinct answer, given a question

Predicting Answering Behaviour in Online Question Answering Communities

Question Answering What is Question Answering?. Dan Jurafsky 2 Question Answering One of the oldest NLP tasks (punched card systems in 1961) Simmons,

Natural Language Question Answering

Holding meeting & answering question

Answering the analysis question

Relevant multimedia question answering

Lecture: Question Answering

Question Answering System - SCU

Answering Essential Question 3

Arabic question answering

Lecture 32 Question Answering

Answering Pharaoh’s Question

QUESTION ANSWERING AND SUMMARIZATION

Image Question Answering: A Visual Semantic Embedding ... · Image Question Answering: A Visual Semantic Embedding ... an automatic question generation ... Image Question Answering: