View
1
Download
0
Category
Preview:
Citation preview
What’s in that picture? VQA system Juanita Ordóñez
Department of Computer Science, Stanford Universityordonez2@stanford.edu
Introduction
Dataset
Feature Extraction
Approach
Qualitative Results Metric
Discussion & Future Work
References
• Images– UsedVGG[2]CNNpre-trainedonImageNet,scaledimagesto224x224x3priortofeedinginnetworkandextractedfeaturesfromFC-7
• Text– Removedallthepunctuation,convertedtolowercaseandbuiltvocabularyontrainingset.
• Answers– Extractedtop-1000mostfrequentanswersfromtrainingset.Modelpredictsascoreforeach.
• UsedVisualQuestionAnswering(VQA)[1]dataset
• 204,721images• 3questionsperimage• 10candidatesanswersper
question• Widevarietyofimage
dimensions,RGBandgrayscale
Figure2:Visualrepresentationofthepreprocessingstep.
VisualQuestionAnsweringisacomplextaskwhichaimsatansweringaquestionaboutanimage.Thistaskrequiresamodelthatcananalyzetheactionswithinavisualsceneandexpressanswersaboutsuchasceneinnaturallanguage.Thisprojectfocusesonbuildingamodelthatanswersopen-endedquestions.
Figure1:VQAdatasetimage,question,andcandidatesanswersexamples.
Softmax layer for 1000 class.
Encode question information using LSTM network
Kept Image and question information throughout MLP, this is done by concatenating FC output with question image context vector.
Figure3:Highlevelvisualrepresentationofthemodel.
Table1.resultsevaluatedontheval-testdataset.Eachmodelwastrainedforatotalof50epochswiththesamehyper-parameters.Weshowevaluationsforthefollowingmodels:MLPbaseline,RecursiveMLPwithbagofwordsandLSTM-RMLP.Wealsoshowtheresultsofalanguage-onlyLSTM-RMLPmodelwhereinnoimageinformationisused
OurresultsshowthatencodingthequestionusinganLSTM,aswedointheLSTM-RMLPmodule,ourVQAscoreswentupby3.87%.Thelanguage-onlymodelonlydidaround4%worseincomparisontothefullinformationLSTM-RMLP.Thisresultisextremelysurprisingasitmeansthatthemodeldoesquitewellinansweringquestionsaboutanimagewithouteverseeingit.FormynextstepsIwillremovethesoftmaxandgeneratearespondinawaysimilartoSequencetoSequencemodels.Iwouldalsoliketoexplorereinforcementlearningtrainingtechniques.Finally,IwanttoexperimentwithtrainingtheVGG-16modelend-to-end.
Figure4:Qualitativeresultsofmodelprediction,redindicatesthemodelgottheincorrectanswerandgreenrepresentsthemodelgotthecorrectanswer.
[1]Antol,Stanislaw,etal."Vqa:Visualquestionanswering." ProceedingsoftheIEEEInternationalConferenceonComputerVision.2015.[2]Simonyan,Karen,andAndrewZisserman."Verydeepconvolutionalnetworksforlarge-scaleimagerecognition." arXivpreprintarXiv:1409.1556 (2014).[3]https://www.tensorflow.org/get_started/summaries_and_tensorboard
Where each word vector fed into LSTM cell, and the last hidden state is concatenated with VGG features
Results
ThemodelperformancewasevaluatedusingtheVQAscoremetric.Whichaisthemodel’sanswermatchestoquestioncandidatesresponses.
Recommended