Every Picture Tells a Story:Generating Sentences from Images
by Ali Farhadi, Mohsen Hejrati, Mohammad Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth
PRESENTATION BY KERRY SEITZ
1
The Problem
Generate sentences from images
Short, descriptive
A man is getting out of the red convertible.
A women in green is talking
on a phone.
2[LAPTEV ET AL. 2008]
Challenges
Lack of Dataset
Out-of-vocabulary words
Phrasing variations◦ “Will gets out of the Chevrolet.”◦ “A black car pulls up. Two army officers get out.”◦ “Erin exits her new truck.”
3[LAPTEV ET AL. 2008]
Challenges
Synecdoche◦ “Will you watch my animal this weekend?”◦ “I just got a hot new set of wheels.”
5
Dataset
Based on PASCAL 2008 images
Randomly select 50 images from each category1000 images total
Generate 5 captions for each image◦ Using Amazon’s Mechanical Turk
Manually add triples: <object, action, scene>
6
Mapping Images to Meaning
Discrete set of values for each label
Solve small multi-label Markov random field
Use greedy method to do inference◦ Linear combination of feature functions◦ Train to score highest on ground truth triple
8
Image Features
Deformable Parts Model◦ Get prediction for each class◦ Consider max confidence of detectors, bounding box center, bounding box aspect
ratio, and scale
Hoiem et al. classification◦ Based on geometry, HOG features, and detection responses
Gist-based scene classification◦ Global information◦ Adaboost style classifiers
10
Node Potentials
For test image, get kNN in training set◦ By matching image features◦ By deriving from classifiers and detectors
Compute average node features over neighbors◦ Computed from image side◦ Computed from sentence side
11
Edge Potentials
Find edge weights such that ground truth triples score highest
Linear combination of four estimates (from node A to node B):◦ Normalized frequency of word A in corpus, f(A)
◦ Normalized frequency of word B in corpus, f(B)
◦ Normalized frequency of (A and B) at the same time, f(A, B)
◦ 𝑓𝑓(𝐴𝐴,𝐵𝐵)𝑓𝑓 𝐴𝐴 𝑓𝑓(𝐵𝐵)
12
Sentence Potentials
Use parser to extract◦ Subject and direct object (object, action)◦ Head nouns of prepositional phrases (scene)◦ Head noun of phrase “X in the background” (scene)
13
Matching Triplets Between an Image and a Sentence
Matching score approximation◦ Top k ranking triples from sentences, compute rank of each as image triple◦ Top k ranking triples from images, compute rank of each as sentence triple◦ Sum the sum of ranks, weighted by inverse rank, to emphasize stronger
triples
14
Evaluation of Mappings to Meaning Space
Compare all triple elements
If ground truth is (dog, sit, ground), which is better:◦ (cat, sit, mat) or (bike, ride, street)?◦ (cat, sit, mat) or (object, do, scene)?
15
Tree-F1 Measure
Object
Animal
Cat Dog
Vehicle
Bicycle Car
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 =# 𝑃𝑃𝑒𝑒𝑒𝑒𝑃𝑃𝑃𝑃 𝑚𝑚𝑚𝑚𝑚𝑚𝑃𝑃𝑚𝑃𝑃𝑃𝑃𝑒𝑒 𝑒𝑒𝑃𝑃𝑃𝑃𝑔𝑔𝑃𝑃𝑒𝑒 𝑚𝑚𝑃𝑃𝑔𝑔𝑚𝑚𝑚
# 𝑃𝑃𝑒𝑒𝑒𝑒𝑃𝑃𝑃𝑃 𝑃𝑃𝑃𝑃 𝑒𝑒𝑃𝑃𝑃𝑃𝑔𝑔𝑃𝑃𝑒𝑒 𝑚𝑚𝑃𝑃𝑔𝑔𝑚𝑚𝑚 𝑝𝑝𝑚𝑚𝑚𝑚𝑚
𝑅𝑅𝑃𝑃𝑃𝑃𝑚𝑚𝑅𝑅𝑅𝑅 =# 𝑃𝑃𝑒𝑒𝑒𝑒𝑃𝑃𝑃𝑃 𝑚𝑚𝑚𝑚𝑚𝑚𝑃𝑃𝑚𝑃𝑃𝑃𝑃𝑒𝑒 𝑒𝑒𝑃𝑃𝑃𝑃𝑔𝑔𝑃𝑃𝑒𝑒 𝑚𝑚𝑃𝑃𝑔𝑔𝑚𝑚𝑚
# 𝑃𝑃𝑒𝑒𝑒𝑒𝑃𝑃𝑃𝑃 𝑃𝑃𝑃𝑃 𝑝𝑝𝑃𝑃𝑃𝑃𝑒𝑒𝑃𝑃𝑃𝑃𝑚𝑚𝑃𝑃𝑒𝑒 𝑝𝑝𝑚𝑚𝑚𝑚𝑚
𝐹𝐹𝐹 = 2𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 × 𝑅𝑅𝑃𝑃𝑃𝑃𝑚𝑚𝑅𝑅𝑅𝑅𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 + 𝑅𝑅𝑃𝑃𝑃𝑃𝑚𝑚𝑅𝑅𝑅𝑅
16
BLUE Measure
Check to see if triple is valid◦ E.g. (bottle, walk, street) not valid
If triple ever appeared in corpus, then it is valid
17
Results – Generating SentencesTrained annotators to evaluate sentences◦ 1 – sentence is accurate◦ 2 – sentence has rough idea about image◦ 3 – sentence is not even close
Generated 10 sentences per image
Averages◦ Total average: 2.33◦ # sentences with score one per image: 1.48◦ # sentences with score two per image: 3.8
20
Summary
Sentences are a descriptive and compact representation of information
This work can generate good sentences for images
The intermediate representation is crucial and allows us to look up images for sentences too
24
Future Work
Sentence model is oversimplified
Iterative procedure for better sentence understanding
Identify adjectives and adverbs once sentence is generated
25