Download pdf - Every Picture Tells a Story: Generating Sentences from Images · Every Picture Tells a Story: Generating Sentences from Images ... Every Picture Tells a Story: Generating Sentences

Every Picture Tells a Story:Generating Sentences from Images

by Ali Farhadi, Mohsen Hejrati, Mohammad Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth

PRESENTATION BY KERRY SEITZ

1

The Problem

Generate sentences from images

Short, descriptive

A man is getting out of the red convertible.

A women in green is talking

on a phone.

2[LAPTEV ET AL. 2008]

Challenges

Lack of Dataset

Out-of-vocabulary words

Phrasing variations◦ “Will gets out of the Chevrolet.”◦ “A black car pulls up. Two army officers get out.”◦ “Erin exits her new truck.”

3[LAPTEV ET AL. 2008]

Challenges

Synecdoche◦ “Will you watch my animal this weekend?”

4

Challenges

Synecdoche◦ “Will you watch my animal this weekend?”◦ “I just got a hot new set of wheels.”

5

Dataset

Based on PASCAL 2008 images

Randomly select 50 images from each category1000 images total

Generate 5 captions for each image◦ Using Amazon’s Mechanical Turk

Manually add triples: <object, action, scene>

6

Approach – Meanings space

7[FARHADI ET AL. 2010]

Mapping Images to Meaning

Discrete set of values for each label

Solve small multi-label Markov random field

Use greedy method to do inference◦ Linear combination of feature functions◦ Train to score highest on ground truth triple

8

Mapping Images to Meaning


Image Features

Deformable Parts Model◦ Get prediction for each class◦ Consider max confidence of detectors, bounding box center, bounding box aspect

ratio, and scale

Hoiem et al. classification◦ Based on geometry, HOG features, and detection responses

Gist-based scene classification◦ Global information◦ Adaboost style classifiers

10

Node Potentials

For test image, get kNN in training set◦ By matching image features◦ By deriving from classifiers and detectors

Compute average node features over neighbors◦ Computed from image side◦ Computed from sentence side

11

Edge Potentials

Find edge weights such that ground truth triples score highest

Linear combination of four estimates (from node A to node B):◦ Normalized frequency of word A in corpus, f(A)

◦ Normalized frequency of word B in corpus, f(B)

◦ Normalized frequency of (A and B) at the same time, f(A, B)

◦ 𝑓𝑓(𝐴𝐴,𝐵𝐵)𝑓𝑓 𝐴𝐴 𝑓𝑓(𝐵𝐵)

12

Sentence Potentials

Use parser to extract◦ Subject and direct object (object, action)◦ Head nouns of prepositional phrases (scene)◦ Head noun of phrase “X in the background” (scene)

13

Matching Triplets Between an Image and a Sentence

Matching score approximation◦ Top k ranking triples from sentences, compute rank of each as image triple◦ Top k ranking triples from images, compute rank of each as sentence triple◦ Sum the sum of ranks, weighted by inverse rank, to emphasize stronger

triples

14

Evaluation of Mappings to Meaning Space

Compare all triple elements

If ground truth is (dog, sit, ground), which is better:◦ (cat, sit, mat) or (bike, ride, street)?◦ (cat, sit, mat) or (object, do, scene)?

15

Tree-F1 Measure

Object

Animal

Cat Dog

Vehicle

Bicycle Car

𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 =# 𝑃𝑃𝑒𝑒𝑒𝑒𝑃𝑃𝑃𝑃 𝑚𝑚𝑚𝑚𝑚𝑚𝑃𝑃𝑚𝑃𝑃𝑃𝑃𝑒𝑒 𝑒𝑒𝑃𝑃𝑃𝑃𝑔𝑔𝑃𝑃𝑒𝑒 𝑚𝑚𝑃𝑃𝑔𝑔𝑚𝑚𝑚

# 𝑃𝑃𝑒𝑒𝑒𝑒𝑃𝑃𝑃𝑃 𝑃𝑃𝑃𝑃 𝑒𝑒𝑃𝑃𝑃𝑃𝑔𝑔𝑃𝑃𝑒𝑒 𝑚𝑚𝑃𝑃𝑔𝑔𝑚𝑚𝑚 𝑝𝑝𝑚𝑚𝑚𝑚𝑚

𝑅𝑅𝑃𝑃𝑃𝑃𝑚𝑚𝑅𝑅𝑅𝑅 =# 𝑃𝑃𝑒𝑒𝑒𝑒𝑃𝑃𝑃𝑃 𝑚𝑚𝑚𝑚𝑚𝑚𝑃𝑃𝑚𝑃𝑃𝑃𝑃𝑒𝑒 𝑒𝑒𝑃𝑃𝑃𝑃𝑔𝑔𝑃𝑃𝑒𝑒 𝑚𝑚𝑃𝑃𝑔𝑔𝑚𝑚𝑚

# 𝑃𝑃𝑒𝑒𝑒𝑒𝑃𝑃𝑃𝑃 𝑃𝑃𝑃𝑃 𝑝𝑝𝑃𝑃𝑃𝑃𝑒𝑒𝑃𝑃𝑃𝑃𝑚𝑚𝑃𝑃𝑒𝑒 𝑝𝑝𝑚𝑚𝑚𝑚𝑚

𝐹𝐹𝐹 = 2𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 × 𝑅𝑅𝑃𝑃𝑃𝑃𝑚𝑚𝑅𝑅𝑅𝑅𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 + 𝑅𝑅𝑃𝑃𝑃𝑃𝑚𝑚𝑅𝑅𝑅𝑅

16

BLUE Measure

Check to see if triple is valid◦ E.g. (bottle, walk, street) not valid

If triple ever appeared in corpus, then it is valid

17

Results – Images to Meaning


Results – Generating Sentences


Results – Generating SentencesTrained annotators to evaluate sentences◦ 1 – sentence is accurate◦ 2 – sentence has rough idea about image◦ 3 – sentence is not even close

Generated 10 sentences per image

Averages◦ Total average: 2.33◦ # sentences with score one per image: 1.48◦ # sentences with score two per image: 3.8

20

Results – Finding Images for Sentences


Out of Vocabulary


Failures


Summary

Sentences are a descriptive and compact representation of information

This work can generate good sentences for images

The intermediate representation is crucial and allows us to look up images for sentences too

24

Future Work

Sentence model is oversimplified

Iterative procedure for better sentence understanding

Identify adjectives and adverbs once sentence is generated

25

Questions?


References

Every Picture Tells a Story: Generating Sentences for Images. A. Farhadi, M. Hejrati, M. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. ECCV 2010.

Learning Realistic Human Actions from Movies. I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. CVPR 2008.

27