56
Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Embed Size (px)

Citation preview

Page 1: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Data-driven Generation of Image Descriptions

Vicente Ordonez-Roman

The State University of New YorkPreviously:

Advisor: Tamara Berg

Page 2: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

What most Computer Vision systems aim to say about a picture

skytreeswaterbuildingbridgerivertree

Computer Vision

Page 3: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

What we are able to say about a picture

One of the many stone bridges in town that carry the gravel carriage roads.

An old bridge over dirty green water.

A stone bridge over a peaceful river.

Our Goal

Page 4: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Let’s just borrow captions from similar images!

Im2Text: Describing Images Using 1 Million Captioned Photographs.Vicente Ordonez, Girish Kulkarni, Tamara L. Berg.

Advances in Neural Information Processing Systems. NIPS 2011.

Page 5: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Harness the Web!

Smallest house in paris between red (on right) and beige (on left).

Bridge to temple in Hoan Kiem lake.

The water is clear enough to see fish swimming around in it.

A walk around the lake near our house with Abby.

Hangzhou bridge in West lake.

The daintree river by boat.

. . .

Images + Captionsfrom the Web

Transfer Caption(s)

Matching using Global Image Features(GIST + Color)

e.g. “The water is clear enough to see fish swimming around in it.”

Page 6: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

GIST

Page 7: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Use the web to collect images + captions

6, 000, 000, 000 photographs! (*)A lot of them with captions(lots of them publicly available )

90, 000, 000, 000 pictures~!! (**)A lot of them with captions(a lot of them not publicy available )

(*) http://blog.flickr.net/en/2011/08/04/6000000000/(**) http://www.quora.com/How-many-photos-are-uploaded-to-Facebook-each-day

Page 8: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Dog with a ball in its mouth running around like crazy on the green grass.

cat in a sink

A 10-kg cat called Hercules.. and got caught in a pet door when trying to sneak into another house to steal dog food. 'Nuff said

Flickr images + captions

Page 9: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Dog with a ball in its mouth running around like crazy on the green grass.

cat in a sink

A 10-kg cat called Hercules.. and got caught in a pet door when trying to sneak into another house to steal dog food. 'Nuff said

Flickr images + captions

Dog with a ball in its mouth running around like crazy on the green grass.

Page 10: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Dog with a ball in its mouth running around like crazy on the green grass.

cat in a sink

A 10-kg cat called Hercules.. and got caught in a pet door when trying to sneak into another house to steal dog food. 'Nuff said

Flickr images + captions

Page 11: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Dog with a ball in its mouth running around like crazy on the green grass.

cat in a sink

A 10-kg cat called Hercules.. and got caught in a pet door when trying to sneak into another house to steal dog food. 'Nuff said

Flickr images + captions

cat in a sink

Page 12: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Dog with a ball in its mouth running around like crazy on the green grass.

cat in a sink

A 10-kg cat called Hercules.. and got caught in a pet door when trying to sneak into another house to steal dog food. 'Nuff said

Flickr images + captions

Page 13: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Dog with a ball in its mouth running around like crazy on the green grass.

cat in a sink

A 10-kg cat called Hercules.. and got caught in a pet door when trying to sneak into another house to steal dog food. 'Nuff said

Flickr images + captions

A 10-kg cat called Hercules.. and got caught in a pet door when trying to sneak into another house to steal dog food. 'Nuff said

Page 14: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Solution: Collect hundreds of millions of captions

Filter them out

We found “good captions” have visual concepts and relation words “by”, “in”, “over”, “beside”,

“on top of”

~1 “good caption” for every 1000 “bad captions”

Im2Text: Describing Images Using 1 Million Captioned Photographs.Vicente Ordonez, Girish Kulkarni, Tamara L. Berg.

Advances in Neural Information Processing Systems. NIPS 2011.

Page 15: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

SBU Captioned Photo Dataset

Our dog Zoe in her bed

Interior design of modern white and brown living room furniture against white wall with a lamp hanging.

The Egyptian cat statue by the floor clock and perpetual motion machine in the pantheon

Man sits in a rusted car buried in the sand on Waitarere beach

Emma in her hat looking super cute

Little girl and her dog in northern Thailand. They both seemed interested in what we were doing

1 million captioned

photos!1 m

illion

captioned

photos!

Page 16: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Results

(1) while walking by the water(2) plane flying over the sun(3) shot this in a moving car at the nkve highway(4) sunset over creve coeur lake and the page bridge(5) sunset on 12th sep 2009 as seen from the field polder near my house(6) window over yellow door(7) sunset over capitol hill as seen from the roof of my building(8) an orange sky over the irish sea(9) beautiful golden sunset reflected in the waves of the ocean(10) red sky probably caused by volcanic ash from iceland(11) a view of sunset over river brahmaputa from koliyabhumura bridge(12) red sky in the morning

Page 17: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Results

(1) burnt wooden door in derelict building portugal(2) peterborough cathedral norman door in south wall(3) amazing wooden door with wider light above(4) door in wall(5) girl looking in a classroom window(6) a interesting cross in a window of an ancient city(7) this mirror decorated with fruit painting was left behind by theprevious owners(8) unusual exterior wall postbox at st albans post office in st peters street al1(9) door in oxford uk in black and white(10) 19 plate behind glass in brass mat and preserver(11) this is some of the window decoration external on the house justover the porch 0364(12) cat in a window

Page 18: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Results

(1) img8783 ginger in the red chair(2) red sky in the morning(3) the cat is in the bag and the bag is in the river quot(4) the light in the kitchen made everythin glow my little girl is growing up (5) my cat in a box that is far too small for her(6) one of the towel animals in the cabin edno ot jivotnite napraveno ot havlieni karpi v kabinata(7) baby in her later years turned from green to red but she never went fully red all over(8) if you take pictures through the hole in the bottom of a flower pot the whole of the eldritch world is revealed(9) glazed ceramic poop form in orange wooden box(10) rock garden in library(11) it s funny to capture the preciousest cat in the house at his most devillicious(12) the pink will get replaced by orange and blue in the fall

Page 19: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Results

(1) starfish from the book toys to knitdashing dachs superwash sock yarn in goldfishbacking is orange fabricstuffing is pillow stuffing(2) mural of birds and trees in the crypt of wat ratburana ayutthaya(3) carvings in the rock wall(4) acrylic on paper scarlet macaws communicate in the color red withyellow and blue as visual grammar(5) epsom and table salt crystals growing in concentrated green tea solution(6) the hops dried to a golden green in a matter of a few days almosttoo pretty to bag up(7) after staring at the gorgeous colors of the leaves claes discoveredthat there were about 100 birds sleeping in the tree(8) you know you re in wisconsin when the beach has pine needles inthe sand(9) i was walking down the sidewalk and i saw this glove craft droppedin the dirt it seemed really unusual(10) made by fusing plastic bags(11) bark pattern from a ponderosa pine tree in grand canyon national park(12) the peasant that found a statue of the black virgin on a rock in ariver

Page 20: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

What to do next?

Page 21: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Use High Level Content to Rerank (Objects, Stuff, People, Scenes, Captions)

The bridge over the lake on Suzhou Street.

The Daintree river by boat. Bridge over Cacapon river.

Iron bridge over the Duck river.

. . .

Transfer Caption(s)

e.g. “The bridge over the lake on Suzhou Street.”

Page 22: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Some success…

Amazing colours in the sky at sunset with the orange of the cloud and the blue of the sky behind.

Strange cloud formation literally flowing through the sky like a river in relation to the other clouds out there.

Fresh fruit and vegetables at the market in Port Louis Mauritius.

Tree with red leaves in the field in autumn.

A female mallard duck in the lake at Luukki Espoo

The sun was coming through the trees while I was sitting in my chair by the river

Under the sky of burning clouds. Stained glass

window in Eusebius church.

Page 23: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Still far from perfect

Kentucky cows in a field.

The cat in the window.

Incorrect objects

Page 24: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Still far from perfect

The sky is blue over the Gherkin.

The boat ended up a kilometre from the water in the middle of the airstrip.

Tree beside the river.

Water over the road.

Incorrect context

Completely wrong

Page 25: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

How to Evaluate?

• “Ground truth”: The car is parked next to the train station besides a building.

• Candidates: “There is car parked in front of an office building”“This is the building that hosted the ceremony”“A vehicle stopped next to my house”

Similar to evaluation on Machine Translation

Page 26: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Method BLEU score

Global matching (1k) 0.0774

Global matching (10k) 0.0909

Global matching (100k) 0.0917

Global matching (1million) 0.1177

Global + Content matching (linear regression)

0.1215

Global + Content matching (linear SVM)

0.1259

BLEU score evaluation against Human Captions

Page 27: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Human Visual Verification

View overlooking Kuala Lumpur from my office building

Please choose the image that better corresponds to the given caption:

Page 28: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Human Visual Verification

View overlooking Kuala Lumpur from my office building

Please choose the image that better corresponds to the given caption:

Caption from Flickr

Random image

Page 29: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Human Visual Verification

View overlooking Kuala Lumpur from my office building

Please choose the image that better corresponds to the given caption:

Caption from Flickr

Random image

Caption used Success rate

Original human caption 96.0%

Top caption 66.7%

Best from our top 4 captions 92.7%

Page 30: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Human Visual Evaluation

Caption used Success rate

Original human caption 96.0%

Top caption 66.7%

Best from our top 4 captions 92.7%

The view from the 13th floor of an apartment building in Nakano awesome.

Please choose the image that better corresponds to the given caption:

Caption produced by our system

Random image

Page 31: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Human Visual Evaluation

Caption used Success rate

Original human caption 96.0%

Top caption 66.7%

Best from our top 4 captions 92.7%

The view from the 13th floor of an apartment building in Nakano awesome.

Please choose the image that better corresponds to the given caption:

Caption produced by our system

Random image

Page 32: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

What to do next?

Page 33: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Let’s not borrow captions from other images, let’s just borrow short phrases!

Collective Generation of Natural Image Descriptions.Polina Kuznetsova, Vicente Ordonez, Alexander C. Berg, Tamara L. Berg, Yejin Choi.

Association for Computational Linguistics. ACL 2012.

Large Scale Retrieval for Image Description Generation Vicente Ordonez, Xufeng Han, Polina Kuznetsova, Girish Kulkarni, Margaret Mitchell,Kota Yamaguchi, Karl Stratos, Amit Goyal, Jesse Dodge, Alyssa Mensch, Hal Daume III,

Alexander C. Berg, Yejin Choi, Tamara L. BergOn Submission to IJCV special issue on Big Data.

Page 34: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Retrieving noun phrases from similar object detections

Page 35: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

this dog was laying in the middle of the road on a back street in jaco

Closeup of my dog sleeping under my desk.

Detect: dog

Find matching dog detections by visual similarity

Peruvian dog sleeping on city street in the city of Cusco, (Peru)

Contented dog just laying on the edge of the road in front of a house..

Retrieving verb phrases from similar

object detections

Page 36: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Find matching region detections using appearance + arrangement

Mini Nike soccer ball all alone in the grass

Comfy chair under a tree.

I positioned the chairs around the lemon tree -- it's like a shrine

Object: car

Cordoba - lonely elephant under an orange tree...

Retrieving prepositional phrases from region +

detection matches

Page 37: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Retrieving prepositional phrases from scene matches

View from our B&B in this photo

Extract scene descriptor

Find matching images by scene similarity

Pedestrian street in the Old Lyon with stairs to climb up the hill of fourviere

I'm about to blow the building across the street over with my massive lung power.

Only in Paris will you find a bottle of wine on a table outside a bookstore

Page 38: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Data Processing

1 million images:– Run object detectors– Run region based stuff detectors (e.g.

grass, sky, etc)– Run global scene classifiers– Parse captions associated with images

and retrieve phrases referring to objects (NPs, VPs), region relationships (PPstuff), and general scene context (PPscene).

Page 39: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Recognition, aka Vision is hardDetecting one hundred objects

Page 40: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Sometimes you can make it (a little) better

Detecting “mentioned” objects

The background is a vintage paint by number painting I have and the fabulous forest dress is by candyjunky!

Kevin’s mom, so punxrawk in Kev’s black flag hat

Look in the mountain for a lion face

Ecuador, amazon basin, near coca, rain forest, passion fruit flower

Page 41: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Everything together

bird in waterin Lincoln City Oregon coast

Objects

Actions

Scene

Stuff

looking for food

Page 42: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Everything together

Retrieved phrases

bird in wateron the beach

bird in waterin Lincoln City Oregon coast

bird in waterin Atlantic Citylooking for

food

looking for food

looking for food

Page 43: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Binary Integer Linear Programming

Phrase sij Position kPhrase spq Position k+1

Phrase sij Position k

Phrase VisionConfidence

Pairwise phrase

cohesion

Ngram cohesion

Head words co-

occurrence+=

Page 44: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Composing Descriptions

Compose descriptions from phrases with ILP approach

• Linguistic constraints – Allow only one phrase of each type– Enforce plural/singular agreement between NP and VP

• Discourse constraints– Prevent inclusion of repeated phrasing

• Phrase cohesion constraints– n-gram statistics between phrases– Co-occurrence statistics between head words of phrases (last

word or main verb) to encourage longer range cohesion

Page 45: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Good Results

This is a sporty little red convertible made for a great day in Key West FL. This car was in the 4th parade of the apartment buildings.

Taken in front of my cat sitting in a shoe box. Cat likes hanging around in my recliner.

This is a brass viking boat moored on beach in Tobago by the ocean.

Page 46: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Bad Results

This is a shoulder bag with a blended rainbow effect One of the most shirt in the wall of

the house.Here you can see a cross by the frog in the sky.

Not relevantGrammatically incorrect. Cognitive absurdity.

Page 47: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Method BLEU scoreHMM (using cognitive phrases) 0.111HMM (without using cognitive phrases) 0.114ILP (using cognitive phrases) 0.114ILP (without using cognitive phrases) 0.116

BLEU score evaluation

Page 48: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Human Forced Choice Evaluation

Caption used ILP Selection

ILP vs. HMM (no images, no cognitive phrases) 67.2%

ILP vs. HMM (no images, with cognitive phrases) 66.3%

ILP vs. HMM (with images, no cognitive phrases) 53.17%

ILP vs. HMM (with images, with cognitive phrases) 54.5%

ILP vs. NIPS 2011 (Global matching 1M) 71.8%

ILP vs. HUMAN 16%

Page 49: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Visual Turing Test

In some cases (16%), ILP generated captions were preferred over human written ones!

Us vs Original Human Written Caption

Page 50: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

What’s next?

Page 51: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Meaning from large-scale computer vision

To be presented at ICCV 2013

Images with the word “house” Images recognized as more likely to produce the word “house”

Page 52: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Meaning from large-scale computer vision

Images with the word “girl” Images recognized as more likely to produce the word “girl”

To be presented at ICCV 2013

Page 53: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Meaning from large-scale computer vision

Mammals Birds InstrumentsStructuresPlantsOther

Weights learned to recognize images with “desk” in caption

Top weighted classifier outputs

Weights learned over outputs of ~8k classifiers

To be presented at ICCV 2013

Page 54: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Meaning from large-scale computer vision

MammalsBirds InstrumentsStructuresPlantsOther

Weights learned to recognize images with “tree” in caption

Top weighted classifier outputs

Weights learned over outputs of ~8k classifiers

To be presented at ICCV 2013

Page 55: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Meaning from large-scale computer vision

MammalsBirds InstrumentsStructuresPlantsOther

Weights learned to recognize images with “tree” in caption

Top weighted classifier outputs

Weights learned over outputs of ~8k classifiers

Page 56: Data-driven Generation of Image Descriptions Vicente Ordonez-Roman The State University of New York Previously: Advisor: Tamara Berg

Questions?