Upload
markku
View
25
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Beyond Attributes -> Describing Images. Tamara L. Berg UNC Chapel Hill. Descriptive Text. - PowerPoint PPT Presentation
Citation preview
Beyond Attributes -> Describing Images
Tamara L. BergUNC Chapel Hill
Descriptive Text“It was an arresting face, pointed of chin, square of jaw. Her eyes were pale green without a touch of hazel, starred with bristly black lashes and slightly tilted at the ends. Above them, her thick black brows slanted upward, cutting a startling oblique line in her magnolia-white skin–that skin so prized by Southern women and so carefully guarded with bonnets, veils and mittens against hot Georgia suns”
Scarlett O’Hara described in Gone with the Wind.
Berg, Attributes Tutorial CVPR13
More Nuance than Traditional Recognition…
car
shoe
person
Berg, Attributes Tutorial CVPR13
Toward Complex Structured Outputs
car
Berg, Attributes Tutorial CVPR13
Toward Complex Structured Outputs
pink car
Attributes of objects
Berg, Attributes Tutorial CVPR13
Toward Complex Structured Outputs
car on road
Relationships between objects
Berg, Attributes Tutorial CVPR13
Toward Complex Structured Outputs
Telling the “story of an image”
Little pink smart car parked on the side of a road in a London shopping district.
… Complex structured recognition outputs
Berg, Attributes Tutorial CVPR13
Learning from Descriptive Text
Visually descriptive language provides:• Information about the world, especially the visual world.• information about how people construct natural language for
imagery.• guidance for visual recognition. How do people
describe the world?
“It was an arresting face, pointed of chin, square of jaw. Her eyes were pale green without a touch of hazel, starred with bristly black lashes and slightly tilted at the ends. Above them, her thick black brows slanted upward, cutting a startling oblique line in her magnolia-white skin–that skin so prized by Southern women and so carefully guarded with bonnets, veils and mittens against hot Georgia suns”
Scarlett O’Hara described in Gone with the Wind.
How does theworld work?
What should we recognize?
Berg, Attributes Tutorial CVPR13
Methodology
Generation Methods:1) Compose descriptions directly from recognized content2) Retrieve relevant existing text given recognized content
Natural language description
A random Pink Smart Car seen driving around Lambeth Roundabout and onto Lambeth Bridge.
Smart Car. It was so adorable and cute in the parking lot of the post office, I had to stop and take a picture.
Pink CarSignDoorMotorcycleTreeBrick buildingDirty RoadSidewalkLondonShopping district
Berg, Attributes Tutorial CVPR13
Related Work• Compose descriptions given recognized content Yao et al. (2010), Yang et al. (2011), Li et al. ( 2011), Kulkarni et al. (2011)
• Generation as retrieval Farhadi et al. (2010), Ordonez et al (2011), Gupta et al (2012), Kuznetsova et al (2012)
• Generation using pre-associated relevant text Leong et al (2010), Aker and Gaizauskas (2010), Feng and Lapata (2010a)
• Other (image annotation, video description, etc) Barnard et al (2003), Pastra et al (2003), Gupta et al (2008), Gupta et al (2009), Feng and Lapata (2010b), del Pero et al (2011), Krishnamoorthy et al (2012), Barbu et al (2012), Das et al (2013)
Berg, Attributes Tutorial CVPR13
Method 1: Recognize & Generate
Berg, Attributes Tutorial CVPR13
Baby Talk: Understanding and Generating Simple Image Descriptions
Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, Tamara L Berg
CVPR 2011
“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.”
Kulkarni et al, CVPR11
“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.”
Kulkarni et al, CVPR11
“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.”
Kulkarni et al, CVPR11
“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.”
Kulkarni et al, CVPR11
“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.”
Kulkarni et al, CVPR11
“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.”
Kulkarni et al, CVPR11
“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.”
Kulkarni et al, CVPR11
“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.”
Kulkarni et al, CVPR11
“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.”
Kulkarni et al, CVPR11
“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.”
Kulkarni et al, CVPR11
Methodology• Vision -- detection and classification• Text inputs - statistics from parsing lots of
descriptive text• Graphical model (CRF) to predict best image
labeling given vision and text inputs• Generation algorithms to generate natural
language
Kulkarni et al, CVPR11
Vision is hard!
World knowledge (from descriptive text) can be used to smooth noisy vision predictions!
Green sheep
Kulkarni et al, CVPR11
Methodology• Vision -- detection and classification• Text -- statistics from parsing lots of descriptive
text• Graphical model (CRF) to predict best image
labeling given vision and text inputs• Generation algorithms to generate natural
language
Kulkarni et al, CVPR11
Learning from Descriptive Text
Attributes
Relationships
green green grass by the lakea very shiny car in the car museum in my hometown of upstate NY.
Our cat Tusik sleeping on the sofa near a hot radiator.
very little person in a big rocking chair Kulkarni et al, CVPR11
Methodology• Vision -- detection and classification• Text -- statistics from parsing lots of descriptive
text• Model (CRF) to predict best image labeling given
vision and text based potentials• Generation algorithms to compose natural
language
Kulkarni et al, CVPR11
System Flow
Input Image
Extract Objects/stuff
a) dog
b) person
c) sofa
brown 0.32striped 0.09furry .04wooden .2Feathered .04 ...
brown 0.94striped 0.10furry .06wooden .8Feathered .08 ...
brown 0.01striped 0.16furry .26wooden .2feathered .06 ...
a) dog
b) person
c) sofaPredict attributesPredict prepositions
a) dog
b) person
c) sofa
near(a,b) 1 near(b,a) 1 against(a,b) .11against(b,a) .04 beside(a,b) .24beside(b,a) .17 ...near(a,c) 1 near(c,a) 1 against(a,c) .3against(c,a) .05 beside(a,c) .5beside(c,a) .45 ...near(b,c) 1 near(c,b) 1 against(b,c) .67against(c,b) .33 beside(b,c) .0beside(c,b) .19 ...
Predict labeling – vision potentials smoothed with text potentials
<<null,person_b>,against,<brown,sofa_c>> <<null,dog_a>,near,<null,person_b>> <<null,dog_a>,beside,<brown,sofa_c>> Generate natural
language description
This is a photograph of one person and one brown sofa and one dog. The person is against the brown sofa. And the dog is near the person, and beside the brown sofa.
Kulkarni et al, CVPR11
This is a picture of one sky, one road and one sheep. The gray sky is over the gray road. The gray sheep is by the gray road.
Here we see one road, one sky and one bicycle. The road is near the blue sky, and near the colorful bicycle. The colorful bicycle is within the blue sky.
Some good results
This is a picture of two dogs. The first dog is near the second furry dog.
Kulkarni et al, CVPR11
Some bad results
Here we see one potted plant.
Missed detections:
This is a picture of one dog.
False detections:
There are one road and one cat. The furry road is in the furry cat.
This is a picture of one tree, one road and one person. The rusty tree is under the red road. The colorful person is near the rusty tree, and under the red road.
This is a photograph of two sheeps and one grass. The first black sheep is by the green grass, and by the second black sheep. The second black sheep is by the green grass.
Incorrect attributes:
This is a photograph of two horses and one grass. The first feathered horse is within the green grass, and by the second feathered horse. The second feathered horse is within the green grass. Kulkarni et al, CVPR11
Algorithm vs Humans
“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.”
H1: A Lemonaide stand is manned by a blonde child with a cookie. H2: A small child at a lemonade and cookie stand on a city corner. H3: Young child behind lemonade stand eating a cookie.
Sounds unnatural!
Kulkarni et al, CVPR11
Method 2: Retrieval based generation
Berg, Attributes Tutorial CVPR13
Every picture tells a story,describing images withmeaningful sentences
Ali Farhadi, Mohsen Hejrati, Amin Sadeghi, Peter Young, Cyrus Rashtchian,
Julia Hockenmaier, David Forsyth ECCV 2010
Slides provided by Ali Farhadi
A Simplified ProblemRepresent image/text content as subject-verb-scene triple
Good triples:• (ship, sail, sea)• (boat, sail, river)• (ship, float, water)
Bad triples:• (boat, smiling, sea) – bad relations• (train, moving, rail) – bad words• (dog, speaking, office) - both
Farhadi et al, ECCV10
The Expanded Model
• Map from Image Space to Meaning Space
• Map from Sentence Space to Meaning Space
• Retrieve Sentences for Images via Meaning SpaceFarhadi et al, ECCV10
Retrieval through meaning space
• Map from Image Space to Meaning Space
• Map from Sentence Space to Meaning Space
• Retrieve Sentences for Images via Meaning SpaceFarhadi et al, ECCV10
Image Space Meaning Space
Predict Image Content using trained classifiersFarhadi et al, ECCV10
Retrieval through meaning space
• Map from Image Space to Meaning Space
• Map from Sentence Space to Meaning Space
• Retrieve Sentences for Images via Meaning SpaceFarhadi et al, ECCV10
Sentence Space Meaning Space• Extract subject, verb and scene from sentences in the
training dataSubject: CatVerb: SittingScene: room
black cat over pink chairA black color cat sitting on chair in a room.cat sitting on a chair looking in a mirror.
Vehicle
Car TrainBike
HumanAnimal
Cat HorseDog
Object
• Use taxonomy trees
Farhadi et al, ECCV10
Retrieval through meaning space
• Map from Image Space to Meaning Space
• Map from Sentence Space to Meaning Space
• Retrieve Sentences for Images via Meaning SpaceFarhadi et al, ECCV10
Farhadi et al, ECCV10
Farhadi et al, ECCV10
Farhadi et al, ECCV10
Data
Rashtchian et al 2010, Farhadi et al 20105 descriptions per image 20 object categories
Image-Clef challenge2 descriptions per image Select image categories
Large amounts of paired data can help us study the image-language relationship
1,000 images 20,000 images
More data needed?
Berg, Attributes Tutorial CVPR13
Data exists, but buried in junk!
Through the smoke Duna Portrait #5
Mirror and gold the cat lounging in the sink
Berg, Attributes Tutorial CVPR13
SBU Captioned Photo Datasethttp://tamaraberg.com/sbucaptions
Our dog Zoe in her bed
Interior design of modern white and brown living room furniture against white wall with a lamp hanging.
The Egyptian cat statue by the floor clock and perpetual motion machine in the pantheon
Man sits in a rusted car buried in the sand on Waitarere beach
Emma in her hat looking super cute
Little girl and her dog in northern Thailand. They both seemed interested in what we were doing
1 million captione
d photos!
1 million
captione
d photos!
Berg, Attributes Tutorial CVPR13
“Im2Text: Describing Images Using
1 Million Captioned Photographs”
Vicente Ordonez, Girish Kulkarni, Tamara L. BergNIPS 2011
Big Data Driven Generation
One of the many stone bridges in town that carry the gravel carriage roads.
An old bridge over dirty green water.
A stone bridge over a peaceful river.
Generate natural sounding descriptions using existing captions
Ordonez et al, NIPS11
Harness the Web!
Smallest house in paris between red (on right) and beige (on left).
Bridge to temple in Hoan Kiem lake.
The water is clear enough to see fish swimming around in it.
A walk around the lake near our house with Abby.
Hangzhou bridge in West lake.
The daintree river by boat.…
SBU Captioned Photo Dataset
Transfer Caption(s)
Global Matching(GIST + Color)
e.g. “The water is clear enough to see fish swimming around in it.”
1 million captioned images!
Ordonez et al, NIPS11
Use High Level Content to Rerank (Objects, Stuff, People, Scenes, Captions)
The bridge over the lake on Suzhou Street.
The Daintree river by boat.
Bridge over Cacapon river.
Iron bridge over the Duck river.
. . .Transfer Caption(s)e.g. “The bridge over the lake on Suzhou Street.”
Ordonez et al, NIPS11
Results
Amazing colours in the sky at sunset with the orange of the cloud and the blue of the sky behind.
Fresh fruit and vegetables at the market in Port Louis Mauritius.
A female Mallard duck in the lake at Luukki Espoo.
Cat in sink.
Good
The cat in the window.
The boat ended up a kilometre from the water in the middle of the airstrip.
Bad
Ordonez et al, NIPS11
Next….Composing novel captions from pieces of existing ones
Berg, Attributes Tutorial CVPR13
Composing captionsguessing game
a) monkey playing in the tree canopy, Monte Verde in the rain forest
e) the monkey sitting in a tree, posing for his picture
c) monkey spotted in Apenheul Netherlands under the tree
d) a white-faced or capuchin in the tree in the garden
b) capuchin monkey in front of my window
Berg, Attributes Tutorial CVPR13
Composing captionsguessing game
a) monkey playing in the tree canopy, Monte Verde in the rain forest
e) the monkey sitting in a tree, posing for his picture
c) monkey spotted in Apenheul Netherlands under the tree
d) a white-faced or capuchin in the tree in the garden
b) capuchin monkey in front of my window
Berg, Attributes Tutorial CVPR13
“Collective Generation of Natural Image Descriptions”
Polina Kuznetsova, Vicente Ordonez, Alexander C. Berg,Tamara L. Berg and Yejin
ChoiACL 2012
Composing Descriptions
the dirty sheep meandered along a desolate road in the highlands of Scotland through frozen grass
NP: the dirty sheep
VP: meandered along a desolate road
PP: in the highlands of Scotland
PP: through frozen grass
Object appearance
Object pose
Scene appearance
Region appearance & relationship
Example Composed Description:
Kuznetsova et al, ACL12
SBU Captioned Photo Datasethttp://tamaraberg.com/sbucaptions
Our dog Zoe in her bed
Interior design of modern white and brown living room furniture against white wall with a lamp hanging.
The Egyptian cat statue by the floor clock and perpetual motion machine in the pantheon
Man sits in a rusted car buried in the sand on Waitarere beach
Emma in her hat looking super cute
Little girl and her dog in northern Thailand. They both seemed interested in what we were doing
1 million captione
d photos!
1 million
captione
d photos!
Ordonez et al, NIPS11
Data Processing1,000,000 images:
oRun object detectorsoRun region based stuff detectors (grass, sky,
etc.)oRun global scene classifierso Parse captions associated with images and
retrieve phrases referring to objects (NPs, VPs), region relationships (PPstuff), and general scene context (PPscene).
Kuznetsova et al, ACL12
Image Description Generation
Generation
Objects, Actions, Stuff, Scenes
Phrase Retrieval
Description
Computer Vision
Kuznetsova et al, ACL12
Image Description Generation
Generation
Objects, Actions, Stuff, Scenes
Phrase Retrieval
Description
Computer Vision
Kuznetsova et al, ACL12
this dog was laying in the middle of the road on a back street in jaco
Closeup of my dog sleeping under my desk.
Detect: dog
Find matching detections by pose similarity
Peruvian dog sleeping on city street in the city of Cusco, (Peru)
Contented dog just laying on the edge of the road in front of a house..
Retrieving VPs
Kuznetsova et al, ACL12
Retrieving NPs
Detect: fruit
Find matching detections by appearance similarity
Tray of glace fruit in the market at Nice, France
Fresh fruit in the market
A box of oranges was just catching the sun, bringing out detail in the skin.
The street market in Santanyi, Mallorca is a must for the oranges and local crafts.
An orange tree in the backyard of the house.
mandarin oranges in glass bowl
Kuznetsova et al, ACL12
Find matching regions by appearance + arrangement similarity
Mini Nike soccer ball all alone in the grassComfy chair under a
tree.
I positioned the chairs around the lemon tree -- it's like a shrine
Cordoba - lonely elephant under an orange tree...
Retrieving PPstuff
Detect: stuff Kuznetsova et al, ACL12
Retrieving PPscene
View from our B&B in this photo
Extract scene descriptor
Find matching images by global scene similarity
Pedestrian street in the Old Lyon with stairs to climb up the hill of fourviere
I'm about to blow the building across the street over with my massive lung power.
Only in Paris will you find a bottle of wine on a table outside a bookstore
Kuznetsova et al, ACL12
Image Description Generation
Objects, Actions, Stuff, Scenes
Phrase Retrieval
Computer Vision
Generation
Description
Kuznetsova et al, ACL12
Object NPs
Actions VPs
Scene PPs
Stuff PPs
birdsthe bird
birds over water are standing
in the ocean
Position 1
Position 2
Position 3
Position 4
are standinglooking for foodin waterover water
in the oceannear Salt Pond
Kuznetsova et al, ACL12
Possible Assignments
birds
Position1
Position2
Position3
Position4
the bird
are standing
in the ocean
…
birds
the bird
are standing
in the ocean
…
birds
the bird
are standing
in the ocean
…
birds
the bird
are standing
in the ocean
…
Kuznetsova et al, ACL12
Possible AssignmentsPosition
1Position
2Position
3Position
4
birds
the bird
are standing
in the ocean
…
birds
the bird
are standing
in the ocean
…
birds
the bird
are standing
in the ocean
…
birds
the bird
are standing
in the ocean
…
Kuznetsova et al, ACL12
Possible AssignmentsPosition
1Position
2Position
3Position
4
birds
the bird
are standing
in the ocean
…
birds
the bird
are standing
in the ocean
…
birds
the bird
are standing
in the ocean
…
birds
the bird
are standing
in the ocean
…
Kuznetsova et al, ACL12
Position1
Position2
Position3
Position4
birds
the bird
are standing
in the ocean
…
birds
the bird
are standing
in the ocean
…
birds
the bird
are standing
in the ocean
…
birds
the bird
are standing
in the ocean
…
Phrases of the Same Type
Kuznetsova et al, ACL12
Position1
Position2
Position3
Position4
birds
the bird
are standing
in the ocean
…are
standing
the bird
birds
in the ocean
…
birds
the bird
are standing
in the ocean
…
birds
the bird
are standing
in the ocean
…
Singular/Plural Relationships
Kuznetsova et al, ACL12
ILP OptimizationVision scores
o Visual detection/classification scores
Phrase cohesion o n-gram statistics between phraseso Co-occurrence statistics between phrase head
words
Linguistic constraints o Allow at most one phrase of each typeo Enforce plural/singular agreement between NP
and VP
Discourse constraintso Prevent inclusion of repeated phrasing
Optimize for:
Subject to:
Kuznetsova et al, ACL12
This is a sporty little red
convertible made for a great day in
Key West FL. This car was in
the 4th parade of the apartment
buildings.
Good Examples
This is a brass viking boat moored on beach in Tobago by the ocean.
The clock made in Korea.
Kuznetsova et al, ACL12
Visual Turing Test
In some cases (16%), ILP generated captions were preferred over human written ones!
Us vs Original Human Written Caption
Kuznetsova et al, ACL12
Grammatically Incorrect
Cognitive Absurdity
This is a shoulder bag
with a blended rainbow effect.
Not Relevant
Here you can see a cross by the
frog in the sky.
One of the most shirt in the wall
of the house.
Computer VisionError
Bad Results
Kuznetsova et al, ACL12
Questions?