Distributional semantics with eyes - Enriching corpus ...marcobaroni.org/publications/lectures/eyed-distsem-saarbruecken... · Distributional semantics: A general-purpose representation

Distributional semantics with eyesEnriching corpus-based models of word meaning

with automatically extracted visual features

Marco Baroni

Center for Mind/Brain SciencesUniversity of Trento

Computational Linguistics ColloquiumComputational Linguistics & Phonetics Department

Saarland UniversityMay 2012

1

Collaborators

award

Jasper Uijlings

Giang Binh Tran

Nam Khanh Tran

Gemma Boleda

Elia Bruni

2

Warning!

This talk is NOT about:

I image retrievalI image annotationI object recognitionI connecting specific images to captions/phrases/wordsI improving computer vision

The talk is about using information extracted from images toimprove the semantic representation of word types

3

The distributional hypothesisHarris, Charles and Miller, Firth, Wittgenstein? . . .

The meaning of a word is (can beapproximated by, derived from) the setof contexts in which it occurs in texts

4

The distributional hypothesis in real lifeMcDonald & Ramscar 2001

He filled the wampimuk, passed itaround and we all drunk some

We found a little, hairy wampimuksleeping behind the tree

5

Distributional semanticsLandauer and Dumais (1997), Turney and Pantel (2010), . . .

he curtains open and the moon shining in on the barelyars and the cold , close moon " . And neither of the wrough the night with the moon shining so brightly , itmade in the light of the moon . It all boils down , wrsurely under a crescent moon , thrilled by ice-white

sun , the seasons of the moon ? Home , alone , Jay plam is dazzling snow , the moon has risen full and coldun and the temple of the moon , driving out of the hugin the dark and now the moon rises , full and amber a

bird on the shape of the moon over the trees in frontBut I could n’t see the moon or the stars , only the

rning , with a sliver of moon hanging among the starsthey love the sun , the moon and the stars . None of

the light of an enormous moon . The plash of flowing wman ’s first step on the moon ; various exhibits , aerthe inevitable piece of moon rock . Housing The Airsh

oud obscured part of the moon . The Allied guns behind

6

Distributional semanticsDistributional meaning as co-occurrence vector

planet night full shadow shine crescent

moon 10 22 43 16 29 12

sun 14 10 4 15 45 0

dog 0 4 2 10 0 0

7

Distributional semanticsThe geometry of meaning

shadow shinemoon 16 29sun 15 45dog 10 0

0 5 10 15 20

010

2030

4050

shadow

shine

dog (10,0)

sun (15,45)

moon (16,29)

8

Distributional semantics: A general-purposerepresentation of lexical meaningBaroni and Lenci, 2010

I Similarity (cord-string vs. cord-smile)I Synonymy (zenith-pinnacle)I Concept categorization (car ISA vehicle; banana ISA fruit)I Selectional preferences (eat topinambur vs. *eat sympathy)I Analogy (mason is to stone like carpenter is to wood)I Relation classification (exam-anxiety are in

CAUSE-EFFECT relation)I Qualia (TELIC ROLE of novel is to entertain)I Salient properties (car-wheels, dog-barking)I Argument alternations (John broke the vase - the vase

broke, John minces the meat - *the meat minced)

9

The ungrounded nature of distributional semanticsGlenberg and Robertson 2000, Andrews, Vigliocco and Vinson 2009,Riordan and Jones 2010. . .

Describing tigers. . .

Humans (McRae et al.,2005):

I have stripesI have teethI are blackI . . .

State-of-the art distributionalmodel (Baroni et al., 2010):

I live in jungleI can killI risk extinctionI . . .

10

The ungrounded nature of distributional semantics

SV

DS

UB

JEC

TS

TAXO RELENT PART QUALITY ACTIVITY FUNCTION LOCATION

Baroni and Lenci 2008 11

Interlude: “blind” semantics?

SIGHTED

BLIND

ABS_ENT

ABS_PROP

CONC_ENT

CONC_PROP

EVENT

PART

SPACE

TAXO

TIME

Cazzolli, Baroni, Lenci and Marotta in preparation12

The distributional hypothesis, generalized

The meaning of a word is (can beapproximated by, derived from) the setof contexts in which it occurs //in///////texts

13

Context in the 2010s

14

Multimodal distributional semanticsusing textual and visual collocates

planet night

moon 10 22 22 0

sun 14 10 15 0

dog 0 4 0 20

15

Bags of visual wordsSivic and Zisserman, 2003

3 3 0 0 2 3 1 116

Determining the visual vocabulary

!

!

!

17

Representing images as bags-of-visual-word vectors

!"#$%&'

!"#$ % # & #

18

Associating bags-of-visual-word vectorsto word labels

!""#

!""# $ % & %

19

Simple multimodal matricesBruni, G.B. Tran and Baroni 2011

!"#$%& !"'()

*%%( +, -. / 0

!1( +2 02 - .

$%3 +/ / . +

See also Feng and Lapata 2010, Leong and Mihalcea 2011 20

The ESP labeled-image data setvon Ahn, 2003

framefacealienboldgreysmile

malldevilredpictureman

bedmotelwhitelampflowerbreakfastpillowsfruithotel 21

Distribution of related concepts in text-vs. image-based spaceBLESS data set, 184 concrete concepts

Text

●

●

●

●

●

●●●

●

●●

●●

●

●

●

●

●

●

● ●

●●

●

●

COORD HYPER MERO ATTRI EVENT RAN.N RAN.J RAN.V

−2

−1

01

2

Image

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

COORD HYPER MERO ATTRI EVENT RAN.N RAN.J RAN.V

−2

−1

01

2

22

Nearest attributes of BLESS concepts

concept text imagecabbage leafy whitecarrot fresh orangecherry ripe reddeer wild browndishwasher electric whitehat white oldhatchet sharp shortonion fresh whiteoven electric newplum juicy redsparrow wild littletanker heavy grey 23

Multimodal fusionBruni, N.K. Tran and Baroni submitted

Latent multimodal smoothing(2)

normalize and concatenate

Textual fea-ture matrix

Visual fea-ture matrix

Text corpus Labeledimage data(1)

split blocks

Textualsmoothed matrix

Visualsmoothed matrix

Multimodal similarity estimation(3)

Textual features

Visual features

Featurecombination

Similarityestimate

Textual features

Visual features

Similarityestimate

Similarityestimate

Scorecombination

24

Predicting human semantic relatedness judgments

Window 2 Window 20Model MEN WordSim MEN WordSimText 0.73 0.70 0.68 0.70Image 0.43 0.36 0.43 0.36SmoothedText 0.77 0.73 0.74 0.75FeatureCombine 0.78 0.72 0.76 0.75ScoreCombine 0.78 0.71 0.77 0.72

25

Similar conceptsbest captured by Text vs. FeatureCombine

Text FeatureCombinedawn/dusk pet/puppysunrise/sunset candy/chocolatecanine/dog paw/petgrape/wine bicycle/bikefoliage/plant apple/cherryfoliage/petal copper/metalskyscraper/tall military/soldiercat/feline paws/whiskerspregnancy/pregnant stream/waterfallmisty/rain cheetah/lion

26

Distributional semantics in TechnicolorBruni, Boleda, Baroni and N.K. Tran 2012

Experiment 1 find typical color of 52 concrete objects:cardboard is brown, coal is black, forest is green(typical colors assigned by two judges byconsensus)

Experiment 2 distinguish literal and non-literal usages of coloradjectives: blue uniform, blue shark, blue note(342 adjective-noun pairs, 227 literal, 115non-literal, as decided by two judges byconsensus)

27

Experiment 1 resultsMedian rank of correct color and number of top matches

Model Exp 1TEXT30K 3 (11)

LAB128 1 (27)

SIFT40K 3 (15)

TEXT+LAB128 1 (27)

TEXT+SIFT40K 2 (17)

28

Experiment 1 examples

word gold LAB SIFT TEXTbanana yellow yellow blue orange

cauliflower white green yellow orangecello brown brown black bluedeer brown green blue redfroth white brown black orange

gorilla black black red greygrass green green green greenpig pink pink brown brownsea blue blue blue grey

weed green green yellow purple

29

Experiment 2 resultsAverage difference in normalized adj-noun cosines in literal vs. non-literal conditionswith t-test significance

Model Exp 1 Exp2TEXT30K 3 (11) .53***LAB128 1 (27) .25*SIFT40K 3 (15) .57***TEXT+LAB128 1 (27) .36***TEXT+SIFT40K 2 (17) .73***

30

Experiment 2 color breakdown

L N

0.05

0.10

0.15

0.20

0.25

0.30

Vision: black

●

●

●

●

●

●

●

L N

0.0

0.1

0.2

0.3

0.4

0.5

Text: black

L N

0.10

0.15

0.20

0.25

0.30

0.35

Vision: blue

●

●

L N

0.0

0.1

0.2

0.3

Text: blue

●

●

L N

0.05

0.15

0.25

Vision: green

●

●

●

L N

0.00

0.04

0.08

0.12

Text: green

L N

0.05

0.10

0.15

0.20

0.25

0.30

Vision: red

●

●

●

L N

0.00

0.10

0.20

0.30

Text: red

●

L N

0.05

0.10

0.15

0.20

0.25

0.30

Vision: white

●

●

●

●

●

●●

L N

0.00

0.05

0.10

0.15

Text: white

black issue, culture, business blue note, shark, shieldgreen future, politics, energy red meat, belt, face

31

What would you rather eat?Bergsma and Goebel 2011

I migas?

I zeolite?

I carillons?

I a ficus?

I a mamey?

I manioc?

32

What would you rather eat?Bergsma and Goebel 2011

Figure 1: Which out-of-vocabulary nouns areplausible direct objects for the verb eat? Each rowcorresponds to a noun: 1. migas, 2. zeolite, 3.carillon, 4. ficus, 5. mamey and 6. manioc.

sponding classifier that scores noun arguments onthe basis of various textual features. We use thisdiscriminative framework to incorporate the visualinformation as new, visual features.Our experiments evaluate the ability of these

classifiers to correctly predict the selectional pref-erences of a small set of verbs. We evaluate twocases: 1) the case where the nouns are all as-sumed to be out-of-vocabulary, and the classifiersmust make predictions without any corpus-basedco-occurrence information, and 2) the case wherewe assume access to noun-verb co-occurrence in-formation derived from web-scale N-gram data.We show that visual features are useful for some

verbs, but not for others. For verbs taking abstractarguments without definitive visual features, theclassifier can often learn to disregard the visualdata. On the other hand, for verbs taking physi-cal arguments (such as food, animals, or people),the classifier can make accurate predictions usingthe nouns’ visual properties. In these cases, visualinformation remains useful even after incorporat-ing the web-scale statistics.

2 Visual Selectional Preference

Consider determining whether the nouns carillon,migas and mamey are plausible arguments for the

verb eat. Existing systems are unlikely to havesuch words in their training data, let alone infor-mation about their edibility. However, after in-specting a few images returned by a Google searchfor these words (Figure 1), a human might rea-sonably predict which words are edible. Humansmake this determination by observing both intrin-sic visual properties (pits, skins, rounded shapesand fruity colors) and extrinsic visual context (cir-cular plates, bowls, and other food-related tools)(Oliva and Torralba, 2007).We propose using similar information to pre-

dict the plausibility of arbitrary verb-noun pairs.That is, we aim to learn the distinguishing vi-sual features of all nouns that are plausible argu-ments for a given verb. This differs from workthat has aimed to recognize, annotate and retrieveobjects defined by a single phrase, such as tree orwrist watch (Feng and Lapata, 2010a). These ap-proaches learn from labeled images during train-ing in order to assign words to unlabeled imagesduring testing. In contrast, we analyze labeled im-ages (during training and testing) in order to deter-mine their visual compatibility with a given predi-cate. Our approach does not need labeled trainingimages for a specific noun in order to assess thatnoun during testing; e.g. we can make a reason-able prediction for the plausibility of eat mameyeven if we’ve never encountered mamey before.We now specify how we automatically 1) down-

load a set of images for each noun, 2) extract vi-sual features from each image, and 3) combine thevisual features from multiple images into plausi-bility scores. Scripts, code and data are availableat: www.clsp.jhu.edu/∼sbergsma/ImageSP/.

2.1 Mining noun images from the web

To obtain a set of images for a particular noun ar-gument, we submit the noun as a query to eitherthe Flickr photo-sharing website (www.flickr.com), or Google’s image search (www.google.com/imghp). In both cases, we download thethumbnails on the results page directly rather thandownloading the source images. Flickr returns im-ages by matching the query against user-providedtags and accompanying text. Google retrieves im-ages based on the image caption, file-name, andsurrounding text (Feng and Lapata, 2010a). Im-ages obtained from Google are known to be com-petitive with “hand prepared datasets” for trainingobject recognizers (Fergus et al., 2005).

400

33

Coda: the illustrated distributional hypothesisBruni, Uijlings and Baroni rejected

The meaning of a visually depicted concept is (canbe approximated by, derived from) the set ofcontexts in which it occurs in images

34

The illustrated distributional hypothesisSperman correlation of image-based models with semantic relatedness intuitionsfor 20 concrete Pascal concepts

Segmentation:Area No Manual AutomaticConcept NA 39 36Surround NA 50 51Concept+Surround 47 54 54

!"#!$%&

'())"(#*

35

That’s all, folks!

Thank you!

http://clic.cimec.unitn.it/marco

36

http://clic.cimec.unitn.it/marco

Documents

Distributional semantics with eyes - Enriching corpus ...marcobaroni.org/publications/lectures/eyed-distsem-saarbruecken... · Distributional semantics: A general-purpose representation