Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Distributional semantics with eyesEnriching corpus-based models of word meaning
with automatically extracted visual features
Marco Baroni
Center for Mind/Brain SciencesUniversity of Trento
Computational Linguistics ColloquiumComputational Linguistics & Phonetics Department
Saarland UniversityMay 2012
1
Collaborators
award
Jasper Uijlings
Giang Binh Tran
Nam Khanh Tran
Gemma Boleda
Elia Bruni
2
Warning!
This talk is NOT about:
I image retrievalI image annotationI object recognitionI connecting specific images to captions/phrases/wordsI improving computer vision
The talk is about using information extracted from images toimprove the semantic representation of word types
3
The distributional hypothesisHarris, Charles and Miller, Firth, Wittgenstein? . . .
The meaning of a word is (can beapproximated by, derived from) the setof contexts in which it occurs in texts
4
The distributional hypothesis in real lifeMcDonald & Ramscar 2001
He filled the wampimuk, passed itaround and we all drunk some
We found a little, hairy wampimuksleeping behind the tree
5
Distributional semanticsLandauer and Dumais (1997), Turney and Pantel (2010), . . .
he curtains open and the moon shining in on the barelyars and the cold , close moon " . And neither of the wrough the night with the moon shining so brightly , itmade in the light of the moon . It all boils down , wrsurely under a crescent moon , thrilled by ice-white
sun , the seasons of the moon ? Home , alone , Jay plam is dazzling snow , the moon has risen full and coldun and the temple of the moon , driving out of the hugin the dark and now the moon rises , full and amber a
bird on the shape of the moon over the trees in frontBut I could n’t see the moon or the stars , only the
rning , with a sliver of moon hanging among the starsthey love the sun , the moon and the stars . None of
the light of an enormous moon . The plash of flowing wman ’s first step on the moon ; various exhibits , aerthe inevitable piece of moon rock . Housing The Airsh
oud obscured part of the moon . The Allied guns behind
6
Distributional semanticsDistributional meaning as co-occurrence vector
planet night full shadow shine crescent
moon 10 22 43 16 29 12
sun 14 10 4 15 45 0
dog 0 4 2 10 0 0
7
Distributional semanticsThe geometry of meaning
shadow shinemoon 16 29sun 15 45dog 10 0
0 5 10 15 20
010
2030
4050
shadow
shine
dog (10,0)
sun (15,45)
moon (16,29)
8
Distributional semantics: A general-purposerepresentation of lexical meaningBaroni and Lenci, 2010
I Similarity (cord-string vs. cord-smile)I Synonymy (zenith-pinnacle)I Concept categorization (car ISA vehicle; banana ISA fruit)I Selectional preferences (eat topinambur vs. *eat sympathy)I Analogy (mason is to stone like carpenter is to wood)I Relation classification (exam-anxiety are in
CAUSE-EFFECT relation)I Qualia (TELIC ROLE of novel is to entertain)I Salient properties (car-wheels, dog-barking)I Argument alternations (John broke the vase - the vase
broke, John minces the meat - *the meat minced)
9
The ungrounded nature of distributional semanticsGlenberg and Robertson 2000, Andrews, Vigliocco and Vinson 2009,Riordan and Jones 2010. . .
Describing tigers. . .
Humans (McRae et al.,2005):
I have stripesI have teethI are blackI . . .
State-of-the art distributionalmodel (Baroni et al., 2010):
I live in jungleI can killI risk extinctionI . . .
10
The ungrounded nature of distributional semantics
SV
DS
UB
JEC
TS
TAXO RELENT PART QUALITY ACTIVITY FUNCTION LOCATION
Baroni and Lenci 2008 11
Interlude: “blind” semantics?
SIGHTED
BLIND
ABS_ENT
ABS_PROP
CONC_ENT
CONC_PROP
EVENT
PART
SPACE
TAXO
TIME
Cazzolli, Baroni, Lenci and Marotta in preparation12
The distributional hypothesis, generalized
The meaning of a word is (can beapproximated by, derived from) the setof contexts in which it occurs //in///////texts
13
Context in the 2010s
14
Multimodal distributional semanticsusing textual and visual collocates
planet night
moon 10 22 22 0
sun 14 10 15 0
dog 0 4 0 20
15
Bags of visual wordsSivic and Zisserman, 2003
3 3 0 0 2 3 1 116
Determining the visual vocabulary
!
!
!
17
Representing images as bags-of-visual-word vectors
!"#$%&'
!"#$ % # & #
18
Associating bags-of-visual-word vectorsto word labels
!""#
!""# $ % & %
19
Simple multimodal matricesBruni, G.B. Tran and Baroni 2011
!"#$%& !"'()
*%%( +, -. / 0
!1( +2 02 - .
$%3 +/ / . +
See also Feng and Lapata 2010, Leong and Mihalcea 2011 20
The ESP labeled-image data setvon Ahn, 2003
framefacealienboldgreysmile
malldevilredpictureman
bedmotelwhitelampflowerbreakfastpillowsfruithotel 21
Distribution of related concepts in text-vs. image-based spaceBLESS data set, 184 concrete concepts
Text
●
●
●
●
●
●●●
●
●●
●●
●
●
●
●
●
●
● ●
●●
●
●
COORD HYPER MERO ATTRI EVENT RAN.N RAN.J RAN.V
−2
−1
01
2
Image
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
COORD HYPER MERO ATTRI EVENT RAN.N RAN.J RAN.V
−2
−1
01
2
22
Nearest attributes of BLESS concepts
concept text imagecabbage leafy whitecarrot fresh orangecherry ripe reddeer wild browndishwasher electric whitehat white oldhatchet sharp shortonion fresh whiteoven electric newplum juicy redsparrow wild littletanker heavy grey 23
Multimodal fusionBruni, N.K. Tran and Baroni submitted
Latent multimodal smoothing(2)
normalize and concatenate
Textual fea-ture matrix
Visual fea-ture matrix
Text corpus Labeledimage data(1)
split blocks
Textualsmoothed matrix
Visualsmoothed matrix
Multimodal similarity estimation(3)
Textual features
Visual features
Featurecombination
Similarityestimate
Textual features
Visual features
Similarityestimate
Similarityestimate
Scorecombination
24
Predicting human semantic relatedness judgments
Window 2 Window 20Model MEN WordSim MEN WordSimText 0.73 0.70 0.68 0.70Image 0.43 0.36 0.43 0.36SmoothedText 0.77 0.73 0.74 0.75FeatureCombine 0.78 0.72 0.76 0.75ScoreCombine 0.78 0.71 0.77 0.72
25
Similar conceptsbest captured by Text vs. FeatureCombine
Text FeatureCombinedawn/dusk pet/puppysunrise/sunset candy/chocolatecanine/dog paw/petgrape/wine bicycle/bikefoliage/plant apple/cherryfoliage/petal copper/metalskyscraper/tall military/soldiercat/feline paws/whiskerspregnancy/pregnant stream/waterfallmisty/rain cheetah/lion
26
Distributional semantics in TechnicolorBruni, Boleda, Baroni and N.K. Tran 2012
Experiment 1 find typical color of 52 concrete objects:cardboard is brown, coal is black, forest is green(typical colors assigned by two judges byconsensus)
Experiment 2 distinguish literal and non-literal usages of coloradjectives: blue uniform, blue shark, blue note(342 adjective-noun pairs, 227 literal, 115non-literal, as decided by two judges byconsensus)
27
Experiment 1 resultsMedian rank of correct color and number of top matches
Model Exp 1TEXT30K 3 (11)
LAB128 1 (27)
SIFT40K 3 (15)
TEXT+LAB128 1 (27)
TEXT+SIFT40K 2 (17)
28
Experiment 1 examples
word gold LAB SIFT TEXTbanana yellow yellow blue orange
cauliflower white green yellow orangecello brown brown black bluedeer brown green blue redfroth white brown black orange
gorilla black black red greygrass green green green greenpig pink pink brown brownsea blue blue blue grey
weed green green yellow purple
29
Experiment 2 resultsAverage difference in normalized adj-noun cosines in literal vs. non-literal conditionswith t-test significance
Model Exp 1 Exp2TEXT30K 3 (11) .53***LAB128 1 (27) .25*SIFT40K 3 (15) .57***TEXT+LAB128 1 (27) .36***TEXT+SIFT40K 2 (17) .73***
30
Experiment 2 color breakdown
L N
0.05
0.10
0.15
0.20
0.25
0.30
Vision: black
●
●
●
●
●
●
●
L N
0.0
0.1
0.2
0.3
0.4
0.5
Text: black
L N
0.10
0.15
0.20
0.25
0.30
0.35
Vision: blue
●
●
L N
0.0
0.1
0.2
0.3
Text: blue
●
●
L N
0.05
0.15
0.25
Vision: green
●
●
●
L N
0.00
0.04
0.08
0.12
Text: green
L N
0.05
0.10
0.15
0.20
0.25
0.30
Vision: red
●
●
●
L N
0.00
0.10
0.20
0.30
Text: red
●
L N
0.05
0.10
0.15
0.20
0.25
0.30
Vision: white
●
●
●
●
●
●●
L N
0.00
0.05
0.10
0.15
Text: white
black issue, culture, business blue note, shark, shieldgreen future, politics, energy red meat, belt, face
31
What would you rather eat?Bergsma and Goebel 2011
I migas?
I zeolite?
I carillons?
I a ficus?
I a mamey?
I manioc?
32
What would you rather eat?Bergsma and Goebel 2011
Figure 1: Which out-of-vocabulary nouns areplausible direct objects for the verb eat? Each rowcorresponds to a noun: 1. migas, 2. zeolite, 3.carillon, 4. ficus, 5. mamey and 6. manioc.
sponding classifier that scores noun arguments onthe basis of various textual features. We use thisdiscriminative framework to incorporate the visualinformation as new, visual features.Our experiments evaluate the ability of these
classifiers to correctly predict the selectional pref-erences of a small set of verbs. We evaluate twocases: 1) the case where the nouns are all as-sumed to be out-of-vocabulary, and the classifiersmust make predictions without any corpus-basedco-occurrence information, and 2) the case wherewe assume access to noun-verb co-occurrence in-formation derived from web-scale N-gram data.We show that visual features are useful for some
verbs, but not for others. For verbs taking abstractarguments without definitive visual features, theclassifier can often learn to disregard the visualdata. On the other hand, for verbs taking physi-cal arguments (such as food, animals, or people),the classifier can make accurate predictions usingthe nouns’ visual properties. In these cases, visualinformation remains useful even after incorporat-ing the web-scale statistics.
2 Visual Selectional Preference
Consider determining whether the nouns carillon,migas and mamey are plausible arguments for the
verb eat. Existing systems are unlikely to havesuch words in their training data, let alone infor-mation about their edibility. However, after in-specting a few images returned by a Google searchfor these words (Figure 1), a human might rea-sonably predict which words are edible. Humansmake this determination by observing both intrin-sic visual properties (pits, skins, rounded shapesand fruity colors) and extrinsic visual context (cir-cular plates, bowls, and other food-related tools)(Oliva and Torralba, 2007).We propose using similar information to pre-
dict the plausibility of arbitrary verb-noun pairs.That is, we aim to learn the distinguishing vi-sual features of all nouns that are plausible argu-ments for a given verb. This differs from workthat has aimed to recognize, annotate and retrieveobjects defined by a single phrase, such as tree orwrist watch (Feng and Lapata, 2010a). These ap-proaches learn from labeled images during train-ing in order to assign words to unlabeled imagesduring testing. In contrast, we analyze labeled im-ages (during training and testing) in order to deter-mine their visual compatibility with a given predi-cate. Our approach does not need labeled trainingimages for a specific noun in order to assess thatnoun during testing; e.g. we can make a reason-able prediction for the plausibility of eat mameyeven if we’ve never encountered mamey before.We now specify how we automatically 1) down-
load a set of images for each noun, 2) extract vi-sual features from each image, and 3) combine thevisual features from multiple images into plausi-bility scores. Scripts, code and data are availableat: www.clsp.jhu.edu/∼sbergsma/ImageSP/.
2.1 Mining noun images from the web
To obtain a set of images for a particular noun ar-gument, we submit the noun as a query to eitherthe Flickr photo-sharing website (www.flickr.com), or Google’s image search (www.google.com/imghp). In both cases, we download thethumbnails on the results page directly rather thandownloading the source images. Flickr returns im-ages by matching the query against user-providedtags and accompanying text. Google retrieves im-ages based on the image caption, file-name, andsurrounding text (Feng and Lapata, 2010a). Im-ages obtained from Google are known to be com-petitive with “hand prepared datasets” for trainingobject recognizers (Fergus et al., 2005).
400
33
Coda: the illustrated distributional hypothesisBruni, Uijlings and Baroni rejected
The meaning of a visually depicted concept is (canbe approximated by, derived from) the set ofcontexts in which it occurs in images
34
The illustrated distributional hypothesisSperman correlation of image-based models with semantic relatedness intuitionsfor 20 concrete Pascal concepts
Segmentation:Area No Manual AutomaticConcept NA 39 36Surround NA 50 51Concept+Surround 47 54 54
!"#!$%&
'())"(#*
35