Visual Sense Disambiguation: A Multimodal Approach
PhD thesis by Kate Saenko
Computer Science and AI Lab
Massachusetts Institute of Technology
Advisor: Trevor Darrell
PhD defense by Kate Saenko 2
The challenge of large scale object recognition
• How to get examples of 10,000+ categories?– Collection of training images is time-
consuming, subjective– But the Web has billions of images!
• How to build precise models based on unlabeled image data?
• How to learn visual models on the fly, based on user input?
PhD defense by Kate Saenko 3
Multimodal context
speech text
image
collective knowledge
PhD defense by Kate Saenko 4
Main Contributions
• Proposed a method that combines images and spoken utterances to learn object models
• Developed an unsupervised approach that learns visual models from unlabeled images, text, and dictionaries
This is a bag…
… The Tote is the perfect example of two handbag design principles that ... The lines of this tote are incredibly sleek, but ... The semi buckles that form the handle attachments are ...
PhD defense by Kate Saenko 5
This is a bag
PhD defense by Kate Saenko 6
Noun• bag, container (a flexible container with a single opening)• bag, handbag, pocketbook, purse (a container used for carrying money
and small personal items or accessories (especially by women))• bag, travelling bag, suitcase (a rectangular container for carrying clothes)
bag
bag
bag
bag
… The Tote is the perfect example of two handbag design principles that ... The lines of this tote are incredibly sleek, but ... The semi buckles that form the handle attachments are ...
PhD defense by Kate Saenko 7
Outline
• Audio-visual object recognition– Related work– Fusion model and experiments*
• Unsupervised text and image models– Related work– WISDOM: probabilistic dictionary-based image sense model– Concrete WISDOM: identifying tangible objects
*see Saenko and Darrell. Object category recognition using probabilistic fusion of speech and image classifiers. MLMI, 2007.
PhD defense by Kate Saenko 8
Audio-visual object recognition
speech text
image
dictionary
PhD defense by Kate Saenko 9
Task: object recognition with audio-visual input*
Speechrecognizer
Speech DB
*e.g. BIRON robot, see S. Li and B.Wrede. “Why and how to model multi-modal interaction for a mobile robot companion," In Proc. AAAI, 2007.
lamp lamplamplamp
label <lamp>+
PhD defense by Kate Saenko 10
Speech, image can be ambiguous…
a pan...
That’s a pen!
Copy machine..
ant → fan
face → bass
piano → cannon
PhD defense by Kate Saenko 11
Proposal: use both channels to help disambiguate underlying word
objectrecognizer
PhD defense by Kate Saenko 12
Fusion of speech and image classifiers
ObjectClassifier
Speechrecognizer
Speech DB
ImageClassifier
Image DB
lamp
• Improve existing method by using both modalities
• Explore late fusion of classifier outputs
– Mean rule– Product rule
PhD defense by Kate Saenko 13
Experiments with 101 objects
• Asked users to speak object name for Caltech101, added noise• Plot shows benefit from fusion across noise levels
PhD defense by Kate Saenko 14
Remaining issues…
objectrecognizer
PhD defense by Kate Saenko 15
Unsupervised object models
speech text
image
dictionary
PhD defense by Kate Saenko 16
Next
• Audio-visual object recognition– Related work– Fusion model and experiments
• Unsupervised text and image models– Related work– WISDOM: probabilistic dictionary-based image sense model– Concrete WISDOM: identifying tangible objects
PhD defense by Kate Saenko 17
How can we learn a rich variety of visual concepts?
PhD defense by Kate Saenko 18
Image Sense Disambiguation
Would rather watch… Suicide watch
Hurricane, tornado watch
Watch out!
Celebrity watch
PhD defense by Kate Saenko 19
Text contexts
icrystal rfid wrist watch features watch masterpiece innovative watch making craftsmanship absolute precision fine charm high scratch resistance anti-
allergenic characteristics make chronometer true jewel s wrist water proof sleek stylish wrist watch solar powered available watch ticket key purse identity card special offer place order rfid wrist watch absolutely free rfid watch black
wrist strap rfid watch orange wrist strap rfid watch stainless steel privacy
disclaimer copyright icrystal pty website
PhD defense by Kate Saenko 20
Topic 1rolex
service repair battery omega replica
tag heuer breitling swiss
replace gucci button price band
…
Topic 2new world media right said
house april
obama islam march bush war
american time
…
Latent Dirichlet allocation (LDA) (Blei et al. ‘03)
• One of several techniques for discovering latent dimensions in bag-of-words data
α θ
z
w
β
φ
K M
Nd
d
word
P(w|z)
topic
document
P(z|d)
PhD defense by Kate Saenko 21
Latent Topics
icrystal rfid wrist watch features watch masterpiece innovative watch making craftsmanship absolute precision fine
charm high scratch resistance anti-allergenic characteristics make
chronometer true jewel wrist water proof sleek stylish wrist watch solar powered available watch ticket key
purse identity card special offer place order rfid wrist watch absolutely free rfid watch black wrist strap rfid watch
orange wrist strap rfid watch stainless steel privacy disclaimer copyright
icrystal pty website
PhD defense by Kate Saenko 22
Overview of approaches to web-based object model learning
• Some learn only from image features– (Li et al.07) bootstrap from labeled images– (Fergus et al.05) select correct image topic
• Some incorporate text features– (Schroff et al.07) use a category-independent text classifier– (Berg and Forsyth 06) ask user to sort text topics
• None address polysemy directly– (Loeff et al.06) do image sense discrimination, not
identification
• All rely on labeled images of correct sense
PhD defense by Kate Saenko 23
Next
• Audio-visual object recognition– Related work– Fusion model and experiments
• Unsupervised text and image models– Related work– WISDOM: probabilistic dictionary-based image sense
model*– Concrete WISDOM: identifying tangible objects
*see Saenko and Darrell. Unsupervised learning of visual sense models for polysemous words. NIPS 2008.
PhD defense by Kate Saenko 24
How can we ground image senses in the absence of labeled examples?
WORDNET: Noun•S: (n) watch, ticker (a small portable timepiece) •S: (n) watch (a period of time (4 or 2 hours) during which some of a ship's crew are on duty) •S: (n) watch, vigil (a purposeful surveillance to guard or observe) •S: (n) watch (the period during which someone (especially a guard) is on duty) •S: (n) lookout, lookout man, sentinel, sentry, watch, spotter, scout, picket (a person employed to keep watch for some anticipated event) •S: (n) vigil, watch (the rite of staying awake for devotional purposes (especially on the eve of a religious festival))
WIKIPEDIA: Watch may also refer to:•Watch system, a period of work duty •Tropical cyclone warnings and watches, alerts issued to coastal areas threatened by severe storms •Watch (Unix), a Unix command •Watch (TV channel) a TV station launching in Autumn 2008 •Watch (computer programming) •Help:Watching pages on Wikipedia •Watch (dog), name of the pet dog in the the Boxcar Children
D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. ACL, 1995.
PhD defense by Kate Saenko 25
Sense-specific classifier
training images
Web Image Sense DictiOnary Model
Search Engine WatchSearch Engine Watch is the authoritative guide to search engine marketing (SEM) and search engine optimization (SEO), offering the latest news about search ...searchenginewatch.com/ - 38k - Cached - Similar pages - Note thiswatch - MDCWatches for assignment to a property named prop in this object, calling handler(prop, oldval, newval) whenever prop is set and storing the return value in ...developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/Object/watch - 30k - Cached - Similar pages - Note this
Search Engine WatchSearch Engine Watch is the authoritative guide to search engine marketing (SEM) and search engine optimization (SEO), offering the latest news about search ...searchenginewatch.com/ - 38k - Cached - Similar pages - Note thiswatch - MDCWatches for assignment to a property named prop in this object, calling handler(prop, oldval, newval) whenever prop is set and storing the return value in ...developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/Object/watch - 30k - Cached - Similar pages - Note this
dictionary definitions
unlabeled text
dictionary model P( sense | data)
WISDOM does:1. image sense
disambiguation
2. dataset collection
3. classification of unseen images
noun
web images
fosil wrist watch a800 x 628 - 107k - jpg
amgmedia.com
watch-1(ticker)
PhD defense by Kate Saenko 26
WISDOM: Using dictionary entries to ground senses
• Use entry text to learn a probability distribution over words for that sense
• Problem: entries contain very little text– Expand by adding synonyms, example sentences, etc.– Still, very few words are covered!
•S: (n) mouse (any of numerous small rodents typically resembling diminutive rats having pointed snouts and small ears on elongated bodies with slender usually hairless tails) •S: (n) mouse (any of numerous small rodents typically resembling diminutive rats having pointed snouts and small ears on elongated bodies with slender usually hairless tails) •direct hyponym / full hyponym
•S: (n) house mouse, Mus musculus (brownish-grey Old World mouse now a common household pest worldwide) •S: (n) harvest mouse, Micromyx minutus (small reddish-brown Eurasian mouse inhabiting e.g. cornfields) •S: (n) field mouse, fieldmouse (any nocturnal Old World mouse of the genus Apodemus inhabiting woods and fields and gardens) •S: (n) nude mouse (a mouse with a genetic defect that prevents them from growing hair and also prevents them from immunologically rejecting human cells and tissues; widely used in preclinical trials) •S: (n) wood mouse (any of various New World woodland mice)
•direct hypernym / inherited hypernym / sister term •S: (n) rodent, gnawer (relatively small placental mammals having a single pair of constantly growing incisor teeth specialized for gnawing)
PhD defense by Kate Saenko 27
WISDOM: Probabilistic dictionary-based model
• Main idea:
– Using LDA, learn latent sense-like dimensions on a large amount of related text,
– Model dictionary senses in LDA space:
• Map image contexts to topics• Map topics to senses
Search Engine WatchSearch Engine Watch is the authoritative guide to search engine marketing (SEM) and search engine optimization (SEO), offering the latest news about search ...searchenginewatch.com/ - 38k - Cached - Similar pages - Note thiswatch - MDCWatches for assignment to a property named prop in this object, calling handler(prop, oldval, newval) whenever prop is set and storing the return value in ...developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/Object/watch - 30k - Cached - Similar pages - Note this
Search Engine WatchSearch Engine Watch is the authoritative guide to search engine marketing (SEM) and search engine optimization (SEO), offering the latest news about search ...searchenginewatch.com/ - 38k - Cached - Similar pages - Note thiswatch - MDCWatches for assignment to a property named prop in this object, calling handler(prop, oldval, newval) whenever prop is set and storing the return value in ...developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/Object/watch - 30k - Cached - Similar pages - Note this
unlabeled text
LDA
PhD defense by Kate Saenko 28
WISDOM sense model
• Given a query word with sense s with values in set {1,…,S}, and a text document d, the probability of sense is
d
z
N
s
• Define the likelihood of topic z given sense s with entry words es= w1,…,wEs as
• To compute probability of sense given topic
PhD defense by Kate Saenko 29
WISDOM: Incorporating Image Features
• Use LDA to discover visual topics v=1,…,L,
• Then estimate the conditional probability P(s|v)
• Given a test image di*, we can compute
• Combine contributions of image and text:
PhD defense by Kate Saenko 30
WISDOM classifier
SVM classifier
training images
Search Engine WatchSearch Engine Watch is the authoritative guide to search engine marketing (SEM) and search engine optimization (SEO), offering the latest news about search ...searchenginewatch.com/ - 38k - Cached - Similar pages - Note thiswatch - MDCWatches for assignment to a property named prop in this object, calling handler(prop, oldval, newval) whenever prop is set and storing the return value in ...developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/Object/watch - 30k - Cached - Similar pages - Note this
Search Engine WatchSearch Engine Watch is the authoritative guide to search engine marketing (SEM) and search engine optimization (SEO), offering the latest news about search ...searchenginewatch.com/ - 38k - Cached - Similar pages - Note thiswatch - MDCWatches for assignment to a property named prop in this object, calling handler(prop, oldval, newval) whenever prop is set and storing the return value in ...developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/Object/watch - 30k - Cached - Similar pages - Note this
dictionary definitions
unlabeled text
dictionary model P( sense | data)
noun
web images
fosil wrist watch a800 x 628 - 107k - jpg
amgmedia.com
watch-1(ticker)
PhD defense by Kate Saenko 31
Evaluation datasets
• Collected by querying Image Search – MIT-ISD: bass, face, mouse, speaker, watch
– MIT-OFFICE: cellphone, fork, hammer, keyboard, mug, pliers, scissors, stapler, telephone, watch
– UIUC-ISD: bass, crane, squash
core relatedcore relatedunrelated ???
PhD defense by Kate Saenko 32
Experimental Setup
1. Task: Image sense disambiguation (ISD) in search results– Separate images according to visual sense
– “core” labels are positive class, “related” and “unrelated” negative
– Metrics: true positives vs. false positives (ROC), recall-precision curve (RPC)
2. Task: object classification in a novel image– Classify image as having correct object category or not
– “core” labels are positive class, other keyword’s “core” senses are negative class
– Metric: percent correct
PhD defense by Kate Saenko 33
ISD example results
squash: sports
squash: vegetable
bass: musical instrument
bass: fish
bass: raw web image data
squash: raw web image data
PhD defense by Kate Saenko 34
yahoo
musical range
polyph. range
male singer
sea bass
freshwater bass
basso, voice
instrument
spiny fish
yahoo
musical range
polyph. range
male singer
sea bass
freshwater bass
basso, voice
instrument
spiny fish
ISD Results: ROC using each WordNet sense for BASS
BASSTr
ue p
ositi
ve ra
te
False positive rate
PhD defense by Kate Saenko 35
ISD Results: RPC using true sense
yahoo wisdom
Retrieval of core senses on UIUC-ISD
PhD defense by Kate Saenko 36
Results: object classification
• Baseline approach: – Automatically generate sense-specific keywords from WordNet– Append word to synonyms and direct hypernyms– Limit queries to 3 terms– Example: mouse + computer, mouse + electronic device
• Plot shows average accuracy across five objects in the MIT-ISD dataset (each is a two-class problem with chance performance of 50%)
85%
75%
65%
55%
50 100 150 200 250 300
Number of training images
baselinewisdom
Accu
racy
PhD defense by Kate Saenko 37
Next
• Audio-visual object recognition– Related work– Fusion model and experiments
• Unsupervised text and image models– Related work– WISDOM: probabilistic dictionary-based image sense model– Concrete WISDOM: identifying tangible objects*
*see Saenko and Darrell, Filtering Abstract Senses From Image Search Results, NIPS 2009.
PhD defense by Kate Saenko 38
Query Word: “cup”
Online DictionaryWord to search for:Noun
Online DictionaryWord to search for:Noun
cup Search Dictionary
• cup (a small open container usually used for drinking; usually has a handle) "he put the cup back in the saucer"; "the handle of the cup was missing"
• cup, loving cup (a large metal vessel with two handles that is awarded as a trophy to the winner of a competition) "the school kept the cups is a special glass case”
• a major sporting event or competition “the world cup”, “the Stanley cup”
Concrete WISDOM
Object Sense: drinking container
Abstract Sense: sporting event
Object Sense: loving cup (trophy)
Removing Abstract Senses
PhD defense by Kate Saenko 39
mouse
rodent
beaver
mammal
cow…
…
How can we identify abstract senses?
Mouse: Noun•<noun.animal>S: (n) mouse (any of numerous small rodents typically resembling diminutive rats having pointed snouts and small ears on elongated bodies with slender usually hairless tails) •<noun.state>S: (n) shiner, black eye, mouse (a swollen bruise caused by a blow to the eye) •<noun.person>S: (n) mouse (person who is quiet or timid) •<noun.artifact>S: (n) mouse, computer mouse (a hand-operated electronic device that controls the coordinates of a cursor on your computer screen…)
• Idea: use the ontological information available via WordNet– semantic relations between concepts (hypernym, part, etc.)– lexical tags:
PhD defense by Kate Saenko 40
Experimental Setup
Table: Concrete Senses Identified by WISDOM
• Task: ISD using concrete-sense WISDOM– all “core” and “related” labels of keyword are positive class,
“unrelated” labels are negative class
PhD defense by Kate Saenko 41
Results: Filtering visual senses
Yahoo Search: “telephone”DICTIONARY
1: (n) telephone, phone, telephone set (electronic equipment that converts sound into electrical signals that can be transmitted over distances and then converts received signals back into sounds)
2: (n) telephone, telephony (transmitting speech at a distance)
PhD defense by Kate Saenko 42
Results: Filtering visual senses
Artifact sense: “telephone”DICTIONARY
1: (n) telephone, phone, telephone set (electronic equipment that converts sound into electrical signals that can be transmitted over distances and then converts received signals back into sounds)
2: (n) telephone, telephony (transmitting speech at a distance)
PhD defense by Kate Saenko 43
Results: RPC of all concrete senses
Retrieval of core+related concrete senses on UIUC-ISD
yahoo wisdom
PhD defense by Kate Saenko 44
Further Improvement: Topic adaptation
• Original LDA topics are learned on text-only unlabeled data• Adapt to image-text data via semi-supervised Gibbs sampling• E.g.: one of “fork” topics:
product bike null tool tube seal set price oil
knife spoon spring ship use item accessory
handle shop order remove store custom
home weight steel supply cap clamp fit false
. . .
cutlery knife spoon product set price handle
steel tool item
stainless null bike
tube seal oil knive
kitchen utensil ship
order use table sp ring
supply design piece
carve weight shop . . .
PhD defense by Kate Saenko 45
“fork”: using original topics
unrelated:fork lift
road forkbike fork
knife…
PhD defense by Kate Saenko 46
“fork”: using adapted topics
unrelated:fork lift
road forkbike fork
knife…
PhD defense by Kate Saenko 47
Results on MIT-OFFICE
• The average area under the RPC improves from 0.47 to 0.57• Detailed RPCs:
yahoo wisdom
PhD defense by Kate Saenko 48
Conclusions
• Showed that combining speech with image input may be advantageous for object recognition
• Presented WISDOM, an unsupervised method to learn sense-specific object models from images and text harvested from the web
• Extended WISDOM to filter out non-physical word senses based on WordNet semantic structure
PhD defense by Kate Saenko 49
Future work: WISDOM-enabled interactive training
speech text
image
dictionary
supervised classifierWISDOM
PhD defense by Kate Saenko 50