Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Towards recognition of everyday objects
Prof. Trevor Darrell
Where we are…
• Great progress in categorical object recognition in computer vision community– advances on CT101, PASCAL, etc.
• Great progress on Robotic sensing, esp., mapping, navigation, etc– SLAM, Grand Challenge, etc
• Yet, broad category-level robotic object recognition in general environments is still nearly nonexistant!
CV: faces and instances
Aachen Cathedral
snaptell.com
like.com
Robotics: Mapping, Navigation…
A modest proposal…• A robot that can recognize / find every object in my
office / kitchen / kid’s playroom?– without requiring a grad student to collect multi-view
training data of each object?
• I don’t think we are on the path to solve this with conventional SIFT + fancy kernels + sup. learning…– even with LabelMe, ImageNet, MechTurk…
• It will likely involve:– multiple sensing modalities (“views”) and semi-supervised
learning (both manifold learning and co-training flavors)
– local features that respect physics of image formation
– active learning at training, and attentive learning at test….
– limited “natural”
interaction with a user
Computer Vision vs
Robotic Vision: Divergent Paradigms?
Computer Vision: – Machine Learning paradigm
– Whoever has the largest dataset wins.
– Leads to least common denominator in terms of features to use
category recognition focus;
weak features…
Robotic Vision: – Sensing paradigm; sensors are
cheap; add them! …3-D sensing, multi-spectral, ultra-
high-res….
– Whoever has the most sensors wins…
– But then we can only get training data from our environment (in situ)!
strong features;
instance recognition focus…
Which one is right?
• Both!
• Neither!
• Key technical problems for next generation robotic visual recognition systems: how to…– bridge category and instance level learning
– fully leverage scene and task context
– simultaneously exploit labeled online data and unlabeled or weakly labeled in-situ data
• Robotics will drive next generation of object recognition challenges….
Rough evolution of visual object recognition research
1970s/80s 2000s1990s 2010+
Get my bag…
?• hierarchical labels• scene/task context• multimodal• interactive• robotic…
• Fusing multiple cues and discovering shared representations across categories…
• Visual Sense Disambiguation…
• Transparent local features…
Recent Progress: Combing Features, Overcoming ambiguity
vs vs
?
features categories
Multiple Cues and Context
Many Local Representations…
Superpixels
[Ren
et al.]
Shape context [Belongie et al.]
Maximally Stable Extremal Regions [Matas et al.]
Geometric Blur [Berg et al.]
SIFT [Lowe]
Salient regions [Kadir et al.]
Harris-Affine [Schmid et al.]
Spin images [Johnson and
Hebert]
Wide variety of proposed local feature representations:
How to Compare Sets of Features?• Each feature type yields set of vectors per image• Unordered, varying number of vectors per instance
?
Pyramid Match
optimal partial matching
for sets with features of dimension
•
Optimal matching
•
Greedy matching
•
Pyramid match
[Grauman and Darrell, ICCV 2005, JMLR 2007]
Cue Combination
• Feature / Kernel combination is very active topic– SVM MKL schemes (Varma, et al.)
– Cross-Validation approaches (recent ICCV papers…)
– Naïve Bayes Nearest Neighbor schemes (Irani)
• Our Gaussian Process formulation has significant efficiency advantage when using 1 vs. all formulation– most of computation is inherently shared across categories
• Good news: significant accuracy improvement!
• Bad news: combination methods all perform about the same, at least on Caltech datasets…– but does help when there are non-informative kernels…
[ See http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-96.html
for details…
]
[Kapoor, Urtasun, Grauman, Darrell, IJCV, to appear]
Combining 2D and 3D
• How to best exploit 3-D sensors?– very hard to get good local shape estimates
• Exploit 3-D sensing as context for 2-D reco.
• 3-D scene context– 3D can provide summary of overall environment
– find support surfaces
• Estimate and exploit absolute size
constraints– model size variation of overall object and
local
patches based on training data or external knowledge
Indoor Support surfaces
Table surface extraction from 3-D scene data:
Indoor search constraints
• Use surface
and size constraints…
Notion of Absolute Size
• Absolute feature size
• Absolute object size
Vs.
Vs.
Branch & Bound for Fast Detection
[Lampert]
•Feature computation•Codebook matching•Discriminative weights•Sliding window•Detections
Object Detection
• Exhaustive search is costly –
in particular
for 3d• Previous work on
branch&bound provides speed up for 2d
• Use upper bounds on bounding box intervals
A
l
l
p
o
s
s
i
b
l
2-42-4 3-53-5
2-2.52-2.5 3-43-4X
Current Detection Demo Results
Joint Category Learning
Standard “1 vs. all”
paradigm….
SVM/GPC – Category 1
SVM/GPC – Category 2
SVM/GPC – Category 3
SVM/GPC – Category 4
SVM/GPC – Category 256
…
SVM/GPC – Category 10,000?
How to exploit shared structure?
Consider ensemble of classifiers
SVM/GPC – Category 1 w1
SVM/GPC – Category 2 w2
SVM/GPC – Category 3 w3
SVM/GPC – Category 4 w4
SVM/GPC – Category 256
…
SVM/GPC – Category w10,000
classifier weights
Consider ensemble of classifiers
SVM/GPC – Category 1 w1
SVM/GPC – Category 2 w2
SVM/GPC – Category 256
…
SVM/GPC – Category wn
W = [ w1
w2
… wn
]
Related tasks and/or object part structure will lead to correlated patterns in W…[Quattoni, Collins, Darrell, CVPR
2007] explore Ando+Zhang style structure learning for scene recognition tasks.
Learn W jointly? [Quattoni, Collins, Darrell, CVPR
2008] explore joint spare optimization via matrix norm penalty.
[Quattoni, Carreras, Collins, Darrell, ICML 2009] report an efficient learning scheme for this approach…
Joint Sparse Approximation
xwxf )(
Dyx
d
jjwCyxfl
),( 1||)),((min
w
• Consider learning a single sparse linear classifier of the form:
That is, we want only a few features with non-zero coefficients
• L1 regularization well-known to yield sparse solutions:
Classificationerror
L1 penalizesnon-sparse solutions
Joint Sparse Approximation
m
kmk
Dyxk
CyxflD
k121
),(,...,, ),....,,R()),((
||1min www
m21 www
Optimization over several tasks jointly:
Average Losson training set k
penalizes solutions that
utilize too many features
xxf kk w)(
Key idea: use a matrix norm…[Obozinski et al. 2006, Argyriou et al. 2006, Amit et al. 2007 ]
Joint Regularization Penalty
rowszerononW #)R(
mddd
m
m
WWW
WWWWWW
,2,1,
,22,21,2
,12,11,1
How do we penalize solutions that use too many features?
Coefficients forfor feature 2
Coefficients forclassifier 2
Would lead to a hard combinatorial problem .
Joint Regularization PenaltyWe use a L1-∞
norm [Tropp 2006]
d
iikk
WW1
|)(|max)R(
The combination of the two norms results in a solution where only a few features are used but the features used will contribute in solving many classification problems.
This norm combines:
An L1 norm on the maximum absolute values of the coefficients across tasks promotes sparsity.
Use few features
The L∞
norm on each row promotes non- sparsity on the rows. Share
features
Joint Sparse Approximation
m
k
d
iikkk
Dyxk
WCyxflD
k1 1),(|)(|max)),((
||1minW
Using the L1-∞
norm we can rewrite our objective function as:
For any convex loss this is a convex objective.
For the hinge loss the optimization problem can be expressed as a linear program. [Quattoni et al. CVPR 2008]
See also [Quattoni et al ICML 2009] for efficient large scale solutions.
News Image Classification Experiments
15 30 60 120 2400.32
0.34
0.36
0.38
0.4
0.42
0.44
0.46
# training examples per task
Mea
n E
ER
Reuters Dataset Results
L2L1L1-INF
Absolute Weights L1
Feat
ure
5 10 15 20 25 30 35 40
500
1000
1500
2000
2500
3000
10
20
30
40
50
60
Absolute Weights L1-INF
Fea
ture
task
5 10 15 20 25 30 35 40
500
1000
1500
2000
2500
3000
0.01
0.02
0.03
0.04
0.05
0.06
SuperBowlDanish
CartoonsSharonAustralian
openTrapped
coal miners
Goldenglobes Grammys Figure
skating AcademyAwards Iraq
L1,∞L1
Visual Sense Disambiguation
38
Goal: Object recognition in situated environments
•
Imagine using natural dialogue to instantiate object models in a robot
That’s a cat over there…
This is one of my purses.
There’s a lamp…
39
Speech, image can be complementary…
a pan...
That’s a pen!
Copy machine..
ant →
fanface →
basspiano →
cannon
40
Using image to aid speech recognition
Objectrecognition
41
Experiments on Caltech101•
Asked users to speak the object name, added noise
•
Showed benefit from fusion at all noise levels
ant →
fanface →
basspiano →
cannon
42
Large Object Vocabulary?
Objectrecognition
43
Large Object Vocabulary?
44
Category Discovery: “Watch”…
45
Sources of visual polysemy
Would rather watch… Suicide watch
Hurricane, tornado watch
Watch out!
Celebrity watch
46
Taking advantage of text contexts
icrystal rfid wrist watch features watch masterpiece innovative watch making craftsmanship absolute precision fine charm high scratch resistance anti-
allergenic characteristics make chronometer true jewel s wrist water proof sleek stylish wrist watch
solar powered available watch ticket key purse identity card special offer place order rfid wrist watch absolutely free rfid watch black
wrist strap rfid watch orange wrist strap rfid watch stainless steel privacy
disclaimer copyright icrystal pty website
47
Dictionary model•
Use entry text to learn a probability distribution over words for that sense
•
Problem: entries contain very little text–
Expand by adding synonyms, hyponyms, 1st-level hypernyms–
Still, very few words are covered!
•S: (n) mouse (any of numerous small rodents typically resembling diminutive rats having pointed snouts and small ears on elongated bodies with slender usually hairless tails) •direct hyponym / full hyponym
•S: (n) house mouse, Mus musculus (brownish-grey Old World mouse now a common household pest worldwide) •S: (n) harvest mouse, Micromyx minutus (small reddish-brown Eurasian mouse inhabiting e.g. cornfields) •S: (n) field mouse, fieldmouse (any nocturnal Old World mouse of the genus Apodemus inhabiting woods and fields and gardens) •S: (n) nude mouse (a mouse with a genetic defect that prevents them from growing hair and also prevents them from immunologically rejecting human cells and tissues; widely used in preclinical trials) •S: (n) wood mouse (any of various New World woodland mice)
•direct hypernym / inherited hypernym / sister term•S: (n) rodent, gnawer (relatively small placental mammals having a single pair of constantly growing incisor teeth specialized for gnawing)
48
Topic space
Idea:
use large collection of unlabeled text
to learn hidden topics which align with different senses/uses of the word
live mice petrodent ear price pest
house bait old animal human tube need tail species gene head breed body love color care
friend wood cat weight white water
. . .
49
Learning visual senses: overview
Search Engine Watch Search Engine Watch
is the authoritative guide to search engine marketing (SEM) and search engine optimization (SEO), offering the latest news about search ...
searchenginewatch.com/ -
38k -
Cached - Similar pages - Note this watch - MDCWatches
for assignment to a property named prop in this object, calling handler(prop, oldval, newval) whenever prop is set and storing the return value in ...
developer.mozilla.org/en/Core_Java
Script_1.5_Reference/Global_Objec
ts/Object/watch - 30k - Cached - Similar pages - Note this
Search Engine Watch Search Engine Watch
is the authoritative guide to search engine marketing (SEM) and search engine optimization (SEO), offering the latest news about search ...
searchenginewatch.com/ -
38k -
Cached - Similar pages - Note this watch - MDC
Watches
for assignment to a property named prop in this object, calling handler(prop, oldval, newval) whenever prop is set and storing the return value in ...
developer.mozilla.org/en/Core_Java
Script_1.5_Reference/Global_Objec
ts/Object/watch - 30k - Cached - Similar pages - Note this
dictionary definitions
latent topic space
unlabeled text
dictionary model P( sense | page)
unlabeled images+text training images
visual senseclassifier
50
Clustering + Word Sense Model
Object Sense: drinking container
Abstract Sense: sporting event
Object Sense: loving cup (trophy)
Search Word: “cup”
Online Dictionary
Word to search for:Noun
cup Search Dictionary
• cup (a small open container usually used for drinking; usually has a handle) "he put the cup back in the saucer"; "the handle of the cup was missing"
• cup, loving cup (a large metal vessel with two handles that is awarded as a trophy to the winner of a competition) "the school kept the cups is a special glass case”
• a major sporting event or competition “the world cup”, “the Stanley cup”
Filtering Abstract Senses
51
Concrete vs. abstract senses
Mouse: Noun•<noun.animal>S: (n) mouse (any of numerous small rodents typically resembling diminutive rats having pointed snouts and small ears on elongated bodies with slender usually hairless tails) •<noun.state>S: (n) shiner, black eye, mouse (a swollen bruise caused by a blow to the eye) •<noun.person>S: (n) mouse (person who is quiet or timid) •<noun.artifact>S: (n) mouse, computer mouse (a hand-operated electronic device that controls the coordinates of a cursor on your computer screen as you move it around on a pad; on the bottom of the device is a ball that rolls on the surface of the pad) "a mouse takes much more room than a trackball"
• How can we determine if a sense is concrete or abstract?– Use a natural language processing method to learn classifier– Use existing dictionary information: e.g. WordNet’s lexical file tags
52
Filtering visual senses
Yahoo Search: “fork” DICTIONARY
1: (n) fork
(cutlery used for serving and eating food) 2: (n) branching, ramification, fork, forking (the act of branching out or dividing into branches) 3: (n) fork, crotch (the region of the angle formed by the junction of two branches) "they took the south fork"; "he climbed into the crotch of a tree"4: (n) fork
(an agricultural tool used for lifting or digging; has a handle and metal prongs) 5: (n) crotch, fork
(the angle formed by the inner sides of the legs where they join the human trunk)
53
Filtering visual senses
Artifact sense of “fork” DICTIONARY
1: (n) fork
(cutlery used for serving and eating food) 2: (n) branching, ramification, fork, forking (the act of branching out or dividing into branches) 3: (n) fork, crotch (the region of the angle formed by the junction of two branches) "they took the south fork"; "he climbed into the crotch of a tree"4: (n) fork
(an agricultural tool used for lifting or digging; has a handle and metal prongs) 5: (n) crotch, fork
(the angle formed by the inner sides of the legs where they join the human trunk)
54
Filtering visual senses
Yahoo Search: “telephone” DICTIONARY
1: (n) telephone, phone, telephone set (electronic equipment that converts sound into electrical signals that can be transmitted over distances and then converts received signals back into sounds)
2: (n) telephone, telephony (transmitting speech at a distance)
55
Filtering visual senses
Artifact sense: “telephone” DICTIONARY
1: (n) telephone, phone, telephone set (electronic equipment that converts sound into electrical signals that can be transmitted over distances and then converts received signals back into sounds)
2: (n) telephone, telephony (transmitting speech at a distance)
56
Topic adaptation•
Original LDA topics are learned on text-only unlabeled data
•
Adapt to image-text data via semi-supervised Gibbs sampling
•
E.g.: one of “fork” topics:
product bike null tool tube seal set price oil knife spoon spring
ship use item accessory handle shop order
remove store custom home weight steel supply cap clamp fit
false . . .
cutlery knife spoon product set price handle
steel tool item stainless null bike tube seal oil knive
kitchen utensil ship order use table sp ring supply design piece carve weight shop
. . .
57
“fork”: using original topics
fork liftroad forkbike fork
etc.
58
“fork”: using adapted topics
knifespooncutlery
bike forketc.
59
Work in progress…
•
Showed that combining speech and image classifiers improves object reference resolution
•
Proposed an unsupervised method to learn sense-specific object models from web text and image data
•
Integrated large scale demo forthcoming
•
See Kate’s NIPS 2009 & 2008 papers…
Transparent Local Features
Dealing with Transparency
Motivation
•
Transparent objects made out of glass or plastic are ubiquitous in domestic environments
•
Traditional local feature approach inappropriate
•
Full physical model intractable
Local Additive Feature Model
•
Significant variation in patch appearance
•
... but common latent structure
new LDA-SIFT model
LDA-SIFT
Transparent Visual Words
•
For each patch we infer the latent mixture activations that characterize the additive structure
•
We model the glass by learning a spatial layout of discrete “transparent local feature”
activations
Transparent Visual Words
Latent component Average occurrenceon train Occurrences on test
Recognition Architecture
LDA
XY
T
Classifier
glas
sba
ckgr
ound
X
T
…
…
Y
Results: general vocabulary
• Training on 4 different glasses in front of screen
• Testing on 49 glass instances in home environment
• Sliding window linear SVM-detection
Recognition Architecture
sLDA
XY
T
Classifier
glas
sba
ckgr
ound
X
T
…
…
Y
Results: sLDA
• Training on 4 different glasses in front of screen
• Testing on 49 glass instances in home environment
• Sliding window linear SVM-detection
Conclusion
•
Traditional local feature models (VQ, NN) are poorly suited for transparent object recognition
•
Proposed additive local feature models can detect superimposed image structures
•
Developed statistical approach to learn such representations using probabilistic topic models
•
Sparse factorization of local gradient statistics•
Encouraging results on real-world data
Future Work
•
Different feature representations; extend model in hierarchical fashion
•
Investigate addition of material property cues; discriminative inverse local light transport models
•
Explore benefits for opaque object recognition; understand relationship to sparse image coding as well as to biological motivated models
• Fusing multiple cues and discovering shared representations across categories…
• Visual Sense Disambiguation…
• Transparent local features…
Recent Progress: Combing Features, Overcoming ambiguity
vs vs
?
features categories
For more information…
•
Probabilistic multi-kernel fusion–
Christhoudias, Urtasun, Darrell, CVPR 2009
•
Joint regularization across categories–
Quattoni, Carreras, Collins, Darrell,
ICML 2009.
•
Machine Learning for Multimodal sense grounding–
Saenko
and Darrell, NIPS 2008, NIPS 2009
•
Local feature models for transparent objects–
Fritz, Bradski, Black, and Darrell, NIPS 2009