Upload
others
View
13
Download
0
Embed Size (px)
Citation preview
Weakly supervised learningWeakly-supervised learning
Cordelia Schmid
Weakly supervised learning motivationWeakly supervised learning - motivation
Massive and ever growing amount of digital image andamount of digital image and video content– Flickr and YouTube– Audiovisual archives (BBC, INA) – Personal collections
Comes with meta-data‒ Text, audio, user click data, …
Meta-data is a sparse and noisy, f
2
yet rich and diverse source of annotation
Weakly supervised learning motivationWeakly supervised learning - motivation
Object detection
Weakly supervisedlarge-scale learning
3
Action recognition
OverviewOverview
• Multi-fold MIL for weakly-supervised learning from images
• Unsupervised learning from images based on matching
• Weakly-supervised learning from videos with motion segmentation
4
Weakly supervised learning for imagesWeakly-supervised learning for images
• Given a set of images with positive and negative labels, g p g ,determine the object region, learn detector
• Avoids costly annotation of object regions
5
6
7
Our approach descriptorsOur approach – descriptors
• Extract selective search regions [Uijlings et al., ICCV’13]• Regions described with high-dimensional Fisher vectors
CNNor CNNs• Image labeled as positives or negatives
8
Standard MIL (Fisher vectors)Standard MIL (Fisher vectors)
9
Multi fold MILMulti-fold MIL
[Cinbis Vebeek & Schmid Multi fold MIL for WS object localization CVPR’14]
10
[Cinbis, Vebeek, & Schmid, Multi-fold MIL for WS object localization, CVPR 14]
Multi fold MILMulti-fold MIL
Avoid relocalization bias since windows used for training and evaluation are different
11
Comparing standard and multi foldComparing standard and multi-fold
12
Performance over iterations (Fisher Vectors)Performance over iterations (Fisher Vectors)
13
Our approach: multi fold training for MILOur approach: multi-fold training for MIL
14
Localization examplesLocalization examples
15
Failure casesFailure cases
16
Refinement of selected boxesRefinement of selected boxes
Window refinement by local search to align windows with contours [Edge boxes: locating object proposals from edges, Zitnick et Dollar,ECCV’14]
17
Refinement of selected boxesRefinement of selected boxes
• Locally refine the top 10 scoring boxes with “edgebox” scorey p g g• “Edgebox” score: encourages alignment with long contours,
discourages contours straddling the window • Final score: “edgebox” score + “selection” score
[Cinbis, Verbeek & Schmid, WS Object Localization with Multi-fold MIL, arXiv’15]
18
[ , , j , ]
Comparison to the state of the artComparison to the state-of-the-art
19
Comparison to the state of the artComparison to the state-of-the-art
20
Summary and future workSummary and future work
• State-of-the-art results for WS localization
• Further improve “initial” and “selected” windows
• Update the CNN features (fine-tuning)
• Dealing with noisy or missing image labels (eg. Google image download)
21
OverviewOverview
• Multi-fold MIL for weakly-supervised learning from images
• Unsupervised learning from images based on matching
• Weakly-supervised learning from videos with motion segmentation
22
How much supervision for localization?
Positives + BB Positives Positives+ +
Strong Weak Very weakNegatives Negatives
g yNone
Object detection (Leibe et al.’08; Felzenszwalb et al.’10; Girshick et al.’14)Object detect on (Le be et al. 08; Felzenszwalb et al. 0; G rsh ck et al. )Weakly supervised localization (Chum’07;Pandey’11;Desaelers’12;Siva’12;Shi’13;Cinbis’14;Wang’14)Co-segmentation/localization (Rother’06;Russell’06;Joulin’10;Kim’11;Vicente’11;Joulin’14;Tang’14)Unsupervised discovery (Grauman & Darrell’05; Sivic et al’05,08; Kim et al.’05,09)
Supervision
Correspondence
(Russell et al.’06; Cho et al.’10; Rubinstein & Joulin’13; Rubio et al.’13)
Our approach
• Correspondences • Correspondences • as a substitute for supervision
b d bj• between parts and objects• picked from bottom-up segmentation proposals• and k-nearest-neighbor images
• How?• Probabilistic Hough matchingProbabilistic Hough matching• Stand-out scoring of part hierarchies
[Cho, Kwak, Schmid & Ponce, Unsupervised Object Discovery and Localization in theWild: Part-based Matching with Bottom-up Region Proposals, CVPR’15 ]
Finding parts and objects among region candidates
Here: bottom-up segmentation proposals (Manen et al.’13, Uijlings et al.’13)and HOG descriptors (Dalal & Triggs’05)
Matching model – Probabilistic Hough matching
match data configuration
P ( m | d ) = c P ( m | c, d ) P ( c |d )
Matching model – Probabilistic Hough matching
match data configuration
P ( m | d ) = c P ( m | c, d ) P ( c |d )
= P ( ma | d ) c P ( mg | c ) P ( c | d )
appearance geometry
Matching model – Probabilistic Hough matching
• Probabilistic model
P ( m | d ) = c P ( m | c, d ) P ( c |d )
= P ( ma | d ) c P ( mg | c ) P ( c | d )
Matching model – Probabilistic Hough matching
• Probabilistic model
P ( m | d ) = c P ( m | c, d ) P ( c |d )
= P ( ma | d ) c P ( mg | c ) P ( c | d )
• Probabilistic Hough transform
P ( c | d ) ≈ H ( c | d ) = m P ( m | c, d )
= m P ( mg | c ) P ( ma | d )
(Hough’59; Ballard’81; Stephens’91; Leibe et al.’04; Maji & Malik’09; Barinova et al.’12)
Matching model – Probabilistic Hough matching
• Probabilistic model
P ( m | d ) = c P ( m | c, d ) P ( c |d )= P ( ma | d ) c P ( mg | c ) P ( c | d )
• Probabilistic Hough transform
P ( c | d ) ≈ H ( c | d ) = m P ( m | c, d )
• Probabilistic Hough transform
( | ) ( | ) m ( | )= m P ( mg | c ) P ( ma | d )
• Region confidence
C ( r’ | [ d’ , d’’ ] ) = max r’’ P ( r’ r’’ | [ d’, d’’ ] )
Matching model – Probabilistic Hough matching
• Probabilistic model
P ( m | d ) = c P ( m | c, d ) P ( c |d )= P ( ma | d ) c P ( mg | c ) P ( c | d )
• Probabilistic Hough transform
P ( c | d ) ≈ H ( c | d ) = m P ( m | c, d )
• Probabilistic Hough transform
( | ) ( | ) m ( | )= m P ( mg | c ) P ( ma | d )
• Two images -> multiple images
Cd’ ( r’ ) = d’’ C ( r’ | [ d’ , d’’ ] )
multiple
two
Appearance only
PHM
Stand-out scoring of part hierarchies
• Object regions should containj g• more foreground than part regions• less background than larger regions
Stand-out scoring of part hierarchies
• Object regions should containj g• more foreground than part regions• less background than larger regions
• S ( r ) = C ( r ) – max r’ r C( r’ )∩
An iterative algorithm – iteration 1
Retrieve 10 NN with GIST (Oliva & Torralba’06)
An iterative algorithm – iteration 1
Probabilistic Hough Matching with 10 NN
An iterative algorithm – iteration 1
Localize top 5 scoring windows with stand-out score
An iterative algorithm – iteration 1
For all image: Localize 5 top scoring windows
An iterative algorithm – next iterations
E l it l t d i Exploit selected regions Retrieve 10 NN using PHM with top confidence regions
An iterative algorithm – next iterations
Probabilistic Hough Matching with 10 NN
An iterative algorithm – next iterations
Localize 5 top scoring windows per image with stand-out score
Localization improvement over iterations
1 3
1 3
Retrieval improvement over iterations
1st iteration
5th iteration
Retrieval improvement over iterations
1st iteration
5th iteration
Comparative evaluationTwo benchmarks:• Object discovery dataset (Rubinstein et al.’13)
S b t f 300 i f 3 l• Subset of 300 image from 3 classes• Includes between 7 and 18 outliers per class
• Pascal’07 – all (Everingham et al.’07)• 4548 images from 20 classes • From train/val set, minus difficult/truncated images, g
Computing time: < 1h for 500 images on 10-core desktop
Performance metrics:• CorLoc: percentage of boxes such that intersection/union > 0.5p g• CorRet: percentage of retrieved 10 NNs in the same class as query
Experimental results: Object discovery datasetCorLoc – separate classes
CorLoc / CorRet – mixed classes without labels
Examples – mixed classes without labelsExamples mixed classes without labels
Experimental results: Pascal’07 – all
CorLoc – separate classes
CorLoc and CorRet – mixed classes without labels
Experimental results: Pascal’07 – all
Examples– mixed classes without labels Successes
Experimental results: Pascal’07 – all
Examples– mixed classes without labels Failures
Discussion and future workDiscussion and future work
• Effective method for object discoveryand localization in challenging unlabeled scenariosand localization in challenging unlabeled scenarios
No use of saliency or objectness measures• No use of saliency or objectness measures• No use of negative examples or pretrained features
• Next:• Image categorization and object detection• Handling multiple objects per imageg p j p g
OverviewOverview
• Multi-fold MIL for weakly-supervised learning from images
• Unsupervised learning from images based on matching
• Weakly-supervised learning from videos with motion segmentation
57
Learning from videosg
• easier to separate object from background reduce need for bounding-box annotation
• a video shows a range of variations for an object• a video shows a range of variations for an object easier to learn multi-view, articulation, illumination
• many frames and easy to access (e.g. YouTube) lots of extra data!
Automatic extraction of objects
??
[Prest, Leistner, Civera, Schmid & Ferrari, Learning object detectors from weakly annotated video, CVPR’12]
Automatic extraction of objectsj
Data collection1. pick 10 moving classes from PASCAL VOC2 collect 9 24 videos per class from YouTube (~500 shots per class)2. collect 9‐24 videos per class from YouTube ( 500 shots per class)3. shot change detection chunks videos into shots
• total 0 57 million frames• total 0.57 million frames• video‐level label only (i.e. keep some shots without the class)
Step 1
LocalizeLocalizeobject tubes
Candidate tubesCandidate tubes
dense point tracksdense point tracks
[N. Sundaram et al., Dense point trajectories by GPU-accelerated large displacement optical flow, ECCV 2010]
Candidate tubesCandidate tubes
motion segmentationmotion segmentation
Candidate tubesCandidate tubes
motion segmentationmotion segmentation
Selecting tubesg
• jointly select one tube per shot by energy minimization• jointly select one tube per shot by energy minimization
nodestates
nodenode
Selecting tubesgshot
candidate tubesUnary potential: tubesUnary potential:
homogeneity within a tube, location prior
shotshotPairwise potential:similarity between tubes from different shots based on appearance descriptors (BOWappearance descriptors (BOW, HOG) extracted for a fixed number of frames per tube
Find states minimizing sum of potentials Inference with TRW‐SFind states minimizing sum of potentials, Inference with TRW S [V. Kolmogorov, Convergent tree-reweighted message passing for energy minimization, PAMI 06]
Selecting tubes
Motion Segments Candidate Tubes Automatically Selected TubeSelected Tube
Over-segmentation
Wrong tube selection
Heavy occlusion and lack of motiony
Experiments: tube quality
10 object classes that move from PASCAL• aeroplane, bird, boat, car, cat, cow, dog, horse, motorbike, train• ~500 shots per class from YouTube
Evaluate on 100‐290 frames/class manually annotated (total 1407)• Performance = detection‐rate as defined in Pascal (>0.5 IoU)
28 5
34,8 Best segment
28,5 Autmatically selected
+ auto selection picks best available tube 80% of the time
0 10 20 30 40 50 60 70 80
‐ motion segmentation far from perfect (best tube covers 35% objects)
Train detector
LocalizeLocalizeobject tubes
Experiments: detection in PASCAL VOC
Test on Pascal 2007 test setTest on Pascal 2007 test set• 4952 test images; multiple classes per image• Many variations: scale, viewpoint, illumination, occlusion, intra‐class
Standard Pascal evaluation protocol
DPM object detector [Felzenszwalb 2010]
Experiments: detection in PASCAL VOCp
15 2
31,6DPM
Image GTVideo CVPR12 seg15,2 Video CVPR12 seg
0 10 20 30 40 50
About same number of training instances (500/class)
• Only about half the mAP! Big gap!
Experiments: detection in PASCAL VOC
Image GT
p
15,231,6 Video CVPR12 seg
Video ICCV13 seg
18,617,1
DPM
Video ICCV13 seg
Video GT
Induced by VOC
33,620,5 Induced by VOC
Image GT + Video induced by VOC
0 10 20 30 40 50
y
• Adding video data to Image GT +2% mAP
• But only with some domain adaptation, otherwise negative
transfer
Training + testing object detectorsTraining + testing object detectorsVideoStill images CombinationdeoS ages Co b a o
Still images from PASCAL VOC 2007
Summary and discussionSummary and discussion
Video significantly more training data• Video significantly more training data
M ti dditi l• Motion as an additional cue
I t ti f ti t l t b• Improve extraction of spatio-temporal tubes
D i hif f d b i i d• Domain shift factors needs to be investigated
• Construction of more complete models