PowerPoint Presentation
Visual Recognition in the Wild: Image Retrieval, Faces, and TextEECVC 2016, Odessa, Ukraine
James Pritts
Center for Machine PerceptionCzech Technical University in Prague
Who are we?
Filip RadenoviImage RetrievalPhD candidateLuk NeumannText DetectionPhD candidateJames PrittsRepeated PatternsPhD candidateMichal ButaText DetectionResearcherJi MatasProfessorVojtch FrancAssistant ProfessorOndej ChumAssociate ProfessorJan echSenior Researcher
2
Goals of the talkDemo robust and working systemsstate-of-the-art performance (or nearly)real-time operation for facial landmarks and text
In the wild means images taken unconstrained
Present selected applied research from CMP
Machine learning integral to vision (even pipelines)
Rigorous analysis in related publications
3
Part 1: Image Retrieval 2.0
Filip RadenoviImage RetrievalPhD candidateJi MatasProfessorOndej ChumAssociate Professor
Retrieval Tasks1.0: Standard image retrieval problemsVisually most similarAll visually similar
2.0: Beyond similarity retrievalNew (unseen) informationWhat/where is this?What is interesting here?Where should I look?
2.1: Image retrieval for 3D reconstruction
2.0 Speak about advacec that go beyond standard image retrieval5
Standard Image Retrieval Evaluation1recall
area under the curveAverage Precision (AP)
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
precision
Query:10 database images5 relevant imagesRanking:
#
Is this what we want?Visually most similarResults identical to query for large datasets
All visually similarOutput of varying lengthGround truth hard to obtain
Users wont look at tens of near-duplicate images!
#
Show google images of golden gate bridge NO7
Bag of Words: Off-line Stage
#
Feature descriptor instead of SIFT
8
Bag of Words Image ModelACDBACDB
10020301
ImagesVisual vocabularyAn image is represented by the histogram ofdetected visual words
Term-frequency (tf) visual word D is twice in the image
#
Bag of Words : On-line Stage13715999565BOWgeometries
IN: qwordimage ID11 5 10 735012522 7 12 739912131 4 15 7200190167772163 7 10 7012245
1. Inverted file: posting list per visual word
2. Image rankingscoreimage ID0.8750.7515730.52112020.00132
image 11202
image 1573
image 5
3. Spatial verification
#inlierszoomimage ID2477x15731052x51737x11202217x75213
4. Re-ranked shortlist
13715999565+231514890215+3102915678921+
+++queryimage 1573image 45
5. Query expansion
OUT: R
Shortlist: top N images
#
Re-rank top ranked images (removing false positives)RANSAC
NOTE: Standard BoW score ranking performed without geometric informationIMPORTANT: Geometric verification crucial for query expansion
10
Query Expansion
Query imageResultsNew querySpatial verificationNew results
Chum, Philbin, Sivic, Isard, Zisserman: Total Recall, ICCV 2007
#
QE important to create a sequence of images, a path from query to result11
Query Expansion: Step by Step
Query ImageRetrieved imageOriginally not retrieved
#
Query Expansion: Step by Step
#
Query Expansion: Step by Step
#
Retrieval Tasks1.0: Standard image retrieval problemsVisually most similarAll visually similar
2.0: Beyond similarity retrievalNew (unseen) informationWhat/where is this?What is interesting here?Where should I look?
2.1: Image retrieval for 3D reconstruction
2.0 Speak about advacec that go beyond standard image retrieval15
CMP Image Retrieval 2.0 Live Demo
Other Retrieval Problems
What is this?
and what is that?Lets zoom-in!
#
Different Retrieval Problems
Query 1Query 2Mikulik, Chum, Matas: Image Retrieval for Online Browsing in Large Image Collections, SISAP 2013.Top: visually most similar Bottom: zoom-in
#
Put slide 32 after this18
Standard Retrieval and Details
queryrank:16465
query11638481368
232rank:
2048DIFFICULTEASY
#
Ask Ondra why is this image like this?19
Zoom-in: On-line Stage13715999565BOWgeometries
IN: qwordimage ID11 5 10 735012522 7 12 739912131 4 15 7200190167772163 7 10 7012245
scoreimage ID0.8750.7515730.52112020.00132
image 11202
image 1573
image 5
3. Spatial verification
#inlierszoomimage ID837x157310517x5177x112022472x75213
4. Re-ranked shortlist
13715999565+231514890215+3102915678921+
+++queryimage 1573image 45
5. Query expansion
OUT: R
Shortlist: top N images1. Inverted file: posting list per visual word2. Image rankingGeometry compressed in inverted file taken into account during scoringProblem specific ranking function, e.g. maximize scale changeQuery expansion from already zoomed images
#
Zoom-out: On-line Stage13715999565BOWgeometries
IN: qwordimage ID11 5 10 735012522 7 12 739912131 4 15 7200190167772163 7 10 7012245
scoreimage ID0.8750.7515730.52112020.00132
image 11202
image 1573
image 5
3. Spatial verification
#inlierszoomimage ID81/37x15731051/17x5171/7x112022471/2x75213
4. Re-ranked shortlist
13715999565+231514890215+3102915678921+
+++queryimage 1573image 45
5. Query expansion
OUT: R
Shortlist: top N images1. Inverted file: posting list per visual word2. Image rankingGeometry compressed in inverted file taken into account during scoringProblem specific ranking function, e.g. maximize scale changeQuery expansion from already zoomed images
#
Context, what you see in retrieved image is as small as possible and surrounded by new info21
Zoom-in: Example
?
#
Zoom-in: Query Expansion
#
Zoom-in: Example
#
Zoom-in: Query Expansion
#
25
Zoom-in: Query Expansion
#
Zoom-out: Iterate
#
Zoom-out: Iterate
#
Zoom-out: Iterate
#
Do reverse in demo29
What is interesting here?
#
The must sees!
#
Most interesting0 1 %1 3 %3 10 %
detail sizeMikulk, Radenovi, Chum, Matas: Efficient Image Detail Mining, ACCV 2014For any pixel in the query,Find the frequency of close-upsTASK:
#
Tourists tend to frame objects in their photographs32
Highest Resolution TransformFor any pixel in the query,Find the max-res image containing the pixel
37.3x27.0x22.8x21.9x21.6x
Mikulk, Radenovi, Chum, Matas: Efficient Image Detail Mining, ACCV 2014TASK:
#
All Details: On-line Stage13715999565BOWgeometries
IN: qwordimage ID11 5 10 735012522 7 12 739912131 4 15 7200190167772163 7 10 7012245
scoreimage ID0.8750.7515730.52112020.00132
image 11202
image 1573
image 5
3. Spatial verification
#inlierszoomimage ID2477x15731052x51737x11202217x75213
4. Re-ranked shortlist
13715999565+231514890215+3102915678921+
+++queryimage 1573image 45
5. Query expansion
OUT: R
Shortlist: top N images1. Inverted file: posting list per visual word2. Image ranking
#
All Details: Hierarchical Query ExpansionBOWgeometries
IN: R
137159995652315148902153102915678921
group G1
group Gn
1. Grouped images
qijAj,iAq,iAq,j
qijAj,iAq,iAq,j
qijAj,iAq,iAq,j
qijAj,iAq,iAq,j
2. Geometric consistencyAq,iAq,jAj,i231548902131029567892+ +
image 1573image 45
+ +
query q1query qnOUT: q1,q2,,qn231153433021517111226712+ +
image 1761image 33
+ +
#
Retrieval Tasks1.0: Standard image retrieval problemsVisually most similarAll visually similar
2.0: Beyond similarity retrievalNew (unseen) informationWhat/where is this?What is interesting here?Where should I look?
2.1: Image retrieval for 3D reconstruction
2.0 Speak about advacec that go beyond standard image retrieval36
Structure-from-Motion 3D ReconstructionThousands of imagesExhaustive matching of all image pairs[Snavely, Seitz, Szeliski: Photo tourism, SIGGRAPH 2006]+ Fine details are reconstructed- Infeasible for large photo collections
Millions of imagesMatching images through standard image retrieval[Heinly, Schonberger, Dunn, Frahm: Reconstructing the World in Six Days, CVPR 2015]+ Efficient and scalable image matching- Details not reconstructed
#
Retrieval for 3D ReconstructionVisually most similar searchMany near duplicatesDetails lost
Zoom-in and details searchDetails retrievedTransition images to match the details
Zoom-out searchViewpoint changeMore context
Sideways crawlSignificant viewpoint changeMore contextSchoenberger, Radenovi, Chum, Frahm: From Single Image Query to Detailed 3D Reconstruction, CVPR 2015
#
Sideways image crawlSchoenberger, Radenovi, Chum, Frahm: From Single Image Query to Detailed 3D Reconstruction, CVPR 2015
#
Sideways crawl: On-line Stage13715999565BOWgeometries
IN: qwordimage ID11 5 10 735012522 7 12 739912131 4 15 7200190167772163 7 10 7012245
scoreimage ID0.8750.7515730.52112020.00132
image 11202
image 1573
image 5
3. Spatial verification
#inlierszoomimage ID2477x15731052x51737x11202217x75213
4. Re-ranked shortlist
13715999565+231514890215+3102915678921+
+++queryimage 1573image 45
5. Query expansion
OUT: R
Shortlist: top N images1. Inverted file: posting list per visual word2. Image rankingUsing geometry to find adequate features for expansion (left-right)Building an expanded query using only sideways features
#
We could estimate full 2 view geometry for each pair of image results to get the relative camera pose but it would come at a high cost. So we are looking for approximate geom constraint that will do the same job: features on the right in results matching features on the left in the query40
Sideways Left: Step by Step
#
Sideways Left: Step by Step
#
Retrieval for 3D ReconstructionSchoenberger, Radenovi, Chum, Frahm: From Single Image Query to Detailed 3D Reconstruction, CVPR 2015
#
Summary
Visually most similar
Zoom-in / details
Zoom-out
Sideways right
#
Image retrieval is not just about efficient search for most visually similar imageswhat / where is this, what is to the right / leftanalysis of details, detailed 3D reconstruction
The use of geometry, context and query expansion significantly enhances performance and the set of problems that are solvable
Can we go further?Temporal constraints: day / night, summer / winter44
Illumination constraints
#
During the day most of the stuff is visible but during night its not or not visible. So illumination changes but structure doesnt. 45
Part 2: 3D Facial Landmarks
Ji MatasProfessor
Vojtch FrancAssistant ProfessorJan echSenior ResearcherCech, Franc, Uricar, Matas: Multi-view facial landmark detection by using a 3D shape model, Image and Vision Computing 2016, volume 47
46
Face detection and facial landmarksRobust face detectionViewpoint invariant
Facial landmarks Are salient keypointsAccurate detection critical for tasks:Gender, Age, IdentityNeed real-time processing
47
William (M34) Kate (F28)
#
True age of both on the photo: 29. Both Wiliam and Kate born 1982. Wedding 2011. 47
Landmarks and head-pose estimationInput: an image (arbitrary viewpoint of face)
Output:Facial landmarks (up to 51 landmarks)6 DOF head pose estimation3D face reconstruction
48
#
48
Landmark and pose estimation pipeline
Initialization of 3D shape and poseRefinement of 3D shape and pose49KX0
facedetectorbboximage normalization
2D DPM45
75
0
posesolver3D optimizerrollyaw
An-1AnInsnsX*x*
visibility
VKX0VIR*,t*IR0,t0
initialization
#
49
Face detector Face-detector is robust to viewpoint changesProvides rough estimates of roll and yawBased on Waldboost [Sochmann-CVPR-2005]Weak classifiers are Haar-like featuresCommercial detector by Eyedea Recognition, Ltd. http://eyedea.cz
50
facedetectorbboximage normalization
2D DPM45
75
0
posesolver3D optimizerrollyaw
An-1AnInsnsKX0X*x*
visibility
VKX0VIR*,t*IR0,t0
#
50
2D landmark detection
Image normalizationBounding box axis-aligned using roll and scaled to (80x60) px
2D Deformable Part Model (DPM)One of 5 DPM models selected using yaw estimationSelected model defines the set of visible landmarks V
51
facedetectorbboximage normalization
2D DPM45
75
0
posesolver3D optimizerrollyaw
An-1AnInsnsKX0X*x*
visibility
VKX0VIR*,t*IR0,t0
#
51
2D Deformable Part ModelsMRF on a tree2D landmark location modelGlobally optimal solution Loss function:
#
52
Local landmark classifiersEach 2D-landmark independently learnedLBP computed at every position point in ROIComputed over 3 scales of ROI, 20x20, 10x10, 5x5 LBPs concatenated to construct descriptorHigh-dimensional binary descriptor, 256*(182+82+32)=75576
53
#
We construct the feature descriptor (x; I) by concatenatingthe Local Binary Patterns (256 valued code assignedto patch 3 3) computed in all positions of the cropped patchnormalized to size 20 20, 10 10 and 5 5 pixels, respectively.By this process we obtain 256(182+82+32)-dimensionalsparse (182 +82 +32 non-zero elements) binary feature descriptorwhose values are to some extent invariant against a scaleand lighting conditions. The side of the cropped squared patchis 0:3 of the bounding box side returned by the face detector.53
Training Landmark ClassifiersDistance to annotated landmark is part of loss
Training set is images with hand-annotated landmarks
#
54
3D pose initialization2D landmarks to 3D head poseUse mean model X0 for 3D points Perspective n-Point problem => camera pose
Estimation with RANSAC followed by Maximizing Likelihood55
facedetectorbboximage normalization
2D DPM45
75
0
posesolver3D optimizerrollyaw
An-1AnInsnsKX0X*x*
visibility
VKX0VIR*,t*I
R0,t0
#
Tracking with Initial Guess (DPM)
#
56
Joint estimation of landmarks and pose and sparse reconstructionShape and pose jointly estimatedVisibility fixedLandmarks from reprojections
i-th classifier score of the projected model point
3D shape model (PCA)
visibility
#
57
Summary2D model with MRF prior giving Globally optimal 2D initializationSparse 3D face pose and shape estimationStraightforward handling of landmark visibility real-timeCompetitive landmark detection accuracy
facedetectorbboximage normalization
2D DPM45
75
0
posesolver3D optimizerrollyaw
An-1AnInsnsKX0X*x*
visibility
VKX0VIR*,t*IR0,t0
#
58
Tracking with 3D-refinement IFirst pose by proposed initialization Subsequent poses initialized from prior frameNo redetection
59
#
Tracking with 3D-refinement IIFirst pose by proposed initialization Subsequent poses initialized from prior frameNo redetection
60
#
Single-view 3D face reconstructionPossible applications in face interpretation tasks (id, age, gender recognition)Computer graphics: face-texture transfer61
#
61
Experiments: Selected results on AFLW dataset62
#
3D Facial Landmark summaryInitialization is fast and a global optimal
Joint detection of landmarks and estimation of head pose and sparse face reconstruction
Super real-time performance (> 50 fps)
Competitive landmark detection accuracy and pose-estimation performance63
#
Part 3: Text Detection in the Wild
Ji MatasProfessor
Luk NeumannText DetectionPhD candidateMichal ButaText DetectionResearcherThe "Neumann-Matas algorithm" for scene text detection (based on CVPR 2012 paper) is now part of OpenCV 3.x thanks to Lluis Gomez. (https://github.com/Itseez/opencv_contrib/tree/master/modules/text)
The FASText detector source code (published in ICCV 2015) is available (https://github.com/MichalBusta/FASText)
64
Problem Introduction
Text: Anything that can be represented as a Unicode string
#
65
Problem IntroductionScene Text (Text in the Wild)Typically snippets of text, arbitrary script and orientation,out-of-vocabulary words, complex backgroundsImage or video taken by a camera
Text in the WildOther Text
http://rrc.cvc.uab.es/?ch=4&com=evaluation
#
66
Tentative text fragments are segmented
Segmentations are classified:Character Multi-characterBackground
Segmentations grouped into text lines
Region-based Pipeline
#
MSER/CSR detection is done with Adaboost with the following weak classifiers:(see slide), these HAVE to be fast so just getting things that are character-like, combination of synthetic and ICDAR data
Character and Grouping classification: Multi-label SVM (one against all), stroke-area ratio, aspect ratio, compactness, trained with synthetic and hand-annotated ICDAR data
Text-line hypothesis hough transform between centers of k-nearest connected components
Graph cut, unaries are color distribtuion, pairwise Potts model
targeting mobile phones, using CNN for entire image requires TITAN and still 10 seconds, android phone cannot be done1000 components entered into the CNN, real time
Recognition: http://www.robots.ox.ac.uk/~vgg/research/text/
67
FASText keypoints
Observation: Text fragments in any script are formed from strokes
Fast Stroke Detector: Keypoints are stroke begins and stroke bends
Live demo on android phone
Stroke End Stroke Bend
#
68
FASText keypoint details
Stroke End KeypointStroke Bend KeypointLBP-like pixel intensity comparison to surrounding pixels
FASText detector:
#
FASText keypoints are used to initialize a flood fill to get segmentation of letters
Imprecision is ratio of outputted components to ground truth ICDAR ddata
at the end you, for every 1 component there will be 10 false positive, so for 100 characters stil real time69
Component classificationComponents are typically assumed to be one character
Assumption relaxed by classifying components as:Part of a CharacterCharacterGroup of CharactersWord
SVM Features:Aspect ratioCompactnessNumber of holesHorizontal crossingsStroke support pixels
1-vs-all SVM for classification
#
70
Stroke Support Pixels
Most discriminative feature
Works for diverse scripts and fonts
CharacterMulti-character
Non-character MSER
#
71
Stroke Support PixelsStroke area A is the product of the stroke axis length sl and stroke width sw
Stroke area ratio As / A is discriminative
sw
sl
S
#
72
Text Line Detection
Text Line ClustersQuantized Text Directions Votes
Neighboring Pairs of Text FragmentsCorresponding Text DirectionsText components are clustered with hough transform
#
73
Character sequence model
* Slide from presentation of Max Jaderberg, Visual Geometry Group, OxfordReading Text in the Wild
#
74
Summary
Arbitrary Text Fragments detected in a single pass
Efficient strokeness feature discriminates between Text Fragments and clutter
FAST Stroke Detector suitable for embedded systems
#
75
Thank you!
Questions?
#
BACKUP SLIDESPerformance MeasuresDetails
#
12345678910
Bag of Words Scoring677136568
query visual word 1query visual word 2query visual word 3
DBGPosting lists
#
Geometric Re-rankingRe-rank top ranked images (removing false positives)RANSAC
NOTE: Standard BoW score ranking performed without geometric information
IMPORTANT: Geometric verification crucial for query expansionSivic, Zisserman: Video Google, ICCV 2003
Philbin, Chum, Isard, Sivic, Zisserman: Object retrieval with large vocabularies and fast spatial matching, CVPR07
#
Experiments: Landmark detection Accuracy300W results49 landmarks
- Face detector miss-rate 46/554=8.3%80
(error relative to inter-ocular distance)
#
Experiments: Landmark detection AccuracyAFLW results15 landmarks
- Face detector miss-rate 12%81
error relative to Nose-eye-to-mouth-centers-distance (due to profile views)
face-size / iod ~ 1.1
#
81
Experiments: Landmark detection AccuracyDatasets:AFLW (Annotated Facial Landmarks in the Wild, 2011)Graz University24k images downloaded from flicker (test set: 4k images, cross-validation)15 landmarks (selected), manually annotated (limited accuracy)Very challenging: multiview, varying image quality, facial expressions300W (300 Faces In-The-Wild Challenge, 2014)Imperial College London5k images, test set: ~600 images (indoor, outdoor)51/68 landmarks semi-manually annotatedMostly near-frontal faces
Tested Algorithms2D DPM, unpublished extension of [Uricar-VISAPP-2012], Chehra [Asthana-CVPR-2014]Proposed method unpublished extension of [Cech-ICPR-2014]
82
#
Landmark visibility examples83
#
MultiPie dataset [CMU] Multiview dataset, 250 subjectsGround-truth position and orientation by SFM (multiple cameras, 21 landmarks annotated => person specific 3D model)
Experiments: Accuracy of pose estimation
#
s0center
s1canthus-rr
s2nose
s3nose-r
s4ear-r
s5mouth-corner-r
s6chin
s00center
s01canthus-ll
s02canthus-lr
s03canthus-rl
s04canthus-rr
s05ear-l
s06nose-l
s07noses08
nose-r
s09ear-r
s10mouth-corner-l
s11mouth-corner-r
s12chin
s00center
s01canthus-ll
s02canthus-lr
s03canthus-rl
s04canthus-rr
s05noses06
nose-r
s07ear-r
s08mouth-corner-l
s09mouth-corner-r
s10chin
Lavf55.0.100Lavf55.0.100
0 5 10 150
0.2
0.4
0.6
0.8
1
localization error [iod]
occure
nce [%
]
Chehra
Ours
DPM
0 10 20 300
20
40
60
80
100
localization error [face size]
occure
nce [%
]
Chehra (all)
Chehra (030)
Ours (all)
Ours (030)
DPM (all)
DPM (030)