James Pritts - Visual Recognition in the Wild: Image Retrieval, Faces, and Text

PowerPoint Presentation

Visual Recognition in the Wild: Image Retrieval, Faces, and TextEECVC 2016, Odessa, Ukraine

James Pritts

Center for Machine PerceptionCzech Technical University in Prague

Who are we?

Filip RadenoviImage RetrievalPhD candidateLuk NeumannText DetectionPhD candidateJames PrittsRepeated PatternsPhD candidateMichal ButaText DetectionResearcherJi MatasProfessorVojtch FrancAssistant ProfessorOndej ChumAssociate ProfessorJan echSenior Researcher

2

Goals of the talkDemo robust and working systemsstate-of-the-art performance (or nearly)real-time operation for facial landmarks and text

In the wild means images taken unconstrained

Present selected applied research from CMP

Machine learning integral to vision (even pipelines)

Rigorous analysis in related publications

3

Part 1: Image Retrieval 2.0

Filip RadenoviImage RetrievalPhD candidateJi MatasProfessorOndej ChumAssociate Professor

Retrieval Tasks1.0: Standard image retrieval problemsVisually most similarAll visually similar

2.0: Beyond similarity retrievalNew (unseen) informationWhat/where is this?What is interesting here?Where should I look?

2.1: Image retrieval for 3D reconstruction

2.0 Speak about advacec that go beyond standard image retrieval5

Standard Image Retrieval Evaluation1recall

area under the curveAverage Precision (AP)

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

precision

Query:10 database images5 relevant imagesRanking:

#

Is this what we want?Visually most similarResults identical to query for large datasets

All visually similarOutput of varying lengthGround truth hard to obtain

Users wont look at tens of near-duplicate images!

#

Show google images of golden gate bridge NO7

Bag of Words: Off-line Stage

#

Feature descriptor instead of SIFT

8

Bag of Words Image ModelACDBACDB

10020301

ImagesVisual vocabularyAn image is represented by the histogram ofdetected visual words

Term-frequency (tf) visual word D is twice in the image

#

Bag of Words : On-line Stage13715999565BOWgeometries

IN: qwordimage ID11 5 10 735012522 7 12 739912131 4 15 7200190167772163 7 10 7012245

1. Inverted file: posting list per visual word

2. Image rankingscoreimage ID0.8750.7515730.52112020.00132

image 11202

image 1573

image 5

3. Spatial verification

#inlierszoomimage ID2477x15731052x51737x11202217x75213

4. Re-ranked shortlist

13715999565+231514890215+3102915678921+

+++queryimage 1573image 45

5. Query expansion

OUT: R

Shortlist: top N images

#

Re-rank top ranked images (removing false positives)RANSAC

NOTE: Standard BoW score ranking performed without geometric informationIMPORTANT: Geometric verification crucial for query expansion

10

Query Expansion

Query imageResultsNew querySpatial verificationNew results

Chum, Philbin, Sivic, Isard, Zisserman: Total Recall, ICCV 2007

#

QE important to create a sequence of images, a path from query to result11

Query Expansion: Step by Step

Query ImageRetrieved imageOriginally not retrieved

#


#


#





CMP Image Retrieval 2.0 Live Demo

Other Retrieval Problems

What is this?

and what is that?Lets zoom-in!

#

Different Retrieval Problems

Query 1Query 2Mikulik, Chum, Matas: Image Retrieval for Online Browsing in Large Image Collections, SISAP 2013.Top: visually most similar Bottom: zoom-in

#

Put slide 32 after this18

Standard Retrieval and Details

queryrank:16465

query11638481368

232rank:

2048DIFFICULTEASY

#

Ask Ondra why is this image like this?19

Zoom-in: On-line Stage13715999565BOWgeometries

IN: qwordimage ID11 5 10 735012522 7 12 739912131 4 15 7200190167772163 7 10 7012245

scoreimage ID0.8750.7515730.52112020.00132

image 11202

image 1573

image 5




13715999565+231514890215+3102915678921+


5. Query expansion

OUT: R

Shortlist: top N images1. Inverted file: posting list per visual word2. Image rankingGeometry compressed in inverted file taken into account during scoringProblem specific ranking function, e.g. maximize scale changeQuery expansion from already zoomed images

#

Zoom-out: On-line Stage13715999565BOWgeometries

IN: qwordimage ID11 5 10 735012522 7 12 739912131 4 15 7200190167772163 7 10 7012245

scoreimage ID0.8750.7515730.52112020.00132

image 11202

image 1573

image 5


#inlierszoomimage ID81/37x15731051/17x5171/7x112022471/2x75213


13715999565+231514890215+3102915678921+


5. Query expansion

OUT: R

Shortlist: top N images1. Inverted file: posting list per visual word2. Image rankingGeometry compressed in inverted file taken into account during scoringProblem specific ranking function, e.g. maximize scale changeQuery expansion from already zoomed images

#

Context, what you see in retrieved image is as small as possible and surrounded by new info21

Zoom-in: Example

?

#

Zoom-in: Query Expansion

#

Zoom-in: Example

#


#

25


#

Zoom-out: Iterate

#

Zoom-out: Iterate

#

Zoom-out: Iterate

#

Do reverse in demo29

What is interesting here?

#

The must sees!

#

Most interesting0 1 %1 3 %3 10 %

detail sizeMikulk, Radenovi, Chum, Matas: Efficient Image Detail Mining, ACCV 2014For any pixel in the query,Find the frequency of close-upsTASK:

#

Tourists tend to frame objects in their photographs32

Highest Resolution TransformFor any pixel in the query,Find the max-res image containing the pixel

37.3x27.0x22.8x21.9x21.6x

Mikulk, Radenovi, Chum, Matas: Efficient Image Detail Mining, ACCV 2014TASK:

#

All Details: On-line Stage13715999565BOWgeometries

IN: qwordimage ID11 5 10 735012522 7 12 739912131 4 15 7200190167772163 7 10 7012245

scoreimage ID0.8750.7515730.52112020.00132

image 11202

image 1573

image 5




13715999565+231514890215+3102915678921+


5. Query expansion

OUT: R

Shortlist: top N images1. Inverted file: posting list per visual word2. Image ranking

#

All Details: Hierarchical Query ExpansionBOWgeometries

IN: R

137159995652315148902153102915678921

group G1

group Gn

1. Grouped images

qijAj,iAq,iAq,j

qijAj,iAq,iAq,j

qijAj,iAq,iAq,j

qijAj,iAq,iAq,j

2. Geometric consistencyAq,iAq,jAj,i231548902131029567892+ +

image 1573image 45

+ +

query q1query qnOUT: q1,q2,,qn231153433021517111226712+ +

image 1761image 33

+ +

#





Structure-from-Motion 3D ReconstructionThousands of imagesExhaustive matching of all image pairs[Snavely, Seitz, Szeliski: Photo tourism, SIGGRAPH 2006]+ Fine details are reconstructed- Infeasible for large photo collections

Millions of imagesMatching images through standard image retrieval[Heinly, Schonberger, Dunn, Frahm: Reconstructing the World in Six Days, CVPR 2015]+ Efficient and scalable image matching- Details not reconstructed

#

Retrieval for 3D ReconstructionVisually most similar searchMany near duplicatesDetails lost

Zoom-in and details searchDetails retrievedTransition images to match the details

Zoom-out searchViewpoint changeMore context

Sideways crawlSignificant viewpoint changeMore contextSchoenberger, Radenovi, Chum, Frahm: From Single Image Query to Detailed 3D Reconstruction, CVPR 2015

#

Sideways image crawlSchoenberger, Radenovi, Chum, Frahm: From Single Image Query to Detailed 3D Reconstruction, CVPR 2015

#

Sideways crawl: On-line Stage13715999565BOWgeometries

IN: qwordimage ID11 5 10 735012522 7 12 739912131 4 15 7200190167772163 7 10 7012245

scoreimage ID0.8750.7515730.52112020.00132

image 11202

image 1573

image 5




13715999565+231514890215+3102915678921+


5. Query expansion

OUT: R

Shortlist: top N images1. Inverted file: posting list per visual word2. Image rankingUsing geometry to find adequate features for expansion (left-right)Building an expanded query using only sideways features

#

We could estimate full 2 view geometry for each pair of image results to get the relative camera pose but it would come at a high cost. So we are looking for approximate geom constraint that will do the same job: features on the right in results matching features on the left in the query40

Sideways Left: Step by Step

#

Sideways Left: Step by Step

#

Retrieval for 3D ReconstructionSchoenberger, Radenovi, Chum, Frahm: From Single Image Query to Detailed 3D Reconstruction, CVPR 2015

#

Summary

Visually most similar

Zoom-in / details

Zoom-out

Sideways right

#

Image retrieval is not just about efficient search for most visually similar imageswhat / where is this, what is to the right / leftanalysis of details, detailed 3D reconstruction

The use of geometry, context and query expansion significantly enhances performance and the set of problems that are solvable

Can we go further?Temporal constraints: day / night, summer / winter44

Illumination constraints

#

During the day most of the stuff is visible but during night its not or not visible. So illumination changes but structure doesnt. 45

Part 2: 3D Facial Landmarks

Ji MatasProfessor

Vojtch FrancAssistant ProfessorJan echSenior ResearcherCech, Franc, Uricar, Matas: Multi-view facial landmark detection by using a 3D shape model, Image and Vision Computing 2016, volume 47

46

Face detection and facial landmarksRobust face detectionViewpoint invariant

Facial landmarks Are salient keypointsAccurate detection critical for tasks:Gender, Age, IdentityNeed real-time processing

47

William (M34) Kate (F28)

#

True age of both on the photo: 29. Both Wiliam and Kate born 1982. Wedding 2011. 47

Landmarks and head-pose estimationInput: an image (arbitrary viewpoint of face)

Output:Facial landmarks (up to 51 landmarks)6 DOF head pose estimation3D face reconstruction

48

#

48

Landmark and pose estimation pipeline

Initialization of 3D shape and poseRefinement of 3D shape and pose49KX0

facedetectorbboximage normalization

2D DPM45

75

0

posesolver3D optimizerrollyaw

An-1AnInsnsX*x*

visibility

VKX0VIR*,t*IR0,t0

initialization

#

49

Face detector Face-detector is robust to viewpoint changesProvides rough estimates of roll and yawBased on Waldboost [Sochmann-CVPR-2005]Weak classifiers are Haar-like featuresCommercial detector by Eyedea Recognition, Ltd. http://eyedea.cz

50


2D DPM45

75

0


An-1AnInsnsKX0X*x*

visibility

VKX0VIR*,t*IR0,t0

#

50

2D landmark detection

Image normalizationBounding box axis-aligned using roll and scaled to (80x60) px

2D Deformable Part Model (DPM)One of 5 DPM models selected using yaw estimationSelected model defines the set of visible landmarks V

51


2D DPM45

75

0


An-1AnInsnsKX0X*x*

visibility

VKX0VIR*,t*IR0,t0

#

51

2D Deformable Part ModelsMRF on a tree2D landmark location modelGlobally optimal solution Loss function:

#

52

Local landmark classifiersEach 2D-landmark independently learnedLBP computed at every position point in ROIComputed over 3 scales of ROI, 20x20, 10x10, 5x5 LBPs concatenated to construct descriptorHigh-dimensional binary descriptor, 256*(182+82+32)=75576

53

#

We construct the feature descriptor (x; I) by concatenatingthe Local Binary Patterns (256 valued code assignedto patch 3 3) computed in all positions of the cropped patchnormalized to size 20 20, 10 10 and 5 5 pixels, respectively.By this process we obtain 256(182+82+32)-dimensionalsparse (182 +82 +32 non-zero elements) binary feature descriptorwhose values are to some extent invariant against a scaleand lighting conditions. The side of the cropped squared patchis 0:3 of the bounding box side returned by the face detector.53

Training Landmark ClassifiersDistance to annotated landmark is part of loss

Training set is images with hand-annotated landmarks

#

54

3D pose initialization2D landmarks to 3D head poseUse mean model X0 for 3D points Perspective n-Point problem => camera pose

Estimation with RANSAC followed by Maximizing Likelihood55


2D DPM45

75

0


An-1AnInsnsKX0X*x*

visibility

VKX0VIR*,t*I

R0,t0

#

Tracking with Initial Guess (DPM)

#

56

Joint estimation of landmarks and pose and sparse reconstructionShape and pose jointly estimatedVisibility fixedLandmarks from reprojections

i-th classifier score of the projected model point

3D shape model (PCA)

visibility

#

57

Summary2D model with MRF prior giving Globally optimal 2D initializationSparse 3D face pose and shape estimationStraightforward handling of landmark visibility real-timeCompetitive landmark detection accuracy


2D DPM45

75

0


An-1AnInsnsKX0X*x*

visibility

VKX0VIR*,t*IR0,t0

#

58

Tracking with 3D-refinement IFirst pose by proposed initialization Subsequent poses initialized from prior frameNo redetection

59

#

Tracking with 3D-refinement IIFirst pose by proposed initialization Subsequent poses initialized from prior frameNo redetection

60

#

Single-view 3D face reconstructionPossible applications in face interpretation tasks (id, age, gender recognition)Computer graphics: face-texture transfer61

#

61

Experiments: Selected results on AFLW dataset62

#

3D Facial Landmark summaryInitialization is fast and a global optimal

Joint detection of landmarks and estimation of head pose and sparse face reconstruction

Super real-time performance (> 50 fps)

Competitive landmark detection accuracy and pose-estimation performance63

#

Part 3: Text Detection in the Wild

Ji MatasProfessor

Luk NeumannText DetectionPhD candidateMichal ButaText DetectionResearcherThe "Neumann-Matas algorithm" for scene text detection (based on CVPR 2012 paper) is now part of OpenCV 3.x thanks to Lluis Gomez. (https://github.com/Itseez/opencv_contrib/tree/master/modules/text)

The FASText detector source code (published in ICCV 2015) is available (https://github.com/MichalBusta/FASText)

64

Problem Introduction

Text: Anything that can be represented as a Unicode string

#

65

Problem IntroductionScene Text (Text in the Wild)Typically snippets of text, arbitrary script and orientation,out-of-vocabulary words, complex backgroundsImage or video taken by a camera

Text in the WildOther Text

http://rrc.cvc.uab.es/?ch=4&com=evaluation

#

66

Tentative text fragments are segmented

Segmentations are classified:Character Multi-characterBackground

Segmentations grouped into text lines

Region-based Pipeline

#

MSER/CSR detection is done with Adaboost with the following weak classifiers:(see slide), these HAVE to be fast so just getting things that are character-like, combination of synthetic and ICDAR data

Character and Grouping classification: Multi-label SVM (one against all), stroke-area ratio, aspect ratio, compactness, trained with synthetic and hand-annotated ICDAR data

Text-line hypothesis hough transform between centers of k-nearest connected components

Graph cut, unaries are color distribtuion, pairwise Potts model

targeting mobile phones, using CNN for entire image requires TITAN and still 10 seconds, android phone cannot be done1000 components entered into the CNN, real time

Recognition: http://www.robots.ox.ac.uk/~vgg/research/text/

67

FASText keypoints

Observation: Text fragments in any script are formed from strokes

Fast Stroke Detector: Keypoints are stroke begins and stroke bends

Live demo on android phone

Stroke End Stroke Bend

#

68

FASText keypoint details

Stroke End KeypointStroke Bend KeypointLBP-like pixel intensity comparison to surrounding pixels

FASText detector:

#

FASText keypoints are used to initialize a flood fill to get segmentation of letters

Imprecision is ratio of outputted components to ground truth ICDAR ddata

at the end you, for every 1 component there will be 10 false positive, so for 100 characters stil real time69

Component classificationComponents are typically assumed to be one character

Assumption relaxed by classifying components as:Part of a CharacterCharacterGroup of CharactersWord

SVM Features:Aspect ratioCompactnessNumber of holesHorizontal crossingsStroke support pixels

1-vs-all SVM for classification

#

70

Stroke Support Pixels

Most discriminative feature

Works for diverse scripts and fonts

CharacterMulti-character

Non-character MSER

#

71

Stroke Support PixelsStroke area A is the product of the stroke axis length sl and stroke width sw

Stroke area ratio As / A is discriminative

sw

sl

S

#

72

Text Line Detection

Text Line ClustersQuantized Text Directions Votes

Neighboring Pairs of Text FragmentsCorresponding Text DirectionsText components are clustered with hough transform

#

73

Character sequence model

* Slide from presentation of Max Jaderberg, Visual Geometry Group, OxfordReading Text in the Wild

#

74

Summary

Arbitrary Text Fragments detected in a single pass

Efficient strokeness feature discriminates between Text Fragments and clutter

FAST Stroke Detector suitable for embedded systems

#

75

Thank you!

Questions?

#

BACKUP SLIDESPerformance MeasuresDetails

#

12345678910

Bag of Words Scoring677136568

query visual word 1query visual word 2query visual word 3

DBGPosting lists

#

Geometric Re-rankingRe-rank top ranked images (removing false positives)RANSAC

NOTE: Standard BoW score ranking performed without geometric information

IMPORTANT: Geometric verification crucial for query expansionSivic, Zisserman: Video Google, ICCV 2003

Philbin, Chum, Isard, Sivic, Zisserman: Object retrieval with large vocabularies and fast spatial matching, CVPR07

#

Experiments: Landmark detection Accuracy300W results49 landmarks

- Face detector miss-rate 46/554=8.3%80

(error relative to inter-ocular distance)

#

Experiments: Landmark detection AccuracyAFLW results15 landmarks

- Face detector miss-rate 12%81

error relative to Nose-eye-to-mouth-centers-distance (due to profile views)

face-size / iod ~ 1.1

#

81

Experiments: Landmark detection AccuracyDatasets:AFLW (Annotated Facial Landmarks in the Wild, 2011)Graz University24k images downloaded from flicker (test set: 4k images, cross-validation)15 landmarks (selected), manually annotated (limited accuracy)Very challenging: multiview, varying image quality, facial expressions300W (300 Faces In-The-Wild Challenge, 2014)Imperial College London5k images, test set: ~600 images (indoor, outdoor)51/68 landmarks semi-manually annotatedMostly near-frontal faces

Tested Algorithms2D DPM, unpublished extension of [Uricar-VISAPP-2012], Chehra [Asthana-CVPR-2014]Proposed method unpublished extension of [Cech-ICPR-2014]

82

#

Landmark visibility examples83

#

MultiPie dataset [CMU] Multiview dataset, 250 subjectsGround-truth position and orientation by SFM (multiple cameras, 21 landmarks annotated => person specific 3D model)

Experiments: Accuracy of pose estimation

#

s0center

s1canthus-rr

s2nose

s3nose-r

s4ear-r

s5mouth-corner-r

s6chin

s00center

s01canthus-ll

s02canthus-lr

s03canthus-rl

s04canthus-rr

s05ear-l

s06nose-l

s07noses08

nose-r

s09ear-r

s10mouth-corner-l

s11mouth-corner-r

s12chin

s00center

s01canthus-ll

s02canthus-lr

s03canthus-rl

s04canthus-rr

s05noses06

nose-r

s07ear-r

s08mouth-corner-l

s09mouth-corner-r

s10chin

Lavf55.0.100Lavf55.0.100

0 5 10 150

0.2

0.4

0.6

0.8

1

localization error [iod]

occure

nce [%

]

Chehra

Ours

DPM

0 10 20 300

20

40

60

80

100

localization error [face size]

occure

nce [%

]

Chehra (all)

Chehra (030)

Ours (all)

Ours (030)

DPM (all)

DPM (030)

Technology

James Pritts - Visual Recognition in the Wild: Image Retrieval, Faces, and Text