Eshan Content

8/6/2019 Eshan Content

1/29

Describable Visual Attributes for Face Verification and Image Search

Dept. Of CSE, AMC Engineering College 1

CHAPTER 1

INTRODUCTION

PedaniusDioscorides wrote perhaps theearliest known field guide for botanists, giving pictures

and writtendescriptions of nearly 600 plant species, showing howeach could be found and

identified. This work laid out the rules of modern taxonomy. All of these workshave in common

an effort to teach the reader how toidentify a plant or animal by describable aspects of itsvisual

appearance.

While the use of describable visual attributes foridentification has been around since

antiquity, it hasnot been the focus of work by researchers in computervision and related

disciplines.Existing methodswork byextracting low-level features in images, such as pixelvalues,

gradient directions, histograms of orientedgradients [6], SIFT [7], etc., which are then used

todirectly train classifiers for identification or detection.

In contrast, the use of low-level image features to first learn intermediate representations

in whichimages are labeled with an extensive list of descriptivevisual attributes, is the method

used in this paper. Although these attributes couldclearly be useful in a variety of domains (such

asobject recognition, species identification, architecturaldescription, action recognition, etc.), we

focus solelyon faces in this paper.These face attributes can rangefrom simple demographic

information such as gender,age, or ethnicity; to physical characteristics of a facesuch as nose

size, mouth shape, or eyebrow thickness;and even to environmental aspects such as

lightingconditions, facial expression, or image quality.

The attributes are used to train a classifier which would then be used to identify attributes

from new images.The classifier outputscan then be used to identify faces and search throughlarge

image collections, and they also seem promisingfor use in many other tasks such as image

explorationor automatic description-generation. The attributes can be combined to produce

descriptionsat multiple levels, including object categories, objects,or even instances of objects.


2/29



CHAPTER 2

IMAGE ATTRIBUTES

2.1Why might one need these attributes?

y Visual attributes much like words can be composed, offering tremendousflexibility and efficiency. Attributes can be combined to produce descriptions at

multiple levels, including object categories, objects, or even instances of objects.

For example, one can describe white male at the category level (a set of

people), or white male brown-hair green-eyes scar on forehead at the object

level (a specific person), or smiling lit-from-above seen-from-left to theprevious for an instance of the object (a particular image of a person). Refer fig.

1.1

y Moreover, attributes are generalizable; one can learn a set of attributes from largeimage collections and then apply them in almost arbitrary combinations to novel

images, objects, or categories. Perhaps most importantly, these attributes can be

chosen to align with the domain-appropriate vocabulary that people have

developed over time for describing different types of objects. For faces, this

includes descriptions at the coarsest level (such as gender and age) to more subtle

aspects (such as expressions and shape of face parts) to highly face-specific marks

(such as moles and scars).

y While describable visual attributes are one of the most natural ways of describingfaces, a persons appearance can also be described in terms of the similarity of a

part of their face to the same part of another individuals. For example, someones

mouth might be like Angelina Jolies, or their nose like Brad Pitts. Dissimilarities

also provide useful information e.g., her eyes are not like Jennifer Anistons.

We call these similes.


3/29



2.2Uses of trained classifiers

Two major uses of classifierstrained on describable visual attributes and similes are: face

verification and image search.

2.2.1 Face VerificationFace verificationis the problem of determining whether two faces areof the same individual.

What makes this problemdifficult is the enormous variability in the mannerin which an

individuals face presents itself to acamera: not only might the pose differ, but so mightthe

expression and hairstyle. Making matters worse at least for researchers in biometrics is that

theillumination direction, camera type, focus, resolution,and image compression are all almost

certain to varyas well. These manifold differences in images of thesame person have confounded

methods for automaticface recognition and verification, often limiting thereliability of automatic

algorithms to the domain ofmore controlled settings with cooperative subjects.We approach the

unconstrained face verificationproblem (with non-cooperative subjects) by comparingfaces using

our attribute and simile classifieroutputs, instead of low-level features directly. Fig. 1.1shows the

outputs of various attribute classifiers, for(a) two images of the same person and (b) images

oftwo different people. Note that in (a), most attributevalues are in strong agreement, despite the

changes inpose, illumination, and expression, while in (b), thevalues are almost perfectly

contrasting. By traininga classifier that uses these labels as inputs for faceverification, we

achieve close to state-of-the-art performanceon the Labeled Faces in theWild (LFW) dataset [8],

at 85.54% accuracy.

2.2.2 Image Search

Another application of describable visual attributesis image search. The ability of current search

enginesto find images based on facial appearance is limitedto images with text annotations. Yet,

there aremany problems with annotation-based image search:

y the manual labeling of images is time-consuming;y the annotations are often incorrect or misleading, as they may refer to other content on a

webpage;


4/29



y and finally, the vast majority of images are simply not annotated.

Figs. 1.2a and 1.2b show the results ofthe query, smiling asian men with glasses,

usinga conventional image search engine (Yahoo ImageSearch, as of November 2010) and our

search engine,respectively. The difference in quality of search resultsis clearly visible. Yahoos

reliance on text annotationscauses it to find some images that have no relevanceto the query,

while our system returns only the imagesthat match the query. In addition, many of the

correctresults on Yahoo point to stock photography websites,which can afford to manually label

their images withkeywords but only because they have collectionsof a limited size, and they

label only the coarsestattributes. Clearly, this approach does not scale.

Fig. 1.1 An attribute classifier can be trained to recognize the presence or absence of adescribable visual attribute. Theresponses of several such attribute classifiers are shown for (a)

two images of the same person and (b) two imagesof different individuals. In (a), notice howmost attribute values are in strong agreement, despite the changes in pose,illumination,

expression, and image quality. Conversely, in (b), the values differ completely despite thesimilarity in thesesame environmental aspects. We train a verification classifier on these outputs

to perform face verification, achieving 85.54%accuracy on the Labeled Faces in the Wild (LFW)benchmark [8], comparable to the state-of-the-art.


5/29



Fig. 1.2 Results for the query, smiling Asian men with glasses, using (a) the Yahoo image

search engine (as of November2010) and (b) our face search engine. Conventional image search

engines rely on text annotations, such as file metadata,manual labels, or surrounding text, whichare often incorrect, ambiguous, or missing. In contrast, we use attribute classifierstoautomaticallylabel images with faces in them, and store these labels in a database. At search

time, only this databaseneeds to be queried, and results are returned instantaneously. Theattribute-based search results are much more relevant tothe query.


6/29



CHAPTER 3

OVERVIEW OF RELAVANT WORK

3.1 Attribute Classification

Prior research on attribute classification has focusedmostly on gender and ethnicity

classification. Earlyworks used neural networks to performgender classification on small

datasets. The Fisherfaceswork [2] showed that linear discriminant analysiscould be used for

simple attribute classificationsuch as glasses/no glasses. Later, Moghaddam andYang [3] used

Support Vector Machines (SVMs) trained on small face-prints to classify the genderof a face,

showing good results. The works of Shakhnarovich and Baluja and Rowley used Adaboost [4] to

selecta linear combination of weak classifiers, allowingfor almost real-time classification of face

attributes,with results in the latter case again demonstrated onthe FERET database. These

methods differ in theirchoice of weak classifiers: the former uses the Haarlikefeatures of the

Viola-Jones face detector [5], whilethe latter uses simple pixel comparison operators. Ina more

general setting, Ferrari and Zisserman [6]described a probabalistic approach for learning

simpleattributes such as colors and stripes.

In contrast to both of these approaches,which are trying to find relations across differentcategories,we concentrate on finding relations betweenobjects in a single category: faces.Faces

have many advantages compared to genericobject categories.

y There is a well-established and consistentreference frame to use for aligning images;y differentiating objects is conceptually simple (e.g., itsunclear whether two cars of the

same model shouldbe considered the same object or not, whereas no suchdifficulty exists

for two faces);

y most attributes canbe shared across all people (unlike, e.g., 4-legged,gothic, or dual-exhaust, which are applicable toanimals, architecture, and automobiles, respectively

but not to each other).

All of these benefits makeit possible for us to train more reliable and usefulclassifiers,

and demonstrate results comparable to thestate-of-the-art.


7/29


8/29



CHAPTER 4

BUILDING THE IMAGE DATASET

Internet services have madecollecting and labeling image data easy in the following ways:

y Large internet photo-sharing sites such as flickr.com and picasa.com are growingexponentially and host billions of public images, some with textual annotations and

comments. In addition, search engines such as Google Images allow searching for images

of particular people (albeit not perfectly).

y Efficient marketplaces for online labor, such as Amazons Mechanical Turk (MTurk),make it possible to label thousands of images easily and with very low overhead.

By exploiting both of these trends, we could create a large dataset of real-world images with

attribute and identity labels, as shown in Fig. 4.1 and described next.

Fig. 4.1. Creating labeled image datasets: Our system downloads images from the internet. Theseimages span many sources of variability, including pose, illumination, expression, cameras, and

environment. Next, faces and fiducial points are detectedusing a commercial detector and storedin the Face Database. A subset of these faces are submitted to theAmazon Mechanical Turk

service, where they are labeled with attributes or identity, which are used to create the datasets.

4.1. Collecting Face Images

4.1.1 Procedure

y Using a variety of online sources, we collect face images, including search engines suchas Yahoo Images and photo-sharing websites such as flickr.com. Depending on the type

of data needed, one can either search for particular peoples names (to build a dataset


9/29



labeled by identity) or for default image filenames assigned by digital cameras (to use for

labeling with attributes).

y Next, we apply the OKAO face detector [40] to the downloaded images to extract faces.This detector also returns the pose angles of each face, as well as the locations of six

fiducial points: the corners of both eyes and the corners of the mouth. These fiducial

points are used to align faces to a canonical pose.

The 3.1 millionaligned faces collected using this procedure comprisethe Columbia Face

Database.

4.1.2 Conclusion

The following empirical conclusions can be arrived at from the dataset of images

y From the statistics of the randomly-named images, it appears that a significant fractionof them contain faces (25.7%), and on average, each image contains 0.5 faces. Thus, it is

clear that faces are ubiquitous and an important case to understand.

y The dataset is truly a realworld dataset, with completely uncontrolled lighting andenvironments, taken using unknown cameras andunknownlighting conditions unlike

existing image datasets.

4.2Collecting attribute and identity labels

For labeling images in the Columbia Face Database, the Amazon Mechanical Turk (MTurk)

service is used.This service matches workers to online jobs createdby requesters, who can

optionally set quality controlssuch as requiring confirmation of results by multipleworkers, filters

on minimum worker experience, etc.We submitted 110, 000 attribute labeling jobs showing30

images to 3 workers per job, presenting atotal of over 10 million images to users. The jobsasked

workers to select face images which exhibiteda specified attribute.


10/29



4.3The FaceTracer Dataset

The FaceTracer dataset is a subset of the ColumbiaFace Database, and it includes attribute labels.

Eachof the 15, 000 faces in the dataset has a variety ofmetadata and fiducial points marked. The

attributeslabeled include demographic information such as ageand race, facial features like

mustaches and hair color,and other attributes such as expression, environment,etc. There are 5,

000 labels in all.

4.4 ThePubFig Dataset

The PubFig dataset is a complement to the LFWdataset [8]. It consists of 58, 797 images of 200

publicfigures. The larger number of images per person (ascompared to LFW) allows one to

construct subsets ofthe data across different poses, lighting conditions,and expressions for further

study. Figure 4.2 c showsthe variation present in all the images of a singleindividual. In addition,

this dataset is well-suited forrecognition experiments.

Fig 4.2(a) PubFig Development set (60 individuals)


11/29



Fig 4.2(b) PubFig Evaluation set (140 individuals)

Fig 4.2(c) All 170 images of Steve Martin

Fig. 4.2 The PubFig dataset consists of 58, 797 images of 200 public figures celebrities and

politicians partitioned into (a)a development set of 60 individuals and (b) an evaluation set of140 individuals. Below each thumbnail is shown the numberof photos of that person. There is no

overlap in either identity or image between the development set and any dataset thatwe evaluateon, including Labeled Faces in the Wild (LFW) [8]. The immense variability in appearance

captured by PubFigcan be seen in (c), which shows all 170 images of Steve Martin(a celebrity).


12/29



CHAPTER 5

TRAINING THE CLASSIFIER

Given a particular describable visual attribute saygender how can one train a classifier for

the attribute?Attributes can be thought of as functions ai that mapimages I to real values ai.

Large positive values of aiindicate the presence or strength of the ith attribute,while negative

values indicate its absence.

5.1Training Architecture:

An overview of the attribute training architecture isshown in Fig. 5.1 The key idea is to leverage

the manyefficient and effective low-level features that havebeen developed by the computer

vision community,choosing amongst a large set of them to find the onessuited for learning a

particular attribute. This processshould ideally be done in a generic, application- anddomain-

independent way, but with the ability to takeadvantage of domain-specific knowledge where

available.

Fig. 5.1 Overview of attribute training architecture. Given a set of labeled positive and negativetraining images, low-levelfeature vectors are extracted using a large pool of low-level featureoptions.. An automatic, iterative selection process then picks the best set offeatures for correctly

classifying the input data. The outputs are the selected features and the trained attribute classifier.


13/29



For the domain of faces, this knowledge consistsof an affine alignment procedure and the

use of lowlevelfeatures which have proven to be very usefulin a number of leading vision

techniques, especiallyfor faces. The alignment takes advantage of the factthat all faces have

common structure i.e., two eyes,a nose, a mouth, etc. and that we have fiducial

pointdetections available from a face detector [40].

5.2Low Level Features

Face images are first alignedusing an affine transformation. A set of k low-levelfeature

extractorsfjare applied to an aligned inputimage I to form a feature set F(I):

F(I) = {f1(I), , fk(I)}

We describe each extractorfjin terms of four choices:

y the region of the face to extract features fromy the type of pixel data to usey the kind of normalization to apply to the datay the level of aggregation to use.The complete set of our 10 regions are shown inFig. 5.2The regions correspond to functional

parts of aface, such as the nose, mouth, etc.From each region, one can extract different typesof

information, as categorized in Table 5.1. The typesof pixel data to extract include various color

spaces(RGB, HSV) as well as edge magnitudes and orientations.To remove lighting effects and

better generalizeacross a limited number of training images,one can optionally normalize these

extracted values.

Finally, one can aggregate normalized values overthe region rather than simply concatenating

them.This can be as simple as using only the mean andvariance, or include more information by

computinga histogram of values over the region. A completefeature type is created by choosing a

region from Fig. 5.2 and one entry from each column of Table 5.1.(Of course,not all possible

combinations are valid).


14/29



Fig. 5.2The face regions used for automatic feature selectionare shown here on an affine-aligned

face image. There is (a)one region for the whole face, and (b) nine regions correspondingtofunctional parts of the face, such as the mouth,eyes, nose, etc. Regions are large enough to

contain the facepart across changes in pose, small errors in alignment, anddifferences betweenindividuals. The regions are manuallydefined, once, in the affine-aligned coordinate system,

andcan then be used automatically for all aligned input faces.

Table 5.1 Feature type options. A complete feature type isconstructed by first converting the

pixels in a givenregion (see Fig. 5.2) to one of the pixel value types fromthe first column, thenapplying one of thenormalizations from the second column, and finallyaggregating these values

into the output featurevector using one of the options from the last column.

5.3 Attribute Classifiers

Attribute classifiers Ci are built using a supervisedlearning approach. Training requires a set of

labeledpositive and negative images for each attribute, examplesof which are shown in Fig. 5.3

The goal is tobuild a classifier that best classifies this training databy choosing an appropriate

Pixel Value Types Normalizations Aggregation

RGB

HSVImage Intensity

Edge MagnitudeEdge Orientation

None

Mean NormalizationEnergy Normalization

None

HistogramMean/Variance


15/29



subset of the feature setF(I) described in the previous section. We do thisiteratively using

forward feature selection. In eachiteration, we first train several individual classifierson the

current set of features in the output set, concatenatedwith a single region-feature combination.

Each classifiers performance is evaluated using crossvalidation.The features used in the

classifier withthe highest cross-validation accuracy are added tothe output set. We continue

adding features until theaccuracy stops improving. For computational reasons, we dropthe

lowest-scoring 70% of features at each round, butalways keeping at least 10 features.

While the design of the classifier architecture has been flexible enough to handle a large

variety ofattributes, it is important to ensure that we havenot sacrificed accuracy in the process.

We thereforecompare our approach to three previous state-of-the-artmethods for attribute

classification: full-face SVMsusing brightness normalized pixel values, Adaboost using Haar-

like features and Adaboostusing pixel comparison features.Results are shown in Table 5.2.

Using the Columbia Face Database and the learning procedure just described, a total of

73 attribute classifiers have been trained. Their cross-validation accuraciesare shown in Table

5.3, and typically range from 80%to 90% (random performance would be 50% for eachattribute).

Fig. 5.3 Training data for the attribute classifiers consists offace images that match the givenattribute label (positiveexamples) and those that dont (negative examples). Shownhere are a few

of the training images used for four differentattributes. Final classifier accuracies for all 73attributes areshown in Table 5.3.


16/29



Table 5.2.Comparison of attribute classification performance forgender&smiling attributes.

Table 5.3. Cross-validation accuracies of the 73 attribute classifier.

5.4 Simile Classifiers


17/29



Simile classifiers measure the similarity of part of apersons face to the same part on a set of

referencepeople. We use the 60 individuals from the developmentset ofPubFig as the reference

people. The left part of Fig. 5.4 shows examples of four regions selectedfrom two reference

people as positive examples. Onthe right are negative examples, which are simply thesame

region extracted from other individuals images.

Fig. 5.4. Each simile classifier is trained using several imagesof a specific reference person,

limited to a small face regionsuch as the eyes, nose, or mouth. We show here threepositive andthree negative examples each, for four regionson two of the reference people used to train these

classifiers.

y the individuals chosen as reference people do not appearin LFW or other benchmarks onwhich we produce results.

y train simile classifiers to recognize similarity topartof a reference persons face in manymages, not similarity to a single image.

For each reference person, we train support vectormachines to distinguish a region (e.g.,

eyebrows, eyes,nose, mouth) on their face from the same regionon other faces.

CHAPTER 6


18/29



FACE VERIFICATION

Challenges faced with the existing methods of face verification by the method are these

twofaces of the same person often make mistakes thatwould seem to be avoidable: men being

confused forwomen, young people for old, asians for caucasians,etc. On the other hand, small

changes in pose, expression,or lighting can cause two otherwise similarimages of the same

person to be misclassified by analgorithm as different. Based on this observation, it is

hypothesized that the attribute and simile classifierscould avoid such mistakes.

6.1 Training a Verification Classifier

Fig. 6.1 illustrates how attribute-based face verificationis performed on a new pair of input

mages. In orderto decide whether two face images I1 and I2 showthe same person, one can train

a verification classifierV that compares attribute vectors C(I1) and C(I2)and returns v(I1, I2), the

verification decision. Thesevectors are constructed by concatenating the result ofn different

attribute and/or simile classifiers.

To build V , let us make some observations aboutthe particular form of our classifiers:

y Values Ci(I1) and Ci(I2) from the ith classifier should be similar if the images are of thesame individual, and different otherwise.

y Classifier values are raw outputs of binary classifiers, where the objective function istrying to separate examples around 0.

Let ai = Ci(I1) and bi = Ci(I2) be the outputs of theith trait classifier for each face (1 i

n). One wouldlike to combine these values in such a way that oursecond-stage verification

classifier V can make senseof the data. This means creating values that are large(and positive)when the two inputs are of the sameindividual, and negative otherwise. From observationwe see

that using the absolute difference |ai bi|will yield the desired outputs and the product aibi.

Putting both terms together yieldsthe tuple pi:

pi = {|ai bi|, aibi} (2)


19/29



The concatenation of these tuples for all n attribute/simile classifier outputs forms the

input to theverification classifier V :

v(I1, I2) = V {(p1, . . . , pn)} (3)

Training V requires pairs of positive examples (twoimages of the same person) and

negative examples(images of two different people).

Fig. 6.1 The face verification pipeline. A pair of input images are run through a face and fiducialdetector [40], and the fiducialsare then used to align both faces to a canonical coordinate system.The aligned face images are fed to each of our attributeand simile classifiers individually to

obtain a set of attribute values. Finally, these values are compared using a verificationclassifier tomake the output determination, which is returned along with the distance to the decision

boundary. The entireprocess is fully automatic.

6.2 Experimental Setup


20/29



Face verification experiments are performed on the LabeledFaces in the Wild (LFW) benchmark

[8] andalso on thePubFig benchmark. For each computationalexperiment, a set of pairs of face

images ispresented for training, and a second set of pairs ispresented for testing.

6.3 Attribute Classifier Results on LFW

Fig. 6.2 shows results on LFW for our attributeclassifiers (red line), simile classifiers (blue line),

anda hybrid of the two (green line), along with several other methods (dotted lines). The highest

accuracy of 85.54% is comparableto the 86.83% accuracy of the current state-of-the-art method

on LFW. The small bump inperformance from combining the attribute and simileclassifiers

suggests that while they contain much ofthe same kind of information, there are still

someinteresting differences. This can be better seen inFig. 6.2, where similes do better in the

low-falsepositiveregime, but attributes do better in the highdetection-rate regime.

Fig. 6.2 Face verification performance on LFW of our attributeclassifiers, simile classifiers, and

a hybrid of the twoare shown in solid red, blue, and green, respectively. Dashedlines are existingmethods


21/29



6.4 Human Attribute Labels on LFW

It is interesting toconsider how well attribute classifiers could potentiallydo. There are several

reasons to believe that ourresults are only first steps towards this ultimate goal:

y We have currently trained 73 attribute classifiers. Adding more attributes, especially fine-scale ones such as the presence and location of highly discriminative facial features

including moles, scars,and tattoos, should greatly improve performance.

y Of the 73 attributes, many are not discriminative for verification. For example, facialexpression, scene illumination, and image quality are all unlikely to aid in verification.

There is also a severe imbalance in LFW of many basic attributes such as gender and age,

which reduces the expected benefit of using these attributes for verification.

y The attribute functions were trained as binary classifiers rather than as continuousregressors. While we use the distance to the separation boundary as a measure of degree

of the attribute, using regression may improve results.

Fig. 6.3 shows a comparison of face verification performanceon LFW using either these

human attributelabels (blue line) or our automatically-computed classifieroutputs (red line), for

increasing numbers ofattributes. In both cases, the labels are fed to the verificationclassifier V

and training proceeds identically,as described earlier. The set of attributes used foreach

corresponding point on the graphs were chosenmanually (and were identical for both).

Verificationresults using the human attribute labels reach 91.86%accuracy with 18 attributes,

significantly outperformingour computed labels at 81.57% for the same 18attributes. Moreover,

the drop in error rates fromcomputational to human labels is actually increasingwith more

attributes, suggesting that adding moreattributes could further improve accuracies.


22/29



Fig. 6.3 Comparison of face verification performance on LFWusing human attribute labels (blue

line) vs. automaticallycomputedclassifier outputs (red line).

6.5 Human Verification on LFW

The high accuracies obtained in the previous sectionlead to a natural question: How well do

peopleperform on the verification task itself? While manyalgorithms for automatic face

verification have beendesigned and evaluated on LFW, there are no publishedresults about how

well people perform onthis benchmark.

Results are shown in Fig. 6.4. Using the originalLFW images (red curve), people have

99.20% accuracy essentially perfect3. The task was then made tougher by cropping the images,

leaving only the facevisible (including at least the eyes, nose and mouth,and possibly parts of the

hair, ears, and neck). Thisexperiment measures how much people are helped bythe context


23/29



(sports shot, interview, press conference,etc.), background (some images of individuals

weretaken with the same background), and hair (althoughsometimes it is partially visible). The

results (bluecurve) show that performance drops to 97.53% atripling of the error rate.

Fig. 6.4. Face verification performance on LFW by humans

6.6Attribute Classifier Results on PubFig

Faceverification is performed on 20, 000 pairs of images of140 people, divided into 10 cross-

validation folds withmutually disjoint sets of 14 people each. These peopleare separate from the

60 people in the developmentset ofPubFig, which were used for training the simileclassifiers.

The performance of the attribute classifierson this benchmark is shown in Fig. 6.5, and it is

indeedmuch lower than on LFW, with an accuracy of 78.65%.


24/29



Fig. 6.5 Face verification results on theP

ubFig evaluationbenchmark using the attributeclassifiers.


25/29



CHAPTER 7

FACE SEARCH

Image search engines are currently dependent ontextual metadata. This data can be in the form

offilenames, manual annotations, or surrounding text.However, for the vast majority of images

on theinternet (and in peoples private collections), this datais often ambiguous, incorrect, or

simply not present.This presents a great opportunity to use attributeclassifiers on images with

faces, thereby making themsearchable. To facilitate fast searches on a large collectionof images,

all images are labeled in an offlineprocess using attribute classifiers. The resulting attributelabels

are stored for fast online searches usingthe FaceTracer engine

The FaceTracer engine uses simple text-basedqueries as inputs, since these are both

familiar andaccessible to most internet users, and correspond well to describable visual

attributes. Search results are ranked by confidence, so thatthe most relevant images are shown

first. The engine uses thecomputed distance to the classifier decision boundaryas a measure of

the confidence. For searches withmultiple query terms, the confidences ofdifferent attribute

labels such that the final rankingshows images in decreasing order of relevance to allsearch terms

are combined.To prevent high confidences for oneattribute from dominating the search results,

we firstconvert the confidences into probabilities by fittingGaussian distributions on attribute

scores computedon a held-out set of positive and negative examples,and then use the product of

the probabilities as thesort criteria. This ensures that the images with highconfidences for

allattributes are shown first.

Example queries on the search engine are shown inFigs. 7.1a and 7.1b. The returned

results are all highlyrelevant. Fig. 7.1b additionally demonstrates two other interesting things.

y it was run on a personalized dataset of images from a single user, showing that thismethod can be applied to specialized image collections as well as general ones.

y it shows that we can learn useful things about an image using just the appearance of thefaces within it in this case determining whether the image was taken indoors or

outdoors.


26/29


27/29



This attribute-based search engine can be used inmany other applications, replacing or

augmentingexisting tools. In law enforcement, eyewitnesses tocrimes could use this system to

quickly narrow a listof possible suspects and then identify the actual criminalfrom the reduced

list, saving time and increasingthe chances of finding the right person. On the internet,our face

search engine is a perfect match for socialnetworking websites such as Facebook, which

containlarge numbers of images with people. Additionally,the community aspect of these

websites would allowfor collaborative creation of new attributes. Finally,people could use our

system to more easily organizeand manage their own personal photo collections. Forexample,

searches for blurry or other poor-qualityimages can be used to find and remove all suchimages

from the collection.


28/29



CHAPTER 8

CONCLUSION

These seem to be promising first steps in a newdirection, and there are many avenues to explore.

Theexperiments with human attribute labeling suggest that adding more attributes and

improvingthe attribute training process could yield great benefitsfor face verification. Another

direction to exploreis how best to combine attribute and simile classifierswith low-level image

cues. Finally, an open question ishow attributes can be applied to domains other thanfaces. It

seems that for reliable and accurate attributetraining, analogues to the detection and

alignmentprocess must be found.

The set of attributes used in this work were chosenin an ad-hoc way; how to select them

dynamically ina more principled manner is an interesting topic toconsider. In particular, a system

with a user-in-the-loopcould be used to suggest new attributes. Thanksto Amazon Mechanical

Turk, such a system would beeasy to setup and could operate autonomously.


29/29


REFERENCES

Books:

[1] BahramJavidi, Image Recognition and Classification, Marcel Dekker, 5(2), 2001.

[2] Takeo Kanade, Anil K. Jain, NaliniKantaRatha, Audio- and video-based biometric person

authentication, International Association forPattern Recognition, 2005.

[3] Shigeo Abe, Support Vector Machines for Pattern Classification, Series in Machine

Perception Artificial Intelligence,pringer, pp. 891-895, 2005.

[4]De-Shuang Huang, Advanced Intelligent Computing. Theories and Applications Springer,

2010.

[5]Cha Zhang, Zhengyou Zhang, Face Detection and Adaptation, Morgan & Claypool

Publishers, 2011.

Links:

[6] Histogram of Oriented Gradients,http://portal.acm.org/citation.cfm?id=1069007

[7] SIFT, http://en.wikipedia.org/wiki/Scale-invariant_feature_transform

IEEE:

[8] Hieu V. Nguyen and Li Bai, Cosine Similarity Metric Learning, Computer Vision - ACCV

2010: 10th Asian Conference on Computer, 2010.

[9]Neeraj Kumar, Alexander C. Berg,Peter N. Belhumeur, and Shree K. Nayar, Describable

Visual Attributes forFace Verication and Image Search, IEEE Transaction on Pattern Analysis

and Machine intelligence.

Documents

Eshan Content