VIP: Finding Important People in Images Clint Solomon Mathialagan Andrew C. Gallagher Dhruv Batra...

Preview:

Citation preview

VIP: Finding Important People in ImagesClint Solomon MathialaganAndrew C. GallagherDhruv Batra

CVPR 2015

1

Outline• Introduction • Approach• Results• Importance vs Saliency• Application: Improving Im2Text• Conclusions

2

Introduction • Project: https://computing.ece.vt.edu/~mclint/vip/

• Demo: http://cloudcv.org/vip/

3

Introduction • The goal of this paper is to automatically predict the

importance of individuals in group photographs.

4

Introduction• Who are most important individuals in these pictures?

• Humans have a remarkable ability to understand social roles and identify important players, even without knowing identities of the people in the images.

5

Introduction • What is Importance?1. the photographer2. the subjects3. neutral third-party human observers

• In this work, we rely on the wisdom of the crowd to estimate the “ground-truth” importance of a person in an image.

6

Introduction• ApplicationsIm2textPhoto cropping algorithmsSocial networking sites and image search applications

• ContributionsWe learn a model for predicting importance of individuals in

photos.We collect two importance datasets.We show that we can automatically predict the importance of

people with high accuracy, and incorporating this predicted importance in applications 7

Approach• Framework

ModelPerson Features

Dataset

Distance FeaturesScaleSharpnessFace Pose FeaturesFace Occlusion

M(pi , pj ) ≈ Si − Sj

Image-Level

Corpus-Level

8

Approach• We model importance in two ways

1. Image-Level Importance: “Given an image, who is the most important individual?”

2. Corpus-Level Importance: “Given multiple images, in which image is a specific person most important?”

9

Approach(1)-Dataset Collection• Image-Level DatasetIn this setting, we need a dataset of images containing at least three people with varying levels of importance. Flickr

• Corpus-Level DatasetIn this setting, we need a dataset that has multiple pictures of the same person; and multiple sets of such photos. TV series (‘Big Bang Theory’)

10

Approach(2)-Importance Annotation

Annotation Interfaces used with MTurk

Image-Level Importance Annotation: Hovering over a button (A or B) highlights the person associated with it.

Corpus-Level Importance Annotation: Hovering over a frame shows the where the person is located in the frame.

11

Approach(2)-Importance Annotation • (pi , pj ): each annotated pair of faces • si , sj : the relative importance scores (0 ~+1)• Note that si and sj are not absolute, as they are not calibrated

for comparison to another person, say pk from another pair.

12

Approach(2)-Importance Annotation • The table shows a breakdown of both datasets along the

magnitude of differences in importance.

13

Approach(3)-Importance Model • The objective is to build a model M that regresses to the

difference in ground truth importance score:

• We use a linear model: : the features extracted for this pair : the regressor weightsWe use ν-Support Vector Regression to learn these weights.

14

Approach(3)-Importance Model • We compared two ways of composing these individual face

features:

Using difference of features

Concatenating the two individual features

15

Approach(4)-Person Features• Distance Features

We first scale the image to a size of (1, 1), and compute two distance features

1. Distance from center2. Weighted distance from center

We compute two more features to capture how far a person is from the center of a group

1. Normalized distance from centroid2. Normalized distance from weighted centroid

d

0.50.5 𝒅

the   largest   dimension  of   the   face  box1

2

3 𝑥𝑐𝑚=𝑚1 𝑥1+𝑚2𝑥2+𝑚3𝑥3

𝑚1+𝑚2+𝑚3 ❑

𝑦 𝑐𝑚=𝑚1𝑦 1+𝑚2 𝑦2+𝑚3 𝑦3

𝑚1+𝑚2+𝑚3 ❑

the weighted average of center points of faces

the  weight   of   a   face=  the  area  of   the   headthe   total   area   of   faces   in   the   image

16

Approach(4)-Person Features• Scale

• SharpnessSobel filterCompute the sum of the gradient energy in a face bounding

box, normalized by the sum of the gradient energy in all the bounding boxes in the image.

17

Approach(4)-Person Features• Face Pose Features

DPM face pose features-We resize the face bounding box patch from the image to 128×128 pixels-Run the face pose and landmark estimation algorithm of Zhu et al. [28]. -Our pose feature is this component id, which can range from 1 to 13.

[28] X. Zhu and D. Ramanan. Face detection, pose estimation and landmark localization in the wild. In CVPR, 2012.

18

Approach(4)-Person Features• Face Pose Features

Aspect ratioWhile the aspect ratio of a face is typically 1:1, this ratio can differentiate between some head poses.

DPM face pose differenceWe compute the pose of the person subtracted by the average pose of every other person in the image.

19

Approach(4)-Person Features• Face OcclusionDPM face scoreswe use scores for each the 13 components in the face detection model of [28] as a feature.

Face detection successThis is a binary feature indicating whether the face detection API [22] we used was successful in detection the face, or whether it required human annotation.The API achieved a nearly zero false positive rate on our dataset.

[28] X. Zhu and D. Ramanan. Face detection, pose estimation and landmark localization in the wild. In CVPR, 2012.[22] SkyBiometry. https://www.skybiometry.com/.

20

Results• BaselinesWe compare our proposed approach to three natural

baselines: center, scale, and sharpness baselines.We used the method of Harel et al. [10, 12] to produce

saliency maps and computed the fraction of saliency intensities inside each face as a measure of its importance.

We measure inter-human agreement in a leave-one-humanout manner.

[10] J. Harel. A saliency implementation in matlab. http://www.klab.caltech.edu/ harel/share/gbvs.php.[12] J. Harel, C. Koch, and P. Perona. Graph-based visual saliency. In Advances in NIPS, 2006.

21

Results• MetricsWe use mean squared error to measure the performance of

our relative importance regressors.In addition,we convert the regressor output into binary

classification by thresholding against zero.For each pair of faces (pi , pj ), we use a weighted classification

accuracy measure, where the weight is the ground-truth importance score of the more important of the two, i.e. max{si , sj}.

22

Results• Image-Level Importance Results

Overall, we achieve an improvement of 3.17% (3.54% relative improvement). The mean squared error for our SVR is 0.1489.

the best baseline

23

Results• Image-Level Importance ResultsTable 4 show a break-down of the accuracies into the three

categories of annotations.

>

24

Results• Corpus-Level Importance Results

the best baseline

25

Results• Corpus-Level Importance ResultsTable 6 shows the category breakdown.

26

Results• Image-Level and Corpus-LevelFig. 4 shows some qualitative results for image experiments

and corpus experiments.

27

Results• Image-Level and Corpus-LevelTable 7 reports results from an ablation study, which shows

the impact of the features on the final performance.

28

Importance vs Saliency• We measured the correlation between importance and

saliency rankings using Kendall’s Tau. The Kendall’s Tau was 0.5256. The most salient face was also the most important person in 52.56% of the cases.

• Fig. 5 shows qualitative examples of individuals who are judged by humans to be salient but not important, important but not salient, both salient and important, and neither. 29

Application: Improving Im2Text

Setup Prediction Results

30

Conclusions• We proposed the task of automatically predicting the

importance of individuals in group photographs.

• Compared to previous work in visual saliency, the proposed person importance is correlated but not identical.

• We showed that our method can successfully predict the importance of people from purely visual cues, and incorporating predicted importance provides significant improvement in im2text.

31

References• Narrow depth-of- field https://goo.gl/EfxN2Q• Sobel filter http://goo.gl/BmBCx9

32