Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to

Automatic Image Tagging

Ying Wu

Electrical Engineering and Computer ScienceNorthwestern University, Evanston, IL 60208

[email protected]

http://www.eecs.northwestern.edu/~yingwu

1 / 35

Outline

Introduction to the Image Tagging Problem

General Approaches to Image Tagging

The TagProp method

The FastTag method

2 / 35

What is Image Tagging?

I A picture is worth a thousand words.

I But you can just use several to roughly describe it.I Image tags

I a small set of words that can well describe the imageI you can never tag an image “completely”I image taggs are usually sparse and incomplete

I The task:I given a predefined dictionary of words, mostly nouns of

entities. Dictionary size could be as large as a few thousands.I image tags provided for image is usually sparse and incomplete.I for an unannotated image, automatically assign a set of words

(tags) that describe image content.

3 / 35

Why Do We Need to Tag Images?

I Image tags provide high-level descriptions to the images

I They provide natural image categories and indeciesI Some exaple applications

I Annotate large scale image dataset for text based imageretrieval. (i.e. Google, Yahoo, Flickr)

I Organize personal photo album. (i.e. Google photo)I Generate natural language description description (image

captioning) to help visually impaired people.

4 / 35

How Difficult is Image Tagging?

I What we can obtain from images are only low-level visualcontents

I visual featuresI image segments, shapes, structuresI etc

I However, the tags are high-level semantic concepts

I There is a big semantic gap between low-level visualcontents and high-level semantic concepts.

I ChallengesI Difficult to map between low-level contents and high-level

conceptsI Large number of words in dictionary, combinatorially large

output tag space.I Severe frequency imbalance of tags in training data.I Incomplete image tags in training dataset.I Image tags used for training are error prone, if automatically

crawled from web data.

5 / 35

Problem Definition

Figure : The image tagging problem

Given xi, yii=1...N , where xi ∈ Rd are some image representation;and yi ∈ 0, 1T are provided tags. Learn a relevance functionf (x, y;W) to estimate tag set for an unannotated image.

6 / 35

Problem Definition

I Given xi, yii=1...N , where xi ∈ Rd are some imagerepresentation; and yi ∈ 0, 1T are provided tags. Learn arelevance function f (x, y;W) to estimate tag set for anunannotated image, where W are the parameters.

I We can also call it image tag prediction

I Maximium posterior prediction:

maxy

f (x, y;W)

I Various problems on different dictionary settings:I Denote dictionary Dtr is used in training, while Dte for testing.I Image tagging: Dte = DtrI Zero-shot learning: Dte ∩Dtr = ΦI Open vocabulary tagging: Dtr ⊂ Dte

7 / 35

Key Issues

I Image tagging is to bridge the image space and the semanticspace, or to obtain a mapping between the two

I Image spaceI how do you model the image space?I image features and representationsI metrics in the image space

I Semantic spaceI how do you model the semantic space?I semantic ontology piror (e.g. WordNet)I topical model tag frequenty occuring in certain topic (e.g.

sand, beach)I tag co-occurence in tagging image dataset.I handling tag incompletion

I Relevance mapping function.I Cross module projection between visual and semantic spacesI Categorize methods w.r.t. relevance function

8 / 35

Outline



The TagProp method

The FastTag method

9 / 35

Regression-based Approaches

I Regression-based approaches

I Nearest neighbor-based approaches

I Transduction-based approaches

10 / 35

Regression-based Approaches

I The natural idea is to obtain a direct regression from theimage space to the semantic space

I Parametric regression models can be used

I E.g., we can use linear regression

y = Wx

and the relevance function is:

f (x, y;W) = e−||Wx−y||2

I Example: Fasttag: Fast image tagging[2]

11 / 35

Nearest Neighbor-based Approaches

I Assumption: the tags of “similar images” are “semanticallysimilar”

I Exemplar-based nonparametric methods

I Local structures of the two spaces are constructed

I Reconstruct an image based on its visual nearest neighbors

I Reconstruct a concept based on its semantic nearest neighbors

I Complexity of the learned hypotheses grows as the amount oftraining data increases.

I Example: TagProp: Discriminative Metric Learning inNearest Neighbor Models for Image Auto-Annotation[1]

12 / 35

Transduction-based Approaches

I Transduction is a type of learing

I It does not distinguish training and testing

I The majority of transduction-based approaches are founded onmatrix factorization.

13 / 35

Outline



The TagProp method

The FastTag method

14 / 35

TagProp: Discriminative Metric Learning

Basic Ideas:

I Assume visually similar images share similar tag sets.

I Adapt flexibly to patterns in dataset as more data is available.

I Learn visual metrics by maximizing the likelihood ofannotations in set of training images.

15 / 35

TagProp: Predict the Tag Presence in Semantic SpaceI Notation:

I N(i) is a set of images in the neighborhood of the image xiI w is a tagI yi (w) takes a binary value, represending the precense of tag wI I (yj(w)) ∈ ε, 1− ε takes 1− ε iff image j is annotated with

tag w . This elevates rare tags.

I Weighting the nearest neighbors: to predict the precenselikelihood of tag w for image i :

yi (w) =∑

xj∈N(i)

πij I (yj(w)) (1)

I Label score calibration by sigmoid function:

p(yi (w)) = 1/(1 + e−aw yi (w)−bw ) (2)

where p(yi (w)) is the probability that tag w is present inimage xi . (aw , bw ) are the parameters to be learned for tag w

16 / 35

TagProp: Distance Metric in the Visual Space

I Define d(xi , xj) the distance of image xi and xj in the imagespace

I Distance metric d(xi , xj) determines nearest neighbors

I SoftMax Normalization:

πij =e−d(xi ,xj )∑

xl∈N(i) e−d(xi ,xl )

(3)

where πij is the probability that xj is the nearest neighbor of xiI Examples of distance metrics:

I Mahalanobis distance: dM(xi , xj) = xTi Mxj , M is positivesemi-definite metric matrix.

I Linearly combined distance: dθ(xi , xj) = θTdij where dij is avector collecting a set of base distances, e.g. SIFT L2distance, GIST χ2 distance.

I TagProp uses the linearly combined distance as the visualmetric, parameterized by θ (to be learned)

17 / 35

TagProp: Metric Learning Objective

I Suppose the dictionary size is T

I Parameters of this model:(θ, aw , bwTw=1

)I Objective function: Max data likelihood:

maxθ,aw ,bw

L(θ, aw , bw ) =

∑i

∑yi (w)=+1

cw+ log(p(yi (w))) +∑

yi (w)=−1

cw− log(1− p(yi (w)))

I p(yi (w)) is defined as Eq(2), cw− and cw+ are weights for

negative and positive tags, define ciw ∈ cw−, cw+.I Projected gradient descent to constrain that all elements in θ

are positive.

I Alternate the optimization processes of two sets ofparameters: aw , bwTw=1, and θ.

18 / 35

TagProp: Learning Visual Metric

I W.l.g., loss function takes form

L(θ) =∑i ,w

ciw log(p(yi (w)))

I Fix aw , bw, take gradient on distance metric:

∂L

∂θ=∑i ,w

ciw [1− p(yi (w))]aw∂yi (w)

∂θ(4)

where yi (w) is KNN prediction defined in Eq(1).

19 / 35

Projected gradient descent

∂yi (w)

∂θ=∑j

∂πij∂θ

I (yj(w))

=∑j

[1

Zi

∂e−dij

∂θ+ e−dij

∂Z−1

∂θ]I (yj(w))

=∑j

πij [−dij +∑l

πildil ]I (yj(w))

=∑j

πij [dij − dij ]I (yj(w))

(5)

I Normalization factor Zi =∑

j e−dij .

I dij is average distance of KNN.

20 / 35

TagProp: Training AlgorithmIterative projective gradient descent:

Input: Image dataset xi ,yiOutput: model: θ, aw , bwwhile stop criterion do

1. Update:

θt = θt−1 − αt ∂L

∂θ

where ∂L∂θ by Eq(4)(5);

2. Select step size αt by backtracking line search;3. Project θ = max(0, θ);4. for each tag w do

Update atw , btw by gradient on Eq(2):∑i

ciw (1− 1

1 + exp(−aw yi (w)− bw )))yi (w)yi (w)

21 / 35

TagProp: Summary

I Nearest neighbor method identifies non-linear decisionboundaries based on multiple feature spaces from differentimage features.

I Distance metric learning adjust distances in visual space, toalign to nearest neighbors in label set semantic space.

I Word specific sigmoid function improves recall for rare words

22 / 35

Outline



The TagProp method

The FastTag method

23 / 35

FastTag: Basic Idea

Basic Ideas:

I No need to compute each sample pair’s similarity as TagProp,whose complexity is Ω(N2).

I Learn visual projection W : Rd → RT that maps image tocomplete tag set.

I Learn tag enrichment projection B : 0, 1T → RT that turnson likely co-occurring tags with existing ones.

24 / 35

FastTag: Co-regularied Regression

I Co-regularized regression learning

L(B,W ) =1

N

N∑i=1

||Byi −Wxi ||+ λ||W||22 + γr(B) (6)

I Ridge regression when fix B. Regularization r(B) learns tagcompletion projection

I Jointly convex function, with parameter W and B

I It has a closed-form solution when fix one and optimize theother.

25 / 35

Regularing Tag Enrichment Projection B

I Marginalized blank-out regularization r(B):

r(B) =1

N

N∑i=1

E [||Byi − yi ||2]p(yi |yi ) (7)

where yi is the incomplete tag (provided by the trainingdata),and yi is the enriched complete tag

I Assume observed tags are corrupted, each tag is corruptedindependently.

I Regularization is the expected reconstruction error undercorruption distribution.

I Simulate corruption yi to yi , by randomly removing entries iny with probability p.

I p(yi (t) = 0) = p and p(yi (t) = yi (t)) = 1− p

26 / 35

FastTag: Handling Co-occuring Tags

I How does projection B complete co-occuring tags?I Frequent coocurring tag pair: ys , yt

I P(ys , yt , ys , yt) = P(ys , yt) · p(ys |ys)p(yt |yt), ys , yt are randomdimension in y.

I If ys and yt are correlated, there are likely many samples withone tag missing after corruption.

I P(ys , yt , ys , yt) > P(ys)P(yt) · C 2, where C = p(yi |yi ) is thesame for all tags.

27 / 35

FastTag: Rewrite the Regularization

I We can re-write regularization

r(B) =1

N

N∑i=1

E [||Byi − yi ||2]p(yi |yi )

=1

Ntrace(BQBT − 2PB + YYT )

(8)

where P =N∑i=1

yiE [yi ]T , Q =

N∑i=1

E [yi yi ]T .

I Uniform blank-out corruption has:

I Expected value of corruption E [yi ]p(yi |yi ) = (1− p)yiI Variance matrix: V [yi ]p(yi |yi ) = p(1− p)σ(yiyTi )

I So we have:

P = (1− p)YYT ,

Q = (1− p)2YYT + p(1− p)σ(YYT )

28 / 35

FastTag: Putting All into the Objective FunctionI Putting all these, the objective function:

L(B,W ) =1

N

N∑i=1

||Byi −Wxi ||+ λ||W||22

+1

Ntrace(BQBT − 2PB + YYT )

(9)

I Training algorithm: Block-coordinate descent optimization:

Input: Dataset X,YOutput: Projection parameter W and Bwhile W and B has little change do

1. UpdateW = BYXT (XXT + NλI )−1

2. Update

B = (γP + WXYT )(γQ + YYT )−1

29 / 35

FastTag: Extension

I Prior knowledge of tag corruption could be added to r(B).

I Language models could also be used for tag correlationmodeling.

30 / 35

Evaluation metric

I As the image tagging problem was motivated by text basedimage retrieval, the performance is usually evaluated byretrieval precision and recall for individual tags.

I all image are annotated with the five most relevant tags (i.e.tags that have the highest prediction value).

I precision (P) and recall (R) are computed for each tag.

I both factors are combined in the F1-score

F1 = 2P ∗ RP + R

31 / 35

Evaluation: Precision and Recall

I IAPRTC12 dataset:I 19,627 images of sports, action, people, landscapes, etc.I 291 tags are used.

I Results:

Name P R F1

TagProp 45 34 39FastTag 47 26 34

I FastTag performs slightly worse than TagProp.

I However, FastTag achieves very significant speedup overTagProp in both training and testing.

32 / 35

Evaluation: Training Time Complexity

Figure : F1 score and training times on IAPRTC12 dataset in log scale. Thegraphs compare the results of FastTag (red dot) with the leastSquares baseline(green square) and the TagProp algorithm (blue diamond).

33 / 35

Evaluation: Testining Time Complexity

I TagProp has O(n) test-time complexity, where n is thenumber of training examples, because each query examplerequires a neighbor-lookup during testing.

I FastTag has constant test-time complexity

34 / 35

Reference

M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid.

Tagprop: Discriminative metric learning in nearest neighbor modelsfor image auto-annotation.

In International Conference on Computer Vision (ICCV), 2009.

K. Q. W. Minmin Chen, Alice Zheng.

Fast image tagging.

In Proceedings of the 30th International Conference on MachineLearning, January 2013.

35 / 35

Documents

Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to