35
Automatic Image Tagging Ying Wu Electrical Engineering and Computer Science Northwestern University, Evanston, IL 60208 [email protected] http://www.eecs.northwestern.edu/~yingwu 1 / 35

Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to

Automatic Image Tagging

Ying Wu

Electrical Engineering and Computer ScienceNorthwestern University, Evanston, IL 60208

[email protected]

http://www.eecs.northwestern.edu/~yingwu

1 / 35

Page 2: Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to

Outline

Introduction to the Image Tagging Problem

General Approaches to Image Tagging

The TagProp method

The FastTag method

2 / 35

Page 3: Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to

What is Image Tagging?

I A picture is worth a thousand words.

I But you can just use several to roughly describe it.I Image tags

I a small set of words that can well describe the imageI you can never tag an image “completely”I image taggs are usually sparse and incomplete

I The task:I given a predefined dictionary of words, mostly nouns of

entities. Dictionary size could be as large as a few thousands.I image tags provided for image is usually sparse and incomplete.I for an unannotated image, automatically assign a set of words

(tags) that describe image content.

3 / 35

Page 4: Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to

Why Do We Need to Tag Images?

I Image tags provide high-level descriptions to the images

I They provide natural image categories and indeciesI Some exaple applications

I Annotate large scale image dataset for text based imageretrieval. (i.e. Google, Yahoo, Flickr)

I Organize personal photo album. (i.e. Google photo)I Generate natural language description description (image

captioning) to help visually impaired people.

4 / 35

Page 5: Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to

How Difficult is Image Tagging?

I What we can obtain from images are only low-level visualcontents

I visual featuresI image segments, shapes, structuresI etc

I However, the tags are high-level semantic concepts

I There is a big semantic gap between low-level visualcontents and high-level semantic concepts.

I ChallengesI Difficult to map between low-level contents and high-level

conceptsI Large number of words in dictionary, combinatorially large

output tag space.I Severe frequency imbalance of tags in training data.I Incomplete image tags in training dataset.I Image tags used for training are error prone, if automatically

crawled from web data.

5 / 35

Page 6: Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to

Problem Definition

Figure : The image tagging problem

Given xi, yii=1...N , where xi ∈ Rd are some image representation;and yi ∈ 0, 1T are provided tags. Learn a relevance functionf (x, y;W) to estimate tag set for an unannotated image.

6 / 35

Page 7: Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to

Problem Definition

I Given xi, yii=1...N , where xi ∈ Rd are some imagerepresentation; and yi ∈ 0, 1T are provided tags. Learn arelevance function f (x, y;W) to estimate tag set for anunannotated image, where W are the parameters.

I We can also call it image tag prediction

I Maximium posterior prediction:

maxy

f (x, y;W)

I Various problems on different dictionary settings:I Denote dictionary Dtr is used in training, while Dte for testing.I Image tagging: Dte = DtrI Zero-shot learning: Dte ∩Dtr = ΦI Open vocabulary tagging: Dtr ⊂ Dte

7 / 35

Page 8: Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to

Key Issues

I Image tagging is to bridge the image space and the semanticspace, or to obtain a mapping between the two

I Image spaceI how do you model the image space?I image features and representationsI metrics in the image space

I Semantic spaceI how do you model the semantic space?I semantic ontology piror (e.g. WordNet)I topical model tag frequenty occuring in certain topic (e.g.

sand, beach)I tag co-occurence in tagging image dataset.I handling tag incompletion

I Relevance mapping function.I Cross module projection between visual and semantic spacesI Categorize methods w.r.t. relevance function

8 / 35

Page 9: Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to

Outline

Introduction to the Image Tagging Problem

General Approaches to Image Tagging

The TagProp method

The FastTag method

9 / 35

Page 10: Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to

Regression-based Approaches

I Regression-based approaches

I Nearest neighbor-based approaches

I Transduction-based approaches

10 / 35

Page 11: Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to

Regression-based Approaches

I The natural idea is to obtain a direct regression from theimage space to the semantic space

I Parametric regression models can be used

I E.g., we can use linear regression

y = Wx

and the relevance function is:

f (x, y;W) = e−||Wx−y||2

I Example: Fasttag: Fast image tagging[2]

11 / 35

Page 12: Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to

Nearest Neighbor-based Approaches

I Assumption: the tags of “similar images” are “semanticallysimilar”

I Exemplar-based nonparametric methods

I Local structures of the two spaces are constructed

I Reconstruct an image based on its visual nearest neighbors

I Reconstruct a concept based on its semantic nearest neighbors

I Complexity of the learned hypotheses grows as the amount oftraining data increases.

I Example: TagProp: Discriminative Metric Learning inNearest Neighbor Models for Image Auto-Annotation[1]

12 / 35

Page 13: Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to

Transduction-based Approaches

I Transduction is a type of learing

I It does not distinguish training and testing

I The majority of transduction-based approaches are founded onmatrix factorization.

13 / 35

Page 14: Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to

Outline

Introduction to the Image Tagging Problem

General Approaches to Image Tagging

The TagProp method

The FastTag method

14 / 35

Page 15: Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to

TagProp: Discriminative Metric Learning

Basic Ideas:

I Assume visually similar images share similar tag sets.

I Adapt flexibly to patterns in dataset as more data is available.

I Learn visual metrics by maximizing the likelihood ofannotations in set of training images.

15 / 35

Page 16: Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to

TagProp: Predict the Tag Presence in Semantic SpaceI Notation:

I N(i) is a set of images in the neighborhood of the image xiI w is a tagI yi (w) takes a binary value, represending the precense of tag wI I (yj(w)) ∈ ε, 1− ε takes 1− ε iff image j is annotated with

tag w . This elevates rare tags.

I Weighting the nearest neighbors: to predict the precenselikelihood of tag w for image i :

yi (w) =∑

xj∈N(i)

πij I (yj(w)) (1)

I Label score calibration by sigmoid function:

p(yi (w)) = 1/(1 + e−aw yi (w)−bw ) (2)

where p(yi (w)) is the probability that tag w is present inimage xi . (aw , bw ) are the parameters to be learned for tag w

16 / 35

Page 17: Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to

TagProp: Distance Metric in the Visual Space

I Define d(xi , xj) the distance of image xi and xj in the imagespace

I Distance metric d(xi , xj) determines nearest neighbors

I SoftMax Normalization:

πij =e−d(xi ,xj )∑

xl∈N(i) e−d(xi ,xl )

(3)

where πij is the probability that xj is the nearest neighbor of xiI Examples of distance metrics:

I Mahalanobis distance: dM(xi , xj) = xTi Mxj , M is positivesemi-definite metric matrix.

I Linearly combined distance: dθ(xi , xj) = θTdij where dij is avector collecting a set of base distances, e.g. SIFT L2distance, GIST χ2 distance.

I TagProp uses the linearly combined distance as the visualmetric, parameterized by θ (to be learned)

17 / 35

Page 18: Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to

TagProp: Metric Learning Objective

I Suppose the dictionary size is T

I Parameters of this model:(θ, aw , bwTw=1

)I Objective function: Max data likelihood:

maxθ,aw ,bw

L(θ, aw , bw ) =

∑i

∑yi (w)=+1

cw+ log(p(yi (w))) +∑

yi (w)=−1

cw− log(1− p(yi (w)))

I p(yi (w)) is defined as Eq(2), cw− and cw+ are weights for

negative and positive tags, define ciw ∈ cw−, cw+.I Projected gradient descent to constrain that all elements in θ

are positive.

I Alternate the optimization processes of two sets ofparameters: aw , bwTw=1, and θ.

18 / 35

Page 19: Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to

TagProp: Learning Visual Metric

I W.l.g., loss function takes form

L(θ) =∑i ,w

ciw log(p(yi (w)))

I Fix aw , bw, take gradient on distance metric:

∂L

∂θ=∑i ,w

ciw [1− p(yi (w))]aw∂yi (w)

∂θ(4)

where yi (w) is KNN prediction defined in Eq(1).

19 / 35

Page 20: Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to

Projected gradient descent

∂yi (w)

∂θ=∑j

∂πij∂θ

I (yj(w))

=∑j

[1

Zi

∂e−dij

∂θ+ e−dij

∂Z−1

∂θ]I (yj(w))

=∑j

πij [−dij +∑l

πildil ]I (yj(w))

=∑j

πij [dij − dij ]I (yj(w))

(5)

I Normalization factor Zi =∑

j e−dij .

I dij is average distance of KNN.

20 / 35

Page 21: Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to

TagProp: Training AlgorithmIterative projective gradient descent:

Input: Image dataset xi ,yiOutput: model: θ, aw , bwwhile stop criterion do

1. Update:

θt = θt−1 − αt ∂L

∂θ

where ∂L∂θ by Eq(4)(5);

2. Select step size αt by backtracking line search;3. Project θ = max(0, θ);4. for each tag w do

Update atw , btw by gradient on Eq(2):∑i

ciw (1− 1

1 + exp(−aw yi (w)− bw )))yi (w)yi (w)

21 / 35

Page 22: Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to

TagProp: Summary

I Nearest neighbor method identifies non-linear decisionboundaries based on multiple feature spaces from differentimage features.

I Distance metric learning adjust distances in visual space, toalign to nearest neighbors in label set semantic space.

I Word specific sigmoid function improves recall for rare words

22 / 35

Page 23: Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to

Outline

Introduction to the Image Tagging Problem

General Approaches to Image Tagging

The TagProp method

The FastTag method

23 / 35

Page 24: Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to

FastTag: Basic Idea

Basic Ideas:

I No need to compute each sample pair’s similarity as TagProp,whose complexity is Ω(N2).

I Learn visual projection W : Rd → RT that maps image tocomplete tag set.

I Learn tag enrichment projection B : 0, 1T → RT that turnson likely co-occurring tags with existing ones.

24 / 35

Page 25: Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to

FastTag: Co-regularied Regression

I Co-regularized regression learning

L(B,W ) =1

N

N∑i=1

||Byi −Wxi ||+ λ||W||22 + γr(B) (6)

I Ridge regression when fix B. Regularization r(B) learns tagcompletion projection

I Jointly convex function, with parameter W and B

I It has a closed-form solution when fix one and optimize theother.

25 / 35

Page 26: Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to

Regularing Tag Enrichment Projection B

I Marginalized blank-out regularization r(B):

r(B) =1

N

N∑i=1

E [||Byi − yi ||2]p(yi |yi ) (7)

where yi is the incomplete tag (provided by the trainingdata),and yi is the enriched complete tag

I Assume observed tags are corrupted, each tag is corruptedindependently.

I Regularization is the expected reconstruction error undercorruption distribution.

I Simulate corruption yi to yi , by randomly removing entries iny with probability p.

I p(yi (t) = 0) = p and p(yi (t) = yi (t)) = 1− p

26 / 35

Page 27: Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to

FastTag: Handling Co-occuring Tags

I How does projection B complete co-occuring tags?I Frequent coocurring tag pair: ys , yt

I P(ys , yt , ys , yt) = P(ys , yt) · p(ys |ys)p(yt |yt), ys , yt are randomdimension in y.

I If ys and yt are correlated, there are likely many samples withone tag missing after corruption.

I P(ys , yt , ys , yt) > P(ys)P(yt) · C 2, where C = p(yi |yi ) is thesame for all tags.

27 / 35

Page 28: Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to

FastTag: Rewrite the Regularization

I We can re-write regularization

r(B) =1

N

N∑i=1

E [||Byi − yi ||2]p(yi |yi )

=1

Ntrace(BQBT − 2PB + YYT )

(8)

where P =N∑i=1

yiE [yi ]T , Q =

N∑i=1

E [yi yi ]T .

I Uniform blank-out corruption has:

I Expected value of corruption E [yi ]p(yi |yi ) = (1− p)yiI Variance matrix: V [yi ]p(yi |yi ) = p(1− p)σ(yiyTi )

I So we have:

P = (1− p)YYT ,

Q = (1− p)2YYT + p(1− p)σ(YYT )

28 / 35

Page 29: Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to

FastTag: Putting All into the Objective FunctionI Putting all these, the objective function:

L(B,W ) =1

N

N∑i=1

||Byi −Wxi ||+ λ||W||22

+1

Ntrace(BQBT − 2PB + YYT )

(9)

I Training algorithm: Block-coordinate descent optimization:

Input: Dataset X,YOutput: Projection parameter W and Bwhile W and B has little change do

1. UpdateW = BYXT (XXT + NλI )−1

2. Update

B = (γP + WXYT )(γQ + YYT )−1

29 / 35

Page 30: Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to

FastTag: Extension

I Prior knowledge of tag corruption could be added to r(B).

I Language models could also be used for tag correlationmodeling.

30 / 35

Page 31: Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to

Evaluation metric

I As the image tagging problem was motivated by text basedimage retrieval, the performance is usually evaluated byretrieval precision and recall for individual tags.

I all image are annotated with the five most relevant tags (i.e.tags that have the highest prediction value).

I precision (P) and recall (R) are computed for each tag.

I both factors are combined in the F1-score

F1 = 2P ∗ RP + R

31 / 35

Page 32: Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to

Evaluation: Precision and Recall

I IAPRTC12 dataset:I 19,627 images of sports, action, people, landscapes, etc.I 291 tags are used.

I Results:

Name P R F1

TagProp 45 34 39FastTag 47 26 34

I FastTag performs slightly worse than TagProp.

I However, FastTag achieves very significant speedup overTagProp in both training and testing.

32 / 35

Page 33: Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to

Evaluation: Training Time Complexity

Figure : F1 score and training times on IAPRTC12 dataset in log scale. Thegraphs compare the results of FastTag (red dot) with the leastSquares baseline(green square) and the TagProp algorithm (blue diamond).

33 / 35

Page 34: Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to

Evaluation: Testining Time Complexity

I TagProp has O(n) test-time complexity, where n is thenumber of training examples, because each query examplerequires a neighbor-lookup during testing.

I FastTag has constant test-time complexity

34 / 35

Page 35: Automatic Image Tagging - Northwestern Universityusers.eecs.northwestern.edu/.../Tagging_notes.pdf · Why Do We Need to Tag Images? I Image tags provide high-level descriptions to

Reference

M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid.

Tagprop: Discriminative metric learning in nearest neighbor modelsfor image auto-annotation.

In International Conference on Computer Vision (ICCV), 2009.

K. Q. W. Minmin Chen, Alice Zheng.

Fast image tagging.

In Proceedings of the 30th International Conference on MachineLearning, January 2013.

35 / 35