ICME 2013

IEEE International Conference on Multimedia & Expo 2013

Augmenting Descriptors for Fine-grained Visual Categorization Using Polynomial Embedding

Hideki Nakayama

Graduate School of Information Science and Technology The University of Tokyo

Outline

Background and Motivation Our solution

Polynomial embedding

Experiment Fine-grained categorization Comparison with state-of-the-art

Conclusion

2

Fine-grained visual categorization (FGVC) Distinguish hundreds of fine-grained objects

under a certain domain (e.g., species of animals and plants)

Complement to traditional object recognition problems

Caltech-256 [Griffin et al., 2007]

Caltech-Bird-200 [Welinder et al., 2010]

Generic Object Recognition FGVC

Yellow Warbler

Pririe

Warbler Pine Warbler Airplane Monitor Dog

V.S.

3

Motivation

We need highly discriminative features to distinguish visually very similar categories

Especially at local level.

4

Two basic ideas

1. Co-occurrence (correlation) of neighboring local descriptors Shaplet [Sabzmeydani et al., 2007] Covariance feature [Tuzel et al., 2006]

GGV [Harada et al., 2012]

Expected to capture middle-level local information Results in high-dimensional local features

2. State-of-the-art bag-of-words representation

Based on higher-order statistics of local features Fisher vector [Perronnin et al., 2010]

VLAD [Jegou et al., 2010]

Remarkably high-performance, enables linear classification Dimensionality increases in linear to the size of local features

☹ ☺

☺ ☹

(N: number of visual words, D: size of local features) ND

2ND

conflict

5

Our approach

Compress polynomials of neighboring local features vector with supervised dimensionality reduction Discriminative latent descriptor

Encode by means of bag-of-words (Fisher vector) Logistic regression classifier

1,000～1,0000 dim 64 dim Descriptor

(e.g. SIFT)

Dense sampling

polynomialvectors

latentdescriptor

categorylabel

CCA

(training)

Fishervector

logisticregressionclassifier

6

★● ●

Exploit co-occurrence information

e.g. SIFT

( )( )( )

=

+

−

Tyxyx

Tyxyx

Tyxyx

yx

yx

Vec

Vec

upperVec

),(),(

),(),(

),(),(

),(

2),(

δ

δ

vv

vv

vv

v

p

),( yxv),( yx σ−v ),( yx σ+vNeighbor

(Left) Neighbor (Right)

Descriptor at target position

×

×

Polynomial Vector

a Matrix of vector flattened:)(Vec

7

Exploit co-occurrence information More spatial information can be integrated with more

neighbors (but become high-dimensional)

( )( )( )

=

+

−

Tyxyx

Tyxyx

Tyxyx

yx

yx

Vec

Vec

upperVec

),(),(

),(),(

),(),(

),(

2),(

δ

δ

vv

vv

vv

v

p

( )

= T

yxyx

yxyx upperVec ),(),(

),(0),( vv

vp

( )( )( )( )( )

=

+

+

−

−

Tyxyx

Tyxyx

Tyxyx

Tyxyx

Tyxyx

yx

yx

Vec

Vec

Vec

Vec

upperVec

),(),(

),(),(

),(),(

),(),(

),(),(

),(

4),(

δ

δ

δ

δ

vv

vv

vv

vv

vv

v

p

★

★● ●

★● ●

●

●

0-neighbor

2-neighbors

4-neighbors

2,144dim

10,336dim

18,528dim

8

Descriptor(e.g. SIFT)

Dense sampling

polynomialvectors

latentdescriptor

Fishervector


categorylabel

CCA

(training)

9

Patch feature and label pairs Category label: Binary occurrence vector

Strong supervision assumption

Most patches should be related to the content (category) (Somewhat) justified for FGVC considering the applications Users will target the object, sometimes can give segmentation

Supervised dimensionality reduction

Allium triquetrum

0

010

0

010

0

010

0

010

(We do not perform manual segmentation in this work, though) 10

Canonical Correlation Analysis (CCA) [Hotelling, 1936]

( ) ( ) tslltpps

lp

andbetweenncorrelatiothemaximizethat,tionstransformalinearfinds CCA

featurelabel: ls),(polynomiafeaturepatch:

−=−= TT BA

Supervised dimensionality reduction

( )( )IBCBBCBCCC

IACAACACCC

llT

llplpplp

ppT

pplpllpl

=Λ=

=Λ=−

−

21

21

nscorrelatiocanonical:matricescovariance:

ΛC

p l

Canonical space

s t

s

t

Image feature Labels feature

( )pps −= TALatent descriptor

1,000～1,0000 dim

64 dim

(discriminative)

11

Descriptor(e.g. SIFT)

Dense sampling

polynomialvectors

latentdescriptor

Fishervector


categorylabel

CCA

(training)

12

Fisher Vector [Perronnin et al., 2010]

State-of-the-art bag-of-words encoding method using higher-level statistics of descriptors (mean and var)

http://www.image-net.org/challenges/LSVRC/2010/ILSVRC2010_XRCE.pdf

13

Experiments

Experimental setup

FGVC Datasets Oxford-Flowers-102 Caltech-Bird-200

Descriptors SIFT, C-SIFT, Opponent-SIFT, Self Similarity Compressed into 64dim using several methods

Fisher Vector 64 Gaussians (visual wods) Global + 3 horizontal spatial regions

Classifier Logistic regression

Evaluation Mean classification accuracy

Flowers

Birds

15

Results: comparison with PCA and CCA

Our method substantially improves performance for all descriptors

Just applying CCA to concatenated neighbors does not improve performance Polynomial embedding makes sense (non-linear convolution)

0

10

20

30

40

50

60

70

80

90

SIFT C-SIFT Opp.-SIFT SSIM

PCA (baseline)

CCA (4-neighbors)

PolEmb (4-neighbors)

0

5

10

15

20

25


Flower Bird

Classification performance (%) with different embedding methods (all 64dim)

Baseline (PCA)

Ours

16

CCA without

Pol.

Results: number of neighbors

Including more neighbors improves performance

0

10

20

30

40

50

60

70

80

90

SIFT C-SIFT Opp.-SIFT SSIM0

5

10

15

20

25


Classification performance (%) of our method with different number of neighbors

Flower Bird

★

★● ●

★● ●

●

●

17

Comparison with other work

Our final system Combine four descriptors in late-fusion

approach (SIFT, C-SIFT, Opp.-SIFT, SSIM)

Sum of log-likelihoods output by each classifier (weighted by its individual confidence)

Descriptor 1(e.g. SIFT)

Dense sampling

polynomialvectors

categorylabel

CCA

latentdescriptor

Fishervector




(training)

Descriptor 2

Descriptor K

+

・・・・・・

Same as above

Same as above

Alliumtriquetrum

19

Comparison on FGVC datasets

Our method outperforms previous work on bird and flower datasets

For the bird dataset, [32] uses the bounding box only for training images, therefore the result is not directly comparable to ours.

(PCA) (PolEmb) (PCA+PolEmb)

← baseline

Mean classification accuracy (%)

20

ImageCLEF 2013 Plant Identification

Flower FruitLeaf Stem Entire

KakiPersimmon

Silverbirch

Boxeldermapple

Identify 250 plant species from different organs (Leaf, Flower, Fruit, etc.)

Got the 1st place in Natural Background task, and in 4/5 subtasks.

(Coming in Sept., 2013.)

21

Conclusion A simple but effective method for FGVC

Embedding co-occurrence patterns of neighboring descriptors Obtain discriminative and small-dimensional latent descriptor

to use together with Fisher vector Polynomial embedding greatly improves the performance,

indicating the importance of non-linearity

Patch-level strong supervision approximation

Not always perfect but reasonable for FGVC problems

Future work Theoretical analysis (probabilistic interpretation) Multiple instance dimensionality reduction

22

Thank you!

Any questions?

23

Object and scene categorization

Caltech-101 (Object dataset)

MIT-Indoor-67 (Scene dataset)

24

Results: Object and scene categorization

Our method seems to be not as effective as in FGVC problems

Combining PCA feature + our feature consistent improves performance

Mean classification accuracy (%)

0

10

20

30

40

50

60

70

80

SIFT C-SIFT Opp.-SIFT0

10

20

30

40

50

60

SIFT C-SIFT Opp.-SIFT

Caltech-101 MIT-Indoor-67

0

10

20

30

40

50

60

70

80

SIFT C-SIFT Opp.-SIFT0

10

20

30

40

50

60

SIFT C-SIFT Opp.-SIFT

25