CNN Image Retrieval Learns from BoW: Unsupervised Fine ...cmp.felk.cvut.cz/~radenfil/publications/Radenovic-ECCV16poster.pdf · SfM [Schonberger et al. CVPR’15] [Radenovic et al

CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples

Filip Radenović Giorgos Tolias Ondřej ChumCenter for Machine Perception, Czech Technical university in Prague

1. Instance Retrieval Challenges

qu

ery

occlusion illumination changes viewpoint changes

2. CNNs and Large Training Datasets

Building class

Landmark class

Pre-trained on ImageNet• Internal activations as descriptors• Inappropriate data for instance matching[Gong et al. ECCV’14, Babenko et al. ICCV’15, Kalantidis et al. arXiv’15, Tolias et al. ICLR’16]

Fine-tuned on landmark classes• Internal activations as descriptors• Better than ImageNet but still inadequate for instance matching[Babenko et al. ECCV’14]

Weakly supervised fine-tuning• Descriptor directly optimized• Requires GPS dataset[Arandjelovic et al. CVPR’14]

Our work• Training data without any human supervision• Lots of training data more appropriate for instance matching• Strong automatic annotation for hard negative, hard positive

mining, and supervised whitening

spatially closest ≠ matching

Hard positives Hard negatives

3. Training Data without any Human Interaction

Image Database7.4M images

3D models: 551 training / 162 validation

Retrieval

SfM [Schonberger et al. CVPR’15][Radenovic et al. CVPR’16]

querythe most similarCNN descriptor

naive hard negativestop k by CNN

diverse hard negativestop k: one per 3D model

query top 1 by CNN top 1 by BoWrandom from top k by BoW

harder positives

used in NetVLAD

…global max pooling

+ L2-norm

K x 1MACvec.

Query Convolutional Layers MAC Layer Descriptor

𝑤×ℎ×3 𝑊×𝐻×𝐾 𝐾×1

…global max pooling

+ L2-norm

K x 1MACvec.

Positive Convolutional Layers MAC Layer Descriptor

𝑤×ℎ×3 𝑊×𝐻×𝐾 𝐾×1

ContrastiveLoss

1 – positive0 – negative

Pair Label

POSITIVE / MATCHING PAIR

POSITIVE PAIR

𝐿 𝑖, 𝑗

ത𝒇 𝑖 − ത𝒇 𝑗

NEGATIVE PAIR

𝐿 𝑖, 𝑗

ത𝒇 𝑖 − ത𝒇 𝑗

contrastive loss

4. CNN Siamese Learning

Whitening: square-root of intraclass covariance matrix: 𝐶𝑆−1/2

Rotation: PCA of interclass covariance matrix in the whitened space:

𝑒𝑖𝑔 𝐶𝑆−1/2𝐶𝐷𝐶𝑆

−1/2

𝐶𝑆 =

𝑌 𝑖,𝑗 =1

ത𝒇 𝑖 − ത𝒇 𝑗 ത𝒇 𝑖 − ത𝒇 𝑗⊺

𝐶𝐷 =

𝑌 𝑖,𝑗 =0

ത𝒇 𝑖 − ത𝒇 𝑗 ത𝒇 𝑖 − ത𝒇 𝑗⊺

5. Supervised Whitening

6. Implicit Correspondences

7. Experiments

Contrastive loss always better than Triplet: +1.7 on Oxford, +3.5 on Paris Off-the-shelf

top 1 CNN + top k CNN

top 1 CNN + top 1 / model CNN

top 1 BoW + top 1 / model CNN

random{top k BoW} + top 1 / model CNN

Our learned whitening

Oxford 5k Paris 6k

44.2

51.6

56.2

63.1

56.7

63.9

59.7

67.1

62.2

68.9

60.2

67.5

Improvement by hard examples

56.6

Oxford 5k Paris 6k

62.0

59.4

65.8

60.2

67.5

10 reconstructed models

100 reconstructed models

all reconstructed models

Fine-tuning with few 3D models

Negative examples:

images from different 3D models than

the query

Positive examples:

images from the same 3D model as the

query

Documents

CNN Image Retrieval Learns from BoW: Unsupervised Fine ...cmp.felk.cvut.cz/~radenfil/publications/Radenovic-ECCV16poster.pdf · SfM [Schonberger et al. CVPR’15] [Radenovic et al