1
CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples Filip Radenović Giorgos Tolias Ondřej Chum Center for Machine Perception, Czech Technical university in Prague 1. Instance Retrieval Challenges query occlusion illumination changes viewpoint changes 2. CNNs and Large Training Datasets Building class Landmark class Pre-trained on ImageNet Internal activations as descriptors Inappropriate data for instance matching [Gong et al. ECCV’14, Babenko et al. ICCV’15, Kalantidis et al. arXiv’15, Tolias et al. ICLR’16] Fine-tuned on landmark classes Internal activations as descriptors Better than ImageNet but still inadequate for instance matching [Babenko et al. ECCV’14] Weakly supervised fine-tuning Descriptor directly optimized Requires GPS dataset [Arandjelovic et al. CVPR’14] Our work Training data without any human supervision Lots of training data more appropriate for instance matching Strong automatic annotation for hard negative, hard positive mining, and supervised whitening spatially closest ≠ matching Hard positives Hard negatives 3. Training Data without any Human Interaction Image Database 7.4M images 3D models: 551 training / 162 validation Retrieval SfM [Schonberger et al. CVPR’15] [Radenovic et al. CVPR’16] query the most similar CNN descriptor naive hard negatives top k by CNN diverse hard negatives top k: one per 3D model query top 1 by CNN top 1 by BoW random from top k by BoW harder positives used in NetVLAD global max pooling + L2-norm K x 1 MAC vec. Query Convolutional Layers MAC Layer Descriptor ×ℎ×3 ×× ×1 global max pooling + L2-norm K x 1 MAC vec. Positive Convolutional Layers MAC Layer Descriptor ×ℎ×3 ×× ×1 Contrastive Loss 1 – positive 0 – negative Pair Label POSITIVE / MATCHING PAIR POSITIVE PAIR , NEGATIVE PAIR , contrastive loss 4. CNN Siamese Learning Whitening: square-root of intraclass covariance matrix: −1/2 Rotation: PCA of interclass covariance matrix in the whitened space: −1/2 −1/2 = , =1 = , =0 5. Supervised Whitening 6. Implicit Correspondences 7. Experiments Contrastive loss always better than Triplet: +1.7 on Oxford, +3.5 on Paris Off-the-shelf top 1 CNN + top k CNN top 1 CNN + top 1 / model CNN top 1 BoW + top 1 / model CNN random{top k BoW} + top 1 / model CNN Our learned whitening Oxford 5k Paris 6k 44.2 51.6 56.2 63.1 56.7 63.9 59.7 67.1 62.2 68.9 60.2 67.5 Improvement by hard examples 56.6 Oxford 5k Paris 6k 62.0 59.4 65.8 60.2 67.5 10 reconstructed models 100 reconstructed models all reconstructed models Fine-tuning with few 3D models Negative examples: images from different 3D models than the query Positive examples: images from the same 3D model as the query

CNN Image Retrieval Learns from BoW: Unsupervised Fine ...cmp.felk.cvut.cz/~radenfil/publications/Radenovic-ECCV16poster.pdf · SfM [Schonberger et al. CVPR’15] [Radenovic et al

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CNN Image Retrieval Learns from BoW: Unsupervised Fine ...cmp.felk.cvut.cz/~radenfil/publications/Radenovic-ECCV16poster.pdf · SfM [Schonberger et al. CVPR’15] [Radenovic et al

CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples

Filip Radenović Giorgos Tolias Ondřej ChumCenter for Machine Perception, Czech Technical university in Prague

1. Instance Retrieval Challenges

qu

ery

occlusion illumination changes viewpoint changes

2. CNNs and Large Training Datasets

Building class

Landmark class

Pre-trained on ImageNet• Internal activations as descriptors• Inappropriate data for instance matching[Gong et al. ECCV’14, Babenko et al. ICCV’15, Kalantidis et al. arXiv’15, Tolias et al. ICLR’16]

Fine-tuned on landmark classes• Internal activations as descriptors• Better than ImageNet but still inadequate for instance matching[Babenko et al. ECCV’14]

Weakly supervised fine-tuning• Descriptor directly optimized• Requires GPS dataset[Arandjelovic et al. CVPR’14]

Our work• Training data without any human supervision• Lots of training data more appropriate for instance matching• Strong automatic annotation for hard negative, hard positive

mining, and supervised whitening

spatially closest ≠ matching

Hard positives Hard negatives

3. Training Data without any Human Interaction

Image Database7.4M images

3D models: 551 training / 162 validation

Retrieval

SfM [Schonberger et al. CVPR’15][Radenovic et al. CVPR’16]

querythe most similarCNN descriptor

naive hard negativestop k by CNN

diverse hard negativestop k: one per 3D model

query top 1 by CNN top 1 by BoWrandom from top k by BoW

harder positives

used in NetVLAD

…global max pooling

+ L2-norm

K x 1MACvec.

Query Convolutional Layers MAC Layer Descriptor

𝑤×ℎ×3 𝑊×𝐻×𝐾 𝐾×1

…global max pooling

+ L2-norm

K x 1MACvec.

Positive Convolutional Layers MAC Layer Descriptor

𝑤×ℎ×3 𝑊×𝐻×𝐾 𝐾×1

ContrastiveLoss

1 – positive0 – negative

Pair Label

POSITIVE / MATCHING PAIR

POSITIVE PAIR

𝐿 𝑖, 𝑗

ത𝒇 𝑖 − ത𝒇 𝑗

NEGATIVE PAIR

𝐿 𝑖, 𝑗

ത𝒇 𝑖 − ത𝒇 𝑗

contrastive loss

4. CNN Siamese Learning

Whitening: square-root of intraclass covariance matrix: 𝐶𝑆−1/2

Rotation: PCA of interclass covariance matrix in the whitened space:

𝑒𝑖𝑔 𝐶𝑆−1/2𝐶𝐷𝐶𝑆

−1/2

𝐶𝑆 =

𝑌 𝑖,𝑗 =1

ത𝒇 𝑖 − ത𝒇 𝑗 ത𝒇 𝑖 − ത𝒇 𝑗⊺

𝐶𝐷 =

𝑌 𝑖,𝑗 =0

ത𝒇 𝑖 − ത𝒇 𝑗 ത𝒇 𝑖 − ത𝒇 𝑗⊺

5. Supervised Whitening

6. Implicit Correspondences

7. Experiments

Contrastive loss always better than Triplet: +1.7 on Oxford, +3.5 on Paris Off-the-shelf

top 1 CNN + top k CNN

top 1 CNN + top 1 / model CNN

top 1 BoW + top 1 / model CNN

random{top k BoW} + top 1 / model CNN

Our learned whitening

Oxford 5k Paris 6k

44.2

51.6

56.2

63.1

56.7

63.9

59.7

67.1

62.2

68.9

60.2

67.5

Improvement by hard examples

56.6

Oxford 5k Paris 6k

62.0

59.4

65.8

60.2

67.5

10 reconstructed models

100 reconstructed models

all reconstructed models

Fine-tuning with few 3D models

Negative examples:

images from different 3D models than

the query

Positive examples:

images from the same 3D model as the

query