Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples
Filip Radenović Giorgos Tolias Ondřej ChumCenter for Machine Perception, Czech Technical university in Prague
1. Instance Retrieval Challenges
qu
ery
occlusion illumination changes viewpoint changes
2. CNNs and Large Training Datasets
Building class
Landmark class
Pre-trained on ImageNet• Internal activations as descriptors• Inappropriate data for instance matching[Gong et al. ECCV’14, Babenko et al. ICCV’15, Kalantidis et al. arXiv’15, Tolias et al. ICLR’16]
Fine-tuned on landmark classes• Internal activations as descriptors• Better than ImageNet but still inadequate for instance matching[Babenko et al. ECCV’14]
Weakly supervised fine-tuning• Descriptor directly optimized• Requires GPS dataset[Arandjelovic et al. CVPR’14]
Our work• Training data without any human supervision• Lots of training data more appropriate for instance matching• Strong automatic annotation for hard negative, hard positive
mining, and supervised whitening
spatially closest ≠ matching
Hard positives Hard negatives
3. Training Data without any Human Interaction
Image Database7.4M images
3D models: 551 training / 162 validation
Retrieval
SfM [Schonberger et al. CVPR’15][Radenovic et al. CVPR’16]
querythe most similarCNN descriptor
naive hard negativestop k by CNN
diverse hard negativestop k: one per 3D model
query top 1 by CNN top 1 by BoWrandom from top k by BoW
harder positives
used in NetVLAD
…global max pooling
+ L2-norm
K x 1MACvec.
Query Convolutional Layers MAC Layer Descriptor
𝑤×ℎ×3 𝑊×𝐻×𝐾 𝐾×1
…global max pooling
+ L2-norm
K x 1MACvec.
Positive Convolutional Layers MAC Layer Descriptor
𝑤×ℎ×3 𝑊×𝐻×𝐾 𝐾×1
ContrastiveLoss
1 – positive0 – negative
Pair Label
POSITIVE / MATCHING PAIR
POSITIVE PAIR
𝐿 𝑖, 𝑗
ത𝒇 𝑖 − ത𝒇 𝑗
NEGATIVE PAIR
𝐿 𝑖, 𝑗
ത𝒇 𝑖 − ത𝒇 𝑗
contrastive loss
4. CNN Siamese Learning
Whitening: square-root of intraclass covariance matrix: 𝐶𝑆−1/2
Rotation: PCA of interclass covariance matrix in the whitened space:
𝑒𝑖𝑔 𝐶𝑆−1/2𝐶𝐷𝐶𝑆
−1/2
𝐶𝑆 =
𝑌 𝑖,𝑗 =1
ത𝒇 𝑖 − ത𝒇 𝑗 ത𝒇 𝑖 − ത𝒇 𝑗⊺
𝐶𝐷 =
𝑌 𝑖,𝑗 =0
ത𝒇 𝑖 − ത𝒇 𝑗 ത𝒇 𝑖 − ത𝒇 𝑗⊺
5. Supervised Whitening
6. Implicit Correspondences
7. Experiments
Contrastive loss always better than Triplet: +1.7 on Oxford, +3.5 on Paris Off-the-shelf
top 1 CNN + top k CNN
top 1 CNN + top 1 / model CNN
top 1 BoW + top 1 / model CNN
random{top k BoW} + top 1 / model CNN
Our learned whitening
Oxford 5k Paris 6k
44.2
51.6
56.2
63.1
56.7
63.9
59.7
67.1
62.2
68.9
60.2
67.5
Improvement by hard examples
56.6
Oxford 5k Paris 6k
62.0
59.4
65.8
60.2
67.5
10 reconstructed models
100 reconstructed models
all reconstructed models
Fine-tuning with few 3D models
Negative examples:
images from different 3D models than
the query
Positive examples:
images from the same 3D model as the
query