5
Supplementary Material for paper “Hi-CMD: Hierarchical Cross-Modality Disentanglement for Visible-Infrared Person Re-Identification” Seokeon Choi Sumin Lee Youngeun Kim Taekyung Kim Changick Kim Korea Advanced Institute of Science and Technology, Daejeon, Korea {seokeon, suminlee94, youngeunkim, tkkim93, changick}@kaist.ac.kr A. Parameter Analysis We report hyper-parameter analysis on RegDB [6] and SYSU-MM01 [10]. The total loss of our Hi-CMD is defined in Equation (10) in the main manuscript. This total loss can be reformulated as the sum of the ID-PIG and HFL loss terms L = (1 - β)L P + βL H , where L P and L H represent the ID-PIG and HFL loss terms respectively, and β [0, 1] denotes a balancing parameter. The ID-PIG loss term is ex- pressed as L P = L recon + λ kl L kl + λ adv L adv , and the HFL loss term is formulated as L H = λ ce L ce + λ trip L trip . The impacts of different β values on the REID performance are shown as in Fig. 1. We observe that our method achieves the best performance when β is set to 0.5, which indicates that the current weights for ID-PIG and HFL losses are properly balanced. B. Results for Additional Evaluation Protocols Different query settings on RegDB. We evaluate the performance under different query settings on the RegDB dataset, as discussed in [11, 13, 12, 3]. Unlike the origi- nal evaluation protocol where the visible image is treated as the query set, infrared images are provided as the query set. Table 1 shows that our method outperforms competing methods regardless of the modalities, which proves that the proposed Hi-CMD is robust to different query settings. Indoor search on SYSU-MM01. We also evaluate the performance of our method under single-shot indoor-search mode on SYSU-MM01. A detailed description of this eval- uation protocol is discussed in [10]. This protocol is less challenging than single-shot all-search mode since the out- door scenes are excluded. The performance shown in Table 2 demonstrates that our Hi-CMD still outperforms the state- of-the-art methods under this evaluation protocol. 0 10 20 30 40 50 60 70 80 0.1 0.3 0.5 0.7 0.9 0.95 0.99 0 10 20 30 40 50 60 70 80 0.1 0.3 0.5 0.7 0.9 0.95 0.99 Rank-1 Identification rates (%) mAP (%) Balancing parameter β Balancing parameter β SYSU-MM01 RegDB SYSU-MM01 RegDB Figure 1. Performance of our Hi-CMD method for different bal- ancing parameters β on RegDB and SYSU-MM01. Dataset RegDB [6](Infrared to visible) Methods R=1 R=10 R=20 mAP Zero padding [10] 16.63 34.68 44.25 17.82 TONE [11] 13.86 30.08 40.05 16.98 TONE+HCML [11] 21.70 45.02 55.58 22.24 BDTR [13] 32.72 57.96 68.86 31.10 eBDTR [12] 34.21 58.74 68.64 32.49 HSME [3] 40.67 65.35 75.27 37.50 D-HSME [3] 50.15 72.40 81.07 46.16 Ours (Hi-CMD) 68.21 84.85 89.76 62.41 Table 1. Comparison with the state-of-the-arts under different query settings on RegDB. The infrared image is given as the query set as opposed to the original evaluation protocol in the main manuscript. Re-identification rates (%) at rank R and mAP (%). Dataset SYSU-MM01 [10](Indoor-search) Methods R=1 R=10 R=20 mAP SVDNet [7] 20.24 64.32 83.62 28.74 PCB [8] 22.63 65.24 83.92 30.46 Zero padding [10] 20.58 68.38 85.79 26.92 TONE [11] 20.82 68.86 84.46 26.38 cmGAN [1] 31.63 77.23 89.18 42.19 eBDTR (alex) [12] 27.62 74.95 88.12 38.40 eBDTR (resnet) [12] 32.46 77.42 89.62 42.46 Ours (Hi-CMD) 37.11 81.63 91.36 47.11 Table 2. Comparison with the state-of-the-art methods under the single-shot indoor-search mode on the SYSU-MM01 dataset.

Supplementary Material for paper “Hi-CMD: Hierarchical ......Supplementary Material for paper “Hi-CMD: Hierarchical Cross-Modality Disentanglement for Visible-Infrared Person Re-Identification”

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Supplementary Material for paper “Hi-CMD: Hierarchical ......Supplementary Material for paper “Hi-CMD: Hierarchical Cross-Modality Disentanglement for Visible-Infrared Person Re-Identification”

Supplementary Material for paper“Hi-CMD: Hierarchical Cross-Modality Disentanglement

for Visible-Infrared Person Re-Identification”

Seokeon Choi Sumin Lee Youngeun Kim Taekyung Kim Changick KimKorea Advanced Institute of Science and Technology, Daejeon, Korea{seokeon, suminlee94, youngeunkim, tkkim93, changick}@kaist.ac.kr

A. Parameter Analysis

We report hyper-parameter analysis on RegDB [6] andSYSU-MM01 [10]. The total loss of our Hi-CMD is definedin Equation (10) in the main manuscript. This total loss canbe reformulated as the sum of the ID-PIG and HFL lossterms L = (1−β)LP +βLH , where LP and LH representthe ID-PIG and HFL loss terms respectively, and β ∈ [0, 1]denotes a balancing parameter. The ID-PIG loss term is ex-pressed as LP = Lrecon+λklLkl+λadvLadv , and the HFLloss term is formulated as LH = λceLce + λtripLtrip. Theimpacts of different β values on the REID performance areshown as in Fig. 1. We observe that our method achieves thebest performance when β is set to 0.5, which indicates thatthe current weights for ID-PIG and HFL losses are properlybalanced.

B. Results for Additional Evaluation Protocols

Different query settings on RegDB. We evaluate theperformance under different query settings on the RegDBdataset, as discussed in [11, 13, 12, 3]. Unlike the origi-nal evaluation protocol where the visible image is treatedas the query set, infrared images are provided as the queryset. Table 1 shows that our method outperforms competingmethods regardless of the modalities, which proves that theproposed Hi-CMD is robust to different query settings.

Indoor search on SYSU-MM01. We also evaluate theperformance of our method under single-shot indoor-searchmode on SYSU-MM01. A detailed description of this eval-uation protocol is discussed in [10]. This protocol is lesschallenging than single-shot all-search mode since the out-door scenes are excluded. The performance shown in Table2 demonstrates that our Hi-CMD still outperforms the state-of-the-art methods under this evaluation protocol.

0

10

20

30

40

50

60

70

80

0.1 0.3 0.5 0.7 0.9 0.95 0.99

0

10

20

30

40

50

60

70

80

0.1 0.3 0.5 0.7 0.9 0.95 0.99

Ran

k-1

Id

enti

fica

tio

n r

ates

(%

)

mA

P (

%)

Balancing parameter β Balancing parameter β

SYSU-MM01

RegDB

SYSU-MM01

RegDB

Figure 1. Performance of our Hi-CMD method for different bal-ancing parameters β on RegDB and SYSU-MM01.

Dataset RegDB [6] (Infrared to visible)Methods R=1 R=10 R=20 mAP

Zero padding [10] 16.63 34.68 44.25 17.82TONE [11] 13.86 30.08 40.05 16.98

TONE+HCML [11] 21.70 45.02 55.58 22.24BDTR [13] 32.72 57.96 68.86 31.10eBDTR [12] 34.21 58.74 68.64 32.49HSME [3] 40.67 65.35 75.27 37.50

D-HSME [3] 50.15 72.40 81.07 46.16

Ours (Hi-CMD) 68.21 84.85 89.76 62.41

Table 1. Comparison with the state-of-the-arts under differentquery settings on RegDB. The infrared image is given as the queryset as opposed to the original evaluation protocol in the mainmanuscript. Re-identification rates (%) at rank R and mAP (%).

Dataset SYSU-MM01 [10] (Indoor-search)Methods R=1 R=10 R=20 mAP

SVDNet [7] 20.24 64.32 83.62 28.74PCB [8] 22.63 65.24 83.92 30.46

Zero padding [10] 20.58 68.38 85.79 26.92TONE [11] 20.82 68.86 84.46 26.38cmGAN [1] 31.63 77.23 89.18 42.19

eBDTR (alex) [12] 27.62 74.95 88.12 38.40eBDTR (resnet) [12] 32.46 77.42 89.62 42.46

Ours (Hi-CMD) 37.11 81.63 91.36 47.11

Table 2. Comparison with the state-of-the-art methods under thesingle-shot indoor-search mode on the SYSU-MM01 dataset.

Page 2: Supplementary Material for paper “Hi-CMD: Hierarchical ......Supplementary Material for paper “Hi-CMD: Hierarchical Cross-Modality Disentanglement for Visible-Infrared Person Re-Identification”

C. Network ArchitecturesThe proposed Hi-CMD method consists of the two pro-

totype encoders Ep1 , E

p2 , the two attribute encoders Ea

1 , Ea2 ,

the two discriminators D1, D2, the single decoder G, andthe single feature embedding network H . We use ResNet-50 [4] pretrained on ImageNet [2] as the attribute encoderEa and the feature embedding networkH . The attribute en-coder has the same structure up to the average pooling layerof ResNet-50, and the feature embedding network borrowsthe Conv3 and Conv4 layers of ResNet-50.

Table 3 shows the remaining three network architecturesas follows: (1) The prototype encoder Ep contains sev-eral convolutional layers and residual blocks [4] with In-stance Normalization (IN) [9]. We also add the Atrous Spa-tial Pyramid Pooling (ASPP) to exploit multi-scale features.(2) The residual decoder G consists of several convolutionlayers, upsampling layers, and residual blocks. The Adap-tive Instance Normalization (AdaIN) [5] layers in the resid-ual layers help the generator synthesize various attributesin a person image. (3) The discriminator D contains sev-eral convolution layers and residual blocks. We use Patch-GAN [14] with the three different scales of an input imageas 256× 128, 128× 64, and 64× 32.

D. Retrieved ExamplesWe analyze the top-5 retrieval results of 24 query exam-

ples on RegDB and SYSU-MM01 datasets, as illustrated inFig. 2, 3. For each dataset, two different experiments areperformed. In the case of RegDB, the visible-to-infraredretrieval is the original setting while the infrared-to-visibleretrieval is also applied as introduced in Table 1. In the caseof SYSU-MM01, the infrared-to-visible retrieval is the orig-inal setting. Note that we use all the images in the gallery setand query set on the SYSU-MM01 dataset, so the retrievalresult is slightly different from the result by the SYSU eval-uation protocol, such as single-shot all-search and single-shot indoor-search.

The images in the first column are the query images,and the retrieved images are sorted from left to right ac-cording to the ascending order of the feature distance.The query and retrieved images have different types ofposes and illuminations since the cross-modality and intra-modality characteristics are entangled intricately in the im-age, which represents that this cross-modality retrieval taskis very challenging. In this harsh situation, the proposedmethod achieves successful cross-modality matching re-sults by finding the common ID-discriminative factors, suchas the clothing type, pattern, and human body size. Al-though there are even a few failure cases, we observe thatthe ID-discriminative characteristics are similar enough tobe mistaken by the human eye. In conclusion, our Hi-CMDmethod is robust to ID-excluded factors and well finds hid-den ID-discriminative clues.

References[1] Pingyang Dai, Rongrong Ji, Haibin Wang, Qiong Wu, and Yuyu

Huang. Cross-modality person re-identification with generative ad-versarial training. In Proceedings of the International Joint Confer-ence on Artificial Intelligence (IJCAI), pages 677–683, 2018. 1

[2] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and LiFei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pages 248–255. Ieee, 2009. 2

[3] Yi Hao, Nannan Wang, Jie Li, and Xinbo Gao. Hsme: Hyperspheremanifold embedding for visible thermal person re-identification.In Proceedings of the AAAI Conference on Artificial Intelligence(AAAI), volume 33, pages 8385–8392, 2019. 1

[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deepresidual learning for image recognition. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (CVPR),pages 770–778, 2016. 2

[5] Xun Huang and Serge Belongie. Arbitrary style transfer in real-timewith adaptive instance normalization. In Proceedings of the IEEEInternational Conference on Computer Vision (CVPR), pages 1501–1510, 2017. 2

[6] Dat Nguyen, Hyung Hong, Ki Kim, and Kang Park. Person recog-nition system based on a combination of body images from visiblelight and thermal cameras. Sensors, 17(3):605, 2017. 1

[7] Yifan Sun, Liang Zheng, Weijian Deng, and Shengjin Wang. Svdnetfor pedestrian retrieval. In Proceedings of the IEEE InternationalConference on Computer Vision (ICCV), pages 3800–3808, 2017. 1

[8] Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. Be-yond part models: Person retrieval with refined part pooling (and astrong convolutional baseline). In Proceedings of the European Con-ference on Computer Vision (ECCV), pages 480–496, 2018. 1

[9] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Improvedtexture networks: Maximizing quality and diversity in feed-forwardstylization and texture synthesis. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition (CVPR), pages6924–6932, 2017. 2

[10] Ancong Wu, Wei-Shi Zheng, Hong-Xing Yu, Shaogang Gong, andJianhuang Lai. Rgb-infrared cross-modality person re-identification.In Proceedings of the IEEE International Conference on ComputerVision (ICCV), pages 5380–5389, 2017. 1

[11] Mang Ye, Xiangyuan Lan, Jiawei Li, and Pong C Yuen. Hierarchicaldiscriminative learning for visible thermal person re-identification.In Proceedings of the AAAI Conference on Artificial Intelligence(AAAI), 2018. 1

[12] Mang Ye, Xiangyuan Lan, Zheng Wang, and Pong C Yuen. Bi-directional center-constrained top-ranking for visible thermal personre-identification. IEEE Transactions on Information Forensics andSecurity, 2019. 1

[13] Mang Ye, Zheng Wang, Xiangyuan Lan, and Pong C Yuen. Visiblethermal person re-identification via dual-constrained top-ranking. InProceedings of the International Joint Conference on Artificial Intel-ligence (IJCAI), pages 1092–1099, 2018. 1

[14] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Un-paired image-to-image translation using cycle-consistent adversarialnetworks. In Proceedings of the IEEE International Conference onComputer Vision (CVPR), pages 2223–2232, 2017. 2

Page 3: Supplementary Material for paper “Hi-CMD: Hierarchical ......Supplementary Material for paper “Hi-CMD: Hierarchical Cross-Modality Disentanglement for Visible-Infrared Person Re-Identification”

Module Name Layer Name Input Shape→ Output Shape Parameters Layer Description

Conv1 (H,W, 3)→ (H2 ,W2 , 32) [3× 3, 32] S-2, IN, LReLU

Conv2 (H2 ,W2 , 32)→ (H2 ,

W2 , 64) [3× 3, 64] S-1, IN, LReLU

Conv3 (H2 ,W2 , 64)→ (H2 ,

W2 , 64) [3× 3, 64] S-1, IN, LReLU

PrototypeEncoderEp

1 , Ep2

Conv4 (H2 ,W2 , 64)→ (H4 ,

W4 , 128) [3× 3, 128] S-2, IN, LReLU

ResBlocks (H4 ,W4 , 128)→ (H4 ,

W4 , 128)

[3× 3, 128

3× 3, 128

]× 4 S-1, IN, LReLU

ASPP (H4 ,W4 , 128)→ (H4 ,

W4 , 256)

[1× 1, 64][1× 1, 64

3× 3, 64

]× 3 S-1, IN, LReLU

Conv5 (H4 ,W4 , 256)→ (H4 ,

W4 , 256) [1× 1, 256] S-1, IN

ResBlocks (H4 ,W4 , 256)→ (H4 ,

W4 , 256)

[3× 3, 256

3× 3, 256

]× 4 S-1, AdaIN, LReLU

Upsampling (H4 ,W4 , 256)→ (H2 ,

W2 , 256) - -

ResidualDecoderG

Conv1 (H2 ,W2 , 256)→ (H2 ,

W2 , 128) [5× 5, 128] S-1, LN, LReLU

Upsampling (H2 ,W2 , 128)→ (H,W, 128) - -

Conv2 (H,W, 128)→ (H,W, 64) [5× 5, 64] S-1, LN, LReLU

Conv3 (H,W, 64)→ (H,W, 64) [3× 3, 64] LReLU

Conv4 (H,W, 64)→ (H,W, 64) [3× 3, 64] LReLU

Conv5 (H,W, 64)→ (H,W, 3) [1× 1, 3] -

Conv1 (H,W, 3)→ (H,W, 32) [1× 1, 32] S-1, LReLU

Conv2 (H,W, 32)→ (H,W, 32) [3× 3, 32] S-1, LReLU

Conv3 (H,W, 32)→ (H2 ,W2 , 32) [3× 3, 32] S-2, LReLU

DiscriminatorD1, D2

Conv4 (H2 ,W2 , 32)→ (H2 ,

W2 , 32) [3× 3, 32] S-1, LReLU

Conv5 (H2 ,W2 , 32)→ (H4 ,

W4 , 64) [3× 3, 64] S-2, LReLU

ResBlocks (H4 ,W4 , 64)→ (H4 ,

W4 , 64)

[3× 3, 64

3× 3, 64

]× 4 S-1, LReLU

Conv6 (H4 ,W4 , 64)→ (H4 ,

W4 , 1) [1× 1, 1] S-1

Table 3. Network architectures within the proposed overall framework. We express the characteristics of each layer as S : Stride size, IN :Instance Normalization,AdaIN : Adaptive Instance Normalization, and LN : Layer Normalization. H andW denote the height and widthof the input image, respectively. Both visible and infrared images are resized to 256× 128× 3.

Page 4: Supplementary Material for paper “Hi-CMD: Hierarchical ......Supplementary Material for paper “Hi-CMD: Hierarchical Cross-Modality Disentanglement for Visible-Infrared Person Re-Identification”

Fai

lure

cas

es

Fai

lure

cas

es

(a) (b)

Figure 2. The top-5 retrieval results of our Hi-CMD on the RegDB dataset. The correct matchings and wrong matchings are indicated bygreen and red color, respectively. (a) Visible images are given as the query set, which is the original evaluation protocol for RegDB. (b)Infrared images are given as the query set, which is introduced in Table 1. Best viewed in color.

Page 5: Supplementary Material for paper “Hi-CMD: Hierarchical ......Supplementary Material for paper “Hi-CMD: Hierarchical Cross-Modality Disentanglement for Visible-Infrared Person Re-Identification”

Fai

lure

cas

es

Fai

lure

cas

es

(a) (b)

Figure 3. The top-5 retrieval results of our Hi-CMD on the SYSU-MM01 dataset. The SYSU-MM01 dataset is relatively challengingcompared to the RegDB dataset. The correct matchings and wrong matchings are indicated by green and red color, respectively. (a)Infrared images are given as the query set, which is the original evaluation protocol for SYSU-MM01. (b) Visible images are given as thequery set. Best viewed in color.