arXiv:1604.07807v2 [cs.CV] 28 Apr 2016arXiv:1604.07807v2 [cs.CV] 28 Apr 2016 methods can be applied to boost the performance. Exper-imental results on three challenging person re-identiﬁcation

An Enhanced Deep Feature Representation for Person Re-identification∗

Shangxuan Wu Ying-Cong Chen Xiang Li An-Cong WuJin-Jie You Wei-Shi Zheng†

Intelligence Science and System Lab, Sun Yat-sen University, ChinaGuangdong Provincial Key Laboratory of Computational Science, China

[email protected], [email protected], [email protected]

[email protected], [email protected], [email protected]

Abstract

Feature representation and metric learning are two crit-ical components in person re-identification models. In thispaper, we focus on the feature representation and claim thathand-crafted histogram features can be complementary toConvolutional Neural Network (CNN) features. We proposea novel feature extraction model called Feature Fusion Net(FFN) for pedestrian image representation. In FFN, backpropagation makes CNN features constrained by the hand-crafted features. Utilizing color histogram features (RGB,HSV, YCbCr, Lab and YIQ) and texture features (multi-scaleand multi-orientation Gabor features), we get a new deepfeature representation that is more discriminative and com-pact. Experiments on three challenging datasets (VIPeR,CUHK01, PRID450s) validates the effectiveness of our pro-posal.

1. Introduction

Person re-identification aims at matching people fromdifferent views under surveillance cameras, which has beenstudied extensively in the past five years. To address there-identification problem, existing methods exploit eithercross-view invariant features [9, 7, 27, 19, 14, 33, 12, 20,18] or cross-view robust metrics [4, 5, 12, 17, 33, 23, 3, 28,34, 25].

Recently, Convolutional Neural Network (CNN) havebeen adopted in person re-identification, e.g. [16, 1, 26, 10].Deep Learning provides a powerful and adaptive approachto handle computer vision problems without excessivehandcraft on image features. The back propagation algo-rithm dynamically adjusts the parameters in CNN, which

∗Citation for this paper: Shangxuan Wu, Ying-Cong Chen, Xiang Li,An-Cong Wu, Jin-Jie You, and Wei-Shi Zheng. An Enhanced Deep FeatureRepresentation for Person Re-identification. In IEEE WACV, 2016.†Corresponding author

unifies both feature extraction and pairwise comparison pro-cess in a single network.

(a) VIPeR (b) CUHK01 (c) PRID450s

Figure 1: Sample images from VIPeR, CUHK01 andPRID450s datasets. Images on the same column representthe same person.

However, in real-world person re-identification, a per-son’s appearance often undergoes large variations acrossnon-overlapping camera views, due to significant changesin view angle, lighting, background clutter and occlusion(see Fig. 1). Hand-crafted concatenation of different ap-pearance features, e.g. RGB, HSV colorspaces and LBP de-scriptor, which are designed to overcome cross-view ap-pearance variations in re-identification tasks, sometimeswould be more distinctive and reliable.

In order to effectively combine hand-crafted featuresand deeply learned features, we investigate the combina-tion and complementary of a multi-colorspace hand-craftedfeatures (ELF16) and deep features extracted from CNN.A deep feature fusion Network (FFN) is proposed in or-der to use hand-crafted features to regularize the CNN pro-cess so as to make the convolution neural networks extractfeatures complementary to hand-crafted features. After ex-tracting features by our FFN, traditional metric learning

arX

iv:1

604.

0780

7v2

[cs

.CV

] 2

8 A

pr 2

016

methods can be applied to boost the performance. Exper-imental results on three challenging person re-identificationdatasets (VIPeR, CUHK01, PRID450s) demonstrate theeffectiveness of our new features. A significant improve-ment of Rank-1 matching rate is achieved as compared tostate-of-the-art methods (8.09%, 7.98% and 11.2%) on thethree datasets. In a word, we show that hand-crafted fea-tures could improve the extraction process of CNN featuresin FFN, achieving a more robust image representation.

2. Related Works

Hand-crafted Features. Color and texture are two of themost useful characteristics in image representation. For ex-ample, HSV and LAB color histograms are used to measurethe color information in the image. LBP histogram [22] andGabor filter describe the textures of images. Recent papersuse a combination of different features to produce more ef-fective features [27, 9, 7, 9, 32, 33, 20].

Recently, features specifically designed for person re-identification significantly boost the matching rate. Lo-cal descriptors encoded by Fisher Vectors (LDFV) [19]build descriptors on Fisher Vector. Color invariants (Col-orInv) [14] use color distributions as the sole cue for goodrecognition performance. Symmetry-driven accumulationof local features (SDALF) [7] proves that symmetry struc-ture of segments can improve the performance significantly,and an accumulative method of features provides robustnessto image distortions. Local maximal occurrence features(LOMO) [18] analyzes the horizontal occurrence of localfeatures and maximizes the occurrence to stably representre-identification images.

Deep Learning. Convolutional Neural Network hasbeen widely used in many computer vision problems, butonly a few papers concern deep learning on person re-identification.

Li et al. first proposed deep filter pairing nerual net-work (FPNN) [16] which used patch-matching layer andmaxout pooling layer to handle pose and viewpoint variant.FPNN was also the first work to employ deep learning onperson re-identification problems. Ahmed et al. improveddeep learning architecture by specifically designing cross-input neighbourhood difference layer [1]. Later, the deepmetric learning in [26] used “siamese” deep neural struc-ture and a cosine layer to deal with big variations of personimages. Hu et al. proposed deep transfer metric learning(DTML) [10], which transfers cross-domain visual knowl-edge into target datasets.

These deep methods combine feature extraction andimage-pair classification into a single CNN network. Pair-wise comparison and symmetry structures are widely usedamong them, which could be inheritances of traditionalmetric learning methods [9, 7, 27, 19, 14, 33, 12, 20, 18,

34, 25]. Since pairwise comparison is form to learn thedeep neural network, it is demanded to form quite a lot ofpairs for each probe image and perform deep convolutionon these pairs. Compared to these works, our FFN is notbased on pairwise input but directly extracts deep featureson a single image, so that our deep architecture can be fol-lowed by any conventional classifiers, while existing deeplearning works cannnot.

3. Methodology

3.1. Network Architecture

We use our modification of convolutional neural network(Feature Fusion Net, FFN) to learn new features. The net-work architecture is shown in Fig. 2. Our Feature FusionNetwork consists of two parts. The first part deals withtraditional convolution, pooling and activation neurons forinput images; the second part processes additional hand-crafted feature representations of the same image. Thesetwo sub-networks are finally linked together to produce afull-fledged image description, so the second part will regu-larize the first part during learning. Finally, our new feature(4096D vector) is extracted from the last Full ConvolutionLayer (Fusion Layer) of FFN.

3.2. CNN Features

The upper part of Fig. 2 describes a traditional processof convolution and pooling. Every convolution layer is fol-lowed by a pooling layer and a local response normalizaion(LRN) layer [13], except for the 3rd layer. Finally, the out-put of the 5th pooling layer is a 4096D vector, which weregarded as CNN Features.

Most re-identification models regard CNN as a whole bi-nary classifier with direct image input like DeepReID [16]and Ahmed’s Improved Deep Re-id Model [1]. However,the work in [6] inspires us to come up with strong reasonfor taking the convolution layer as a feature extractor. Onemajor characteristic of Re-identification images are whole-body images under different camera views. Most of thebody parts could be found in all the camera views, but sufferfrom serious malposition, distortion and misalignment. Theconvolution in CNN allows part displacement and visualchanges to be alleviated in higher-level convolution layers.Multiple convolution kernels provide different descriptionsfor pedestrian images. In addition, pooling and LRN layersprovide nonlinear expression of corresponding description,which significantly reduces the overfitting problem. Theselayers contribute to a stable Convolution Neural Networkthat could be applied to new datasets (See Section 4 for de-tailed training process).

Figure 2: Fusion Feature Net (FFN) for ELF16 features and CNN features.

3.3. Hand-crafted Features

The lower part of Fig.2 extracts conventional hand-crafted features widely used in person re-identification. Inthis work, we employ the Ensemble of Local Features(ELF) [9] and is improved in [32, 33]. It extracts RGB,HSV and YCbCr histograms of 6 horizontal stripes of in-put image. Also, 8 Garbor filters and 13 Schmid filters areapplied to get corresponding texture information.

We modify ELF feature by improving the color spaceand stripe division [3]. Input image is equally partitionedinto 16 horizontal stripes, and our featurs are composed ofcolor features including RGB, HSV, LAB, XYZ, YCbCrand NTSC and texture features including Gabor, Schmidand LBP. A 16D histogram is extracted for each channeland then normalized by L1-norm. All histograms are con-catenated together to form a single vector. In this work, wedenote the above type of hand-crafted features as ELF16.

3.4. Proposed New Features

We aim to jointly map CNN features and hand-craftedfeatues to a unitary feature space. A feature fusion deepneural network is proposed in order to use hand-crafted fea-tures to regularize CNN features so as to make CNN ex-tract complementary features. In our framework, by usingback propagation, the parameters of the whole CNN net-work could be affected by hand-craft features. In general,as a results of the fusion, the regularized CNN features out-

put by our proposal network should be more discriminativethan both CNN features and the employed hand-crafted fea-tures.

Fusion Layer and Buffer Layer. Our Fusion Layer usesfull connection to provide self-adaptation on person re-identification problems. Both ELF16 Features and CNNFeatures are followed by a 4096D-output full connectionlayer (Buffer Layer), which provides buffer for the fusionaction. Buffer Layer is essential in our architecture, since itbridges the gap between two features with huge difference,and guarantees the convergence of FFN.

If the input of Fusion Layer is

x = [ELF16,CNN Features], (1)

then the output of this layer is computed by:

ZFusion(x) = h(W TFusionx+ bFusion), (2)

where h(·) denotes the activation function. The ReLU anddropout layers are adopted, with a dropout ratio 0.5. Ac-cording to back propagation algorithm, parameters of lth

layer after a new iteration are written as:

W (l)new = W (l) − α[(

1

m∆W (l)) + λW (l)], (3)

b(l)new = b(l) − α[1

m∆b(l)], (4)

where parameters α, m and λ are set under the guidance of[2].

Existing deep re-identification networks for person re-identification adopt Deviance Loss [26] or Maximum MeanDiscrepancy [1] as loss function. But we aim at extractingdeep features on every image effectively rather than per-forming pairwise comparison through a deep neural net-work. Therefore, softmax loss function is applied in ourmodel, and intuitively speaking a more discriminative fea-ture representation should result in lower softmax loss aswell. For a single input vector x and a single output node jin the last layer, the loss could be calculated by:

p(y = j|x;θ) =eθ

Tj x∑n

k=1 eθTk x. (5)

The last layer of our network is designed to minimize thecross-entropy loss:

J = −n∑

k=1

pk log pk, (6)

in which the number of output node n varies on differenttraining sets as described in Section 4.

3.5. How do Hand-crafted Features Influence theExtraction of CNN Features?

If the parameters of the network are influenced by theELF16 features x, i.e., the gradient of the network parame-ters are adjusted according to x, then ELF16 features in thelower part of FFN could make CNN features more comple-mentary with it, since the final objective of FFN is to makeour features more discriminative in different images.

Denote CNN features (in FC7 layer) as x and ELF16features as x, Denote the weight connecting the jth node innth layer and the ith node in (n + 1)th layer as W n

ij . LetZn

j =∑

jWn−1ji an−1

i where an−1i = h(Zn−1

i ). Denote

δni =∂J

∂Zni

. (7)

Note that Z8j =

∑jW

n−1xn−1i . We show that by using

back propagation, ∂J∂W 7

ijis influenced by x. In this way,

CNN Features learn its parameters which will form featurescomplementary to the ELF16 features x. Note that

∂J

∂W 7ij

= xjδ8i , (8)

whereδ8i = (

∑j

W 8jiδ

9j )h′(Z8

i ), (9)

δ9j = (∑k

W 9kjδ

10k )h′(Z9

j ). (10)

δ9j is influenced by x in two ways. Firstly,

Z9k =

∑j

W 8kja

8j +

∑j

W 8kja

8j , (11)

wherea8j = h(

∑i

W 7jixi). (12)

In other words, the information in ELF16 features x couldpropagate through h′(Z9

j ), and thus the convolution filtersof Deep Feature Extraction part would adapt itself accord-ing to x. Secondly, the output of softmax loss layer is in-fluenced by x during the forward propagation process, andthus δ10k is also influenced by x.

4. Settings for Feature Fusion Network

4.1. Training Dataset

Market-1501 is a multi-shot person re-identificationdataset recently reported by [31]. It consists of 38195 im-ages from 1501 identities, which is the largest public personre-identification dataset available. We trained our FeatureFusion Network on Market-1501, and used it to extract fea-tures in Section 5.

4.2. Training Strategies

Our training strategy applied mini-batch stochastic gra-dient descent (SGD) for faster back propagation andsmoother convergence [2]. In each iteration of trainingphase, 25 images form a mini-batch and were forwarded tosoftmax loss Layer. The initial learning rate γ = 1e − 5,which is significantly smaller than most of other CNNmodels. Every 20000 iterations the learning rate decreasedby γnew = 0.1 ∗ γ. We finetuned our network based onImageNet [13] model provided by [11]. Our FFN modeltook 50000 iterations to converge (about 4 hours on a TeslaK20m GPU). In order to improve the adaptation of ourmodel, we further use difficult samples to finetune the net-work.

Hard negative mining [1] gives us a logical way to em-phasize difficult samples in CNN. This training strategyis originally designed to balance the positive and negativesamples in pairwise comparison for person re-identification.We applied this strategy to our Feature Fusion Network aswell. About 12000 images of 630 IDs were wrongly-labeledby the previous network, and were manually picked out forfurther finetuning. We replaced the last softmax loss layerwith less output nodes and continued to finetune our modelon these difficult samples, with lower learning rate (1e− 6)and fewer iterations (about 10000). The whole training pro-cess took about 5-6 hours to converge to a tolerable trainingloss (about 0.05 typically).

2 4 6 8 10 12 14 16 18 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Rank

Mat

chin

g R

ate

18.82% Ours15.25% ELF16+CNN-FC713.51% LDFV12.78% ELF169.02% gBiCov4.83% LOMO4.07% CNN-FC717.59% Ours+LOMO

(a) VIPeR on L1 − norm

2 4 6 8 10 12 14 16 18 200.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Rank

Mat

chin

g R

ate

15.47% Ours10.80% LOMO10.03% ELF16+CNN-FC77.25% gBiCov6.61% CNN-FC75.37% ELF1810.27% Ours+LOMO

(b) CUHK01 on L1 − norm

2 4 6 8 10 12 14 16 18 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Rank

Mat

chin

g R

ate

25.50% Ours24.22% ELF16+CNN-FC712.33% LOMO11.39% ELF169.28% CNN-FC73.06% gBiCov33.42% Ours+LOMO

(c) PRID450s on L1 − norm

2 4 6 8 10 12 14 16 18 200.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Rank

Mat

chin

g R

ate

32.09% Ours29.72% ELF16+CNN-FC729.43% ELF1628.86% LOMO26.14% LDFV20.47% gBiCov11.87% CNN-FC737.85% Ours+LOMO

(d) VIPeR on LFDA

2 4 6 8 10 12 14 16 18 200.2

0.3

0.4

0.5

0.6

0.7

0.8

Rank

Mat

chin

g R

ate

40.37% Ours39.57% LOMO37.69% ELF16+CNN-FC721.48% gBiCov20.62% CNN-FC722.08% ELF1840.14% Ours+LOMO

(e) CUHK01 on LFDA

2 4 6 8 10 12 14 16 18 200.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Rank

Mat

chin

g R

ate

51.73% Ours50.22% LOMO48.89% ELF16+CNN-FC736.18% ELF1625.11% CNN-FC713.96% gBiCov33.42% Ours+LOMO

(f) PRID450s on LFDA

0 2 4 6 8 10 12 14 16 18 200.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Rank

Mat

chin

g R

ate

51.06% Ours+LOMO41.69% Ours39.75% ELF1640.53% LOMO37.15% ELF16+CNN-FC733.35% LDFV15.35% gBiCov15.07% CNN-FC7

(g) VIPeR on Mirror KMFA

2 4 6 8 10 12 14 16 18 200.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Rank

Mat

chin

g R

ate

55.51% Ours+LOMO47.53% Ours45.69% LOMO36.60% ELF1635.82% ELF16+CNN-FC716.87% CNN16.15% gBiCov

(h) CUHK01 on Mirror KMFA

2 4 6 8 10 12 14 16 18 200.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Rank

Mat

chin

g R

ate

66.62% Ours+LOMO60.09% LOMO58.02% Ours53.33% ELF1649.33% ELF16+CNN-FC727.38% LDFV14.61% gBiCov

(i) PRID450s on Mirror KMFA

Figure 3: CMC Curves of three datasets. L1 − norm, LFDA and Mirror KMFA were used to evaluate the features. Theyellow CMC Curves in the last row indicates the final model we used in Section 5.4.

5. Experiments

This section evaluated our new features in different per-spectives. We presented extensive experimental results onthree benchmark datasets in order to clearly demonstrate theeffectiveness of our features.

5.1. Datasets and Experiment Protocols

Our test was based on three publicly available datasets:VIPeR [8], CUHK01 [15] and PRID450s [24]. Each of ourdatasets was presented in two disjoint camera views, withsignificant misalignment, light change and body part distor-tion. Table 1 briefly introduces these three datasets. Also,some sample images of these datasets are shown in Fig. 1.

In each individual experiment, we randomly selectedhalf of the identities as training set, and the other half as test-ing set. Training set was used to train projection matrix W(in metric learning methods). Testing set used x′ = W Txto get the final projection of x and measures the distance be-

tween a pair of input images. For the reliability and stable-ness of our results, each experiment was repeated 10 timesand the average Rank-i accuracy rate was computed. Cumu-lative Matching Curve (CMC) was also provided in Fig. 3,providing a more intuitional comparison between differentalgorithms.

We applied single-shot protocol in our experiment, thatis during testing phase, one image was chosen from View2as probe and all the images in View1 were regarded as thegallery. For CUHK01 specifically, which has 2 images ofthe same person in one camera view, we randomly choseone image of each identity as the gallery.

Mirror Kernel Marginal Analysis (KMFA), proposedby [3], provides a high-performance metric learning algo-rithm on person re-identification. This method was adoptedin Section 5.3.2, with chi-square kernel embedded and pa-rameters set to the optimal according to [3].

VIPeR CUHK01 PRID450sNo. of images 1264 3884 900

No. of identities 632 971 450No. of images in training set 316 485 225

No. of camera views 2 2 2No. of images per view per ID 1 2 1

Table 1: Re-identification datasets used in our experiments.

5.2. Features

Six feature extraction approaches were evaluted in ourexperiments for comparison, including LDFV [19], gBi-Cov [20], ImageNet [13] CNN features, LOMO fea-tures [18], ELF16 features and our proposed features1. ForImageNet CNN features alone, FC7 layer data, which pro-duced the highest accuracy rate in our tests, was chosen inour experiments. Local Maximal Occurrence Representa-tion (LOMO) is another high-performance feature represen-tation specifically designed for re-identification problem.LDFV features were evaluated only on VIPeR dataset due toits copyrights of code. All images were resized to 224×224for our feature extraction.

In order to demonstrate the effectiveness of ournew feature, two compound features (ELF16+CNN-FC7and Ours+LOMO) were also added for the comparison.ELF16+CNN-FC7 denotes the concatenation of normal-ized CNN-FC7 feature to ELF16 feature. Ours+LOMO de-notes the concatenation of our new features and normalizedLOMO features.

All of these features were extracted and evaluated in itsdefault dimension (see Table 5).

5.3. Evaluations on Features

5.3.1 Unsupervised Method

Fig. 3 (a)-(c) shows the performance of our features com-pared to other features on L1 − norm, evaluating an algo-rithm’s capability in an original and unsupervised perspec-tive. Our features significantly outperformed other stand-alone features (see Fig.3 (a)-(c)), suggesting that raw infor-mation provided by our feature is more accurate for repre-senting re-identification images in most cases.

ELF16+CNN-FC7 features performed the second-bestand outperformed both ELF16 and CNN-FC7, which pro-vides supports on our assumption that traditional featureand CNN features are complementary. Also, our new fea-tures significantly outperformed ELF16+CNN-FC7, whichmay be bause of the following two reasons:

• CNN features in our network were trained to becomplementary to the traditional features, while inELF16+CNN-FC7, the CNN features are simply cas-caded with ELF16 features, which may not be optimal.

1Our proposed features are available at http://isee.sysu.edu.cn/resource

• The use of Buffer Layer and Fusion layer could auto-matically tune the weights for each feature, and makesthe fused feature perform much better.

LOMO features were specifically designed to describeperson re-identification images. However, it ranked the sev-enth on VIPeR and the third on CUHK01, which is not sta-ble enough for L1 − norm.

5.3.2 Metric Learning Methods

To demonstrate the maximal effectiveness of our imagedescription, we put it into two metric learning methods:LFDA [23] and Mirror KMFA [3], along with other widely-used features. We used each of the features to learn distancemetric between each probe image and gallery set.In this ex-periments, we evaluated their capability on supervised met-ric learning methods.

Fig. 3 (d)-(i) shows the CMC curves on three datasets,with Rank-1 identification rate labeled on each feature type.Note that LDFV performed badly using chi-square kernel,so we adopted Mirror MFA without kernel trick in the com-parison.

The results clearly show the outstanding performance ofour proposed features, as it exceeded all the stand-alone fea-tures in VIPeR and CUHK01. Compared to ELF16 andCNN-FC7 features alone, our new features yielded muchbetter results. Also, the simple concatenation of these twofeatures (ELF+CNN-FC7) could not represent the image asgood as ours, and it indicates the necessity of Fusion Layerin the proposed FFN.

Rank 1 5 10 20Our Model 51.06 81.01 91.39 96.90

Deep Feature Learning[6] 40.50 60.80 70.40 84.40LOMO+XQDA [18] 40.00 67.40 80.51 91.08

Mirror KMFA(Rχ2 ) [3] 42.97 75.82 87.28 94.84mFilter+LADF [30] 43.39 73.04 84.87 93.70

mFilter [30] 29.11 52.10 67.20 80.14SalMatch [28] 30.16 52.31 65.54 79.15

LFDA [23] 24.18 52.85 67.12 78.96LADF [17] 29.34 61.04 75.98 88.10RDC [33] 15.66 38.42 53.86 70.09

KISSME [12] 24.75 53.48 67.44 80.92LMNN-R [5] 19.28 48.71 65.49 78.34PCCA [21] 19.28 48.89 64.91 80.28L2 − norm 10.89 22.37 32.34 45.19L1 − norm 12.15 26.01 32.09 34.72

Table 2: Top Matching Rank on VIPeR (Sorted by the pro-posed time).

Our proosed features are always better than LOMO fea-tures. Since LOMO emphasizes on HSV and SILTP his-tograms, it performed better on PRID450s, which was un-dergoing specific lighting conditions. But on other datasets,our new features are still better than LOMO features.

http://isee.sysu.edu.cn/resource

Rank 1 5 10 20Our Model 55.51 78.40 83.68 92.59

Mirror KMFA(Rχ2 ) [3] 40.40 64.63 75.34 84.08Ahmed’s Deep Re-id [1] 47.53 72.10 80.53 88.49

mFilter [30] 34.30 55.12 64.91 74.53SalMatch [28] 28.45 45.85 55.67 67.95DeepReID [16] 27.87 64.01 82.50 87.36

ITML [4] 15.98 35.22 45.60 59.81eSDC [29] 19.67 32.72 40.29 50.58LFDA [23] 22.08 41.56 53.85 64.51

KISSME [12] 14.02 32.20 44.44 56.61LMNN-R [5] 13.45 31.33 42.25 54.11L2 − norm 5.63 16.00 22.89 30.63L1 − norm 10.80 15.51 37.57 35.57

Table 3: Top Matching Rank on CUHK01(Sorted by pro-posed time).

Rank 1 5 10 20Our Model 66.62 86.84 92.84 96.89

Mirror KMFA(Rχ2 ) [3] 55.42 79.29 87.82 93.87Ahmed’s Deep Re-id [1] 34.81 63.72 76.24 81.90

ITML [4] 24.27 47.82 58.67 70.89LFDA [23] 36.18 61.33 72.40 82.67

KISSME [12] 36.31 65.11 75.42 83.69LMNN-R [5] 28.98 55.29 67.64 78.36L2 − norm 11.33 24.50 33.22 43.89L1 − norm 25.50 25.33 51.73 53.07

Table 4: Top Matching Rank on PRID450s (Sorted by pro-posed time)

The concatenation of these two features (Ours+LOMO)has a strong discriminative ability and outperformed allother features on Mirror KMFA. This indicates that ourdeep learning methods is complementary to LOMO fea-tures. Thus, we regard this 31056D mix features as thefinal image representation in Mirror KMFA person re-identification model.

5.4. Comparison with State-of-the-Art

This experiment compared overall performance betweenstate-of-the-art person re-identification model and ours. Ourmodel is based on Mirror KMFA, using the concatena-tion of our new features and normalized LOMO features(Ours+LOMO).

Table 2-4 summarize some of the highest performancemodels on VIPeR, CUHK01 and PRID450s, includingLOMO+XQDA [18], Mirror KMFA [3], Ahmed’s Im-proved Deep ReID [1] and Mid-level Filter [30]. Our modelcan beat them by about 10% in Rank-1 matching rate.

Three Deep Learning methods (DeepReID [16],Ahmed’s Deep Re-id [1], Ding’s Deep Feature Learn-ing [6])are specifically listed in Table 3 and 4. All of themmodified CNN for pairwise comparison, and employedunique layers to match two views of the input images.

In comparison, our model regards CNN as a feature ex-

Feature Extraction Time Default DimensiongBiCov 13.6152s 5940LOMO 0.2610s 26960ELF16 0.5720s 8064

CNN-FC7 0.1773s 4096Ours (with ELF16) 0.1769s+0.5720s 4096

Table 5: Average time of extracting features of a single64x128 image (Evaluated on a 2.00GHz Xeon CPU with16 cores).

tractor, while adopting metric learning to calculate relativedistance of different images. This not only contributes tothe improvement in accuracy, but also enables us to uselarger datasets in CNN training process. Our model alsoclearly exceeded their performance on CUHK01 (7.98%)and PRID450s (11.2%).

5.5. Running Time

We evaluate the running time of these feature-extractionalgorithms, as shown in Fig. 5. The reporeted time is theaverage feature extraction time for a single 48 × 128 im-age on VIPeR dataset (in its default dimension). Note thatwe have included the time of extracting ELF16 features inthe last row. bottou2012stochastic It can be seen that ourFusion Feature Network is even faster than some of thehand-crafted methods (such as gBiCov), which breaks thestereotype of huge and clumsy Convolutional Neural Net-work. Also, most of the time was spent on the extraction ofELF16 features. Compared to LOMO features, our featureshave much lower dimension, and will perform faster in themetric learning step followed. With a balance between thespeed and dimensional complexity, our Feature Fusion Net-work can be easily applied to actual use. Besides, comparedto other CNN-based models, our FFN does not need to fine-tune on target datasets, which makes it faster to apply.

6. ConclusionIn this paper, we have presented a novel and effective

way of feature extraction for person re-identification calledFeature Fusion Network (FFN). This model jointly uti-lizes both CNN feature and hand-crafted features, includ-ing RGB, HSV, YCbCr, Lab, YIQ color feature and Gabortexture feature. It could adjust the weights of these infor-mation automatically with the back propagation process ofNeural Network. Also, we have proved that FFN regular-izes the CNN process so as to make CNN focus on extract-ing complementary features. Experiments on three chal-lenging person re-identification datasets (VIPeR, CUHK01,PRID450s) show the effectiveness of our learned deep fea-tures. By using Mirror Kernel Marginal Fisher Analysis(KMFA), our proposed features significantly outperformthe state-of-the-art person re-identification models on these

three datasets by 8.09%, 7.98%, and 11.2% (in Rank-1 ac-curacy rate), respectively.

AcknowledgementsThis research was partly supported by Guangdong

Provincial Government of China through the ComputationalScience Innovative Research Team Program, and partiallyby Natural Science Foundation of China (Nos. 61472456,61522115, 61573387), Guangzhou Pearl River Science andTechnology Rising Star Project under Grant 2013J2200068,the Guangdong Natural Science Funds for DistinguishedYoung Scholar under Grant S2013050014265, and theGuangDong Program (No. 2015B010105005).

References[1] E. Ahmed, M. Jones, and T. K. Marks. An improved deep

learning architecture for person re-identification. In IEEECVPR, 2015.

[2] L. Bottou. Stochastic gradient descent tricks. In Neural Net-works: Tricks of the Trade. 2012.

[3] Y.-C. Chen, W.-S. Zheng, and J. Lai. Mirror represen-tation for modeling view-specific transform in person re-identification. In IJCAI, 2015.

[4] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon.Information-theoretic metric learning. In ICML, 2007.

[5] M. Dikmen, E. Akbas, T. S. Huang, and N. Ahuja. Pedestrianrecognition with a learned metric. In ACCV. 2011.

[6] S. Ding, L. Lin, G. Wang, and H. Chao. Deep featurelearning with relative distance comparison for person re-identification. Pattern Recognition, 2015.

[7] M. Farenzena, L. Bazzani, A. Perina, V. Murino, andM. Cristani. Person re-identification by symmetry-driven ac-cumulation of local features. In IEEE CVPR, 2010.

[8] D. Gray, S. Brennan, and H. Tao. Evaluating appearancemodels for recognition, reacquisition, and tracking. In IEEEPETS Workshop, 2007.

[9] D. Gray and H. Tao. Viewpoint invariant pedestrian recogni-tion with an ensemble of localized features. In ECCV. 2008.

[10] J. Hu, J. Lu, and Y.-P. Tan. Deep transfer metric learning. InIEEE CVPR, 2015.

[11] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-shick, S. Guadarrama, and T. Darrell. Caffe: Convolutionalarchitecture for fast feature embedding. arXiv:1408.5093,2014.

[12] M. Koestinger, M. Hirzer, P. Wohlhart, P. M. Roth, andH. Bischof. Large scale metric learning from equivalenceconstraints. In IEEE CVPR, 2012.

[13] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. InNIPS, 2012.

[14] I. Kviatkovsky, A. Adam, and E. Rivlin. Color invariantsfor person reidentification. IEEE TPAMI, 35(7):1622–1634,2013.

[15] W. Li, R. Zhao, and X. Wang. Human reidentification withtransferred metric learning. In ACCV, 2012.

[16] W. Li, R. Zhao, T. Xiao, and X. Wang. Deepreid: Deep filterpairing neural network for person re-identification. In IEEECVPR, 2014.

[17] Z. Li, S. Chang, F. Liang, T. S. Huang, L. Cao, and J. R.Smith. Learning locally-adaptive decision functions for per-son verification. In IEEE CVPR, 2013.

[18] S. Liao, Y. Hu, X. Zhu, and S. Z. Li. Person re-identificationby local maximal occurrence representation and metriclearning. In IEEE CVPR, 2015.

[19] B. Ma, Y. Su, and F. Jurie. Local descriptors encoded byfisher vectors for person re-identification. In ECCV, 2012.

[20] B. Ma, Y. Su, and F. Jurie. Covariance descriptor basedon bio-inspired features for person re-identification and faceverification. IVC, 32(6):379–390, 2014.

[21] A. Mignon and F. Jurie. Pcca: A new approach for distancelearning from sparse pairwise constraints. In IEEE CVPR,2012.

[22] T. Ojala, M. Pietikainen, and T. Maenpaa. Multiresolutiongray-scale and rotation invariant texture classification withlocal binary patterns. IEEE TPAMI, 24(7):971–987, 2002.

[23] S. Pedagadi, J. Orwell, S. Velastin, and B. Boghossian. Localfisher discriminant analysis for pedestrian re-identification.In IEEE CVPR, 2013.

[24] P. M. Roth, M. Hirzer, M. Kostinger, C. Beleznai, andH. Bischof. Mahalanobis distance learning for person re-identification. In Person Re-Identification, pages 247–267.2014.

[25] X. Wang, W.-S. Zheng, X. Li, and J. Zhang. Cross-scenariotransfer person re-identification. IEEE TCSVT, PP(99):1–1,2015.

[26] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Deep metric learning forperson re-identification. In IEEE ICPR, 2014.

[27] Y. Zhang and S. Li. Gabor-lbp based region covariance de-scriptor for person re-identification. In IEEE ICIG, 2011.

[28] R. Zhao, W. Ouyang, and X. Wang. Person re-identificationby salience matching. In IEEE ICCV, 2013.

[29] R. Zhao, W. Ouyang, and X. Wang. Unsupervised saliencelearning for person re-identification. In IEEE CVPR, 2013.

[30] R. Zhao, W. Ouyang, and X. Wang. Learning mid-level fil-ters for person re-identification. In IEEE CVPR, 2014.

[31] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, J. Bu, andQ. Tian. Scalable person re-identification: A benchmark. InIEEE ICCV, 2015.

[32] W.-S. Zheng, S. Gong, and T. Xiang. Person re-identificationby probabilistic relative distance comparison. In IEEECVPR, 2011.

[33] W.-S. Zheng, S. Gong, and T. Xiang. Reidentification byrelative distance comparison. IEEE TPAMI, 35(3):653–668,2013.

[34] W.-S. Zheng, S. Gong, and T. Xiang. Towards open-worldperson re-identification by one-shot group-based verifica-tion. IEEE TPAMI, PP(99):1–1, 2015.

Documents

arXiv:1604.07807v2 [cs.CV] 28 Apr 2016arXiv:1604.07807v2 [cs.CV] 28 Apr 2016 methods can be applied to boost the performance. Exper-imental results on three challenging person re-identiﬁcation