LearningMultipleKernelMetricsforIterativePerson Re ...static.tongtianta.site/paper_pdf/8f1eaab0-a2ff-11e... · 78:2 H.Dongetal. ACM Reference format: HushengDong,PingLu,ChunpingLiu,YiJi,andShengrongGong.2018.LearningMultipleKernelMetricsfor

78

Learning Multiple Kernel Metrics for Iterative Person

Re-Identification

HUSHENG DONG, Soochow University, China and Suzhou Institute of Trade and Commerce, China

PING LU, Suzhou Institute of Trade and Commerce, China

CHUNPING LIU, Soochow University, China, Jilin University, China, and Collaborative Innovation

Center of Novel Software Technology and Industrialization, China

YI JI, Soochow University, China

SHENGRONG GONG, Changshu Institute of Science and Technology, China, Soochow University,

China, and Beijing Jiaotong University, China

In person re-identification most metric learning methods learn from training data only once, and then they

are deployed for testing. Although impressive performance has been achieved, the discriminative informa-

tion from successfully identified test samples are ignored. In this work, we present a novel re-identification

framework termed Iterative Multiple Kernel Metric Learning (IMKML). Specifically, there are two main mod-

ules in IMKML. In the first module, multiple metrics are learned via a new derived Kernel Marginal Nullspace

Learning (KMNL) algorithm. Taking advantage of learning a discriminative nullspace from neighborhood

manifold, KMNL can well tackle the Small Sample Size (SSS) problem in re-identification distance metric

learning. The second module is to construct a pseudo training set by performing re-identification on the

testing set. The pseudo training set, which consists of the test image pairs that are highly probable correct

matches, is then inserted into the labeled training set to retrain the metrics. By iteratively alternating between

the two modules, many more samples will be involved for training and significant performance gains can be

achieved. Experiments on four challenging datasets, including VIPeR, PRID450S, CUHK01, and Market-1501,

show that the proposed method performs favorably against the state-of-the-art approaches, especially on the

lower ranks.

CCS Concepts: • Computing methodologies → Computer vision tasks; Kernel methods;

Additional Key Words and Phrases: Person re-identification, kernel method, nullspace, metric learning

This work was partially supported by National Natural Science Foundation of China (NSFC Grants No. 61170124, No.

61272258, No. 61301299, No. 61272005, No. 61572085,No. 61702055), Provincial Natural Science Foundation of Jiangsu

(Grants No. BK20151254, No. BK20151260), Six Talents Peaks project of Jiangsu (DZXX-027), Key Laboratory of Sym-

bolic Computation and Knowledge Engineering of Ministry of Education, Jilin University (Grant No. 93K172016K08), and

Collaborative Innovation Center of Novel Software Technology and Industrialization.

Authors’ addresses: H. Dong, Soochow University, School of Computer Science and Technology, Suzhou 215006, China

and Suzhou Institute of Trade and Commerce, Suzhou, 215009, China; email: [email protected]; P. Lu, Suzhou In-

stitute of Trade and Commerce, Suzhou 215009, China; email: [email protected]; C. Liu, Soochow University, School of

Computer Science and Technology, Suzhou, China; Jilin University, Key Laboratory of Symbolic Computation and Knowl-

edge Engineering of Ministry of Education, Changchun 130012, China; and Collaborative Innovation Center of Novel

Software Technology and Industrialization, Nanjing 210046, China; email: [email protected]; Y. Ji, Soochow University,

School of Computer Science and Technology, Suzhou, China; email: [email protected]; S. Gong, Changshu Institute of

Science and Technology, 215500, Changshu, China; Soochow University, School of Computer Science and Technology,

Suzhou, China; and Beijing Jiaotong University, School of Computer and Information Technology, Beijing, China; email:

Shrgong@[email protected].

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee

provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and

the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.

Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires

prior specific permission and/or a fee. Request permissions from [email protected].

© 2018 ACM 1551-6857/2018/08-ART78 $15.00

https://doi.org/10.1145/3234929

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 3, Article 78. Publication date: August 2018.

mailto:[email protected]

https://doi.org/10.1145/3234929

78:2 H. Dong et al.

ACM Reference format:

Husheng Dong, Ping Lu, Chunping Liu, Yi Ji, and Shengrong Gong. 2018. Learning Multiple Kernel Metrics for

Iterative Person Re-Identification. ACM Trans. Multimedia Comput. Commun. Appl. 14, 3, Article 78 (August

2018), 24 pages.

https://doi.org/10.1145/3234929

1 INTRODUCTION

Person re-identification aims at matching an individual observed by multiple cameras with non-overlapping fields-of-view. It serves as a fundamental step in many surveillance applications, suchas cross-camera action analysis [33], multi-person association [49], and multi-camera pedestriantracking [16]. However, person re-identification is an inherently challenging task. Due to the largevariation in viewing conditions arising from viewpoint, illumination, pose, and occlusion, the ap-pearance of one person may change significantly in different cameras. Moreover, there may bemany passers-by wearing similar clothes in one camera view, which makes the re-identificationtask more difficult.

Current literature generally takes person re-identification as a fine-grained retrieval problem.Given one pedestrian image as the probe, the objective is to rank the gallery images accordingto their distances (or similarities) with the probe. To address the visual appearance challenges,existing approaches either focus on designing robust feature representations [11, 20, 34, 37, 47];or emphasize on learning discriminative metrics [19, 23, 26, 45, 65]. Both try to make the positiveimage pairs (images containing the same person) closer while the negative ones (images of differentpersons) are far away. Among them, the metric learning based methods are more prevailing andhave shown great success.

Although impressive performance has been achieved, there are still two limitations of metriclearning based models. The first one is that most of them suffer heavily from the classical problemof Small Sample Size (SSS). Specifically, to capture rich person appearance, the feature dimension incurrent re-identification works usually amounts to as high as thousands or even tens of thousands.In contrast, most re-identification datasets are relatively small, normally with only hundreds ofsamples (e.g., VIPeR [14] and PRID450S [40]). As a result, the sample size is much smaller than thefeature dimension and this leads to a singular within-class scatter matrix. In consequence, manymetric learning algorithms that rely on its inverse will run into numerical problems, such as theLocal Fisher Discriminant Analysis (LFDA) [44] and Keep It Simple and Straightforward MEtric(KISSME) [19]. Although some techniques like Principle Component Analysis (PCA) and matrixregularization can alleviate the SSS problem, they may make the learned metric sub-optimal andless discriminative in turn [55, 64].

The second limitation is that most metric learning methods only learn from training data once,and then they are deployed on the testing set. Therefore, the discriminative information from thesuccessfully identified test image pairs are ignored. Whereas, in a real-world scenario, they canbe inserted into the labeled training set in an online way. With gradually increased training data,the learned metric will be more and more discriminative. Besides, the top-ranked gallery imagesare usually checked by multiple human experts in manual operating procedures. The candidatesagreed by most experts are considered the correct matches with high confidence. Unfortunately,few works have yet mimicked these characteristics.

To address the above issues, we present a novel person re-identification framework named It-erative Multiple Kernel Metric Learning (IMKML). Different from previous approaches, the re-identification in IMKML is iteratively performed by learning multiple kernel metrics from notonly the labeled training data but also some image pairs picked out from the testing set. To


https://doi.org/10.1145/3234929

Learning Multiple Kernel Metrics for Iterative Person Re-Identification 78:3

Fig. 1. Schematic illustration of the proposed IMKML person re-identification framework. The training datais augmented by a pseudo training set of highly probable matched test pairs, and multiple kernel metrics arelearned from them. The re-identification procedure is iteratively performed on the testing set.

guarantee the selected pairs are highly probable correct matches, they should obtain rank-1 con-sensus and meet a distance gap under all learned metrics. Taking advantage of learning thenullspace of training samples [55, 64], we derive a Kernel Marginal Nullspace Learning (KMNL)algorithm to learn the metrics while bypassing the SSS problem. By iteratively alternating betweenlearning the metrics and picking out test pairs to augment the training set, much more discrimi-native metrics will be obtained. Therefore, significant re-identification performance improvementcan be gained. A schematic illustration of the proposed IMKML framework is shown in Figure 1.

The main contributions of this work are three-fold:

(1) We propose an iterative person re-identification framework via learning multiple ker-nel metrics. The framework can exploit the discriminative information from not only thetraining data but also some successfully matched test sample pairs.

(2) We introduce an effective pseudo training set construction method which can pick out thehighly confident positive test pairs to augment the training set.

(3) We derive a kernel nullspace learning algorithm for matching cross-view pedestrian im-ages. By learning the nullspace of training samples, the method can well address the SSSproblem in re-identification distance metric learning.

We conduct extensive experiments to validate the efficacy of the proposed method. Experimentalresults demonstrate it achieves more favorable performance than the state-of-the-art approaches.

The rest of this article is organized as follows. Section 2 briefly reviews the related work. Sec-tion 3 details the proposed method. In Section 4, the experimental results are presented. In Sec-tion 5, we conduct some analysis on the proposed method. Finally, we draw some conclusions inSection 6.

2 RELATED WORK

There exists extensive work to tackle the person re-identification problem after years of research,as recently surveyed in [13, 48, 63]. Here, we briefly review the most related works from the aspectsof feature representation, metric learning, and re-ranking.


78:4 H. Dong et al.

2.1 Feature Representation in Person Re-Identification

Various feature representations have been designed to describe pedestrian appearance under vary-ing conditions. To implement accurate part-wise matching, earlier works generally extracted fea-tures from part-based models, such as Ensemble of Local Features (ELF) [14], Symmetry DrivenAccumulation of Local Features (SDALF) [11], Custom Pictorial Structures (CPS) [8], and unsuper-vised saliency learning based matching [59].

Recently, some new descriptors with impressive performance have greatly advanced the re-search of person re-identification. For example, the Biological Inspired Features (BIF) [34] obtainedwell robustness against illumination changes and background clutter. The Salient Color Names(SCN) [53] obtained a more discriminative feature than color histogram by computing the pixelvalue distribution over some standard colors. The Weighted Histograms of Overlapping Stripes(WHOS) [28] well captured the holistic appearance from local stripes. By applying max poolingoperation to local features, the Local Maximal Occurrence (LOMO) [25] obtained a stable rep-resentation against viewpoint changes. Inspired from LOMO, the Gaussian of Gaussian (GOG)[37] described pedestrian image as a set of hierarchical Gaussian distributions and achieved betterdiscrimination.

2.2 Metric Learning in Person Re-Identification

A number of metric learning approaches have been devised to measure cross-view pairwise dis-tances, and they have dominated the re-identification literature. Among them, the Large MarginNearest Neighbor (LMNN) [50] is one of the most representative methods, which tried to learn ametric to separate the positive sample pairs from negative ones by a large margin. From rankingview, the metric in [65] was learned from relative distance comparison. Without heavy optimiza-tion, the KISSME algorithm [19] derived a closed-form metric that can be computed very efficiently.In [25], an extension of KISSME was proposed to learn a more discriminative metric along with asubspace for dimension reduction. To address the problem of imbalanced training pairs, the metricin [26] was learned from a logistic formulation with asymmetric weighting strategy. In [45], themetric was integrated with some latent variables to tackle the cross-view misalignment problem.

To deal with the highly non-linear patterns of pedestrian appearances, the kernel trick is com-monly employed, such as the Kernel Local Fisher Discriminant Analysis (KLFDA) [44], Null Foley-Sammon Transformation (NFST) [55], and Kernel Canonical Correlation Analysis (KCCA) [27, 29].There are also some works that try to solve person re-identification by the deep neural networks[6, 57, 58, 66]. All of them can be viewed as non-linear mapping based metric learning models.

In practice, most of the above methods suffer the SSS problem heavily. Especially for the deepmodels, typically a large number of training samples are required. Besides, the discriminative in-formation from the successfully identified test pairs are not exploited, as they learn from trainingset only once. In this work, we address the SSS problem by extending the Kernel Marginal FisherAnalysis (KMFA) [51] to nullspace learning. Compared to NFST [55] that shares the same idea ofcollapsing within-class samples to a single point, our method has the advantage of learning fromlocal neighborhood and no assumptions on the data distribution is required. Our work also differsfrom [18] that leveraged local neighborhood topology to adapt the results of linear Support VectorMachines (SVMs), as we learn a discriminative nullspace instead. To utilize the discriminative in-formation of correctly matched test samples, we iteratively pick them out to augment the trainingdata. Therefore, much more discriminative metrics can be obtained progressively. Although thereare multiple metrics to be learned in our IMKML framework, it is quite different from the works in[17, 35, 36, 52]. They focus on learning the best weighting strategy to fuse multiple metrics, whileour work tries to mimic human experts to pool the decisions of different kernels.



2.3 Re-Ranking for Person Re-Identification

In visual object retrieval, it has been a common practice to post-optimize the initial ranking re-sult [38]. Borrowing the same idea, there has arisen a new trend of re-ranking the ranked galleryimages in re-identification. Liu et al. [30] first investigated the post-optimization of ranking listby exploiting human-in-the-loop feedback. García et al. [12] introduced an unsupervised post-ranking framework by exploiting the discriminative contextual information. Ye et al. [54] opti-mized the initial ranking list by aggregating both similarity and dissimilarity cues. Inspired byexploiting the sparse contextual activation for visual re-ranking [1], Zhong et al. [67] proposed tore-rank the matching result by fusing the original Mahalanobis distance with a Jaccard distancecomputed from k-reciprocal nearest neighbors. Based on the idea of applying a diffusion pro-cess to suppress irreverent instances [3, 10], Bai et al. [2] proposed to boost the re-identificationaccuracies by learning affinities on a supervised smoothed manifold. Different from these ap-proaches, there are multiple ranking lists for each probe in our work and they are obtained byperforming re-identification iteratively, other than post-optimized. The ranking lists in our workare then used for pooling decisions to construct a pseudo training set. Besides, our method doesnot require human intervention or additional visual expansion which are rather time-consuming[12, 30].

3 THE APPROACH

3.1 System Overview

As shown in Figure 1, the proposed IMKML framework is a cyclic system which can be dividedinto two main modules: (1) learning multiple kernel metrics, and (2) constructing a pseudo trainingset.

The first module resembles a common re-identification distance metric learning procedure anddefines the basis of IMKML. LetT = {X,y} be the training set, where X = (x1,x2, . . . ,xN ) ∈ Rd×N

is the feature matrix of N images in the d-dimensional space, and y ∈ RN is the correspond-ing identity vector. If x i and x j share the same identity (i.e., yi = yj ), then (x i ,x j ) is a posi-tive pair; otherwise it is a negative pair. In person re-identification, d is usually very high andthis incurs the typical SSS problem in metric learning. The objective of the first module is tolearn multiple discriminative subspaces {Wm }Mm=1 with KMNL (Section 3.2), such that the posi-tive pairs of each pedestrian are separated from negative ones. Meanwhile, the SSS problem isbypassed.

The purpose of the second module is to perform re-identification on the testing set T ′ and con-struct a pseudo training set of highly confident correct pairs. The pseudo training set is then addedinto the labeled training set to retrain the kernel metrics, so as to enhance their discrimination.With the obtained metrics, multiple ranking lists {Rm }Mm=1 are computed for each probe image. Ifone gallery image appears on rank-1 in all the ranking lists of a probe, then they are selected as oneitem of a candidate set. To filter out some visually similar but wrongly matched pairs, a distancegap constraint is further applied to refine the candidate set. As shown in Figure 2, we only keep thepairs with distance gap between Probe-Rank2 and Probe-Rank1 larger than a specified threshold;all others will be discarded. This follows the idea of LMNN [50] to set a margin between positivepairs and negative imposters.

By iteratively alternating between the two modules, much more discriminative metrics willbe obtained due to the augmentation of training data. The re-identification on the testing set isiteratively carried out until the pseudo training set no longer changes, and the final ranking isobtained by averaging the distances from all metrics.


78:6 H. Dong et al.

Fig. 2. Illustration of the Probe-Rank1 and Probe-Rank2 distances for each pair in the candidate set.

3.2 Kernel Marginal Nullspace Learning

Following the idea of learning a nullspace of training samples for matching cross-view pairs [56,64], we aim to map the samples to a discriminative subspace such that each class is shrunk to asingle point. The advantage is that the SSS problem can be addressed effectively in this way. Weachieve this by extending the KMFA [51] to nullspace learning while maintaining some superior-ities like the kernel trick, no assumptions on data distribution, and learning from local manifold.

Letϕ : Rd → F be a nonlinear mapping fromd-dimensional space to an implicit space F of muchhigher dimension, and k (x ,z) = ϕ (x )�ϕ (z) be a kernel function. We consider the dimensionalityreduction in the kernel space by assuming each projection direction can be linearly represented bythe training samples in F [41]. Similar to KMFA, an intrinsic graph Gw = {ϕ (X),Aw } and a penaltygraph Gb = {ϕ (X),Ab } are first defined to enforce neighborhood constraints in KMNL, whereϕ (X) = (ϕ (x1), . . . ,ϕ (xN )) is the vertex set of mapped samples, and Aw ∈ RN×N and Ab ∈ RN×N

are two symmetric K-Nearest Neighbor (KNN) connectivity matrices defined as follows:

Aw (i, j ) =

{1, if i ∈ Nw (j ) or j ∈ Nw (i )0, otherwise,

(1)

Ab (i, j ) =

{1, if i ∈ Nb

(yj

)or j ∈ Nb (yi )

0, otherwise,(2)

where Nw (i ) is the index set of Nw within-class nearest neighbors of sample x i , while Nb (yi ) isthe index set consisting of Nb nearest neighbors from different classes of yi .

To obtain Aw and Ab , the Euclidean distance is chosen to compute the distances between eachvertex pair (ϕ (x i ),ϕ (x j )). Although it is impossible to compute them explicitly due to the implicitmapping of ϕ, we can take the advantage of the kernel trick to achieve this. That is,

d2 (ϕ (x i ),ϕ (x j )) = ‖ϕ (x i ) − ϕ (x j )‖22 = Kii + Kj j − 2Ki j , (3)

where K ∈ RN×N is the kernel matrix with Ki j = k (x i ,x j ).The objective function of KMFA is

minβ

β�ϕ (X)Lwϕ (X)�β

β�ϕ (X)Lbϕ (X)�β, (4)

where β is one of the projection directions to be learned, and Lw and Lb are the Laplacian matricesof Gw and Gb , which are defined as

Ls = Ds − As ,Ds (i, i ) =∑j�i

As (i, j ) ,∀i, s ∈ {w,b}. (5)



Note that Dw and Db are two diagonal matrices here. Let Sw = ϕ (X)Lwϕ (X)� be the within-classscatter matrix, Sb = ϕ (X)Lbϕ (X)� be the between-class scatter matrix, and β =

∑Ni=1 ϕ (x i )αi =

ϕ (X )α (α ∈ RN ), then Equation (4) can be rewritten as

minα

α�S′wα

α�S′bα, (6)

where S′w = ϕ (X)�Swϕ (X) = KLw K and S

′b= ϕ (X)�Sbϕ (X) = KLb K. The optimal α in Equa-

tion (6) can be obtained by solving the generalized Eigen problem of S′wα = λS

′bα [51]. However, in

the case of small sample size this may run into numerical problems due to singular Sw . To addressthis, KMFA has to apply matrix regularization to Sw , but it may lead to the degenerate eigenvalueproblem [64].

Here, we propose to learn the nullspace of Sw to avoid the computation of its inverse. As aresult, the SSS problem can be bypassed. In such nullspace, the nearest samples of the same classare collapsed to a single point. This will lead to an optimal metric for matching cross-view imagepairs if the between-class distances are nonzero. For one projection direction w , our objective is

maxw

w�Sbw s .t . w�Sww = 0,w�Sbw > 0. (7)

For convenience of deriving optimal w , we denote the nullspace of a matrix A as N(A), that is,

N(A) = {x |Ax = 0}, and let N(A) represent the orthogonal complement of N(A) in the following.Besides, we define the total scatter matrix of samples in space F as

St =

N∑i=1

(ϕ (x i ) − μ) (ϕ (x i ) − μ)� = ϕ (X) (I −M) (I −M)�ϕ (X), (8)

where μ = 1N

∑Ni=1 ϕ (x i ) is the mean vector of all mapped samples, I is the N × N identity matrix,

and M is a N × N matrix with all terms equal to 1/N .

Lemma 3.1. Let W = (w1, . . . ,wr ) be a projection matrix in which {w i }ri=1 are obtained from Equa-

tion (7). Assuming the projection by W preserves the similarities between each sample pair in F, then

the within-class scatter matrix Sw of the projected samples is a zero matrix.

Proof. Note that for every PSD matrix A (i.e., A � 0), we have x�Ax = 0 iff Ax = 0 [64]. Ac-cording to the definition of Sw , we know that Sw � 0. Because each projection direction satisfiesw�Sww = 0, we have W

�Sw W = 0 and this gives Sw W = 0. Let Y =W

�ϕ (X) be the projectedsamples. Based on the similarity preserving assumption, the Laplacian matrix in the transformedspace is the same with Lw . According to the definition of within-class scatter matrix, we have

Sw = YLw Y� =W

�Sw W = 0. �

Lemma 3.2. Let ϕ (X) = (ϕ (x i ) − μ, . . . ,ϕ (xN ) − μ) be the centered samples in space F, then the

optimal w of Equation (7) lies in the space of N (St ) ∩ N (Sw ) ∩ N (Sb ).

Proof. It is obvious that w ∈ N (Sw ), so we only prove w ∈ N (St ) and w ∈ N (Sb ).

With all samples centered, we have St = ϕ (X)ϕ (X)�, Sw = ϕ (X)Lwϕ (X)�, and Sb =

ϕ (X)Lbϕ (X)�; all of them are PSD matrices. If w ∈ N (St ), then Stw = 0, and this gives w�Stw =

w�ϕ (X)ϕ (X)�w = 0, thus ϕ (X)�w = 0. This further gives w�Sww = 0 and w�Sbw = 0 simulta-

neously, which does not satisfy the constraints in Equation (7). Whereas, w ∈ N (St ) can ensure

w�Sbw > 0 because St � 0. This proves that w ∈ N (St ).Similarly, ifw ∈ N (Sb ), we havew�Sb w = 0, which does not meet the demand. On the contrary,

w ∈ N (Sb ) can guaranteew�Sb w > 0 always holds due to the PSD property of Sb . Thus, the desired

optimal w lies in N (Sb ).


78:8 H. Dong et al.

Because w ∈ N (St ), w ∈ N (Sw ), and w ∈ N (Sb ) simultaneously, the optimal projection w lies

in N (St ) ∩ N (Sw ) ∩ N (Sb ). �

With Lemma 3.1 and 3.2, we know where the projections w1, . . . ,wr lie. Although Lemma 3.1relies on an assumption of holding the same Lw , it is reasonable. This is because the low-dimensional representations are supposed to preserve the discrimination of the original samples.

In the following, we show how to resolve w ∈ N (St ), w ∈ N (Sw ), and w ∈ N (Sb ) one by one.

The task of makingw ∈ N (St ) is identical to find the basis of St . By applying Eigen decomposi-tion to St using the Kernel Principle Component Analysis (KPCA) method [41], the eigenvectorscorresponding to the nonzero eigenvalues are just what we want. From [41] we know that thedesired eigenvectors can be represented as

ξ =N∑

i=1

(ϕ (x i ) − μ)ui = ϕ (X) (I −M)u, (9)

where u ∈ RN is a real vector. Then, KPCA turns out to solve the Eigen equation of Ku = λu,

where K = (I −M)�K(I −M). Let Λ ∈ Rn×n be a diagonal matrix composed of the n descend-

ingly sorted nonzero eigenvalues of K, and U ∈ RN×n be a matrix consisting of corresponding

eigenvectors. We then compute U = UΛ−1/2 to obtain normalized ξ which now are

ξ i = ϕ (X) (I −M) U:,i , i = 1, . . . ,n. (10)

Here, U:,i is the ith column of U. Each ξ i is one projection direction that lies in N (St ).

Because{ξ i

}ni=1 are basis vectors of N (St ), any projection directionw that lies in N (St ) ∩ N (Sw )

can be linearly represented as

w =n∑

i=1

ξ ivi = ϕ (X) (I −M) Uv, (11)

where v ∈ Rn is a real vector. Substituting Equation (11) into the constraint w�Sww = 0 in

Equation (7), we obtainv�Swv = 0, where

Sw = U� (I −M)�S

′w (I −M)U. (12)

This meansv lies in the nullspace of Sw . So, we apply Eigen decomposition to Sw and choose theeigenvectors corresponding to the zero eigenvalues to construct a matrix V. Then, the projection

directions that lie in N (St ) ∩ N (Sw ) now are

ηi = ϕ (X) (I −M) UV:,i , i = 1, . . . , s . (13)

Similarly, V:,i denotes the ith column of V here.

It is worth noting that ηi (i = 1, . . . , s ) also form the basis vectors of N (St ) ∩ N (Sw ), so

any w ∈ N (St ) ∩ N (Sw ) ∩ N (Sb ) can be represented as w =∑s

i=1 ηihi = ϕ (X) (I −M) UVh. Byplugging it into Equation (7), we obtain

maxh

h�Sbh s .t . h�Sbh > 0, (14)

where

Sb = V�

U� (I −M)�S

′b (I −M)UV. (15)

Note that Sb is a PSD matrix, and the constraintw�Sww = 0 is ignored here because it has already

been satisfied. By applying Eigen decomposition to Sb , the obtained eigenvectors corresponding



to descendingly sorted nonzero eigenvalues are just the optimal h, and we construct a matrix H

with them.With the obtained U, V, and H, the columns ofϕ (X) (I −M) UVH are just the optimal projection

directions in Equation (7). Let r be the chosen subspace dimension; then our final projectionmatrix can be represented as

W = (ϕ (X) (I −M) UVH):,1:r , (16)

where (·):,1:r denotes the first r columns of a matrix.A summation of the whole procedure of KMNL is shown in Algorithm 1. If we use M different

mappings, multiple metrics {Wm }Mm=1 will be obtained.

ALGORITHM 1: KMNL

Input: training set T = {X,y}.Output: U, V, H

compute kernel matrix K;

compute Lw , Lb according to Equations (1), (2), and (5);

compute S′w = KLw K, and S

′b= KLb K;

compute K = (I −M)�K(I −M);

obtain U by applying Eigen decomposition to K;

compute U = UΛ−1/2;

compute Sw according to Equation (12);

obtain V by applying Eigen decomposition to Sw ;

compute Sb according to Equation (15);

obtain H by applying Eigen decomposition to Sb .

3.3 Pseudo Training Set Construction

Once obtaining the kernel metrics {Wm }Mm=1, we then compute the pairwise distances for all cross-

view image pairs in the testing set. Under metric Wm , the distance between a probe image xpi and

a gallery image xдj is computed as

d2m

(x

pi ,x

дj

)=��W�

mϕ (xpi ) −W

�mϕ (x

дj )��

2

2=��(T�

Kxp

i− T�

Kxд

j

)1:r

��2

2, (17)

where T = (I −M) UVH, and Kx = (k (x1,x ), . . . ,k (xN ,x ))� is the kernel mapping of sample xusing themth kernel function. For each x

pi , we rank the gallery images according to the distances

under all metrics. As shown in Figure 1, this leads to M ranking lists {Rm }Mm=1 which are used forselecting candidate image pairs of a pseudo training set.

In the manual person re-identification process, if one image pair is considered containing thesame person by multiple experts, then the two images are true matches, basically. Similarly, oneprobe image and its rank-1 gallery image form a pair with high confidence to be correctly matched,especially when this gallery image is ranked first by multiple metrics. So we select the gallery im-ages obtaining rank-1 consensus by all metrics and their corresponding probes to form a candidateset C:

C ={(x

pi ,x

дj

) ��(x

pi ,x

дj

)∈ T ′, arg rank

j(Rm (i, j )) = 1,∀i,m = 1, 2, . . . ,M

}, (18)

where arg rank(Rm (i, j )) denotes the rank of a gallery image xдj in the mth ranking list of probe

image xpi .


78:10 H. Dong et al.

Fig. 3. (a) Examples of some probe images and their corresponding 15 top-ranked gallery images in thetesting set; the correct matches are highlighted in yellow. (b) Ranks and corresponding distances betweenall probe-gallery image pairs in the training set; each line corresponds to one probe image.

In spite of obtaining rank-1 consensus by different metrics, there may still be some wronglymatched pairs in candidate set C due to similar appearances. Figure 3(a) shows some probe imageswith their 15 top-ranked gallery images in the testing set of the VIPeR [14] dataset; all the rank-1gallery images are ranked first by three different kernel metrics. The first pairs in each row areall selected into the candidate set. It can be found that although some of them are visually similar,they are captured from different persons, in fact (see the last two rows, for example). If there aretoo many such hard negative pairs, the discrimination of the re-trained metrics will be harmed. Tofilter out such negative pairs as much as possible, a refinement mechanism based on the distancegap between Probe-Rank2 and Probe-Rank1 is further employed.

Figure 3(b) shows the distances between 316 probe images and their corresponding 316 galleryimages under the RBF kernel metric on the training set of the VIPeR dataset. Each line correspondsto the ranking list of a probe image. It is worth noting that all the rank-1 gallery images are correctmatches. We can see that there is clearly a gap between Probe-Rank2 and Prob-Rank1 distances.Inspired from this, we assume there should also be a distance gap on the testing set to separatethe correctly matched rank-1 gallery images from others. Therefore, image pairs in candidate Cthat do not meet the distance gap should be discarded. As a result, our final pseudo training set Pis

P ={(x

pi ,x

дj

) ��(x

pi ,x

дj

)∈ C,d2

m

(x

pi ,x

д

Rm (i,2)

)− d2

m

(x

pi ,x

дj

)> Δm ,∀i,m = 1, 2, . . . ,M }, (19)

where xд

Rm (i,2)is the gallery image that ranked second in themth ranking list of x

pi , and Δm is the

distance gap under the mth kernel metric. We stress that the pairs in P should meet the distancegap under all metrics, and pseudo labels are assigned to them once obtaining P.

The Δm plays an important role in building the final pseudo training set, but it is not a trivialtask to choose its proper value because we do not know the true labels of the testing samples.Here, we set it to the parameterized average pairwise distance of the test pairs, that is,

Δm = ρ1

N ′

∑(x

p

i,x

д

j)∈T ′

d2m

(x

pi ,x

дj

),

(20)

where N ′ is the number of cross-view image pairs in the testing set, and ρ ∈ (0, 1) is a hyper-parameter to tune Δm . The influence of ρ will be discussed in Section 5.3.



Fig. 4. Example image pairs from the VIPeR, PRID450S, CUHK01, and Market-1501 datasets. Images in eachcolumn belong to the same person.

3.4 Iterative Metric Learning for Person Re-Identification

By iteratively alternating between learning the metrics and constructing the pseudo training set,more correctly identified test pairs will be involved for training, and this will lead to much morediscriminative metrics in return. Although there may still be a few wrongly matched pairs in thepseudo training set, they are the ones that passed double tests of rank-1 consensus and the distancegap. Therefore, they can only be in a small number and rather similar visual appearances shouldtake. The harm brought by them is affordable.

In practice, the complexity of computing a nonlinear kernel like RBF is O (d (N + |P |)2), where|P | is the cardinality ofP. The total time cost will be considerable withM kernels inT iterations. Toreduce the computation cost, the feature matrix X of training set T , the probe and gallery feature

matrices Xp and X

д of testing set T ′ are composited into a new matrix X, that is, X = (X|Xp |Xд ).

We pre-compute the kernel matrices {Km }Mm=1 of X using different kernel functions offline and

cache them. As a result, there is no need to compute the kernel matrices {Ktm }Mm=1 of T ∪ Pt

in each iteration, where t is the iteration number. Instead, they can be quickly referenced from

{Km }Mm=1 and this saves a lot of time in experiments.The iteration of metric learning and pseudo training set construction will terminate when the

pseudo training set no longer changes or the maximum iteration number T is achieved. The final

distance between a test pair (xpi ,x

дj ) is computed by averaging the distances computed from all

the metrics as

dis(x

pi ,x

дj

)=

1

m

M∑m=1

d2m

(x

pi ,x

дj

). (21)

The obtained dis (xpi ,x

дj ) is used to compute the final ranking list R∗.

Summation of the proposed IMKML re-identification framework is shown in Algorithm 2.

4 EXPERIMENTS

In this section, we first introduce the experimental settings. Then we evaluate the proposed methodon four widely used person re-identification datasets, namely, VIPeR [14], PRID450S [40], CUHK01[60], and Market-1501 [61]. Figure 4 shows some example image pairs from these datasets.

4.1 Experimental Settings

4.1.1 Feature Representation. The fusion of the widely used LOMO [25] feature and its stripe-based variant is used as the representation of pedestrian images. The LOMO feature has showngreat robustness against viewpoint changes by concatenating the local maximal pattern of jointHSV histogram and the SILTP descriptor. However, LOMO is not a good extractor for captur-ing the holistic appearance information from larger regions due to the computation from denseblocks. To compensate for this, we fuse LOMO with its enhanced elementary features computed



ALGORITHM 2: IMKML

Input: training set T = {X,y}, testing set T ′ = {Xp ,Xд }, parameter ρ, maximum iteration number T .

Output: R∗.X = (X|Xp |Xд ); P0 = ∅;

compute {Km }Mm=1 offline;

for t = 0, 1, . . . ,T do

T t = T ∪ Pt ;

form = 1, 2, . . . ,M do

reference Ktm from Km according to T t ;

compute U, V, and H for Ktm with KMNL;

compute d2m (x

pi ,x

дj ) for image pairs in T ′;

compute ranking list Rtm ;

compute Δm according to Equation (20);end

According to Equation (18), obtain candidate set C using {Rtm }Mm=1;

update Pt+1 according to Equation (19);

if Pt+1 \ Pt = ∅ thenbreak;

end

end

compute dis (xpi ,x

дj ) for image pairs in T ′;

compute final ranking list R∗.

from a pyramid space of overlapping stripes. Specifically, we first divide the foreground imagesinto a number of horizontal stripes as in [28], and then compute the joint HSV histogram andSILTP from each stripe with the same settings in LOMO. To further enhance the discrimination,we also extract the SCN descriptor [53] and joint RGB histogram. Different from LOMO, there isno max pooling operation during computing these features. Finally, we concatenate LOMO andthe features computed from all stripes, obtaining a descriptor of 44990 dimensions. In this way,the fine details from LOMO and the holistic appearance from stripe-based features can both beutilized.

4.1.2 Kernels and Parameters. In consideration of computational time, the widely used Linear,RBF, and Polynomial kernels are chosen for learning the metrics in IMKML. Their functions arek (x ,z) = x�z, k (x ,z) = exp (− ‖x − z‖2 /2σ 2) and k (x ,z) = (ax�z + c )b , respectively. In experi-ments, the bandwidth σ in the RBF kernel is set to the mean pairwise distance of all trainingimage pairs, and the parameters a, b, c of the Polynomial kernel are set to 1, 2, and 0.05, whichare obtained via grid search. For KMNL, the parameters Nb and r are set to 12 and 100, which areobtained by cross validation, and the Nw is set to the minimal within-class number of trainingsamples. The parameter ρ for gating the distance gap in IMKML is tuned to 0.2, and the maximumiteration number T is set to 5.

4.1.3 Evaluation Protocol. Following the evaluation protocol in [14], each dataset is split intonon-overlapping training set and testing set first. In the testing set, all gallery images are matchedwith every image in the probe set, and then they are ranked according to the distances. The Cu-mulative Matching Characteristic (CMC) curve and mean Average Precision (mAP) are employedto measure the performance of compared algorithms. For easier comparison with published re-sults, the cumulative matching accuracies at selected top ranks are also listed in tables. Except for



Fig. 5. Performance comparison on the VIPeR dataset: (a) results of baselines with and without IMKMLframework; (b) IMKML (+KMNL) and the state-of-the-art approaches. The results are shown in rank-1 match-ing rates and the CMC curves.

the fixed dataset split of Market-1501, 10 trials with random training/testing partition are run toaverage the results on the other three datasets.

4.2 Experiments on VIPeR

The VIPeR [14] is one of the most popular person re-identification datasets, which has been widelyused for performance evaluation of re-identification algorithms. It was captured in outdoor envi-ronment with 632 persons involved, and each person has one pair of images observed from twodifferent camera views. All images in this dataset are normalized to 128×48 pixels. Due to severevariations in illumination, viewpoint, and background, VIPeR is considered a rather challengingre-identification dataset. Following the commonly used partition protocol, the persons are dividedinto two groups of 316 persons each; one group is used for training and the other one for testing.

4.2.1 Comparison with Baselines. It should be noted that our IMKML re-identification frame-work is not restricted by the proposed KMNL; other kernel metric learning algorithms can also beplugged in as the baseline model. To demonstrate the effectiveness of IMKML, we also evaluate itwith NFST [55], KMFA [51], and KLFDA [44], which are three representative kernel metric learn-ing algorithms. Figure 5(a) presents the re-identification results obtained by the baseline modelswith and without IMKML framework. The results of the baseline models are all obtained using theRBF kernel, and the metrics are learned from the mere training set.

From the performance comparison, we can observe that IMKML obtains remarkable per-formance gains against the baseline models, especially on the lower ranks. The rank-1 re-identification rates of KMNL, NFST, KMFA, and KLFDA are 53.6%, 52.8%, 52.1%, and 51.6%, respec-tively, whereas 58.4%, 56.7%, 57.4%, and 56.7% are obtained after applying the IMKML framework,thus improving the baseline models by 4.8%, 3.9%, 5.3%, and 5.1%. The performance gains confirmthat much more discriminative metrics can be learned by adding the successfully matched testpairs into the training set. With more discriminative metrics, the matching results are improvedas a consequence.

From Figure 5(a), we can also find KMNL and NFST performs better than KMFA and KLFDA,which shows the superiority of learning the nullspace. Since our feature dimension is as high as44,990, the latter ones suffer the SSS problem heavily, whereas KMNL and NFST can well address



Table 1. Comparison of Cumulative Matching Rates(%) at Rank 1, 5, 10, and 20 on VIPeR Dataset

Method Rank-1 Rank-5 Rank-10 Rank-20

IMKML 58.4 83.8 91.5 96.8

SpindleNet [57] 53.8 74.1 83.2 92.1SSM [2] 53.7 - 91.5 96.1SCSP [4] 53.5 82.6 91.5 96.7

NFST [55] 51.2 82.1 90.5 95.9ME [39] 50.6 77.3 88.6 95.9

CRAFT [6] 50.3 80.0 89.6 95.5GOG [37] 49.7 79.7 88.7 94.5

MPCNN [7] 47.8 74.7 84.8 91.1MC-KCCA [27] 47.2 - 87.3 94.7

SSSVM [56] 42.7 74.2 84.3 91.9KEPLER [36] 42.4 - 82.4 90.7MLAPG [26] 40.7 69.9 82.3 92.4XQDA [25] 40.0 68.1 80.5 91.1MLF [60] 29.1 52.3 66.0 79.9

KISSME [19] 19.6 49.4 62.2 77.0PRDC [65] 15.7 38.4 53.9 70.1

it. Compared to NFST, better performance is obtained by KMNL. We believe this is because KMNLhas the advantage of learning from neighborhood.

4.2.2 Comparison to the State of the Art. We then compare the proposed method with a numberof existing methods, including SpindleNet [57], SSM [2], CRAFT [6], SCSP [4], MPCNN [7], MC-KCCA [27], NFST [55], GOG [37], SSSVM [56], ME [39], MLAPG [26], XQDA [25], MLF [60],KEPLER [36], KISSME [19], and PRDC [65]. From the comparison results shown in Figure 5(b)and Table 1, it can be observed that our IMKML (using KMNL as the baseline) outperforms allprevious methods on the most important rank-1. In particular, we achieve a matching accuracyof 58.4%, outperforming the second best SpindleNet by 4.6%. Noting that SCSP fused the scoresof global image and multiple local regions, SSM applied post ranking optimization to the initialresults; they are still inferior to IMKML. Considering the inherent differences in boosting initialresults, we believe that better performance can be achieved by integrating IMKML, SSM, and spatialconstraints together. Comparing to the deep learning based models, namely, SpindleNet, CRAFT,and MPCNN, our IMKML is obviously superior to them. Due to limited training samples, theirpower is difficult to fully demonstrate on the VIPeR dataset.

4.2.3 Comparison with Re-Ranking Methods. Finally, we compare IMKML with some re-ranking methods since it can improve the initial matching results of baseline models. The com-parison results are shown in Table 2. For better inspecting the performance gains of IMKML, boththe feature in Section 4.1.1 (denoted as eLOMO) and LOMO are considered. Similar to existingre-ranking methods, the matching accuracies before and after re-ranking are both reported forrank-1, while on other ranks only the re-ranked accuracies are presented.

From Table 2 we can find the DCIA+KCCA [12] achieved the best rank-1 matching accuracyof 63.9% after re-ranking and the improvement over initial result was 21.8%. However, theboosting effects from DCIA [12] were not stable. The improvements were 5.1% and 4.1% when



Table 2. Comparison with Re-Ranking Methods on VIPeR Dataset


IMKML+KMNL(eLOMO) 53.6→58.4 83.8 91.5 96.8IMKML+NFST(eLOMO) 52.8→56.7 81.6 90.5 96.6IMKML+KMFA(eLOMO) 52.1→57.4 83.5 91.4 97.3IMKML+KLFDA(eLOMO) 51.6→56.7 82.6 91.0 97.3

IMKML+KMNL(LOMO) 40.7→43.5 69.7 80.5 89.0IMKML+NFST(LOMO) 40.1→43.0 68.7 79.4 88.2IMKML+KMFA(LOMO) 39.6→41.6 69.6 80.2 89.4IMKML+KLFDA(LOMO) 38.9→41.3 69.5 81.2 90.7

SSM+XQDA [2] 53.3→53.7 - 91.5 96.1DCIA+KCCA [12] 42.1→63.9 78.5 87.5 -

DCIA+KISSME [12] 33.8→38.9 68.0 82.0 -DCIA+LADF [12] 40.5→44.7 71.5 83.6 -

POP+RankSVM [30] 14.9→59.1 61.0 63.1 -CCRR+KISSME [22] 20.0→22.0 49.0 69.0 -

using KISSME and LADF [24] as baselines. Although POP+RankSVM [30] reported the highestimprovement of 44.2%, it required users being in the loop. For SSM [2] and CCRR [22], theimprovements were not noticeable, which were only 0.4% and 2%. As there is no re-rankingprocedure in IMKML, it is not sensitive to the baselines. The improvements are generally above 4%on rank-1 when our eLOMO is applied, and more than 2% with the LOMO feature. Although sucha boosting effect is not as significant as POP+RankSVM or DCIA+KISSME, it does not need humanintervention or visual expansion. It is worth noting that the visual expansion in DCIA and POP arerather time-consuming [12], while our IMKML is very efficient. This will be further discussed inSection 5.4.

4.3 Experiments on PRID450S

The PRID450S dataset [40] contains 450 persons observed from two disjoint camera views, andeach person has one image in every view. This is also a challenging dataset due to serious viewpointchanges, partial occlusion, and background clutter. We scale the images to 128×64 pixels, and selecthalf of the persons for training and the remainder for testing.

We first evaluate the proposed IMKML with different baselines. The performance comparisonis shown in Figure 6(a). It can be found that IMKML improves the initial results of all baselines.Specifically, the rank-1 matching rate increases from 76.4% to 82.3% using KMNL as the baselinemodel, from 76.0% to 81.8% using NFST, from 75.2% to 79.1% using KMFA, and from 74.7% to 78.8%using KLFDA. Note that the baselines only learn from the mere training set, while our IMKMLcan learn from not only the training data but also some correctly matched test samples, thus moresamples are involved for training. Similar to the results on the VIPeR dataset, we can find that theKMNL outperforms the other three baselines, and the boosting effects of IMKML mainly reside onthe lower ranks.

Next, we compare IMKML with some approaches that have reported results on the PRID450Sdataset, including SSM [2], GOG [37], SSSVM [56], MirroRep [5], SCNCD [53], ECM [32], KISSME[19], and EIML [15]. From the performance comparison shown in Figure 6(b) and Table 3, it canbe observed that the IMKML outperforms all competitors by a large margin. In particular, theobtained rank-1 matching rate is as high as 82.3%, which improves the state-of-the-art 73.0% by



Fig. 6. Performance comparison on the PRID450S dataset: (a) Results of different baselines and that after ap-plying IMKML framework; (b) IMKML (+KMNL) and the state-of-the-art approaches. The results are shownin rank-1 matching rates and the CMC curves.

Table 3. Comparison of Cumulative Matching Rates (%) at Rank 1,5, 10, and 20 on PRID450S Dataset


IMKML 82.3 95.7 97.6 98.7SSM [2] 73.0 - 96.8 99.1

GOG [37] 68.4 88.8 94.5 97.8SSSVM [56] 60.5 82.9 88.6 93.6

MirrorRep [5] 55.4 79.3 87.8 93.9SCNCD [53] 42.4 69.2 79.6 88.4

ECM [32] 41.9 66.3 76.9 84.9KISSME [19] 33.5 59.8 70.8 79.5

EIML [15] 34.7 57.7 67.9 77.3

9.3%. Besides, IMKML is the only one that has reported higher than 90% matching rate on rank-5. On rank-20, IMKML yields the second best accuracy of 98.7%, which is slightly lower than the99.1% reported by SSM. Because there are only hundreds of training samples in PRID450S and mostof the compared approaches only learn from the training set, they suffer the SSS problem heavily.In contrast, our KMNL can well address the SSS problem, and IMKML can exploit discriminativeinformation from both training set and successfully identified test samples. Therefore, a muchbetter result is obtained on this small dataset.

4.4 Experiments on CUHK01

The CUHK01 [60] dataset contains 971 persons observed by two cameras on a campus. One camerarecorded the front or back view, and the other one recorded the side views. There are two imagesfor each person in one camera view, so the total image number is 3,884. All images are scaledto 160×60 pixels. As can be seen from Figure 4, the images in this dataset are of much higherquality and resolutions than previous datasets. In experiment, we randomly select 485 persons tobuild the training set and the remaining 486 persons are used for testing. As one person has two



Table 4. Comparison of Cumulative Matching Rates (%) at Rank 1, 5,10, and 20 on CUHK01 Dataset


IMKML 91.4 99.0 100 100

EDM [43] 86.6 - - -SpindleNet [57] 79.9 94.4 97.1 98.6

DLPAR [58] 75.0 93.5 95.7 97.7CRAFT [6] 74.5 91.2 94.8 97.1NFST [55] 69.1 86.9 91.8 95.4GOG [37] 67.3 86.9 91.8 95.9

SSSVM [56] 66.0 - - -MLAPG [26] 64.2 85.4 90.8 94.9XQDA [25] 63.2 83.9 90.0 94.2

ME [39] 55.2 77.5 84.5 92.5MRT-CNN [21] 52.6 81.6 88.2 -

MLF [60] 34.3 55.1 65.0 74.9

images in each view, we sum the cross-view distances of each person to perform Multiple-shot vsMultiple-shot (MvsM) re-identification.

Table 4 presents the comparison of our IMKML with the state-of-the-art approaches under thesame settings. We can find that IMKML outperforms all existing results by a large margin. Inparticular, the rank-1 matching rate obtained by IMKML is as high as 91.4%. To the best of ourknowledge, this is the first work that has reported higher than 90% rank-1 matching accuracy onthis dataset. Compared to the second best 86.6% reported by EDM [43], there is an improvement of4.8%. The encouraging result of IMKML is unexpected on this dataset. Note that except for IMKMLthe approaches with rank-1 matching accuracy higher than 70% are all deep learning related.

We think there are two main reasons leading to the success of IMKML. First, the employedfeature representation may well fit the pedestrian appearances in this dataset. Since the feature isa fusion of the dense block based LOMO and its stripe based variant, the coarse appearance andfine details are both captured. Due to the small range of appearance variation in CUHK01 (seeFigure 4, for example), the matching accuracy should be higher than other datasets. Second, ascan be observed from Figure 8, the initial matching result of KMNL is already pretty good (84.8%on rank-1). Therefore, with the boosting effect of IMKML a higher than 90% matching accuracy isobtained at last.

4.5 Experiments on Market-1501

Market-1501 [62] is one of the biggest person re-identification datasets to date, which contains32,668 images of 1,501 pedestrians. This dataset is captured by six cameras from different views andeach pedestrian is observed by at least two cameras. We evaluate our IMKML with the providedfixed dataset split: 12,936 “bounding-box-train” images are used for training, while 3,368 queryimages and 19,732 “bounding-box-test” images are used for testing. Both single-query and multi-query settings are considered.

Table 5 shows the performance comparison of the proposed IMKML with a number of state-of-the-art approaches. Since there are much more training images, the SSS problem is not serious onMarket-1501 and the potential of deep learning can be fully demonstrated. We can find the topfive results are all achieved by deep learning based approaches. Our IMKML only ranks sixth; the



Fig. 7. The precision of the pseudo training setin each iteration on Market-1501 dataset.

Fig. 8. Summation of rank-1 matching accuracygains on four considered datasets.

Table 5. Comparison of Rank-1 Matching Rateand mAP (%) on Market-1501 Dataset

MethodSingle Query Multiple Query

Rank-1 mAP Rank-1 mAP

IMKML 66.1 40.7 76.6 50.7CSAdap [68] 88.1 68.7 - -

MSP-CNN [42] 81.9 63.6 - -IDE [66] 79.5 59.9 85.8 70.3

PSS-DFL [69] 70.7 44.3 85.8 55.7CRAFT [6] 68.7 42.3 77.0 50.3

Gated S-CNN [46] 65.9 39.6 76.0 48.5NFST [55] 61.0 35.7 71.6 46.0SCSP [4] 51.9 26.4 - -

MS-TCNN [31] 45.1 - 55.4 -BOW [62] 34.4 14.1 42.6 19.5

obtained rank-1 matching rates are 66.1% and 76.6% under single-query and multi-query settings,and the mAPs are 40.7% and 50.7%, respectively. Even though it does not achieve the best results,our IMKML still outperforms the Gated S-CNN [46] and MS-TCNN [31], as well as the traditionalalgorithms of NFST [55] and SCSP [4].

Figure 7 presents the change of the pseudo training set’s precision in each iteration, as well asthe numbers of positive pairs and all selected pairs. It can be observed that although more positivepairs are picked out with the increasing of iterations, the precision gradually drops. This meansmore negative pairs are involved. Because the negative pairs may bring some noise and the selectedpositive pairs are much less than the labeled training set, the performance gains are not significant.As can be observed from Figure 8, the rank-1 matching rate is only improved from 63.3% to 66.1%under single-query setting. But the good news is that with the pseudo set, our IMKML improvesthe mAP from 34.0% to 40.7%. We think this should be attributed to the visual similarity of thechosen rank-1 gallery images with their within-class ones.

It should be noted that there are multiple images of the same person in the gallery set duringtesting; our IMKML can only choose the image ranked first. Although we can relax the rank-1



Fig. 9. (a) Performance comparison of each kernel and their fusion. (b) The change of cumulative matchingrates at ranks 1, 5, 10, and 20 w.r.t. iteration number.

consensus to more ranks, we find the performance degrades heavily in experiment. Therefore, wecomply with the pseudo pair selection mechanism detailed in Section 3.3.

5 ANALYSIS

In this section, we analyze the proposed method in four aspects, including the contribution of eachkernel, the effect of iterative re-identification, the influence of distance gap, and the training time.The analysis is performed on the VIPeR dataset by randomly splitting it into two non-overlappingsets of equal size; one is used for training, and the other one for testing.

5.1 Contribution of Each Kernel

The three kernels, Linear, RBF, and Polynomial, act as a pool of experts in the proposed IMKMLre-identification framework. In Figure 9(a), we show the contributions of them. The plot showsthe final matching accuracies of each kernel as well as their fusion at rank-{1, 5, 10, 20}. It canbe found that the performances of the three kernels are close to each other; the rank-1 matchingrates are 57.4%, 56.3%, and 58.3%, for Linear, RBF, and Polynomial, respectively. It is interestingthat the Linear kernel performs better than the RBF kernel; we think this may be the consequenceof the high dimension of the used feature. In this case, the automatically determined bandwidthparameter may lead to the over-fitting problem. The close but different matching results not onlyavoid the performance degradation in fusion, but also help to select the pseudo training pairs.We can find the fusion of the three kernels leads to slightly higher performance than each kernelalone. Note that we only use a simple average weighting strategy for the final distance; morecomprehensive combinations may lead to some higher performance.

5.2 Effect of Iterative Re-Identification

In IMKML, the metric learning and re-identification on the testing set are coupled together. Toinvestigate the influence of iterative metric learning for re-identification, we plot the changingcurves of re-identification accuracies at rank-{1, 5, 10, 20} w.r.t. the iteration number in Figure 9(b).As can be observed, the matching rates at rank-{1, 5, 10} increase obviously in the first four it-erations, while the rank-20 accuracy degrades in the first five iterations, and then they hold thesame. Therefore, the iterative metric learning helps to improve the re-identification performancesignificantly over the lower ranks, but may harm the accuracies on larger ranks. We think this is



Fig. 10. Influence of distance gap: (a) cumulative matching accuracies at rank-{1, 5, 10, 20} w.r.t. parameterρ; (b) the precision and recall of pseudo training set w.r.t. to parameter ρ; (c) the CMC curves at differentprecision and recall levels; only five selected ranks are plotted. The “P” and “R” in the legend indicate precisionand recall, respectively.

due to the hard negative pairs in the pseudo training set. Although their similar appearances canhelp to learn more discriminative metrics, they are always misidentified themselves and thus leadto matching accuracy drop on larger ranks. Nevertheless, it is still worthwhile to learn from thepseudo training set, because the lower ranks are more important and the accuracies on them areimproved significantly. From Figure 9(b), we can find the matching accuracies only change in thefirst five iterations, then they hold the same. This means our IMKML can quickly converge withinfive iterations. From the precision change of the pseudo training set shown in Figure 7, we canfind that the convergence of IMKML is similar on Market-1501. Therefore, we set the maximumiteration number T to 5 in experiments.

5.3 Influence of Distance Gap

The distance gap plays an important role in choosing the candidate pairs of pseudo training set.To explore the influence of the distance gap on the final performance, we evaluate IMKML withdifferent values of parameter ρ, and the results are presented in Figure 10(a). When ρ = 0, thereis no distance gap constraint; all rank-1 gallery images obtaining consensus with their probes areadded into the pseudo training set. It can be found in this case a promising 58.3% matching accuracyis obtained on rank-1. However, the performance on other ranks is rather low. With increasing ρ,the rank-1 accuracy descends gradually, while the accuracies on other ranks increase and becomestable when ρ ≥ 0.2.

Figure 10(b) shows the influence of distance gap on the precision and recall of pseudo trainingset. We can observe the precision ascends with increasing ρ, meanwhile the recall decreases grad-ually. Although larger distance gap can boost the precision of the pseudo training set, it may leadto less pseudo training pairs. Especially when ρ = 1, there are no pairs selected into the pseudotraining set, which does not help to exploit the discrimination of successfully identified test pairs.In Figure 10(c) we plot the CMC curves at different precision and recall levels. It can be found thatthe overall performance is the best when P=91.4% and R=25.8%, and this corresponds to ρ = 0.2in Figure 10(b). Therefore, we choose ρ = 0.2 in experiments.

5.4 Training Time

The training time is an important standard for evaluating metric learning algorithms. Althoughthe IMKML needs to update pseudo training set and learn multiple kernel metrics iteratively, it isstill very efficient due to the closed-form solution of KMNL and the offline computation of ker-nel matrices. Table 6 shows a comparison of average training time on the VIPeR dataset over 10



Table 6. Comparison of Training Time (Seconds) on VIPeR Dataset

Method KISSME XQDA NFST KMNL IMKML LMNN MLAPG ITMLTime 3.7 3.5 1.4 1.5 20.3 269.3 62.1 195.8

random trials. All compared algorithms are state-of-the-art metric learning algorithms and theyare implemented in MATLAB. The training is performed on a desktop PC with an Intel [email protected] CPU. From the comparison, we can observe that our IMKML requires 20.3 seconds fortraining. Although this is much slower than the baseline KMNL, as well as other methods thatlearn a single closed-form metric such as NFST, KISSME, and XQDA, it is still much faster thanLMNN, MLAPG, and ITML [9] that need heavy optimization.

6 CONCLUSION AND FUTURE WORK

In this article, we have presented a novel person re-identification framework called IMKML. In par-ticular, IMKML first learns multiple metrics via a new derived kernel nullspace learning algorithmnamed KMNL, then it selects the highly confident positive test pairs to augment the training data.By iteratively alternating between the two modules, the discriminative information from success-fully identified test samples can be well exploited to learn more discriminative metrics. As a result,significant re-identification performance gains can be achieved. Experiments show that IMKMLoutperforms the state-of-the-art approaches by a big margin on small datasets, and it also achievescomparable performance on a big dataset. In future work, we will further improve IMKML fromtwo aspects. The first one is to address the problem that IMKML may lead to the degeneration ofinitial matching accuracy on larger ranks, and the other is to broaden its application. In currentstage, only one image can be selected into the pseudo training set even if there are two or moreimages of the same person in the gallery set.

REFERENCES

[1] S. Bai and X. Bai. 2016. Sparse contextual activation for efficient visual re-ranking. IEEE Transactions on Image Pro-

cessing 25, 3 (2016), 1056–1069.

[2] Song Bai, Xiang Bai, and Qi Tian. 2017. Scalable person re-identification on supervised smoothed manifold. In IEEE

Conference on Computer Vision and Pattern Recognition. 3356–3365.

[3] Song Bai, Zhichao Zhou, Jingdong Wang, Xiang Bai, Longin Jan Latecki, and Qi Tian. 2017. Ensemble diffusion for

retrieval. In IEEE International Conference on Computer Vision. 774–783.

[4] Dapeng Chen, Zejian Yuan, Badong Chen, and Nanning Zheng. 2016. Similarity learning with spatial constraints for

person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition. 1268–1277.

[5] Ying Cong Chen, Wei Shi Zheng, and Jianhuang Lai. 2015. Mirror representation for modeling view-specific transform

in person re-identification. In International Conference on Artificial Intelligence. 3402–3408.

[6] Ying Cong Chen, Xiatian Zhu, Wei Shi Zheng, and Jian Huang Lai. 2018. Person re-identification by camera correlation

aware feature augmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 2 (2018), 392–408.

[7] De Cheng, Yihong Gong, Sanping Zhou, Jinjun Wang, and Nanning Zheng. 2016. Person re-identification by multi-

channel parts-based CNN with improved triplet loss function. In IEEE Conference on Computer Vision and Pattern

Recognition. 1335–1344.

[8] Dong Seon Cheng, Marco Cristani, Michele Stoppa, Loris Bazzani, and Vittorio Murino. 2011. Custom pictorial struc-

tures for re-identification. In British Machine Vision Conference. 1–11.

[9] Jason V. Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S. Dhillon. 2007. Information-theoretic metric learn-

ing. In ACM International Conference on Machine Learning. 209–216.

[10] Michael Donoser and Horst Bischof. 2013. Diffusion processes for retrieval revisited. In IEEE Conference on Computer

Vision and Pattern Recognition. 1320–1327.

[11] Michela Farenzena, Loris Bazzani, Alessandro Perina, Vittorio Murino, and Marco Cristani. 2010. Person re-

identification by symmetry-driven accumulation of local features. In IEEE Conference on Computer Vision and Pattern

Recognition. 2360–2367.



[12] Jorge García, Niki Martinel, Alfredo Gardel, Ignacio Bravo, Gian Luca Foresti, and Christian Micheloni. 2017. Discrim-

inant context information analysis for post-ranking person re-identification. IEEE Transactions on Image Processing

26, 4 (2017), 1650–1665.

[13] Shaogang Gong, Marco Cristani, Shuicheng Yan, and Chen Change Loy. 2014. Person Re-identification. Springer.

[14] Douglas Gray and Hai Tao. 2008. Viewpoint invariant pedestrian recognition with an ensemble of localized features.

In European Conference on Computer Vision. 262–275.

[15] Martin Hirzer, Peter M. Roth, and Horst Bischof. 2012. Person re-identification by efficient impostor-based metric

learning. In IEEE Conference on Advanced Video and Signal-Based Surveillance. 203–208.

[16] Weiming Hu, Min Hu, Xue Zhou, Tieniu Tan, Jianguang Lou, and Steve Maybank. 2006. Principal axis-based corre-

spondence between multiple cameras for people tracking. IEEE Transactions on Pattern Analysis and Machine Intelli-

gence 28, 4 (2006), 663–71.

[17] Jieru Jia, Qiuqi Ruan, Gaoyun An, and Yi Jin. 2017. Multiple metric learning with query adaptive weights and multi-

task re-weighting for person re-identification. Computer Vision & Image Understanding 160 (2017), 87–99.

[18] Svebor Karaman, Giuseppe Lisanti, Andrew D. Bagdanov, and Alberto Del Bimbo. 2014. Leveraging local neighbor-

hood topology for large scale person re-identification. Pattern Recognition 47, 12 (2014), 3767–3778.

[19] Martin Köestinger, Martin Hirzer, Paul Wohlhart, Peter M. Roth, and Horst Bischof. 2012. Large scale metric learning

from equivalence constraints. In IEEE Conference on Computer Vision and Pattern Recognition. 2288–2295.

[20] Igor Kviatkovsky, Amit Adam, and Ehud Rivlin. 2013. Color invariants for person reidentification. IEEE Transactions

on Pattern Analysis and Machine Intelligence 35, 7 (2013), 1622–1634.

[21] Bogdan Kwolek. 2017. Person re-identification using multi-region triplet convolutional network. In ACM International

Conference on Distributed Smart Cameras. 82–87.

[22] Qingming Leng, Ruimin Hu, Chao Liang, Yimin Wang, and Jun Chen. 2015. Person re-identification with content and

context re-ranking. Multimedia Tools and Applications 74, 17 (2015), 6989–7014.

[23] Wei Li and Xiaogang Wang. 2013. Locally aligned feature transforms across views. In IEEE Conference on Computer

Vision and Pattern Recognition. 3594–3601.

[24] Zhen Li, Shiyu Chang, Feng Liang, Thomas Huang, Liangliang Cao, and John Smith. 2013. Learning locally-

adaptive decision functions for person verification. In IEEE Conference on Computer Vision and Pattern Recognition.

3610–3617.

[25] Shengcai Liao, Yang Hu, Xiangyu Zhu, and Stan Z. Li. 2015. Person re-identification by local maximal occurrence

representation and metric learning. In IEEE Conference on Computer Vision and Pattern Recognition. 2197–2206.

[26] Shengcai Liao and Stan Z. Li. 2015. Efficient PSD constrained asymmetric metric learning for person re-identification.

In IEEE International Conference on Computer Vision. 3685–3693.

[27] Giuseppe Lisanti, Svebor Karaman, and Iacopo Masi. 2017. Multichannel-kernel canonical correlation analysis for

cross-view person reidentification. ACM Transactions on Multimedia Computing, Communications, and Applications

(TOMM) 13, 2 (2017), 13–32.

[28] Giuseppe Lisanti, Iacopo Masi, Andrew D. Bagdanov, and Alberto Del Bimbo. 2015. Person re-identification by iter-

ative re-weighted sparse ranking. IEEE Transactions on Pattern Analysis and Machine Intelligence 37, 8 (2015), 1629–

1642.

[29] Giuseppe Lisanti, Iacopo Masi, and Alberto Del Bimbo. 2014. Matching people across camera views using kernel

canonical correlation analysis. In ACM International Conference on Distributed Smart Cameras. 1–6.

[30] Chunxiao Liu, Chen Change Loy, Shaogang Gong, and Guijin Wang. 2013. POP: Person re-identification post-rank

optimisation. In IEEE International Conference on Computer Vision. 441–448.

[31] Jiawei Liu, Zheng Jun Zha, Q. I. Tian, Dong Liu, Ting Yao, Qiang Ling, and Tao Mei. 2016. Multi-scale triplet CNN for

person re-identification. In ACM Multimedia Conference. 192–196.

[32] Xiaokai Liu, Hongyu Wang, Yi Wu, and Jimei Yang. 2015. An ensemble color model for human re-identification. In

Applications of Computer Vision. 868–875.

[33] Chen Change Loy, Tao Xiang, and Shaogang Gong. 2009. Multi-camera activity correlation analysis. In IEEE Confer-

ence on Computer Vision and Pattern Recognition. 1988–1995.

[34] Bingpeng Ma, Yu Su, and Frederic Jurie. 2014. Covariance descriptor based on bio-inspired features for person re-

identification and face verification. Image and Vision Computing 32, 6 (2014), 379–390.

[35] Lianyang Ma, Xiaokang Yang, and Dacheng Tao. 2014. Person re-identification over camera networks using multi-

task distance metric learning. IEEE Transactions on Image Processing 23, 8 (2014), 3656–70.

[36] Niki Martinel, Christian Micheloni, and Gian Luca Foresti. 2015. Kernelized saliency-based person re-identification

through multiple metric learning. IEEE Transactions on Image Processing 24, 12 (2015), 5645–5658.

[37] Tetsu Matsukawa, Takahiro Okabe, Einoshin Suzuki, and Yoichi Sato. 2016. Hierarchical Gaussian descriptor for

person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition. 1363–1372.



[38] Tao Mei, Yong Rui, Shipeng Li, and Qi Tian. 2014. Multimedia search reranking: A literature survey. ACM Computing

Surveys 46, 3 (2014), 1–38.

[39] Sakrapee Paisitkriangkrai, Lin Wu, Chunhua Shen, and Anton Van Den Hengel. 2017. Structured learning of metric

ensembles with application to person re-identification. Computer Vision and Image Understanding 156 (2017), 51–65.

[40] Peter M. Roth, Martin Hirzer, Martin Köstinger, Csaba Beleznai, and Horst Bischof. 2014. Mahalanobis Distance Learn-

ing for Person Re-identification. 247–267 pages.

[41] B Schölkopf, A. Smola, and K. Müller. 1998. Nonlinear component analysis as a kernel eigenvalue problem. Neural

Computation 10, 5 (1998), 1299–1319.

[42] Chen Shen, Zhongming Jin, Yiru Zhao, Zhihang Fu, Rongxin Jiang, Yaowu Chen, and Xian Sheng Hua. 2017. Deep

siamese network with multi-level similarity perception for person re-identification. In ACM Multimedia Conference.

1942–1950.

[43] Hailin Shi, Yang Yang, Xiangyu Zhu, Shengcai Liao, Zhen Lei, Weishi Zheng, and Stan Z. Li. 2016. Embedding deep

metric for person re-identification: A study against large variations. In European Conference on Computer Vision.

732–748.

[44] Masashi Sugiyama. 2007. Dimensionality reduction of multimodal labeled data by local Fisher discriminant analysis.

Journal of Machine Learning Research 8, 1 (2007), 1027–1061.

[45] Chong Sun, Dong Wang, and Huchuan Lu. 2017. Person re-identification via distance metric learning with latent

variables. IEEE Transactions on Image Processing 26, 1 (2017), 23–34.

[46] Rahul Rama Varior, Mrinal Haloi, and Gang Wang. 2016. Gated siamese convolutional neural network architecture

for human re-identification. In European Conference on Computer Vision. 791–808.

[47] Rahul Rama Varior, Gang Wang, Jiwen Lu, and Ting Liu. 2016. Learning invariant color features for person reidenti-

fication. IEEE Transactions Image Processing 25, 7 (2016), 3395–3410.

[48] Roberto Vezzani, Davide Baltieri, and Rita Cucchiara. 2013. People reidentification in surveillance and forensics: A

survey. ACM Computing Surveys 46, 2 (2013), 1–37.

[49] Bing Wang, Gang Wang, Kap Luk Chan, and Li Wang. 2014. Tracklet association with online target-specific metric

learning. In IEEE Conference on Computer Vision and Pattern Recognition. 1234–1241.

[50] Kilian Q. Weinberger and Lawrence K. Saul. 2009. Distance metric learning for large margin nearest neighbor classi-

fication. The Journal of Machine Learning Research 10 (2009), 207–244.

[51] Shuicheng Yan, Dong Xu, Benyu Zhang, Hong-Jiang Zhang, Qiang Yang, and Stephen Lin. 2007. Graph embedding

and extensions: A general framework for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine

Intelligence 29, 1 (2007), 40–51.

[52] Xun Yang, Meng Wang, Richang Hong, Yong Rui, and Yong Rui. 2017. Enhancing person re-identification in a self-

trained subspace. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 13, 3

(2017), 27–41.

[53] Yang Yang, Jimei Yang, Junjie Yan, Shengcai Liao, Dong Yi, and Stan Z. Li. 2014. Salient color names for person

re-identification. In European Conference on Computer Vision. 536–551.

[54] Mang Ye, Chao Liang, Yi Yu, Zheng Wang, Qingming Leng, Chunxia Xiao, Jun Chen, and Ruimin Hu. 2016. Person rei-

dentification via ranking aggregation of similarity pulling and dissimilarity pushing. IEEE Transactions on Multimedia

18, 12 (2016), 2553–2566.

[55] Li Zhang, Tao Xiang, and Shaogang Gong. 2016. Learning a discriminative null space for person re-identification. In

IEEE Conference on Computer Vision and Pattern Recognition. 1239–1248.

[56] Ying Zhang, Baohua Li, Huchuan Lu, Atshushi Irie, and Ruan Xiang. 2016. Sample-specific SVM learning for person

re-identification. In IEEE Conference on Computer Vision and Pattern Recognition. 1278–1287.

[57] Haiyu Zhao, Maoqing Tian, Shuyang Sun, Jing Shao, Junjie Yan, Shuai Yi, Xiaogang Wang, and Xiaoou Tang. 2017.

Spindle net: Person re-identification with human body region guided feature decomposition and fusion. In IEEE


[58] Liming Zhao, Xi Li, Jingdong Wang, and Yueting Zhuang. 2017. Deeply-learned part-aligned representations for

person re-identification. In IEEE International Conference on Computer Vision. 3239–3248.

[59] Rui Zhao, Wanli Ouyang, and Xiaogang Wang. 2013. Unsupervised salience learning for person re-identification. In

IEEE Conference on Computer Vision and Pattern Recognition. 3586–3593.

[60] Rui Zhao, Wanli Ouyang, and Xiaogang Wang. 2014. Learning mid-level filters for person re-identification. In IEEE


[61] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. 2015. Scalable person re-

identification: A benchmark. In IEEE International Conference on Computer Vision. 1116–1124.

[62] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. 2015. Scalable person re-

identification: A benchmark. In IEEE International Conference on Computer Vision.



[63] Liang Zheng, Yi Yang, and Alexander G. Hauptmann. 2016. Person re-identification: Past, present and future.

ArXiv:1610.02984.

[64] Wenming Zheng, Li Zhao, and Cairong Zou. 2005. Foley-sammon optimal discriminant vectors using kernel approach.

IEEE Transactions on Neural Networks 16, 1 (2005), 1–9.

[65] Wei-Shi Zheng, Shaogang Gong, and Tao Xiang. 2013. Reidentification by relative distance comparison. IEEE Trans-

actions on Pattern Analysis and Machine Intelligence 35, 3 (2013), 653–668.

[66] Zhedong Zheng, Liang Zheng, and Yi Yang. 2017. A discriminatively learned CNN embedding for person re-

identification. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14, 1 (2018),

Article 13.

[67] Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li. 2017. Re-ranking person re-identification with k-reciprocal

encoding. In IEEE Conference on Computer Vision and Pattern Recognition. 3652–3661.

[68] Zhun Zhong, Liang Zheng, Zhedong Zheng, Shaozi Li, and Yi Yang. 2018. Camera style adaptation for person re-

identification. In IEEE Computer Vision and Pattern Recognition.

[69] Sanping Zhou, Jinjun Wang, Jiayun Wang, Yihong Gong, and Nanning Zheng. 2017. Point to set similarity based deep

feature learning for person re-identification. In IEEE Computer Vision and Pattern Recognition. 5028–5037.

Received December 2017; revised May 2018; accepted May 2018


Documents

LearningMultipleKernelMetricsforIterativePerson Re ...static.tongtianta.site/paper_pdf/8f1eaab0-a2ff-11e... · 78:2 H.Dongetal. ACM Reference format: HushengDong,PingLu,ChunpingLiu,YiJi,andShengrongGong.2018.LearningMultipleKernelMetricsfor