19
W. LIU ET AL.: JOINTLY LEARNING NON-NEGATIVE PROJECTION AND DICTIONARY 1 Jointly Learning Non-negative Projection and Dictionary with Discriminative Graph Constraints for Classification Weiyang Liu 12 [email protected] Zhiding Yu 3 [email protected] Yandong Wen 3 [email protected] Rongmei Lin 4 [email protected] Meng Yang* 1 [email protected] 1 College of Computer Science & Software Engineering, Shenzhen University, China 2 School of ECE, Peking University, China 3 Dept. of ECE, Carnegie Mellon University, USA 4 Dept. of Math & Computer Science, Emory University, USA Abstract Sparse coding with dictionary learning (DL) has shown excellent classification per- formance. Despite the considerable number of existing works, how to obtain features on top of which dictionaries can be better learned remains an open and interesting question. Many current prevailing DL methods directly adopt well-performing crafted features. While such strategy may empirically work well, it ignores certain intrinsic relationship between dictionaries and features. We propose a framework where features and dictio- naries are jointly learned and optimized. The framework, named “joint non-negative pro- jection and dictionary learning” (JNPDL), enables interaction between the input features and the dictionaries. The non-negative projection leads to discriminative parts-based object features while DL seeks a more suitable representation. Discriminative graph constraints are further imposed to simultaneously maximize intra-class compactness and inter-class separability. Experiments on both image and image set classification show the excellent performance of JNPDL by outperforming several state-of-the-art approaches. 1 Introduction Sparse coding has been widely applied in a variety of computer vision problems where one seeks to represent a signal as a sparse linear combination of bases (dictionary atoms). Dic- tionary plays an important role as it is expected to robustly represent components of the query signal. [22] proposed the sparse representation-based classification (SRC) in which the entire training set is treated as a structured dictionary. Methods taking off-the-shelf bases (e.g., wavelets) as the dictionary were also proposed [7]. While such strategy is straight forward and convenient, research also indicates that it may not be optimal for classification c 2016. The copyright of this document resides with its authors. It may be distributed unchanged freely in print or electronic forms.

Jointly Learning Non-negative Projection and Dictionary with ...wyliu.com/papers/LiuBMVC16.pdf · W. LIU ET AL.: JOINTLY LEARNING NON-NEGATIVE PROJECTION AND DICTIONARY 1 Jointly

Embed Size (px)

Citation preview

Page 1: Jointly Learning Non-negative Projection and Dictionary with ...wyliu.com/papers/LiuBMVC16.pdf · W. LIU ET AL.: JOINTLY LEARNING NON-NEGATIVE PROJECTION AND DICTIONARY 1 Jointly

W. LIU ET AL.: JOINTLY LEARNING NON-NEGATIVE PROJECTION AND DICTIONARY 1

Jointly Learning Non-negative Projectionand Dictionary with Discriminative GraphConstraints for Classification

Weiyang Liu12

[email protected]

Zhiding Yu3

[email protected]

Yandong Wen3

[email protected]

Rongmei Lin4

[email protected]

Meng Yang*1

[email protected]

1 College of Computer Science &Software Engineering,Shenzhen University, China

2 School of ECE,Peking University, China

3 Dept. of ECE,Carnegie Mellon University, USA

4 Dept. of Math & Computer Science,Emory University, USA

Abstract

Sparse coding with dictionary learning (DL) has shown excellent classification per-formance. Despite the considerable number of existing works, how to obtain features ontop of which dictionaries can be better learned remains an open and interesting question.Many current prevailing DL methods directly adopt well-performing crafted features.While such strategy may empirically work well, it ignores certain intrinsic relationshipbetween dictionaries and features. We propose a framework where features and dictio-naries are jointly learned and optimized. The framework, named “joint non-negative pro-jection and dictionary learning” (JNPDL), enables interaction between the input featuresand the dictionaries. The non-negative projection leads to discriminative parts-basedobject features while DL seeks a more suitable representation. Discriminative graphconstraints are further imposed to simultaneously maximize intra-class compactness andinter-class separability. Experiments on both image and image set classification show theexcellent performance of JNPDL by outperforming several state-of-the-art approaches.

1 IntroductionSparse coding has been widely applied in a variety of computer vision problems where oneseeks to represent a signal as a sparse linear combination of bases (dictionary atoms). Dic-tionary plays an important role as it is expected to robustly represent components of thequery signal. [22] proposed the sparse representation-based classification (SRC) in whichthe entire training set is treated as a structured dictionary. Methods taking off-the-shelf bases(e.g., wavelets) as the dictionary were also proposed [7]. While such strategy is straightforward and convenient, research also indicates that it may not be optimal for classification

c© 2016. The copyright of this document resides with its authors.It may be distributed unchanged freely in print or electronic forms.

Citation
Citation
{Wright, Yang, Ganesh, Sastry, and Ma} 2009
Citation
Citation
{Huang and Aviyente} 2006
Page 2: Jointly Learning Non-negative Projection and Dictionary with ...wyliu.com/papers/LiuBMVC16.pdf · W. LIU ET AL.: JOINTLY LEARNING NON-NEGATIVE PROJECTION AND DICTIONARY 1 Jointly

2 W. LIU ET AL.: JOINTLY LEARNING NON-NEGATIVE PROJECTION AND DICTIONARY

Figure 1. An illustration of JNPDL. The dictionary and non-negative projection are jointly learned with discrimi-native graph constraints.

tasks. Many currently prevailing approaches choose to learn dictionaries directly from train-ing data and have leaded to good results in face recognition [8, 10, 24, 26, 29] and imageclassification [25]. These dictionary updating strategies are referred to as dictionary learning(DL). DL received significant attention for its excellent representation power. Such advan-tage mainly comes from the fact that allowing the update of dictionary atoms often resultsin additional flexibility to discover better intrinsic signal patterns, therefore leading to morerobust representations.

DL methods can be broadly divided into unsupervised DL and supervised DL. Due tothe lack of label information, unsupervised DL only guarantees to discover signal patternsthat are representative, but not necessarily discriminative. Supervised DL exploits the in-formation among classes and requires the learned dictionary to be discriminative. Relatedliteratures include the discriminative K-SVD [29] and label-consistent K-SVD [8]. Dictio-nary regularization terms which takes label information into account were also introduced[1, 16, 17]. For stronger discrimination, [24, 26] tried to learn a discriminative dictionaryvia the Fisher discrimination criteria. More recently, [25] proposed a latent DL method byjointly learning a latent matrix to adaptively build the relationship between dictionary atomsand class labels.

Different from the conventional wide variety of discriminative DL literatures, our workcasts an alternative view on this problem. One major purpose of this paper is to jointly learna feature projection that improves DL. Instead of keep exploiting additional discriminationfrom the dictionary representation, we consider optimizing the input feature to further im-prove the learned dictionary. We believe such process can considerably influence the qualityof learned dictionary, while a better learned dictionary may directly improve subsequentclassification performance.

Given that mid-level object parts are often discriminative for classification, we aim tolearn a feature projection that mines these discriminative patterns. It is well-known that non-negative matrix factorization (NMF) [11] can learn similar part-like components. In the lightof NMF and projective NMF (PNMF) [13], we consider the projective self-representation(PSR) model where the set of training samples YYY is approximately factorized as: YYY ≈MMMPPPYYY .The model jointly learns both the intermediate basis matrix MMM and the projection matrix PPPwith non-negativity such that the additive (non-subtractive) combinations leads to learnedprojected features PPPYYY accentuating spatial object parts. In the paper, we propose a novel

Citation
Citation
{Jiang, Lin, and Davis} 2013
Citation
Citation
{Kong and Wang} 2012
Citation
Citation
{Yang, Zhang, and Feng} 2011
Citation
Citation
{Yang, Zhang, Feng, and Zhang} 2014{}
Citation
Citation
{Zhang and Li} 2010
Citation
Citation
{Yang, Dai, Shen, and Gool} 2014{}
Citation
Citation
{Zhang and Li} 2010
Citation
Citation
{Jiang, Lin, and Davis} 2013
Citation
Citation
{Castrodad and Sapiro} 2012
Citation
Citation
{Qiu, Patel, and Chellappa} 2014
Citation
Citation
{Ramirez, Sprechmann, and Sapiro} 2010
Citation
Citation
{Yang, Zhang, and Feng} 2011
Citation
Citation
{Yang, Zhang, Feng, and Zhang} 2014{}
Citation
Citation
{Yang, Dai, Shen, and Gool} 2014{}
Citation
Citation
{Lee and Seung} 1999
Citation
Citation
{Liu, Yan, and Jin} 2010
Page 3: Jointly Learning Non-negative Projection and Dictionary with ...wyliu.com/papers/LiuBMVC16.pdf · W. LIU ET AL.: JOINTLY LEARNING NON-NEGATIVE PROJECTION AND DICTIONARY 1 Jointly

W. LIU ET AL.: JOINTLY LEARNING NON-NEGATIVE PROJECTION AND DICTIONARY 3

NMF-like feature projection learning framework on top of the PSR model to simultaneous-ly incorporate label information with discriminative graph constraints. One shall see, ourproposed framework can be viewed as a tradeoff between NMF and feature learning [30].

The dictionary representation is further discriminatively learned given the projected inputfeatures. An overview of the joint non-negative projection and dictionary learning (JNPDL)framework is illustrated in Fig. 1. The construction of discriminative graph constraints inboth non-negative projection and dictionary learning follows the graph embedding frame-work [23]. While the inputs of graph constraints are essentially the same, they form differentregularization terms for the convenience of optimization. Finally, a discriminative recon-struction constraint is also adopted so that coding coefficients will only well represent sam-ples from their own classes but poorly represent samples from other classes. We test JNPDLin both image classification and image set classification with comprehensive evaluations,showing the excellent performance of JNPDL.

2 Related WorksOnly a few works [3, 15, 28] have discussed similar ideas but have all reported more com-petitive performance than conventional DL methods. [15] proposed a framework which si-multaneously learns feature and dictionary. Their work focused on learning a reconstructivefeature (filterbank) with similar idea in [30]. The work in [3] jointly learns a dimensionalityreduction matrix and a dictionary for face recognition. Unlike JNPDL, the model focused onlow-dimensional representation without discriminative constraints. The source of discrim-ination purely comes from a Fisher-like constraint on coding coefficients. [28] presenteda simultaneous projection and dictionary learning method with carefully designed discrim-inative sigmoid reconstruction error. Their method represents the input samples with themultiplication of projection matrix and dictionary, which differs significantly from JNPDL.[5] presents a novel dictionary pair learning approach for pattern classification, which alsopartially resembles our framework.

3 The Proposed JNPDL ModelLet YYY be a set of s-dimensional training samples, i.e., YYY = {YYY 1,YYY 2, · · · ,YYY K}where YYY i denotesthe training samples from class i. The learned structured (class-specific) dictionary is denot-ed by DDD = {DDD1,DDD2, · · · ,DDDK} and the corresponding sparse representation of the trainingsamples over dictionary DDD is defined as XXX = {XXX1,XXX2, · · · ,XXXK}. MMM is the intermediate non-negative basis matrix while PPP ∈ Rsp×s denotes the projection matrix. In order to avoid highinter-class correlations and benefit subsequent sparse coding, JNPDL first projects trainingsamples to a more discriminative space before dictionary encoding. Similar process appliesfor testing. At training, the projection, dictionary and encoded coefficients are jointly learnedwith the following model:

〈DDD,MMM,PPP,XXX〉= arg minDDD,MMM≥000,PPP≥000,XXX

{R(DDD,PPP,XXX)+α1Gp(PPP,MMM)+α2Gc(XXX)+α3‖XXX‖1

}, (1)

where α1,α2 and α3 are scalar constants, and R(DDD,PPP,XXX) is the discriminative reconstructionerror. Gp(PPP,MMM) is the graph-based projection term learning the NMF-like feature projection,while Gc(XXX) is the graph-based coding coefficients term imposing discriminative label in-formation to DL.

Citation
Citation
{Zou, Zhu, Yu, and Ng} 2012
Citation
Citation
{Yan, Xu, Zhang, Zhang, Yang, and Lin} 2007
Citation
Citation
{Feng, Yang, Zhang, Liu, and Zhang} 2013
Citation
Citation
{Lu, Wang, Deng, and Moulin} 2014
Citation
Citation
{Zhang, Zhang, and Huang} 2013
Citation
Citation
{Lu, Wang, Deng, and Moulin} 2014
Citation
Citation
{Zou, Zhu, Yu, and Ng} 2012
Citation
Citation
{Feng, Yang, Zhang, Liu, and Zhang} 2013
Citation
Citation
{Zhang, Zhang, and Huang} 2013
Citation
Citation
{Gu, Zhang, Zuo, and Feng} 2014
Page 4: Jointly Learning Non-negative Projection and Dictionary with ...wyliu.com/papers/LiuBMVC16.pdf · W. LIU ET AL.: JOINTLY LEARNING NON-NEGATIVE PROJECTION AND DICTIONARY 1 Jointly

4 W. LIU ET AL.: JOINTLY LEARNING NON-NEGATIVE PROJECTION AND DICTIONARY

3.1 Discriminative Reconstruction ErrorDiscriminative reconstruction error targets the three following objectives: minimizing theglobal reconstruction error, minimizing the local reconstruction error1 and maximizing thenon-local reconstruction error. The term is defined as:

R(DDD,PPP,XXX) = ‖PPPYYY −DDDXXX‖2F +

K

∑i=1‖PPPYYY i−DDDiXXX i

i‖2F +

K

∑i=1

K

∑j=1, j 6=i

‖DDD jXXXji ‖

2F , (2)

where XXX ji denotes the coding coefficients of samples YYY i associated with the sub-dictionary

DDD j. ∑Ki=1 ‖YYY i−DDDiXXX i

i‖2F denotes the local reconstruction error that requires the local sub-

dictionaries DDDii well represent samples from class i. ‖DDD jXXX

ji ‖2

F is further minimized such thatinter-class coefficients XXX j

i , i 6= j are relatively small compared with XXX ii.

3.2 Graph-based Coding Coefficients TermThe term seeks to constrain the intra-class coding coefficients to be similar while the inter-class ones to be significantly dissimilar. We first construct an intrinsic graph for intra-classcompactness and a penalty graph for inter-class separability. Rewriting the coding coeffi-cients XXX as XXX = {xxx1,xxx2, · · · ,xxxN} where N is the number of training samples, the similaritymatrix of the intrinsic graph is defined as:

{WWW c}i j =

{1, if i ∈ S+k1

( j) or j ∈ S+k1(i)

0, otherwise, (3)

where S+k1(i) indicates the index set of the k1 nearest intra-class neighbors of sample xxxi.

Similarly, the similarity matrix of the penalty graph is defined as:

{WWW pc }i j =

{1, j ∈ S−k2

(i) or i ∈ S−k2( j)

0, otherwise, (4)

where S−k2(i) denotes the index set of the k2 nearest inter-class neighbors of xxxi. the coeffi-

cients term is defined as:

Gc(XXX) = Tr(XXXT LLLcXXX)−Tr(XXXT LLLpc XXX), (5)

where LLLc =BBBc−WWW c in which BBBc =∑ j 6=i{WWW c}i j and LLLpc =BBBp

c−WWW pc in which BBBp

c =∑ j 6=i{WWW pc}i j.

Imposing the graph-based discrimination makes the coding coefficients more discriminative.Interestingly, most existing discriminative coding coefficients term, such as in [15, 26], arespecial cases of the graph-based discrimination constraint.

3.3 Graph-based Non-negative Projection TermThis term aims to learn a non-negative projection that maps the training samples to a dis-criminative space. Inspired by [13], we design a structured projection matrix by dividing theprojection matrix PPP into two parts:

YYY pro j =

[YYY

pro j

YYY pro j

]= PPPYYY =

[PPPPPP

]YYY (6)

1Stands for the reconstruction with atoms from the same class.

Citation
Citation
{Lu, Wang, Deng, and Moulin} 2014
Citation
Citation
{Yang, Zhang, Feng, and Zhang} 2014{}
Citation
Citation
{Liu, Yan, and Jin} 2010
Page 5: Jointly Learning Non-negative Projection and Dictionary with ...wyliu.com/papers/LiuBMVC16.pdf · W. LIU ET AL.: JOINTLY LEARNING NON-NEGATIVE PROJECTION AND DICTIONARY 1 Jointly

W. LIU ET AL.: JOINTLY LEARNING NON-NEGATIVE PROJECTION AND DICTIONARY 5

where YYYpro j

= {yyypro j1 , · · · , yyypro j

N } ∈ Rq×N that serves for the certain purpose of graph em-bedding, and YYY pro j

= {yyypro j1 , · · · , yyypro j

N } ∈R(sp−q)×N that contains the additional informationfor data reconstruction (YYY pro j is a relaxed matrix that compensates the information loss inYYY

pro j). Note that YYY

pro jpreserves the discriminative graph properties while the whole YYY pro j

is used for data reconstruction purpose. Therefore the purposes of data reconstruction andgraph embedding coexist harmoniously and do not mutually compromise like conventionalformulations with multiple objectives. The basis matrix MMM is correspondingly divided intotwo parts MMM = {MMM,MMM} in which MMM ∈ Rs×q and MMM ∈ Rs×(sp−q). MMM can be considered as thecomplementary space of MMM.

We first define yyypro jj as the jth column vector of YYY pro j and then construct the intrinsic

graph and the penalty graph using the same procedure as graph-based coding coefficientsterm. The construction of the similarity matrix WWW p and WWW p

p for intrinsic graph and the penaltygraph is identical to WWW c and WWW p

c , except that WWW p and WWW pp measure the similarities among

features and adopt different parameters. Differently, WWW c and WWW pc measure similarities among

coding coefficients. As [23] suggests, we have two objectives to preserve graph propertiesand enhance discrimination:{

maxYYY

pro j ∑i6= j ‖yyypro ji − yyypro j

j ‖22{WWW

pp}i j

minYYY

pro j ∑i 6= j ‖yyypro ji − yyypro j

j ‖22{WWW p}i j

. (7)

∑i 6= j ‖yyypro ji −yyypro j

j ‖22{WWW p}i j =∑i6= j ‖yyy

pro ji − yyypro j

j ‖22{WWW p}i j+∑i6= j ‖yyy

pro ji − yyypro j

j ‖22{WWW p}i j,

for a specific YYY pro j, minimizing the objective function with respect to YYY pro j is equivalent tomaximizing the objective function associated with the complementary part, i.e., YYY

pro j. Thus

we constrain the projection matrix with the following equivalent objective:{minPPP Tr(PPPYYY LLLpYYY T PPP

T)

minPPP Tr(PPPYYY LLLppYYY T PPPT

)(8)

where LLLp = BBBp −WWW p in which BBBp = ∑ j 6=i{WWW p}i j and LLLpp = BBBp

p −WWW pp in which BBBp

p =

∑ j 6=i{WWW pp}i j. we formulate the graph-based projection term as follows:

Gp(PPP,MMM) = ‖YYY −MMMPPPYYY‖2F +β ·Tr(PPPYYY LLLpYYY T PPP

T)+β ·Tr(PPPYYY LLLp

pYYY T PPPT)+‖MMM−PPPT ‖2

F , (9)

where β is a scaling constant and each column of MMM is normalized to unit l2 norm. We use‖YYY−MMMPPPYYY‖2

F +‖MMM−PPPT‖2F to incorporate the projection matrix into a NMF-like framework,

and these two terms can further ensure the reconstruction ability of the projection PPP and avoidthe trivial solutions of PPP. they serve similar role to the auto-encoder style reconstructionpenalty term in [18, 30, 31]. Other terms in Gp(PPP) preserve the graph properties and enhancediscrimination.

4 OptimizationWe adopt standard iterative learning to jointly learn XXX , PPP, MMM and DDD. The proposed algorithmis shown in Algorithm 1. Because of the limited space, we present the detailed optimizationframework in Appendix A. We also provide its convergence analysis in Appendix B. Allthe appendixes are included in the supplementary material which can be found at http://wyliu.com/.

Citation
Citation
{Yan, Xu, Zhang, Zhang, Yang, and Lin} 2007
Citation
Citation
{Wang, Lu, Wang, Cai, and Cham} 2014
Citation
Citation
{Zou, Zhu, Yu, and Ng} 2012
Citation
Citation
{Zuo, Wang, Shuai, Zhao, Yang, and Jiang} 2014
Page 6: Jointly Learning Non-negative Projection and Dictionary with ...wyliu.com/papers/LiuBMVC16.pdf · W. LIU ET AL.: JOINTLY LEARNING NON-NEGATIVE PROJECTION AND DICTIONARY 1 Jointly

6 W. LIU ET AL.: JOINTLY LEARNING NON-NEGATIVE PROJECTION AND DICTIONARY

Algorithm 1 Training Procedure of JNPDL

Input: Training samples YYY = ‖YYY 1, · · · ,YYY N‖, intrinsic graph WWW c,WWW p, penalty graph WWW pc ,WWW

pp, param-

eters α1,α2,α3,β , iteration number T .Output: Non-negative projection matrix PPP, dictionary DDD, coding coefficient matrix XXX .

Step1: Initialization1: t = 1.2: Randomly initialize columns in DDD0,MMM0 with unit l2 norm.3: Initialize xxxi,1≤i≤N with ((DDD0)T (DDD0)+λ2III)−1(DDD0)T yyyi where yyyi is the ith training sample (regard-

less of label).Step2: Search local optima

4: while not convergence or t < T do5: Solve PPPt ,MMMt iteratively with fixed DDDt−1 and XXX t−1 via non-negative projection learning.6: Solve XXX t with fixed MMMt , DDDt−1 and PPPt via non-negative projection learning.7: Solve DDDt with fixed MMMt , PPPt and XXX t via discriminative DL.8: t← t +1.9: end while

Step3: Output10: Output PPP = PPPt , DDD = DDDt and XXX = XXX t .

5 Classification StrategyWhen the projection and the dictionary have been learned, we need to project the testingimage via learned projection, code the projected sample over the learned dictionary andeventually obtain its coding coefficients which is expected to be discriminative. We firstproject the testing sample and then code it over the learning dictionary via

xxx = argminxxx{‖PPPyyy−DDDxxx‖2

2 +λ1‖xxx‖1} (10)

where λ is a constant. After obtaining the coding coefficients xxx, we classify the testingsample via

label(yyy) = argmini

{‖PPPyyy−DDDiδi(xxx)‖2

2 +σ‖xxx−mmmi‖22}

(11)

where σ is a weight to balance these two terms, and δi(·) is the characteristic function thatselects the coefficients associated with ith class. mmmi is the mean vector of the learned codingcoefficient matrix of class i, i.e., XXX i. Incorporating the term ‖xxx−mmmi‖2

2 is to make the bestof the discrimination within the dictionary, because the dictionary is learned to make codingcoefficients similar from the same class and dissimilar among different classes.

6 Experiments

6.1 Implementation DetailsWe evaluate JNPDL by both image classification and image set classification. We con-struct WWW p and WWW p

p with correlation similarity, and set the number of nearest neighbors asmin(nc− 1,5) where nc is the training sample number in class c. The number of shortestpairs from different classes is 20 for each class. For WWW c and WWW p

c , we first remove the graph-based coding coefficients term in Eq. 1 and solve the optimization. The Euclidean distancesamong the coding coefficients of training samples are used as initial neighbor metric. Weset k1 = min(nc−1,5),k2 = 30. For all experiments, we fix α1 = 1,α2 = 1 and α3 = 0.05.

Page 7: Jointly Learning Non-negative Projection and Dictionary with ...wyliu.com/papers/LiuBMVC16.pdf · W. LIU ET AL.: JOINTLY LEARNING NON-NEGATIVE PROJECTION AND DICTIONARY 1 Jointly

W. LIU ET AL.: JOINTLY LEARNING NON-NEGATIVE PROJECTION AND DICTIONARY 7

Table 1. Recognition accuracy (%) on extended Yale B and AR.Method ExYaleB AR Face Method ExYaleB AR Face

SVM 88.8 87.1 JDDRDL [3] 90.1 90.9SRC [22] 91.0 88.5 FDDL [26] 91.9 92.1

D-KSVD [29] 75.3 85.4 DSRC [28] 89.6 88.2LC-KSVD [8] 86.8 90.2 JNPDL 94.1 94.7

Table 2. Accuracy (%) vs. feature dimension on AR dataset.

Dimension 100 200 300 500SRC [22] 84.0 87.3 88.5 89.7

FDDL [26] 85.7 88.5 92.0 92.2DSRC [28] 84.8 86.9 88.2 89.1

JDDRDL [3] 82.5 87.7 90.9 91.6JNPDL 88.3 92.4 94.7 95.1

Other parameters for JNPDL are obtained via 5-fold cross validation to avoid over-fitting.Specifically, we use β = 0.7,λ1 = 5×10−6 and σ = 0.05 for image classification. For imageset based face recognition, we set β = 0.8,λ2 = 0.001×N/700 where N is the number oftraining samples. For all baselines, we usually use their original settings or carefully imple-ment them following the paper. All results are the average value of 10 times independentexperiments.

6.2 Application to Image Classification6.2.1 Face recognition

We evaluate JNPDL on the Extended Yale B (ExYaleB)2 and AR Face Dataset3. For bothExYaleB and AR Face, we exactly follow the same training/testing set selection in [24]. Thedictionary size is set to half of the training samples. SRC [22] uses all training samples asthe dictionary.

Comparison with state-of-the-art approaches. We compare JNPDL with state-of-the-art DL approaches including D-KSVD [29], LC-KSVD [8] and FDDL [25]. DSRC [28]and JDDRDL [3] which share similar philosophy are also compared. JNPDL, DSRC andJDDRDL uses the original images for training and set the feature dimension after projec-tion as 300. All the other methods use the 300-dimensional Eigenface feature. SRC andLinear SVM are used as baselines. Results are shown in Table 1. One can see that JNPDLachieves promising recognition accuracy on both datasets, respectively achieving 2.2% and2.6% improvement over the second best approaches.

Accuracy vs. Feature dimensionality. We vary the feature dimension after projection toevaluate the performance of JNPDL on AR dataset. For SRC and FDDL, the dimensionalityreduction is performed by Eigenface. Table 2 indicates that jointly learned projection canpreserve much discriminative information even at low feature dimension.

Joint projection learning vs. Separate projection. Projection and dictionary can al-so be learned separately. We compare it with joint learning to validate our motivation forJNPDL. We also remove projection learning from JNPDL and use Eigenface features with

2The extended Yale B dataset consists of 2,414 frontal face images from 38 individuals. All images are normal-ized to 54×48.

3The AR dataset consists of over 4,000 frontal images for 126 individuals. For each individual, 26 pictures weretaken in two separated sessions. All images in AR are normalized to 60×43.

Citation
Citation
{Feng, Yang, Zhang, Liu, and Zhang} 2013
Citation
Citation
{Wright, Yang, Ganesh, Sastry, and Ma} 2009
Citation
Citation
{Yang, Zhang, Feng, and Zhang} 2014{}
Citation
Citation
{Zhang and Li} 2010
Citation
Citation
{Zhang, Zhang, and Huang} 2013
Citation
Citation
{Jiang, Lin, and Davis} 2013
Citation
Citation
{Wright, Yang, Ganesh, Sastry, and Ma} 2009
Citation
Citation
{Yang, Zhang, Feng, and Zhang} 2014{}
Citation
Citation
{Zhang, Zhang, and Huang} 2013
Citation
Citation
{Feng, Yang, Zhang, Liu, and Zhang} 2013
Citation
Citation
{Yang, Zhang, and Feng} 2011
Citation
Citation
{Wright, Yang, Ganesh, Sastry, and Ma} 2009
Citation
Citation
{Zhang and Li} 2010
Citation
Citation
{Jiang, Lin, and Davis} 2013
Citation
Citation
{Yang, Dai, Shen, and Gool} 2014{}
Citation
Citation
{Zhang, Zhang, and Huang} 2013
Citation
Citation
{Feng, Yang, Zhang, Liu, and Zhang} 2013
Page 8: Jointly Learning Non-negative Projection and Dictionary with ...wyliu.com/papers/LiuBMVC16.pdf · W. LIU ET AL.: JOINTLY LEARNING NON-NEGATIVE PROJECTION AND DICTIONARY 1 Jointly

8 W. LIU ET AL.: JOINTLY LEARNING NON-NEGATIVE PROJECTION AND DICTIONARY

Table 3. Accuracy (%) of different projection.

Method ExYaleB AR FaceJNPDL (with Eigenface) 91.8 92.2

JNPDL (Separate Learning) 92.1 92.5JNPDL (Joint Learning) 94.1 94.7

Table 4. Recognition Accuracy (%) on LFWa Dataset.

Method Acc. Method Acc. Method Acc. Method Acc.SVM 63.0 COPAR [10] 72.6 D-KSVD 65.9 LDL [25] 77.2SRC 72.7 FDDL 74.8 LC-KSVD 66.0 JNPDL 78.1

dimension set to 300. Results are shown in Table 3. Results show that JNPDL jointly learn-ing projection and dictionary achieves the best accuracy.

Face recognition in the wild. we apply JNPDL in a more challenging face recognitiontask with LWFa dataset [21] which is an aligned version of LFW. We use 143 subject withno less than 11 samples per subject in LFWa dataset (4174 images in total) to perform theexperiment. The first 10 samples are selected as the training samples and the rest is fortesting. Following [25], histogram of uniform-LBP is extracted by partitioning face into10×8 patches. PCA is used to reduce the dimension to 1000. Results are shown in Table 4,with JNPDL achieving the best.

6.2.2 Object categorization

We perform the object categorization experiment on 17 Oxford Flower dataset and use thedefault experiment setup as in [26]. We compare JNPDL with MTJSRC [27], COPAR,JDDRDL, DSRC, FDDL, SDL, LLC and two baseline: SRC, SVM. For fair comparison, weuse the Frequent Local Histogram (FLH) feature to generate a kernel-based feature descriptorthe same as [26]. Table 5 shows that JNPDL is slightly worse than FDDL but is better thanmost competitive approaches. We believe that it is because the kernel features are alreadyquite discriminative and projection does not help much.

6.3 Application to Image Set Classification

6.3.1 Classification strategy for image set classification

Applying the classification in Section 5 to each video frame altogether with a voting s-trategy, JNPDL can be easily extended to image set classification. Given a testing videoY te = {yyyte

1 ,yyyte2 , · · · ,yyyte

Kt} in which yyyte

j is the jth frame and Kt is the number of image frames inthe video, we project each frame to a feature via the learned non-negative projection PPP andobtain its coding coefficients with Eq. (10). Thus the label of a video frame can be obtainby Eq. (11). After getting all the labels of frames, we perform a majority voting to decidethe label of the given image set. For testing efficiency, we replace the l1 norm ‖xxx‖1 with a l2norm ‖xxx‖2

2 and derive the decision:

label(yyytej ) = argmin

i{‖PPPyyyte

j −DDDiδi(DDD†yyytej )‖2

2}. (12)

where DDD† = (DDDT DDD+λ2III)−1DDDT . Eventually we use the majority voting to decide the labelof a video (image set).

Citation
Citation
{Kong and Wang} 2012
Citation
Citation
{Yang, Dai, Shen, and Gool} 2014{}
Citation
Citation
{Wolf, Hassner, and Taigman} 2011
Citation
Citation
{Yang, Dai, Shen, and Gool} 2014{}
Citation
Citation
{Yang, Zhang, Feng, and Zhang} 2014{}
Citation
Citation
{Yuan, Liu, and Yan} 2012
Citation
Citation
{Yang, Zhang, Feng, and Zhang} 2014{}
Page 9: Jointly Learning Non-negative Projection and Dictionary with ...wyliu.com/papers/LiuBMVC16.pdf · W. LIU ET AL.: JOINTLY LEARNING NON-NEGATIVE PROJECTION AND DICTIONARY 1 Jointly

W. LIU ET AL.: JOINTLY LEARNING NON-NEGATIVE PROJECTION AND DICTIONARY 9Table 5. Recognition Accuracy (%) on 17 Oxford Flower Dataset.

Method Accuracy Method AccuracySVM 88.6 MTJSRC 88.4SRC 88.4 COPAR 88.6

LLC (20 bases) 89.7 SDL 91.0JDDRDL 87.7 FDDL 91.7

DSRC 88.9 JNPDL 92.1

Table 6. Recognition acc. (%) on Honda, MoBo, YTC datasets.

Method Honda MoBo YTC Method Honda MoBo YTCDCC [9] 94.9 88.1 64.8 SANP [6] 93.6 96.1 68.3

MMD [20] 94.9 91.7 66.7 LMKML [14] 98.5 96.3 78.2MDA [19] 97.4 94.4 68.1 SFDL [15] 100 96.7 76.7CHISD [2] 92.5 95.8 67.4 JNPDL 100 97.1 77.4

6.3.2 Image set based face recognition

Three video face recognition benchmark dataset, including Honda/UCSD [12]4, CMU MoBo[4]5 and YouTube Celebrities (YTC)6 are used to evaluate the proposed JNPDL. For faircomparison, we follows the experimental setup in [15]. We use Viola-Jones face detector tocapture faces and then resize them to 30×30 intensity image. Each image frame is croppedinto 30× 30 according to the provided eye coordinates. Thus each video is represented asan image set. Following standard experiment protocol as in [2, 15], the detected face imagesare histogram equalized but no further preprocessing, and the image features are raw pixelvalues.

Comparison with state-of-the-art approaches. For both the Honda/UCSD and CMUMoBo datasets, we randomly select one face video per person as the training samples andthe rest as testing samples. For YTC dataset, we equally divide the whole dataset into fivefolds, and each fold contains 9 videos per person. In each fold, we randomly select 3 facevideos per person for training and use the rest for testing. We compare JNPDL with DCC [9],MMD [20], MDA [19], CHISD [2], SANP [6], LMKML [14] and SFDL [15]. The settingsof these approaches are basically the same as [15]. We select the best accuracy that JNPDLachieves with projected dimensions from 50, 100, 150, 200 and 300. Results in Table 6 showthe superiority of the proposed method.

Accuracy vs. Frame number. We evaluate the performance of JNPDL when videoscontain different number of image frames on YTC dataset. We randomly select 50, 100and 200 frames from each face image set for training and select another 50, 100, 200 fortesting. If an image set does not have enough frames, all frames are used. We select the bestaccuracy JNPDL achieves with projected dimension equal to 50, 100, 150, 200 and 300. Fig.2 shows JNPDL achieves better accuracy than the other approaches. It can be learned thatthe discrimination power of JNPDL is strong even with small training set.

Efficiency. The computational time for JNPDL to recognize a query image set is approx-imately 5 seconds with a 3.4GHz Dual-core CPU and 16GB RAM, which is comparable to[2, 15] and slightly higher than [9, 19, 20].

4Contains 59 face video of 20 individuals with large pose and expression variations, and average length of 400frames.

5Consists of 96 videos from 24 subjects, each containing 5 videos of different walking patterns.6Contains 1910 video sequences of 47 celebrities from YouTube. Most videos contain noisy and low-resolution

image frames.

Citation
Citation
{Kim, Kittler, and Cipolla} 2007
Citation
Citation
{Hu, Mian, and Owens} 2011
Citation
Citation
{Wang, Shan, Chen, and Gao} 2008
Citation
Citation
{Lu, Wang, and Moulin} 2013
Citation
Citation
{Wang and Chen} 2009
Citation
Citation
{Lu, Wang, Deng, and Moulin} 2014
Citation
Citation
{Cevikalp and Triggs} 2010
Citation
Citation
{Lee, Ho, Yang, and Kriegman} 2003
Citation
Citation
{Gross and Shi} 2001
Citation
Citation
{Lu, Wang, Deng, and Moulin} 2014
Citation
Citation
{Cevikalp and Triggs} 2010
Citation
Citation
{Lu, Wang, Deng, and Moulin} 2014
Citation
Citation
{Kim, Kittler, and Cipolla} 2007
Citation
Citation
{Wang, Shan, Chen, and Gao} 2008
Citation
Citation
{Wang and Chen} 2009
Citation
Citation
{Cevikalp and Triggs} 2010
Citation
Citation
{Hu, Mian, and Owens} 2011
Citation
Citation
{Lu, Wang, and Moulin} 2013
Citation
Citation
{Lu, Wang, Deng, and Moulin} 2014
Citation
Citation
{Lu, Wang, Deng, and Moulin} 2014
Citation
Citation
{Cevikalp and Triggs} 2010
Citation
Citation
{Lu, Wang, Deng, and Moulin} 2014
Citation
Citation
{Kim, Kittler, and Cipolla} 2007
Citation
Citation
{Wang and Chen} 2009
Citation
Citation
{Wang, Shan, Chen, and Gao} 2008
Page 10: Jointly Learning Non-negative Projection and Dictionary with ...wyliu.com/papers/LiuBMVC16.pdf · W. LIU ET AL.: JOINTLY LEARNING NON-NEGATIVE PROJECTION AND DICTIONARY 1 Jointly

10 W. LIU ET AL.: JOINTLY LEARNING NON-NEGATIVE PROJECTION AND DICTIONARY

DCC MMD MDA CHISD SANP LMKML SFDL JNPDL0.5

0.55

0.6

0.65

0.7

0.75

Reco

gniti

on a

ccur

acy

200 Frames100 Frames 50 Frames

Figure 2. Recognition accuracy with different number of frames.

7 Concluding RemarksIn this paper, we proposed a novel joint non-negative projection and dictionary learningframework where non-negative feature projection and dictionary are simultaneously learnedwith discriminative graph constraints. The graph constraints guarantee the discrimination ofprojected training samples and coding coefficients. We also proposed a multiplicative non-negative updating algorithm for the projection learning with convergence guarantee. Thelearned feature projection considerably improves the quality learned dictionary, leading tobetter classification performance. Experimental results have validated the excellent perfor-mance of JNPDL on both image classification and image set classification.

Possible future work includes handling nonlinear cases using methods like kernel trickor other non-linear mapping algorithms, adding more discriminative regularizations to learnthe projection matrix and considering to learn a multiple-layered (hierarchical) projection.

8 AcknowledgementThis work is partially supported by the National Natural Science Foundation for Young Sci-entists of China (Grant no. 61402289), National Science Foundation of Guangdong Province(Grant no. 2014A030313558)

References[1] Alexey Castrodad and Guillermo Sapiro. Sparse modeling of human actions from motion im-

agery. IJCV, 100(1):1–15, 2012.

[2] Hakan Cevikalp and Bill Triggs. Face recognition based on image sets. In CVPR, 2010.

[3] Zhizhao Feng, Meng Yang, Lei Zhang, Yan Liu, and David Zhang. Joint discriminative di-mensionality reduction and dictionary learning for face recognition. Pattern Recognition, 46(8):2134–2143, 2013.

[4] Ralph Gross and Jianbo Shi. The cmu motion of body (mobo) database. 2001.

[5] Shuhang Gu, Lei Zhang, Wangmeng Zuo, and Xiangchu Feng. Projective dictionary pair learningfor pattern classification. In NIPS, 2014.

[6] Yiqun Hu, Ajmal S Mian, and Robyn Owens. Sparse approximated nearest points for image setclassification. In CVPR, 2011.

Page 11: Jointly Learning Non-negative Projection and Dictionary with ...wyliu.com/papers/LiuBMVC16.pdf · W. LIU ET AL.: JOINTLY LEARNING NON-NEGATIVE PROJECTION AND DICTIONARY 1 Jointly

W. LIU ET AL.: JOINTLY LEARNING NON-NEGATIVE PROJECTION AND DICTIONARY 11

[7] Ke Huang and Selin Aviyente. Sparse representation for signal classification. In NIPS, 2006.

[8] Zhuolin Jiang, Zhe Lin, and Larry S Davis. Label consistent k-svd: learning a discriminativedictionary for recognition. IEEE T. PAMI, 35(11):2651–2664, 2013.

[9] Tae-Kyun Kim, Josef Kittler, and Roberto Cipolla. Discriminative learning and recognition ofimage set classes using canonical correlations. IEEE T. PAMI, 29(6):1005–1018, 2007.

[10] Shu Kong and Donghui Wang. A dictionary learning approach for classification: Separating theparticularity and the commonality. In ECCV. 2012.

[11] Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negative matrixfactorization. Nature, 401(6755):788–791, 1999.

[12] Kuang-Chih Lee, Jeffrey Ho, Ming-Hsuan Yang, and David Kriegman. Video-based face recog-nition using probabilistic appearance manifolds. In CVPR, 2003.

[13] Xiaobai Liu, Shuicheng Yan, and Hai Jin. Projective nonnegative graph embedding. IEEE T. IP,19(5):1126–1137, 2010.

[14] Jiwen Lu, Gang Wang, and Pierre Moulin. Image set classification using holistic multiple orderstatistics features and localized multi-kernel metric learning. In ICCV, 2013.

[15] Jiwen Lu, Gang Wang, Weihong Deng, and Pierre Moulin. Simultaneous feature and dictionarylearning for image set based face recognition. In ECCV. 2014.

[16] Qiang Qiu, Vishal Patel, and Rama Chellappa. Information-theoretic dictionary learning forimage classification. IEEE T. PAMI, 36(11):2173–2184, 2014.

[17] Ignacio Ramirez, Pablo Sprechmann, and Guillermo Sapiro. Classification and clustering viadictionary learning with structured incoherence and shared features. In CVPR, 2010.

[18] Anran Wang, Jiwen Lu, Gang Wang, Jianfei Cai, and Tat-Jen Cham. Multi-modal unsupervisedfeature learning for rgb-d scene labeling. In ECCV. 2014.

[19] Ruiping Wang and Xilin Chen. Manifold discriminant analysis. In CVPR, 2009.

[20] Ruiping Wang, Shiguang Shan, Xilin Chen, and Wen Gao. Manifold-manifold distance withapplication to face recognition based on image set. In CVPR, 2008.

[21] Lior Wolf, Tal Hassner, and Yaniv Taigman. Effective unconstrained face recognition by com-bining multiple descriptors and learned background statistics. IEEE T. PAMI, 33(10):1978–1990,2011.

[22] John Wright, Allen Y Yang, Arvind Ganesh, Shankar S Sastry, and Yi Ma. Robust face recogni-tion via sparse representation. IEEE T. PAMI, 31(2):210–227, 2009.

[23] Shuicheng Yan, Dong Xu, Benyu Zhang, Hong-Jiang Zhang, Qiang Yang, and Stephen Lin.Graph embedding and extensions: a general framework for dimensionality reduction. IEEE T.PAMI, 29(1):40–51, 2007.

[24] Meng Yang, David Zhang, and Xiangchu Feng. Fisher discrimination dictionary learning forsparse representation. In ICCV, 2011.

[25] Meng Yang, Dengxin Dai, Lilin Shen, and Luc Van Gool. Latent dictionary learning for sparserepresentation based classification. In CVPR, 2014.

Page 12: Jointly Learning Non-negative Projection and Dictionary with ...wyliu.com/papers/LiuBMVC16.pdf · W. LIU ET AL.: JOINTLY LEARNING NON-NEGATIVE PROJECTION AND DICTIONARY 1 Jointly

12 W. LIU ET AL.: JOINTLY LEARNING NON-NEGATIVE PROJECTION AND DICTIONARY

[26] Meng Yang, Lei Zhang, Xiangchu Feng, and David Zhang. Sparse representation based fisherdiscrimination dictionary learning for image classification. IJCV, 109(3):209–232, 2014.

[27] Xiao-Tong Yuan, Xiaobai Liu, and Shuicheng Yan. Visual classification with multitask jointsparse representation. IEEE T. IP, 21(10):4349–4360, 2012.

[28] Haichao Zhang, Yanning Zhang, and Thomas S Huang. Simultaneous discriminative projectionand dictionary learning for sparse representation based classification. Pattern Recognition, 46(1):346–354, 2013.

[29] Qiang Zhang and Baoxin Li. Discriminative k-svd for dictionary learning in face recognition. InCVPR, 2010.

[30] Will Zou, Shenghuo Zhu, Kai Yu, and Andrew Y Ng. Deep learning of invariant features viasimulated fixations in video. In NIPS, 2012.

[31] Zhen Zuo, Gang Wang, Bing Shuai, Lifan Zhao, Qingxiong Yang, and Xudong Jiang. Learningdiscriminative and shareable features for scene classification. In ECCV. 2014.

Page 13: Jointly Learning Non-negative Projection and Dictionary with ...wyliu.com/papers/LiuBMVC16.pdf · W. LIU ET AL.: JOINTLY LEARNING NON-NEGATIVE PROJECTION AND DICTIONARY 1 Jointly

Supplementary Material: Jointly Learning Non-negative Projection andDictionary with Discriminative Graph Constraints for Classification

Weiyang Liu1, Zhiding Yu2, Yandong Wen2, Rongmei Lin3, Meng Yang4

1Peking University, 2Carnegie Mellon University3Emory University, 4Shenzhen University

Summary

The JNPDL model is well motivated by the current drawbacks of dictionary learning approaches, while each constraintsare also well designed (the novel discriminative graph constraints are proposed, and all constrains are designed to be easilyoptimized). Aiming to bridge the gap between features and dictionary, we do think the proposed idea of learning a projec-tion for features jointly with the dictionary is worth noticing. Consider that the given training data is usually not naturallydiscriminative, yhus the discrimination of the learned dictionary will be limited by the training data. Jointly learning the dis-criminative feature projection and discriminative dictionary together helps to improve each other. The key of JNPDL is tolearn a discriminative projection (non-negativity is one way to discrimination) that can better work with the dictionary.

In the supplementary material, we give the detailed optimization framework for the JNPDL model in Appendix A, andthen provide a detailed proof in Appendix B to show the convergence of the updating rules for P and M is theoreticallyguaranteed. In other words, the proposed multiplicative updating algorithm for the non-negative projection P in the paper isproved to be convergent.

Appendix A: Optimization Framework

We adopt a standard iterative learning framework to jointly learn the sparse representation X , the non-negative projectionmatrixP , the intermediate non-negative basis matrixM and the dictionaryD. The proposed algorithm is shown in Algorithm1. The non-negative projection learning converges as we prove in Appendix B.

Non-negative projection learning

To learn the non-negative projection, we optimize P ,M withD,X fixed. Thus the JNPDL model is rewritten as

minP≥0,M≥0

{‖PY −DX‖2F +

K∑i=1

‖PYi −DiXii‖2F

+ α1‖Y −MPY ‖2F + α1β · Tr(P Y LpYT P T )

+ α1β · Tr(P Y LppY

T P T ) + α1‖M − P T ‖2F} (1)

which is essentially a projective non-negative matrix factorization problem [6, 2]. We use the multiplicative iterative solution[6, 2, 4] to solve Eq. (1). Specifically, we transform it into tractable sub-problems and optimizeM and P by a multiplicativenon-negative iterative procedure.

Because M is the basis matrix, it is necessary to require each column mi to have unit l2 norm, i.e., ‖mi‖ = 1. Thisextra constraint makes the optimization more complicated, so we compensate the norms of the basis matrix into the coefficientmatrix as in [4] and replace α1βTr(P Y LpY

T P T ) + α1βTr(P Y LppY

T P T ) with

α1β ·(

Tr(QPY LpYT P T QT ) + Tr(QPY Lp

pYT P T QT )

)(2)

where Q equals diag{‖m1‖, · · · , ‖mq‖} and Q equals diag{‖mq+1‖, · · · , ‖ms‖}.

1

Page 14: Jointly Learning Non-negative Projection and Dictionary with ...wyliu.com/papers/LiuBMVC16.pdf · W. LIU ET AL.: JOINTLY LEARNING NON-NEGATIVE PROJECTION AND DICTIONARY 1 Jointly

On optimizingM with P ,D,X fixed. We can further rewrite Eq. (18) as Tr(MGmMT ) where

Gm = Gm+ −Gm−

=

[P Y (α1βBp)Y T P T 0

0 P Y (α1βBpp)Y

T LT

]� I

−[P Y (α1βWp)Y T P T 0

0 P Y (α1βWpp )Y

T LT

]� I

(3)

where� denotes the element-wise matrix multiplication, and I is an identity matrix. Then we put the non-negative constraintsinto the objective function with respect to M , and define ψij as the Lagrange multiplier for M ≥ 0. With Ψ = [ψij ], theLagrange L(W ) is defined as

L(M) =‖PY −DX‖2F +∑‖PYi −DiX

ii‖2F+

α1‖Y −MPY ‖2F + Tr(MGmMT )

+ α1‖M − P T ‖2F + Tr(ΨMT )

. (4)

Thus the partial derivative of L with respect toM is

∂L∂M

=− 2α1Y Y TP T + 2α1MPY Y TP T

+ 2MGm + 2α1M − 2α1PT +Ψ

. (5)

According to the Karush-Kuhn-Tucker (KKT) condition (ψijMij = 0) and ∂L∂M = 0, we can obtain the update rule:

M(t+1)ij = M

(t)ij

(α1Y Y TP T +MGm− + α1PT )ij

(α1MPY Y TP T +MGm+ + α1M)ij. (6)

On optimizing P with M ,D,X fixed. After updating M , we normalize the column vectors of M and multiply thenorm to the projective matrix P , namely Pi ← Pi × ‖mi‖2, mi ← mi/‖mi‖2,∀i. By using the normalized M , wesimplify Eq. (18) as α1βTr(P Y LpY

T P T ) + α1βTr(P Y LppY

T P T ). Thus the Lagrange L(P ) is

L(P ) = ‖PY −DX‖2F +∑‖PYi −DiX

ii‖2F+

α1βTr(P Y LpYT P T ) + α1βTr(P Y Lp

pYT P T )+

α1‖Y −MPY ‖2F + α1‖M − P T ‖2F + Tr(ΦP T )

(7)

where φ is the Lagrange multiplier for constraint Pij ≥ 0 and Φ = [φij ]. After setting ∂L∂P = 0 and applying KKT condition

(φijPij = 0), we obtain the update rule for P :

P(t+1)ij = P

(t)ij

(DXY T +∑

DiXiiY

Ti + α1M

TY Y T

+ α1MT + α1β

[P tY WpY T

P tY W pp Y

T

] )ij( P tY Y T +

∑P tYiY

Ti + α1P

t+

α1MTMP tY Y T + α1β

[P tY BpY T

P tY BppY

T

])ij

. (8)

Now both Eq. (6) and Eq. (41) are non-negative update. We prove that the convergence of the updating rule for P andMcan be guaranteed. Detailed proof can refer to the supplementary material (Appendix A).

Discriminative dictionary learning

The discriminative dictionary learning is optimized using a standard iterative optimization procedure that is widely adopted insparse coding.

On optimizingX withD,P ,M fixed. WithD,P fixed, the optimization of the JNPDL model becomes

minX

{‖PY −DX‖2F +

K∑i=1

‖PYi −DiXii‖2F + α3‖X‖1

+

K∑i=1

K∑j=1,j 6=i

‖DjXji ‖

2F + α2Tr(XT (L′)X)

} (9)

2

Page 15: Jointly Learning Non-negative Projection and Dictionary with ...wyliu.com/papers/LiuBMVC16.pdf · W. LIU ET AL.: JOINTLY LEARNING NON-NEGATIVE PROJECTION AND DICTIONARY 1 Jointly

Algorithm 1 Training Procedure of JNPDL

Input: Training samples Y = ‖Y1, · · · ,YN‖, intrinsic graph Wc,Wp, penalty graph W pc ,W

pp , parameters α1, α2, α3, β, iteration num-

ber T .Output: Non-negative projection matrix P , dictionary D, coding coefficient matrix X .

Step1: Initialization1: t = 1.2: Randomly initialize columns in D0,M0 with unit l2 norm.3: Initialize xi,1≤i≤N with ((D0)T (D0) + λ2I)

−1(D0)Tyi where yi is the ith training sample (regardless of label).Step2: Search local optima

4: while not convergence or t < T do5: Solve P t,M t iteratively with fixed Dt−1 and Xt−1 via Eq. (1).6: Solve Xt with fixed M t, Dt−1 and P t via Eq. (10).7: Solve Dt with fixed M t, P t and Xt via Eq. (12).8: t← t+ 1.9: end while

Step3: Output10: Output P = P t, D = Dt and X = Xt.

whereL′ = Lc−Lpc . Eq. (9) can be solved using feature sign search algorithm [1] after certain formulation based on [7, 3]. We

optimize X class by class. Following [1], we update Xi one by one in the ith class. We define xi,j as the coding coefficientsof the jth sample in the ith class and reformulated the problem as

minX

{‖PYi −Dxi,j‖2F + ‖PYi −Dix

ii,j‖2F

+

K∑n=1,n6=i

‖Dnxni,j‖2F + α2Q(xi,j) + α3‖xi,j‖1

(10)

where Q(xi,j) = α2(xTi,jXiL

′j + (XiL

′j)

Txi,j − xTi,jL

′jj) in which L′

i is the ith column of L, and L′ii is the entry in the

ith row and ith column of L. xi,j can be solved via feature sign search algorithm as in [7, 3].On optimizingD with P ,X,M fixed. By fixing P ,M andX , the JNPDL model is rewritten as

minD

{‖PY −DX‖2F +

K∑i=1

‖PYi −DiXii‖2F

+

K∑i=1

K∑j=1,j 6=i

‖DjXji ‖

2F

} (11)

for which we updateD class by class sequentially. When we updateDv , the sub-dictionariesDi, i 6= v associated to the otherclasses will be fixed. Thus Eq. (11) can be further rewritten to

minDi,i∈{1,2,··· ,K}

{‖PYi −DiXi‖2F + ‖PYi −DiX

ii‖2F

+

K∑j=1,j 6=i

‖DjXji ‖

2F

} (12)

which is essentially a quadratic programming problem and can be directly solved by the algorithm presented in [5] (updateDi atom by atom). Note that each atom in the dictionary should have unit l2 norm.

Appendix B: Proof of the convergence of updating rules for P andM

Proof. Before proving the convergence of the updating rules, we first introduce some necessary preliminaries.

Definition 1. Function G(A,A′) is an auxiliary function for function F(A) if the conditions

G(A,A′) ≥ F(A), G(A,A) = F(A) (13)

are satisfied.

3

Page 16: Jointly Learning Non-negative Projection and Dictionary with ...wyliu.com/papers/LiuBMVC16.pdf · W. LIU ET AL.: JOINTLY LEARNING NON-NEGATIVE PROJECTION AND DICTIONARY 1 Jointly

Lemma 1. If G(A,A′) is an auxiliary function for F(A), then F(A) is non-increasing toA under the update

At+1 = argminAG(A,At) (14)

where t denotes the tth iteration.

Proof. From Eq.(14), we construct the following relation:

F(At) = G(At,At) ≥ G(At+1,At). (15)

Because G(A,A′) is an auxiliary function for F(A), we can obtain the following inequality from Eq. (13):

G(At+1,At) ≥ F(At+1) (16)

which leads toF(At+1) ≤ F(At). (17)

Thus F(A) is non-increasing with respect toA under the updating rule in Eq. (14). The lemma is proved

We first consider the scenario when P is fixed. With P fixed, we rewrite the optimization objective (Eq. (10) in the paper)as

F(M) =α1‖Y −MPY ‖2F + Tr(MGmMT )

+ α1‖M − P T ‖2F. (18)

We denote Fij as the part of F(M) relevant to Mij , and then compute the first-order and the second-order derivative asfollows:

F ′ij(M) =α1(−2Y Y TP T + 2MPY Y TP T )ij

+ (2MGm)ij + α1(2M − 2P T )ij, (19)

F ′′ij(M) = 2α1(PY Y TP T +Gm + I)jj (20)

where I denotes an identity matrix with matched size. We construct the function G(Mij ,Mtij) as

G(Mij ,Mtij) = Fij(M

tij) + F ′ij(M t

ij)(Mij −M tij)

+α1(M

tPY Y TP T +M tGm+ +M t)ijM t

ij

(Mij −M tij)

2. (21)

Lemma 2. G(Mij ,Mtij) in Eq. (21) is an auxiliary function for the function Fij(M).

Proof. Because it is easily obtained that G(Mij ,Mij) = Fij(Mij), we only need to prove that G(Mij ,Mtij) ≥ Fij(Mij).

We first compute the Taylor series expansion of Fij(M) as

Fij(Mij) =Fij(Mtij) + F ′ij(M t

ij)(Mij −M tij)

+1

2F ′′ij(M t

ij)(Mij −M tij)

2. (22)

Because the following inequalities are satisfied:

(M tPY Y TP T )ij =∑v

(M t

iv(PY Y TP T )vj)

≥M tij(PY Y TP T )jj ,

(23)

(M tGm+)ij =∑v

(M t

iv(Gm+)vj)

≥M tij(Gm)jj ,

(24)

M tij ≥M t

ijIjj , (25)

we can let the following relation hold:

α1(MtPY Y TP T +M tGm+ +M t)ij

M tij

≥ (PY Y TP T +Gm)jj .

(26)

Therefore, we can prove that G(Mij ,Mtij) ≥ Fij(Mij) holds. The lemma is proved.

4

Page 17: Jointly Learning Non-negative Projection and Dictionary with ...wyliu.com/papers/LiuBMVC16.pdf · W. LIU ET AL.: JOINTLY LEARNING NON-NEGATIVE PROJECTION AND DICTIONARY 1 Jointly

Theorem 1. The updating rule forM can be obtained by minimizing the auxiliary function G(Mij ,Mtij).

Proof. We let the derivative of G(Mij ,Mtij) with respect toMij equal to zero, namely

∂G(Mij ,Mtij)

∂Mij

=2α1(M

tPY Y TP T +M tGm+ +M t)ijM t

ij

(Mij −M tij)

+ F ′ij(M tij).

= 0

(27)

from which we can deriveM t+1

ij = M tij

(α1Y Y TP T +M tGm− + α1PT )ij

(α1M tPY Y TP T +M tGm+ + α1M t)ij. (28)

which is identical to the updating rule that we use in the paper. Thus the lemma is proved

Then we consider the other scenario whenM is fixed. After updating the matrixM via Eq. (28), we normalize the columnvectorsmi ofM and consequently convey the norm to the projective matrix P , namely

Pi ← Pi × ‖mi‖mi ←mi/‖mi‖

(29)

where Pi is the ith column vector of the projection matrix P . Considering Eq. (29) and the fixed M , we can rewrite theoptimization objective (Eq. (10) in the paper) as

F(P ) = ‖PY −DX‖2F +∑‖PYi −DiX

ii‖2F+

α1βTr(P Y LpYT P T ) + α1βTr(P Y Lp

pYT P T )

+ α1‖Y −MPY ‖2F + α1‖M − P T ‖2F )

. (30)

By denoting Fij as the part of F(P ) relevant to Pij , we have the following derivatives:

F ′ij(P ) = 2(PY Y T )ij − 2(DXY T ))ij + 2(∑

PYiYTi )ij

− 2∑

(DiXiiY

Ti )ij − 2α1(M

TY Y T )ij + 2α1(MTMPY Y T )ij

+ 2α1β

[P Y LpY T

P Y LppY

T

]ij

+ α1(2P − 2MT )ij

, (31)

F ′′ij(P ) = 2(Y Y T )jj + 2(∑

YiYTi )jj + 2α1(M

TM)ii(Y Y T )jj

+ 2α1β

[Y LpY T

Y LppY

T

]jj

+ 2α1Ijj. (32)

The auxiliary function of Fij(P ) is designed as

G(Pij ,Ptij) = Fij(P

tij) + F ′ij(P t

ij)(Pij − P tij)

+

( P tY Y T +∑

P tYiYTi + α1(P

t)+

α1MTMPY Y T + α1β

[P tY BpY T

P tY BppY

T

])ij

P tij

(Pij − P tij)

2

. (33)

Lemma 3. G(Pij ,Ptij) in Eq. (33) is an auxiliary function for the function Fij(P ).

Proof. Because obviously G(Pij ,Pij) = Fij(Pij), we only need to prove that G(Pij ,Ptij) = Fij(Pij). We first obtain the

Taylor series expansion of Fij(P ) as

Fij(Pij) =Fij(Ptij) + F ′ij(P t

ij)(Pij − P tij)

+1

2F ′′ij(P t

ij)(Pij − P tij)

2. (34)

Since the following relations hold:(P tY Y T )ij =

∑v

(P t

iv(Y Y T )vj)

≥ P tij(Y Y T )jj

, (35)

5

Page 18: Jointly Learning Non-negative Projection and Dictionary with ...wyliu.com/papers/LiuBMVC16.pdf · W. LIU ET AL.: JOINTLY LEARNING NON-NEGATIVE PROJECTION AND DICTIONARY 1 Jointly

(∑i

P tYiYTi )ij =

∑v

(P t

iv(∑

YiYTi )vj

)≥ P t

ij(∑

YiYTi )jj

, (36)

(MTMP tY Y T )ij =∑v

((MTMP t)iv(Y Y T )vj

)≥ (MTMP t)ij(Y Y T )jj

=∑v

((MTM)ivP

tvj

)(Y Y T )jj

≥ P tij(M

TM)ii(Y Y T )jj

, (37)

[P tY BpY

T

P tY BppY

T

]ij

=

{ ∑v

(P t

iv(Y BpYT )vj

), ifj ≤ q∑

v

(P t

iv(Y BppY

T )vj), otherwise

{P t

ij(Y BpYT )jj , ifj ≤ q

P tij(Y Bp

pYT )jj , otherwise

{P t

ij(Y LpYT )jj , ifj ≤ q

P tij(Y Lp

pYT )jj , otherwise

= P tij

[Y LpY

T

Y LppY

T

]jj

, (38)

P tij ≥ P t

ijIjj , (39)

we can have G(Pij ,Ptij) ≥ F(Pij). Therefore the lemma is proved.

Theorem 2. The updating rule for P can be obtained by minimizing the auxiliary function G(Pij ,Ptij).

Proof. Let (∂G(Pij ,Ptij))/(∂Pij) = 0, and we have

2

( P tY Y T +∑

P tYiYTi + α1(P

t)+

α1MTMPY Y T + α1β

[P tY BpY T

P tY BppY

T

])ij

P tij

(Pij − P tij)

+ F ′ij(P tij) = 0

(40)

from which we can derive the updating rule for P

P(t+1)ij = P

(t)ij

(DXY T +∑

DiXiiY

Ti + α1M

TY Y T

+ α1MT + α1β

[P tY WpY T

P tY W pp Y

T

] )ij( P tY Y T +

∑P tYiY

Ti + α1P

t+

α1MTMP tY Y T + α1β

[P tY BpY T

P tY BppY

T

])ij

. (41)

Thus the theorem is proved.

According to Lemma 1, Theorem 1 and Theorem 2, we have proved that the convergence of the updating rules for PandM can be theoretically guaranteed.

References[1] Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Y Ng. Efficient sparse coding algorithms. In NIPS, 2006.

[2] Xiaobai Liu, Shuicheng Yan, and Hai Jin. Projective nonnegative graph embedding. IEEE T. IP, 19(5):1126–1137, 2010.

[3] Jiwen Lu, Gang Wang, Weihong Deng, and Pierre Moulin. Simultaneous feature and dictionary learning for image set based facerecognition. In ECCV. 2014.

[4] Changhu Wang, Zheng Song, Shuicheng Yan, Lei Zhang, and Hong-Jiang Zhang. Multiplicative nonnegative greph embedding. InCVPR, 2009.

6

Page 19: Jointly Learning Non-negative Projection and Dictionary with ...wyliu.com/papers/LiuBMVC16.pdf · W. LIU ET AL.: JOINTLY LEARNING NON-NEGATIVE PROJECTION AND DICTIONARY 1 Jointly

[5] Meng Yang, Lei Zhang, Jian Yang, and David Zhang. Metaface learning for sparse representation based face recognition. In ICIP,2010.

[6] Zhirong Yang and Erkki Oja. Linear and nonlinear projective nonnegative matrix factorization. IEEE T. NN, 21(5):734–749, 2010.

[7] Miao Zheng, Jiajun Bu, Chun Chen, Can Wang, Lijun Zhang, Guang Qiu, and Deng Cai. Graph regularized sparse coding for imagerepresentation. IEEE T. IP, 20(5):1327–1336, 2011.

7