4
ORTHOGONAL SPARSITY PRESERVING PROJECTIONS FOR FEATURE EXTRACTION CHAO LAN, XIAO-YUAN JING, QIAN LIU, SHI-QIANG GAO, YONG-FANG YAO Colleage of Automation, Nanjing University of Posts and Telecommunications, Nanjing 210003, China E-MAIL: [email protected], [email protected] Abstract: Sparse representation has been extensively studied in the signal processing community, which surprisingly pointed out that one target signal can be accurately represented as a linear combination of very few measurement signals, often called atoms, in a given dictionary. This discovery has soon been employed to the field of pattern recognition and more recently, given rise to a newly developed unsupervised feature extraction method named sparsity preserving projections (SPP), which aims at seeking a linear embedded space where the sparse reconstructive relations among the data in the dictionary can be preserved. However, SPP is non-orthogonal and still has some space for improvement. Specially, by taking into consideration the preservation of some neat property of a dictionary, this paper presents an orthogonal sparsity preserving projections (OSPP). OSPP iteratively calculate a projective vector which can preserve the sparse reconstructive relations as SPP dose, and at the same enforcing it to be orthogonal to all previously obtained vectors. Empirical study shows that OSPP has more powerful sparsity preserving ability than SPP and hence is expected to have better classification performance, since sparsity is potentially related to discrimination. Experiments on the public Yale face databases validate the effectiveness of OSPP, as compared with several representative unsupervised feature extraction methods. Keywords: Sparse Representation; Feature Extraction; Orthogonal Projective Vectors; Unsupervised Learning;. 1. Introduction Surprise enough, recent study of sparse representations in the signal community has revealed that a target signal can be accurately represented by a linear combination of very few measurement signals, often called atoms, in a dictionary and thus can potentially reach a lower sampling than Shannon-Nyquist bound [1]. This favorable discovery has caused a great impact on the signal processing community, and soon been applied to many other application tasks such as image separation [2], blind source separation [3], image coding [4] and image denoising [5]. Pattern recognition has also benefited from this technique from many aspects. Huang [6] imposed the sparse constraint on the objective function of discriminant analysis for signal classification; Yang [7] examined the sparseness for feature selection in face recognition and Wright exploited in his work [8] the sparse representations for robust face recognition. Unsupervised feature extractions, such as PCA [9], LPP [10] and some others [16-17], which explore the intrinsic structure of the data set, have always been an active research area in the pattern recognition. More recently, the sparse representation technique has also advanced the development of this field. Qiao developed in his work [11] an unsupervised feature extraction method called sparsity preserving projections (SPP), which aims at seeking a low-dimensional subspace where the sparse reconstructive relations among the data in the dictionary can be preserved. It has been shown by the author that SPP achieves better recognition results than other extraction methods. However, there are some favorable conditions of the dictionary that SPP does not consider while training. The dictionary used for calculating the sparse coefficient usually has the nice property of being orthogonal [12] or incoherent [13], which can not be well kept by SPP projections due to its non-orthogonal natural. Motivated by the above analysis, we propose in this paper an unsupervised feature extraction method named orthogonal sparsity preserving projections (OSPP) based on SPP. We iteratively calculate each projective vector which can minimize the mean reconstructive error of the target samples, as SPP does, while enforcing it to be orthogonal to all previously obtained ones. In this manner, the sparsity can be well preserved, as well as other conditions of the dictionary. Surprisingly, we found by empirical study that OSPP has more powerful sparsity preserving ability than SPP. Since sparsity is potentially related to discriminating power, OSPP is expected to acquire better classification performance. Experiments on the public Yale face database validate the effectiveness of OSPP, as compared with several other unsupervised feature extraction methods, including PCA, LPP and SPP. 176 2010 IEEE 978-1-4244-6531-6/10/$26.00 © Proceedings of the 2010 International Conference on Wavelet Analysis and Pattern Recognition, Qingdao, 11-14 July 2010

[IEEE 2010 International Conference on Wavelet Analysis and Pattern Recognition (ICWAPR) - Qingdao, China (2010.07.11-2010.07.14)] 2010 International Conference on Wavelet Analysis

Embed Size (px)

Citation preview

ORTHOGONAL SPARSITY PRESERVING PROJECTIONS FOR FEATURE EXTRACTION

CHAO LAN, XIAO-YUAN JING, QIAN LIU, SHI-QIANG GAO, YONG-FANG YAO

Colleage of Automation, Nanjing University of Posts and Telecommunications, Nanjing 210003, China E-MAIL: [email protected], [email protected]

Abstract: Sparse representation has been extensively studied in the

signal processing community, which surprisingly pointed out that one target signal can be accurately represented as a linear combination of very few measurement signals, often called atoms, in a given dictionary. This discovery has soon been employed to the field of pattern recognition and more recently, given rise to a newly developed unsupervised feature extraction method named sparsity preserving projections (SPP), which aims at seeking a linear embedded space where the sparse reconstructive relations among the data in the dictionary can be preserved. However, SPP is non-orthogonal and still has some space for improvement. Specially, by taking into consideration the preservation of some neat property of a dictionary, this paper presents an orthogonal sparsity preserving projections (OSPP). OSPP iteratively calculate a projective vector which can preserve the sparse reconstructive relations as SPP dose, and at the same enforcing it to be orthogonal to all previously obtained vectors. Empirical study shows that OSPP has more powerful sparsity preserving ability than SPP and hence is expected to have better classification performance, since sparsity is potentially related to discrimination. Experiments on the public Yale face databases validate the effectiveness of OSPP, as compared with several representative unsupervised feature extraction methods.

Keywords: Sparse Representation; Feature Extraction; Orthogonal

Projective Vectors; Unsupervised Learning;.

1. Introduction

Surprise enough, recent study of sparse representations in the signal community has revealed that a target signal can be accurately represented by a linear combination of very few measurement signals, often called atoms, in a dictionary and thus can potentially reach a lower sampling than Shannon-Nyquist bound [1]. This favorable discovery has caused a great impact on the signal processing community, and soon been applied to many other application tasks such as image separation [2], blind source separation [3], image coding [4] and image denoising [5]. Pattern recognition has

also benefited from this technique from many aspects. Huang [6] imposed the sparse constraint on the objective function of discriminant analysis for signal classification; Yang [7] examined the sparseness for feature selection in face recognition and Wright exploited in his work [8] the sparse representations for robust face recognition.

Unsupervised feature extractions, such as PCA [9], LPP [10] and some others [16-17], which explore the intrinsic structure of the data set, have always been an active research area in the pattern recognition. More recently, the sparse representation technique has also advanced the development of this field. Qiao developed in his work [11] an unsupervised feature extraction method called sparsity preserving projections (SPP), which aims at seeking a low-dimensional subspace where the sparse reconstructive relations among the data in the dictionary can be preserved. It has been shown by the author that SPP achieves better recognition results than other extraction methods. However, there are some favorable conditions of the dictionary that SPP does not consider while training. The dictionary used for calculating the sparse coefficient usually has the nice property of being orthogonal [12] or incoherent [13], which can not be well kept by SPP projections due to its non-orthogonal natural.

Motivated by the above analysis, we propose in this paper an unsupervised feature extraction method named orthogonal sparsity preserving projections (OSPP) based on SPP. We iteratively calculate each projective vector which can minimize the mean reconstructive error of the target samples, as SPP does, while enforcing it to be orthogonal to all previously obtained ones. In this manner, the sparsity can be well preserved, as well as other conditions of the dictionary. Surprisingly, we found by empirical study that OSPP has more powerful sparsity preserving ability than SPP. Since sparsity is potentially related to discriminating power, OSPP is expected to acquire better classification performance. Experiments on the public Yale face database validate the effectiveness of OSPP, as compared with several other unsupervised feature extraction methods, including PCA, LPP and SPP.

1762010 IEEE978-1-4244-6531-6/10/$26.00 ©

Proceedings of the 2010 International Conference on Wavelet Analysis and Pattern Recognition, Qingdao, 11-14 July 2010

The rest of the paper is organized as follows: in section 2, we briefly review the sparsity preserving projections (SPP) and present the motivation of OSPP; in section 3, we introduce the orthogonal SPP (OSPP); experiments are given in section 4 and conclusion is drawn in section 5.

2. Sparsity Preserving Projections, a Brief Review

Let ix be the thi sample in the dictionary

1 2[ , ,..., ]NX x x x= of size N , 1Niα ×∈� be the sparse

coefficient set associated with ix . SPP first calculates iα by the following formula:

arg min || ||

. . , 1i

Ti i is t x e

αα α= Χ =

(1)

where ie is a vector of all ones. As presented in [8],[11], the hard constraint in formula (1) can be replaced with a soft one that can tolerate certain degree of noise, i.e.

2

arg min || ||

. . || || ,1i

Ti i is t x e

αα ε α− Χ < =

(2)

where ε is a comparably small threshold value. To preserve the sparse reconstructive relations among data in the projected space, SPP calculates the optimal linear projective vector v by:

1arg m in || ||

. . 1

NT T

i iv iT T

v x v X

s t v XX v

α=

=

� , (3)

where 1T Tv XX v = can be regarded as to guarantee a stable solution. As shown in SPP, formula (3) can be written in a matrix form as:

arg min ( )

. . 1

T T T T

vT T

v X I S S SS X v

s t v XX v

− − +

=. (4)

where 1 2[ , ,..., ]NS α α α= . By employing the Lagrange Multipliers, the solution v appears to be the eigenvector of matrix 1( ) ( ( ) )T T T TXX X I S S SS X− − − + associated with the smallest eigenvalue.

Next we show the motivation of obtaining an orthogonal projective vector set. Suppose W is the overall projective vector set, X is an orthogonal dictionary, i.e. TX X I= and

TY W X= is the projected dictionary, If W is orthogonal, then

T T T TY Y X WW X X X I= = = , (5) and thus Y is also orthogonal.

On the other hand, if the initial dictionary is incoherent, i.e.,

max , 1 , ,i j i ji jx x d x x X

≠≤ ∈ , (6)

Then we may hope the projected dictionary Y remains incoherent. If W is orthogonal, then

, ( ) ( ) ( )

,

T T T T Ti j i j i j

Ti j i j

y y W x W x x WW x

x x x x

= =

= =, (7)

where Ti iy W x= , and thus

max , 1 , ,i j i ji jy y d y y Y

≠≤ ∈ . (8)

And the projected dictionary remains incoherent.

3. Orthogonal Sparsity Preserving Projections

In this section we impose the orthogonal constraint to the expected projective vectors. The sparse coefficient ks is calculated using a generalized regularized form [6][14] as:

2 1argmin(|| || || || )k kS

s x s sλ= − Χ − (9)

where the regularized scalar λ is used to balance the tradeoff between reconstruction error and the sparseness of the solution.

Based on the analysis presented in section 2, we propose an orthogonal sparsity preserving projections (OSPP). By minimizing the reconstructive errors among the data associated with the sparse coefficients, as SPP did, we iteratively extract one projective vector and simultaneously require it to be orthogonal to all previously obtained ones. Thus, the optimal thl projective vector ˆlν is derived by:

2

1

1 1

ˆ arg min

. . 1, ... 0l

NT T

l l k l kk

T T T Tl l l l l

x s

s tν

ν ν ν

ν ν ν ν ν ν=

= − Χ

ΧΧ = = = =

� , (10)

where lν represents the thl candidate solution.

In the following we show how to obtain the solution to formula (10). Based on SPP, let T TB I S S SS= − − + , the objective function of formula (10) can be written as:

ˆ arg minl

T Tl l l

νν ν ν= ΧΒΧ . (11)

By employing the Lagrange Multipliers to formula (10), we have:

( )1

1

( 1)l

T T T T Tl l l l l i l i

iQ ν ν ν λ ν ν αν ν

=

= ΧΒΧ − ΧΧ − −� . (12)

By setting zero to the derivate of formula (12), we have 1

12 2 0

lT T

l l i ii

ν λ ν α ν−

=

ΧΒΧ − ΧΧ − =� . (13)

Successively left multiplying formula (13) by 1 1 1

1 2 1( ) , ( ) , , ( )T T T T T Tlν ν ν− − −−ΧΧ ΧΧ ΧΧ� gives rise to a series of

177

Proceedings of the 2010 International Conference on Wavelet Analysis and Pattern Recognition, Qingdao, 11-14 July 2010

equations:

( ) ( ) ( )( ) ( ) ( )

( ) ( ) ( )

1 1 1

1 1 1 1 1 1 1

1 1 1

2 1 2 1 1 2 1

1 1 1

1 1 1 1 1 1 1

2

2

2

T T T T T T Tl l l

T T T T T T Tl l l

T T T T T T Tl l l l l l

ν ν αν ν α ν ν

ν ν αν ν α ν ν

ν ν αν ν α ν ν

− − −

− −

− − −

− −

− − −

− − − − −

ΧΧ ΧΒΧ = ΧΧ + + ΧΧ

ΧΧ ΧΒΧ = ΧΧ + + ΧΧ

ΧΧ ΧΒΧ = ΧΧ + + ΧΧ

. (14)

Define 1 1( )l T Tij i jν ν− −Μ = ΧΧ , 1 1 1[ , , ]Tl lα α− −Α = � and

11 1 1[ ] ( )T T

l l lV V−− − −Μ = ΧΧ , (15)

then formula (14) can be written in a matrix form as: 1

1 1 12[ ] ( )T T Tl l l lA V ν−− − −Μ = ΧΧ ΧΒΧ , (16)

And thus 1 1

1 1 12[ ] [ ] ( )T T Tl l l lA V ν− −− − −= Μ ΧΧ ΧΒΧ . (17)

Left multiplying formula (13) by 1( )T −ΧΧ , we have: 1 1

1 12( ) 2 ( ) 0T T Tl l l lV Aν λν− −

− −ΧΧ ΧΒΧ − − ΧΧ = . (18) Combing formulas (17-18), we arrive at an eigenproblem:

1 1 11 1 1{ ( ) [ ] }( )T T T T

l l l l lI V V ν λν− − −− − −− ΧΧ Μ ⋅ ΧΧ ΧΒΧ = . (19)

And the solution l̂ν is the corresponding eigenvector associated with the smallest eigenvalue. It is important to note that the derivation is motivated by [15].

Based on formula (11), we can measure the sparsity preserving ability of a projective vector iν by:

( ) ( ) , || || 1T T T Ti i i iJ S S S Sν ν ν ν= Χ + − Χ = . (20)

We exam and compare the sparsity preserving abilities of the all projective vectors extracted both by SPP and OSPP on Yale face database in Figure 1. The x-axis indicates the index of one projective vector.

Figure 1. Sparsity preserving ability of OSPP and SPP.

As can be seen, projective vectors of OSPP has greater sparsity preserving ability thus can be expected to achieve better classification performance than SPP. The algorithmic procedure of OSPP is given in Figure 2.

Step 1. Calculate the sparse reconstructive weight matrix S by formula (9). Step 2. Calculate the first projective vector

1ν by formula (4).

Step 3. Update matrix 1 1[ ]V ν= and 1

1 1 1[ ] ( )T TV V−Μ = ΧΧ . Step 4. For 2 :l n=

1) Calculate the thl projective vector lν by formula (19).

2) Update 1[ , ]l l lV V ν−= matrix and 1l−Μ defined in formula (14).

Step 5. nV is the final projective vector set

Figure 2. Algorithmic procedure of OSPP

4. Experiments

We conduct our experiment on the public Yale face database and compare the proposed OSPP with several representative unsupervised feature extraction methods including PCA, LPP and SPP on recognition rate.

Yale database contains 15 individuals with 11 images per person. The whole image set is taken under varying facial expressions and lighting conditions. In our experiment, we crop each image to size 32×32. Demo images of one subject are illustrated in Figure 3.

Figure 3. Demo images of one subject in Yale face database

In our methods, LPP, SPP and OSPP involve a PCA phrase (with all components kept) to avoid the problem of singularity. As comparison, the baseline method indicates the classification of the raw data. For each method, we select the first r samples of each subject for training and the rest for testing. The regularized scalar in formula (7) and the number of neighborhood of LPP are determined by searching a large range of the potential value and choosing the one that yields the best classification results. We use the 1NN classifier with Euclidean distance for classification for its simplicity. Recognition rates versus feature dimensions are illustrated in Figure 4. The best results with the corresponding feature number are given in Table 1, where OSPP generally outperforms other methods.

178

Proceedings of the 2010 International Conference on Wavelet Analysis and Pattern Recognition, Qingdao, 11-14 July 2010

(a) Train 3 (b) Train 4

(c) Train 5 (d) Train 6

Figure 4. Recognition rate versus feature dimension on Yale database

TABLE 1. COMPARISON OF RECOGNITION RATES ON YALE DATABASE

Train Number 3 4 5 6 Baseline 57.5 61.9 63.33 62.67

PCA 59.17(40) 63.81(48) 67.78(35) 66.67(46)LPP 59.17(41) 64.76(45) 68.89(27) 72(23) SPP 63.33(43) 67.62(59) 70(12) 70.67(85)M

etho

d

OSPP 63.33(38) 68.57(37) 71.11(31) 72(51)

5. Conclusion

In this paper, we propose an unsupervised feature extraction method called orthogonal sparsity preserving projections (OSPP) based on SPP, aiming at preserving the sparse reconstructive relations among data, as well as some nice property of the dictionary, by computing an orthogonal projective vector set. Empirical study shows that OSPP has greater sparsity preserving ability than SPP and can be expected to acquire better classification performance. Experimental results validate the effectiveness of OSPP, which outperforms several representative unsupervised feature extraction methods including PCA, LPP and SPP.

References

[1] E. Cand`es, “Compressive sampling”, Proceedings of the International Congress of Mathematicians, Madrid, August 2006.

[2] J. Starck, M. Elad and D. Donoho, “Image decomposition via the combination of sparse representation and a variational approach”, IEEE Trans. Image Processing, Vol 14, No.10, pp. 1570-1582, Sep. 2005.

[3] Y. Li, A. Cichocki and S. Amari, “Analysis of sparse representation and blind source separation”, Neural Computation, Vol 16, No.6, pp. 1193-1234, Jun. 2004.

[4] B. Olshausen, P. Sallee and M. Lewicki, “Learning sparse image codes using a wavelet pyramid architecture”, Neural Information Processing Systems (NIPS), Vancouver, pp.887-893, Dec. 2001.

[5] M. Elad and M. Aharon, “Image denoising via learned dictionaries and sparse representation”, Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), New York, pp. 895-900, Jun. 2006.

[6] K. Huang and S. Aviyente, “Sparse representation for signal classification”, Neural Information Processing Systems (NIPS), Vancouver, pp. 609-616, Dec. 2006.

[7] A. Yang, J. Wright, Y. Ma and S. Sastry, “Feature selection in face recognition: A sparse representation perspective”, UC Berkeley Tech Report UCB/EECS-2007-99, 2007.

[8] J. Wright, A. Yang, S. Sastry and Y. Ma, “Robust face recognition via sparse representation”, IEEE Trans. Pattern Anal. Mach. Intell., Vol 31, No.2, pp. 210-227, Feb. 2009.

[9] M. Turk and A. Pentland, “Eigenfaces for recognition”, Journal of Cognitive Neuroscience, Vol 3, No.1, pp. 71-86, 1991.

[10] X. He, S. Yan, Y. Hu, P. Niyogi and H. J. Zhang, “Face recognition using laplacian faces”, IEEE Trans. Pattern Anal. Mach. Intell., Vol 27, No.3, pp.328-340, Mar. 2005.

[11] L. Qiao, S. Chen and X. Tan. “Sparsity Preserving Projections with Applications to Face Recognition”, Pattern Recognition, Vol 43, No.1, pp. 331-341, Jan. 2010.

[12] D. L. Donoho and X. Huo, “Uncertainty principles and ideal atomic decomposition”, IEEE Trans. Information Theory, Vol 47, No.7, pp. 2845-2862, Nov. 2001.

[13] J. A. Tropp, “Greed is good: Algorithmic results for sparse approximation”, IEEE Trans. Information Theory, Vol 50, No.10, pp.2231-2242, Oct. 2004.

[14] R. Tibshirani, “Regression shrinkage and selection via the LASSO”, Journal of the Royal Statistical Society B, Vol 58, No.1, pp.267-288, 1996.

[15] J. Duchene and S. Leclercq, “An optimal transformation for discriminant and principal component analysis”, IEEE Trans. Pattern Anal. Mach. Intell., Vol 10, No.6, pp.978-983, Nov. 1988.

[16] B. Fang, M. Cheng, Y. Y. Tang, and G. He, “Improving the discriminant ability of local margin based learning method by incorporating the global between-class separability criterion,” Neurocomputing, vol. 73, no. 1–3, pp. 536–541, 2009

[17] M. Cheng, B. Fang, Y. Y. Tang, T. Zhang, and J. Wen, "Incremental Embedding and Learning in the Local Discriminant Subspace With Application to Face Recognition", IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., Accepted, To Appear

179

Proceedings of the 2010 International Conference on Wavelet Analysis and Pattern Recognition, Qingdao, 11-14 July 2010