9
A Deep Semi-NMF Model for Learning Hidden Representations George Trigeorgis GEORGE. TRIGEORGIS08@IMPERIAL. AC. UK Konstantinos Bousmalis K. BOUSMALIS@IMPERIAL. AC. UK Stefanos Zafeiriou S. ZAFEIRIOU@IMPERIAL. AC. UK Bj¨ orn W. Schuller BJOERN. SCHULLER@IMPERIAL. AC. UK Department of Computing, Imperial College London, United Kingdom Abstract Semi-NMF is a matrix factorization technique that learns a low-dimensional representation of a dataset that lends itself to a clustering interpre- tation. It is possible that the mapping between this new representation and our original features contains rather complex hierarchical information with implicit lower-level hidden attributes, that classical one level clustering methodologies can not interpret. In this work we propose a novel model, Deep Semi-NMF, that is able to learn such hidden representations that allow them- selves to an interpretation of clustering accord- ing to different, unknown attributes of a given dataset. We show that by doing so, our model is able to learn low-dimensional representations that are better suited for clustering, outperform- ing Semi-NMF, but also other NMF variants. 1. Introduction Matrix factorization is a particularly useful family of tech- niques in data analysis. In recent years, there has been a significant amount of research on factorization methods that focus on particular characteristics of both the data ma- trix and the resulting factors. Non-negative matrix factor- ization (NMF), for example, focuses on the decomposition of non-negative multivariate data matrix X into factors Z and H that are also non-negative, such that X ZH. The application area of the family of NMF algorithms has grown significantly during the past years. It has been shown that they can be a successful dimensionality reduc- tion technique over a variety of areas including, but not lim- ited to, environmetrics (Paatero & Tapper, 1994), microar- ray data analysis (Brunet et al., 2004; Devarajan, 2008), document clustering (Berry & Browne, 2005), face recog- Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy- right 2014 by the author(s). nition (Zafeiriou et al., 2006; Kotsia et al., 2007) and more. What makes NMF algorithms particularly attractive is the non-negativity constraints imposed on the factors they pro- duce, allowing for better interpretability. Moreover, it has been shown that NMF variants (such as the Semi-NMF) are equivalent to k-means clustering, and that in fact, NMF variants are expected to perform better than k-means clus- tering particularly when the data is not distributed in a spherical manner (Ding et al., 2010; Cing et al., 2005). Nonlinear extensions of NMF have been also recently stud- ied (Zafeiriou & Petrou, 2010). In order to extend the applicability of NMF in cases where our data matrix H is not strictly non-negative, Ding et al. (2010) introduced the Semi-NMF, an NMF variant that im- poses non-negativity constraints only on the second factor H, but allows mixed signs in both the data matrix X and the first factor Z. This was motivated from a clustering perspective, where Z represents cluster centroids, and H represents soft membership indicators for every data point, allowing Semi-NMF to learn new lower-dimensional fea- tures from the data that have a convenient clustering inter- pretation. It is possible that the mapping Z between this new repre- sentation H and our original features X contains rather complex hierarchical and structural information. Consider for example the problem of mapping images of faces to their identities: a face image also contains information about attributes like pose and expression that can help iden- tify the person depicted. One could argue that by further factorizing this mapping Z, in a way that each factor adds an extra layer of abstraction minimizing the dimension- ality of the representation, one could automatically learn such latent attributes and the intermediate hidden represen- tations that are implied, allowing for a better higher-level feature representation H, as demonstrated in Figure 1. In this work, we propose a novel Deep Semi-NMF approach, which applies the concept of Semi-NMF to a multi-layer structure that is able to learn hidden representations of the original data. As Semi-NMF has a close relation to k-

A Deep Semi-NMF Model for Learning Hidden Representationsproceedings.mlr.press/v32/trigeorgis14.pdfA Deep Semi-NMF Model for Learning Hidden Representations X H Z (a) Semi-NMF X H

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

  • A Deep Semi-NMF Model for Learning Hidden Representations

    George Trigeorgis [email protected] Bousmalis [email protected] Zafeiriou [email protected]̈rn W. Schuller [email protected] of Computing, Imperial College London, United Kingdom

    AbstractSemi-NMF is a matrix factorization techniquethat learns a low-dimensional representation ofa dataset that lends itself to a clustering interpre-tation. It is possible that the mapping betweenthis new representation and our original featurescontains rather complex hierarchical informationwith implicit lower-level hidden attributes, thatclassical one level clustering methodologies cannot interpret. In this work we propose a novelmodel, Deep Semi-NMF, that is able to learnsuch hidden representations that allow them-selves to an interpretation of clustering accord-ing to different, unknown attributes of a givendataset. We show that by doing so, our modelis able to learn low-dimensional representationsthat are better suited for clustering, outperform-ing Semi-NMF, but also other NMF variants.

    1. IntroductionMatrix factorization is a particularly useful family of tech-niques in data analysis. In recent years, there has beena significant amount of research on factorization methodsthat focus on particular characteristics of both the data ma-trix and the resulting factors. Non-negative matrix factor-ization (NMF), for example, focuses on the decompositionof non-negative multivariate data matrix X into factors Zand H that are also non-negative, such that X ≈ ZH .The application area of the family of NMF algorithms hasgrown significantly during the past years. It has beenshown that they can be a successful dimensionality reduc-tion technique over a variety of areas including, but not lim-ited to, environmetrics (Paatero & Tapper, 1994), microar-ray data analysis (Brunet et al., 2004; Devarajan, 2008),document clustering (Berry & Browne, 2005), face recog-

    Proceedings of the 31 st International Conference on MachineLearning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy-right 2014 by the author(s).

    nition (Zafeiriou et al., 2006; Kotsia et al., 2007) and more.What makes NMF algorithms particularly attractive is thenon-negativity constraints imposed on the factors they pro-duce, allowing for better interpretability. Moreover, it hasbeen shown that NMF variants (such as the Semi-NMF)are equivalent to k-means clustering, and that in fact, NMFvariants are expected to perform better than k-means clus-tering particularly when the data is not distributed in aspherical manner (Ding et al., 2010; Cing et al., 2005).Nonlinear extensions of NMF have been also recently stud-ied (Zafeiriou & Petrou, 2010).

    In order to extend the applicability of NMF in cases whereour data matrix H is not strictly non-negative, Ding et al.(2010) introduced the Semi-NMF, an NMF variant that im-poses non-negativity constraints only on the second factorH , but allows mixed signs in both the data matrix X andthe first factor Z. This was motivated from a clusteringperspective, where Z represents cluster centroids, and Hrepresents soft membership indicators for every data point,allowing Semi-NMF to learn new lower-dimensional fea-tures from the data that have a convenient clustering inter-pretation.

    It is possible that the mapping Z between this new repre-sentation H and our original features X contains rathercomplex hierarchical and structural information. Considerfor example the problem of mapping images of faces totheir identities: a face image also contains informationabout attributes like pose and expression that can help iden-tify the person depicted. One could argue that by furtherfactorizing this mapping Z, in a way that each factor addsan extra layer of abstraction minimizing the dimension-ality of the representation, one could automatically learnsuch latent attributes and the intermediate hidden represen-tations that are implied, allowing for a better higher-levelfeature representation H , as demonstrated in Figure 1. Inthis work, we propose a novel Deep Semi-NMF approach,which applies the concept of Semi-NMF to a multi-layerstructure that is able to learn hidden representations of theoriginal data. As Semi-NMF has a close relation to k-

  • A Deep Semi-NMF Model for Learning Hidden Representations

    X

    H

    Z

    (a) Semi-NMF

    X

    H1

    · · ·

    Hm

    Z1

    Z2

    Zm

    (b) Deep Semi-NMF

    Figure 1. (a) A Semi-NMF model results in a linear transforma-tion of the initial input space. (b) Deep Semi-NMF learns a hi-erarchy of hidden representations that aid in uncovering the finallower-dimensional representation of the data.

    means clustering, by extending the model to a deep one weallow our model to learn new representations of our orig-inal data that continue to have a clustering interpretationaccording to the different latent attributes of our dataset, asdemonstrated in Figure 2.

    Closest to our proposal is recent work that has presentedNMF-variants that factorize X into more than 2 factors.Specifically, Ahn et al. (2004) have demonstrated the con-cept of Multi-Layer NMF on a set of facial images and (Lyu& Wang, 2013; Cichocki & Zdunek, 2006; Song & Lee,2013) have proposed similar NMF models that can be usedfor Blind Source Separation, classification of digit im-ages (MNIST), and documents. The representations of theMulti-layer NMF however do not lend themselves to a clus-tering interpretation, as the representations learned fromour model. Although the Multi-layer NMF is a promis-ing technique for learning hierarchies of features from data,we show in this work that our proposed model, the DeepSemi-NMF outperforms the Multi-layer NMF and, in fact,all models we compared it with on the task of feature learn-ing for clustering images of faces.

    The novelty of this work can be summarized as follows: (1)we outline a novel deep framework for matrix factorizationsuitable for clustering of multimodally distributed objectssuch as faces, (2) we present a greedy algorithm to optimizethe factors of the Semi-NMF problem, inspired by recentadvances in deep learning (Hinton & Salakhutdinov, 2006),and (3) we evaluate the representations learned by differentNMF-variants in terms of clustering performance.

    2. BackgroundIn this work, we assume that our data is provided in a ma-trix form X ∈ Rp×n, i.e., X = [x1,x2, . . . ,xn] is a col-lection of n data vectors as columns, each with p features.Matrix factorization aims at finding factors of X that sat-isfy certain constraints. In Singular Value Decomposition(SVD) (Golub & Reinsch, 1970), the method that underliesPrincipal Component Analysis (PCA) (Wold et al., 1987),we factorize X into two factors: the loadings or basesZ ∈ Rp×k and the features or components H ∈ Rk×n,without imposing any sign restrictions on either our data orthe resulting factors. In Non-negative Matrix Factorization(NMF) (Lee & Seung, 2001) we assume that all matricesinvolved contain only non-negative elements1, so we try toapproximate a factorization X+ ≈ Z+H+.

    2.1. Semi-NMF

    In turn, Semi-NMF (Ding et al., 2010) relaxes the non-negativity constrains of NMF and allows the data matrix Xand the loadings matrix Z to have mixed signs, while re-stricting only the features matrix H to comprise of strictlynon-negative components, thus approximating the follow-ing factorization:

    X± ≈ Z±H+. (1)

    This is motivated from a clustering perspective. If weview Z = [z1, z2, . . . ,zk] as the cluster centroids, thenH = [h1,h2, . . . ,hn] can be viewed as the cluster indica-tors for each datapoint. In fact, if we had a matrix H thatwas not only non-negative but also orthogonal, then everycolumn vector would have only one positive element, mak-ing Semi-NMF equivalent to k-means. Thus Semi-NMF,which does not impose an orthogonality constraint on itsfeatures matrix, can be seen as a soft clustering methodwhere the features matrix describes the compatibility ofeach component with a cluster centroid, a base in Z.

    2.2. State-of-the-art for learning features for clusteringbased on NMF-variants

    In this work, we compare our method with, among others,the state-of-the-art NMF techniques for learning featuresfor the purpose of clustering. Cai et al. (2011) proposeda graph-regularized NMF (GNMF) which takes into ac-count the intrinsic geometric and discriminating structureof the data space, which is essential to the real-world appli-cations, especially in the area of clustering. To accomplishthis, GNMF constructs a nearest neighbor graph to modelthe manifold structure. By preserving the graph structure,

    1When not clear from the context we will use the notation A+

    to state that a matrix A contains only non-negative elements. Sim-ilarly, when not clear, we will use the notation A± to state that Amay contain any real number.

  • A Deep Semi-NMF Model for Learning Hidden Representations

    it allows the learned features to have more discriminatingpower than the standard NMF algorithm, in cases that thedata are sampled from a submanifold which lies in a higherdimensional ambient space. This combines two differentnotions, that of NMF and graph Laplacian regularizationalgorithms (Belkin & Niyogi, 2001).

    Another state-of-the-art matrix factorization techniquewould be NeNMF (Guan et al., 2012). NeNMF makes useof Nesterov’s optimal gradient method to alternatively opti-mize one factor, with the other fixed. This allows for fasterNMF optimization without time-consuming linesearch pro-cedure or numerical instability problems that traditionalNMF has. Guan et al. (2012) showed that it outperformedexisting NMF solvers in terms of reconstruction error anddocument clustering performance.

    3. Deep Semi-NMFIn Semi-NMF the goal is to construct a low-dimensionalrepresentation H+ of our original data X±, with the basesmatrix Z± serving as the mapping between our originaldata and its lower-dimensional representation (see Equa-tion 1). In many cases the data we wish to analyze is oftenrather complex and has a collection of distinct, often un-known, attributes. In this work for example, we deal withdatasets of human faces, where the variability in the datadoes not only stem from the difference in the appearance ofthe subjects, but also from other attributes, such as the poseof the head in relation to the camera, or the facial expres-sion of the subject. In addition faces compromise of mainlyhierarchical features and thus face clustering problems canbe better solved using our deep framework, as each subse-quent layers can capture the hierarchical structure.

    We propose here the Deep Semi-NMF model, which factor-izes a given data matrix X into m+ 1 factors, as follows:

    X± ≈ Z±1 Z±2 · · ·Z±mH+m (2)

    This formulation, as shown in with respect to Figures 2 and1 allows for a hierarchy of m layers of implicit representa-tions of our data that can be given by the following factor-izations:

    H+m−1 ≈ Z±mH+m...

    H+2 ≈ Z±3 · · ·Z±mH+mH+1 ≈ Z±2 · · ·Z±mH+m

    As one can see above, we further restrict these implicit rep-resentations (H+1 , . . . ,H

    +m−1) to also be non-negative. By

    doing so, every layer of this hierarchy of representationsalso lends itself to a clustering interpretation. By examin-ing Figure 2 one can acquire better intuition of how that

    happens. In this case the input to the model, our data X ,is a collection of face images from different subjects (iden-tity), expressing a variety of emotions (expressions) takenfrom many angles (pose). A Semi-NMF model would finda representation H of X , which would be useful for per-forming clustering according to the identity of the subjects,and Z the mapping between these identities and the faceimages. A Deep Semi-NMF model also finds a representa-tion of our data that has a similar interpretation at the toplayer, its last factor Hm. However, the mapping from iden-tities to face images is now further analyzed as a product ofthree factors Z = Z1Z2Z3, with Z3 corresponding to themapping of identities to emotions, Z2Z3 corresponding tothe mapping of identities to poses, and finally Z1Z2Z3 cor-responding to the mapping of identities to the face images.That means that, as shown in Figure 2 we are able to de-compose our data in 3 different ways according to our 3different attributes:

    X± ≈ Z±1 H+1X± ≈ Z±1 Z±2 H+2X± ≈ Z±1 Z±2 Z±3 H+3

    Our hypothesis is that by further factorizing Z we are ableto construct a deep model that is able to (1) automaticallylearn what this latent hierarchy of attributes is; (2) find rep-resentations of the data that are most suitable for clusteringaccording to the attribute that corresponds to each layer inthe model; and (3) find a better high-level, final-layer repre-sentation for clustering according to the attribute with thelowest variability, in our case the identity of the face de-picted. In our example in Figure 2 we would expect to findbetter features for clustering according to identities H3 bylearning the hidden representations at each layer most suit-able for each of the attributes in our data, in this example:H1 ≈ Z2Z3H3 for clustering our original images in termsof poses and H2 ≈ Z3H3 for clustering the face imagesin terms of expressions.

    In order to expedite the approximation of the factors inour model, we pre-train each of the layers to have an ini-tial approximation of the matrices Zi,Hi as this greatlyimproves the training time of the model. This is a tac-tic that has been employed successfully before (Hinton& Salakhutdinov, 2006) on deep autoencoder networks.To perform the pre-training, we first decompose the ini-tial data matrix X ≈ Z1H1, where Z1 ∈ Rp×k1 andH1 ∈ R+0

    k1×n. Following this, we decompose the fea-tures matrix H1 ≈ Z2H2, where Z2 ∈ Rk1×k2 andH1 ∈ R+0

    k2×n, continuing to do so until we have pre-trained all of the layers. Afterwards, we can fine-tune theweights of each layer, by employing alternating minimiza-tion (with respect to the objective function in Equation 3)of the two factors in each layer, in order to reduce the to-tal reconstruction error of the model, according to the cost

  • A Deep Semi-NMF Model for Learning Hidden Representations

    DorothyCharlie

    Z1H1

    H2H3

    Z1Z2Z1Z2Z3

    X

    PoseFeatures

    ExpressionFeatures

    IdentityFeaturesk-means

    k-meansk-means

    Surprise Neutral Squint0º -30º 30º

    Z2

    Z3

    Figure 2. A Deep Semi-NMF model learns a hierarchical structure of features, with each layer learning a representation suitable forclustering according to the different attributes of our data. In this simplified, for demonstration purposes, example from the CMU Multi-PIE database, a Deep Semi-NMF model is able to simultaneously learn features for pose clustering (H1), for expression clustering(H2), and for identity clustering (H3). Each of the images in X has an associated color coding that indicates its memberships accordingto each of these attributes (pose/expression/identity).

    function in Equation 3.

    Cdeep =1

    2‖X −Z1Z2 · · ·ZmHm‖2F

    = tr[X>X − 2X>Z1Z2 · · ·ZmHm+H>mZ

    >mZ

    >m−1 · · ·Z>1 Z1Z2 · · ·ZmHm] (3)

    Update rule for the weights matrix Z We fix the rest ofthe weights for the ith layer and we minimize the cost func-tion with respect to Zi. That is, we set ∂Cdeep/∂Zi = 0,which gives us the updates:

    Zi = (Ψ>Ψ)−1Ψ>XH̃i

    >(H̃iH̃i

    >)−1

    Zi = Ψ†XH̃†i (4)

    where Ψ = Z1 · · ·Zi−1, † denotes the Moore-Penrosepseudo-inverse and H̃i is the reconstruction of the ith

    layer’s feature matrix.

    Update rule for features matrix H Utilizing a similarproof to Ding et al. (2010), we can formulate the updaterule for Hi as follows:

    Hi = Hi �√

    [Ψ>X]pos + [Ψ>Ψ]negHi[Ψ>X]neg + [Ψ>Ψ]posHi

    (5)

    Supplementary material including the implementation ofAlgorithm 1 and the proof of its convergence can be foundat http://trigeorgis.com/deepseminmf.

    Complexity

    The computational complexity for the pre-training stage of Deep Semi-NMF is of orderO(mt(pnk + nk2 + kp2 + kn2

    )), where m is the

    number of layers, t the number of iterations until conver-gence and k is the maximum number of components outof all the layers. The complexity for the fine-tuning stageis O

    (mtf

    (pnk + (p+ n)k2

    ))where tf is the number of

    additional iterations needed.

    Non-linear Update Rules

    One can use a non-linear function g(·), between each of theimplicit representations (H+1 , . . . ,H

    +m−1), in order to bet-

    ter approximate the non-linear manifolds which the givendata matrix X originally lies on. Thus, one can representthe ith feature matrix Hi, by

    g(Hi) ≈ Zi+1Hi+1. (6)

    This, creates new derivatives update rules as the derivativeswith respect to the objective function become:

    ∂Cdeep∂Zi

    = (Ni −Ri)H>i (7)∂Cdeep∂Hi

    = Z>i (Ni −Ri) (8)

    http://trigeorgis.com/deepseminmf

  • A Deep Semi-NMF Model for Learning Hidden Representations

    Algorithm 1 Suggested algorithm for training a DeepSemi-NMF model. Initially we approximate the factorsgreedily using the SEMI-NMF algorithm (Ding et al.,2010) and we fine-tune the factors until we reach the con-vergence criterion.

    function DEEPSEMINMFInput: X ∈ Rp×n, list of layer sizesOutput: weight matrices Zi and feature matrices Hifor each of the layers

    Initialize Layersfor all layers doZi,Hi ← SEMINMF(Hi−1, layers(i))

    end for

    repeatfor all layers do

    H̃i ←{

    Hi if i = kZi+1H̃i+1 otherwise

    Ψ←∏i−1k=1 ZkZi ← Ψ†XH̃†i

    Hi ←Hi �√

    [Ψ>X]pos + [Ψ>Ψ]posHi[Ψ>X]neg + [Ψ>Ψ]posHi + �

    end foruntil Stopping criterion is reached

    end function

    which gives the following update rules:

    Zi =

    {XH̃†i ; if i = 1g(Hi−1)H̃

    †i otherwise

    , (9)

    Hi = Hi �[Z>i Ri]

    pos + [Z>i Ni]neg

    [Z>i Ri]neg + [Z>i Ni]

    pos(10)

    where, ∀i.i ≤ m Ni,Ri are auxiliary matrices that facili-tate the computation of the factors in each layer, Apos is amatrix that has the negative elements of matrix A replacedwith 0, and similarly Aneg is one that has the positive ele-ments of A replaced with 0. The base case values i = 1 forthese matrices are:

    N1 = (Z1H̃1)�∇g−1(Z1H1), (11)R1 = X �∇g−1(Z1H1), (12)

    and for the cases where i > 1 they are computed as such:

    Ni+1 = (Z>i Ni)�∇g−1(ZiHi), (13)

    Ri+1 = (Z>i Ri)�∇g−1(ZiHi). (14)

    4. ExperimentsOur main hypothesis is that a Deep Semi-NMF is able tolearn better high-level representations of our original data

    than an one-layer Semi-NMF for clustering according tothe attribute with the lowest variability in the dataset. Inorder to evaluate this hypothesis, we have compared theperformance of Deep Semi-NMF with that of other meth-ods, on the task of clustering images of faces in 3 distinctdatasets. These datasets are:

    • CMU MultiPIE: The first dataset we examine isThe CMU Multi Pose, Illumination, and Expression(MultiPIE) Database (Gross et al., 2010) and containsaround 750, 000 images of 337 subjects, captured un-der laboratory conditions in four different sessions. Inthis work, we used a subset of 13, 230 images of 147subjects in 5 different poses and 3 different illumina-tion conditions, expressing 6 different emotions. Us-ing the annotations from (Sagonas et al., 2013a;b), wealigned these images based on a common frame by us-ing piece-wise affine warping. After that, we resizedthem to a smaller resolution of 30× 30. The databasecomes with labels for each of the attributes mentionedabove: identity, illumination, pose, expression.

    • CMU PIE: We also used a freely available versionof CMU Pie (Sim et al., 2003), which comprises of2, 856 grayscale 32 × 32 face images of 68 subjects.Each person has 42 facial images under different lightand illumination conditions. In this database we onlyknow the identity of the face in each image and wecould not use piece-wise affine warping, since we didnot have facial point annotations for it.

    • XM2VTS: The Extended Multi Modal Verificationfor Teleservices and Security applications (XM2VTS)(Messer et al., 1999) contains 2, 360 frontal imagesof 295 different subjects. Each subject has two avail-able images for each of the four different laboratorysessions, for a total of 8 images. The images werealigned based on the same piece-wise affine warpingtechnique as the Multi-PIE dataset, after resizing theoriginal images to 42× 30.

    In order to evaluate the performance of our model, we com-pared it against not only Semi-NMF (Ding et al., 2010),but also against other NMF variants that could be useful inlearning such representations. More specifically, for eachof our three datasets we performed the following experi-ments:

    • Pixel Intensities: By using only the pixel intensitiesof the images in each of our datasets, which of coursegive us a strictly non-negative input data matrix X ,we compare the reconstruction error and the clusteringperformance of our Deep Semi-NMF method againstthe Semi-NMF, NMF with multiplicative update rules(Lee & Seung, 2001), Multi-Layer NMF (Song & Lee,

  • A Deep Semi-NMF Model for Learning Hidden Representations

    Number of componentsName 30 40 50 60 70CMU Multi-PIE – Pixel IntensitiesNMF 2.68 2.52 2.41 2.32 2.23NeNMF 2.40 2.20 2.04 1.92 1.80GNMF 3.05 3.03 2.98 3.02 3.02Semi-NMF 2.37 2.16 2.01 1.89 1.78Multi-layer NMF 2.71 2.54 2.42 2.31 2.22Deep Semi-NMF 2.40 2.20 2.07 1.96 1.86CMU Multi-PIE – Image Gradient OrientationsSemi-NMF 0.14 0.13 0.13 0.12 0.12Deep Semi-NMF 0.14 0.13 0.13 0.13 0.12

    Table 1. The mean reconstruction error for each of the algorithmswith a variable number of components. The error quantifies thedeviation of the reconstruction with the original matrix X , usingthe Euclidean objective function we used: 1

    N‖X − X̃‖2, with

    N amount of samples. Deep Semi-NMF has a comparable recon-struction error to that of Semi-NMF.

    2013), GNMF (Cai et al., 2011), and NeNMF (Guanet al., 2012).

    • Image Gradient Orientations (IGO): In general, thetrend in Computer Vision is to use complicated en-gineered features like HoGs, SIFT, LBPs, etc. As aproof of concept, we choose to conduct experimentswith simple gradient orientations (Zafeiriou et al.,2012) as features, instead of pixel intensities, whichresults into a data matrix of mixed signs, and expectthat we can learn better data representations for clus-tering faces according to identities. In this case, weonly compared our Deep Semi-NMF with its one-layer Semi-NMF equivalent, as the other techniquesare not able to deal with mixed-sign matrices.

    Finally, we evaluated our secondary hypotheses, i.e. thatevery hidden representation in each layer is in fact mostsuited for clustering according to the attributes that corre-sponds to the layer of interest. We performed clusteringexperiments by using the features learned in each layer ofa two-layer Deep Semi-NMF on the case of CMU Multi-PIE, as this was the only dataset for which we had labelsfor attributes other than the identity.

    4.1. Implementation Details

    In order to initiate the matrix factorization process, NMFand Semi-NMF algorithms start from some initial point(Z0,H0), where usually Z0 and H0 are randomly ini-tialized matrices. In order to speed up the conver-gence rate of NMF, Boutsidis & Gallopoulos (2008), sug-gested Non-negative Double Singular Value decomposition(NNDSVD), which is a new method based on two SVD

    20 30 40 50 60 70Number of components

    0.26

    0.28

    0.30

    0.32

    0.34

    0.36

    0.38

    NM

    I

    Deep Semi-NMF (17.75)Semi-NMF (13.79)NMF (MUL) (14.74)

    Multi-layer NMF (15.10)NeNMF (14.04)GNMF (14.16)

    (a) NMI

    20 30 40 50 60 70Number of components

    0.06

    0.08

    0.10

    0.12

    0.14

    0.16

    0.18

    Clu

    ster

    ing

    Acc

    urac

    y

    Deep Semi-NMF (7.92)Semi-NMF (4.14)NMF (MUL) (4.50)

    Multi-layer NMF (4.89)NeNMF (4.11)GNMF (3.95)

    (b) Accuracy

    Figure 3. CMU MultiPIE–Pixel Intensities: NMI and Accuracyfor clustering based on the representations learned by each modelwith respect to identities. The deep architectures are compro-mised of 2 representation layers 1988-625-a and the representa-tions used were from the top layer. In parenthesis we show theAUC scores.

    processes, one to approximate the initial data matrix X andthe other to approximate the positive sections of the result-ing partial SVD factors. Although, the proposed initializa-tion of Semi-NMF by its authors is by using the k-meansalgorithm (Ding et al., 2010), we found empirically that itwas too computationally heavy when the number of com-ponents k was fairly high (k > 100). As an alternative weused NNDSVD to gain an initial and deterministic approx-imation, after forcing the initial data to have non-negativevalues, by setting all negative values to zero.

    For the GNMF experimental setup, we chose a suitablenumber of neighbours, in our case 5, using visualization ofthe datasets using Laplacian Eigenmaps (Belkin & Niyogi,2001), such that we had visually distinct clusters.

    Important for the experimental setup is the selected struc-ture of the multi-layered models. After careful prelimi-nary experimentation, we focused on experiments that in-volve two hidden layer architectures for the Deep Semi-NMF and Multi-layer NMF. Although we experimentedwith a higher-number of layers, approximating the addi-tional number of factors/layers did not seem to have a sig-nificant impact on our results for these datasets, and wassignificantly more computationally expensive. We specifi-cally experimented with models that had a first hidden rep-resentation H1 with 625 features, and a second represen-tation H2 with a number of features that ranged from 20to 70. Again we chose these numbers after preliminary re-sults showed us that the computational burden of additionalfeatures outweighed the performance increase obtained byincreasing these numbers.

    4.2. Reconstruction Error Results

    Our first experiment was to evaluate whether the extra lay-ers, which naturally introduce more factors and are there-fore more difficult to optimize, result in a lower quality lo-

  • A Deep Semi-NMF Model for Learning Hidden Representations

    20 30 40 50 60 70Number of components

    0.78

    0.80

    0.82

    0.84

    0.86

    0.88

    0.90

    NM

    I

    Deep Semi-NMF (43.63)Semi-NMF (42.82)NMF (MUL) (41.77)

    Multi-layer NMF (41.45)NeNMF (41.65)GNMF (42.24)

    (a) NMI

    20 30 40 50 60 70Number of components

    0.45

    0.50

    0.55

    0.60

    0.65

    0.70

    Clu

    ster

    ing

    Acc

    urac

    y

    Deep Semi-NMF (31.31)Semi-NMF (29.77)NMF (MUL) (27.59)

    Multi-layer NMF (27.21)NeNMF (27.66)GNMF (28.89)

    (b) Accuracy

    Figure 4. XM2VTS-Pixel Intensities: NMI and Accuracy forclustering based on the representations learned by each modelwith respect to identities. The deep architectures are compro-mised of 2 representation layers 1260-625-a and the representa-tions used were from the top layer. In parenthesis we show theAUC scores.

    20 30 40 50 60 70Number of components

    0.50

    0.55

    0.60

    0.65

    0.70

    0.75

    0.80

    0.85

    NM

    I

    Deep Semi-NMF (37.92)Semi-NMF (32.17)NMF (MUL) (35.37)

    Multi-layer NMF (36.41)NeNMF (37.74)GNMF (32.60)

    (a) NMI

    20 30 40 50 60 70Number of components

    0.30

    0.35

    0.40

    0.45

    0.50

    0.55

    0.60

    0.65

    0.70

    Clu

    ster

    ing

    Acc

    urac

    y

    Deep Semi-NMF (31.04)Semi-NMF (22.79)NMF (MUL) (26.17)

    Multi-layer NMF (29.11)NeNMF (29.14)GNMF (21.23)

    (b) Accuracy

    Figure 5. CMU PIE–Pixel Intensities: NMI and Accuracy forclustering based on the representations learned by each modelwith respect to identities. The deep architectures are compro-mised of 2 representation layers 1024-625-a and the representa-tions used were from the top layer. In parenthesis we show theAUC scores.

    cal optimum. We evaluated how well the matrix decompo-sition is performed by calculating the reconstruction error,the Frobenius norm of the difference between the originaldata and the reconstruction for all the methods we com-pared. Note that, in order to have comparable results, allof the methods have the same stopping criterion rules. Wehave set the maximum amount of iterations to 300 (usu-ally ∼100 epochs are enough) and we use the convergencerule Ei−1 −Ei ≤ κ max(1, Ei−1) in order to stop the pro-cess when reconstruction error (Ei) between the currentand previous update is small enough. In our experimentswe set κ = 10−6. Table 1 shows the change in reconstruc-tion error with respect to the selected number of features inH2 for all the methods we used on the Multi-PIE dataset.The results for the other datasets and for a larger variety ofnumber of components were similar and are excluded dueto lack of space.

    The results show that Semi-NMF and Deep Semi-NMFmanage to reach a much lower reconstruction error than

    Pose Accuracy (%) Identity Accuracy (%)# H1 H2 H1 H21 27.58 23.11 9.37 16.592 28.18 23.25 9.48 16.67

    Table 2. Clustering accuracy according to pose and identity labelson the CMU Multi-PIE dataset using two Deep Semi-NMF mod-els with two hidden layers each (1) 625-60 and (2) 625-70.

    the rest consistently, which would match our expectationsas they do not constrain the weights Z to be non-negative.What is important to note here is that the Deep Semi-NMFmodels do not have a significantly lower reconstruction er-ror compared to the equivalent Semi-NMF models, eventhough the approximation involves more factors. This isin contrast to the multi-layer NMF and GNMF which sac-rifice the reconstruction quality, in return for uncoveringmore meaningful features than their NMF counterpart.

    4.3. Clustering Results

    After achieving satisfactory reconstruction error for ourmethod, we proceeded to evaluate the features learned atthe final representation layer, by using k-means clustering,following the experimental protocol in (Cai et al., 2011).To assess the clustering quality of the representations pro-duced by each of the algorithms we compared, we take ad-vantage of the fact that the datasets are already labelled.The two metrics used were the accuracy (AC) and the nor-malized mutual information metric (NMI), as those are de-fined in (Xu et al., 2003).

    Figures 3-5 show the comparison in clustering accuracyand NMI when using k-means on the feature representa-tions produced by each of the techniques we compared,when our input matrix contained only the pixel intensitiesof each image. Our method significantly outperforms ev-ery method we compared it with on all three datasets. Thedifference is not as obvious in Figure 5, which could per-haps be a sign that our method is able to use the warpingtechnique we used to its advantage more so than the othertechniques we used.

    By using IGOs, the Deep Semi-NMF was able to outper-form the single-layer Semi-NMF as shown in Figures 6-8. Making use of these simple mixed-signed features im-proved the clustering accuracy considerably, especially forXM2VTS and CMU PIE. It should be noted that in allcases, with the exception of the XM2VTS experiment withIGOs, our Deep Semi-NMF outperformed all other meth-ods with a difference in performance that is statistically sig-nificant (paired t-test, p� 0.01).

  • A Deep Semi-NMF Model for Learning Hidden Representations

    20 30 40 50 60 70Number of components

    0.310

    0.315

    0.320

    0.325

    0.330

    0.335

    0.340

    0.345

    0.350

    0.355

    NM

    I

    Deep Semi-NMF (16.46) Semi-NMF (15.93)

    (a) NMI

    20 30 40 50 60 70Number of components

    0.09

    0.10

    0.11

    0.12

    0.13

    0.14

    0.15

    0.16

    0.17

    Clu

    ster

    ing

    Acc

    urac

    y

    Deep Semi-NMF (6.69) Semi-NMF (6.27)

    (b) Accuracy

    Figure 6. CMU MultiPIE-IGO: NMI and Accuracy scores onclustering based on the representations learned by each modelwith respect to identities. The deep architectures are compro-mised of 2 hidden layers 1988-625-a and the representations usedwere from the top layer. In parenthesis we show the AUC scores.

    20 30 40 50 60 70Number of components

    0.84

    0.86

    0.88

    0.90

    0.92

    0.94

    0.96

    NM

    I

    Deep Semi-NMF (45.92) Semi-NMF (45.85)

    (a) NMI

    20 30 40 50 60 70Number of components

    0.60

    0.65

    0.70

    0.75

    0.80

    Clu

    ster

    ing

    Acc

    urac

    y

    Deep Semi-NMF (36.44) Semi-NMF (36.13)

    (b) Accuracy

    Figure 7. XM2VTS-IGO: NMI and Accuracy scores on cluster-ing based on the representations learned by each model with re-spect to identities. The deep architectures are compromised of 2hidden layers 2520-625-a and the representations used were fromthe top layer. In parenthesis we show the AUC scores.

    4.4. Clustering with Respect to Different Attributes

    Finally, we conducted experiments by performing k-meansclustering on each of the two representations learned byour Deep Semi-NMF models by using the raw intensitiesof our warped images of CMU Multi-PIE. We only usedMulti-PIE since we only had identity labels for our otherdatasets. Since the warping technique we use gets rid ofmost of the variability shown in expressions, we evaluatedhow well we did on clustering according to pose and iden-tities. As one can see in Table 2 our first layer indeed learnsrepresentations that are better suited for clustering accord-ing to poses. On the other hand, we do confirm that our fi-nal layer indeed learns representations that are better suitedfor clustering according to identities, thus confirming oursecondary hypothesis, i.e. that every hidden representationin each layer is in fact most suited for clustering accordingto the attributes that corresponds to the layer of interest. Asone can see in Figure 3, by learning the hidden represen-tation that relates to poses, we are able to achieve signifi-cantly better results compared to the Semi-NMF, where welearn only one level of representation.

    20 30 40 50 60 70Number of components

    0.80

    0.82

    0.84

    0.86

    0.88

    0.90

    0.92

    NM

    I

    Deep Semi-NMF (43.70) Semi-NMF (41.37)

    (a) NMI

    20 30 40 50 60 70Number of components

    0.60

    0.65

    0.70

    0.75

    Clu

    ster

    ing

    Acc

    urac

    y

    Deep Semi-NMF (36.04) Semi-NMF (32.43)

    (b) Accuracy

    Figure 8. CMU PIE-IGO: NMI and Accuracy scores on cluster-ing based on the representations learned by each model with re-spect to identities. The deep architectures are compromised of 2hidden layers 2048-625-a and the representations used were fromthe top layer. In parenthesis we show the AUC scores.

    5. ConclusionWe have introduced a novel deep architecture for semi-non-negative matrix factorization, the Deep Semi-NMF , that isable to automatically learn a hierarchy of attributes of agiven dataset, as well as representations suited for cluster-ing according to these attributes. We have also presentedan algorithm for optimizing the factors of our Deep Semi-NMF, and we evaluate its performance compared to thesingle-layered Semi-NMF and other related work, on theproblem of clustering faces with respect to their identities.We have shown that our technique is able to learn a high-level, final-layer representation for clustering with respectto the attribute with the lowest variability in the case ofthree popular datasets of face images, outperforming theother NMF-based techniques.

    The next obvious step is to explore the suitability of suchlearned hidden representations for performing classifica-tion. Another line of work that will aid the learning processwill be the initialization scheme for pre-training the DeepSemi-NMF model, as currently we used a rough approxi-mation of the initial Z0,H0 matrices using the NNDSVDalgorithm. Finally, future avenues include experimentingwith other applications, e.g. in the area of speech recog-nition, especially for multi-source speech recognition andwe will investigate multilinear extensions of the proposedframework (Zafeiriou, 2009b;a).

    AcknowledgementsWe would like to thank our anonymous reviewers for theiruseful comments. George Trigeorgis is a recipient of thefellowship of the Department of Computing, Imperial Col-lege London, and this work was partially funded by it. Thework of Konstantinos Bousmalis was funded partially fromthe Google Europe Fellowship in Social Signal Processing.The work of Stefanos Zafeiriou was partially funded by theEPSRC project EP/J017787/1 (4D-FAB).

  • A Deep Semi-NMF Model for Learning Hidden Representations

    ReferencesAhn, Jong-Hoon, Choi, Seungjin, and Oh, Jong-Hoon. A mul-

    tiplicative up-propagation algorithm. In Proceedings of thetwenty-first international conference on Machine learning, pp.3. ACM, 2004.

    Belkin, Mikhail and Niyogi, Partha. Laplacian eigenmaps andspectral techniques for embedding and clustering. In NIPS,volume 14, pp. 585–591, 2001.

    Berry, Michael W and Browne, Murray. Email surveillance usingnon-negative matrix factorization. Computational & Mathe-matical Organization Theory, 11(3):249–264, 2005.

    Boutsidis, Christos and Gallopoulos, Efstratios. Svd based ini-tialization: A head start for nonnegative matrix factorization.Pattern Recognition, 41(4):1350–1362, 2008.

    Brunet, Jean-Philippe, Tamayo, Pablo, Golub, Todd R, andMesirov, Jill P. Metagenes and molecular pattern discovery us-ing matrix factorization. Proceedings of the National Academyof Sciences, 101(12):4164–4169, 2004.

    Cai, Deng, He, Xiaofei, Han, Jiawei, and Huang, Thomas S.Graph regularized nonnegative matrix factorization for datarepresentation. IEEE Transactions on Pattern Analysis andMachine Intelligence, 33(8):1548–1560, 2011.

    Cichocki, Andrzej and Zdunek, Rafal. Multilayer nonnegativematrix factorization. Electronics Letters, 42:947–948, 2006.

    Cing, C., He, X., and Simon, H.D. On the equivalence of non-negative matrix factorization and spectral clustering. In Proc.SIAM Data Mining, 2005.

    Devarajan, Karthik. Nonnegative matrix factorization: an ana-lytical and interpretive tool in computational biology. PLoScomputational biology, 4(7):e1000029, 2008.

    Ding, Chris HQ, Li, Tao, and Jordan, Michael I. Convex andsemi-nonnegative matrix factorizations. IEEE Transactions onPattern Analysis and Machine Intelligence, 32(1):45–55, 2010.

    Golub, Gene H and Reinsch, Christian. Singular value decompo-sition and least squares solutions. Numerische Mathematik, 14(5):403–420, 1970.

    Gross, Ralph, Matthews, Iain, Cohn, Jeffrey, Kanade, Takeo, andBaker, Simon. Multi-pie. Image and Vision Computing, 28(5):807–813, 2010.

    Guan, Naiyang, Tao, Dacheng, Luo, Zhigang, and Yuan, Bo.Nenmf: an optimal gradient method for nonnegative matrixfactorization. Signal Processing, IEEE Transactions on, 60(6):2882–2898, 2012.

    Hinton, Geoffrey E and Salakhutdinov, Ruslan R. Reducing thedimensionality of data with neural networks. Science, 313(5786):504–507, 2006.

    Kotsia, Irene, Zafeiriou, Stefanos, and Pitas, Ioannis. A noveldiscriminant non-negative matrix factorization algorithm withapplications to facial image characterization problems. IEEETransactions on Information Forensics and Security, 2(3-2):588–595, 2007.

    Lee, Daniel D. and Seung, H. Sebastian. Algorithms for non-negative matrix factorization. Advances in neural informationprocessing systems, 13:556–562, 2001.

    Lyu, Siwei and Wang, Xin. On algorithms for sparse multi-factornmf. In Advances in Neural Information Processing Systems,pp. 602–610, 2013.

    Messer, Kieron, Matas, Jiri, Kittler, Josef, Luettin, Juergen, andMaitre, Gilbert. Xm2vtsdb: The extended m2vts database.In Second international conference on audio and video-basedbiometric person authentication, volume 964, pp. 965–966.Citeseer, 1999.

    Paatero, Pentti and Tapper, Unto. Positive matrix factorization:A non-negative factor model with optimal utilization of errorestimates of data values. Environmetrics, 5(2):111–126, 1994.

    Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., and Pantic, M. 300faces in-the-wild challenge: The first facial landmark localiza-tion challenge. In Computer Vision Workshops (ICCVW), 2013IEEE International Conference on, pp. 397–403, Dec 2013a.

    Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., and Pantic, M.A semi-automatic methodology for facial landmark annota-tion. In Computer Vision and Pattern Recognition Work-shops (CVPRW), 2013 IEEE Conference on, pp. 896–903, June2013b.

    Sim, Terence, Baker, Simon, and Bsat, Maan. ”the cmu pose,illumination, and expression database.”. IEEE Trans. PatternAnalysis and Machine Intelligence, 25(12):1615–1618, 2003.

    Song, Hyun Ah and Lee, Soo-Young. Hierarchical data represen-tation model - multi-layer nmf. International Conference onLearning Representations, abs/1301.6316, 2013.

    Wold, Svante, Esbensen, Kim, and Geladi, Paul. Principal com-ponent analysis. Chemometrics and intelligent laboratory sys-tems, 2(1):37–52, 1987.

    Xu, Wei, Liu, Xin, and Gong, Yihong. Document clustering basedon non-negative matrix factorization. In Proceedings of the26th annual international ACM SIGIR conference on Researchand development in information retrieval, pp. 267–273. ACM,2003.

    Zafeiriou, Stefanos. Algorithms for nonnegative tensor factoriza-tion. In Tensors in Image Processing and Computer Vision, pp.105–124. Springer, 2009a.

    Zafeiriou, Stefanos. Discriminant nonnegative tensor factoriza-tion algorithms. Neural Networks, IEEE Transactions on, 20(2):217–235, 2009b.

    Zafeiriou, Stefanos and Petrou, Maria. Nonlinear non-negativecomponent analysis algorithms. Image Processing, IEEETransactions on, 19(4):1050–1066, 2010.

    Zafeiriou, Stefanos, Tefas, Anastasios, Buciu, Ioan, and Pitas,Ioannis. Exploiting discriminant information in nonnegativematrix factorization with application to frontal face verifica-tion. Neural Networks, IEEE Transactions on, 17(3):683–695,2006.

    Zafeiriou, Stefanos, Tzimiropoulos, Georgios, Petrou, Maria, andStathaki, Tania. Regularized kernel discriminant analysis witha robust kernel for face recognition and verification. NeuralNetworks and Learning Systems, IEEE Transactions on, 23(3):526–534, 2012.