SUBMITTED TO IEEE TRANSACTIONS ON IMAGE ...anand/pdf/TIP_Image...SUBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING 3 its rank [see Section II-A and Section III]. There exists a large

SUBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING 1

1


A Method for Compact Image Representation usingSparse Matrix and Tensor Projections onto

Exemplar Orthonormal BasesKarthik S. Gurumoorthy, Ajit Rajwade, Arunava Banerjee andAnand Rangarajan

Abstract—We present a new method for compact representa-tion of large image datasets. Our method is based on treatingsmall patches from a 2D image as matrices as opposed to theconventional vectorial representation, and encoding these patchesas sparse projections onto a set of exemplar orthonormal bases,which are learned a priori from a training set. The end resultis a low-error, highly compact image/patch representationthathas significant theoretical merits and compares favorably withexisting techniques (including JPEG) on experiments involvingthe compression of ORL and Yale face databases, as well asa database of miscellaneous natural images. In the context oflearning multiple orthonormal bases, we show the easy tunabilityof our method to efficiently represent patches of different com-plexities. Furthermore, we show that our method is extensible in atheoretically sound manner to higher-order matrices (‘tensors’).We demonstrate applications of this theory to compression ofwell-known color image datasets such as the GaTech and CMU-PIE face databases and show performance competitive withJPEG. Lastly, we also analyze the effect of image noise on theperformance of our compression schemes.

Index Terms—compression, compact representation, sparseprojections, singular value decomposition (SVD), higher-ordersingular value decomposition (HOSVD), greedy algorithm, tensordecompositions.

I. I NTRODUCTION

Most conventional techniques of image analysis treat imagesas elements of a vector space. Lately, there has been a steadygrowth of literature which regards images as matrices, e.g.[1], [2], [3], [4]. As compared to a vectorial method, thematrix-based representation helps to better exploit the spatialrelationships between image pixels, as an image is essentially atwo-dimensional function. In the image processing community,this notion of an image has been considered by Andrews andPatterson in [5], in the context of image coding by usingsingular value decomposition (SVD). Given any matrix, sayJ ∈ RM1×M2 , the SVD of the matrix expresses it as a productof a diagonal matrixS ∈ RM1×M2 of singular values, withtwo orthonormal matricesU ∈ RM1×M1 and V ∈ RM2×M2 .This decomposition is unique upto a sign factor on the vectorsof U andV if the singular values are all distinct [6]. For thespecific case of natural images, it has been observed that thevalues along the diagonal ofS are rapidly decreasing1, which

The authors are with the Department of Computer and Information Scienceand Engineering, University of Florida, Gainesville, FL 32611, USA. e-mail:{ksg,avr,arunava,anand}@cise.ufl.edu.

1A related empirical observation is the rapid decrease in thepower spectraof natural images with increase in spatial frequency [7].

allows for the low-error reconstruction of the image by low-rank approximations. This property of SVD, coupled with thefact that it provides the best possible lower-rank approximationto a matrix, has motivated its application to image compressionin the work of Andrews and Patterson [5], and Yang and Lu[8]. In [8], the SVD technique is further combined with vectorquantization for compression applications.

To the best of our knowledge, the earliest technique onobtaining acommonmatrix-based representation for a set ofimages was developed by Rangarajan in [1]. In [1], asinglepair of orthonormal bases(U, V ) is learned from a set of someN images, and each imageIj , 1 ≤ j ≤ N is representedby means of a projectionSj onto these bases, i.e. in theform Ij = USjV

T . Similar ideas are developed by Ye in[2] in terms of optimal lower-rank image approximations.In [3], Yang et al. develop a technique named 2D-PCA,which computes principal components of a column-columncovariance matrix of a set of images. In [9], Dinget al.presenta generalization of 2D-PCA, named as 2D-SVD, and someoptimality properties of 2D-SVD are investigated. The workin [9] also unifies the approaches in [3] and [2], and providesa non-iterative approximate solution with bounded error totheproblem of finding the single pair of common bases. In [4], Heet al. develop a clustering application, in which a single set oforthonormal bases is learned in such a way that projections ofa set of images on these bases are neighborhood-preserving.

It is quite intuitive to observe that, as compared to an entireimage, a small image patch is a simpler, more local entity,and hence can be more accurately represented by means of asmaller number of bases. Following this intuition, we chooseto regard an image as a set of matrices (one per image patch)instead of using a single matrix for the entire image as in[1], [2]. Furthermore, there usually exists a great deal ofsimilarity between a large number of patches in one imageor across several images of a similar kind. We exploit thisfact to learn a small number of full-sized orthonormal bases(as opposed to a single set of low-rank bases learned in [1],[2] and [9], or a single set of low-rank projection vectorslearned in [3]) to reconstruct a set of patches from a trainingset by means ofsparseprojections with least possible error.As we shall demonstrate later, this change of focus fromlearning lower-rank bases to learning full-rank bases but withsparse projection matrices, brings with itself several significanttheoretical and practical advantages. This is because it ismucheasier to adjust the sparsity of a projection matrix in ordertomeet a certain reconstruction error threshold than to adjust for


its rank [see Section II-A and Section III].There exists a large body of work in the field of sparse

image representation. Coding methods which express imagesas sparse combinations of bases using techniques such asthe discrete cosine transform (DCT) or Gabor-filter baseshave been quite popular for a long period. The DCT isalso an essential ingredient of the widely used JPEG imagecompression standard [10]. These developments have alsospurred interest in methods thatlearn a set of bases from aset of natural images as well as asparseset of coefficientsto represent images in terms of those bases. Pioneering workin this area has been done by Olshausen and Field [11], witha special emphasis on learning an over-complete set of basesand their sparse combinations, and also by Lewickiet al [12].Some recent noteworthy contributions in the field of over-complete representations include the highly interesting workin [13] by Aharon et al., which encodes image patches asa sparse linear combination of a set of dictionary vectors(learned from a training set). There also exist other techniquessuch as sparsified non-negative matrix factorization (NMF)[14] by Hoyer, which represent image datasets as a product oftwo large sparse non-negative matrices. An important featureof all such learning-based approaches (as opposed to thosethat use a fixed set of bases such as DCT) is their tunability todatasets containing aspecifictype of images. In this paper, wedevelop such a learning technique, but with the key differencethat our technique is matrix-based, unlike the aforementionedvector-based learning algorithms. The main advantage of ourtechnique is as follows: we learn a small number of pairs oforthonormal bases to represent ensembles of image patches.Given any such pair, the computation of a sparse projectionfor any image patch (matrix) with least possible reconstructionerror, can be accomplished by means of a very simple andoptimal greedy algorithm. On the other hand, the computationof the optimal sparse linear combination of an over-completeset of basis vectors to represent another vector is a well-knownNP-hard problem [15]. We demonstrate the applicability ofour algorithm to the compression of databases of face images,with favorable results in comparison to existing approaches.Our main algorithm on ensembles of 2D images or imagepatches, was presented earlier by us in [16]. In this paper, wepresent more detailed comparisons, including with JPEG, andalso study the effect of various parameters on our method,besides showing more detailed derivations of the theory.

In the current work, we bring out another significantextension of our previous algorithm - namely its elegantapplicability to higher-order matrices (commonly and usuallymistakenly termed tensors), with sound theoretical founda-tions. In this paper, we represent patches from color imagesas tensors (third-order matrices). Next, we learn a smallnumber of orthonormal matrices and represent these patchesas sparse tensor projections onto these orthonormal matrices.The tensor-representation for image datasets, as such, is notnew. Shashua and Levin [17] regard an ensemble of gray-scale(face) images, or a video sequence, as a single third-ordertensor, and achieve compact representation by means of lower-rank approximations of the tensor. However, quite unlike thecomputation of the rank of a matrix, the computation of the

rank of a tensor2 is known to be an NP-complete problem [19].Moreover, while the SVD is known to yield the best lower-rank approximation of a matrix in the 2D case, its higher-dimensional analog (known as higher-order SVD or HOSVD)does not necessarily produce the best lower-rank approxi-mation of a tensor (see the work of Lathauwer in [20] andLathauweret al. in [18] for an extensive development of thetheory of HOSVD and several of its properties). Nevertheless,for the sake of simplicity, HOSVD has been used to producelower-rank approximations of datasets (albeit a theoreticallysub-optimal one [18]) which work well in the context ofapplications ranging from face recognition under change inpose, illumination and facial expression as in [21] by Vasilescuand Terzopoulos (though in that particular paper, each imageis still represented as a vector) to dynamic texture synthesisfrom gray-scale and color videos in [22] by Costantinietal. In these applications, the authors demonstrate the efficacyof the multilinear representation in accounting for variabilityover several different modes (such as pose, illumination andexpression in [21], or spatial, chromatic and temporal in [22]),as opposed to simple vector-based image representations.

An iterative scheme for a lower-rank tensor approximationis developed in [20], but the corresponding energy functionis susceptible to the problem of local minima. In [17], twonew lower-rank approximation schemes are designed: a closed-form solution under the assumption that the actual tensor rankequals the number of images (which usually may not be truein practice), or else an iterative approximation in other cases.Although the latter iterative algorithm is proved to be conver-gent, it is not guaranteed to yield a global minimum. Wang andAhuja [23] also develop a new iterative rank-R approximationscheme using an alternating least-squares formulation, andalso present another iterative method that is specially tunedfor the case of third-order tensors. Very recently, Dingetal. [24] have derived upper and lower bounds for the errordue to low-rank truncation of the core tensor obtained inHOSVD (which is a closed-form decomposition), and use thistheory to find asingle common triple of orthonormal basesto represent a database of color images (represented as 3Darrays) with minimal error in theL2 sense. Nevertheless, thereis still no method of directly obtaining theoptimal lower-rank tensor approximation which is non-iterative in nature.Furthermore, the error bounds in [24] are applicable only whenthe entire set of images is coded using a single common tripleof orthonormal bases. Likewise, the algorithms presented in[23] and [17] also seek to find asinglecommon basis.

The algorithm we present here differs from the aforemen-tioned ones in the following ways: (1) All the aforementionedmethods learn a common basis, which may not be sufficientto account for the variability in the images. We do not learn asingle common basis-tuple, but asetof someK orthonormalbases to represent someN patches in the database of images,with each image and each patch being represented as a higher-order matrix. Note thatK is much less thanN . (2) We do notseek to obtain lower-rank approximations to a tensor. Rather,

2The rank of a tensor is defined as the smallest number of rank-1tensorswhose linear combination gives back the tensor, with a tensor being of rank-1if it is equal to the outer product of several vectors [18].


we represent the tensor as a sparse projection onto a chosentuple of orthonormal bases. This sparse projection, as we shalllater on show, turns out to be optimal and can be obtainedby a very simple greedy algorithm. We use our extension forthe purpose of compression of a database of color images,with promising results. Note that this sparsity based approachhas advantages in terms of coding efficiency as compared tomethods that look for lower-rank approximations, just as inthe 2D case mentioned before.

Our paper is organized as follows. We describe the theoryand the main algorithm for 2D datasets in Section II. SectionIII presents experimental results and comparisons with existingtechniques. Section IV presents our extension to higher-ordermatrices, with experimental results and comparisons in SectionV. We conclude in Section VI.

II. T HEORY: 2D IMAGES

Consider a set of images, each of sizeM1 × M2. Wedivide each image into non-overlapping patches of sizem1 ×m2, m1 ≪ M1, m2 ≪ M2, and treat each patch as a separatematrix. Exploiting the similarity inherent in these patches, weeffectively represent them by means of sparse projections onto(appropriately created) orthonormal bases, which we call as‘exemplar bases’. We learn these exemplarsa priori from aset of training image patches. Before describing the learningprocedure, we first explain the mathematical structure of theexemplars.

A. Exemplar Bases and Sparse Projections

Let P ∈ Rm1×m2 be an image patch. Using singular valuedecomposition (SVD), we can representP as a combinationof orthonormal basesU ∈ Rm1×m1 and V ∈ Rm2×m2 inthe form P = USV T , whereS ∈ Rm1×m2 is a diagonalmatrix of singular values. HoweverP can also be representedas a combination ofany set of orthonormal basesU and V ,different from those obtained from the SVD ofP . In thiscase, we haveP = USV T whereS turns out to be anon-diagonal matrix 3. Contemporary SVD-based compressionmethods leverage the fact that the SVD provides the bestlow-rank approximation to a matrix [8], [25]. We choose to departfrom this notion, and instead answer the following question:What sparsematrix W ∈ Rm1×m2 will reconstructP froma pair of orthonormal basesU and V with the least error‖P − UWV T ‖2? Sparsity is quantified by an upper boundT

on theL0 norm ofW , i.e. on the number of non-zero elementsin W (denoted as‖W‖0)4. We prove that theoptimalW withthis sparsity constraint is obtained by nullifying the least (inabsolute value)m1m2−T elements of the estimated projectionmatrix S = UT P V . Due to the ortho-normality ofU and V ,this simple greedy algorithm turns out to be optimalas weprove below:.Theorem 1: Given a pair of orthonormal bases(U , V ), the

3The decompositionP = USV T exists for anyP even if U and V

are not orthonormal. We still follow ortho-normality constraints to facilitateoptimization and coding. See section II-C and III-B.

4See section III for the merits of our sparsity-based approach over thelow-rank approach.

optimal sparse projection matrixW with ‖W‖0 = T isobtained by setting to zerom1m2 −T elements of the matrixS = UT P V having least absolute value.Proof: We haveP = USV T . The error in reconstructinga patch P using some other matrixW is e = ‖U(S −W )V T ‖2 = ‖S − W‖2 as U and V are orthonormal. LetI1 = {(i, j)|Wij = 0} and I2 = {(i, j)|Wij 6= 0}. Thene =

∑(i,j)∈I1

S2ij +

∑(i,j)∈I2

(Sij −Wij)2. This error will be

minimized whenSij = Wij in all locations whereWij 6= 0andWij = 0 at those indices where the corresponding valuesin S are as small as possible. Thus if we want‖W‖0 = T , thenW is the matrix obtained by nullifyingm1m2−T entries fromS that have the least absolute value and leaving the remainingelements intact.Hence, the problem of finding an optimal sparse projectionof a matrix (image patch) onto a pair of orthonormal bases,is solvable inO(m1m2(m1 + m2)) time as it requires justtwo matrix multiplications (S = UT PV ). On the other hand,the problem of finding the optimal sparse linear combinationsof an over-complete set of basis vectors in order to representanother vector (i.e. the vectorized form of an image patch)is a well-known NP-hard problem [15]. In actual applications[13], approximate solutions to these problems are sought, bymeans of pursuit algorithms such as OMP [26]. Unfortunately,the quality of the approximation in OMP is dependent onT ,with an upper bound on the error that is directly proportionalto

√1 + 6T [see [27], Theorem (C)] under certain conditions.

Similar problems exist with other pursuit approximation al-gorithms such as Basis Pursuit (BP) as well [27]. For largevalues ofT , there can be difficulties related to convergence,when pursuit algorithms are put inside an iterative optimizationloop. Our technique avoids any such dependencies.

B. Learning the Bases

The essence of this paper lies in a learning method to pro-duceK exemplar orthonormal bases{(Ua, Va)}, 1 ≤ a ≤ K,to encode a training set ofN image patchesPi ∈ Rm1×m2

(1 ≤ i ≤ N ) with least possible error (in the sense of theL2

norm of the difference between the original and reconstructedpatches). Note thatK ≪ N . In addition, we impose a sparsityconstraint that everySia (the matrix used to reconstructPi

from (Ua, Va)) has at mostT non-zero elements. The mainobjective function to be minimized is:

E({Ua, Va, Sia, Mia}) =

N∑

i=1

K∑

a=1

Mia‖Pi −UaSiaV Ta ‖2 (1)

subject to the following constraints:

(1)UTa Ua = V T

a Va = I, ∀a. (2) ‖Sia‖0 ≤ T, ∀(i, a).

(3)∑

a

Mia = 1, ∀i andMia ∈ {0, 1}, ∀i, a. (2)

HereMia is a binary matrix of sizeN × K which indicateswhether theith patch belongs to the space defined by(Ua, Va).Notice that we are not trying to use a mixture of orthonormalbases for the projection, but just project onto a single suitablebasis pair instead. The optimization of the above energyfunction is difficult, asMia is binary. A simple K-means


will lead to local minima issues, and so we relax the binarymembership constraint so that nowMia ∈ (0, 1), ∀(i, a),subject to

∑K

a=1 Mia = 1, ∀i. The naive mean field theoryline of development starting with [28] and culminating in [29],(with much of the theory developed already in [30]) leads tothe following deterministic annealing energy function:


N∑

i=1

K∑

a=1

Mia‖Pi − UaSiaV Ta ‖2+

1

β

∑

ia

Mia log Mia +∑

i

µi(∑

a

Mia − 1)+

∑

a

trace[Λ1a(UTa Ua − I)] +

∑

a

trace[Λ2a(V Ta Va − I)].

(3)

Note that in the above equation,{µi} are the Lagrangeparameters,{Λ1a, Λ2a} are symmetric Lagrange matrices, andβ a temperature parameter.

We first initialize {Ua} and {Va} to random orthonormalmatrices∀a, andMia = 1

K, ∀(i, a). As {Ua} and {Va} are

orthonormal, the projection matrixSia is computed by thefollowing rule:

Sia = UTa PiVa. (4)

Note that this is the solution to the least squares energyfunction‖Pi −UaSiaV T

a ‖2. Thenm1m2 −T elements inSia

with least absolute value are nullified to get the best sparseprojection matrix. The updates toUa andVa are obtained asfollows. Denoting the sum of the terms of the cost functionin Eqn. (1) that are independent ofUa as C, we rewrite thecost function as follows:


N∑

i=1

K∑

a=1

Mia‖Pi − UaSiaV Ta ‖2+

∑

a

trace[Λ1a(UTa Ua − I)] + C

=

N∑

i=1

K∑

a=1

Mia[trace(Pi − UaSiaV Ta )T (Pi − UaSiaV T

a ))]+

∑

a

trace[Λ1a(UTa Ua − I)] + C

=

N∑

i=1

K∑

a=1

Mia[trace(PTi Pi) − 2trace(PT

i UaSiaV Ta )

+trace(STiaSia)] +

∑

a

trace(Λ1a(UTa Ua − I)) + C.

(5)

Now taking derivatives w.r.t.Ua, we get:

∂E

∂Ua

= −2

N∑

i=1

Mia(PiVaSTia) + 2UaΛ1a = 0. (6)

Re-arranging the terms and eliminating the Lagrange matrices,we obtain the following update rule forUa:

Z1a =∑

i

MiaPiVaSTia; Ua = Z1a(ZT

1aZ1a)−1

2 . (7)

An SVD of Z1a will give us Z1a = Γ1aΨΥT1a where Γ1a

andΥ1a are orthonormal matrices andΨ is a diagonal matrix.Using this, we can re-writeUa as follows:

Ua = (Γ1aΨΥT1a)((Γ1aΨΥT

1a)T (Γ1aΨΥT

1a))−1

2

= (Γ1aΨΥT1a)(Υ1aΨΓT

1aΓ1aΨΥT1a)−

1

2

= (Γ1aΨΥT1a)(Υ1aΨ2ΥT

1a)−1

2

= (Γ1aΨΥT1a)(Υ1aΨ−1ΥT

1a)

= Γ1aΥT1a. (8)

Notice that the steps above followed only owing to the factthat Ua and henceΓ1a andΥ1a were full-rank matrices. Thisgives us an update rule forUa. A very similar update rule forVa follows along the same lines:

Z2a =

K∑

i=1

MiaPTi UaSia; Va = Z2a(ZT

2aZ2a)−1

2 = Γ2aΥT2a. (9)

whereΓ2a andΥ2a are obtained from the SVD ofZ2a. Themembership values are obtained by the following update:

Mia =e−β‖Pi−UaSiaV T

a‖2

∑K

b=1 e−β‖Pi−UbSibV T

b‖2

. (10)

The matrices{Sia, Ua, Va} andM are then updated sequen-tially following one another for a fixedβ value, until conver-gence. The value ofβ is then increased and the sequentialupdates are repeated. The entire process is repeated until anintegrality condition is met.Our algorithm could be modified to learn a set of orthonormalbases (single matrices and not a pair) for sparse represen-tation of vectorized images, as well. Note that even then,our technique should not be confused with the ‘mixtures ofPCA’ approach [31]. The emphasis of the latter is again onlow-rank approximations and not on sparsity. Furthermore,in our algorithm we do not compute an actual probabilisticmixture (also see Sections II-C and III-E). However, we havenot considered this vector-based variant in our experiments,because it does not exploit the fact that image patches aretwo-dimensional entities.

C. Application to Compact Image Representation

Our framework is geared towards compact butlow-errorpatch reconstruction. We are not concerned with thediscrim-inating assignment of aspecific kindof patches to aspecificexemplar, quite unlike in a clustering or classification appli-cation. In our method, after the optimization, each trainingpatch Pi (1 ≤ i ≤ N ) gets represented as a projectiononto one out of theK exemplar orthonormal bases, whichproduces the least reconstruction error, i.e. thekth exemplaris chosen if‖Pi − UkSikV T

k ‖2 ≤ ‖Pi − UaSiaV Ta ‖2, ∀a ∈

{1, 2, ..., K}, 1 ≤ k ≤ K. For patchPi, we denote thecorresponding ‘optimal’ projection matrix asS⋆

i = Sik, andthe corresponding exemplar as(U⋆

i , V ⋆i ) = (Uk, Vk). Thus

the entire training set is approximated by (1) thecommonsetof basis-pairs{(Ua, Va)}, 1 ≤ a ≤ K (K ≪ N ), and (2) theoptimal sparse projection matrices{S⋆

i } for each patch, with atmostT non-zero elements each. The overall storage per image


is thus greatly reduced (see also section III-B). Furthermore,these bases{(Ua, Va)} can now be used to encode patchesfrom a new set of images that are somewhat similar to the onesexisting in the training set. However, a practical applicationdemands that the reconstruction meet a specific error thresholdon unseen patches, and hence theL0 norm of the projectionmatrix of the patch is adjusted dynamically in order to meet theerror. Experimental results using such a scheme are describedin the next section.

III. E XPERIMENTS: 2D IMAGES

In this section, we first describe the overall methodologyfor the training and testing phases of our algorithm based onour earlier work in [16]. As we are working on a compressionapplication, we then describe the details of our image codingand quantization scheme. This is followed by a discussion ofthe comparisons between our technique and other competi-tive techniques (including JPEG), and an enumeration of theexperimental results.

A. Training and Testing Phases

We tested our algorithm for compression of the entire ORLdatabase [32] and the entire Yale database [33]. We divided theimages in each database into patches of fixed size (12 × 12),and these sets of patches were segregated into training andtest sets. The test set consisted of many more images than thetraining set, in the case of either database. For the purposeof training, we learned a total ofK = 50 orthonormal basesusing a fixed value ofT to control the sparsity. For testing, weprojected each patchPi onto that exemplar(U⋆

i , V ⋆i ) which

produced thesparsestprojection matrixS⋆i that yielded an

average per-pixel reconstruction error‖Pi−U⋆

iS⋆

iV ⋆

i

T ‖2

m1m2

of nomore than some chosenδ. Note that different test patchesrequired differentT values, depending upon their inherent‘complexity’. Hence, we varied the sparsity of the projectionmatrix (but keeping its size fixed to12 × 12), by greedilynullifying the smallest elements in the matrix, without lettingthe reconstruction error go aboveδ. This gave us the flexibilityto adjust to patches of different complexities, without alteringthe rank of the exemplar bases(U⋆

i , V ⋆i ). As any patch

Pi is projected onto exemplar orthonormal bases that aredifferent from those produced by its own SVD, the projectionmatrices turn out to be non-diagonal. Hence, there is nosuch thing as a hierarchy of ‘singular values’ as in ordinarySVD. As a result, we cannot resort to restricting the rank ofthe projection matrix (and thereby the rank of(U⋆

i , V ⋆i )) to

adjust for patches of different complexity (unless we learnaseparate set ofK bases, each set for a different rankr where1 < r < min(m1, m2), which would make the training andeven the image coding [see Section III-B] very cumbersome).This highlights an advantage of our approach over that ofalgorithms that adjust the rank of the projection matrices.

B. Details of Image Coding and Quantization

We obtainS⋆i by sparsifyingU⋆

iT PiV

⋆i . As U⋆

i andV ⋆i are

orthonormal, we can show that the values inS⋆i will always

lie in the range[−m, m] for m × m patches, if the values ofPi lie in [0, 1]. This can be proved as follows: We know thatS⋆

i = U⋆i

T PiV⋆i . Hence we write the element ofS⋆

i in theath

row andbth column as follows:

S⋆i ab =

∑

k,l

U⋆i

T

akPiklV⋆lb ≤

∑

kl

1√m

Pikl

1√m

=1

m

∑

kl

Pikl = m. (11)

The first step follows because the maximum value ofS⋆i ab

is obtained when all values ofU⋆i

Tak are equal to

√m. The

second step follows becauseS⋆i ab will meet its upper bound

when all them2 values in the patch are equal to 1. We cansimilarly prove that the lower bound onS⋆

i ab is −m. As inour case,m = 12, the upper and lower bounds are+12 and−12 respectively.In our experiment, we Huffman-encoded [34] the integer partsof the values in the{S⋆

i } matrices over the whole image(giving us an average of someQ1 bits per entry) and quantizedthe fractional parts withQ2 bits per entry. Thus, we neededto store the following information per test-patch to createthecompressed image: (1) the index of the best exemplar, usinga1 bits, (2) the location and value of each non-zero elementin its S⋆

i matrix, usinga2 bits per location andQ1 + Q2 bitsfor the value, and (3) the number of non-zero elements perpatch encoded usinga3 bits. Hence the total number of bitsper pixel for the whole image is given by:

RPP =N(a1 + a3) + T whole(a2 + Q1 + Q2)

M1M2. (12)

whereT whole =∑N

i=1 ‖S⋆i ‖0. The values ofa1, a2 and a3

were obtained by Huffman encoding.After performing quantization on the values of the coefficientsin S⋆

i , we obtained new quantized projection matrices, denotedasS⋆

i . Following this, the PSNR for each image was measuredas 10 log10

Nm1m2P

N

i=1‖Pi−U⋆

iS⋆

iV ⋆

i

T ‖2, and then averaged over the

entire test set. The average number of bits per pixel (RPP)was calculated as in Eqn. (12) for each image, and thenaveraged over the whole test set. We repeated this procedurefor different δ values from8 × 10−5 to 8 × 10−3 (range ofimage intensity values was[0, 1]) and plotted an ROC curveof average PSNR vs. average RPP.

C. Comparison with other techniques:

We pitted our method against four existing approaches, eachof which are both competitive and recent:

1) The KSVD algorithm from [13], for which we used441 unit norm dictionary vectors of size 144. In thismethod, each patch is vectorized and represented asa sparse linear combination of the dictionary vectors.During training, the value ofT is kept fixed, and duringtesting it is dynamically adjusted depending upon thepatch so as to meet the appropriate set of error thresholdsrepresented byδ ∈ [8×10−5, 8×10−3]. For the purposeof comparison between our method and KSVD, we used


the exact same values ofT andδ as in our algorithm, tofacilitate a fair comparison. The method used to find thesparse projection onto the dictionary was the orthogonalmatching pursuit (OMP) algorithm [27].

2) The over-complete DCT dictionary with 441 unit normvectors of size 144 created by sampling cosine waves ofvarious frequencies, again with the sameT value, andusing the OMP technique for sparse projection.

3) The SSVD method from [25], which is not a learningbased technique. It is based on the creation of sub-sampled versions of the image by traversing the imagein a manner different from usual raster-scanning.

4) The JPEG standard (for which we used the implementa-tion provided in MATLAB R©), for which we measuredthe RPP from the number of bytes for storage of the file.

For KSVD and over-complete DCT, there exist bounds onthe values of the coefficients that are very similar to thosefor our technique, as presented in Eqn. (11). For both thesemethods, the RPP value was computed using the formula in[13], Eqn. (27), with the modification that the integer partsof the coefficients of the linear combination were Huffman-encoded and the fractional parts separately quantized (as itgave a better ROC curves for these methods). For the SSVDmethod, the RPP was calculated as in [25], section (5).

D. Results

For the ORL database, we created a training set of patchesof size 12 × 12 from images of 10 different people, with10 images per person. Patches from images of the remaining30 people (10 images per person) were treated as the testset. From the training set, a total of 50 pairs of orthonormalbases were learned using the algorithm described in SectionII-B. The T value for sparsity of the projection matriceswas set to 10 during training. As shown in Figure III-D(c),we also computed the variance in the RPP values for everydifferent pre-quantization error value, for each image in theORL database. Note that the variance in RPP decreases withincrease in specified error. This is because at very low errors,different images require different number of coefficients inorder to meet the error.

The same experiment was run on the Yale database with avalue ofT = 3 on patches of size12 × 12. The training setconsisted of a total of 64 images of one and the same personunder different illumination conditions, whereas the testingset consisted of 65 images each of the remaining 38 peoplefrom the database (i.e. 2470 images in all), under differentillumination conditions. The ROC curves for our method weresuperior to those of other methods over a significant rangeof δ, for the ORL as well as the Yale database, as seen inFigures 1(a) and 1(b). Sample reconstructions for an imagefrom the ORL database are shown in Figures 1(d), 1(f), 1(g)and 1(h) for δ = 3 × 10−4. For this image, our methodproduced a better PSNR to RPP ratio than others. Also Figure2 shows reconstructed versions of another image from theORL database using our method with different values ofδ.For experiments on the ORL database, the number of bitsused to code the fractional parts of the coefficients of the

0 2 4 620

25

30

35

40

45

RPP

PS

NR

ORL: PSNR vs RPP ; T = 10

(1) OurMethod(2) KSVD(3) OverComplete DCT(4) SSVD(5) JPEG

(a)

0 0.5 1 1.520

25

30

35

40

45

RPP

PS

NR

Yale: PSNR vs RPP ; T = 3

(1) OurMethod(2) KSVD(3) OverComplete DCT(4) SSVD(5) JPEG

1

2 34

5

(b)

0 1 2 3 4 5 6 7 8

x 10−3

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Error

Var

ianc

e in

RP

P

Variance in RPP vs. error: ORL Database

(c) (d) (e) (f)

(g) (h)

Fig. 1. ROC curves on (a) ORL and (b) Yale Databases. Legend- Red (1):Our Method, Blue (2): KSVD, Green (3): Over-complete DCT, Black (4):SSVD and (5): JPEG (Magenta). (c) Variance in RPP versus pre-quantizationerror for ORL. (d) Original image from ORL database. Sample reconstructionswith δ = 3 × 10−4 of (d) by using (e) Our Method [RPP: 1.785, PSNR:35.44], (f) KSVD [RPP: 2.112, PSNR: 35.37], (g) Over-complete DCT [RPP:2.929, PSNR: 35.256], (h) SSVD [RPP: 2.69, PSNR: 34.578].These plots areviewed best in color.

Original Error 0.003 Error 0.001

Error 0.0005 Error 0.0003 Error 0.0001

Fig. 2. Reconstructed versions of an image from the ORL database using5 different error values. The original image is on the extreme left and topcorner.


Method Dictionary Size (number of scalars)Our Method 50 × 2 × 12 × 12 = 14400

KSVD 441 × 12 × 12 = 63504Overcomplete DCT 441 × 12 × 12 = 63504

SSVD -JPEG -

TABLE ICOMPARISON OF DICTIONARY SIZE FOR VARIOUS METHODS FOR

EXPERIMENTS ON THEORL AND YALE DATABASES.

0 1 2 3 420

25

30

35

40

RPP

PS

NR

ORL: PSNR vs. RPP, Diff. Patch Size

8X8 with T = 512X12 with T = 1015X15 with T = 1520X20 with T = 20

(a)

0 1 2 3 4 520

25

30

35

40

RPP

PS

NR

ORL: PSNR vs RPP; Diff. Values of T

T = 5T = 10T = 20T = 50

(b)

Fig. 3. ROC curves for the ORL database using our technique with (a)different patch size and (b) different value ofT for a fixed patch size of12 × 12. These plots are viewed best in color.

projection matrices [i.e.Q2 in Eqn. (12)] was set to 5. Forthe Yale database, we often obtained pre-quantization errorssignificantly less than the chosenδ, and hence using a valueof Q2 less than 5 bits often did not raise the post-quantizationerror aboveδ. Keeping this in mind, we varied the value ofQ2 dynamically, for each pre-quantization error value. Thesame variation was applied to the KSVD and over-completeDCT techniques as well. For the experiments on these twodatabases, we also summarize the dictionary size of all therelevant methods being compared in Table I.

E. Effect of Different Parameters on the Performance of ourMethod:

The different parameters in our method include: (1) the sizeof the patches for training and testing, (2) the number of pairsof orthonormal bases, i.e.K, and (3) the sparsity of each patchduring training, i.e.T . In the following, we describe the effectof varying these parameters:

1) Patch Size: If the size of the patch is too large, itbecomes more and more unlikely that a fixed number oforthonormal matrices will serve as adequate bases forthese patches in terms of a low-error sparse representa-tion, owing to the curse of dimensionality. Furthermore,if the patch size is too large, any algorithm will lose outon the advantages of locality and simplicity. However,if the patch size is too small (say2 × 2), there isvery little a compression algorithm can do in termsof lowering the number of bits required to store thesepatches. We stress that the patch size as a free parameteris something common toall algorithms to which weare comparing our technique (including JPEG). Also,the choice of this parameter is mainly empirical. We

tested the performance of our algorithm withK = 50pairs of orthonormal bases on the ORL database, usingpatches of size8 × 8, 12 × 12, 15 × 15 and 20 × 20,with appropriately different values ofT . As shown inFigure 3(a), the patch size of12 × 12 using T = 10yielded the best performance, though other patch sizesalso performed quite well.

2) The number of orthonormal basesK: The choice ofthis number is not critical, and can be set to as higha value as desired without affecting the accuracy ofthe results. This parameter should not be confused withthe number of mixture components in standard mixture-model based density estimation, because in our methodeach patch gets projected only onto asingle set oforthonormal bases, i.e. we do not compute a combinationof projections onto all theK different orthonormal basispairs. The only down-side of a higher value ofK is theadded computational cost during training. Again notethat this parameter will be part of any learning-basedalgorithm for compression that uses a set of bases toexpress image patches.

3) The value ofT during training: The value ofT isfixed only during training, and is varied for each patchduring the testing phase so as to meet the requirederror threshold. A very high value ofT during trainingcan cause the orthonormal basis pairs to overfit to thetraining data (variance), whereas a very low value couldcause a large bias error. This is an instance of thewell-known bias-variance tradeoff common in machinelearning algorithms [35]. Our choice ofT was empirical,though the value ofT = 10 performed very wellon nearly all the datasets we ran our experiments on.The effect of different values ofT on the compressionperformance for the ORL database is shown in Figure3(b). We again emphasize that the issue with the choiceof an ‘optimal’ T is not an artifact of our algorithmper se. For instance, this issue of a choice ofT willalso be encountered in pursuit algorithms to find theapproximately optimal linear combination of unit vectorsfrom a dictionary. In the latter case, however, there willalso be the problem of poorer and poorer approximationerrors asT increases, under certain conditions on thedictionary of overcomplete basis vectors.

F. Performance on Random Collections of Images

We performed additional experiments to study the behaviorof our algorithm when the training and test sets were verydifferent from one another. To this end, we used the databaseof 155 natural images (in uncompressed format) from theComputer Vision Group at the University of Granada [36].The database consists of images of faces, houses, naturalscenery, inanimate objects and wildlife. We converted allthe given images to grayscale. Our training set consisted ofpatches (of size12 × 12) from 11 images. Patches from theremaining images were part of the test set. Though the imagesin the training and test sets were very different, our algorithmproduced excellent results superior to JPEG upto an RPP value


0 0.5 1 1.5 2 2.5 3 3.520

25

30

35

40

RPP

PS

NR

CVG : RPP vs PSNR , T0 = 10

OurMethod

JPEG

(a)

0 1 2 3 4 520

25

30

35

40

RPP

PS

NR

UCID : RPP vs PSNR

Our Method T0=5

JPEG

(b)

Fig. 4. PSNR vs. RPP curve for our method and JPEG on (a) the CVGGranada database and (b) the UCID database.

of 3.1 bits, as shown in Figure 4(a). Three training images andreconstructed versions of three different test images for errorvalues of 0.0002, 0.0006 and 0.002 respectively, are shownin Figure III-F. To test the effect of minute noise on theperformance of our algorithm, a similar experiment was runon the UCID database [37] with a training set of 10 imagesand the remaining 234 test images from the database. Thetest images were significantly different from the ones used fortraining. The images were scaled down by a factor of two(i.e. to320×240 instead of640×480) for faster training andsubjected to zero mean Gaussian noise of variance 0.001 (ona scale from 0 to 1). Our algorithm was yet again competitivewith JPEG as shown in Figure 4(b).

G. Comparison with Wavelet Coding

In this section, we present experiments comparing theperformance of our method with JPEG2000 and also with awavelet-based compression method involving a very simplequantization scheme that employs Huffman encoding, similarto the quantization scheme we employed for our technique. Wewould like to emphasize that JPEG2000 is an industry standardthat was developed by several researchers for over a decade.Its strongest point is its superlative quantization and encodingscheme, rather than the mere fact that it uses wavelet bases[38]. On the other hand, a wavelet-coding method employinga much simpler quantization scheme (similar to the one weused in our technique) performs quite poorly in practice (much

(a) (b) (c)

(d) (e) (f)

(g) (h) (i)

(j) (k) (l)

Fig. 5. (a) to (c): Training images from the CVG Granada database. (d) to(l): Reconstructed versions of three test images for error values of 0.0002,0.0006 and 0.002 respectively.

worse than JPEG), with one of its major problems being thatthere are no bounds that can be imposed on the value of thewavelet coefficients in terms of the image intensity range orpatch size. In our experiments, the typical range of coefficientvalues turned out to be around104. As opposed to this, thevalues of the coefficients of the projected matrices in ourtechnique are (provably) bounded between−m and +m forpatch sizes ofm × m (as shown in Section (3B)).In our wavelet coding strategy, we performed a decompositionof upto four levels (level 1 giving us higher RPP and higherPSNR, and level 4 giving us low RPP and low PSNR).For any decomposition, the LL sub-band coefficients (alsocalled the approximation coefficients) of the final level werenever thresholded. The integer parts of these coefficientswere Huffman encoded (just as in our technique) and 4 bitswere allocated for the fractional part. Having more bits forthe fractional part did not improve the ROC curves in ourexperiments. The coefficients of the other sub-bands (HH,HL, LH) at every level were thresholded using the commonlyused Birge-Massart strategy (which is used in the MATLABwavelet toolbox for several applications including compres-sion). Different thresholds were used for each subband at eachlevel. Again, the thresholded values were coded using Huffmanencoding on the integer part followed by 4 bits for the floatingpoint. The specific type of wavelet that we used was ‘sym4’


0 0.5 1 1.5 2 2.5 3 3.5 4 4.520

25

30

35

40

45

Our Method

JASPER

JPEG

Wavelet SYM4

Wavelet DB7

WAVELET BIOR6.8

(a)

0 0.2 0.4 0.6 0.8 1 1.2 1.422

24

26

28

30

32

34

36

38

40

42

Our MethodJASPERJPEGWavelet SYM4

(b)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.520

22

24

26

28

30

32

34

36

38

40


(c)

Fig. 6. PSNR vs. RPP curve for our method, JPEG, JPEG2000 (JASPER)and wavelet coding with ‘sym4’ wavelets on (a) ORL Database,(b) YaleDatabase and (c) CVG Database.

as it gave better performance than other wavelets that we tried(such as ‘db7’, biorthogonal wavelets etc).We performed experiments comparing the compression perfor-mance of the following four techniques: our method, JPEG,JPEG2000 (the specific implementation being the commonlyavailable Jasper package [39]), wavelet methods using ‘sym4’followed by a simple quantizer (as decribed earlier). In Figure6, we show results on ORL, Yale and CVG databases. Forthe ORL database, we also present results with db7 andbiorthogonal wavelets. As these performed much worse thansym4, we do not show results with them on other databases. Asis evident from Figure 6, our method as well as JPEG (usingDCT) far outperformed the simple wavelet based techniques,though they did not outperform JPEG2000. Note that there areother reports in the literature (such as [40]) that indicatea farsuperior performance of JPEG over wavelet-based methodsthat did not use sophisticated quantizers, in particular forhigher bit rates.

Thus we believe that our results will improve further ifa more sophisticated quantizer/coder is used in conjunctionwith bases learned by our method. The improvement of thequantization scheme for our technique is well beyond thescope of the current paper, and is an avenue for future re-search. Nevertheless, we observed that the performance of ourtechnique was competitive with JPEG2000 for low bit rates,when negligible amounts of noise (i.i.d. zero-mean Gaussiannoise of variance8 × 10−4 or variance10−3) were addedto the images. It should be noted that the visual effect ofsuch small amounts of noise is not obvious to the human eye.At lower bit rates, we observed that our method sometimesbeat JPEG2000. This is despite the fact that wetrained ourmethod on noiseless images, and tested on noisy images.Also,in all our experiments, the size of the test size wasseveraltimes larger than the size of the training set. The results ofour method, JPEG, JPEG2000 and wavelet coding with sym4

0 0.5 1 1.5 2 2.5 3 3.5 422

24

26

28

30

32

34

36

38


(a)

0 0.5 1 1.5 2 2.5 322

24

26

28

30

32

34

36

38


(b)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.520

22

24

26

28

30

32

34

36

38


(c)

Fig. 7. PSNR vs. RPP curve for our method, JPEG, JPEG2000 (JASPER)and wavelet coding with ‘sym4’ wavelets on (a) ORL Database,(b) YaleDatabase and (c) CVG Database, with the images subjected to iid Gaussiannoise of mean zero and variance8 × 10−4.

0 0.5 1 1.5 2 2.5 3 3.5 422

24

26

28

30

32

34

36

38


(a)

0 0.5 1 1.5 2 2.522

24

26

28

30

32

34

36

38

40


(b)

0 0.5 1 1.5 2 2.5 3 3.5 420

22

24

26

28

30

32

34

36


(c)

Fig. 8. PSNR vs. RPP curve for our method, JPEG, JPEG2000 (JASPER)and wavelet coding with ‘sym4’ wavelets on (a) ORL Database,(b) YaleDatabase and (c) CVG Database, with the images subjected to iid Gaussiannoise of mean zero and variance10−3.

wavelets are shown in Figures 7 and 8 for two levels of noiseon the ORL, Yale and CVG databases.

IV. T HEORY: 3D (OR HIGHER-D) IMAGES

We now consider a set of images represented as third-ordermatrices (say RGB face images), each of sizeM1×M2×M3.We divide each image into non-overlapping patches of sizem1 × m2 × m3, m1 ≪ M1, m2 ≪ M2, m3 ≪ M3, andtreat each patch as a separate tensor. Just as before, we startby exploiting the similarity inherent in these patches, andrepresent them by means of sparse projections onto a triple ofexemplar orthonormal bases. Again, we learn these exemplars


a priori from a set of training image patches. We would liketo point out that our extension to higher dimensions is non-trivial, especially when considering that a key property oftheSVD in 2D (namely that SVD provides the optimal lower-rankreconstruction by simple nullification of the lowest singularvalues) does not extend into higher-dimensional analogs suchas HOSVD [18]. In the following, we now describe themathematical structure of the exemplars. We would like toemphasize that though the derivations presented in this paperare for 3D matrices, a very similar treatment is applicable tolearningn-tuples of exemplar orthonormal bases to representpatches fromn-D matrices (wheren ≥ 4).

A. Exemplar Bases and Sparse Projections

Let P ∈ Rm1×m2×m3 be an image patch. Using HOSVD,we can representP as a combination of orthonormal basesU ∈ Rm1×m1 , V ∈ Rm2×m2 and W ∈ Rm3×m3 in theform P = S ×1 U ×2 V ×3 W , whereS ∈ Rm1×m2×m3 istermed the core-tensor. The operators×i refer to tensor-matrixmultiplication over different axes. The core-tensor has specialproperties such as all-orthogonality and ordering. For furtherdetails of the core tensor and other aspects of multilinearalgebra, the reader is referred to [18]. Now,P can also berepresented as a combination ofany set of orthonormal basesU , V andW , different from those obtained from the HOSVDof P . In this case, we haveP = S ×1 U ×2 V ×3 W whereS is not guaranteed to be an all-orthogonal tensor, nor is itguaranteed to obey the ordering property.

As the HOSVD does not necessarily provide the bestlow-rank approximation to a tensor [18], we choose to departfrom this notion (instead of settling for a sub-optimal approx-imation), and instead answer the following question: WhatsparsetensorQ ∈ Rm1×m2×m3 will reconstructP from atriple of orthonormal bases(U , V , W ) with the least error‖P − Q ×1 U ×2 V ×3 W‖2? Sparsity is again quantifiedby an upper boundT on the L0 norm of Q (denoted as‖Q‖0). We now prove that theoptimal Q with this sparsityconstraint is obtained by nullifying the least (in absolutevalue)m1m2m3 − T elements of the estimated projection tensorS = P ×1 UT ×2 V T ×3 WT . Due to the ortho-normalityof U , V andW , this simple greedy algorithm turns out to beoptimal (see Theorem 2).Theorem 2: Given a triple of orthonormal bases(U , V , W ),the optimal sparse projection tensorQ with ‖Q‖0 = T isobtained by setting to zerom1m2m3 − T elements of thetensorS = P ×1 UT ×2 V T ×3 WT having least absolutevalue.Proof: We have P = S ×1 U ×2 V ×3 W . The errorin reconstructing a patchP using some other matrixQ ise = ‖(S − Q) ×1 U ×2 V ×3 W‖2. For any tensorX , wehave‖X‖2 = ‖X(n)‖2 (i.e. the Frobenius norm of the tensorand itsnth unfolding are the same [18]). Also, by the matrixrepresentation of HOSVD, we haveX(n) = U ·S(n)·(V ⊗W )5.Hence, it follows thate = ‖U · (S−Q)(1) · (V ⊗W )T ‖2. Thisgives use = ‖S − Q‖2. The last step follows becauseU ,

5HereA⊗B refers to the Kronecker product of matricesA ∈ RE1×E2 andB ∈ RF1×F2 , which is given asA ⊗ B = (Ae1e2

B)1≤e1≤E1,1≤e2≤E2.

V and W , and henceV ⊗ W are orthonormal matrices. LetI1 = {(i, j, k)|Qijk = 0} and I2 = {(i, j, k)|Qijk 6= 0}.Then e =

∑(i,j,k)∈I1

S2ijk +

∑(i,j,k)∈I2

(Sijk − Qijk)2. Thiserror will be minimized whenSijk = Qijk in all locationswhere Qijk 6= 0 and Qijk = 0 at those indices where thecorresponding values inS are as small as possible. Thus if wewant ‖Q‖0 = T , thenQ is the tensor obtained by nullifyingm1m2m3−T entries fromS that have the least absolute valueand leaving the remaining elements intact.We wish to re-emphasize that a key feature of our approach isthe fact that the same technique used for 2D images scales tohigher dimensions. Tensor decompositions such as HOSVDdo not share this feature, because the optimal low-rank re-construction property for SVD does not extend to HOSVD.Furthermore, though the upper and lower error bounds forcore-tensor truncation in HOSVD derived in [24] are veryinteresting, they are applicable only when the entire set ofimages has a common basis (i.e. a commonU , V and W

matrix), which may not be sufficient to compactly account forthe large variability in real-world datasets.

B. Learning the Bases

We now describe a method to learnK exemplar orthonormalbases{(Ua, Va, Wa)}, 1 ≤ a ≤ K, to encode a trainingset of N image patchesPi ∈ Rm1×m2×m3 (1 ≤ i ≤ N )with least possible error (in the sense of theL2 norm of thedifference between the original and reconstructed patches).Note that K ≪ N . In addition, we impose a sparsityconstraint that everySia (the tensor used to reconstructPi

from (Ua, Va, Wa)) has at mostT non-zero elements. Themain objective function to be minimized is:

E({Ua, Va, Wa, Sia, Mia}) =N∑

i=1

K∑

a=1

Mia‖Pi − Sia ×1 Ua ×2 Va ×3 Wa‖2 (13)

subject to the following constraints:

(1)UTa Ua = V T

a Va = WTa Wa = I, ∀a. (2) ‖Sia‖0 ≤ T, ∀(i, a).

(3)∑

a

Mia = 1, ∀i, andMia ∈ {0, 1}, ∀i, a.

(14)

Here Mia is a binary matrix of sizeN × K which indi-cates whether theith patch belongs to the space defined by(Ua, Va, Wa). Just as for the 2D case, we relax the binarymembership constraint so that nowMia ∈ (0, 1), ∀(i, a),subject to

∑K

a=1 Mia = 1, ∀i. Using Lagrange parameters{µi}, symmetric Lagrange matrices{Λ1a}, {Λ2a} and{Λ3a},and a temperature parameterβ, we obtain the following


deterministic annealing objective function:

E({Ua, Va, Wa, Sia, Mia}) =N∑

i=1

K∑

a=1

Mia‖Pi − Sia ×1 Ua ×2 Va ×3 Wa‖2+

1

β

∑

ia

Mia log Mia +∑

i

µi(∑

a

Mia − 1)+

∑

a

trace(Λ1a(UTa Ua − I))+

∑

a

trace(Λ2a(V Ta Va − I)) +

∑

a

trace(Λ3a(WTa Wa − I)).

(15)

We first initialize{Ua}, {Va} and{Wa} to random orthonor-mal tensors∀a, and Mia = 1

K, ∀(i, a). Secondly, using the

fact that{Ua}, {Va} and{Wa} are orthonormal, the projectionmatrix Sia is computed by the rule:

Sia = Pi ×1 UTa ×2 V T

a ×3 WTa , ∀(i, a). (16)

This is the minimum of the energy function‖Pi − Sia ×1

Ua ×2 Va ×3 Wa‖2. Thenm1m2m3 −T elements inSia withleast absolute value are nullified. Thereafter,Ua, Va andWa

are updated as follows. Let us denote the sum of the terms inthe previous energy function that are independent ofUa, asC. Then we can write the energy function as:

E({Ua, Va, Wa, Sia, Mia}) =∑

ia

Mia‖Pi − (Sia ×1 Ua ×2 Va ×3 Wa)‖2+

∑

a

trace(Λ1a(UTa Ua − I)) + C

=∑

ia

Mia‖Pi(1) − (Sia ×1 Ua ×2 Va ×3 Wa)(1)‖2+

∑

a

trace(Λ1a(UTa Ua − I)) + C. (17)

Note that herePi(n) refers to thenth unfolding of the tensorPi (refer to [18] for exact details). Also note that the abovestep uses the fact that for any tensorX , we have‖X‖2 =‖X(n)‖2. Now, the objective function can be further expressedas follows:


ia

Mia‖Pi(1) − Ua · Sia(1) · (Va ⊗ Wa)T ‖2+

∑

a


(18)

Further simplification gives:


ia

Miatrace[(Pi(1) − Ua · Sia(1) · (Va ⊗ Wa)T )T

(Pi(1) − Ua · Sia(1) · (Va ⊗ Wa)T )]

+∑

a


=∑

ia

Mia[trace(PTi(1)Pi(1))−

2trace(PTi(1)Ua · Sia(1) · (Va ⊗ Wa)T )+

trace(STia(1)Sia(1))] +

∑

a

trace(Λ1a(UTa Ua − I)) + C. (19)

Taking the partial derivative ofE w.r.t. Ua and setting it tozero, we have:

∂E

∂Ua

=∑

i

−2MiaPi(1)(Va ⊗ Wa)STia(1) + 2UaΛ1a = 0.

(20)

After a series of manipulations to eliminate the Lagrangematrices, we arrive at the following update rule forUa:

ZUa =∑

i

MiaPi(1)(Va ⊗ Wa)STia(1);

Ua = ZUa(ZTUaZUa)−

1

2 = Γ1aΥT1a. (21)

Here Γ1a and Υ1a are orthonormal matrices obtained fromthe SVD of ZUa. The updates forVa and Wa are obtainedsimilarly and are mentioned below:

ZV a =∑

i

MiaPi(2)(Wa ⊗ Ua)STia(2);

Va = ZV a(ZTV aZV a)−

1

2 = Γ2aΥT2a. (22)

ZWa =∑

i

MiaPi(3)(Ua ⊗ Va)STia(3);

Wa = ZWa(ZTWaZWa)−

1

2 = Γ3aΥT3a. (23)

Here again,Γ2a and Υ2a refer to the orthonormal matricesobtained from the SVD ofZV a, andΓ3a andΥ3a refer to theorthonormal matrices obtained from the SVD ofZWa. Themembership values are obtained by the following update:

Mia =e−β‖Pi−Sia×1Ua×2Va×3Wa‖

2

∑K

b=1 e−β‖Pi−Sib×1Ub×2Vb×3Wb‖2. (24)

The core tensors{Sia} and the matrices{Ua, Va, Wa}, M arethen updated sequentially following one another for a fixedβ value, until convergence. The value ofβ is then increasedand the sequential updates are repeated. The entire processisrepeated until an integrality condition is met.

C. Application to Compact Image Representation

Quite similar to the 2D case, after the optimization duringthe training phase, each training patchPi (1 ≤ i ≤ N ) getsrepresented as a projection onto one out of theK exemplarorthonormal bases, which produces the least reconstructionerror, i.e. thekth exemplar is chosen if‖P − Sik ×1 Uk ×2


Vk ×3 Wk‖2 ≤ ‖P − Sia ×1 Ua ×2 Va ×3 Wa‖2, ∀a ∈{1, 2, ..., K}, 1 ≤ k ≤ K. For patch trainingPi, we denotethe corresponding ‘optimal’ projection tensor asS⋆

i = Sik, andthe corresponding exemplar as(U⋆

i , V ⋆i , W ⋆

i ) = (Uk, Vk, Wk).Thus the entire set of patches can be closely approximated by(1) thecommonset of basis-pairs{(Ua, Va, Wa)}, 1 ≤ a ≤ K

(K ≪ N ), and (2) the optimal sparse projection tensors{S⋆i }

for each patch, with at mostT non-zero elements each. Theoverall storage per image is thus greatly reduced. Furthermore,these bases{(Ua, Va, Wa)} can now be used to encode patchesfrom a new set of images that are similar to the ones existingin the training set, though just as in the 2D case, the sparsity ofthe patch will be adjusted dynamically in order to meet a givenerror threshold. Experimental results for this are provided inthe next section.

V. EXPERIMENTS: COLOR IMAGES

In this section, we describe the experiments we performedon color images represented in the RGB color scheme. Eachimage of sizeM1×M2×3 was treated as a third-order matrix.Our compression algorithm was tested on the GaTech FaceDatabase [41], which consists of RGB color images with 15images each of 50 different people. The images in the databaseare already cropped to include just the face, but some of themcontain small portions of a distinctly cluttered background.The average image size is150×150 pixels, and all the imagesare in the JPEG format. For our experiments, we divided thisdatabase into a training set of one image each of 40 differentpeople, and a test set of the remaining 14 images each of these40 people, and all 15 images each of the remaining 10 people.Thus, the size of the test set was 710 images. The patch-sizewe chose for the training and test sets was12 × 12 × 3 andwe experimented withK = 100 different orthonormal baseslearned during training. The value ofT during training was setto 10. At the time of testing, for our method, each patch wasprojected onto that triple of orthonormal bases(U⋆

i , V ⋆i , W ⋆

i )which gave the sparsest projection tensorS⋆

i such that the

per-pixel reconstruction error‖Pi−Si×1U⋆

i×2V ⋆

i×3W ⋆

i‖2

m1m2

was nogreater than a chosenδ. Note that, in calculating the per-pixel reconstruction error, we did not divide by the number ofchannels, i.e. 3, because at each pixel, there are three valuesdefined. We experimented with different reconstruction errorvaluesδ ranging from8 × 10−5 to 8 × 10−3. Following thereconstruction, the PSNR for the entire image was measured,and averaged over the entire test test. The total number ofbits per pixel, i.e. RPP, was also calculated for each imageand averaged over the test set. The details of the trainingand testing methodology, and also the actual quantization andcoding step are the same as presented previously in SectionIII-A and III-B.

A. Comparisons with Other Methods

The results obtained by our method were compared to thoseobtained by the following techniques:

1) KSVD, for which we used patches of size12 × 12 × 3reshaped to give vectors of size 432, and used these

to train a dictionary of 1340 vectors using a value ofT = 30.

2) Our algorithm for 2D images from Section II with anindependent (separate) encoding of each of the threechannels. As an independent coding of the R, G and Bslices would fail to account for the inherent correlationbetween the channels (and hence give inferior com-pression performance), we used principal componentsanalysis (PCA) to find the three principal componentsof the R, G, B values of each pixel from the trainingset. The R, G, B pixel values from the test images werethen projected onto these principal components to givea transformed image in which the values in each of thedifferent channels are decorrelated. A similar approachhas been taken earlier in [42] for compression of colorimages of faces using vector quantization, where thePCA method is empirically shown to produce channelsthat are even more decorrelated than those from the Y-Cb-Cr color model. The orthonormal bases were learnedon each of the three (decorrelated) channels of the PCAimage. This was followed by the quantization and codingstep similar to that described in Section III-B. Howeverin the color image case, the Huffman encoding step forfinding the optimal values ofa1, a2 anda3 in Eqn. (12)was performed using projection matrices from all threechannels together. This was done to improve the codingefficiency.

3) The JPEG standard (its MATLABR© implementation),for which we calculated RPP from the number of bytesof file storage on the disk. See Section V-B for moredetails.

We would like to mention here that we did not compare ourtechnique with [42]. The latter technique uses patches fromcolor face images and encodes them using vector quantization.The patches from more complex regions of the face (suchas the eyes, nose and mouth) are encoded using a separatevector quantization step for better reconstruction. The methodin [42] requires prior demarcation of such regions, which initself is a highly difficult task to automate, especially undervarying illumination and pose. Our method does not requireany such prior segmentation, as we manually tune the numberof coefficients to meet the pre-specified reconstruction error.

B. Results

As can be seen from Figure 9(a), all three methods performwell, though the higher-order method produced the best resultsafter a bit-rate of around 2 per pixel. For the GaTech database,we did not compare our algorithm directly with JPEG becausethe images in the database are already in the JPEG format.However, to facilitate comparison with JPEG, we used a subsetof 54 images from the CMU-PIE database [43]. The CMU-PIEdatabase contains images of several people against clutteredbackgrounds with a large variation in pose, illumination, facialexpression and occlusions created by spectacles. All the im-ages are available in an uncompressed (.ppm) format, and theirsize is631×467 pixels. We chose 54 images belonging to oneand the same person, and used exactly one image for training,


0 2 4 6 820

25

30

35

40

45

RPP

PS

NR

PSNR vs RPP: Training and testing on clean images

Our Method: 3DOur Method: 2DKSVD

(a)

0 2 4 6 8 1018

20

22

24

26

28

30

32

RPP

PS

NR

PSNR vs RPP: Clean Training, Noisy Testing

Our Method: 3DOur Method: 2DKSVDJPEG

(b)

0 5 10 1518

20

22

24

26

28

30

32

RPP

PS

NR

PSNR vs RPP: Noisy Training Noisy Testing

Our Method (3D)Our Method (2D)KSVDJPEG

(c)

Fig. 9. ROC curves for the GaTech database: (a) when trainingand testingwere done on the clean original images, (b) when training wasdone on cleanimages, but testing was done with additive zero mean Gaussian noise ofvariance8×10−4 added to the test images, and (c) when training and testingwere both done with zero mean Gaussian noise of variance8 × 10−4 addedto the respective images. The methods tested were our higher-order method,our method in 2D, KSVD and JPEG. Parameters for our methods:K = 100andT = 10 during training.These plots are viewed best in color.

Method Dictionary Size (number of scalars)Our Method (3D) 100 × 3 × 12 × 12 = 43200Our Method (2D) 100 × 2 × 3 × 12 × 12 = 86400

KSVD 1340 × 12 × 12 × 3 = 578880JPEG -

TABLE IICOMPARISON OF DICTIONARY SIZE FOR VARIOUS METHODS FOR

EXPERIMENTS ON THEGATECH AND CMU-PIE DATABASES.

and all the remaining for testing. Experiments with our higher-order method, our method involving separate channels andalso KSVD, revealed performance that is competitive withJPEG. For a bit rate of greater than 1.5 per pixel, our methodsproduced performance that was superior to that of JPEG interms of the PSNR for a given RPP, as seen in Figure V-B. Theparameters for this experiment wereK = 100 andT = 10 fortraining. Sample reconstructions of an image from the CMU-PIE database using our higher-order method for different errorvalues are shown in Figure V-B. This is quite interesting, sincethere is considerable variation between the training imageandthe test images, as is clear from Figure V-B. We would like toemphasize that the experiments were carried out on uncroppedimages of the full size with the complete cluttered background.Also, the dictionary sizes for the experiments on each of thesedatabases are summarized in Table II. For color-image patchesof sizem1×m2×3 usingK sets of bases, our method in 3Drequires a dictionary of size3Km1m2, whereas our method in2D requires a dictionary of size2Km1m2 per channel, whichis 6Km1m2 in total.

Training Image Original Test Image Error 0.003

Error 0.001 Error 0.0005 Error 0.0003

Error 0.0001

Fig. 10. Sample reconstructions of an image from the CMU-PIEdatabasewith different error values using our higher order method. The original imageis on the top-left.These images are viewed best in color.

0 2 4 6 8 1020

25

30

35

40

RPP

PS

NR

PSNR vs RPP: CMU−PIE

Our Method (3D)Our Method (2D)KSVDJPEG

Fig. 11. ROC curves for the CMU-PIE database using our methodin 3D, ourmethod in 2D, KSVD and JPEG. Parameters for our methods: withK = 100andT = 10 during training.These plots are viewed best in color.

C. Comparisons with JPEG on Noisy Datasets

As mentioned before, we did not directly compare ourresults to the JPEG technique for the GaTech Face Database,because the images in the database are already in the JPEGformat. Instead, we added zero-mean Gaussian noise of vari-ance8 × 10−4 (on a color scale of[0, 1]) to the images ofthe GaTech database and converted them to a raw format.Following this, we converted these raw images back to JPEG(using MATLAB R©) and measured the RPP and PSNR. Thesefigures were pitted against those obtained by our higher-ordermethod, our method on separate channels, as well as KSVD,as shown in Figures 9(b) and 9(c). The performance of JPEGwas distinctly inferior despite the fact that the noise addeddid not have a very high variance. The reason for this isthat the algorithms used by JPEG cash in on the fact thatwhile representing most natural images, the lower frequenciesstrongly dominate. This assumption can fall apart in case ofsensor noise. As a result, the coefficients produced by the DCTstep of JPEG on noisy images will possess prominently highervalues, giving rise to higher bit rates for the same PSNR. Forthe purposes of comparison, we ran two experiments using ourhigher-order method, our separate channel method and KSVDas well. In the first experiment, noise was added only to thetest set, though the orthonormal bases or the dictionary were


learned on a clean training set (i.e. without noise being addedto the training images). In the second experiment, noise wasadded to every image from the training set, and all methodswere trained on these noisy images. The testing was performedon (noisy) images from the test set, and the ROC curves wereplotted as usual. As can be seen from Figures 9(b) and 9(c),the PSNR values for JPEG begin to plateau off rather quickly.Note that in all these experiments, the PSNR is calculated fromthe squared difference between the reconstructed image andthe noisy training image. This is because the ‘original clean’image would be usually unavailable in a practical applicationthat required compression of noisy data.

VI. CONCLUSION

We have presented a new technique for sparse representationof image patches, which is based on the notion of treatingan image as a 2D entity. We go beyond the usual singularvalue decomposition of individual images or image-patchestolearn a set of common, full-sized orthonormal bases forsparserepresentation, as opposed to low-rank representation. For theprojection of a patch onto these bases, we have provided avery simple and provably optimal greedy algorithm, unlikethe approximation algorithms that are part and parcel of pro-jection pursuit techniques, and which may require restrictiveassumptions on the nature or properties of the dictionary[27]. Based on the developed theory, we have demonstrated asuccessful application of our technique for the purpose of com-pression of two well-known face databases. Our compressionmethod is able to handle the varying complexity of differentpatches by dynamically altering the number of coefficients(and hence the bit-rate for storage) in the projection matrixof the patch. Furthermore, this paper also presents a cleanand elegant extension to higher order matrices, for whichwe have presented applications to color-image compression.Unlike decompositions like HOSVD which do not retain theoptimal low-rank reconstruction property of SVD, our methodscales cleanly into higher dimensions. The experimental resultsshow that our technique compares very favorably to otherexisting approaches from recent literature, including theJPEGstandard. We have also empirically examined the performanceof our algorithm on noisy color image datasets.

Directions for future work include: (1) investigation of al-ternative matrix representations tuned for specific applications(as opposed to using the default rows and columns of theimage), (2) application of our technique for compression ofgray-scale and color video (which can be represented as 3Dand 4D matrices respectively), (3) application of our techniquefor denoising or classification, (4) a theoretical study of themethod in the context of natural image statistics, (5) a study onutilizing a perceptually driven quality metric [44] as opposedto a simple L2 norm, and (6) a method of incorporatingincoherent bases in our framework, inspired by recent researchon compressive sensing [45].

REFERENCES

[1] A. Rangarajan, “Learning matrix space image representations”, in En-ergy Minimization Methods in Computer Vision and Pattern Recognition(EMMCVPR). 2001, vol. 2134 ofLNCS, pp. 153–168, Springer Verlag.

[2] J. Ye, “Generalized low rank approximations of matrices”, MachineLearning, vol. 61, pp. 167–191, 2005.

[3] J. Yang, D. Zhang, A. F. Frangi, and J. y. Yang, “Two-dimensionalPCA: A new approach to appearance-based face representation andrecognition”, IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 1,2004.

[4] X. He, D. Cai, and P. Niyogi, “Tensor subspace analysis”,in NeuralInformation Processing Systems, 2005, pp. 499–506.

[5] H. C. Andrews and C. L. Patterson, “Singular value decomposition(SVD) image coding”,IEEE Transactions on Communications, pp. 425–432, 1976.

[6] G. Golub and C. van Loan,Matrix Computations, The Johns HopkinsUniversity Press, October 1996.

[7] A. van der Schaaf and J. H. van Hateren, “Modelling the power spectraof natural images: statistics and information”,Vision Research, vol. 36,pp. 2759–2770, 1996.

[8] J.-F. Yang and C.-L. Lu, “Combined techniques of singular value decom-position and vector quantization for image coding”,IEEE Transactionson Image Processing, vol. 4, no. 8, pp. 1141–1146, 1995.

[9] C. Ding and J. Ye, “Two-dimensional singular value decomposition(2DSVD) for 2D maps and images”, inSIAM Int’l Conf. Data Mining,2005, pp. 32–43.

[10] G. Wallace, “The JPEG still picture compression standard”, IEEETransactions on Consumer Electronics, vol. 38, no. 1, pp. 18–34, 1992.

[11] B. Olshausen and D. Field, “Natural image statistics and efficientcoding”, Network, vol. 7, pp. 333–339, 1996.

[12] M. Lewicki, T. Sejnowski, and H. Hughes, “Learning overcompleterepresentations”,Neural Computation, vol. 12, pp. 337–365, 2000.

[13] M. Aharon, M. Elad, and A. Bruckstein, “The K-SVD: An algorithmfor designing of overcomplete dictionaries for sparse representation”,IEEE Transactions on Signal Processing, vol. 54, no. 11, pp. 4311–4322, November 2006.

[14] P. Hoyer, “Non-negative matrix factorization with sparseness con-straints”, Journal of Machine Learning Research, pp. 1457–1469, 2004.

[15] D. Mallat and M. Avellaneda, “Adaptive greedy approximations”,Journal of Constructive Approximations, vol. 13, pp. 57–98, 1997.

[16] K. Gurumoorthy, A. Rajwade, A. Banerjee, and A. Rangarajan, “Be-yond SVD - sparse projections onto exemplar orthonormal bases forcompact image representation”, inInternational Conference on PatternRecognition (ICPR), 2008.

[17] A. Shashua and A. Levin, “Linear image coding for regression andclassification using the tensor-rank principle”, inCVPR(1), 2001, pp.42–49.

[18] L. de Lathauwer, B. de Moor, and J. Vandewalle, “A multilinear singularvalue decomposition”,SIAM J. Matrix Anal. Appl., vol. 21, no. 4, pp.1253–1278, 2000.

[19] J. Hastad, “Tensor rank is NP-Complete”, inICALP ’89: Proceedingsof the 16th International Colloquium on Automata, Languages andProgramming, 1989, pp. 451–460.

[20] L. de Lathauwer,Signal Processing Based on Multilinear Algebra, PhDthesis, Katholieke Universiteit Leuven, Belgium, 1997.

[21] M.A.O. Vasilescu and D. Terzopoulos, “Multilinear analysis of imageensembles: Tensorfaces”, inInternational Conference on Pattern Recog-nition (ICPR), 2002, pp. 511–514.

[22] R. Costantini, L. Sbaiz, and S. Susstrunk, “Higher order SVD analysisfor dynamic texture synthesis”,IEEE Transactions on Image Processing,vol. 17, no. 1, pp. 42–52, Jan. 2008.

[23] H. Wang and N. Ahuja, “Rank-R approximation of tensors:Usingimage-as-matrix representation”, inIEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2005, vol. 2, pp. 346–353.

[24] C. Ding, H. Huang, and D. Luo, “Tensor reduction error analysis- applications to video compression and classification”, inIEEEConference on Computer Vision and Pattern Recognition (CVPR), 2008.

[25] A. Ranade, S. Mahabalarao, and S. Kale, “A variation on SVD basedimage compression”,Image and Vision Computing, vol. 25, no. 6, pp.771–777, 2007.

[26] Y. Pati, R. Rezaiifar, and P. Krishnaprasad, “Orthogonal matchingpursuit: Recursive function approximation with applications to waveletdecomposition”, in27th Asilomar Conference on Signals, Systems andComputation, 1993, vol. 1, pp. 40–44.

[27] J. Tropp, “Greed is good: algorithmic results for sparse approximation”,IEEE Transactions on Information Theory, vol. 50, no. 10, pp. 2231–2242, 2004.

[28] C. Petersen and B. Soderberg, “A new method for mappingoptimizationproblems onto neural networks”, International Journal of NeuralSystems, vol. 1, pp. 3–22, 1989.


[29] A. Yuille and J. Kosowsky, “Statistical physics algorithms that con-verge”, Neural Computation, vol. 6, no. 3, pp. 341–356, 1994.

[30] R. Hathaway, “Another interpretation of the EM algorithm for mixturedistributions”,Statistics and Probability Letters, vol. 4, pp. 53–56, 1986.

[31] M. Tipping and C. Bishop, “Mixtures of probabilistic principal com-ponent analysers”,Neural Computation, vol. 11, no. 2, pp. 443–482,1999.

[32] “The ORL Database of Faces”, http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html.

[33] “The Yale Face Database”, http://cvc.yale.edu/projects/yalefaces/yalefaces.html.

[34] D. Huffman, “A method for the construction of minimum-redundancycodes”, inProceedings of the I.R.E., 1952, pp. 1098–1102.

[35] S. Geman, E. Bienenstock, and R. Doursat, “Neural networks and thebias/variance dilemma”,Neural Computation, vol. 4, no. 1, pp. 1–58,1992.

[36] “The CVG Granada Database”, http://decsai.ugr.es/cvg/dbimagenes/index.php.

[37] “The Uncompressed Colour Image Database”, http://vision.cs.aston.ac.uk/datasets/UCID/ucid.html.

[38] M. Marcellin, M. Lepley, A. Bilgin, T. Flohr, T. Chinen,and J. Kasner,“An overview of quantization in jpeg 2000”,Signal Processing: ImageCommunication, vol. 17, no. 1, pp. 73–84, 2000.

[39] M. Adams and R. Ward, “Jasper: A portable flexible open-sourcesoftware toolkit for image coding/processing”, inIEEE InternationalConference on Acoustics, Speech and Signal Processing, 2004, pp. 241–244.

[40] M. Hilton, B. Jawerth, and A. Sengupta, “Compressing still and movingimages with wavelets”,Multimedia Systems, vol. 2, no. 5, pp. 218–227,1994.

[41] “The Georgia Tech Face Database”, http://www.anefian.com/face reco.htm.

[42] J. Huang and Y. Wang, “Compression of color facial images usingfeature correction two-stage vector quantization”,IEEE Transactionson Image Processing, vol. 8, pp. 102–109, 1999.

[43] “The CMU Pose, Illumination and Expression (PIE) Database”, http://www.ri.cmu.edu/projects/project418.html.

[44] M. Eckert and A. Bradley, “Perceptual quality metrics applied to stillimage compression”,Signal Processing, vol. 70, no. 3, pp. 177–200,1998.

[45] E. Candes, J. Romberg, and T. Tao, “Robust uncertaintyprinciples: Exactsignal reconstruction from highly incomplete frequency information”,IEEE Transactions on Information Theory, vol. 56, no. 2, pp. 489–509,2006.

Documents

SUBMITTED TO IEEE TRANSACTIONS ON IMAGE ...anand/pdf/TIP_Image...SUBMITTED TO IEEE TRANSACTIONS ON IMAGE PROCESSING 3 its rank [see Section II-A and Section III]. There exists a large