USING THE KERNEL TRICK IN COMPRESSIVE SENSING: …ecee.colorado.edu/~smhughes/QiHughesKernelTrickinCompressiveSensing.pdf · how to apply the kernel trick, popular in machine learning,

USING THE KERNEL TRICK IN COMPRESSIVE SENSING: ACCURATE SIGNALRECOVERY FROM FEWER MEASUREMENTS

Hanchao Qi, Shannon HughesDepartment of Electrical, Computer, and Energy Engineering

University of Colorado at Boulder

ABSTRACTCompressive sensing accurately reconstructs a signal that is

sparse in some basis from measurements, generally consisting of thesignal’s inner products with Gaussian random vectors. The numberof measurements needed is based on the sparsity of the signal, allow-ing for signal recovery from far fewer measurements than is requiredby the traditional Shannon sampling theorem. In this paper, we showhow to apply the kernel trick, popular in machine learning, to adaptcompressive sensing to a different type of sparsity. We consider asignal to be “nonlinearly K-sparse” if the signal can be recoveredas a nonlinear function of K underlying parameters. Images that liealong a low-dimensional manifold are good examples of this type ofnonlinear sparsity. It has been shown that natural images are as well[1]. We show how to accurately recover these nonlinearly K-sparsesignals from approximately 2K measurements, which is often farlower than the number of measurements usually required under theassumption of sparsity in an orthonormal basis (e.g. wavelets). Inexperimental results, we find that we can recover images far betterfor small numbers of compressive sensing measurements, some-times reducing the mean square error (MSE) of the recovered imageby an order of magnitude or more, with little computation. A boundon the error of our recovered signal is also proved.

Index Terms— Compressive sensing, Kernel methods.

1. INTRODUCTIONThe “kernel trick” in machine learning is a way to easily adapt lin-ear algorithms to nonlinear situations. For example, by applying thekernel trick to the support vector machines (SVM) algorithm, whichconstructs the best linear hyperplane separating data points belong-ing to two different classes, we obtain the kernel SVM algorithm, analgorithm that constructs the best curved boundary separating datapoints belonging to two different classes. Similarly, principal com-ponents analysis (PCA), selects the best linear projection of the datato minimize error between the original and projected data. KernelPCA finds the best smooth polynomial mapping to represent data.

The key idea of the kernel trick is that, conceptually, we map ourdata from the original data space RP to a much higher-dimensionalfeature space F using the nonlinear mapping Φ : RP → F beforeapplying the usual linear algorithm such as SVM or PCA in the fea-ture space. As an example, we might map a point (x1, x2) ∈ R2

onto the higher-dimensional vector with components x1, x2, x21,x22, x1x2, x31, etc. before applying SVM or PCA. A linear bound-ary in the higher-dimensional feature space (

∑j ajΦ(x)j = C),

can be expressed as a polynomial boundary in the original space(a0x1 + a1x2 + a2x

21 + . . . = C). Similarly, a linear mapping of

the data becomes a polynomial mapping.However, this view of the kernel trick is purely conceptual. In

reality, we avoid the complexity of mapping to and working in thehigh-dimensional feature space. When the original algorithm can bewritten in terms of only inner products between points, not the pointsthemselves, we can replace the original inner product 〈x, y〉 with

the new inner product k(x, y) = 〈Φ(x),Φ(y)〉 and run the originalalgorithm without additional computation. For example, a popularchoice of k(x, y) is (〈x, y〉+ c)d, which produces a Φ of monomialsas described above. As an illustration, for x,y ∈ R2, c = 0, d =2, k(x,y) = 〈x,y〉2 = 〈(x21,

√2x1x2, x

22), (y21 ,

√2y1y2, y

22)〉, so

Φ(x) = (x21,√

2x1x2, x22).

Kernel PCA [2] often reveals low-dimensional representationsof the dataset that reflect its underlying degrees of freedom. For ex-ample, in the synthetic “sculpture faces” dataset of Fig. 1, each faceimage is a highly nonlinear, but deterministic, function of three un-derlying variables: two pose angles and one lighting angle. A kernelPCA, performed with a well-chosen kernel function, is able to pickout two of these degrees of freedom as the first two dimensions cho-sen in kernel PCA. (Note that an ordinary PCA will not.) The resultsof kernel PCA can thus reflect a type of nonlinear sparsity in thedataset. We could represent each image fairly accurately, knowingonly its coordinates in this two-dimensional representation.

Indeed, we may be able to build a better approximation of theimage knowing its first m coordinates in a nonlinearly sparse rep-resentation such as kernel PCA than we can knowing its largest mFourier, wavelet or curvelet coefficients. Fig. 1(c) shows a compari-son of the mean-squared error for an individual image of the “sculp-ture faces” dataset when approximated from m kernel PCA compo-nents vs. m wavelet coefficients. The MSE decays much faster forkernel PCA components, showing that the image is more nonlinearlysparse than linearly sparse. Like this simple toy dataset, natural im-ages have been shown to be nonlinearly sparse: patches of naturalimages tend to lie along low-dimensional manifolds [1].

In view of this nonlinear sparsity, consider compressive sens-ing [3, 4]. Recently, compressive sensing has asserted that we canachieve perfect reconstruction of a signal with far fewer samplesthan Shannon-Nyquist traditionally requires, if the signal is approx-imately sparse in some basis. In fact, in practice, we can achievea near-perfect, or even perfect, reconstruction, of the signal fromabout 5K measurements, each of which is the K-sparse signal’s in-ner product with a random vector.

In this paper, we will show how the kernel trick can be used toadapt this paradigm of reconstructing a linearly sparse signal froma linear set of measurements to the case of reconstructing a non-linearly sparse signal from either nonlinear or linear measurements.The key idea is that a signal that is nonlinearly sparse can, with aproper choice of kernel, become linearly sparse in feature space, asour “sculpture faces” did above. We can thus reconstruct it from ran-dom measurements in feature space, which can be easily obtainedfrom the usual random measurements for some kernels. Experimen-tally, we find that when the signal to be reconstructed is nonlinearlysparse, our method reconstructs it from far fewer compressive sens-ing measurements, sometimes using an order of magnitude fewermeasurements to achieve the same MSE.

Section 2 outlines our recovery algorithm. Section 3 presentsexperimental results showing its power on sample datasets. Finally,

(a) Image Samples (b) Arranged by KPCA

(c) Error Dropoff Comparison

Fig. 1. (a) A set of images that have 3 underlying degrees of freedom. (b)Each image is now plotted at the location in 2-dimensional space that a kernelPCA (keeping 2 principal components) would map it to. The two underlyingdegrees of freedom of pose angle are nicely captured. (c) A plot of MSEbetween the original image and its best m-term approximation in waveletsvs. MSE between the image and its best m-term kernel PCA approximation.The image is better approximated in kernel PCA components.

in Section 4, we present some error analysis for our recovered signal.

2. SIGNAL RECOVERY FROM CS MEASUREMENTSUSING THE KERNEL TRICK

Suppose we wish to recover an unknown y that is approximatelynonlinearly d-sparse, meaning it can be reconstructed with small er-ror from its first d coordinates in some orthonormal basis of featurespace, i.e.

Φ(y) ≈d∑k=1

βkvk.

Here, unlike traditional compressive sensing, we do not expect thevk to be some unknown subset of the standard basis for F , whichare typically monomials. Hence, {vi}di=1 in F will typically beestimated via kernel PCA from other data that is expected to be non-linearly sparse in the same way, i. e. from other samples {x(i)}mi=1

of the manifold of images that our image belongs to, or from othernatural images.

Now suppose we have measurements of y in the form of lin-ear inner products 〈y, ei〉, or nonlinear inner products k(y, ei) =〈Φ(y),Φ(ei)〉, where {ei}ni=1 are known random vectors. Our goalis to reconstruct y from the random measurements as in typical com-pressive sensing. In the case that linear inner products are provided,we shall assume the kernel defining our feature space is in the formf(〈y, ei〉) so that k(y, ei) is known as well.

If we complete the orthonormal basis with vd+1, ...,vq , whereq is the dimension of F , we can write Φ(y) as

Φ(y) =

d∑k=1

βkvk +

q∑k=d+1

γk−dvk (1)

Each random measurement can then be expressed as

k(y, ei) = 〈Φ(y),Φ(ei)〉

=

d∑k=1

βk〈vk,Φ(ei)〉+ εi

where εi is a small error term since∑qk=d+1 γk−dvk is small by

assumption. In section 4, we will analyze the error incurred by this.In matrix form, this becomes k(y, e1)

...k(y, en)

=

〈v1,Φ(e1)〉 . . . 〈vd,Φ(e1)〉...

. . ....

〈v1,Φ(en)〉 . . . 〈vd,Φ(en)〉

β1...βd

+ ε1...εn

Writing the above equation as

M = Gβ + ε, (2)

we can use the least squares estimator to estimate β: β = G+M .We shall later show that we must have n ≥ d for good results, so wewill always have G+ = (GTG)−1GT .

We then estimate Φ(y) as Φ(y) =∑dk=1 βkvk. Finally, we

need to invert the one-to-one mapping Φ to find our estimate y ofy. This problem, of recovering z from Φ(z), is called the preim-age problem in the kernel methods literature. If an exact pre-imagez such that Φ(z) = Φ(y) exists, then we can estimate our original

signal y via y = Φ−1(

Φ(y))

. However, more often due to estima-

tion error in Φ(y), an exact preimage will not exist, and we will sety = arg minz ‖Φ(z)− Φ(y)‖22.

2.1. Avoiding Computation in Feature Space

Often, we want to avoid working with Φ or elements in F , such as{vk}dk=1, to avoid complexity. We now show that our estimate fromthe previous subsection can be obtained without working in F if werely on the data {x(i)}mi=1 from which {vk}dk=1 were estimated.

In the following description, for simplicity, we assume that themapped data Φ(x(j)) are centered inF . Otherwise, we must replaceΦ(x(j)) with the centered Φ(x(j)) = Φ(x(j))− 1

m

∑mi=1 Φ(x(i)).

In the usual kernel PCA procedure (see [2]), we recover not thevk themselves, which are cumbersome, but rather their coefficientsαki in the expansion

vk =

m∑i=1

αki Φ(x(i)) (3)

(It can be easily shown that the optimal vk for PCA must lie inthe subspace spanned by the Φ(x(i))). This allows us to express〈vl,Φ(ek)〉 as

∑mj=1 α

ljk(x(j), ek) and thus G as

G =

k(x(1), e1) ... k(x(m), e1)...

. . ....

k(x(1), en) ... k(x(m), en)

α1

1 ... αd1...

. . ....

α1m ... αdm

Once we have found β = G+M , we then have

Φ(y) =

d∑k=1

βkvk =

m∑i=1

d∑k=1

βkαki Φ(x(i)),

We can recover the exact preimage from β without venturinginto feature space using the following technique in [5]. Consider anexpansion Ψ =

∑mi=1 ciΦ(x(i)) in F . If there exists z ∈ Rp such

that Φ(z) = Ψ, and an invertible function fk such that k(x,y) =fk(〈x,y〉), then

z =

p∑i=1

〈z,ui〉ui =

p∑i=1

f−1k

(m∑j=1

cjk(x(j),ui)

)ui (4)

where {ui}pi=1 is any orthonormal basis of Rp. As an example ker-nel, we could choose k(x,y) = fk(〈x,y〉) = (〈x,y〉 + c)d, withd odd. Approximate preimages can also be constructed without ven-turing into feature space (for example, see [5] and [6]). However,we will not detail these methods here since we found that the simplepreimage recovery method in Eq. 4 was sufficient.

Fig. 2. Experimental Results: A visual results comparison (left) and plots(in log-log scale) of MSE vs. number of measurements (right) for each ofthe “sculpture face” (top), “Frey face” (middle), and “Handwritten digits”(bottom) datasets. Our method (KTCS) is compared with traditional L1 andTotal Variation Minimization (L1-Min, TVM).

3. EXPERIMENTAL RESULTSIn this section, we apply our method to sample signals to compareit with existing compressive sensing signal recovery techniques. Forour analysis, we choose three datasets in which the images show thetype of “nonlinear sparsity” previously described. The “sculptureface” dataset includes 698 images with size 64 × 64 of a sculptureface rendered according to 3 different input parameters, two of poseangle and one of lighting angle. The “Frey face” dataset includes1964 images with size 16×16 of the same person’s face as he showsdifferent emotions. The “Handwritten digits” dataset includes 60000training images with size 32× 32 of digits from 0 to 9.

All experiments were done using the kernel k(x,y) = (〈x,y〉+c)5, setting c as 0.5 times the mean of all entries of the covariancematrix formed from the points {x(k)}mk=1, so that it is of approx-imately the same scale as the inner products, and for i.i.d. ek ∼N (0, 1

nI). To mimic a real-world situation, we estimate {vk}dk=1

via kernel PCA on the other data samples, choosing d = n2

, andavoid the complexity of feature space computation as in Sec. 2.1.

Fig. 2 compares our method (KTCS) with the popular compres-sive sensing recovery methods L1-Minimization (L1-MIN) and To-tal Variation Minimization (TVM), giving the reconstructed imagesfor varying numbers of measurements and plotting MSE vs. numberof measurements (in logarithmic scale) for each method. We noticethat our method outperforms traditional compressive sensing recov-ery methods for small numbers of measurements, producing either agreatly reduced MSE for the same number of measurements or viceversa. L1-MIN and TVM do catch up and surpass our method forvery large numbers of measurements and very small recovery error,most likely due to small inaccuracies in estimating the pre-image

from Φ(y). These inaccuracies should de-crease with exact knowledge of {vk}dk=1

or an increased set of data {x(i)}mi=1 fromwhich to estimate them. Even though theseare simple datasets, this serves as a proof-of-concept that nonlinearly sparse signalscan be recovered remarkably well fromvery few measurements, and it suggeststhat more accurate recovery should alsobe possible for other nonlinearly sparsesignals, such as natural images.

Finally, since real-world signals are notexactly sparse, but rather show a rapidlydecaying approximation error with increas-ing number of components kept, Fig. 3attempts to experimentally find the optimallevel of nonlinear sparsity to assume for agiven number of measurements n. It plotsthe MSE between the original image andour reconstructed image for different com-binations of assumed sparsity level d andnumber of measurements n for the “sculp-ture faces” dataset. In Fig. 3(a), we seea sharp drop in MSE for each curve, oc-curring precisely when n becomes largerthan d. We shall understand this drop betterusing our error analysis in Section 4. Given

n > d, we obtain a smaller MSE for larger d. In Fig. 3(b), for eachn, the MSE is smallest at about d ≈ n

2. Based on this guideline, we

chose d = n2

in our experiments.

4. ERROR ANALYSIS OF OUR ESTIMATORTheorem. Suppose v1, . . . ,vq is an orthonormal basis for the fea-ture space F . Now consider Φ(y) ∈ F and its representation interms of this basis:

Φ(y) =

d∑i=1

βivi +

q∑j=d+1

γj−dvj (5)

Suppose we take n measurements of Φ(y) by taking inner productsMk = 〈ek,Φ(y)〉 for vectors {ek}nk=1 in feature space F that arei.i.d drawn from an isotropic Gaussian distribution ek ∼ N (0, 1

nI).

Let Φ(y) =∑dj=1 βjvj for β = G+M and G as in Eq. 2. Then,

‖Φ(y)− Φ(y)‖22 ≤‖γ‖22 + b

(1−√d/n− r)2

+ ‖γ‖22 (6)

with probability at least b2

b2+ 2n‖γ‖42

(1− e−nr2/2). �

Fig. 3. MSE of our recovered image as a function of n and d for the “sculp-ture face” dataset for (a) fixed n, varying d and (b) fixed d, varying n.

We note that because of the rapid decay of error with numberof kernel PCA components for real signals, as displayed in Fig. 1,||γ||22 should decay rapidly for increasing d, becoming very smalland ensuring a tight bound. Also, as n→∞, we find that ||Φ(y)−Φ(y)‖22 ≤ 2||γ||22 with prob. 1.

Proof. The measurement vector M can be written as

Mn×1 = Gn×dβd×1 +Hn×(q−d)γ(q−d)×1

where the subscripts indicate dimensions and H is the matrix withi, jth entry Hij = 〈ei,vj+d〉. The error is thus

‖Φ(y)− Φ(y)‖22 =

∥∥∥∥∥∥d∑i=1

(βi − βi)vi +

q∑j=d+1

γj−dvj

∥∥∥∥∥∥2

2

= ‖β − β‖22 + ‖γ‖22 (7)

where

‖β − β‖2 = ‖β −G+M‖2= ‖β −G+(Gβ +Hγ)‖2≤ ‖(I −G+G)β‖2 + ‖G+Hγ‖2 (8)

Here, G and H are both i.i.d. matrices of Gaussian randomvariables (entries are linear projections of an isotropic Gaussian ontoorthogonal vectors), each distributed according to N (0, 1

n). G and

H are also independent from each other. Hence, for the first term‖(I −G+G)β‖2 in (8), since n > d, the columns of G are linearlyindependent with probability 1. So GTG is full rank and

‖(I −G+G)β‖ = ‖(I − (GTG)−1GTG)β‖2 = 0 with prob. 1(9)

For the second term ‖G+Hγ‖2 in (8), we first note that

‖G+Hγ‖2 ≤ ‖G+‖op‖Hγ‖2 (10)

where ‖A‖op = maxx∈Rn,x 6=0‖Ax‖‖x‖ . Hence, since G and H are

independent,

Pr(||G+Hγ||≤K1K2

)≥Pr

(||G+||op≤K1

)Pr(||Hγ||2≤K2)

(11)Since G is a n × d matrix with d ≤ n with entries i.i.d drawn

fromN (0, 1n

), we use a theorem of Szarek [7], which states that thesmallest singular value of G, σmin(G) obeys

Pr(σmin(G) > 1−

√d/n− r

)≥ 1− e−nr

2/2

so since we can verify that ||G+||op = ||(GTG)−1GT ||op ≤1

σmin(G), we have

Pr

(‖G+‖op ≤

1

1−√d/n− r

)≥ 1− e−nr

2/2 (12)

Now consider ‖Hγ‖2,

‖Hγ‖22 =

∥∥∥∥∥∥∥ 〈e1,vd+1〉 ... 〈e1,vq〉

.... . .

...〈en,vd+1〉 ... 〈en,vq〉

γ1

...γq−d

∥∥∥∥∥∥∥2

2

=1

n

n∑k=1

〈√nek,

q−d∑i=1

γivd+i〉2

Let ak = 〈√nek,

∑q−di=1 γivd+i〉, so each ak is a linear pro-

jection of a Gaussian random variable, and thus ak is still Gaussian.Moreover, note E(ak) = 0, var(ak) = ‖γ‖22, and {ak}nk=1 areindependent. Each ak is thus i.i.d. inN (0, ‖γ‖22).

Hence, E(||Hγ||2) = ‖γ‖22, var(||Hγ||2) = 2n‖γ‖42. Using

the one-sided Chebyshev inequality ,

Pr (W − E(W ) ≥ b) ≤ var(W )

var(W ) + b2(13)

to bound the tail decay of this random variable, we have

Pr(||Hγ||2 ≤ ‖γ‖22 + b

)≥ 1−

2n‖γ‖42

2n‖γ‖42 + b2

=b2

2n‖γ‖42 + b2

(14)Combining (7), (8), (9), (11), (12), (14) proves the theorem. �

While our measurement vectors may not always be i.i.d. Gaus-sian in feature space, we note that the above proof primarily relied onlower bounding ||G||op and upper bounding ||H||op, which shouldalso be possible for the Φ corresponding to many standard kernels.We hope to address this further in future work.

5. CONCLUSIONWe have demonstrated that signals which show nonlinear sparsity,i.e. which can be represented in terms of a small number of compo-nents in feature space under an appropriate choice of kernel, can bereconstructed from far fewer random measurements than is typicallyrequired using traditional compressive sensing, indeed sometimes anorder of magnitude fewer. Moreover, using the kernel trick frame-work, these gains can be achieved with very little computation, in atime comparable to that of traditional compressive sensing recoveryalgorithms. We hope to explore applications of this framework toother signals that show nonlinear sparsity, such as patches of naturalimages, in future work.

6. REFERENCES

[1] G. Carlsson et al., “On the local behavior of spaces of naturalimages,” Int. Jour. Computer Vision, pp. 1–12, 2008.

[2] B. Scholkopf et al., “Kernel principal component analysis,” inICANN, pp. 583–588. Springer, 1997.

[3] D.L. Donoho, “Compressed sensing,” IEEE Trans. on Informa-tion Theory, pp. 1289 –1306, Apr. 2006.

[4] E.J. Candes and M.B. Wakin, “An introduction to compressivesampling,” IEEE Signal Proc. Mag., pp. 21 –30, Mar. 2008.

[5] B. Scholkopf and A.J. Smola, Learning with Kernels: Sup-port Vector Machines, Regularization, Optimization, and Be-yond, MIT Press, Cambridge, MA, 2001.

[6] J. Kwok et al., “The pre-image problem in kernel methods,”IEEE Trans. on Neural Networks, pp. 1517–25, Nov. 2004.

[7] S. J. Szarek, “Condition numbers of random matrices,” J. Com-plexity, pp. 131 –149, 1991.

Documents

USING THE KERNEL TRICK IN COMPRESSIVE SENSING: …ecee.colorado.edu/~smhughes/QiHughesKernelTrickinCompressiveSensing.pdf · how to apply the kernel trick, popular in machine learning,