Image Hallucination Using Neighbor Embedding over Visual ...vigir.missouri.edu/.../IEEE.../data/papers/0033.pdfImage Hallucination Using Neighbor Embedding over Visual Primitive Manifolds

Image Hallucination Using Neighbor Embedding over Visual PrimitiveManifolds

Wei Fan & Dit-Yan YeungDepartment of Computer Science and Engineering,Hong Kong University of Science and Technology

{fwkevin,dyyeung}@cse.ust.hk

Abstract

In this paper, we propose a novel learning-based methodfor image hallucination, with image super-resolution be-ing a specific application that we focus on here. Given alow-resolution image, its underlying higher-resolution de-tails are synthesized based on a set of training images. Inorder to build a compact yet descriptive training set, weinvestigate the characteristic local structures contained inlarge volumes of small image patches. Inspired by recentprogress in manifold learning research, we take the as-sumption that small image patches in the low-resolution andhigh-resolution images form manifolds with similar localgeometry in the corresponding image feature spaces. Thisassumption leads to a super-resolution approach which re-constructs the feature vector corresponding to an imagepatch by its neighbors in the feature space. In addition,the residual errors associated with the reconstructed im-age patches are also estimated to compensate for the infor-mation loss in the local averaging process. Experimentalresults show that our hallucination method can synthesizehigher-quality images compared with other methods.

1. Introduction

Image super-resolution refers to the process by whicha higher-resolution enhanced image is synthesized fromone or more low-resolution images. It finds a numberof real-world applications, which include restoring historicphotographs, enlarging “thumbnail” images on web pages,and image-based rendering for high-quality display pur-poses. Practical super-resolution methods may make useof a single still image or a sequence of consecutive videoframes with sub-pixel translation for synthesizing a higher-resolution image. In this paper, we focus on the problemof single-image “hallucination” with the goal of inferringsome high-resolution details missing in the original imagethat cannot be achieved by simple sharpening.

In recent years, there has been a good deal of researchinto learning-based approaches for image hallucination aswell as other related low-level vision problems [2, 4, 5, 6,7, 11]. These learning-based methods share the commoncharacteristic of using a training set of image (observation)and scene (state) pairs to build a co-occurrence model. Withthe learnt model, one can then predict the missing details inthe observed input image by “borrowing” information fromsome similar examples in the training set.

Due to the contiguous nature of objects and surfaces invisual environments, images from natural scenes only con-stitute a minuscule fraction of the space of all possible im-ages [9]. However, it is difficult, if not totally impossi-ble, to precisely model the probability distribution of nat-ural images for the generic super-resolution task. Instead,what we can do is to study the distribution of small im-age patches and see what kinds of local image structures(e.g., edges or corners) are likely to occur in the image.These local image patches, from either the low-resolutionor high-resolution image, are the building blocks of oursuper-resolution or image hallucination approach. They areexpected to lie along a continuous nonlinear manifold em-bedded in a high-dimensional image space. Inspired by awell-known manifold learning method called locally linearembedding (LLE) [10], we assume that small image patchesin the low-resolution and high-resolution images form man-ifolds with similar local geometry in the corresponding im-age spaces. This assumption generally holds as long as theimage patches are associated with image primitives and thefeature descriptions for the two corresponding images areboth isometric. Our contribution is to devise an effectivemethod for generic image hallucination using locally linearfitting and a learnt image primitive model.

A flowchart of our image hallucination approach isshown in Figure 1. We highlight the major steps of our ap-proach here. In the learning phase, large volumes of imageprimitive patches are extracted from both the low-resolutionand high-resolution images used for training. A training setis constructed by analyzing the local neighborhood relation-

1

1-4244-1180-7/07/$25.00 ©2007 IEEE

ships between the low-resolution and high-resolution imagepatches and keeping only the isometric regions along thetwo manifolds. During the synthesis phase, a low-resolutionimage is presented to the system. Each element in the targethigh-resolution image comes from an optimal linear recon-struction by its nearest neighbors in the training set. More-over, the residual errors associated with the reconstructedimage patches are also estimated to compensate for the in-formation loss in the local averaging process. Finally, weenforce the local compatibility and smoothness constraintsbetween patches in the target high-resolution image throughoverlapping.

Local neighbor

embedding

Residual error

esimationTraining Set

Figure 1. Flowchart of our image hallucination approach.

The rest of this paper is organized as follows. In Sec-tion 2, we give a brief overview of the background and somerelated work. The problem setting and detailed algorithmare described in Sections 3 and 4, respectively. Section 5presents some experimental results, demonstrating the ef-fectiveness and efficiency of our method. Finally, we con-clude our paper in Section 6.

2. Related workImage super-resolution is intrinsically an ill-posed prob-

lem since, theoretically, many high-resolution images cangive rise to the same low-resolution image through someoperations such as smoothing and subsampling. Traditionalpixel interpolation methods with smoothness priors, such aspixel replication and cubic-spline interpolation, introduceartifacts and blurred edges. Reconstruction-based methods[8] aim to restore the lost image details by requiring thedown-sampled version of the high-resolution reconstructedimage to be as close to the original low-resolution imageas possible. However, in the absence of additional infor-mation, these generic constraints are not every effectivefor synthesizing perceptually plausible images which are of

higher quality than the original low-resolution images.Over the past few years, learning-based approaches have

produced compelling results for various low-level visiontasks, including image hallucination [2, 4, 5, 7, 11], im-age analogy [6], and texture synthesis [3]. Despite someimplementation-level differences, these algorithms are allsimilar in spirit. In the learning stage, they learn the un-derlying scene details that correspond to different image re-gions observed in the input. During the inference stage, theyuse those learned relationships to predict missing details inanother image, which is the target high-resolution image forsuper-resolution problems. Thus each element in the targetimage comes from only one “best” example in the trainingset. The neighbor embedding method proposed by [2] intro-duces a more general way of using the training examples. Intheir method, multiple training examples can contribute si-multaneously to the generation of each image patch in thehigh-resolution image. The underlying assumption of themethod is that small image patches in the low-resolutionand high-resolution images form manifolds with similar lo-cal geometry in the two corresponding image spaces. How-ever, they use uniformly sampled image patches from sev-eral training images to build a medium-sized training set,which may not satisfy the manifold assumption and hencemay not lead to good generalization.

Motivated by the work of [11] in applying primal sketchpriors for image hallucination, we conjecture that the imagepatches associated with the image primitives (e.g., edges,corners, and blobs, similar to the primal sketches in [11])are of greater significance for building a good trainingset. This conjecture will be verified empirically in Sec-tion 4 through statistical analysis, showing that these im-age patches preserve well the local isometric relationshipsbetween the manifolds for the low-resolution and high-resolution image patches.

Our method also benefits from the face hallucinationwork of [7], in which the relationships between the low-resolution and high-resolution residues are learnt by cou-pled PCA to refine the final hallucinated face image. In ourcase, the residual error vectors of neighboring examples areconsistent, since they are expected to lie on a subspace per-pendicular to the linear tangent plane of the curved mani-fold. Adding a simple average of these residues is sufficientfor error compensation.

In the next two sections, we will formulate the problemmore precisely and then present details of different compo-nents of our method.

3. Problem settingThe single-image super-resolution problem that we want

to solve can be formulated as follows. Given a low-resolution image Lt as input, we estimate the target high-resolution image Ht with the help of a training set of one or

more low-resolution images Ls and the corresponding high-resolution images Hs. We represent each low-resolution orhigh-resolution image as a set of small overlapping imagepatches.

Ideally, each patch generated for the high-resolution im-age Ht should not only be related appropriately to the corre-sponding patch in the low-resolution image Lt, but shouldalso preserve some inter-patch relationships with adjacentpatches in Ht. The former determines the accuracy whilethe latter determines the local compatibility and smoothnessof the high-resolution image. To satisfy these requirementsas much as possible, our method has the following proper-ties: (a) Each patch in Ht is associated with multiple patchtransformations learned from the training set. (b) Local re-lationships between patches in Lt should be preserved inHt. (c) Neighboring patches in Ht are constrained throughoverlapping to enforce local compatibility and smoothness.

4. Details of our method4.1. Preprocessing

In the preprocessing step, a high-resolution natural im-age H (Figure 2(c)) is blurred and sub-sampled to generatea corresponding low-resolution image L (Figure 2(a)). Ap-plying an initial enhancement through bilinear interpolationto L, we obtain an image H l (Figure 2(b)) which has thesame size as H but lacks the high-resolution details. In thetraining set, we only need to store the differences betweenH and H l (Figure 2(e)), which correspond to the missinghigh-frequency components caused by the image degrada-tion process. Through band-pass filtering, we further de-compose each interpolated image H l into the sum of twoimages containing the medium and low spatial frequencies,respectively. Following the assumption in [5] that the high-est spatial-frequency components of the low-resolution im-age are most important for predicting the extra details of H ,we only store the example patches from the medium fre-quency layer (Figure 2(d)). Finally, to achieve good gener-alization, the high- and medium-frequency image pairs arecontrast normalized by a local measure of energy in the im-age. We undo this normalization step later when we re-construct the high-resolution image. The final output is thesum of the interpolated low-resolution image and the high-frequency predictions.

4.2. Image primitives for training

An essential factor attributing to the success of learning-based approaches is how to construct a good training set,which is descriptive enough in giving useful informationabout the image-scene relationships and is also compactenough for computational efficiency and good generaliza-tion. The patches extracted from the preprocessed imagescan be regarded as points in a vector space with each dimen-

Figure 2. Image preprocessing steps. (a) low-resolution image;(b) initial interpolation of (a) to a higher resolution; (c) originalhigh-resolution image; (d) band-pass filtered and contrast normal-ized version of (b); (e) high-pass filtered and contrast normalizedversion of (c).

sion corresponding to one pixel in the patch. As natural im-ages contain characteristic statistical regularities, we expectthese feature points to lie on a continuous nonlinear man-ifold. Different regions of the manifold may correspondto characteristic image primitives such as edges, corners,blobs, etc. When building our training set, we put empha-sis on the patches associated with these image primitivesfor two main reasons. First, the missing high-frequencydetails to be estimated are densely distributed over the re-gions of image primitives, as shown in Figure 3(b) and 3(c).Focusing on these regions can lead to significant speedupas fewer patches need to be transformed. Second, we be-lieve the local neighborhood relationships between low-resolution and high-resolution primitive patches in the twofeature spaces are more consistent than those between gen-eral image patches. This is supported by our experimentalinvestigation to be reported in Section 4.

The primitive patches are extracted by convolving theinterpolated low-resolution image H l with a bank of max-imum response (MR) filters (Figure 3(a)) [13]. The filterbank consists of a Gaussian kernel and a Laplacian of Gaus-sian kernel, which are arranged in three scales and six ori-entations each. To achieve scale invariance, the outputs are“collapsed” by recording only the maximum filter responseacross all scales. This reduces the number of responsesfor each pixel from 36 (six orientations at three scales foreach of two oriented filters) to 12 (six orientations for eachof two filters). Figure 3(c) depicts the magnitude map ofthe filtered responses for a low-resolution image. Com-pared with the high-frequency difference image shown inFigure 3(b), we can clearly see the consistent relationshipsbetween them.

To sum up, each example in the training set is in the form

Figure 3. (a) Filter bank used for primitive extraction; (b) high-frequency difference image to be estimated; (c) magnitude map ofthe filtered responses for the low-resolution image.

of a pair of primitive patches. These pairs capture the sta-tistical relationships that we are interested in. We representeach image primitive by a 7 × 7 image patch, which is se-lected from the region in Figure 3(c) with high magnitudevalue or energy level.

4.3. Local neighbor embedding

The manifold structure of image patches characterizesthe smooth variation corresponding to some regular trans-formations in natural images. For instance, we can expectthat the manifold coordinates of the edge structure corre-spond to its orientation, translation, and blurring variations.In super-resolution problems, we further assume that man-ifolds of the small patches in the low-resolution and high-resolution images bear similar local geometry in the twospaces. This assumption holds as long as the patches areassociated with image primitives and the two feature de-scriptions are isometric.

In recent years, manifold learning (or nonlinear dimen-sionality reduction) methods have emerged as powerfultools for discovering a “faithful” low-dimensional repre-sentation of the original data embedded in some high-dimensional observation space [1, 10, 12, 14]. The maincomputations for these methods are based on tractable,polynomial-time optimizations, such as shortest path prob-lems, least squares fits, semidefinite programming, and ma-trix diagonalization. Our super-resolution method to be de-scribed below has been inspired by the LLE algorithm [10].Its key idea is that the local geometry in the neighborhoodof each data point can be characterized by linear coefficientsthat reconstruct the data point from its neighbors.

For convenience, we use lps , hps , lqt and hq

t to denote thefeature vectors as well as the corresponding low- and high-resolution image patches, and Ls, Hs, Lt and Ht to denotethe sets of feature vectors as well as the corresponding im-

ages. The neighbor embedding algorithm of our method canbe summarized as follows:

Algorithm - Local neighbor embedding

Input: low-resolution test patches Lt = {l1t , l2t · · · , lntt }

Output: high-resolution patches Ht = {h1t , h

2t · · · , hnt

t }Begin

1. For each patch lqt in image Lt:

(a) Find the set Nq of K nearest neighbors in Ls.(b) Compute the reconstruction weights of the neigh-

bors that minimize the error of reconstructing lqt .(c) Compute the initial high-resolution embedding

hqt using the appropriate high-resolution features

of the K nearest neighbors and the reconstructionweights.

(d) Estimate the high-resolution residual error vectoreqt using the average residual error vector of the

K nearest neighbors in Nq, and update hqt with

hqt + eq

t .

2. Construct the target high-resolution image Ht by en-forcing the local compatibility and smoothness con-straints between adjacent patches obtained in step 1(d).

End

We implement step 1(a) by using Euclidean distance todefine neighborhood. Based on the K nearest neighborsidentified, step 1(b) seeks to find the best reconstructionweights for each patch lqt in Lt. Optimality is achieved byminimizing the local reconstruction error for lqt

εq = ‖lqt −∑

lps∈Nq

ωqplps‖2 (1)

which is the squared distance between lqt and its recon-struction, subject to the constraints

∑lps∈Nq

ωqp = 1 andωqp = 0 for any lps 6∈ Nq. Minimizing εq subject to the con-straints is a constrained least squares problem. We define alocal Gram matrix Gq for lqt as

Gq = (lqt 1T − L)T (lqt 1

T − L) (2)

where 1 is a column vector of ones and L is a D × Kmatrix with its columns being the neighbors of lqt . More-over, we group the weights of the neighbors to form a K-dimensional weight vector wq by reordering the subscript pof each weight ωqp. The constrained least squares problemhas the following closed-form solution:

wq =G−1

q 1

1T G−1q 1

(3)

Instead of inverting Gq, we apply a more efficient methodby solving the linear system of equations Gqwq = 1 and

then normalizing the weights so that∑

lps∈Nqωqp = 1. Af-

ter repeating steps 1(a) and 1(b) for all Nt patches in Lt,the reconstruction weights obtained form a weight matrixW = [ωqp]Nt×Ns

.Step 1(c) computes the initial value of hq

t based on W :

hqt =

∑

lps∈Nq

ωqphps (4)

Step 1(d) estimates the high-resolution residual errorvector eq

t of the linear reconstruction

eqt =

1K

∑

lps∈Nq

eps

=1K

∑

lps∈Nq

(hps −

∑

hrs∈Np

ωrphrs) (5)

where eps is the residual error of the neighboring patch

lps ∈ Nq. It is calculated in the learning phase and storedtogether with its associated lps . We use the average of ep

s

to estimate eqt , with the assumption that these neighboring

residual error vectors are in approximately the same direc-tion which is perpendicular to the tangent plane at hq

t .In step 2, we use a simple method to enforce inter-

patch relationships by averaging the feature values in re-gions where adjacent patches overlap. Other more sophisti-cated methods may also be used.

4.4. Training set revisited

An essential assumption of our local neighbor embed-ding method is that the linear reconstruction weight of lpsand that of its corresponding hp

s should be approximatelythe same in the two corresponding image spaces. In thissubsection, we evaluate this assumption by computing thestandard linear correlation coefficient R(wl, wh) betweentwo groups of weight vectors, wl and wh. Evaluation is per-formed based on two settings, either using a randomly sam-pled patch set or using an image primitive patch set. Eachdata set contains around 25,000 pairs of low-resolution andhigh-resolution patches. Figure 4 shows the histogramsof the correlation coefficient under the two settings, withits value ranging from −1 to 1. From this comparison,we can clearly see that the local neighborhood relation-ships between low-resolution and high-resolution primitivepatches are more consistent than those between general im-age patches.

However, since the image primitives are extracted us-ing a hard threshold on the filtered low-resolution image,it is inevitable that some “noisy” patches will be includedin the training set, resulting in the low correlation part ofFigure 4(b). In our experiment, we only keep those primi-tive patches and their K nearest neighbors if their correla-tion coefficients are greater than 0.7. These patches form a

compact training set which can characterize well the mani-fold structure of natural images.

Figure 4. Histograms of correlation coefficient R(wl, wh) be-tween reconstruction weight vectors for lps and hp

s in the two imagespaces under two settings: (a) randomly sampled image patches;(b) image primitive patches. The red arrows indicate the medianvalues.

5. ExperimentsWe build training sets for the super-resolution algorithm

from band-pass and high-pass pairs taken from a set oftraining images (see Figure 5). All the eight representativenatural images were downloaded from a public web site.1

They were taken with a Canon EOS D60 digital camera witha resolution of 500 × 433 pixels. About 400,000 primitiveexamples have been extracted from these training images.

Figure 5. Eight training images (downloaded from http://www.the-digital-picture.com/Gallery/) used in our experiments.

Our method has only three parameters to determine. Thefirst parameter is the number of nearest neighbors K forneighbor embedding. Our experiments show that the super-resolution result is not very sensitive to the choice of K.We set K to 5 for all our experiments. The second and thirdparameters are the patch size and the degree of overlap be-tween adjacent patches. For both the low-resolution andhigh-resolution images, we use 7 × 7 pixel patches and letthe overlap between adjacent patches be 4 pixels. The corre-sponding low-resolution and high-resolution image patchesare properly aligned by their geometrical centers in theimage plane. Principal component analysis (PCA) is per-formed on the low-resolution patch set to reduce its dimen-sionality to 15, which covers more than 98% of the total

1http://www.the-digital-picture.com/Gallery/

variance. The high-resolution feature vector is representedby concatenating all 7× 7 pixels in the patch, since we can-not find an ‘elbow’ at which the eigenvalue curve ceasesto decrease significantly with added dimensions. Note thatwe perform hallucination on the image intensity only be-cause humans are more sensitive to the brightness informa-tion. The color channels are simply interpolated by a bilin-ear function.

We compare our approach with bicubic interpolation andthe neighbor embedding method of Chang et al. [2] on dif-ferent super-resolution examples, all with a magnificationfactor of 3 (Figure 6). When implementing the method in[2], we use uniformly sampled patches in the training im-ages to build a training set with the same size as ours. Itis clear to see that bicubic interpolation gives the smoothestresult. Chang et al.’s method gives better result for some de-tails in the images. On the other hand, sharper and smoothercontours are hallucinated by our approach (e.g., see theedges of the leaf in the first example).2

We also calculate the RMS error between the super-resolution image generated and the ground-truth image asthe number of nearest neighbors K varies over a range.Figure 7 shows the results for the four examples discussedabove. As we can see, the RMS error attains its lowest valuewhen K is between 4 and 6, showing that using multiplenearest neighbors (as opposed to only one nearest neighboras in the existing methods) does give improved results.

6. Conclusion

In this paper, we have proposed a learning-based methodfor image hallucination based on local neighbor embeddingof the image primitive manifold constructed from an in-put image. In particular, we study image hallucination inthe context of the single-image super-resolution problem.A compact yet descriptive training set is constructed fromcharacteristic regions in images where the manifold as-sumption holds well. Compared with other generic single-image super-resolution methods, our method can synthesizehigher-quality images. In our future work, we will apply asimilar approach to other image hallucination problems.

Acknowledgments

This research has been supported by research grantsHKUST621305 and N-HKUST602/05 from the ResearchGrants Council (RGC) of Hong Kong and the National Nat-ural Science Foundation of China (NSFC).

2Readers are recommended to see the enlarged electronic version of thefigure.

References[1] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimen-

sionality reduction and data representation. Neural Compu-tation, 15(6):1373–1396, June 2003. 4

[2] H. Chang, D. Yeung, and Y. Xiong. Super-resolution throughneighbor embedding. In Proceedings of the IEEE ComputerSociety Conference on Computer Vision and Pattern Recog-nition, volume 1, pages 275–282, 2004. 1, 2, 6

[3] A. Efros and T. Leung. Texture synthesis by non-parametricsampling. In Proceedings of the Seventh IEEE InternationalConference on Computer Vision, volume 2, pages 1033–1038, 1999. 2

[4] W. Freeman, T. Jones, and E. Pasztor. Example-based super-resolution. IEEE Computer Graphics and Applications,22(2):55–65, 2002. 1, 2

[5] W. Freeman, E. Pasztor, and O. Carmichael. Learning low-level vision. International Journal of Computer Vision,40(1):25–47, 2000. 1, 2, 3

[6] A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, andD. Salesin. Image analogies. In SIGGRAPH, 2001. 1, 2

[7] W. Liu, D. Lin, and X. Tang. Hallucinating faces: Tensor-patch super-resolution and coupled residue compensation. InProceedings of the IEEE Computer Society Conference onComputer Vision and Pattern Recognition, volume 2, pages478–484, 2005. 1, 2

[8] B. S. Morse and D. Schwartzwald. Image magnificationusing level-set reconstruction. In Proceedings of the IEEEComputer Society Conference on Computer Vision and Pat-tern Recognition, volume 1, pages 275–282, 2001. 2

[9] B. Olshausen and D. Field. Natural image statistics and ef-ficient coding. Network: Computation in Neural Systems,7(2):333–339, 1996. 1

[10] S. Roweis and L. Saul. Nonlinear dimensionality reduc-tion by locally linear embedding. Science, 290(5500):2323–2326, December 2000. 1, 4

[11] J. Sun, N. Zheng, H. Tao, and H. Shum. Image hallucinationwith primal sketch priors. In Proceedings of the IEEE Com-puter Society Conference on Computer Vision and PatternRecognition, volume 2, pages 729–736, 2003. 1, 2

[12] J. Tenenbaum, V. Silva, and J. Langford. A global geometricframework for nonlinear dimensionality reduction. Science,290(5500):2319–2323, December 2000. 4

[13] M. Varma and A. Zisserman. A statistical approach to textureclassification from single images. International Journal ofComputer Vision, 67(1-2):61–81, 2005. 3

[14] K. Weinberger and L. Saul. Unsupervised learning of imagemanifolds by semidefinite programming. In Proceedings ofthe IEEE Computer Society Conference on Computer Visionand Pattern Recognition, volume 1, pages 275–282, 2004. 4

(a) (b) (c) (d)

Figure 6. Test images magnified by three times using (a) bicubic interpolation, (b) Chang et al.’s method, and (c) our approach; (d) originalhigh-resolution image.

0 2 4 6 8 10 120.038

0.0385

0.039

0.0395

0.04

0.0405

0.041

0 2 4 6 8 10 120.017

0.018

0.019

0.02

0.021

0.022

0.023

0.024

0.025

0 2 4 6 8 10 120.017

0.018

0.019

0.02

0.021

0.022

0.023

0.024

0 2 4 6 8 10 12

0.028

0.0285

0.029

0.0295

0.03

0.0305

0.031

0.0315

0.032

0.0325

(a) (b)

(c) (d)

Figure 7. RMS error between the super-resolution image generated and the ground-truth image as a function of the number of nearestneighbors used: (a) leaf image; (b) flower image; (c) face image; (d) pattern image.

Documents

Image Hallucination Using Neighbor Embedding over Visual ...vigir.missouri.edu/.../IEEE.../data/papers/0033.pdfImage Hallucination Using Neighbor Embedding over Visual Primitive Manifolds