10
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 12, NO. 4, APRIL 2002 217 Super-Resolution Still and Video Reconstruction From MPEG-Coded Video Yucel Altunbasak, Andrew J. Patti, and Russell M. Mersereau Abstract—There are a number of useful methods for creating high-quality video or still images from a lower quality video source. The best of these involve motion compensating a number of video frames to produce the desired video or still. These methods are formulated in the space domain and they require that the input be expressed in that format. More and more frequently, however, video sources are presented in a compressed format, such as MPEG, H.263, or DV. Ironically, there is important information in the compressed domain representation that is lost if the video is first decompressed and then used with a spatial-domain method. In particular, quantization information is lost once the video has been decompressed. Here, we propose a motion-compen- sated, transform-domain super-resolution procedure for creating high-quality video or still images that directly incorporates the transform-domain quantization information by working with the compressed bit stream. We apply this new formulation to MPEG-compressed video and demonstrate its effectiveness. Index Terms—Constrained optimization, image reconstruction, MAP, POCS, resolution enhancement, super-resolution, trans- form-domain restoration, video quality. I. INTRODUCTION A COMMON and important problem that arises in visual communications is the need to create an enhanced-res- olution video image sequence from a lower resolution input video stream. This can be accomplished by exploiting the spatial correlations that exist between successive video frames using super-resolution (SR) reconstruction. As we are using the term, SR refers to the task of increasing the spatial resolution through multiple frame processing as in [1]. This can be extended to video by using a moving window of video frames, as depicted in Fig. 1. One example of where this need arises is in High Definition Television (HDTV). The U.S. Federal Committee on Commu- nications (FCC) has mandated that all television broadcasters in the United States must simulcast digital TV (DTV) and traditional NTSC signals. Consumers have already started buying high-definition displays (HDTV sets), and eventually most will buy such displays. Nonetheless, even in the distant future, it is predicted that only a fraction of programming will Manuscript received January 2, 2001; revised December 3, 2001. This work was supported in part by the National Science Foundation under Award CCR-0113681 and by the Office of naval Research (ONR) under Award N000140110619. This paper was recommended by Associate Editor A. Luthra. Y. Altunbasak and R. M. Mersereau are with the Center for Signal and Image Processing, School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332-0250 USA (e-mail: [email protected]; [email protected]). A. J. Patti is with the Liberate Technologies, Redwood Shores, CA 94065 USA (e-mail: [email protected]). Publisher Item Identifier S 1051-8215(02)04288-X. Fig. 1. SR reconstruction methods may be applied with a moving window to form a higher spatial resolution video from a low-resolution source video. Both video sequences operate at the same video frame rate. be in the HDTV format. Instead, most of the bandwidth is ex- pected to be used for additional standard definition TV (SDTV) programming along with data, internet, and interactive TV applications. There is a clear need and a technical opportunity to design systems to enhance the quality of SDTV signals to match the quality and capabilities of high-definition displays. A related need arises when an especially interesting portion of a video sequence is used to create an enhanced-resolution still image. Often, it is important for the still image to be produced at a higher resolution than the video format naturally affords. For example, NTSC video yields at most 480 vertical lines, whereas easily twice that many lines are required to print with reasonable resolution on modern 300-dpi and higher resolution printers. Most of the work directed toward video SR to date, such as [1]–[15], has been based upon a spatial-domain formulation and has been implemented in that domain. However, the low-reso- lution input video sequences that are likely to be most impor- tant will normally be presented in a compressed format, such as MPEG, for both of the applications cited above. In such cases, one natural solution is to first decompress the video and then apply a spatial-domain SR algorithm, but this may not be a good approach. In our extensive tests we have concluded that spatial-domain SR algorithms may not perform well when used with previously compressed video, especially when the amount of compression is high. Fig. 2(a) shows the result of applying the spatial-domain algo- rithm of [1] to four low-resolution compressed images acquired with a video camera. Fig. 2(b) shows an enlarged image formed by bilinearly interpolating the first frame only. We see here that the bilinear interpolation result is less objectionable than the SR result for this example, in which the compression ratio is high. 1051-8215/02$17.00 © 2002 IEEE

Super-resolution still and video reconstruction from MPEG-coded video

  • Upload
    rm

  • View
    217

  • Download
    5

Embed Size (px)

Citation preview

Page 1: Super-resolution still and video reconstruction from MPEG-coded video

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 12, NO. 4, APRIL 2002 217

Super-Resolution Still and Video ReconstructionFrom MPEG-Coded Video

Yucel Altunbasak, Andrew J. Patti, and Russell M. Mersereau

Abstract—There are a number of useful methods for creatinghigh-quality video or still images from a lower quality videosource. The best of these involve motion compensating a numberof video frames to produce the desired video or still. Thesemethods are formulated in the space domain and they require thatthe input be expressed in that format. More and more frequently,however, video sources are presented in a compressed format, suchas MPEG, H.263, or DV. Ironically, there is important informationin the compressed domain representation that is lost if the video isfirst decompressed and then used with a spatial-domain method.In particular, quantization information is lost once the videohas been decompressed. Here, we propose a motion-compen-sated, transform-domain super-resolution procedure for creatinghigh-quality video or still images that directly incorporates thetransform-domain quantization information by working withthe compressed bit stream. We apply this new formulation toMPEG-compressed video and demonstrate its effectiveness.

Index Terms—Constrained optimization, image reconstruction,MAP, POCS, resolution enhancement, super-resolution, trans-form-domain restoration, video quality.

I. INTRODUCTION

A COMMON and important problem that arises in visualcommunications is the need to create an enhanced-res-

olution video image sequence from a lower resolution inputvideo stream. This can be accomplished by exploiting the spatialcorrelations that exist between successive video frames usingsuper-resolution (SR) reconstruction. As we are using the term,SR refers to the task of increasing the spatial resolution throughmultiple frame processing as in [1]. This can be extended tovideo by using a moving window of video frames, as depictedin Fig. 1.

One example of where this need arises is in High DefinitionTelevision (HDTV). The U.S. Federal Committee on Commu-nications (FCC) has mandated that all television broadcastersin the United States must simulcast digital TV (DTV) andtraditional NTSC signals. Consumers have already startedbuying high-definition displays (HDTV sets), and eventuallymost will buy such displays. Nonetheless, even in the distantfuture, it is predicted that only a fraction of programming will

Manuscript received January 2, 2001; revised December 3, 2001. Thiswork was supported in part by the National Science Foundation under AwardCCR-0113681 and by the Office of naval Research (ONR) under AwardN000140110619. This paper was recommended by Associate Editor A. Luthra.

Y. Altunbasak and R. M. Mersereau are with the Center for Signal and ImageProcessing, School of Electrical and Computer Engineering, Georgia Instituteof Technology, Atlanta, GA 30332-0250 USA (e-mail: [email protected];[email protected]).

A. J. Patti is with the Liberate Technologies, Redwood Shores, CA 94065USA (e-mail: [email protected]).

Publisher Item Identifier S 1051-8215(02)04288-X.

Fig. 1. SR reconstruction methods may be applied with a moving window toform a higher spatial resolution video from a low-resolution source video. Bothvideo sequences operate at the same video frame rate.

be in the HDTV format. Instead, most of the bandwidth is ex-pected to be used for additional standard definition TV (SDTV)programming along with data, internet, and interactive TVapplications. There is a clear need and a technical opportunityto design systems to enhance the quality of SDTV signals tomatch the quality and capabilities of high-definition displays.

A related need arises when an especially interesting portionof a video sequence is used to create an enhanced-resolution stillimage. Often, it is important for the still image to be produced ata higher resolution than the video format naturally affords. Forexample, NTSC video yields at most 480 vertical lines, whereaseasily twice that many lines are required to print with reasonableresolution on modern 300-dpi and higher resolution printers.

Most of the work directed toward video SR to date, such as[1]–[15], has been based upon a spatial-domain formulation andhas been implemented in that domain. However, the low-reso-lution input video sequences that are likely to be most impor-tant will normally be presented in a compressed format, such asMPEG, for both of the applications cited above. In such cases,one natural solution is to first decompress the video and thenapply a spatial-domain SR algorithm, but this may not be agood approach. In our extensive tests we have concluded thatspatial-domain SR algorithms may not perform well when usedwith previously compressed video, especially when the amountof compression is high.

Fig. 2(a) shows the result of applying the spatial-domain algo-rithm of [1] to four low-resolution compressed images acquiredwith a video camera. Fig. 2(b) shows an enlarged image formedby bilinearly interpolating the first frame only. We see here thatthe bilinear interpolation result is less objectionable than the SRresult for this example, in which the compression ratio is high.

1051-8215/02$17.00 © 2002 IEEE

Page 2: Super-resolution still and video reconstruction from MPEG-coded video

218 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 12, NO. 4, APRIL 2002

Fig. 2. (a) Result of spatial-domain SR image reconstruction from fourlow-resolution compressed images using the method of [1]. (b) Result ofbilinear interpolation applied to only the first low-resolution image.

When the video is available only in a transform-coded format,spatial-domain SR formulations fail to make the most effec-tive use of all of the data that is available. Specifically, they ig-nore any quantization information that is available in the codedvideo. Zakhor [16] in 1992 showed how this information couldbe employed to reduce the coding artifacts associated with com-pressed still images. In this paper we propose a novel com-pressed-domain SR formulation that borrows from spatial-do-main video SR and from Zakhor’s artifact-reduction procedureto make explicit use of the quantization information that is em-bedded in the video stream.

II. PRIOR WORK

When considering the problem of creating a SR still imagefrom transform-coded video, there are two areas of research thatare relevant: 1) spatial-domain SR methods for video and stillimages1 and 2) block-artifact reduction methods for compressedimages and video. Until now, the development of these two areashas been separate.

A. Spatial-Domain SR

By carrying out a frequency domain analysis of the SRproblem, Tsai and Huang [11] showed that for SR to be effec-tive, there must be frequency aliasing present in the availablelower resolution video frames. Maximuma posteriori (MAP)estimators have been shown to provide a simple and effectivemeans of incorporating various regularization constraintsinto spatial-domain SR reconstruction [3], [6]. Schultz andStevenson [6], [17] used a discontinuity preserving prior modelwithin a MAP estimation context, where the likelihood ofedges was controlled by the Huber edge penalty function. Theprojections onto convex sets (POCS)-based methods developedin [2], [18] allow for noise to be taken into account by using a

1As we use the term, spatial-domain SR refers to SR reconstruction with inputvideo expressed in the spatial domain.

spatial noise variance estimate on a pixel-by-pixel basis to con-strain the solution directly. If one were to use a spatial-domainPOCS method for transform-coded SR, it would be possible toaccount for quantization noise using this mechanism. A bettermethod, however, which has been employed within a differentcontext, will be discussed in Section III.

B. Blocking-Artifact Reduction

The second body of research that is relevant to the SR of trans-form-coded video has been directed toward removing blockingartifacts in transform-coded images. References [16], [19], [20]are especially relevant in the context of this paper, since theyall employ a POCS solution that makes direct use of the quanti-zation information that is available. This suggests a solution toSR that first performs the best possible job of post-processingthe individual still frames, and then proceeds with spatio-tem-poral SR using these frames. There are two important reasons,however, why this approach is not optimal. First, in addition toapplying convex sets to create bounded data consistency con-straints, the block-reduction approach attempts to remove arti-facts by removing the higher frequency content in the decodedimages. But, as Tsai and Huang showed in [11], it is impor-tant that there be aliasing present in the decoded images for SRto be effective. Block-artifact reduction does not preserve thisnecessary aliasing information. Second, none of the formula-tions that use the POCS data consistency sets on single framesare capable of applying similar information from neighboringframes to enable a form ofmotion-compensated filtering. This isextremely important, because blocking artifacts can be naturallymitigated by an appropriate formulation that uses the POCS dataconsistency sets to include data from surrounding frames, thusexploiting the same temporal correlations used in SR formula-tions. The formulation we propose does precisely this within theframework of SR image reconstruction.

III. T HEORY

There are two identifiable components in spatial-domain SRformulations. The first models the imaging process as a systemof equations and constraints that must be satisfied by a feasiblesolution. These equations are generally linear, but space-varying(LSV). The second component is a method for finding a solu-tion to this system of constrained equations. The fact that theimaging process is space-varying complicates this step. In trans-form image and video coding, the transform used is usually alsolinear and, in the case of the block DCT, space-varying. This al-lows us to extend the modeling step to incorporate the codingtransform, as well as the imaging model, while still maintaininga linear system of equations. This is the approach that we use.

The block diagram that depicts the system to be modeled isshown in Fig. 3. The ideal continuous-time and space videosignal represents the actual input to the camera,where denote continuous spatial variables in thehorizontal and vertical directions, respectively. In that figure,however, we hypothesize the existence of a (non-observable)discrete, high-resolution (HR) video scene fromwhich we wish to extract an image at some reference time

. It is this signal that our algorithm seeks to reconstruct. To

Page 3: Super-resolution still and video reconstruction from MPEG-coded video

ALTUNBASAK et al.: SUPER-RESOLUTION STILL AND VIDEO RECONSTRUCTION FROM MPEG-CODED VIDEO 219

Fig. 3. Block diagram illustrating the steps involved in video acquisition, digitization, and MPEG-compression.

Fig. 4. Trajectory of a point(x ; x ) on the imaging plane at timetcan be described by a motion mappingMMM .

incorporate into the model, we posit that is the image thatcould be reconstructed from the discrete-time and space HRimage through the interpolation filter shown in the blockdiagram.

Next we model the combination of image acquisition,sampling, and MPEG compression. The continuous HR image

is first blurred by the imaging optics. This isassumed to be a spatially invariant operation characterized byan impulse response that results in the spatially andtemporally continuous video signal . The video

can be modeled by the convolution ofand

(1)

Since there is motion between the frames in a video sequence,we need to incorporate the motion model into the image acqui-sition process. To do this, we assume that the pointin Fig. 4 on the imaging plane at the reference timemoves tothe point at time . The two-headed arrow representsthe trajectory of this point at different times. It can be modeledusing a motion mapping defined as

(2)

is a vector-valued function that specifies the coordinates ofthe point at the reference time that was at location

at time , while is the inverse mapping. If we assume thatthe only changes in the scene are those caused by motion, themotion model implies

(3)

Employing the inverse motion mapping and (3), we can performa change of variables in (1), which then becomes

(4)

where is the Jacobian matrix given by

(5)

Defining as

(6)

we obtain

(7)

where the effects of motion modeling are incorporated intothe blur function itself. Equation (7) is important. It states that(within the limits of the motion assumptions) any continuousblurred image at any time can be obtained by a linearshift-varying blur function acting on a single image.

Next, the system must be discretized for practical implemen-tation. The continuous image is assumed tohave been generated from the HR discrete image ,where denote the indices of the normalized HR sam-pling lattice. We will drop in with the under-standing that the HR image is sampled at the refer-ence time . Assuming that is interpolated through the re-

Page 4: Super-resolution still and video reconstruction from MPEG-coded video

220 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 12, NO. 4, APRIL 2002

construction impulse response and that the spatial samplinginterval has been normalized to unity,can be expressed as

(8)

Substituting (8) into (7) yields

(9)

If we assume that the summations and integrals in (9) converge,we can interchange their order. This gives

(10)

Next, we can evaluate the convolution of the impulse functionswith the reconstruction filter to get the simpler result

(11)

Now, if we define as

(12)

we end up with the next key result

(13)

This states that the continuous video signalcan be obtainedby applying a continuous linear operatorto a single discreteimage (within the limits of the motion model).

Next in the block diagram, the sampling ofon a lattice isperformed to arrive at low-resolution observed images. For thecase of MPEG-1 video, would simply represent progressivelyscanned video, whereas for MPEG-2, it might represent inter-laced video sampling. The sampling is modeled by evaluating(10) at the low-resolution sampling lattice indicesas

(14)

If we denote the discretized blurring function as

(15)

we obtain the following equation:

(16)

which shows how the low-resolution discrete images (obser-vations) are related to a single image sampled on a HR sam-pling lattice at a reference time instantthrough the discreteshift-varying blur function . The actual motion that occurs inthe video sequence will greatly affect the form of the blur.It is important to differentiate this actual motion from the lesscritical MPEG estimated motion that is used only for motioncompensated prediction.

We now consider the more MPEG related portions of Fig. 3.The motion-compensated prediction is formed bymotion compensating the appropriate previously decoded frameor frames using the MPEG motion vectors. The specific proce-dure varies, depending on whether the current frame atis an I,P, or B frame. The prediction is then subtracted from the blurredand sampled video to produce the residual signal. Althoughmotion compensation is important for compression, it has verylittle effect on HR reconstruction since it can be viewed as theaddition of a known offset. Thus, rather than treating I, P, andB frames separately, we treat all frame types equally and notethat the motion compensated offset is zero for an I frame. Themotion-compensated residual frame is given by

(17)

where is additive noise. The motion-compensatedresidual image is then transformed using a number of 8block-based DCTs. This operation has a significant impact onthe overall modeling. The DCT coefficients, denoted by, aregiven by

(18)

where , and denote the 8 8 block DCTs of and ,respectively. The block DCTs are arranged into a single “image”whose size is the same as the original. This enables the sameindices to be used both before and after the transformoperations. Using (16), can be written as

modulo modulo

(19)

where the function that appears in the limits of summation isdefined by

(20)

Page 5: Super-resolution still and video reconstruction from MPEG-coded video

ALTUNBASAK et al.: SUPER-RESOLUTION STILL AND VIDEO RECONSTRUCTION FROM MPEG-CODED VIDEO 221

Fig. 5. The shaded portion of the HR image gives rise to different DCTcoefficients on the low-resolution MPEG-compressed frames through differentblur functionsh .

is the 8 8 DCT coefficient-generating function, givenby

(21)

and is the normalization constant. Now, we define the bluroperator to be

%8 %

(22)

and we obtain

(23)

Equation (23) is the fundamental equation in our SR formula-tion. It relates the DCT coefficients associated with any low-res-olution frames, regardless of their picture type, to a single HRframe, as depicted in Fig. 5. In the final step in the block dia-gram of Fig. 3, undergoes quantization.

We can use to calculate an approximation to, given an estimate of, provided that the motion

field and optical blur of the imaging systems are given. Thisreblurred image can then be compared with information fromthe low-resolution frames. Although we do not know the exactvalue of the DCT coefficients since they have

been quantized, we do know the interval within which they lie.This suggests the use of set-theoretic methods in the inversionprocess to estimate. We will, therefore, employ a set-theoreticformulation (see [21] and [22]) and use a projections ontoconvex sets (POCS) algorithm to solve for.

For this approach to work, however, we need an accurate es-timate of the motion field. The motion vectors that are availableas part of the MPEG bit stream are inadequate for this purposefor several reasons: they are not available for intracoded blocksand frames, and they may not accurately capture the true mo-tion of the scene. To determine the precise form for, and thus

, requires that motion estimation be explicitly performedas part of the SR process. Furthermore, the function , willonly be valid where the motion can be accurately modeled. Wethus avoid creating constraint sets for image blocks where oc-clusion and uncovered background exist. Omitting such blocksis not difficult.

In order to apply POCS, we must first define a number ofconvex constraint sets. The objective is to define a set at everypoint where both DCT data is available, and (23) isvalid. As depicted in Fig. 3, quantization is applied to the signal. Thus, using the MPEG quantization bounds that can be de-

rived from the MPEG bitstream and ignoring for the mo-ment, we define the family of convex sets as in (24), shown atthe bottom of the page, whereand are the lower and upperbounds, respectively, of the quantization level. If we choose totake the noise into account, then the interval in (24) can be triv-ially extended by a fraction of the standard deviationof thenoise process. The modified convex set appears in (25), shown atthe bottom of the page, whereand is a constant. However, in our implementation we neglectthe noise term based on the assumption that quantization noiseis the dominant noise.

The case of the two-pixel image depicted in Fig. 6 will helpto visualize the projection onto . In that figure,every HR image appears as a single point; the distance of thepoint along each axis denotes the intensity of the two pixels.The shaded region shows the set. Those HR images that liewithin this region are the only ones that are consistent with theconstraint. The two hyper-planes that determine this region aregiven by equations and . They sharethe same normal vector , which is the normalized blurfunction . Because we have only a two-pixel image, thistrivial example can only have a blur with a support of two pixels.It is easy to visualize how the space would grow if we wereto extend it to a more realistic size. If the current HR imageestimate is the point shown in the figure, thenthe projection is accomplished by adding to it a scaled version

(24)

(25)

Page 6: Super-resolution still and video reconstruction from MPEG-coded video

222 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 12, NO. 4, APRIL 2002

Fig. 6. In the POCS method, an initial HR image is projected onto convex setssequentially.

of such that the result is a point on the closer of the twohyper-planes defining .

The projection , that takes the current HR imageestimate and projects it onto can thusbe written as [23] in (26)–(28), shown at the bottom of the page.The dependencies on have been omitted to avoidclutter.

Equation (28) is the main projection required to generate theHR image since it simultaneously computes the inverse DCT,inverts the blur model under the interpolation constraint, anduses aliasing present in the LR images. Note that the bounds areprecisely set by the DCT quantization used during compression.

In addition to (24), other constraint sets can be added as well,such as the well-known clipping constraint that forces the re-sulting image amplitude values to lie in a prescribed interval.The POCS algorithm proceeds by performing projections for allDCT coefficients where the motion estimation and modeling aredetermined to be valid. Then, any additional projections onto theclipping (or other) set(s) are performed. This constitutes a singleiteration of the algorithm. Iterations proceed until some suitablestopping criterion has been satisfied.

Four comments are in order before proceeding to the experi-mental results.

1) Dealing with Macroblock Types and Frame Types:(23) isvalid regardless of the macroblock mode, frame type, orvideo format. In its derivation, we only assumed a hybridMC-DCT video codec, where the quantization interval forany DCT coefficient is known. Motion compensation istaken into account through an offset. In the case of “intra”

mode, the motion-compensation offset is zero. The mac-roblock mode or video format (e.g., progressive or inter-laced video) does not matter, since we are only interestedin the quantization interval for a given DCT coefficient.Of course, macroblock modes are taken into account indi-rectly since they are used in determining the quantizationbounds.

2) Coping with Inaccurate Motion Estimates:Any SRalgorithm, whether it be implemented in the spatial orcompressed domain, requires accurate motion estimates.However, for a typical video scene, we will almostsurely have inaccurate motion estimates at some framesor regions because of the ill-posed nature of motionestimation. We need to deal effectively with these modelfailure (MF) regions for successful application of theproposed SR algorithm to general video sequences. Weshould first note that it is trivial to choose reliable DCTvalues and use only those values. If we are uncertainabout the modeling efficacy of a given DCT value, wesimply do not define a set at that location and, therefore,make no use of it. Thus, our approach is to use a binarymodel-failure-mask (MFM). The MFM defines whetherthe motion estimate is reliable at each pixel location(or the DCT coefficient). The MFM can be obtainedsimply by thresholding the motion compensated framedifference (MCFD). If the MCFD lies below a threshold,it is likely that the motion estimate is correct. However,there could be cases where the MCFD is small, but themotion estimate is inaccurate. Luckily, these motionestimates will not cause an artifact in the reconstructionprocess as long as the MCFD is small since the projectedintensities match. The second observation is that theMPEG compression motion vectors are only used togenerate . As previously mentioned, moreprecise vectors must be estimated to model the blurringprocess. This estimation needs to be part of the restora-tion algorithm.

3) Complexity Analysis and Real-Time Implementations:There are both real-time and off-line applications of theproposed compressed-domain resolution enhancementalgorithm. The execution time requirements, whichdepend on the application, vary considerably. However,it is still useful to estimate the computational complexityof the algorithm in order to verify the feasibility of areal-time hardware implementation.The computational complexity of SR reconstructionis approximately multiply–add operations,

(26)

(27)

elsewhere

(28)

Page 7: Super-resolution still and video reconstruction from MPEG-coded video

ALTUNBASAK et al.: SUPER-RESOLUTION STILL AND VIDEO RECONSTRUCTION FROM MPEG-CODED VIDEO 223

TABLE IHBM PARAMETERS

Fig. 7. Portion of $20 bill is scanned, and a sequence of 16 low-resolution images are generated through a process of simulated motion, blurring and down-samplingthe original scan. The sequence was then MPEG-compressed. Restoration by bilinear interpolation.

where the image size, blur size, number of low-reso-lution frames, and number of iterations are assumedto be and , respectively. If weuse VHS-quality video as the input signal and assumetypical parameter settings, the total computation powerrequirements are approximately 275 million operationsper SR reconstruction. These initial calculations indi-cate that a real-time implementation is possible on ahigh-end DSP processor. However, to further reducethe computation cost, we can reduce the number ofprojejections. We observed that the projections of thehigher AC coefficients (in MPEG) often correspond toan identity projection. This is because of the fact that theamount of quantization of these coefficients is so largethat the initial HR estimate lies within the bounds, andthus do not need to be projected onto the constraint set.The projections can be limited to the lowest frequencycoefficients (about one-quarter of the total number) ofeach block without a significant quality degradation inthe reconstructed image. This means that the theoreticalcomplexity numbers can be reduced by a factor of four.

4) Performance Limits:The performance of SR dependson many factors, including optical blur, the amount ofmotion, the compression ratio, and the number of framesused in the reconstruction process. Theoretically, theresolution-enhancement ratio is equal to the number offrames in the reconstruction process, provided that there

exists sub-pixel movement between the frames and noaggressive optical blurring takes place. As the varianceof the blur function becomes larger, we expect to achieveless improvement in resolution. The dependence on thecompression ratio is likely to be more complicated. Thedetermination of performance limits is beyond the scopeof this paper, and will be investigated in a forthcomingresearch paper.

IV. RESULTS

A. Motion Estimation

The complexity of the modeling described in Section III forcomputing the blur , is determined by the motion model.In the simplest case, the motion from the low-resolution imagesto the reference can be modeled as a spatially uniform transla-tion. In practice, however, we have found this model to be inad-equate. As a result, we use hierarchical block matching (HBM)[24]. The HBM parameters that are used are documented inTable I. In HBM, the motion is assumed to be locally trans-lational. When warping effects are small, this approximationcan be quite effective. The performance of the proposed POCS-based SR algorithm will ultimately be limited by the effective-ness of the motion estimation and modeling. Note also that theinput video is decompressed, and low-resolution decompressedimages are bilinearly interpolated for the purpose of motion es-timation.

Page 8: Super-resolution still and video reconstruction from MPEG-coded video

224 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 12, NO. 4, APRIL 2002

Fig. 8. Restoration by spatial-domain SR for the same video sequence used in Fig. 7.

Fig. 9. Restoration by DCT-domain SR for the same video sequence used in Fig. 7.

B. Experiments

The proposed POCS SR algorithm has been very successfullyapplied to several real MPEG sequences compressed at variousbit rates. Two sets of results are provided to demonstrate its ef-fectiveness.

In the first example, a portion of a $20 bill was scanned, anda sequence of 16 low-resolution images was generated from itthrough a process of adding motion, blurring and downsamplingthe original scan. We then MPEG-1 compressed the sequenceat 1.5 Mbits/s. Figs. 7–9 depict the results of bilinear interpo-lation, spatial SR, and DCT domain SR, respectively. Both thebilinear interpolation and spatial-domain SR results contain vis-ible aliasing artifacts, especially within the oval grid. The reader

may choose to look at a real $20 bill to see the fine details in thegrid. The proposed algorithm produces a result which is essen-tially aliasing free.

In the second experiment, the video was initially taken withan analog camcorder (Hi-8), digitized at 320240 pixels andlines, then compressed using a standard MPEG-1 coder at 0.5Mbits/s. Shown in Fig. 10 are the results of creating an SR imageusing eight MPEG domain LR images. The MPEG images areof size 320 240 pixels and the SR image is of size 640480pixels.

Fig. 11 shows a portion of those images that have beencropped and zoomed. For comparison, results are shown for aspace-domain SR algorithm and the proposed algorithm. It isclear from the figure that the transform-domain SR formulation

Page 9: Super-resolution still and video reconstruction from MPEG-coded video

ALTUNBASAK et al.: SUPER-RESOLUTION STILL AND VIDEO RECONSTRUCTION FROM MPEG-CODED VIDEO 225

Fig. 10. A video sequence is captured through a Hi-8 camera and digitizedand MPEG-compressed. The result of the (a) spatial and (b) transform domainSR reconstruction with four low-resolution frames are depicted.

(a) (b)

Fig. 11. Portion of images in Fig. 10 are zoomed. (a) Spatial and (b) DCTdomain SR.

does a superior job of removing coding artifacts and increasingresolution.

V. CONCLUSION

We propose a novel multiframe video resolution-enhance-ment method that makes use of quantization informationembedded in MPEG video sequences. The principles under-lying the development equally apply to other linear transformsas well. Applications that may benefit from the proposedresearch are also discussed.

REFERENCES

[1] A. J. Patti, M. I. Sezan, and A. M. Tekalp, “Robust methods for high-quality stills from interlaced video in the presence of dominant motion,”IEEE Trans. Circuits Syst. Video Technol., vol. 7, pp. 328–342, Apr.1997.

[2] , “Superresolution video reconstruction with arbitrary sampling lat-tices and nonzero aperture time,”IEEE Trans. Image Processing, vol. 6,pp. 1064–1076, Aug. 1997.

[3] M. Elad and A. Feuer, “Restoration of a single superresolution imagefrom several blurred, noisy and undersampled measured images,”IEEETrans. Image Processing, vol. 6, pp. 1646–1658, Dec. 1997.

[4] , “Super-resolution reconstruction of image sequences,”IEEETrans. Pattern Anal. Machine Intell., vol. 21, pp. 817–834, Sept. 1999.

[5] R. R. Schultz, “Extraction of high-resolution video stills from mpegimage sequences,”Proc. IEEE Int. Conf. Image Processing, vol. 2, pp.465–469, Oct. 1998.

[6] R. R. Schultz and R. L. Stevenson, “Improved definition video frameenhancement,” inProc. 1995 Int. Conf. Acoustics, Speech, and SignalProcessing, vol. 4, May 1995, pp. 2169–2172.

[7] B. R. Hunt, “Imagery super-resolution: Emerging prospects,” inProc.SPIE—Applications of Digital Image Processing XIV, vol. 1567, SanDiego, CA, July 1991, pp. 600–608.

[8] , “Super-resolution of images: Algorithms, principles, perfor-mance,”Int. J. Imaging Syst. Technol., vol. 6, no. 4, pp. 297–304, 1995.

[9] B. Tom and A. Katsaggelos, “Reconstruction of a high resolution imageby simultaneous registration, restoration and interpolation of low reso-lution images,”Proc. Int. Conf. Image Processing, vol. 2, pp. 539–542,1995.

[10] , “Reconstruction of a high resolution image from multiple de-graded mis-registered low resolution images,” inProc. SPIE—VisualCommunication and Image Processing, vol. 2308, 1994, pp. 971–981.

[11] R. Y. Tsai and T. S. Huang,Multiframe Image Restoration and Regis-tration, In: Advances in Computer Vision and Image Processing, T. S.Huang, Ed. Greenwich, CT: JAI Press, 1984.

[12] M. Irani and S. Peleg, “Motion analysis for image enhancement: Resolu-tion, occlusion, and transparency,”J. Visual Commun. Image Represent.,vol. 4, pp. 324–335, Dec. 1993.

[13] S. Mann and R. Picard, “Virtual bellows: Constructing high-quality im-ages from video,”Proc. IEEE Int. Conf. Image Processing, Nov. 1994.

[14] A. Tanaka, H. Imai, M. Miyakoshi, and T. Da-Te, “Enlargement of dig-ital images with multiresolution analysis,”Trans. Inst. Electron., Inf.,Commun. Eng. D-II, vol. J79D-II, no. 5, pp. 819–825, May 1996.

[15] L. Xiao-Zheng, M. Kudo, J. Toyama, and M. Shimbo, “Knowledge-based enhancement of low spatial resolution images,”IEICE Trans. Info.Syst., vol. E81-D, no. 5, pp. 457–463, May 1998.

[16] A. Zakhor, “Iterative procedures for reduction of blocking effects intransform image coding,”IEEE Trans. Circuits Syst. Video Technol., vol.2, pp. 91–95, Mar. 1992.

[17] R. L. Stevenson, B. E. Schmitz, and E. J. Delp, “Discontinuity preservingregularization of inverse visual problems,”IEEE Trans. Syst., Man, Cy-bern., vol. 24, pp. 455–469, 1994.

[18] P. E. Eren, M. I. Sezan, and A. M. Tekalp, “Robust, object-basedhigh-resolution image reconstruction from low-resolution,”IEEETrans. Image Processing, vol. 6, pp. 1446–1451, Oct. 1997.

[19] H. Paek, R. C. Kim, and S. U. Lee, “On the pocs-based postprocessingtechnique to reduce the blocking artifacts in transform coded images,”IEEE Trans. Circuits Syst. Video Technol., vol. 8, pp. 358–367, June1998.

[20] S. J. Reeves and S. L. Eddins, “Comments on iterative procedures forreduction of blocking effects in transform image coding,”IEEE Trans.Circuits Syst. Video Technol., vol. 3, pp. 439–440, Dec. 1993.

[21] P. L. Combettes, “Convex set theoretic image recovery by extrapolatediterations of parallel subgradient projections,”IEEE Trans. Image Pro-cessing, vol. 6, pp. 493–506, Apr. 1997.

Page 10: Super-resolution still and video reconstruction from MPEG-coded video

226 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 12, NO. 4, APRIL 2002

[22] M. I. Sezan and H. J. Trussell, “Prototype image constraints for set-theoretic image restoration,”IEEE Trans. Signal Processing, vol. 39,pp. 2275–2285, 1991.

[23] H. J. Trussell and M. R. Civanlar, “Feasible solution in signal restora-tion,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 32, pp.201–212, 1984.

[24] M. Bierling, “Displacement estimation by hierarchical block matching,”in Proc. SPIE Visual Communications and Image Processing’88, 1998,pp. 942–951.

Yucel Altunbasak received the B.S. degree fromBilkent University, Ankara, Turkey, in 1992 withhighest honors. He received the M.S. and Ph.D.degrees from the University of Rochester, Rochester,NY, in 1993 and 1996, respectively. His Ph.D.research involved object-scalable mesh-basedvideo representation and coding with emphasis oninteractive multimedia applications.

He joined Hewlett Packard Research Laboratories(HPL), Palo Alto, CA, in July 1996. His position atHPL provided him with the opportunity to work on a

diverse set of research topics, such as video processing, coding and communi-cations, streaming and networking, multimedia, and inverse problems in signalprocessing such as demosaicing and super-resolution. He also taught digitalvideo and signal processing courses at Stanford University, Stanford, CA, andSan Jose State University, San Jose, CA, as a Consulting Assistant Professor.

Dr. Altunbasak joined the school of Electrical and Computer Engineering,Georgia Institute of Technology, Atlanta, in 1999 as an Assistant Professor. Heis currently working on industrial- and government-sponsored projects relatedto media communications, networked video, and interactive video. His researchinterests include video and multimedia signal processing, scalable video coding,inverse problems in imaging, and network distribution of compressed multi-media content. His work has been published in more than 50 publications.

Dr. Altunbasak is an Area/Associate Editor forSignal Processing: ImageCommunicationsand for theJournal of Circuits, Systems and Signal Processing.He has also served as a Session Chair in technical conferences, as a Panel Re-viewer for government funding agencies, and currently as a Technical Reviewerfor various journals and conferences in the field of signal processing and com-munications. He received the National Science Foundation (NSF) CAREERAward in 2002. He is a member of the IEEE Signal Processing Society’s IMDSPTechnical Committee.

Andrew J. Patti was born in Utica, NY. He receivedthe B.S.E.E. degree from Clarkson University,Potsdam, NY, in 1987, and the M.S.E.E. and Ph.D.degrees from the University of Rochester, Rochester,NY, in 1991 and 1995, respectively.

From 1987 to 1990, he was an Engineer at theWestinghouse Defense Electronics Center, Balti-more, MD, designing hardware for electroopticalsystems. He is currently with Liberate Technologies,Redwood Shores, CA. His research interests are inthe area of digital video processing, including motion

estimation, standards conversion, resolution enhancement, and restoration andsegmentation.

Russell M. Mersereaureceived the B.S. and M.S.degrees in 1969 and the D.Sc. degree in 1973 fromMassachusetts Institute of Technology, Cambridge,MA.

He joined the School of Electrical and ComputerEngineering, Georgia Institute of Technology,Atlanta, in 1975. His current research interests are inthe development of algorithms for the enhancement,modeling, and coding of computerized images,synthesis aperture radar, and computer vision. In thepast, this research has been directed to problems of

distorted signals from partial information of those signals, computer imageprocessing and coding, the effect of image coders on human perception ofimages, and applications of digital signal processing methods in speech pro-cessing, digital communications, and pattern recognition. He is the co-authorof the textMultidimensional Digital Signal Processing(Englewood Cliffs, NJ:Prentice-Hall, 1984).

Dr. Mersereau has served on the Editorial Board of the PROCEEDINGS OF THE

IEEE and as Associate Editor for the IEEE TRANSACTIONS ON ACOUSTICS,SPEECH, and SIGNAL PROCESSING and for the IEEE SIGNAL PROCESSING

LETTERS. He is currently the Vice President for Awards and Membership ofthe Signal Processing Society. He was a co-recipient of the 1976 Bowder J.Thompson Memorial Prize of the IEEE for the best technical paper by anauthor under the age of 30, and a recipient of the 1977 Research Unit Awardof the Southeastern Section of the ASEE and three teaching awards. He wasawarded the Society Award of the Signal Processing Society in 1990.