Optimizing Visible Objects Embedding Towards Realtime ... · Optimizing Visible Objects Embedding Towards Realtime Interactive Internet TV Bin Yu, Klara Nahrstedt Computer Science

Optimizing Visible Objects Embedding Towards RealtimeInteractive Internet TV

Bin Yu, Klara NahrstedtComputer Science Department

University of Illinois at Urbana-Champaignbinyu, klara @ cs.uiuc.edu

ABSTRACTEmbedding new visible objects such as video or images intoMPEG video has many applications in newscasting, pay-per-view, Interactive TV and other distributed video ap-plications. Because the embedded foreground content in-terferes with the original motion compensation process ofthe background stream, we need to decode macroblocks inI and P frames via motion compensation and re-encodeall macroblocks with broken reference links via motion re-estimation. Although previous work has explored DCT-compressed domain algorithms and provided a heuristic ap-proach for motion re-estimation, the computation intensivemotion compensation step is not much optimized and sostill prevents efficient realtime embedding. In this work,we optimize previous work to enable realtime embeddingprocessing that can be applied to Interactive Internet TVapplications. We study the motion compensation processand show that on average up to 90% of the macroblocksdecoded are not used at all. To explore this phenomenon, wepropose to buffer a GOP (Group-Of-Picture) of frames andapply a backtracking process that identifies the minimumset of macroblocks which need to go through the decodingoperation. At the price of a delay of one-GOP time, this ap-proach greatly speeds up the whole embedding process andenables on-line software embedding operation even capablefor processing HDTV stream. Further optimizations arediscussed and a real-world application scenario is presented.Experimental results have confirmed that this approach ismuch more efficient than previous solutions and results inequally good video quality.

1. INTRODUCTIONMPEG2 standard [1] has become the most widely used for-mat for video transmission and storage in various videoapplications, such as Standard Definition TV and High Def-inition Digital TV (HDTV, [2]) Broadcast, Internet Video-On-Demand, interactive TV and video conferencing. Onthe other hand, broadband networking technologies such ascable modem [3] and xDSL [4] are bringing to our home andoffice buildings an Internet connection with up to 10Mbpsor even higher bandwidth at acceptable and reducing prices.More advanced technologies of even higher magnitude suchas Fast Ethernet [5] and Gigabit Ethernet [6] are providingcapacity to afford multiple high quality video streams simul-taneously. At the same time, high-end personal computersare becoming more powerful with processors of speeds over1GHz, large memory and storage devices.

With all these exciting technologies, we can expect muchbetter video distribution applications offering higher visual

quality, greater interactivity, more flexibility and finer ser-vice customization. Aiming at the same goal, two originallyseparate approaches, PC plus Internet and TV plus Set-Top-Box, are rapidly converging to provide a common set ofservices, including per-per-view, watching multiple channelsvia Picture-in-Picture (PiP), web browsing, email accessingand stock/weather/sports/news updating through tickers onthe screen, and etc. However, because of the lack of efficientsoftware solutions that are capable of compositing high vol-ume video streams on-line, most of current TV programs areedited with special hardware at the TV station or the Set-Top-Box. The major drawback of this End-to-End approachis that it cannot easily scale up in the case of multiplevideo sources. Therefore, the huge amount of data contentand the joint computing power over the Internet cannotbe fully utilized. Also, since all the video processing isdone in the closed world from the TV station to the Set-Top-Box through predefined operation modules, it preventsthird party service providers from offering more flexible andcustomizable services to the users.

Under this background, we have started a project to providean open, flexible and scalable software solution to composemultiple MPEG video streams on-line. Our particular visionis that almost all the value-adding services mentioned abovecan be characterized as one problem – the visible objectsembedding (VOE) problem. VOE refers to any operationthat embeds any visible objects (video or image or even text)into the original video stream, including content mergingapplications such as PiP, logo insertion and captioning, andcontrol interface presentation such as displaying menus andbuttons on the TV screen for the user to interact. There-fore, our major task maps to constructing a realtime VOEalgorithm for digital video streams. Note that overlayingcan be easily done at the end host via the pixel domainoverlapping of windows system or in Set-Top-Boxes, butwe argue that a pure MPEG-in-MPEG-out scenario withTV as an output device has its own advantages in termsof scalability, interactivity and customization. To verify ourapproach, we will present and concentrate on the InteractiveInternet TV application using MPEG streams. We currentlychoose MPEG2 because it is the standard for digital TVbroadcast and is widely used in most video distributionapplications today.

In this application, we will merge visible objects into streamsin the MPEG compressed domain. Compressed-domain videocomposition algorithms have been studied before [7, 8, 9],which have achieved 10% to 30% speed-up compared tospatial-domain approach. However, we discover that they

Original MPEG-compressed Stream

MC DomainData

RD Domain

Data

MotionCompensation

Result MPEG-compressed Stream

Header Parsing & Inverse VLC

ForegroundOverlapping

AdditionalVisibleObjects

CompositeMC Domain

Data

Fixing Wrongreference

Header Packing& VLC

Figure 1: The Video Objects Embedding Process

still suffer from the computation intensive macroblock com-pensation operation, and so are relatively slow and ineffi-cient. Consequently, they are generally considered only foroff-line editing such as watermarking, rather than on-lineprocessing as in interactive video applications. Therefore,we have to greatly optimize previous work to enable realtimeVOE processing for TV streams. We will discuss in detailhow we can reduce this workload to the minimum amountnecessary and lead to an averagely 10-times speed-up for themotion compensation process, along with other optimizationchoices that reduce computation complexity at the price ofreduced visual quality.

This paper is organized as follows. First, in section 2, wedescribe the wrong reference problem we must solve forcompressed domain VOE and previous solutions. Then insection 3 we present a more efficient and flexible approachfor realtime VOE processing. Section 4 is dedicated to anexample scenario that demonstrates how our approach canbe applied to greatly improve distributed video applicationsin many ways. Analysis and experimental results are pre-sented in section 5, and finally section 6 concludes this paperand future work.

2. BACKGROUND AND PREVIOUS WORK2.1 Visible Object Embedding in MPEG Com-

pressed DomainFigure 1 shows the primary operations in the VOE process.Our goal is to embed visible objects into an MPEG streamdirectly at the compressed domain. To achieve this goal weneed to take the original MPEG compressed frame, parsethrough the macroblock headers to reach the so-called MCdomain. The MC domain refers to macroblock headers withinformation such as motion vectors and prediction errorsin quantized DCT-format. After inverse VLC, the VOEprocess needs to execute motion compensation and thenreach the RD domain. The RD domain is actually the“Reconstructed DCT Domain”, which includes the DCT-formatted values for each image block. At this point weoverlap the additional visible objects in DCT representationover the original frame in the “Foreground Area”. However,the merging causes major problems because of the temporaldependencies between MPEG frames since it causes someof the macroblocks to have wrong reference. For the VOEprocess to work correctly, this problem needs to be solvedby redoing motion estimation using RD domain data to getnew MC domain data. In this paper, we will show howto perform the foreground overlapping and correct wrongreferences while executing VOE.

2.2 Object Overlapping and the Wrong Ref-erence Problem

The visible information overlapping process itself is verysimple. The result pixels are normally a combination ofthe foreground and background pixels, i.e.,

Pnew(i, j) = α·Pf (i, j) + (1 − α)·Pb(i, j) (1)

where Pnew, Pf , Pb and α are new pixels, foreground pixels,background pixels and the transparency factor respectively.Since it is a linear operation, it can also be done in the DCTdomain as follows:

DCT (P̄new) = α·DCT (P̄f ) + (1 − α)·DCT (P̄b) (2)

where P̄ represents the block-wise data in DCT format. Foropaque overlapping, α is 1 and the overlapped backgroundimage is discarded. Note that we are assuming that theforeground object to be overlaid is available in DCT domain,which is true for many video and image formats, such asMPEG, H.263, MJPEG and JPEG. However, for text dataor other images such as bitmaps, we need to do DCT toconvert them into a “VOE-friendly” format first. That is,we need to provide motion mode and DCT-formatted imagedata, which will be discussed in detail after the main algo-rithm is presented. After executing equation 2 we receive

I

P1MB1

MB2P2

MB0FG

FG

FG

BG

BG

BG

MB3

MB4

Figure 2: The Wrong Reference Problem

changed macroblocks in the foreground area. At this pointthe Wrong Reference Problem happens because some of thechanged macroblocks are used as prediction reference forfuture macroblocks. More specifically, let us consider thechain of I → P1 → P2 frames in Figure 2. The foregroundarea is the smaller rectangle (FG) and the background framearea is the large rectangle (BG). The small squares representthe macroblocks (MB). Let us assume MB2 in frame P2 usesthe data of MB1 in frame P1 as a content predictor. Dueto the VOE the data in MB1 is changed by the content ofthe overlapping objects. If MB2 uses this new data for itsprediction and adds the original prediction error, the finalresult is obviously wrong. To make it worse, this wronglydecoded macroblock may then be used as a reference itselffor other macroblocks in later frames, causing the errorto propagate all the way down the motion prediction linkuntil the next I frame appears. To fix MB2’s MC data,we need to know MB2’s RD domain data, and thereforeMB1’s. Also note that MB1 itself may rely on the datafrom another macroblock (MB0 in Figure 2) for predictionreference. Therefore, to know MB1’s value, MB0 will also

need to be decoded to RD domain. In the worse case, fora GOP pattern of IBBPBBPBBPBBPBB, the data of onemacroblock in the first I frame may be referenced 4 timesto derive the content of macroblocks in all the 4 following Pframes and many other macroblocks in B frames. For a max-imum search distance of 16 macroblocks used for searchingthe optimal prediction block by the encoder, this means theprediction link may stride 4x16=64 macroblocks across theframe. Therefore, all the macroblocks in the I and P framesare decoded to RD domain in case future macroblocks willneed it. In the following discussion, we will call macroblockslike MB0 or MB1 “d-MBs” since their Data is needed tobe decoded to RD domain for future reference by othermacroblocks. Similarly, we will call and macroblocks likeMB2 “c-MBs” as their reference blocks are wrong and sotheir MC data has been Changed.

2.3 Previous SolutionsMany MPEG compressed domain algorithms have been de-veloped, among which [7], [8] and [9] provide a good startingpoint for our work.

In [7], Noguchi, Messerschmitt and Shih-Fu thoroughly de-fined how to do motion compensation and motion estimationin the DCT domain. The major difficulties for DCT-domainalgorithms come from the fact that the reference blocks areoften not aligned with the regular 8x8 image block bound-aries. Therefore, their value has to be derived in the DCTdomain using 2 or 4 neighboring image blocks. Moreover,when the prediction is based on interlaced reference, then 2such blocks need to be reconstructed to form a 16x8 block,which is then vertically sub-sampled to get the final refer-ence block. For the motion estimation part for c-MBs, thestandard block searching approach used by MPEG encodersis not suitable because of the massive computation workinvolved. Instead, the authors proposed a much cheaperheuristic way of obtaining the motion vectors by “infer-ence”. Specifically, they only examine only 2 most likelycandidate reference macroblocks located at the edge of theforeground window, which greatly simplifies the motion re-estimation process without causing much perceptible imagequality degradation.

Although the [7] approach could save some computationcompared to brute-force spacial domain approach (10% to30% speed-up according to [7]), it is still a costly process overall, and the motion compensation part for getting the RDdomain data becomes the most costly bottleneck. Thoughspeedup compared to spacial domain approach has beenachieved, realtime processing is still not readily feasible. Forexample, for HDTV stream with a resolution of 1920x1088,the decoding step alone is not feasible for realtime processingon general purpose single processor PCs.

Based on [7], Jianhao and Shih-Fu proposed a faster solutionin the sub-domain of visible watermark embedding. Thesituation now is different in that, instead of merging thewatermarks with the original MPEG frame according toequation 2, the watermark block is simply added to theoriginal background data.

Eij = Eij + Wij . (3)

Eij = Eij − Wreference + Wij . (4)

For intra-coded macroblocks, equation 3 is used to add the

watermark value Wij to the image value Eij for the ijthblock. For inter-coded macroblocks, equation 4 is used be-cause we need to first subtract the watermark added to thereference macroblock Wreference from the original predictionerror Eij . This way, the motion compensation work for ref-erence macroblocks is eliminated since we only need to knowhow they have changed and can adjust the prediction errorsbased on the changes calculated using only the watermarkembedded.

After this work, [9] pointed out that the [8] approach maynot work if the embedded watermark or captions are sostrong that equation 3 will make Eij overflow the [16,235]bound for a pixel’s luminance value. In this case, the re-sult will be normalized, but then the changes made to theprediction error would also depend on the original referencemacroblock’s data. Therefore, to truly prevent overflow,the maximum luminance value of the reference macroblockis needed again and the savings of [8] are not available anymore. The authors of [9] avoided this problem heuristicallyby using the average luminance of the background block toestimate the caption threshold, which is the DC coefficient inthe DCT domain and can be easily acquired. The potentialproblem is when the maximum pixel value differs much fromthe average, overflow will still occur, but the authors claimthat the resulting image is of acceptable high quality.

Though the algorithm from [9] works fine for text formatinformation embedding, it can not be applied to more gen-eral image/text overlapping cases where the foreground im-age is more important to be shown clearly and in a stablemanner, and the background image needs to be weightedmuch less. Formally, if we still keep the original motionvectors, then the difference between original and new predic-tion errors (DCT (P̄diff )) will be the original reference mac-roblock value (DCT (P̄b)) and its new value (DCT (P̄new))after VOE, as in equation 5:

DCT (P̄diff ) = DCT (P̄new) − DCT (P̄b)

= α·DCT (P̄f ) + (1 − α)·DCT (P̄b) − DCT (P̄b)

= α·(DCT (P̄f ) − DCT (P̄b)). (5)

Therefore, to know the changes made to the backgroundblock, we still need to decode the background block in thefirst place. An extreme case is opaque overlapping (α =1), the resulting block after overlapping is simply the fore-ground block, and the prediction error need to be adjustedby the difference between the foreground block (DCT (P̄f ))and the background block (DCT (P̄b)).

In summary, for general video/image overlapping, we musthave the original macroblock value of both the referencemacroblocks and the dependant macroblocks in RD domainvia motion compensation, and so cannot avoid it as in [8,9]. Therefore, we have to face the computation intensivemotion compensation operation and find a new solution tominimize the amount of computation involved.

3. OUR SOLUTION3.1 The Backtracking ProcessAfter the thorough discussion above, it is clear that to greatlyaccelerate the VOE process, we have to work on the expen-sive motion compensation work done for all macroblocksin I and P frames. The key observation is that the pre-vious work is doing much more than the minimum decod-

FG

BG

c-MB

d-MB

d-MB d-MB

d-MB

d-MBd-MBd-MB

Figure 3: A typical inverse motion prediction linkdiscovered by backtracking

ing necessary for rendering correct results. Originally, thereconstructed RD domain data serves two goals: gettingthe correct value of the d-MBs and c-MBs, and calculatingnew MC domain data through motion estimation. Since [7]simplified the motion estimation part by only examining theedge macroblocks from the background frame surroundingthe foreground area, the second goal can be satisfied so longas we always decode the edge macroblocks – what we callthe “gold edge”. For the first goal, in cases when thereare only a few c-MBs, the corresponding number of d-MBswill also be small, and the total number of macroblocksnecessary for reconstruction is much less than the number ofall macroblocks in I and P frames. Our experimental resultshave also confirmed that only those macroblocks surround-ing the foreground window are heavily affected by the VOEprocess, and most other macroblocks are not used at all.The only reason why the [7] solution makes all the effortsto completely reconstruct all reference frames is that thefuture motion prediction pattern is not known in advance.For example, in Figure 2, when we decode frame P1, we donot know that only a few d-MBs like MB1 will be used asreference for future macroblocks like MB2 in frame P2.

The insight behind our approach is that, for most distributedvideo applications, various delays exist, such as queueingdelay inside the network, buffering /synchronization delay atthe receiver side, and delays due to jitter control and trafficpolicing. Therefore, for the VOE service, in many casesthe user may accept the service even if it will introduce aslightly larger delay, or a larger response time for interactiveapplications, as long as the extra content is embedded inan impromptu manner. In other words, we take the delayas another QoS (Quality of Service) dimension that couldbe flexibly controlled, and try to balance it together withother QoS dimensions such as computation complexity andrealtime-ness according to the user’s preference.

Under this assumption, we can simulate “predicting thefuture” by “buffering the past”. That is, we decode eachframe to the MC domain and buffer the motion vectors andquantized DCT coefficients. After we have gone throughthe whole GOP, all the c-MBs can be identified by test-ing whether a macroblock’s motion vectors are pointing tosomewhere inside the foreground area. Similarly, d-MBs canbe identified if some future d-MBs or c-MBs are using it asprediction reference. Therefore, we can define a “backtrack-ing” operation as follows:

From the last B frame to the first I frame in a GOP,for each macroblock MB current in the frame, if ituses reference macroblock(s) MB reference in theforeground area, then we mark MB current as a c-MB and MB reference as d-MBs. Also, for everyd-MB, all the macroblocks it refers to will also bemarked as d-MBs. In addition, all macroblocks onthe gold edge are by default marked as d-MBs.

This way, all the “active” prediction links that are relevantto the motion compensation will be found out. Typicallythese links take the format of “c-MB → d-MBs → ... dMBs”,such as the link “MB2 → MB1 → MB0” in Figure 2. Fig-ure 3 shows how a typical inverse prediction link discoveredlooks like. Note that at each step the number of macroblocksmay double or quadruple since the prediction does not followregular 8x8 block boundaries. Once the c-MBs and d-MBsare marked out, we can resume the decoding and overlayingprocess from the I frame at the beginning of this GOP again.By this time we have the MC domain data, we can continuethe VOE processing like this: only for those c-MBs and d-MBs, we need to perform motion compensation to get theirRD domain data, and only for c-MBs we perform motionestimation to get their new motion modes and predictionerrors. For other macroblocks, we do nothing, and theirMC domain data will be directly used in later re-encodingphase. We also want to point out that this backtracking isalways needed within the motion compensation process forany solutions, so it does not incur any extra cost. Rather,we are just batching up the tracking for a GOP of frames tillthe end of the GOP instead of doing it separately for eachframe. A great saving can be expected since only c-MBsand d-MBs are motion-compensation-wise decoded, whichare typically much less than the total number of macroblocksin I and P frames. This VOE solution is especially usefulfor timely delivery service for multiple multimedia streamsfrom different sources to a single TV receiver. In addition,the idea of identifying the minimum set of macroblocks formotion compensation through backtracking can also be ap-plied to many other video manipulation operations such asvideo clipping and video juxtaposition.

3.2 OptimizationsSince the VOE process is quite complicated and involvesmany operations, there are a lot of places where we can maketrade off and optimize towards a certain QoS metric. Thebasic idea is that we can further decrease the computationcomplexity through heuristic methods at the price of slightdegradation in the resulting visual quality. The followingare 3 examples:

3.2.1 Bi-direction to Uni-direction PredictionFor a c-MB that is bidirectionally predicted, an interestingphenomenon is that it is very likely that only one of itsprediction links points to the foreground area and the othernot. The corresponding physical meaning can be explainedthis way. Let us assume a jumping ball in a video movingdirectly towards the foreground area, as Figure 4 shows theposition of 3 macroblocks that describe this scenario in threeneighboring P-B-P frames. When encoding a macroblock forthe middle B frame, the encoder may choose to predict itbidirectionally and use the average of the two macroblocks(P MB 1 and P MB 2) in the 2 P frames. Since P MB 2

FG

BG

Ball movingdirection

P_MB_1

P_MB_2

B_MB

Figure 4: Bi-directional prediction

is located inside the foreground area, B MB will be markedas c-MB and the 2 P frame macroblocks will be markedas d-MBs. Since B MB is not very different from the 2P macroblocks, we can exploit this phenomenon to reduceprocessing time by discarding the broken prediction. Thatis, we change the macroblock in the B frame to be uni-directionally predicted using only the one macroblock thatis not in the foreground area. The same prediction error canbe used since all 3 macroblocks are basically of the same con-tent, and then the only operation we need to do is to changethe motion compensation mode from bi-directional to uni-directional and delete one motion vector. In this example,we will only need to change the B MB to be unidirectionallypredicted from P MB 1, and all these 3 macroblocks willnot be marked as c-MB or d-MB at all. Although the resultpicture will have a slightly poorer quality, the experimentalresults show that it is not obvious to human eyes and thesaving in processing power justifies this cost.

3.2.2 Sensitive AreaSome times the foreground window only occupies a smallportion of the background frame, and so many macroblocksfrom the background stream may never be affected by theVOE operation. Therefore, we can define a “sensitive-area”for each frame for the back tracking process. A “d-sensitive-area” specifies the area that may contain d-MBs, and “c-sensitive-area” c-MBs. The benefit of defining sensitive areacomes from saving the decoding work needed for macroblocksoutside the sensitive area. If a whole slice is “insensitive”,we can copy the whole slice from the input stream directlyto the output stream without even decoding it to the MCdomain. In case part of a slice still lies inside the sensitiveregion, we will still need to decode these slices to the MCdomain, but macroblocks outside the sensitive area do notneed to be checked for c-MBs or d-MBs in the backtrack-ing process. For discussion convenience, we define S to be

Cut off

: S

Region-0 =

FG + Edge

Region-1

Region-2

Region-3

...

...

Figure 5: Sensitive Regions: candidate macroblocksfor backtracking

the maximum search distance used by the encoder during

motion estimation (normally 16 macroblocks or 256 pixels)and define “Region-i” (i ≥ 0) as follows: Region-0 is theforeground window plus the ”gold edge” – the one circle ofmacroblocks in the background frame surrounding the fore-ground window. Region-i is acquired by enlarging Region-(i-1) in all the 4 directions by a step size of S. If the enlargedregion exceeds the frame boundary, then the surplus partwill be cut off. This is shown in Figure 5. For every inter-coded frame, the c-MBs will only be referring to macroblocksinside Region-0 (the bidirectional case can be eliminatedby transforming to unidirectional prediction), and so the c-sensitive-area for every frame is Region-1. As to d-sensitive-area, it changes as the back tracking process progresses –the further back we go, the wider area of macroblocks maybe touched by the prediction link. For the last P frame, thed-MBs are located inside the foreground area plus the goldedge (as needed for motion re-estimation), which is Region-0. Therefore, the final sensitive area for the last P frame isthe super set of its c and d-sensitive-area, which is Region-1. All these macroblocks may need macroblocks in the lastbut second P frame for reference, and so those referencemacroblocks (candidate d-MBs) are at most S away fromthe gold edge. Therefore, the d-sensitive-area for the last butsecond P frame is S larger than Region-1, which is Region-2. Based on similar deduction, we get the final sensitivearea for every frame as shown in Table 1. The numbersrepresent Region indices (“-” means not applicable), andthe final sensitive area of each frame is the super set of thatframe’s c and d-sensitive-area.

Frame I B B P B B P B B P B B P B B

c - 1 1 - 1 1 1 1 1 1 1 1 1 1 1

d 5 - - 4 - - 3 - - 2 - - 0 - -

Final 5 1 1 4 1 1 3 1 1 2 1 1 1 1 1

Table 1: Sensitive regions for each frames

We will see that when the foreground window is located onthe boundary of the background frame, which is the normalcase, a lot of slices and macroblocks will be in the insensitivearea and require little decoding/encoding operation.

3.2.3 Shortening the DelayThe longer delay and/or response time caused by the buffer-ing of a GOP of frames is the major cost of our approach.Since Delay = GOP size/frame rate, for a video clip at30 frames per second, a GOP of 15 frames means 0.5 seconddelay. As we analyzed above, this delay is used to tradeknowledge about future reference patterns so that we donot need to do unnecessary decoding. However, we cantreat this problem in 2 ways. First, we can select a shorterGOP size at the encoding time to facilitate video editing,or a transcoding proxy is inserted to shorten the GOP size,then the delay will be smaller accordingly. For a GOP sizeof 6 frames with 30 frames per second, the delay will beonly 200ms. If the frame rate is higher, the delay will alsobe shorter. Of course, a shorter GOP or a higher framerate means a larger bandwidth requirement for the samevideo quality, or a reduced video quality under the samebandwidth. Therefore, the change of GOP size is ratheran engineering choice that balances between video quality,bandwidth requirement and processing delay, and we pro-pose to adapt it according to different user preference and

resource availability. Second, since the last P frames have arelatively small sensitive area, we can start the back trackingprocess earlier if necessary by assuming all the macroblocksin that sensitive area are d-MBs. For example, after wehave decoded the last but second P frame, if we assume allmacroblocks in its d-sensitive-area are d-MBs, then we canstart the backtracking from this point immediately, insteadof waiting for the remaining 4 B frames and one P frameto come. This 5/15 = 33% speed-up comes at the priceof having more d-MBs and so more motion compensationdecoding. The earlier we start the backtracking, the mored-MBs we decode without using them. If we push this to theextreme, then all the macroblocks in the I frame are assumedto be d-MBs, and we are back to the same situation as in[7]: no delay and so no saving in decoding. This providesanother way of balancing the delay and processing time, andthe more processing power we have, the less delay we mayachieve.

3.3 DiscussionsThough the general idea of our approach is straightforward,we need to be careful about several details for an efficientimplementation. Currently we are only considering opaqueembedding, which applies for most situations and can beextended to semi-transparent embedding.

3.3.1 Streaming ModeThe first problem is how to do the VOE filtering with back-tracking delay in a streaming mode. A naive way is to buffera whole GOP, run the backtracking process to mark c-MBsand d-MBs, resume the decoding and VOE process, and thenre-encode the GOP all before working on the next GOP.Obviously, the VOE gateway will never be able to catch upwith the incoming realtime stream. We solve this problemby “pipelining” the decoding and encoding operations alongthe sequence of frames. At the initialization phase, a GOP offrames will be decoded to MC domain and buffered, and thebacktracking is done at this point. Starting from the secondGOP, we decode one new frame from the next GOP to MCdomain, and simultaneously perform the embedding and re-encoding for one old frame of the current GOP. This pro-cess will be able to continue for the whole stream endlessly,and at any specific time we are decoding a frame and thenencoding another frame. The only difference with normalfiltering is that the frame being VOE processed is always aGOP-length “older” than the frame being decoded.

3.3.2 New Content PreparationWe divide the preparation for the new content to be embed-ded into three cases.

I The first case is video merging, where the new contentis also a MPEG video stream, with the same framerate and GOP pattern but a smaller frame size. Thisforeground video may come from downscaling anotherTV program, or from lower frame size video clips ordigital camcorders.

In opaque overlaying, the foreground video macroblocksare not affected by the background stream and so canreserve their motion modes. The 2 MPEG streams’GOP distribution need to be exactly matched – forexample, I frames need to be embedded into I frames.This may requires delaying one of the 2 streams if

they are not synchronized in advance. Also, since theywill be treated the same as original background streammacroblocks, they will be decoded by future playersusing the background stream’s quantization table. Asa result, if the 2 streams’ quantization tables are dif-ferent, then we need to decode the foreground videoto MC domain and re-quantize it, the results of whichwill then be re-encoded as other macroblocks from thebackground stream. If the 2 streams happen to beusing the same quantization table, then we can directlycopy the foreground video’s slices directly to the outputstream. This “bit string copy” saves a lot computationand simplifies the VOE implementation since we do notneed to take care of decoding and motion-compensationfor this new stream.

II The second case considers image embedding, wherethe original image format may be JPEG-formatted orsimply raw pixels such as bitmaps. To embed thistype of static content into a MPEG stream, we needto prepare the image into MC domain format. Theimage data need to be DCT transformed and quan-tized if necessary, using the same quantization tableas the background stream. Then we have to make uptheir macroblock headers. This is done following the“show once, copy many” approach: for the first I frame,the content is encoded as intra-coded macroblocks; forfollowing frames, we put “dummy copy macroblocks”in the foreground area. A “dummy copy macroblock”is always using forward prediction with motion vectorzero and zero prediction errors. In our implementation,the “skipped macroblock” syntax is used to achieve thiseffect wherever feasible. This results in motion imagesupdated at GOP boundaries. Of course, depending onthe frequency we want the images to be updated, wecan also encode these intra-coded macroblocks of theimage content in P frames for a higher refresh rate,or even in B frames if it does not exceed the maximumnumber of intra-coded macroblocks in B frames allowedby MPEG.

III For text embedding, currently we are using the samemethod as for image embedding. Text is first convertedto raw pixels, and then DCT transformed before em-bedding, and the remaining work is the same as thecase above. For transparent text embedding, we areplanning to incorporate the results from [9].

4. APPLICATION SCENARIO: INTERAC-TIVE TV

We have derived an efficient realtime VOE algorithm, andthe next question concerns benefits this would bring to dis-tributed video applications. To justify our approach, wepresent a new Interactive TV scenario enabled by our VOEalgorithm and compare it with the conventional TV render-ing approach.

Figure 6 illustrates a typical scenario for conventional TVprogram broadcasting. The TV program is transmittedto the end host via satellite or cable broadcast, and thendecoded by a Set-Top-Box or equivalent internal decodingdevices of a TV set. Meanwhile, the Internet connectionbrings in useful information from data servers on the worldwide web, such as email server or http server. Typicallya viewer interacts with the Set-Top-Box via buttons on a

Satellite Server

Satellite dish

TV StationCable Network Set-Top-Box

Internet Http ServerEmail Server

TV set

YUV signals

User with aRemote

Controller

Figure 6: A typical architecture for TV program delivery

remote controller. To provide PiP functions or present com-mand menus, all data streams are decoded to image pixelsand then composed by the Set-Top-Box. Similar effectscan be achieved by the windows system on PCs with thesame kind of pixel-domain overlay operation if the outputis PC monitors. Note that this requires that all the overlayoperations be done after the TV stream is decoded, andso all information processing are pushed to the single “hotspot”, such as the Set-Top-Box in this example. A majorproblem is that this centralized scheme does not scale wellwhen the number of streams flowing in becomes larger andthe way of presenting them together becomes more diver-sified, which will be very likely as the Internet continuesto grow exponentially. In that case, the hot spot will soon“catch fire” because of the burden of maintaining connectionstates, synchronizing video/data streams, responding to userrequests, content processing and etc. Currently this problemis solved by sacrificing flexibility and customization. Forexample, the viewer can not access email while watchingTV, and the content source and the position of the overlaywindow in PiP are limited to predefined settings. Also, asmentioned in the introduction section, we cannot generallyprogram the service in an open and customizable way. Forexample, if the user wants the foreground window to besmaller or at a different position on the screen, or he wantsto show some graphic data other than email or web pagesfrom the Internet, there is no way he could “enter” into theSet-Top-Box and make these happen.

To fundamentally solve this problem, we propose an al-ternative setup as shown in Figure 7. The TV programsare captured as MPEG2 streams, and after all the VOEprocessing the resulting MPEG2 stream is decoded into TVformat. Actually new DV transmission standards such asIEEE 1394 [10] allow the MPEG2 stream to be sent di-rectly to 1394-enabled TV sets to save the cost of an ex-tra decoding device. Many more information sources areadded in this picture, and several changes have been madeto accommodate these information streams and distributethe computation cost. First, an “Internet Request/Reply”component cooperates with a network of application-levelVOE processing gateways placed inside the Internet to col-lect and process data coming from various kinds of serversbased on the users request. Even the TV station couldutilize this Internet channel for more interactivity betweenthe TV viewers. This service model fits well into many exist-ing architectures, such as Active Network [11] and ContentService Network [12], and the video composition can be donewherever most appropriate in terms of efficiency, cost and

convenience. Secondly, a “Content Preparation” componentwill collect data from devices in the Home Network andtransform it into more “VOE-friendly” format, that is, DCTdomain data plus certain motion modes. Lastly, the viewerhas much more choices of how to talk to the VOE system.Mice and keyboards convey much more information thanremote controllers, and very customized friendly GUI canbe supported via hand-held devices.

In general, our fast VOE algorithm has the following benefitsto many video applications:

• Computation DistributionSince our algorithm can process streaming video inrealtime, it will decrease the resource requirement atapplication level gateways, as well as enable load bal-ancing for efficient distributed video applications. Forexample, the multiple video streams are merged at themost appropriate distributed nodes on the path ratherthan at the client and the result stream is kept in stan-dard MPEG2 format. This greatly reduces state main-tenance burden at intermediate gateways and end hostsin terms of multiple connections, inter-stream synchro-nization and buffering management. This scalabilityallows multiple servers on the Internet to cooperate ina “multi-sink” (reverse multicast) tree fashion. On theother hand, multicast sessions will also benefit since theembedding can be done right before the multicast pointto save end-hosts’ efforts. Further more, our algorithmenables a cost-effective way to insert information suchas logos and other location/time relevant meta data forsecurity or copyright purposes in a on-demand manner,since the new content becomes an integral part of theMPEG stream before the stream arrives at the clientplayer. Also note that at the end hosts, the compu-tation can also be divided into parallel software com-ponents, which is suitable for utilizing future uniformand cheap computing power.

• On-line VOE Processing for High Quality VideoSince the very beginning, we have been after a fastapproach that can work in realtime for fairly compli-cated VOE operations on high quality MPEG streams.The speedup given by “doing the minimum necessarycomputation” can achieve many goals not possible withprevious approaches. For example, although single pro-cessor PCs are not fast enough even to decode and playHDTV MPEG streams, we can embed visible objectsonto this stream directly at the compressed domain

TV Station

Internet

Web Server

Email Server

TV set

YUVsignal

TV program

Capture Device

VOEProcessing

UserInteraction

ContentPreparation

MPEG2

MouseKeyboard

MPEG2

Decoding Device

HomeNetwork

Telephone

D V D

...

Refrigerator

DigitalCamera

DVD/VCRPlayer

Video Server Image Server

Data Server...

VOE ServiceGateways

StorageDevice

Internet

Request/Reply

Hand-heldDevice

...

MPEG2 Content

Content Command

Command

Figure 7: A new setup for distributed VOE services

in realtime because the total amount of “real work”required is well affordable. As the TV streams go tohigh definition, the cost-effectiveness of our approachbecomes more and more obvious. Also, current videoediting software relies on spacial domain algorithmsto render high quality results, and it will require atedious compiling process before we can evaluate evena small change we have made. With our algorithm,the user can first “preview” the overall editing effectsin realtime, and only use spatial domain processing forfinal production.

• More Interactivity and CustomizationWith a fast VOE algorithm, the bottom-line is thatwe can provide all the functionalities current Set-Top-Boxes can do, such as presenting multiple programs viaPicture-In-Picture, accessing email and web on the TVscreen and stock/weather tickers embedding. Further-more, because of the flexibility of software approach, wecan do much better by allowing customization of thenumber, position, resolution and content source of thesmall overlay window in PiP, or even mimic a “desktop”with buttons and icons on the TV screen for the userto interact. More importantly, since our algorithm isbased on the open MPEG2 standard and the interfaceof VOE can be described in a simple and standardizedway, it enables an open VOE service model: applicationservice providers could provide standard VOE process-ing service, and third party content providers only needto provide the content together with standard descrip-tion of how this content should be presented. Thisprogrammability provides much higher level of controland customization than the current closed approaches.

• Flexible Resource and QoS controlThe VOE process is generally complicated and computation-intensive, so there are many places where different re-source requirements and QoS settings are conflicting,

such as occupied bandwidth, delay, jitter, response time,processor cycles and memory space. Fortunately, afine granularity control of resource utilization over awide parameter space is possible, and it is the userrequirement and preference that ultimately decide inwhich way the algorithm should be optimized..

5. EXPERIMENTAL RESULTSWe have implemented the first version of the realtime VOEfilter with functions such as video Picture-in-Picture andimage embedding, as part of the Active Space project [13].HDTV quality MPEG2 streams are multicast through Gi-gabit LAN, and client PCs receive the stream and outputit to dedicated plasma displays. Our VOE service gatewayintercepts the stream and provides VOE service in realtime.The resulting stream is then multicast to clients. We analyzethe performance of proposed VOE approach in 3 aspects:

• Realtime ProcessingThe VOE service gateway runs on a general purposePC with a single pentium IV 1.4GHz processor and512M memory. The filter is written in Visual C++and runs on top of Windows 2k platform. As we haveexpected, we can perform many VOE functions in real-time, introducing an extra delay of at most 0.5 seconds.After initialization, the processor utilization averagelylevels off at 70% when no other applications are runningon the gateway machine at the same time. For thisexperiment, we have chosen the background stream tobe a HD quality (1920*1088, 30 frames per second)stream ”Trees1.mpg”, and the foreground stream to bea standard resolution MPEG video ”football sd.mpg”(480*256, 30 frames per second). Figure 8 is a screenshot of the resulting stream. Note that the foregroundwindow can be located any where in the frame so longas it follows the 16x16 macroblock boundaries.

• Computational Complexity Reduction

Figure 8: An example screen shot for PiP

Figure 9: Distribution of c-MBs

Figure 10: Distribution of d-MBs

As we have analyzed above and also pointed out by[7], a most expensive work done using the [7] approachis the decoding of reference macroblocks from MC do-main to RD domain, and our approach is exactly solv-ing this problem based on the observation that nor-mally great portion of these macroblocks are not usedat all.

To demonstrate this, we only need to compare thenumber of macroblocks that are decoded to the RD do-main by the 2 approaches. In the [7] approach, all themacroblocks in all I and P frames should be counted,so for the “football in trees” example in Figure 8, thetotal number of macroblocks decoded to RD domainfor 500 frames would be 1920∗1088

16∗16 ∗ 615

∗500 = 1632000.On the other hand, for our approach, the total count isdependant upon the nature of the background streamsince we only work on affected macroblocks. For 3

Football Stars Trees

Previous Approach 1632000 1632000 1632000

Our Approach 139455 55509 56613

Table 2: Comparison of the number of macroblocksthat need motion compensation

different background streams, the total number of mac-roblocks reconstructed through motion compensationis given by Table 2. From the table we know that withour approach the most expensive motion compensationoperation only needs to be done for less than 10% ofprevious approach, and for background streams suchas Stars and Trees this percentage is even smaller. Tofurther explain the saving, we have plotted out thedistribution of c-MBs and d-MBs over the whole back-ground area, as shown in Figure 9 and 10. As expected,c-MBs only appear around the foreground area, andthe number of d-MBs also decrease dramatically goingfarther away from the foreground area.

Figure 11: Effects of the optimization: “bi-directional to uni-directional prediction”

As discussed in section 3.2, the “bi-direction to uni-direction prediction” optimization could eliminate onepotential c-MB and at least one potential d-MB, andso reduce the amount of computation needed. Figure11 shows a comparison of the number of c-MBs ineach frame with or without the “bi-direction to uni-direction prediction” optimization using the “football-in-stars” scenario. For I and P frames, there are nobidirectionally predicted macroblocks at all, so the twocurves will join at a single point. But for B frames, wecan see that the reduction is averagely more than 50%.

• Resulting Video QualitySince in general we are using the same DCT domain

Figure 12: pSNR of background image for 3 videoclips

approach as [7] except the changes in reconstruction,we can expect the resulting image quality follows thesame analysis as [7]. The quality degradation primar-ily comes from DCT quantization, simplified motion-estimation and some of optimizations we discussed. Forthe visual quality test, because the foreground video isnot changed in opaque overlaying, we only calculatedthe pSNR (Peak Signal-To-Noise, equation 6) valuesfor the background streams.

pSNR(X, Y ) = 20log10[255√

1MN

∑Mi=1

∑Nj=1(Xij − Yij)2

]

(6)Specifically, we used standard MPEG2 decoder to de-code both original background streams and the result-ing streams after embedding to pixel domain, and usedtheir difference (excluding the foreground area) as thenoise value. From the result shown in Figure 12, we cansee that the I-frames are not affected at all, and otherframes’ pSNR value changes for different video clips:the more motions there are, the more noises resultingfrom the VOE process.

Besides, we have done a thorough subjective test. Over100 Viewers have seen the resulting stream of our re-altime VOE, and no perceptible quality degradation isobserved.

6. CONCLUSION AND FUTURE WORKVisible Object Embedding is a very useful video composit-ing operation that is applied in many video applications.A software realtime VOE algorithm in MPEG compresseddomain has great benefits and can be immediately applied tomany scenarios in people’s life. In this paper, we proposeda new cost-effective and flexibly controllable approach fordoing VOE in MPEG compressed domain. Compared withprevious work, our approach greatly reduces computationalcomplexity by maximally eliminating unnecessary motioncompensation operations by sacrificing a small amount ofdelay. Specifically a backtracking operation determines theset of macroblocks whose image value are needed for futurereference and whose prediction reference is wrong. This way,only the minimum amount of work necessary for correct em-bedding results is done, and for the most expensive motioncompensation part the savings can be over 90%. The otherproperty of our approach is that we have flexible controlover how to balance between different QoS metrics, suchas delay, computational complexity and bandwidth, and an

optimal service setting can be easily configured based on theuser’s preference. We have implemented a first version of therealtime VOE filter that can be used to embed video andimages in realtime even for high definition MPEG streams,and discussed many subtle problems we have met and pro-posed several ways of optimize the VOE process based onour first-hand experiences.

7. REFERENCES[1] ISO/IEC International Standard 13818. Generic

coding of moving pictures and associated audioinformation. 1994.

[2] C. Bezzan. High definition TV: its history andperspective. Telecommunications Symposium, 1990.ITS ’90 Symposium Record., SBT/IEEEInternational, 1990.

[3] A. Dutta-Roy. An overview of cable modemtechnology and market perspectives . IEEECommunications Magazine , Volume: 39 Issue: 6 ,June 2001, Page(s): 81 -88, 2001.

[4] Emerging high-speed xDSL access services:architectures, issues, insights, and implications . IEEECommunications Magazine , Volume: 37 Issue: 11 ,Nov. 1999, Page(s): 106 -114, 1999.

[5] J. Spragins. Fast Ethernet: Dawn of a New Network[New Books and Multimedia] . IEEE Network ,Volume: 10 Issue: 2 , March-April 1996, Page(s): 4,1996.

[6] Gigabit Ethernet. Circuits and Systems, 2001.Tutorial Guide: ISCAS 2001. The IEEE InternationalSymposium on , 2001, Page(s): 9.4.1 -9.4.16, 2001.

[7] D.G.; Shih-Fu Chang Noguchi, Y.; Messerschmitt.MPEG video compositing in the compressed domain .Circuits and Systems, 1996. ISCAS ’96., Connectingthe World., 1996 IEEE International Symposium on ,Volume: 2, 1996.

[8] J. Meng and S.-F. Chang. Embedding VisibleWatermark in Compressed Video Stream. Proceedings,1998 International Conference on Image Processing(ICIP’98), Chicago, Illinois, October 1998.

[9] Seungwook Hong Jongho Nang, Ohyeong Kwon.Caption processing for MPEG video in MC-DCTcompressed domain. Proceedings of the 8th ACMinternational conference, October 2000.

[10] D. Thompson. IEEE 1394: changing the way we domultimedia communications. IEEE Multimedia ,Volume: 7 Issue: 2, April-June 2000, Page(s): 94-100.

[11] W. Sincoskie D. Wetherall G. Minden.D. Tennenhouse, J. Smith. A Survey of ActiveNetwork Research. IEEE Communications Magazine,Vol. 35, No. 1, pp80-86., January 1997.

[12] Jack Brassil Wei-Ying Ma Bo Shen. Content servicesnetwork: the Architecture and Protocols. The 6thInternational Workshop on Web Caching and ContentDistribution, pp 83-101 Boston, June 2000.

[13] GAIA: Active Spaces for Ubiquitous Computing.http://devius.cs.uiuc.edu/gaia/.

Documents

Optimizing Visible Objects Embedding Towards Realtime ... · Optimizing Visible Objects Embedding Towards Realtime Interactive Internet TV Bin Yu, Klara Nahrstedt Computer Science