Advanced Digital Video Processing An Introduction to MPEG4 ...sigmedia/pmwiki/uploads/Teaching.4... · Video Object. The compression of each Group of VOPs (GOV) is performed similarly

Advanced Digital Video Processing

An Introduction to MPEG4

Dr. Anil Kokaram [email protected]

1 The Layer Model

The concept of an image being created by superposition of object layers has been used by photographeditors for many many years. Adobe Photoshop has been using this idea since 1989. In 1994, Wangand Adelson presented an article in which they gave serious credence to the idea that each framein an image sequence can be constructed by layers of object planes. This is a much more realisticidea. The rubber sheet model for all objects is replaced instead by a series of sheets correspondingto each object in the sequence. To create the observed frame, these sheets are blended1 with eachother.

Assume that there are r objects, labelled r = 1, . . . R in the sequence. For each object at framen, the corresponding image is Ir

n. Each such object plane is associated with a mask αrn at each frame

instant. This α map is either a binary map or a continuous valued map. The α map thereforeindicates in each frame, which objects (or portions of objects) are occluding which other objects (orportions of objects). The final observed image at frame n is then In created as follows.

In(x) =R−1∑

r=0

Irn(x)αr

n(x) (1)

There must be a constraint on α of course, and it is sensible to use∑

r αrn(x) = 1. With this

constraint one cannot have two objects superimposed exactly upon each other. When α is continuousvalued, the model can allow for transparent motion.

This model is extremely powerful and overcomes all the problems of allowing for discontinuitiesin motion. Once the α-maps are extracted along with object planes from any given image sequence,there is no longer any need to quantify discontinuities further. The layer model is illustrated for asimple image/sequence in figure 1.

Needless to say, estimation of these object maps as well as the motion between the maps from areal image sequence is an interesting problem. This is because α now introduces a new variable to

1Using for instance the α-blend idea used in TV/cinema post-production houses for some time.

1

1 THE LAYER MODEL

Video Object Plane

α-Map

α-Map

+

=

x

x

x

x

Support Map Intensity Map Velocity Map

LAYER 2

LAYER 1

Final Sequence

Figure 1: The layer model of an image sequence. Each frame is built as the superposition of twolayers. The support and α maps are white for 1 and black for 0. The support map was not identifiedin the text as part of the model, but it is part of the strict definition for the object inter-layerinteraction.

2 www.mee.tcd.ie/∼isd

2 MPEG4

be estimated at each image site, and also, the object planes may not ever be completely observedduring the image sequence. Very few reliable algorithms have been proposed that operate under awide range of conditions. The topic of Video Segmentation is high on the current research agenda.

2 MPEG4

The there are two principal differences between MPEG4 and MPEG2 from which most of the ethosof the new standard can be understood.

1. The video sequence is thought of as being created by a number of interacting visual objectseach represented in each frame time by a Video Object Plane. This comes directly from thework in video processing research during the last few years (since 1994).

2. There are more error resilience modes and more efficient options for error resilience thanMPEG2.

The standard became ISO approved in 1998 and version 2 is currently completing development(March 2001). The visual standard is called ISO-14496-2. MPEG4 is optimised for bitrates varyingfrom 10 Kbits/sec to 3 Mbits/sec. The ITU standard, H.263 was one of the starting points forMPEG4. The following items consititute the whole MPEG4 standard.

• ISO 14496-1: MPEG-4 Systems

• ISO 14496-2: MPEG-4 Visual

• ISO 14496-3: MPEG-4 Audio

• ISO 14496-4: MPEG-4 Conformance

• ISO 14496-5: MPEG-4 Software

• ISO 14496-6: MPEG-4 DMIF

• ISO 14496-7: Optimised Visual Reference Software

• ISO 14496-8: Carriage of MPEG-4 contents over IP Networks

All reference software and documents are now available on the ISO website www.iso.ch.

This section discusses some of the aspects of MPEG4 that are appropriate for the modern in-dustrial context. Two excellent references for learning more about MPEG4 are Multimedia Systems,Standards and Networks [1] and Compressed Video Over Networks [2].

MPEG4 is very similar to the H.264 standard which is the basis for video encoding on BluRayHD Discs. MPEG4 and H.264 is also used as the de-facto standard for mobile video transmission.


3 BASIS

3 Basis

As discussed previously, the image sequence is better thought of as a series of interacting layers. Eachlayer independently represents an object moving through a sequence of images. MPEG4 formalisesthe description of image sequences using this concept. The goal is not only better compressionbut also to allow the user increased interactivity even to the level of selective layer decoding withthe multimedia stream. Thus the sequence is built from a number of Video Objects (VO’s) thatinteract at each time frame instant to build each observed video frame. The location and shape ofeach such objects at each frame instant is represented as a Video Object Plane (VOP). Thus ratherthan coding each image (corresponding to the entire observed frame) as in MPEG2, each image isseparated into a number of VOPs. It is the sequence of VOPs that make up a Video Object (VO)that is coded in MPEG4. Figure 2 illustrates this idea with a sequence of 3 frames consisting of 3video objects that are represented by 3 VOPs corresponding to the 3 frames in the sequence.

A sequence is therefore compressed by extracting the VOPs (automatically, semi-automatically,or manually), serialising the VOPs and then compressing groups of VOPs belonging to the sameVideo Object. The compression of each Group of VOPs (GOV) is performed similarly to that inMPEG2. A combination of transform compression (8 × 8 blocks) and DPCM is used. GOVs arecompressed using I, P and B frames in a similar manner to MPEG2.

MPEG4 also allows for a special type of object layer called a Sprite Object. This is a backgroundimage (or clean plate as it is called in the post-production industry) that is transmitted to the decoderseparately. The VOPs are then overlayed on warped versions of the sprite to create each image frame.

Since VOPs and thus GOVs contain valid data only only within the object that is coded, thereis a need to code the shape information, or α−plane data (as per the model introduced previouslyin these lectures). Therefore a major departure from MPEG2 is the allowance in MPEG4 for codingthis shape information. Both shape and image data is treated in 8 × 8 blocks embedded within16× 16 macroblocks.

Tools for motion compensation are much improved in MPEG4, and the range of motion thatis coded is also increased. Motion vectors can be specified for 8 × 8 blocks as well as the 16 × 16size in MPEG4. The differential coding of the vector field is also improved, using a 2D predictionneighbourhood instead of the simple single vector predictor used for MPEG2.

MPEG4 also contains methods for compressing synthetic data such as face animations, mesh andtexture objects and text-to-speech (synthetic audio). It is interesting to note that the texture objectscan be coded using the DWT and the Embedded Zero Tree wavelet (EZW) algorithm. We will notdiscuss these in any detail in this course. Aside from the improved error resilience in MPEG4, thestandard also discusses methods for post processing to remove ringing and blocking artefacts.

As in MPEG2, the interoperability of codecs are quantified through the definition of MPEG4Profiles and Levels. This is discussed next.


3 BASIS

(VO2)

Video Object0(VO0)

Video Object 1(VO1)

Video Object 2

Video sequence

Video Object Planes

VOP1 VOP2 VOP3

Figure 2: Defining VOPs.


4 MPEG4 PROFILES AND LEVELS

4 MPEG4 Profiles and Levels

MPEG4 conformance is organised using three different concepts. Profiles and Levels are as inMPEG2, but in addition there was introduced the idea that Object Types could be selected by thecodec. Object Types group together tools that give a certain functionality, Profiles delineate a setof object types that therefore defines a subset of syntax and semantics. A level defines a maximumvalue for the parameters of the data that can be manipulated, in much the same way as MPEG2(but not exactly).

OBJECT TYPES in the standard define sets of algorithms to be used for compression. Forthe Simple, Core and Main OTs, each is a superset of algorithms found in the OT below. For theother OTs this is not quite the case as shown in the following table.

Object Type Algorithms

SCALABLE STILL TEXTURE Scalable Still Texture (using DWT/EZW)3 layers with SNR and Spatial Scalable coding modes4:2:2 YUV picture format

SIMPLE Scalable Supports all functionality of the Simple OT plusB-VOP, Temporal and Spatial Scalable coding modes4:0:0 YUV

MAIN Supports all functionality of the CORE OT plusGray Shape, Sprite Coding, Coding Interlaced Video4:2:0 YUV

CORE All functionality of the SIMPLE profile plusB-picture prediction modes,P-VOP Temporal Scalability, Binary Shape

SIMPLE Nonscalable coding supportingcoding progressive videorandom access (GOV)I, P-picture prediction modes4:2:0 YUV4MV, Unrestricted motion, coefficient prediction

I have omitted the following from the above table N-Bit, Basic Animated, Animated 2D Mesh,Simple Face.

As in MPEG2 Profiles define subsets of algorithms for compression. In MPEG4, this is doneby specifying groups of OTs within each Profile.


5 MPEG4 APPLICATIONS

Profile Object Type

HYBRID All Object Types of MAIN profile PLUSBasic Animated Texture, Animated 2D Mesh and Simple Face

SIMPLE Scalable Simple, Simple Scalable

MAIN Main, Core, Simple, Scalable Still Texture

CORE Core, Simple

SIMPLE Simple

I have omitted N-Bit, Scalable Texture, Basic Animated Texture and Simple Facial Animation pro-files.

Within each Profile there are a number of options for the range of parameters that can besupported. The upper bounds of each such range e.g. picture size, frame rate, bit rate, is defined inthe scope of a Level as follows.

Profile Level Upper Bound on Parameters

Simple L1 QCIF, 64 Kbits/sec, Maximum of 4 objectsL2 CIF, 128 Kbits/sec, Maximum of 4 objectsL3 CIF, 384 Kbits/sec, Maximum of 4 objects

Core L1 QCIF, 384 Kbits/sec, Maximum of 4 objectsL2 CIF, 2 Mbits/sec, Maximum of 16 objects

Main L2 CIF, 2 Mbits/sec, Maximum of 16 objectsL3 ITU-R 601, 15 Mbits/sec, Maximum of 32 objectsL4 1920× 1088, 38.4 Mbits/sec, Maximum of 32 objects

5 MPEG4 Applications

MPEG4 incorporates many improvements to compression efficiency over MPEG2 but the standardhas also been designed to increase the level of content based interactivity and accessibility to content.Many of the video applications suitable for MPEG4 rely on exploiting one or more of these aspects.

Compression Efficiency By allowing for better motion compensation, and differential predictionof both motion and DCT information, MPEG4 achieves better quality pictures at higher com-pression rates than MPEG1/2. To give an example, MPEG4 at about 512 Kbits/sec ≡MPEG1at 768 Kbits/sec.

Content Based Interactivity Since MPEG4 allows the compression of video as object layers,enabling content interactivity is one of the major new features introduced by the standard.Not only are objects represented separately, and efficiently, object based scalability is alsoallowed.


6 NATURAL VIDEO I

Universal Access By incorporating new tools that enable the creation of error-resilient bitstreams,MPEG4 content can be decoded across a wide variety of media. This includes in particular,mobile content access over wireless channels.

Applications for MPEG4 include the following.

Mobile Mediaphones Mobile devices are typically constrained by low bandwidth, variable channelbit error rate and variable computing resources. By exploiting the scalability (hence QoS),error-resilience and compression tools in MPEG4 such devices can become realisable.

Streaming Internet Video This application can exploit the compression efficiencies and error-resilience tools in MPEG4. This is because of the low bandwidth to home PCs and the needfor On-Time communications when dealing with multimedia content.

Networked Video Games Gaming is a huge industry. It relies heavily on the rapid generation andmanipulation of visual representations of 3D objects. By adding VOs to the games, they canbe made more lifelike and these applications can therefore exploit the mesh and 3D animationtools in MPEG4 for lifelike video overlays.

Virtual Studio Some studio footage is generally shot against green or blue backgrounds for ma-nipulation later. By storing the studio objects independently of background material, muchcompression efficiency can be gained. In addition, re-distributors of the content can alter thebackground content as required.

Interactive Digital TV Picture-in-Picture, Captions, Logos, Customised Advertising, Multiwin-dow TV, can all exploit the Video Object paradigm in MPEG4.

6 MPEG4 Natural Video Coding I

The basic principle of video compression in MPEG4 is the same as MPEG2. That is, to achievecompression, it is possible to transform the video signal into motion compensated frame differences.This then implies that like MPEG2, MPEG4 uses I, P, and B frames. However the importantdistinction in MPEG4 is that each image frame is further subdivided into object layers. The timesnapshot of each object (equivalent to an MPEG2 frame) is called a Video Object Plane. Thisimplies that included with the data for each VOP must be information about the shape of the objectdelineated in the instant that the VOP must be visualised. Coding this shape information is alsoa major new contribution within this standard. This section treats the underlying technology incoding Natural Video Objects. A familiarity with MPEG2 concepts is assumed.


6.1 I-VOP Compression 6 NATURAL VIDEO I

Possible Prediction

B C D

A X Y

Block

Macroblock

coefficientdirections for DC

Figure 3: The 2D prediction for DC coefficients.

6.1 I-VOP Compression

Each I-VOP is equivalent to the Intra-frame in MPEG2 except that an I- VOP delineates only oneobject within a frame. Thus the I-VOP contains a series of 16 times16 macroblocks that consist of4 8× 8 macroblocks. The improvements introduced here are as follows

DC Coefficient Prediction Instead of using only the left hand block as the prediction for the DCcoefficient of the current block. MPEG4 allows for vertical prediction as well. This impliesthat across vertical edges in the image, a better prediction for the current DC block can beextracted.

This situation is illustrated in figure 3. Recall that at the decoder, the current DC coefficient isnot known before reconstruction from its prediction. Thus the decision about which directionto use for prediction, must be made on the basis of the already existing blocks. The decisionabout wether the prediction of the DC coefficient for the current block X is taken from blockC or A is therefore based on analysing the DC gradient using blocks A, B, C only. Given thequantised values of the DC coefficients as DCA,B,C/8, the prediction for DCX , DCX is then



Y

Macroblock

BlockDCB

A X

Figure 4: The 2D prediction for AC coefficients.

as follows.

gvDC = |DCA −DCB|

ghDC = |DCC −DCB|

DCX =

DCC If ghDC > gv

DC

DCA Otherwise

This process is repeated for each block within the macroblock, using the relative blocks asrequired. For instance, X, C, D will be used for predicting the DC coefficient of Y .

Prediction of AC Coefficients MPEG4 allows for prediction of a subset of AC coefficients. MPEG2has no such prediction. The situation is illustrated in figure 4. Only prediction of the first rowor first column of AC coefficients is allowed. This again exploits the local correlation structurein the image.

Because of the presence of local texture, this prediction may need to be turned off in someblocks. To avoid the increased overhead of switching on a block basis, this is only allowed atthe macroblock level. The direction of the prediction is defined as the same direction usedfor predicting the DC coefficient. This means that the AC coefficients of the block tobe coded can be used as part of the decision indicating wether or not to use AC coefficientprediction. Thus the only option available for AC coefficient prediction is to turn on or off theprediction over the whole macroblock.



Alternate Vertical Scan

0 0 0

63 63 63

Alternate Horizontal Scan Zig Zag Scan

Figure 5: The three scan patterns possible in MPEG4.

It makes sense only to code the AC coefficients predictively, if the magnitude of the difference(residual) is less than the actual values of the coefficients. Thus the first value to be calculatedis the magnitude of this ‘energy’ as follows.

E =7∑

m=0

|TX(m)| −7∑

m=0

|TX(m)− TA/C(m)| (2)

where T (m) is the mth AC coefficient along the first row (if C has been selected for DCprediction) or the first column (if A has been selected for DC prediction). AC prediction is setif

∑E ≥ 0 and is not set otherwise.

∑E is calculated as the sum of E above for each of the

4 blocks in a macroblock.

Alternate DCT Scans MPEG4 additionally allows alternate horizontal and vertical scans of eachDCT block in order to encourage more beneficial run length encoding of each block. Thethree scans are shown in figure 5. Ideally the option of scan would be done by coding theblock under consideration after using each of the three possible scans, and then selecting thescan that created the most efficient run/level representation. However to minimise this scanselection overhead, the DC prediction direction and AC prediction flag are used.

If the AC prediction flag is off (no AC prediction is used for the macroblock), then the normalzig-zag scan is used. Otherwise a selection must be made between the horizontal and verticalscan options. This is done by observing the DC prediction direction. If block A was usedfor DC prediction, then the vertical DCT block scan is used (alt-vert scan), otherwise thehorizontal scan is used (alt-hor scan).

Alternate VLC Table Selection MPEG4 allows different VLC Tables for the Intra and Inter-VOPs.


6.2 P/B-VOP Compression 6 NATURAL VIDEO I

MVMV1

MV2 MV3

MVMV1

MV3MV2

MV MVMV1 MV1

MV2 MV3MV3MV2

Figure 6: Creating the prediction for the motion vector MV. The median of the three vectors MV1,MV2, MV3 is used.

6.2 P/B-VOP Compression

Each P/B-VOP is equivalent to the P/B frame in MPEG2 except for the object specification inherentin VOPs. Thus the predicted VOPs contain a series of 16 times16 macroblocks that are made up of4 8× 8 blocks. The improvements introduced here are as follows

8× 8 Blocks for Motion Instead of allowing only one motion vector per macroblock, MPEG4optionally allows each macroblock to be split into 4×4 blocks for compensation with 4 motionvectors. This is the same as H.263.

Prediction of Motion Vectors The differential prediction of MPEG2 is extended to allow a 2Dprediction neighbourhood. This can be done on a block basis, with the prediction neighbour-hood being 3 vectors whose position varies depending on the block within the macroblockthat is under consideration. This situation is illustrated in figure 6. The prediction is themedian of the vectors in the neighbourhood. The median is calculated by using the horizontaland vertical components of each vector separately to generate the required component of theprediction vector.

If only one motion vector is used for the macroblock then the prediction neighbourhood is asfor the top left hand block within the macroblock.


6.2 P/B-VOP Compression 6 NATURAL VIDEO I

MV/3

0 1 32

I B PB

at required location in B frame

Vector in P frame used for predicting

Vector obtained after manipulatingprediction vectorB frame location

vector at Actual

MV

-(2MV)/3 if residual is 0else MVF-MV

Figure 7: Temporal prediction of vectors for B VOPs.

Overlapped MC There is an overlapped motion compensation mode that will not be discussedhere.

Temporal Motion Prediction MPEG4 allows B-frame motion vectors to be represented by pre-diction from temporal motion vectors generated when compensating P VOPs.

More about B-VOP codingB-VOP coding deserves special mention. B-VOPs are coded in one of four modes. In forward,backward and bidirectional mode, motion compensation is performed ONLY on 16×16 macroblocks.A single motion vector is used to compensate those blocks. These modes are identical to the modesin MPEG1/2. As the names imply, in the forward mode, a single motion vector is used to retrievedata from the next I or P VOP; in the backward mode, a single vector is used to retrieve datafrom the past I or P VOP and in the bi-directional mode two vectors are used to combine motioncompensated from the past and future I/P VOPs.

The direct mode is the only one in which 8 × 8 blocks are allowed. It uses the idea that littleacceleration could reasonably be expected between consecutive I and or P frames. Scaling thesevectors can give very good prediction for the motion of a block in a B VOP. The reference vector(the one from which is prediction is generated) is the one from the macroblock in the most recentlydecoded I or P VOP. Using I-VOP motion vectors implies using 0 as the motion vector prediction.If the colocated macroblock is in a transparent VOP or in a location where the VO does not existin the VOP, then a 0 motion vector is used.

Figure 7 illustrates the situation with an (I/P)BBP GOV segment. Considering a block withcentre at position x in frame n, and given that the most recently decoded P frame is n+2, with theprevious P or I frame at n− 1, say; the motion vector is coded as follows.

dn,n−1(x) = dn+2,n−1(x)/3 + ∆n,n−1(x)

dn,n+2(x) = −2dn+2,n−1(x)/3 + ∆n,n−1(x)


6.3 Error Resilience 6 NATURAL VIDEO I

It is the prediction error ∆ that is coded.

Coding the motion predictionThis is done using the same method as MPEG2. The prediction error for a component of motion istherefore given by

|∆| = (motiondata − 1)2rsize + motionresidual + 1

rsize = vopfcode − 1

The motion vector range that can be coded is dictated by vop_fcode. In half-pel units, it is−32× 2rsize . . . (32× 2rsize)− 1 i.e. a range of 64× 2rsize .

Only motion_residual and motion_data are coded in the bitstream. The motion_residual iscoded directly as an unsigned integer using r_size bits. Table 1 illustrates how the code operatesfor a variety of values of ∆.

Range of motion to be coded in Half-Pel unitsvop_fcode Range

1 [-32,31]2 [-64,63]3 [ -128,127]4 [-256,255]5 [-512,511]6 [-1024,1023]7 [-2048,2047]

Zero motion is coded by setting motion_data to zero regardless of the vop_fcode value.

6.3 Error Resilience

There are four Error resilience tools in MPEG4, Resynchronisation, Data Partitioning, ReversibleVariable Length Codes and the Header Extension Code. These tools assist in error detection andcorrection as well as synchronisation within the bitstream.

6.3.1 Resynchronisation

The bitstream is parsed as a serial data stream. This means that if an error occurs, bits parsedsubsequent to the error may be misinterpreted. For instance, loss of some DC coefficient informationmay cause the subsequent AC coefficient information to be wrongly interpreted as DC information.Assuming that the decoder can detect such errors, it needs to be able to skip to the next correctdata and know what it corresponds to. This is called resynchronisation. In the example given, this



vop_fcode(range) motion_data motion_residual ∆1 (32) -32 0 -32

-31 0 -31...

......

32 32

2 (64) -32 0 -651 -64

-31 0 -631 -62

-30 0 -611 -60

......

...32 1 64

3 (128) -32 00 -13101 -13010 -12911 -128

......

...

Table 1: Motion Coding in Half-Pel Units

implies that the decoder should be able to skip the corrupted DC data and then process the ACdata knowing that it is indeed AC data.

To allow the decoder to resynchronise it is common to insert special markers into the bitstream.These markers would start the beginning of some known bitstream state. Thus when the decoderencounters an error, it can skip data until it encounters a resynchronisation marker and then begindecoding again.

In MPEG2 the resynchronisation tool is the Slice header. It is allowed to occur only at thestart of a macroblock and a new slice is always started at the left hand edge of each coded image.In MPEG4, the position of the resynchronisation marker is not limited to any particular bitstreamlocations. The marker is inserted at the start of video packets. Each of these video packets is madeup of an integer number of macroblocks. The mode of operation of the MPEG4 encoder that isadopted is to insert the Resynch marker periodically in the bitstream every K bits. This meansthat in high activity areas, which consume more bits than in low activity areas, there will be moreResynch markers than in the low activity areas. Assuming that the high activity areas are moreimportant than the low activity areas, then this mode of operation implies a higher level of protectionagainst loss of data in high activity areas.

It is typical to insert Resynch markers at 512 or 600 bit intervals in the bitstream. The markersare byte aligned, and the idea is that once the bits in a Video Packet exceeds a threshold (512 or



Indicates which of the 4 blocks follow

EncodedMVs

DCT Data Block 2

HEC Data Combined Motion and DCT Data

Resynchronisation Marker

Macroblock NumberQuantisation Parameter

HEC

EncodedMVs

DCT DataBlock 1 1

1 bit to indicate wether macroblock is codedCOD1

MCBPC1Macroblock ModeInter/Intra/Inter 4V, Intra+Q 2 bit Incremental modification

of Quantisation from previous MblockDQUANT1

CBPY1

Figure 8: The structure of a video packet with Resynchronisation marker.

600 bits typically), a new video packet is created at the start of the next Macroblock. It is theResynch marker that indicates the start of another video packet. This packet approach is similar tothe Group of Blocks (GOBs) structure utilized by the ITU-T Recommendations H.261 and H.263.

The Resynch Marker is denoted as gob_resync_marker in the MPEG4 documentation. It is afixed length code of 17 bits having the value 0000 0000 0000 0000 1. Thus it is distinguishablefrom VOP start codes and other VLC codes that are used in MPEG4.

In order for the mechanism to work, the data in each Video Packet must be able to be in-dependently decoded from data in previous Video Packets. This is achieved by inserting headerinformation at the start of the Video Packet after the Resynch marker. The syntactical structure ofa video packet is shown in figure 8.

The information in the packet is firstly, the macroblock address2 (number) of the first macroblockin the packet, followed by the default quantisation scale used for the macroblocks in the packet.Additional information may also follow including the modulo time base, vop_time_increment,vop_coding_type, intra_dc_vlc_thr, vop_fcode_forward, vop_fcode_backward depending onwether a Header Extension code is set to 1 or not. If the VOP contains motion information, thenthe prediction for the first macroblock is set to 0. This makes sure that there are no predictionsacross packet boundaries.

2The spatial position of the macroblock in the image.


6.4 Data Partitioning 6 NATURAL VIDEO I

MBM

of Quantisation from previous MblockDQUANT1

CBPY1Indicates which of the 4 blocks follow

HEC Data



HEC

1 bit to indicate wether macroblock is codedCOD1

MCBPC1Macroblock ModeInter/Intra/Inter 4V, Intra+Q

MBM (Motion Boundary Marker)

Shape Motion DCT Data

EncodedMVs 1

EncodedMVs

.... DCT Data

Block 1

DCT DataBlock 2

.....

2 bit Incremental modification

Figure 9: Data partitioning in MPEG4 video packets.

6.4 Data Partitioning

In MPEG2 the motion and DCT data is coded within the code for each macroblock. This meansthat an error in any of these components can cause the loss of all information until the next sliceboundary. In MPEG4 the DCT and motion data can be inserted separately into each video packet.ALL the motion data for ALL the macroblocks in a video packet is coded first, followed by ALLthe DCT data for the macroblocks. The two data components are separated by a Motion BoundaryMarker (MBM). This is a 17-bit binary string 1 1111 0000 0000 0001. It is only present whenthe data_partitioned flag is set to ’1’. This flag is present in the Video Object Layer headerinformation.

Separating motion information from DCT information in this way allows two advantages.

• An error can be detected in the motion data if the MBM cannot be detected. If such an erroris detected, the packet data can be discarded and decoding continues at the start of the nextpacket.

• An error in the texture portion of the video packet can be detected because the subsequentResynch marker at the start of the next VOP probably will not occur at the required location.Also, if the motion information is correctly received, the number of macroblocks present inthe video packet would implicitly be known. Thus errors in the texture component of thepacket can also be detected when the number of macroblocks decoded does not agree with thisknown amount. When such an error occurs, the macroblocks that were affected can still be


6.5 RVLCs 6 NATURAL VIDEO I

Reverse Decoding with RVLCs

HEC Data



HEC

Shape Motion DCT Data coded with RVLCs

ERROR

MBM (Motion Boundary Marker)

Recovered

data

would otherwise

be discarded

Error Detected here

DATA DISCARDED

Forward decoding direction

Figure 10: The use of RVLCs for recovering corrupted portions of the MPEG4 bitstream.

reconstructed using motion compensation with the correctly received motion information.

The format of the data partioned video packet is shown in figure 9.

DATA PARTITIONING IS NOT USED FOR B-VOPs SO THE MOTION COM-PENSATION IS KNOWN TO BE ON 16× 16 MACROBLOCKS ONLY.

6.5 Reversible variable length codes (RVLCs)

When an error is encountered in the bitstream, the subsequent data must typically be discarded untilthe next Resynchronisation point presents itself. MPEG4 allows the use of RVLCs to alleviate thisproblem. These codes can be decoded in both the forward and backward directions. Thus afterthe next resynchronisation point is located, the bitstream can be decoded in the reverse directionuntil an error is again encountered or the processed data is past the correctly decoded data in theforward direction. Figure 10 illustrates how this may improve the error resilience of the bitstream.RVLCs can be used for both motion and texture information. For the technique to work, all thedata using RVLCs must be contained in the same locality in the bitstream. Thus the use of RVLCsmust be accompanied by the use of the data partitioning mode.

6.6 Header Extension Code (HEC)

The HEC is an optional set of header information placed at the start of a video packet. This headerrepeats the information necessary for the decoding of the VOP and the macroblocks in the packet.The HEC mechanism therefore further increases the independence between video packets and henceimproves error resilience.


7 NATURAL VIDEO CODING II

Context for Intra CAEContext for Inter CAE

Figure 11: The context used for coding binary shape with arithmetic coding. The pixels used forcreating the context template are marked with a + and the pixel to be predicted is marked with a ◦.The context for Inter-CAE also uses a 3× 3 block of pixels from the previous frame, centred at thesame location as the current pixel in the current frame.

7 Natural Video Coding II

Coding video objects is not used in the Simple profile. The use of video objects is the major newfeature in MPEG4 and therefore it is only when decoders start to implement the Main and Coreprofiles that we will see some truly novel applications arising for video communications. However,coding VOPs involves the need to handle portions of the image, since the VO may not occupy thewhole image frame. This section briefly considers the various techniques that have been developedto cope with this new problem.

7.1 Coding Shape Information

The shape information corresponding to a VOP can either be binary (value 0 − 1) or gray scale(values 0 − 255). Gray scale shape information allows semi-transparent objects to interact withinthe scene. MPEG4 uses arithmetic coding to deal with binary shape, and the DCT to deal withgray scale shape information. Coding gray scale shape information is therefore identical to codingthe image data corresponding to a VOP.

Binary shape coding uses a Context Based Arithmetic Coding scheme. Arithmetic coding is anextremely efficient method for assigning codewords to data and can achieve very closely the limitof Entropy. We will not describe the method here. The idea for Binary coding is to use a causalneighbourhood around the current pixel to index into a probability table that acts like a predictionfor the state of the current pixel (0/1). Figure 11 shows the context used for Inter and Intra VOPs.


7.1 Coding Shape Information 7 NATURAL VIDEO CODING II

Internal Blocks treated as normal

Blocks at a boundary must be padded

Figure 12: A VOP showing the 8 × 8 blocks. Some blocks straddle an object boundary and must betreated differently for texture coding.

7.1.1 Texture VOPs and Padding

The scheme for coding texture within a VOP (Intra or Inter) uses the DCT as discussed before.For blocks that are fully contained within the shape to be coded, the process of compression is asnormal (ie. like MPEG2). Thus the DCT of the block is then quantised and scanned using one ofthe 3 scan patterns and the resulting data is run,level coded. For blocks that lie across the shapeboundaries however, the DCT is likely to contain spurious high frequency coefficients that, if codedand quantised, would cause ringing artefacts when decoded. This situation is illustrated in figure 12.

To avoid this problem, blocks that straddle shape boundaries are padded to decrease the presenceof high frequency coefficients. Two methods have been provided in the standard Repetitive Paddingfollowed by the normal DCT, or Repetitive padding followed by the Shape Adaptive DCT (SA-DCT).

Repetitive Padding This involves filling the pixels in the block that lie outside the boundary withsome value. Two methods can be used.

Mean Padding All the pixels outside the shape boundary are assigned the mean value of thepixels within the shape inside the block.

Recursive Averaging Each pixel outside the shape is assigned a value equal to the mean ofthe 4 pixels in the 4-connected neighbourhood around the current pixel. This is performedin a raster scan fashion in the block. Given the current site is [i, j] the pixel value that isassigned I(i, j) is therefore

I(i, j) =14

(I(i− 1, j) + I(i + 1, j) + I(i, j − 1) + I(i, j + 1)

)(3)


7.2 Sprites and Global Motion 7 NATURAL VIDEO CODING II

Shape-Adaptive DCT The shape adaptive DCT used a separable DCT operation on the columnsand then the rows in a block. The pixels within the shape in the block are first shifted to thetop of the columns in which they lie, and the DCT of the resulting columns is then taken.The resulting coefficients are then shifted to the leftmost position in each respective row andthe row DCT is then taken. This leaves coefficients that are concentrated at the top left handcorner of the block. This arrangement is then more amenable to zig-zag scanning to createa more efficient run,level representation. This method requires the binary mask of the shapewithin the block in order to decode the coefficients correctly.

7.1.2 Shape and Motion Compensation

A similar problem arises for motion compensating VOPs that do not cover the entire frame. Forblocks that are completely within the shape, the motion compensation is as normal. For those thatstraddle shape boundaries, the block is first padded using repetitive padding as discussed above,then standard motion compensation is used.

For motion coding, transparent blocks are skipped. If the blocks from which prediction Motionvectors are to be obtained are not valid, then there are some rules to create a set of predictionvectors from the set of 3.

• When only one prediction vector is not valid, then that prediction vector is set to 0.

• When only 2 are not valid, then the two corresponding vectors are set to the value of the third(valid) vector. Thus the prediction is the value of the single valid vector.

• When all 3 vectors are not valid then the prediction vector is set to 0.

For binary shape motion compensation, only integer pixel motion accuracy is allowed. Thevector neighbourhood for shape block motion prediction is slightly different from that of textureblock motion prediction.

7.2 Sprites and Global Motion

Sprites have long been used in computer graphics and by the gaming community. Anyone familiarwith the old 1980’s Commodore-64 will remember that the ability of that machine to animate userdefined ‘sprites’ made it an awesome gaming machine. A sprite is a Video Object that persists overtime. In MPEG4 it can also exist outside of the view of the current VOP. The classic exampleof a sprite useful for compression purposes is the background scene at a tennis match. The spriteconsisting of the entire background area can be transmitted once to the decoder, and then subsequentVOPs can be pasted onto warped portions of the sprite to create each new image frame.

The creation of Sprites from natural imagery is a complex business (no matter what you mayhear elsewhere). Creation of a bad sprite is relatively easy, but creation of a sprite that stands


8 FINAL COMMENTS

up to close scrutiny demands a good understanding of camera distortion caused by rotation andperspective; as well as a good understanding of video models. You probably know enough to puttogether a good sprite extraction algorithm from the knowledge you have gained from the VideoModels section of this course, but we will not go into this further as it is a subject of some veryinteresting research.

Sprites are transmitted as (possibly large) I-Frame (a rectangular I- VOP). To construct theimage at the decoder, portions of the sprite have to be warped and combined with the transmittedVOP for each frame. The warping required is generally equal to the global motion of the scenecontent. MPEG4 allows for a parameterisation of the motion using the following mappings.

x′ =(a0 + a1x + a2y)a6x + a7y + 1

y′ =(a3 + a4x + a5y)a6x + a7y + 1

where (x′, y′) is the location of the motion compensated pixel in the previous frame and (x, y) is thelocation of the pixel in the current frame. This model is the full model for camera motion.

Rather than transmitting these warping parameters, which would require a new data section tobe designed for the bitstream, MPEG4 parameterises the warp by using the displacements betweencontrol points in the sprite. Thus for a translation only warp, only one control point and a singlemotion vector for that point need be transmitted. For affine transformation, 3 control points andtheir associated displacements are needed, and for full perspective transformation 4 control pointsetc are needed. The displacements are quantised to 0.5 pel accuracy.

7.2.1 Global Motion Compensation (GMC)

By allowing for global motion, the compression of motion information can be increased dramatically.This is because in typical scenes a large percentage of the scene is undergoing the same motiontransformation. Instead of applying warping to a sprite VOP, each Inter-block is allowed to selecteither a Global Motion Vector or a Local Motion Vector. This decision should be taken using acriterion that acknowledges the fact that a block using GMC does not have to be associated withmotion information. Thus having a low DFD is not the only consideration. Furthermore, at very lowbit rates, the advantage of using GMC becomes higher since the ratio between motion informationand texture information increases. The GMC warping parameters are transmitted as for sprites.

8 Final Comments

We have not touched on the synthetic coding parts of the MPEG4 standard. The discussion herehas remained primarily in the area of Natural Video Coding. It is unclear at this stage how muchof the MPEG4 standard will ever be widely adopted in the industry. The Simple Profile together


A VLC TABLES

with the improved compression and error-resilience tools will clearly be important for video overlow bandwidth links and thus mobile applications. However, adoption of the rest of the standard ishampered by the very difficult problem of automatically segmenting image sequences for the purposesof object based coding. In cases where there are two principal objects, this is certainly possible (e.g.sports) but for more general cases, the problem remains yet to be efficiently solved.

A VLC Tables

VLC motion_data

0000 0000 0010 1 -160000 0000 0011 1 -15.50000 0000 0101 -150000 0000 0111 -14.50000 0000 1001 -140000 0000 1011 -13.50000 0000 1101 -130000 0000 1111 -12.50000 0001 001 -120000 0001 011 -11.50000 0001 101 -110000 0001 111 -10.50000 0010 001 -100000 0010 011 -9.50000 0010 101 -90000 0010 111 -8.50000 0011 001 -80000 0011 011 -7.50000 0011 101 -70000 0011 111 -6.50000 0100 001 -60000 0100 011 -5.50000 0100 11 -50000 0101 01 -4.50000 0101 11 -40000 0111 -3.50000 1001 -30000 1011 -2.50000 111 -20001 1 -1.50011 -1011 -0.5

VLC motion_data

1 0010 0.50010 10001 0 1.50000 110 20000 1010 2.50000 1000 30000 0110 3.50000 0101 10 40000 0101 00 4.50000 0100 10 50000 0100 010 5.50000 0100 000 60000 0011 110 6.50000 0011 100 70000 0011 010 7.50000 0011 000 80000 0010 110 8.50000 0010 100 90000 0010 010 9.50000 0010 000 100000 0001 110 10.50000 0001 100 110000 0001 010 11.50000 0001 000 120000 0000 1110 12.50000 0000 1100 130000 0000 1010 13.50000 0000 1000 140000 0000 0110 14.50000 0000 0100 150000 0000 0011 0 15.50000 0000 0010 0 16

VLC Table for motion_data


REFERENCES REFERENCES

References

[1] Atul Puri and Tsuhan Chen. Multimedia Systems, Standards and Networks. Marcel Dekker,2000.

[2] Ming-Ting Sun and Amy R. Reibman. Compressed Video over Networks. Marcel Dekker, 2001.


Documents

Advanced Digital Video Processing An Introduction to MPEG4 ...sigmedia/pmwiki/uploads/Teaching.4... · Video Object. The compression of each Group of VOPs (GOV) is performed similarly