30
Linköpings universitet SE–581 83 Linköping +46 13 28 10 00 , www.liu.se Linköping University | Department of Electrical Engineering Master thesis, 30 ECTS | Informationskodning 2017 | LiTH-ISY-EX--17/5105--SE Screen Content Coding in HEVC Mixed raster content with matching pursuit Ching-Hsiang Yang Supervisor : Harald Nautsch Examiner : Ingemar Ragnemalm

ScreenContentCodinginHEVC1181918/FULLTEXT01.pdf1 Introduction 1.1 Background Video is a very common type of multimedia content. A video clip is an ordered sequence of images, displayed

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ScreenContentCodinginHEVC1181918/FULLTEXT01.pdf1 Introduction 1.1 Background Video is a very common type of multimedia content. A video clip is an ordered sequence of images, displayed

Linköpings universitetSE–581 83 Linköping

+46 13 28 10 00 , www.liu.se

Linköping University | Department of Electrical EngineeringMaster thesis, 30 ECTS | Informationskodning

2017 | LiTH-ISY-EX--17/5105--SE

Screen Content Coding in HEVCMixed raster content with matching pursuit

Ching-Hsiang Yang

Supervisor : Harald NautschExaminer : Ingemar Ragnemalm

Page 2: ScreenContentCodinginHEVC1181918/FULLTEXT01.pdf1 Introduction 1.1 Background Video is a very common type of multimedia content. A video clip is an ordered sequence of images, displayed

Upphovsrä

De a dokument hålls llgängligt på Internet – eller dess fram da ersä are – under 25 år från pub-liceringsdatum under förutsä ning a inga extraordinära omständigheter uppstår. Tillgång ll doku-mentet innebär llstånd för var och en a läsa, ladda ner, skriva ut enstaka kopior för enskilt brukoch a använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring avupphovsrä en vid en senare dpunkt kan inte upphäva de a llstånd. All annan användning av doku-mentet kräver upphovsmannens medgivande. För a garantera äktheten, säkerheten och llgäng-ligheten finns lösningar av teknisk och administra v art. Upphovsmannens ideella rä innefa ar räa bli nämnd som upphovsman i den omfa ning som god sed kräver vid användning av dokumentetpå ovan beskrivna sä samt skyddmot a dokumentet ändras eller presenteras i sådan form eller i så-dant sammanhang som är kränkande för upphovsmannens li erära eller konstnärliga anseende elleregenart. För y erligare informa on om Linköping University Electronic Press se förlagets hemsidah p://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet – or its possible replacement – for aperiod of 25 years star ng from the date of publica on barring excep onal circumstances. The onlineavailability of the document implies permanent permission for anyone to read, to download, or toprint out single copies for his/hers own use and to use it unchanged for non-commercial research andeduca onal purpose. Subsequent transfers of copyright cannot revoke this permission. All other usesof the document are condi onal upon the consent of the copyright owner. The publisher has takentechnical and administra ve measures to assure authen city, security and accessibility. According tointellectual property law the author has the right to be men oned when his/her work is accessedas described above and to be protected against infringement. For addi onal informa on about theLinköping University Electronic Press and its procedures for publica on and for assurance of documentintegrity, please refer to its www home page: h p://www.ep.liu.se/.

© Ching-Hsiang Yang

Page 3: ScreenContentCodinginHEVC1181918/FULLTEXT01.pdf1 Introduction 1.1 Background Video is a very common type of multimedia content. A video clip is an ordered sequence of images, displayed

Abstract

Screen content coding is used to improve coding efficiency of synthetic contents invideos, such as text and UI elements, as opposed to contents captured using photo-graphic equipment, which most video codecs are optimized for. One way of improvingscreen content coding efficiency is to utilize mixed block coding with matching pursuit.By separating the prediction and transformation steps for overlay and background ele-ments, better contrast and signal-to-noise ratio can be achieved. The paper describes theimplementation of such algorithm within the HEVC reference encoder, and discusses theexperimental results on several test images.

Page 4: ScreenContentCodinginHEVC1181918/FULLTEXT01.pdf1 Introduction 1.1 Background Video is a very common type of multimedia content. A video clip is an ordered sequence of images, displayed

Acknowledgments

First and foremost, I would like to thank my family for their support throughout my edu-cation. Without them I could not have reached where I am now.

So many people at Linköping University have helped me during the thesis study. HaraldNautsch gave me the knowledge for data compression and video encoding during his fantasticlectures, and helped me settle on the thesis topic. Ingemar Ragnemalm supervised the thesis,and provided constructive advices on improving this report. Anton Kovalev as my opponentgave precious feedback on the thesis presentation and report. Each of these is invaluable tothe completion of my thesis work.

Last but not the least, I would like to thank my friends, for their company always keeps merefreshed and happy. Special thanks to Jerome for helping me through life, and to Dai for hisadvices and fun hours at his apartment.

iv

Page 5: ScreenContentCodinginHEVC1181918/FULLTEXT01.pdf1 Introduction 1.1 Background Video is a very common type of multimedia content. A video clip is an ordered sequence of images, displayed

Contents

Abstract iii

Acknowledgments iv

Contents v

List of Figures vi

1 Introduction 31.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.5 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Theory 52.1 Video coding concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 HEVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Screen Content Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Method 133.1 HEVC Test Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.3 Changes from the original algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 153.4 Test environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Evaluation 16

5 Results and Discussion 205.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

6 Conclusion 22

Bibliography 23

v

Page 6: ScreenContentCodinginHEVC1181918/FULLTEXT01.pdf1 Introduction 1.1 Background Video is a very common type of multimedia content. A video clip is an ordered sequence of images, displayed

List of Figures

2.1 H.264/AVC intra-prediction modes for 4×4 luma blocks . . . . . . . . . . . . . . . . 62.2 Encoder Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 CUs and PUs in a CTU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4 HEVC Intra-prediction angular modes . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.5 Original image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.6 Mask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.7 Reconstructed foreground block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.8 Reconstructed background block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.9 Combined reconstructed blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.1 Test Image 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.2 Test Image 1 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.3 Test Image 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.4 Test Image 2 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.5 Test Image 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.6 Test Image 3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

vi

Page 7: ScreenContentCodinginHEVC1181918/FULLTEXT01.pdf1 Introduction 1.1 Background Video is a very common type of multimedia content. A video clip is an ordered sequence of images, displayed

List of Acronyms

AVC advanced video coding

CABAC context-adaptive binary arithmetic coding

CAVLC context-adaptive variable-length coding

CTU coding tree unit

CU coding unit

DCT discrete cosine transform

DST discrete sine transform

HEVC High Efficiency Video Coding

HSAD absolute sum of Hadamard transformed coefficients

HM HEVC Test Model

JCT-VC Joint Collaborative Team on Video Coding

MPM most probable mode

MRC Mixed Raster Content

PSNR peak signal-to-noise ratio

PU prediction unit

QP quantization parameter

RDO rate-distortion optimization

RDPCM residual differential pulse code modulation

RMD rough mode decision

SBAC Syntax-based context-adaptive Binary Arithmetic Coding

SCC screen content coding

1

Page 8: ScreenContentCodinginHEVC1181918/FULLTEXT01.pdf1 Introduction 1.1 Background Video is a very common type of multimedia content. A video clip is an ordered sequence of images, displayed

List of Figures

SATD sum of absolute transformed differences

TU transform unit

2

Page 9: ScreenContentCodinginHEVC1181918/FULLTEXT01.pdf1 Introduction 1.1 Background Video is a very common type of multimedia content. A video clip is an ordered sequence of images, displayed

1 Introduction

1.1 Background

Video is a very common type of multimedia content. A video clip is an ordered sequence ofimages, displayed one by one to form a sense of animation. Each image consists of pixels thatrecord the light intensity of a small region in the frame, and optionally different color channelsto represent color content.

Since videos are essentially large collections of images, it takes significant disk storage tostore a digital video, and a large bandwidth to transfer. What makes possible today’s videostreaming services and high definition digital television is the field of video compression, whichdeals with finding an efficient way of transmitting and storing videos.

As with other types of data compression, data can be compressed by removing redundantinformation from the uncompressed data. In the context of video encoding, there is spatialredundancy, which means that there might be some information that can be inferred by re-ferring to surrounding pixels in the same image. For example, a pixel surrounded by largeareas of white pixels is more likely to be of similar color. There is also temporal redundancy,which can be exploited by referring to previous or later frames. For example, a video of astill scenery has more temporal redundancy, since most pixels will not change color over time.Finding and reducing these redundancies, videos can be compressed very efficiently to satisfytransmission and storage demands.

1.2 Motivation

The High Efficiency Video Coding (HEVC) standard [1], first standardized and approved asan ITU-T standard in 2013, is a further development from the H.264/MPEG-4 advanced videocoding (AVC) standard. HEVC incorporated many improvements to enhance coding efficiency,specifically for better compression of ultra-high resolution videos [2].

Apart from the usual use case of capturing physical scenes, synthetic contents such asgame streaming and computer screen recording have become common use cases of the videoencoding tools we have today. These type of videos calls for screen content coding (SCC)tools, which can take advantage of the underlying structures of the image to further increasecoding efficiency.

3

Page 10: ScreenContentCodinginHEVC1181918/FULLTEXT01.pdf1 Introduction 1.1 Background Video is a very common type of multimedia content. A video clip is an ordered sequence of images, displayed

1.3. Aim

There are many different coding tools that can provide superior performance in the contextof SCC, and the HEVC working group Joint Collaborative Team on Video Coding (JCT-VC)has also proposed several new modules to be included in upcoming standardization efforts,such as Palette Mode and Adaptive Color Transform [3].

Mixed Raster Content (MRC) is a coding method that involves combining several layers ofimage into one, exploiting the different properties between layers to improve coding efficiency.Applying this method to SCC, Nautsch and Ostermann [4] proposed using a mask layer toseparate video macroblocks into foreground and background layers, and utilize the matchingpursuit algorithm [5] to find the corresponding coefficients of the incomplete blocks. Theirimplementation based on the AVC standard shows gains in peak signal-to-noise ratio (PSNR)from 0.1 dB to 1 dB under the same bitrate, depending on the contents of the encoded image.

Adapting the algorithm to HEVC is thus an interesting topic for research, with severalnew tools and a totally different block coding scheme in use, whether MRC can further in-crease coding efficiency comes down to implementing it on top of the HEVC and conductingexperiments.

1.3 Aim

This thesis project aims to implement a mixed block coding algorithm, with matching pursuitused for sample filling, as described in [4], using the HEVC reference encoder, HEVC TestModel (HM). Test video sequences from the JCT-VC will be used to experiment the efficiencyof the algorithm as compared to HEVC standard encoding.

1.4 Research questions

The following research questions will be discussed:

1. How can the algorithm be applied to the HEVC encoder?

2. How much does HEVC gain from utilizing the algorithm, in different types of videos?

1.5 Delimitations

The algorithm is implemented only in the intra-frames of the image, and only on the lumacomponent of the video frame. Consequently, the test material consists of grayscale still imagestaken from the HEVC test suite video clips.

The prediction and coding method sometimes deviate from the original paper [4] whennecessary, as to retain the advantage of new coding tools in HEVC. Differences will be detailedin later chapters.

4

Page 11: ScreenContentCodinginHEVC1181918/FULLTEXT01.pdf1 Introduction 1.1 Background Video is a very common type of multimedia content. A video clip is an ordered sequence of images, displayed

2 Theory

Improvements in coding efficiency has enabled fast transmission of video over Internet, andstreaming real-time video has become a commodity to the general public. The latest videoencoding standards focus on obtaining better efficiency with higher definition videos, in orderto meet the demand for ever higher quality video and to utilize the full potential of bettervideo capture equipment.

In this chapter, an overview of relevant background theory is presented.

2.1 Video coding concepts

We start by introducing some concepts commonly used in video coding.

Luma and chromaTo represent an image with analog or digital signal, a color space is used to define a mappingbetween values and colors. The most widely used color space in video and image compression isYCbCr, which represents a color using three components: one luma component representingthe brightness, and two chroma (chrominance) components representing the color.

Hybrid video encodingSince the ratification of the H.261 video codec [6] in 1988, nearly all modern video codingstandards are structured under the principle of hybrid video coding, which combines intra-frame and inter-frame coding. A frame in a video is predicted either using other parts in thesame frame, or from nearby frames before or after. The difference between the original imageand the prediction is then transformed (usually with the discrete cosine transform), quantizedand entropy encoded.

Blockwise codingAn image is first split into blocks for encoding. From H.261 until H.264 the fixed-sized 16×16pixel macroblock has been the basic processing unit for the family of video codecs. Eachmacroblock is subdivided into prediction blocks for prediction and transform blocks fortransformation.

5

Page 12: ScreenContentCodinginHEVC1181918/FULLTEXT01.pdf1 Introduction 1.1 Background Video is a very common type of multimedia content. A video clip is an ordered sequence of images, displayed

2.1. Video coding concepts

The macroblock is replaced by the coding tree unit in the HEVC standard, which willbe detailed in the next section.

PredictionPrediction is used to reduce redundancy in time or space. Each prediction block or predictionunit is created using either intra-prediction or inter-prediction.

The difference between the prediction and the original image is called the residue, and istransformed, quantized, entropy encoded and included in the output bitstream. The moresimilar the prediction is to the actual block of image we want to encode, the less energy wouldremain in the residue, thus the better the block can be compressed.

Intra-prediction Using surrounding pixels in the same frame, a block of image can bepredicted by extending the pixels in the direction of the texture. This is called intra-prediction,and can be done without referring to other frames.

Figure 2.1 shows the nine intra-prediction modes used in the AVC standard when predictinga 4×4 luma block (shown in light blue). The pixels labeled with alphabets from A to M aresurrounding reconstructed pixels that are used as reference pixels. For mode 2 (DC mode) themean of the reference pixels is used as the prediction.

The remaining eight modes are angular modes. The arrows indicate the directions in whichthe pixels are referenced. For each predicted pixel, its location is projected to the referencerow of pixels using the prediction direction. The two closest reference pixels are interpolated,with weight calculated using the projected position of the pixel on the reference.

These angular modes efficiently capture edges or textures of specific angles, and in HEVCthe number of modes have been drastically increased, which results in even more preciseprediction and better compression.

0 (Vertical)M A B C D E F G HIJKL

1 (Horizontal)M A B C D E F G HIJKL

2 (DC)

(mean)

M A B C D E F G HIJKL

3 (Diagonal down-left)M A B C D E F G HIJKL

4 (Diagonal down-right)M A B C D E F G HIJKL

5 (Vertical-right)M A B C D E F G HIJKL

6 (Horizontal-down)M A B C D E F G HIJKL

7 (Vertical-left)M A B C D E F G HIJKL

8 (Horizontal-up)M A B C D E F G HIJKL

Figure 2.1: H.264/AVC intra-prediction modes for 4×4 luma blocks

6

Page 13: ScreenContentCodinginHEVC1181918/FULLTEXT01.pdf1 Introduction 1.1 Background Video is a very common type of multimedia content. A video clip is an ordered sequence of images, displayed

2.1. Video coding concepts

Inter-prediction Prediction made using image blocks from other frames is called inter-prediction. A motion vector is used to capture movement between frames, and the predictionis made by copying the moved image segment from the direction of the motion vector.

DCT and DSTThe discrete cosine transform (DCT) is a Fourier-related transform that can represent datapoints using a sum of cosine functions. The DCT is used to make the image more compress-ible, since most signal energy can be represented in low-frequency components of the DCT.Similarly, the discrete sine transform (DST) represents a signal using a sum of sine functionswith different frequencies and amplitudes.

Nearly all video and image codecs use the DCT, including the JPEG format, theH.264/AVC standard and the HEVC standard. HEVC in addition uses the DST as an al-ternative transform when processing 4×4 luma blocks in intra-prediction mode. [2]

QuantizationThe transformed coefficients are quantized to reduce information and make them easier tocompress. This is the main lossy process during encoding, since the coefficients are constrainedto discrete quantization steps, introducing distortion.

The quantization parameter (QP) as an encoder setting controls the extent to whichthe coefficients are quantized. This lets users control the trade-off between video bitrate andquality. The larger the QP value is, the coarser the coefficients are quantized, resulting insmaller file sizes but lower quality videos.

Entropy encodingThe last step in encoding a block of image is the entropy encoding stage, which is a lossless com-pression on the quantized coefficients along with other metadata such as the intra-predictionmode and inter-prediction motion vectors.

Some entropy coding schemes have been developed specifically for video encoding applica-tions, to take advantage of several characteristics of quantized coefficients, such as:

• Most coefficients, especially high frequency components, will be zero after quantization.

• Neighboring blocks are correlated.

• The coefficients representing low frequency are usually larger than higher frequency co-efficients.

Special codes such as context-adaptive variable-length coding (CAVLC) and context-adaptive binary arithmetic coding (CABAC), both used in the AVC standard, are designed toexploit these to improve compression ratio.

Rate-distortion optimizationRate-distortion optimization (RDO) is a method of finding the best parameters to compressa block of image. While an exhaustive search for the best quality decoded image can be usedto minimize distortion, it does not take into account the number of bits required to storethe encoded image. A block could use a disproportionate number of bits without significantimprovement on quality.

RDO compares parameters by associating a cost to an encoded block. The block is pre-dicted, transformed, quantized and entropy encoded, yielding the distortion and number ofbits used. The cost of the block is then determined by the number of bits used (the rate) andthe distortion.

7

Page 14: ScreenContentCodinginHEVC1181918/FULLTEXT01.pdf1 Introduction 1.1 Background Video is a very common type of multimedia content. A video clip is an ordered sequence of images, displayed

2.2. HEVC

Since the full RDO process requires all steps in the encoding to be done, the encoder wouldtake very long to determine the most appropriate settings. For HEVC, several methods havebeen proposed [7][8][9] to make fast decisions on parameters and reduce the number of fullRDO needed.

2.2 HEVC

HEVC is an evolution of the H.264/MPEG-4 AVC standard, and uses similar hybrid videoencoding design. For each slice (frame), the image is broken down into blocks, which arepredicted either by intra-prediction (using redundancy in the same image) or inter-prediction(using redundancy from previous or later images). The error of the prediction is then trans-formed into coefficients and quantized. The mode and coefficients are encoded with an entropyencoder and then packed into the video stream.

Figure 2.2 shows a high-level overview of the encoding procedure with a block diagram.Note that the prediction and transformation can be done on different block sizes, which willbe detailed in the next subsection.

Input Predict

+

Transform

Quantize EntropyCoder Output

Prediction

Prediction Mode-

Prediction Error

Coefficients

Figure 2.2: Encoder Overview

Coding tree unitOne of the most important evolution from AVC to HEVC is the introduction of coding treeunits (CTUs) as the basic unit for encoding. Previous standards such as AVC rely on fixedsized “macroblock” and its subdivisions for the prediction and transformation step. Withhigher resolution video, this become less efficient as the texture details become larger than theblock size, rendering the coding tools unutilized.

With the advent of coding tree units, or CTUs, HEVC coding units are not tied to aspecific size, but rather determined by the underlying structure of the image. A frame inHEVC is divided into CTUs, and each CTU can be recursively split into several square codingunits (CUs).

A CU contains prediction units (PUs), which are the unit on which predictions are made.In the case of intra-coded slices, each CU corresponds to a PU of the same size, with theexception of the smallest CU, which can be divided into four smaller PUs.

Figure 2.3 depicts one possible example of the CUs and PUs contained in a CTU, underintra-prediction mode. For intra-prediction, the PU size is always the same as the CU thatcontains it, with the exception of the smallest CU size, which can contain one level of furthersubdivided PUs. The blocks with dotted lines are the PUs that have been split further fromthe smallest allowed CU size.

Each CU is also recursively divided into smaller transform units (TUs), on which trans-formation and quantization are performed. This ensures that relatively tiny details can becoded efficiently while still having the flexibility to deal with large blocks, making it suitablefor ultra-high resolution video.

8

Page 15: ScreenContentCodinginHEVC1181918/FULLTEXT01.pdf1 Introduction 1.1 Background Video is a very common type of multimedia content. A video clip is an ordered sequence of images, displayed

2.2. HEVC

The decision of CU, PU and TU sizes is up to the encoder to find the best combination forcompression efficiency.

Figure 2.3: CUs and PUs in a CTU

PredictionFor each PU in an intra-slice, a prediction is created from surrounding pixels. HEVC supportsa total of 35 intra prediction modes, compared to 9 for AVC.[10] Mode 0 indicates the planarintra prediction, which is similar to a bilinear interpolation from the reference pixels. Mode 1denotes DC prediction, which is filling the whole block with the mean of the available referencepixels.

The rest 33 modes are angular prediction modes, which predicts using the reference pixelswith a prediction direction, much like the eight angular modes used with AVC depicted infigure 2.1. In HEVC the predicted pixels are interpolated from the two reference pixels closestto the projection in the prediction direction. The weighting of the interpolation is determinedby the subpixel location in between the reference pixels. Figure 2.4 shows the available 33prediction directions of HEVC angular inter-prediction modes.

The reason for such drastic increase in the number of predict directions is to better capturethe characteristics of the predicted images, especially in the edges of strong contrast.

To determine the best prediction mode, the RDO process can be used. Each mode is testedby actually calculating the distortion and using the entropy encoder to test encode it, givingthe bits needed to code the block. These modes are then compared, using the number of bitsused and the distortion, to find the best prediction mode for encoding the PU.

TransformationThe TUs are transformed using a scaled integer approximation of the DCT, or alternativelya DST for the 4×4 luma residual block, which has similar properties as the DCT. Since thetransformation is used extremely frequently, using only integer operations reduces computa-tional complexity.

9

Page 16: ScreenContentCodinginHEVC1181918/FULLTEXT01.pdf1 Introduction 1.1 Background Video is a very common type of multimedia content. A video clip is an ordered sequence of images, displayed

2.3. Screen Content Coding

V-32 V-26 V-21 V-17 V-13 V-9 V-5 V-2 V0 V+2 V+5 V+9 V+13 V+17 V+21 V+26 V+32

H-26

H-21

H-17

H-13

H-9

H-5

H-2

H0

H+2

H+5

H+9

H+13

H+17

H+21

H+26

H+32

Figure 2.4: HEVC Intra-prediction angular modes

Entropy encodingIn AVC, both CAVLC and CABAC are specified as available entropy encoder, with the latterbeing the more efficient method. In HEVC, only CABAC is used.

2.3 Screen Content Coding

Screen content refers to synthetic contents including text, animation and mixed contents withnatural imagery. Several coding tools were proposed to the HEVC standard, such as residualdifferential pulse code modulation (RDPCM) [11]. This thesis is focused on an application ofMixed Raster Content (MRC), combined with the matching pursuit algorithm.

Mixed raster contentMixed raster content [12][13][14] was proposed as a format for efficient storing of multi-layeredimages. Such images can contain separate layers of texts, photos, pictures and color masks.By coding the layers using different methods, better efficiency can be achieved than directlycoding the resulting combined image.

10

Page 17: ScreenContentCodinginHEVC1181918/FULLTEXT01.pdf1 Introduction 1.1 Background Video is a very common type of multimedia content. A video clip is an ordered sequence of images, displayed

2.3. Screen Content Coding

MRC is especially suited for SCC, where texts can be overlaid on top of another video,and synthetic materials are mixed with captured videos. In [4], the authors presented analgorithm to improve SCC coding efficiency on AVC. For each block, in addition to regularAVC coding, a mask is determined by using a two step quantization, which separates theforeground and background components of the block. Those two blocks are predicted andtransformed separately, then combined to form the reconstructed block. If the resulting cost(determined by the combined cost of the two constituting blocks plus one bit for signaling) islower, then the block is coded using the mixed block algorithm.

Experiments has shown that up to 1dB increase in PSNR is obtained by the algorithm,which benefits the most when encoding images with many text in pixelated fonts. Bettercontrast can be observed from UI elements in screen captures.

Matching pursuitA key element in the algorithm described above is the transformation step for the separatedblocks. After masking, the blocks are not square anymore, with masked “holes” in them.Several algorithms for filling the missing samples were proposed, to varying efficiency[15][16].Instead of filling in the samples, Nautsch and Ostermann proposed using the matching pursuitalgorithm [5] to directly find the coefficients of each incomplete block.

The matching pursuit algorithm is an iterative algorithm, and works by finding the mostsignificant coefficients one at a time, reconstructing using the transform basis and calculatingthe error until it is lower than a criteria. The result is transform coefficients that approximatethe non-masked samples of the incomplete block. [4] showed that matching pursuit performswell compared to other sample filling algorithms.

Example Below is an example of the matching pursuit algorithm performed on an 8×8block, shown in figure 2.5. First a mask is created (figure 2.6, border added for clarity) bycomparing the luma value of each pixels to the average over the block, which is a rough wayto classify pixels into foreground and background ones.

Figure 2.5: Original image Figure 2.6: Mask

With the image and the mask, the matching pursuit algorithm finds the coefficients thatminimizes the mean square error between the inverse transformed block and the original image.Note that for illustration, the prediction step is ignored and we are transforming directly theoriginal image. The resulting coefficients, when inverse transformed with DCT, results inreconstructed blocks of the foreground and background components (figure 2.7 and figure 2.8).

The final step is simply combining the two reconstructed blocks, taking pixels from eitherof the blocks according to the mask. The result is figure 2.9, which is our reconstruction ofthe original image.

11

Page 18: ScreenContentCodinginHEVC1181918/FULLTEXT01.pdf1 Introduction 1.1 Background Video is a very common type of multimedia content. A video clip is an ordered sequence of images, displayed

2.3. Screen Content Coding

Figure 2.7: Reconstructed foreground block Figure 2.8: Reconstructed background block

Figure 2.9: Combined reconstructed blocks

12

Page 19: ScreenContentCodinginHEVC1181918/FULLTEXT01.pdf1 Introduction 1.1 Background Video is a very common type of multimedia content. A video clip is an ordered sequence of images, displayed

3 Method

In this chapter the implementation of the project is discussed, including the HEVC referenceencoder upon which the project is built, and the test environment used to evaluate the resultingcoding efficiency.

3.1 HEVC Test Model

The algorithm is implemented on top of the HEVC Test Model (HM), the reference encoderand decoder suite for the HEVC standard. HM contains a standalone encoder applicationthat can produce standard-compliant HEVC video from raw video input (uncompressed imagesequences), as well as the resulting decoded video.

HM is intended as a demonstration for the efficiency of HEVC, and can produce very goodresults at the cost of longer encoding times compared to production-quality encoders. Duringdevelopment of the HEVC standard, HM serves as a workbench for evaluating various codingtools and trade-offs. Each new coding tool and optimization is experimented on HM beforeformal proposal into the HEVC standard. Thus it is most suited as our test environment forthe project.

In the standard setting we use for testing, the HM encoder works in two passes. For eachCTU in a frame, first the best CTU structure and prediction modes are decided during the trialcompression stage by recursively building all possible CTU tree structures and test encodingeach of the possible combinations of prediction modes. Then the best parameters, ones whichhave the least cost during rate-distortion optimization, are used to encode the CTU, producingthe bitstream.

To test our proposed algorithm, the HM encoder pipeline is modified to produce resultsfrom both the original HEVC standard and our modified algorithm. We can then analyze thegains we obtain by comparing the bitrate and distortion from both algorithms.

3.2 Implementation

Implementing the original algorithm was not straight forward, and some decisions have to beresearched and made, which include the following:

13

Page 20: ScreenContentCodinginHEVC1181918/FULLTEXT01.pdf1 Introduction 1.1 Background Video is a very common type of multimedia content. A video clip is an ordered sequence of images, displayed

3.2. Implementation

Macroblock and CTUThe original algorithm works on a per-macroblock basis, which is most natural as the maincoding unit of the AVC standard. With HEVC, there is no macroblock, and we need to decideon which unit we should consider using the matching pursuit algorithm, between the CU, PUand TU – as described in section 2.2.

The masking step could be performed on units larger than the other steps, but for simplicitythe masking step is performed on the same unit as the prediction and matching pursuit step,right before prediction.

The prediction step is most naturally performed on PUs, in line with the semantics of theHEVC coding tree structure. The transform step, if we follow HEVC semantics, should beperformed on TUs. For simplicity, all three steps are performed on each PU, but there is thepossibility that other assignments could work, for example calculating a mask per CU andthen doing the matching pursuit on smaller units within that CU.

Transform: scaling and roundingThe transforms used in HEVC are scaled integer approximations of the DCT and DST. Foreach iteration of matching pursuit, the basis needs to be scaled properly. With integer trans-form, the matching pursuit algorithm can sometimes become non-converging, due to accumu-lated errors. This is prevented by limiting the number of iterations and setting a less strictupper bound on error.

Entropy encoder contextThe modified CABAC encoder in HEVC, named Syntax-based context-adaptive Binary Arith-metic Coding (SBAC), heavily depends on context, and provides sophisticated interfaces forloading and saving the state of the entropy encoder. Because of this, the trial compressionstage used to determine coding tree structure and prediction modes do not report the accuratefinal encoded bit count, but only serves as a tool for the RDO process. It also prevents inte-grating other coding tools in the stream, since doing so will disturb the context and underminethe integrity of HEVC.

Here a compromise is made by evaluating the cost of mixed block coding only duringthe trial compression phase. The cost are comparable to normally encoded PU blocks, dueto them having the same context at the moment of compression, and the context does notpropagate (the trial compression phase resets the context for each coding unit). The resultingbits saved are not necessary the same as it would have been if we were to integrate a streamcontaining mixed block coding, but it gives a rough estimate of the cost differences from usingthe algorithm.

Mode decisionThe cost are obtained from the cost function used by the HEVC:

D +R ∗ λwhere D is the sum of squared error, and R is the bits used. λ depends on the current QP

value. For mixed blocks, the sum of squared error is the sum of the errors from the combinedblock. The bits used includes bits for the mask, the bits for foreground block coefficientsand the bits for background coefficients. For comparison to single block coding, the 1 bit forflagging mixed block encoding is omitted.

The resulting bitrate saving is estimated by summing up all the differences of bits betweensingle and mixed block coding during the trial compression whenever mixed block coding isselected. As stated in the previous section, this is not the real number of bits used in the finalcompression, but it is comparable.

14

Page 21: ScreenContentCodinginHEVC1181918/FULLTEXT01.pdf1 Introduction 1.1 Background Video is a very common type of multimedia content. A video clip is an ordered sequence of images, displayed

3.3. Changes from the original algorithm

3.3 Changes from the original algorithm

There are changes made to the original algorithm described in [4], due to the inherent dif-ferences in AVC and HEVC, and the complexity of having to rewrite significant parts of theencoder pipeline.

Mixed block predictionIn the original implementation, the mixed blocks are predicted such that if there are surround-ing blocks that are also coded as mixed blocks, then the reference samples are taken from thecorresponding layer. For example, if the previous block is a mixed block, and the current blockprediction uses reference samples from that block, then the foreground block is predicted usingthe previous foreground reconstructed samples, and vise versa.

In this project the reference samples are always taken from the final reconstruction of thereference blocks, due to the complexity of availability patterns that arises in a coding treestructure. The foreground and background blocks still use separate prediction modes.

3.4 Test environment

Three images were chosen from the test videos for the HEVC-SCC evaluations. These includea screenshot of a computer desktop with office software open and many texts; a video footageof a sport event, with scrolling text at the bottom of the video; and a screenshot of a computergame, with UI elements and some text.

These test images represent various scenarios where the matching pursuit may prove bene-ficial, and are also common use cases of SCC, including screen text, UI overlays and computergraphics. As we are only evaluating the algorithm under intra-prediction mode, only one framefrom each video is chosen since we do not depend on other frames during encode.

The configurations are taken from the main intra coding configuration file detailed in [17],which serves as a common test condition for evaluating coding efficiency. The QP value of theencoder is set from 27 to 45, with increments of 3. This gives eight data points for each testimage, which will then be used to plot the PSNR and bit rate gain from using the algorithm.

15

Page 22: ScreenContentCodinginHEVC1181918/FULLTEXT01.pdf1 Introduction 1.1 Background Video is a very common type of multimedia content. A video clip is an ordered sequence of images, displayed

4 Evaluation

The results from encoding the three test images are shown, including the original image andthe corresponding rate–PSNR graph.

Each test image is used as input to the encoder, with both the original HEVC and ourmodified algorithm running. Several quantization parameter (QP) values are used, to showthe results under different rate requirements.

Test Image 1 The first test image is a screen shot of a typical office setting, with many textand UI elements. The background is mostly solid color or smooth gradients. It is expectedthat the matching pursuit algorithm can improve efficiency on the text parts, while simplesmooth textures still benefit from the large block sizes of the HEVC.

Figure 4.1: Test Image 1

16

Page 23: ScreenContentCodinginHEVC1181918/FULLTEXT01.pdf1 Introduction 1.1 Background Video is a very common type of multimedia content. A video clip is an ordered sequence of images, displayed

In figure 4.2, the PSNR for mixed block coding with matching pursuit slightly went underthe line of single block coding, but has a better PSNR at lower QP, at around 3dB better.

26

28

30

32

34

36

38

40

42

44

0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6

PSNR

[dB

]

Rate [bits/pixel]

Single BlockMatching Pursuit

Figure 4.2: Test Image 1 Result

Test Image 2 The second image is a camera captured footage with a strip of text overlayand watermark. Here we are comparing the algorithm against HEVC in terms of large textover smooth background gradients.

Figure 4.3: Test Image 2

The two lines in figure 4.4 coincides, indicating that mixed block coding almost alwaysperforms worse than normal coding methods.

17

Page 24: ScreenContentCodinginHEVC1181918/FULLTEXT01.pdf1 Introduction 1.1 Background Video is a very common type of multimedia content. A video clip is an ordered sequence of images, displayed

30

31

32

33

34

35

36

37

38

39

40

0 0.05 0.1 0.15 0.2 0.25 0.3

PSNR

[dB

]

Rate [bits/pixel]

Single BlockMatching Pursuit

Figure 4.4: Test Image 2 Result

Test Image 3 The third test image is a screenshot from a computer game, with many UIelements and text. The scene has some grainy backgrounds, but also objects with clearlydefined edges, such as the car and the silhouettes of trees and structures. The wires on the topleft corner are close to what the mixed block coding algorithm is designed for: very differentproperties of foreground and background texture, so it is interesting to see how this comparesto normal HEVC.

Figure 4.5: Test Image 3

In figure 4.6 the mixed block coding algorithm has nearly 1dB better PSNR than normalsingle block coding at lower QP.

18

Page 25: ScreenContentCodinginHEVC1181918/FULLTEXT01.pdf1 Introduction 1.1 Background Video is a very common type of multimedia content. A video clip is an ordered sequence of images, displayed

28

30

32

34

36

38

40

42

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55

PSNR

[dB

]

Rate [bits/pixel]

Single BlockMatching Pursuit

Figure 4.6: Test Image 3 Result

19

Page 26: ScreenContentCodinginHEVC1181918/FULLTEXT01.pdf1 Introduction 1.1 Background Video is a very common type of multimedia content. A video clip is an ordered sequence of images, displayed

5 Results and Discussion

In this chapter the results from the experiments as well as the methodology is discussed.

5.1 Results

The three test images were chosen in connection to their contents regarding screen contentcoding.

The first test image (figure 4.1) is an example of a typical office computer environment.The majority of the texts are in bitmap fonts, which the masking step can perfectly captureand reconstruct, leading to the very high PSNR. This can be considered as the best casescenario for the particular algorithm.

The second test image (figure 4.3) is mostly natural imagery, with only a few overlay textand a bar of gradually changing texture. HEVC can code the material very well, so the mixedblock coding method never won out during trial compression RDO.

The third test image (figure 4.5) has less text than the first one, but more complicated UIelements. The texts are also in bitmap font, which makes it easy to capture with the maskingstep. Also notice that the wires on the top left corner of the sky is also completely pixelated,so the mixed block coding method is often chosen in that area as well.

In general, images with strong contrast in small details, such as bitmap font text andpixelated UI elements, benefit more from our algorithm. If our test image were to containmore on-screen text, we can gain more by using this masked approach.

5.2 Method

Although the expect results were obtained, there are some quirks in the implementation of theproject. In the first test image, one of the mixed block coding seems to have slightly worsecoding efficiency than normal HEVC. As previously discussed, the decision of the mixed blockcoding mode is made during the trial compression stage, when the contexts for the entropycoder are not correct. It is entirely possible that the choice could turn out to be wrong in theactual encoding phase when all decisions on coding tree structure and prediction modes havebeen made.

On a similar issue, there is no guarantee that the used bits reported during trial compressionis incorrect. Since the context model is not designed with mixed block coding in sight, it is

20

Page 27: ScreenContentCodinginHEVC1181918/FULLTEXT01.pdf1 Introduction 1.1 Background Video is a very common type of multimedia content. A video clip is an ordered sequence of images, displayed

5.2. Method

likely that integrating the method into the final encoding will degrade the coding efficiency ofother blocks.

For mixed blocks, separating not only the prediction mode but also the reference samples,as described in section 3.3 could provide further improvement on coding efficiency.

The implementation is far from practical. The algorithm take more than 200 seconds ona single frame of size 1920×1080, with a commodity personal computer from 2015 runningsingle threaded, so it would require aggressive optimization to gain wider use. One of theoptimizations would be to incorporate the mixed coding algorithm into the rough mode decision(RMD) and most probable mode (MPM) decision process, to reduce the number of predictionmodes needed to perform full-RDO.

The HEVC is designed with hardware implementation in mind, so it would also become abottle neck when more than double the memory is needed to hold the reference samples in theworst case.

21

Page 28: ScreenContentCodinginHEVC1181918/FULLTEXT01.pdf1 Introduction 1.1 Background Video is a very common type of multimedia content. A video clip is an ordered sequence of images, displayed

6 Conclusion

In chapter 3, the problems encountered when adapting the algorithm to HM is discussed, aswell as the decisions made to overcome or circumvent them. Chapter 5 observes the codeefficiency gain of the algorithm applied to HEVC, and made comments on the possible reasonsbehind them.

With this thesis the mixed block coding algorithm with matching pursuit is implemented ontop of the HEVC reference encoder, with some minor changes. The experimental results showthat under certain circumstances, the algorithm can outperform normal encoding routines ofthe HEVC, albeit with a compression speed penalty.

The result prove that MRC derived methods, while not a new concept, remains an ade-quate coding tool when it comes to screen content coding. Combined with matching pursuit,it shows potential in improving compression ratio with pixel oriented materials. With thecontinuous advancement of video encoding methods, it will remain an effective coding toolworth considering.

22

Page 29: ScreenContentCodinginHEVC1181918/FULLTEXT01.pdf1 Introduction 1.1 Background Video is a very common type of multimedia content. A video clip is an ordered sequence of images, displayed

Bibliography

[1] Benjamin Bross, Woo-Jin Han, Jens-Rainer Ohm, Gary J Sullivan, Ye-Kui Wang, andThomas Wiegand. “High efficiency video coding (HEVC) text specification draft 10”. In:JCTVC-L1003 1 (2013).

[2] Gary J Sullivan, Jens Ohm, Woo-Jin Han, and Thomas Wiegand. “Overview of thehigh efficiency video coding (HEVC) standard”. In: IEEE Transactions on circuits andsystems for video technology 22.12 (2012), pp. 1649–1668.

[3] Jizheng Xu, Rajan Joshi, and Robert A Cohen. “Overview of the emerging HEVC screencontent coding extension”. In: IEEE Transactions on Circuits and Systems for VideoTechnology 26.1 (2016), pp. 50–62.

[4] Harald Nautsch and Jörn Osterman. “Transform Coding of Compound Images UsingMatching Pursuit”. In: Proceedings of 2012 Picture Coding Symposium. 2012.

[5] S. Mallat and Z. Zhang. “Matching pursuit with time-frequency dictionaries”. In: IEEETransactions on Signal Processing. Vol. 41. 1993, pp. 3397–3415.

[6] H.261: Video codec for audiovisual services at p×384 kbit/s. Recommendation. TheInternational Telegraph and Telephone Consultative Committee, Nov. 1988.

[7] Liang Zhao, Li Zhang, Siwei Ma, and Debin Zhao. “Fast mode decision algorithm forintra prediction in HEVC”. In: Visual Communications and Image Processing (VCIP),2011 IEEE. IEEE. 2011, pp. 1–4.

[8] Thaı́sa L Da Silva, Luciano V Agostini, and Luis A da Silva Cruz. “Fast HEVC intraprediction mode decision based on EDGE direction information”. In: Signal ProcessingConference (EUSIPCO), 2012 Proceedings of the 20th European. IEEE. 2012, pp. 1214–1218.

[9] Dongdong Zhang, Youwei Chen, and Ebroul Izquierdo. “Fast intra mode decision forHEVC based on texture characteristic from RMD and MPM”. In: Visual Communicationsand Image Processing Conference, 2014 IEEE. IEEE. 2014, pp. 510–513.

[10] Jani Lainema, Frank Bossen, Woo-Jin Han, Junghye Min, and Kemal Ugur. “Intra cod-ing of the HEVC standard”. In: IEEE Transactions on Circuits and Systems for VideoTechnology 22.12 (2012), pp. 1792–1801.

[11] Jewon Kang, Rajan Joshi, Joel Sole, and Marta Karcziwicz. “Explicit residual DPCMfor screen contents coding”. In: Consumer Electronics (ISCE 2014), The 18th IEEEInternational Symposium on. IEEE. 2014, pp. 1–2.

23

Page 30: ScreenContentCodinginHEVC1181918/FULLTEXT01.pdf1 Introduction 1.1 Background Video is a very common type of multimedia content. A video clip is an ordered sequence of images, displayed

Bibliography

[12] T.44: Mixed Raster Content (MRC). Recommendation. ITU-T Study Group 16, Jan.2005.

[13] ISO/IEC 16485:2000 Information technology – Mixed Raster Content (MRC). Standard.International Organization for Standardization, Sept. 2000.

[14] Ricardo de Queiroz, Robert Buckley, and Ming Xu. “Mixed Raster Content (MRC)Model for Compound Image Compression”. In: Visual Communications and Image Pro-cessing ’99. 1998.

[15] Gopal Lakhani and Rajesh Subedi. “Optimal filling of FG/BG layers of compound doc-ument images”. In: Image Processing, 2006 IEEE International Conference on. IEEE.2006, pp. 2273–2276.

[16] Ricardo L De Queiroz. “On data filling algorithms for MRC layers”. In: Image Processing,2000. Proceedings. 2000 International Conference on. Vol. 2. IEEE. 2000, pp. 586–589.

[17] Frank Bossen. “Common test conditions and software reference configurations”. In: JointCollaborative Team on Video Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IECJTC1/SC29/WG11, 5th meeting, Jan. 2011. 2011.

24