H.264/AVC Based Perceptual Video Coding - Técnico Lisboa ... · H.264/AVC Based Perceptual Video Coding ... a principal conclusão do trabalho apresentado nesta Tese é que vale

H.264/AVC Based Perceptual Video Coding

Design, Implementation and Evaluation

Ana Sofia Clemente Cabrita

Dissertação para obtenção do Grau de Mestre em

Engenharia Electrotécnica e de Computadores

Júri

Presidente: Prof. José Bioucas Dias

Orientador: Prof. Fernando Pereira

Co-Orientador: Dr. Matteo Naccari

Vogal: Prof. Luís Ducla Soares

Julho 2010

i

Acknowledgments

First and foremost I would like to thank my supervisor in this thesis, Prof. Fernando Pereira for the

valuable guidance, advice, devotion and dedication to this work.

Besides, I would like to thank all the IT Image Group for their support, especially Matteo Naccari and

Catarina Brites, for the availability and help on technical issues when I needed the most.

Finally, an honorable mention goes to my family and friends for their understanding and support on me in

completing this project. Without helps of the particular that mentioned above, I would have faced many

difficulties while doing this project. A deep acknowledgement to all the above mentioned as well as to

everyone close to me, which, in their own way, contributed for the development of this Thesis.

Pais

ii

Abstract

Nowadays, the demand for more video compression keeps growing following the increasing wide scale

deployment of applications where video compression plays a key enabling role. Some of these important

video applications are part of our daily life, notably applications such as the storage of films and video

games in DVD and Blu-ray discs, Internet video streaming as from YouTube, digital television using

various broadcasting channels, mobile TV and real-time applications such as videotelephony and

videoconferencing.

Current video compressions standards are mainly optimized for objective quality metrics such as the

mean squared error which does not take into account the human visual system features. The challenge

addressed in this Thesis regards the inclusion in the state-of-the-art H.264/AVC video compression

standard of additional coding tools able to increase its rate-distortion performance by exploiting the

perceptual features of the human visual system, in summary, to be able to provide the same subjective

quality at a lower bitrate.

This challenge has been addressed by introducing in the H.264/AVC basic architecture additional

perceptually driven modifications. First, a perceptual coefficients pruning method using the JND

thresholds provided by a selected JND model to set some of the DCT coefficients to zero has been

implemented. After, a JND adaptive quantization method adapting the initial quantization parameter (QP)

based on the same JND thresholds has been implemented with the target to avoid coding video data

which is non perceptually relevant.

The main conclusion of the work reported in this Thesis is that it is worthwhile to adopt a perceptual

approach in the optimization of the H.264/AVC video codec, reaching bitrate reductions which may go up

to approximately 35% for the same perceptual quality.

Keywords: H.264/AVC; perceptual video coding; just noticeable distortion model; coefficients pruning;

adaptive quantization

iii

Resumo

Hoje em dia, as necessidades em termos de uma maior eficiência da compressão de vídeo continuam a

crescer, acompanhando o aumento em larga escala do número de aplicações onde as tecnologias de

compressão do vídeo desempenham um papel fundamental. Algumas das mais importantes aplicações

de vídeo são hoje já parte do dia-a-dia, nomeadamente aplicações como a gravação de filmes e

videojogos em DVD e discos Blu-ray, o streaming de vídeos na Internet como o YouTube, a televisão

digital utilizando vários meios de transmissão, a televisão móvel e as aplicações de tempo real como a

videotelefonia e a videoconferência.

As actuais normas de compressão de vídeo foram essencialmente optimizadas para métricas objectivas

de qualidade, como o erro quadrático médio, que não levam em conta as características do sistema

visual humano. O principal desafio considerado nesta Tese consiste na introdução na arquitectura básica

da norma de compressão de vídeo H.264/AVC, de ferramentas perceptuais adicionais com vista a

aumentar o desempenho em termos de qualidade versus débito binário. Como esse objectivo, foi

primeiro implementado o método de selecção perceptiva de coeficientes que utiliza os limiares

fornecidos por um modelo de just noticeable distortion seleccionado para eliminar alguns dos

coeficientes DCT considerando que não são perceptivamente relevantes. De seguida, foi implementado

um modelo de quantização adaptativa dos coeficientes DCT que adapta o passo de quantização inicial

com base nos mesmos limiares usados acima com o objectivo de evitar codificar informação de vídeo

que não seja perceptivamente relevante, controlando a precisão dos coeficientes enviados.

Assim, a principal conclusão do trabalho apresentado nesta Tese é que vale a pena adoptar uma

abordagem perceptiva para optimizar o codec de vídeo H.264/AVC, alcançando reduções de débito que

podem ir até, aproximadamente, 35% para a mesma qualidade perceptiva.

Palavras-chave: H.264/AVC,codificação de vídeo perceptiva, just noticeable distortion model, selecção

de coeficientes, quantização adaptativa

iv

Table of Contents

Chapter 1 ...................................................................................................................................................... 1

Introduction ................................................................................................................................................... 1

1.1. Context and Motivation ................................................................................................................. 1

1.2. Background ................................................................................................................................... 2

1.3. Objective ....................................................................................................................................... 4

1.4. Thesis Organization ...................................................................................................................... 4

Chapter 2 ...................................................................................................................................................... 5

Reviewing Perceptual Video Coding Concepts and Tools ........................................................................... 5

2.1. Brief Overview on the Human Visual System ............................................................................... 5

2.1.1. Human Visual System Features ........................................................................................... 5

2.1.2. Perceptual Models ................................................................................................................ 9

2.2. Brief Overview on Video Coding Standards................................................................................ 10

2.3. Reviewing Perceptual Video Coding Solutions ........................................................................... 18

2.3.1. H.264/AVC Perceptual Video Coding based on a Foveated JND Model ........................... 18

2.3.2. H.264/AVC Coding with JND Model based Coefficients Filtering ....................................... 24

2.3.3. H.264/AVC Inter Coding based on Structural Similarity driven Motion Estimation ............. 28

2.3.4. H.264/AVC Bitrate Control based on 4D Perceptual Quantization Modeling ..................... 33

Chapter 3 .................................................................................................................................................... 41

A JND Model based Coefficients Pruning Method for H.264/AVC Video Coding ...................................... 41

3.1. The Perceptual Coefficients Pruning Method ............................................................................. 41

3.1.1. Objective ............................................................................................................................. 41

3.1.2. Architecture ......................................................................................................................... 41

3.1.3. Walkthrough ........................................................................................................................ 42

3.1.4. Novel Tools Description ...................................................................................................... 43

3.2. Performance Evaluation .............................................................................................................. 47

3.2.1. Test Conditions ................................................................................................................... 47

3.2.2. Results and Analysis ........................................................................................................... 49

3.2.3. Conclusion........................................................................................................................... 62

v

Chapter 4 .................................................................................................................................................... 64

A JND Model based Adaptive Quantization Method for H.264/AVC Video Coding ................................... 64

4.1. The JND Adaptive Quantization Method..................................................................................... 64

4.1.1. Objective ............................................................................................................................. 64

4.1.2. Architecture ......................................................................................................................... 64

4.1.3. Walkthrough ........................................................................................................................ 66

4.1.4. Description of New Tool ...................................................................................................... 66

4.2. Performance Evaluation .............................................................................................................. 68

4.2.1. Test Conditions ................................................................................................................... 68

4.2.2. Results and Analysis ........................................................................................................... 70

4.2.3. Conclusion........................................................................................................................... 87

Chapter 5 .................................................................................................................................................... 88

Conclusions and Future Work ..................................................................................................................... 88

5.1. Summary and Conclusions ......................................................................................................... 88

5.2. Future Work ................................................................................................................................ 89

References .................................................................................................................................................. 91

Annex A ....................................................................................................................................................... 94

Annex B ....................................................................................................................................................... 97

vi

Index of Figures

Figure 2.1 – Eye structure [3] ........................................................................................................................ 6

Figure 2.2 – Internal layer structure of the eye [6] ........................................................................................ 7

Figure 2.3 – For the same image: (a) The noise is added randomly, (b) The noise power is weighted on

the texture activity of each block [11] ............................................................................................................ 8

Figure 2.4 – For the same image: (a) The noise is added randomly, (b) The noise power is weighted

based on the intensity contrast of each MB [11] ........................................................................................... 8

Figure 2.5 – Variation of the contrast sensitivity function with the spatial frequency [9] .............................. 8

Figure 2.6 – Simultaneous contrast: The two smaller squares have equal luminance although the right

one appears brighter [9] ................................................................................................................................ 9

Figure 2.7 – Perceptual factors: (a) Proximity; (b) Similarity; (c) Closure; (d) Symmetry [12] ..................... 9

Figure 2.8 – Weber's law [9] ....................................................................................................................... 10

Figure 2.9 – Chronology of the video recommendations/standards developed by ITU-T VCEG and

ISO/IEC MPEG [16] .................................................................................................................................... 11

Figure 2.10 – Ellipse [25] ............................................................................................................................ 18

Figure 2.11 – Architecture of the H.264/AVC Perceptual Video Coding based on a Foveated JND Model

solution ........................................................................................................................................................ 19

Figure 2.12 – DMOS comparisons for the H.264/AVC based coding solutions [23]. ................................. 23

Figure 2.13 – Portions of decoded frames for the test sequence Stefan. Stefan frame coded with (a) JM

(e) FJND; Fixation point of Stefan frame coded with (b) JM (f) FJND; Texture region away from fixation

point coded with (c) JM (g) FJND; Non-fixation point coded with (d) JM (h) FJND [23] ............................. 23

Figure 2.14 – Architecture of the H.264/AVC Coding with JND Model based Coefficients Filtering solution

.................................................................................................................................................................... 24

Figure 2.15 – Bitrate changes at different QP for I, P, and B frames [26] .................................................. 27

Figure 2.16 – Architecture of the H.264/AVC Inter Coding based on SSIM driven ME solution ................ 29

Figure 2.17 – Illustration of the H.264/AVC Bitrate Control based on 4D Perceptual Quantization

Modeling rate control main components (major revisions are marked by double stars) [28] ..................... 33

Figure 2.18 – Architecture of the H.264/AVC Bitrate Control based on 4D Perceptual Quantization

Modeling solution ........................................................................................................................................ 34

Figure 2.19 – H.264/AVC bitrate control procedure using the 4D perceptual quantization model [28] ...... 35

Figure 2.20 – Comparison of the image quality for the Carphone sequence at 24 kbps (significant

differences are marked with black circles) [28] ........................................................................................... 40

Figure 2.21 – Comparison of the image quality for the Foreman sequence at 128 kbps (significant

differences are marked with black circles) [28] ........................................................................................... 40

Figure 3.1 – Improved H.264/AVC codec architecture including the JND model and the coefficients

pruning method. .......................................................................................................................................... 42

Figure 3.2 – First frame of each test sequence: (a) Foreman; (b) Mobile; (c) Panslow; (d) Spincalendar;

(e) Playing_cards; (f) Toys_and_calendar .................................................................................................. 48

Figure 3.3 – PSNR RD performance for the Foreman sequence .............................................................. 50

Figure 3.4 – PSNR RD performance for the Mobile sequence ................................................................... 50

Figure 3.5 – PSNR RD performance for the Panslow sequence ................................................................ 51

vii

Figure 3.6 – PSNR RD performance for the Spincalendar sequence ........................................................ 51

Figure 3.7 – PSNR RD performance for the Playing_cards sequence ....................................................... 51

Figure 3.8 – PSNR RD performance for the Toys_and_calendar sequence.............................................. 52

Figure 3.9 – MS-SSIM RD performance for the Foreman sequence ......................................................... 53

Figure 3.10 – MS-SSIM RD performance for the Mobile sequence ........................................................... 53

Figure 3.11 – MS-SSIM RD performance for the Panslow sequence ........................................................ 53

Figure 3.12 – MS-SSIM RD performance for the Spincalendar sequence ................................................. 54

Figure 3.13 – MS-SSIM RD performance for the Playing_cards sequence ............................................... 54

Figure 3.14 – MS-SSIM RD performance for the Toys_and_calendar sequence ...................................... 54

Figure 3.15 – Average number of zeroed coefficients for the Foreman sequence .................................... 56

Figure 3.16 – Average number of zeroed coefficients for the Mobile sequence ........................................ 56

Figure 3.17 – Average number of zeroed coefficients for the Panslow sequence ..................................... 56

Figure 3.18 – Average number of zeroed coefficients for the Spincalendar sequence ............................. 57

Figure 3.19 – Average number of zeroed coefficients for the Playing_cards sequence ............................ 57

Figure 3.20 – Average number of zeroed coefficients for the Toys_and_calendar sequence ................... 57

Figure 3.21 – Average zigzag position of the zeroed coefficients for the 4x4 blocks in the Foreman

sequence ..................................................................................................................................................... 59

Figure 3. 22 – Average zigzag position of the zeroed coefficients for the 4x4 blocks in the Mobile

sequence ..................................................................................................................................................... 59

Figure 3.23 – Average zigzag position of the zeroed coefficients for the 4x4 blocks in the Panslow

sequence ..................................................................................................................................................... 60

Figure 3.24 – Average zigzag position of the zeroed coefficients for the 4x4 blocks in the Spincalendar

sequence ..................................................................................................................................................... 60

Figure 3.25 – Average zigzag position of the zeroed coefficients for the 4x4 blocks in the Playing_cards

sequence ..................................................................................................................................................... 60

Figure 3.26 – Average zigzag position of the zeroed coefficients for the 4x4 blocks in the

Toys_and_calendar sequence .................................................................................................................... 61

Figure 3.27 – 4x4 block with the average zeroed positions highlighted ..................................................... 63

Figure 4.1 – Encoder architecture of the proposed solution with the JND related quantization modules .. 65

Figure 4.2 – Encoder architecture of the proposed solution including the JND related quantization and

DCT coefficients pruning modules .............................................................................................................. 65

Figure 4.3 – PSNR RD performance for the Foreman sequence ............................................................... 71

Figure 4.4 – PSNR RD performance for the Mobile sequence ................................................................... 71

Figure 4.5 – PSNR RD performance for the Panslow sequence ................................................................ 71

Figure 4.6 – PSNR RD performance for the Spincalendar sequence ........................................................ 72

Figure 4.7 – PSNR RD performance for the in Playing_cards sequence ................................................... 72

Figure 4.8 – PSNR RD performance for the Toys_and_calendar sequence.............................................. 72

Figure 4.9 – MS-SSIM RD performance for the Foreman sequence ......................................................... 75

Figure 4.10 – MS-SSIM RD performance for the Mobile sequence ........................................................... 75

Figure 4.11 – MS-SSIM RD performance for the Panslow sequence ........................................................ 75

Figure 4.12 – MS-SSIM RD performance for the Spincalendar sequence ................................................. 76

Figure 4.13 – MS-SSIM RD performance for the Playing_cards sequence ............................................... 76

Figure 4.14 – MS-SSIM RD performance for the Toys_and_calendar sequence ...................................... 76

Figure 4.15 – VQM RD performance for the Foreman sequence ............................................................... 78

Figure 4.16 – VQM RD performance for the Mobile sequence .................................................................. 79

Figure 4.17 – VQM RD performance for the Panslow sequence ............................................................... 79

Figure 4.18 – VQM RD performance for the Spincalendar sequence ........................................................ 79

Figure 4.19 – VQM RD performance for the Playing_cards sequence ...................................................... 80

Figure 4.20 – VQM RD performance for the Toys_and_calendar sequence ............................................. 80

Figure 4.21 – RP compensated VQM RD performance for the Foreman sequence .................................. 83

Figure 4.22 – RP compensated VQM RD performance for the Mobile sequence ...................................... 83

Figure 4.23 – RP compensated VQM RD performance for the Panslow sequence ................................... 84

viii

Figure 4.24 – RP compensated VQM RD performance for the Spincalendar sequence ........................... 84

Figure 4.25 – RP compensated VQM RD performance for the Playing_cards sequence .......................... 84

Figure 4.26 – RP compensated VQM RD performance for the Toys_and_calendar sequence ................ 85

ix

Index of Tables

Table 2.1 – Results of the FJND validation tests [23] ................................................................................. 23

Table 2.2 – Bitrate reduction for the JND-thresholded sequences and their MOS [26] ............................. 28

Table 2.3 – MEBSS results with QP=10 [27]. ............................................................................................. 32

Table 2.4 – Comparison of overall coding performance for the GOP IPPP pattern using PQrc and JM10.2

(average PSNR gain: 0.515 dB) [28] .......................................................................................................... 39

Table 2.5 – Comparison of overall coding performance for the GOP IBBP pattern using PQrc and JM10.2

(average PSNR gain: 0.35dB) [28] ............................................................................................................. 39

Table A-1 – Average rate, PSNR and MS-SSIM for each sequence for various RD points, and their

variation in percentage. ............................................................................................................................... 95

Table A-2 – Overall average of the average zigzag position for a 4x4 block and the variance of the

average zigzag position for a 4x4 block for each RD point......................................................................... 96

Table B-1 – Average rate for each sequence for various RD points for different codecs and their variation

in percentage. ............................................................................................................................................. 98

Table B-2 – Average PSNR for each sequence for various RD points for different codecs and their

variation in percentage. ............................................................................................................................... 99

Table B-3 – Average MS-SSIM for each sequence for various RD points for different codecs and their

variation in percentage. ............................................................................................................................. 100

Table B-4 – Average VQM for each sequence for various RD points for different codecs and their

variation in percentage. ............................................................................................................................. 101

Table B-5 – Average RP compensated VQM for each sequence for various RD points for different codecs

and their variation in percentage. .............................................................................................................. 102

x

List of Acronyms

AVC Advanced Video Coding

BCQ Bit Complexity Quantization

BPF Band-Pass Filter

CABAC Context-Adaptive Binary Arithmetic Coding

CAVLC Context- Adaptive Variable Length Coding

CD Compact Disc

CI Confidence Interval

CIF Common Intermediate Format

CRT Cathode Ray Tube

CSF Contrast Sensitivity Function

DCT Discrete Cosine Transform

DMOS Differential MOS

DP Data Partitioning

DSCQS Double Stimulus Continuous Quality Scale

DSIS Double Stimulus Impairment Scale

DVD Digital Video Disc

FJND Foveated JND

FMO Flexible Macroblock Ordering

GOP Group Of Pictures

GSTN General Switched Telephone Network

HVS Human Visual System

ICT Integer Discrete Cosine Transform

xi

IEC International Electrotechnical Commission

ISDN Integrated Services Digital Network

ISO International Standards Organisation

ITS Institute for Telecommunications Sciences

ITU International Telecommunication Union

ITU-T ITU Telecommunication standardization sector

JM Joint Model

JND Just Noticeable Distortion

JPEG Joint Photographic Experts Group

JVT Joint Video Team

MAD Mean Absolute Difference

MB MacroBlock

MC Motion Compensation

ME Motion Estimation

MEBSS Motion Estimation based on SSIM

MOS Mean Opinion Score

MPEG Motion Picture Experts Group

MSE Mean Squared Error

MSSIM Mean SSIM

MS-SSIM Multi-Scale SSIM

NAL Network Abstraction Layer

NTSC National Television System(s) Committee

PAL Phase Alternating Line

PQrc Perceptual Quantization Rate Control

PSNR Peak Signal to Noise Ratio

PSTN Public Switched Telephone Network

QCIF Quarter Common Intermediate Format

QP Quantization Parameter

RD Rate Distortion

xii

RDO Rate Distortion Optimization

RP Resolving Power

SAD Sum of Absolute Differences

SCSF Spatial Contrast Sensitivity Function

SGI Silicon Graphics

SIF Source Input Format

SJND Spatial Just Noticeable Distortion

SNR Signal to Noise Ratio

SSIM Structural Similarity

TJND Temporal Just Noticeable Distortion

TV Television

VCEG Video Coding Experts Group

VHS Video Home System

VLC Variable Length Coding

VQM Video Quality Metric

1

Chapter 1

Introduction

This chapter intends to present the main objectives of this Thesis, after providing its context, motivation

and technical background. Finally, the structure of this report will be presented.

1.1. Context and Motivation

Following the increasing wide scale deployment of daily life applications where video compression plays a

key enabling role, the demand for more video compression keeps growing. Some of these important

video applications are the storage of films and video games in DVD and Blu-ray discs, Internet video

streaming as from YouTube, digital television using various broadcasting channels, mobile TV and real-

time applications such as videotelephony and videoconferencing.

Video compression or video coding technologies have the target to efficiently represent digital video data

to allow the easier transmission and storage of this type of data. Therefore, it implies a complex balance

between video quality, coding bitrate, encoding and decoding complexities, robustness to data losses and

errors, ease of editing, random access, end-to-end delay, and a number of other relevant factors

depending on the target application scenario.

The main challenge in video coding is, thus, to reduce the compressed video data size for a target video

quality or maximizing the video quality for a target coding rate, in other words to find an efficient balance

between quality and bitrate. To compress the video data, compression algorithms exploit the redundancy

and irrelevance in the original (non-compressed) digital data, through largely used tools such as the

discrete cosine transform (DCT), motion estimation and compensation, quantization and entropy coding.

In order for these tools to provide an efficient quality versus bitrate trade-off, advanced rate-distortion

(RD) mechanisms have to be used. These mechanisms have also to consider the relevant application

constraints in terms of the available channel bandwidth and buffer sizes which are limited for most

application scenarios. Thus, the critical need to design rate control solutions to achieve the best visual

quality under the relevant constraints, deeply associated to the application [1].

2

Traditionally, video coding optimization has been made based on simple objective quality metrics such as

the mean squared error (MSE) and the peak signal to noise ratio (PSNR); this has been the case for the

very popular MPEG and ITU-T video compression standards. These metrics do not accurately express

the perceived distortion/quality as assessed by human subjects through the Human Visual System (HVS)

and more appropriate quality metrics may be used to maximize the subjective quality for a certain target

bitrate. This maximization is especially interesting in the context of the H.264/AVC (Advanced Video

Coding) standard which is nowadays widely used for all types of services and products from mobile TV to

Blu-ray and represents the state-of-the art in video coding.

1.2. Background

The most relevant background concepts and tools for this Thesis are the human visual system and its

features and behaviors together with the best available video compression solution, this means the

H.264/AVC standard.

The HVS includes the eye and also a part of the brain which is dedicated to process the visual

information, through memory and knowledge. The eye is structured into three layers and each one has a

specific function; the most important elements are the photoreceptors which transform the luminous

stimuli into nervous impulses. Some HVS properties which have been recently exploited in terms of video

compression are the texture masking, the intensity contrast masking, the spatial frequency sensitivity and

the preservation of object boundaries effects.

Since the 90s, several video compression standards have been developed by the Video Coding Experts

Group (VCEG) from the International Telecommunications Union (ITU), more precisely the ITU

Telecommunication standardization sector (ITU-T), and the Motion Picture Experts Group (MPEG) from

the International Standards Organization/ International Electrotechnical Commission (ISO/IEC) [2]. It all

began with the ITU-T H.261 Recommendation which was designed to address bidirectional and

unidirectional visual communications at data rates multiples of 64 kbit/s, between 40 kbit/s and 2000

kbit/s and had as main target applications, videotelephony and videoconference [3]. Afterwards, the

MPEG-1 Video standard was designed to compress Video Home System (VHS) quality raw digital video,

that is, for a digital quality video equivalent to the VHS quality, down to around 1.2 Mbit/s without

excessive quality loss, so to allow digital movie storage in CD-ROMs [4] [5]. In 1994, after a joint effort

between the ITU-T and ISO/IEC, the H.262/MPEG-2 Video standard was developed targeting digital

television and, thus, interlaced video at higher rates and spatial resolutions than MPEG-1 Video, notably

up to High Definition (HD) [6] [7]. Soon after, the ITU-T H.263 Recommendation was developed to

optimize video quality for lower bitrates, especially targeting applications such as visual telephony in

copper telephone lines and mobile networks [8]. Meanwhile, ISO/IEC developed the MPEG-4 Visual

standard, providing increased compression efficiency but also adopting an object-based representation

framework with high flexibility and advanced interaction capabilities. The main MPEG-4 Visual standard

3

target applications ranged from digital television to mobile and Internet video streaming video and games

[9].

More recently, the H.264/AVC video compression standard was developed with the intent of increasing by

50% the compression performance provided by all previously available video compression standards

while also providing a “network-friendly” video representation. The target applications included

conversational , e.g. videotelephony and videoconference, and non-conversational scenarios, e.g.

storage, broadcast and streaming, using bitrates from 50 kbps to 8 Mbps and more [10] [11] [12].

To achieve a superior video compression performance, the H.264/AVC standard specifies many

normative novel tools and also requires the additional implementation of some non-normative tools and

features [12]. Normative tools are those specified by the standard and which implementation as in the

standard is essential for interoperability. On the contrary, the non-normative tools do not require a

normative specification since they are not essential for interoperability, implying that its precise

implementation is left to the implementer’s criterion. From the full H.264/AVC set of coding tools, a few

are especially important for the provided increased coding performance, notably:

• Variable block sizes for block prediction – block sizes from 4x4 to 16x16 are allowed to more

efficiently encapsulate the video-frame’s regions properties notably in terms of motion.

• Smaller size (4x4) DCT transform – makes the transform coefficients more localized in space

since there is less spatial redundancy to exploit due to the better temporal prediction; as such,

these coefficients are able to better express the visual properties of a region in a frame.

• Quarter pixel Motion Estimation (ME) – allows providing a more accurate prediction of

translational motion, thus reducing the prediction error.

• De-Blocking In-Loop filter – helps to reduce the blocking artifacts typical of block-based video

codecs in a rather block adaptive way, especially at low bitrates.

• Flexible macroblock (MB) ordering – facilitates the grouping of MBs into slices which can be

used either for error resilience or more efficient video coding; this grouping may be performed

based on the perceptual importance of the MBs (for increased error resilience) or based on

similarity properties among the various MBs (for coding efficiency).

Some of the above mentioned tools make the H.264/AVC coding standard a good candidate for

benefiting from the consideration of HVS perceptual aspects in the maximization of the achieved video

quality. In this Thesis, a perceptual video coder-decoder (codec) is any video coding solution somehow

exploiting the HVS characteristics to increase the codec performance. All the video coding standards

mentioned above will be further detailed in Section 2.2.

4

1.3. Objective

The main objective of this work is to design, implement and evaluate perceptually driven coding tools to

be integrated into a standard H.264/AVC video codec, targeting the improvement of the rate distortion

performance notably measured using advanced objective video quality metrics, ideally without

significantly increasing the encoding complexity. In this context, no formal subjective assessments are

foreseen considering the logistic complexity involved in performing this type of evaluation. The

improvement of the H.264/AVC video codec must be made through perceptually driven coding tools

based on a just noticeable distortion (JND) model, trying to only code video information which is

perceptually relevant. In this context, two perceptually driven tools will be developed, namely a DCT

coefficients pruning method and a JND adaptive quantization method. To evaluate the improvements,

notably in terms of RD performance, a set of objective quality metrics with different correlation with the

subjective assessment should be used, notably the PSNR, Multi-Scale Structural Similarity (MS-SSIM),

Video Quality Metric (VQM) and Resolving Power (RP) compensated VQM.

1.4. Thesis Organization

This Thesis reports in detail the design, implementation and assessment of perceptual video codecs

based on the H.264/AVC standard. The process is described in five chapters, including the current one,

which introduces the context, motivations, main background, objectives and structure of this Thesis.

After, Chapter 2 provides a detailed review on the perceptual video coding concepts and technologies in

the literature, starting with a review of the HVS properties, the available video coding standards and,

finally, the most relevant perceptual video coding solutions.

Next, Chapter 3 presents the first perceptually driven coding tool implemented in this work, notably a DCT

perceptual coefficients pruning mechanism targeting the elimination of non-perceptually relevant DCT

coefficients using a JND model. With this purpose, the architecture, walkthrough and metrics used are

first presented; after, the performance results are presented and analysed to derive the main conclusions

associated to the performance of the implemented method.

Chapter 4 presents another perceptually driven coding tool implemented in the context of a H.264/AVC

codec, notably a quantization mechanism introducing error according to a previously selected JND model.

As in the previous chapter, the architecture, walkthrough and relevant assessment metrics are presented

first and after the results are presented and analysed to derive the main conclusions.

Finally, Chapter 5 is reserved for the conclusions and the presentation of eventual further work.

5

Chapter 2

Reviewing Perceptual Video Coding Concepts and Tools

The first objective of this chapter is to briefly review the Human Visual System (HVS) structure,

associated features and properties. After, the evolution of video coding standards and recommendations

is reviewed, notably in terms of their goals and relative improvements. Finally, the most relevant

perceptual video coding solutions in the literature are described. As already mentioned in Chapter 1, in

the context of this Thesis, a perceptual video coder-decoder (codec) is any video coding solution

exploiting the HVS characteristics in some way to increase the codec performance; as it will be seen

later, there are several ways to design a perceptual video codec, depending on how the HVS features

and related tools impact and integrate in the video codec.

2.1. Brief Overview on the Human Visual System

The HVS includes not only the eye but also the part of the brain dedicated to process the visual

information, notably in terms of memory and knowledge. To understand how the human visual system

works, the anatomic features of the eye and, afterwards, the cognitive components responsible for the

perceptual vision, will be presented in the following.

2.1.1. Human Visual System Features

To better understand the anatomic features of the eye, its basic structure is presented in Figure 2.1.The

structure of the eye is formed by three layers:

• External layer, which includes two structures, notably the sclera and cornea with the functions of

eye movement and allowing the luminous rays to enter the eye, respectively;

6

• Average layer, with three elements, notably the iris, ciliary muscle and choroid; they have the

function of nourishing the structures without their own irrigation, including the sensory elements of

the retina;

• Internal layer, named retina, which includes the photoreceptors (cones and rods) and the

sustentation cells, see Figure 2.2; they have the function of receiving the projected luminous rays.

Figure 2.1 – Eye structure [13]

Some of the main anatomic features of the eye are:

• Vision is constrained to the frequencies in the visible region, notably �� ∈ �350, 780� ��,;

• The photoreceptors are especially important since they transform the luminous stimuli into nervous

impulses.

o The cones are high-precision cells specialized to detect red, green, or blue light (day light

vision). They are generally located at the center of the retina in a region of high acuity, called

the fovea, see Figure 2.2;

o On the other hand, rods are very sensitive to changes in contrast, even in low light levels,

providing the black and white vision (night vision); consequently, they are imprecise in terms of

detection position, due to light scatter. Rods are generally located in the periphery of the retina,

see Figure 2.2 [14] [15] [16].

7

Figure 2.2 – Internal layer structure of the eye [16]

The HVS properties can be used to correct shortcomings of mathematical models used as distortion

metrics, such as the Mean Squared Error (MSE). The MSE expresses the mean squared difference

between the original and the decoded sequence. For every data point, it is taken the distance vertically

from the point to the corresponding y value on the curve fit (the error), and square the value [17]. In other

words, the MSE quantifies the difference between an estimator and the true value of the quantity being

estimated as presented in equation (1).

�� = �� − �� (1)

So, the MSE indicates a value of distortion, however under certain conditions, the HVS can tolerate more

distortion than the MSE. On the other hand, there are some types of distortions that the MSE does not

measure and express in the same way as they are perceived.

Some of the most recently explored HVS properties are:

• Texture masking: HVS is less sensitive to details (distortion) in the image areas with intense

texture. This means that more noise can be tolerated in areas with texture; see examples in Figure

2.3(a) and (b), which show the same image, one with random noise added and the other with noise

power weighted on the texture activity in each block added.

• Intensity contrast masking: Intensity contrast regards the difference in the mean intensities of the

background and object, thus characterizing the intensity difference between the object and

background [18] , in other words, for example the difference in brightness between the light and dark

areas of a picture. For lower/medium contrast areas, more noise can be hidden in the darker areas.

In the examples in Figure 2.4(a) and (b), the noise is first randomly added and after the noise power

is weighted based on the intensity contrast for each MB. It is important to notice that the visual

perception is sensitive to the luminance contrast but not to the absolute value of the luminance, as it

can be seen in Figure 2.6, where it is shown that the background luminance makes the brightness of

the same object appear differently [19].

• Spatial frequency sensitivity: The HVS acts as a band-pass filter (BPF) with a peak at around four

cycles per degree of visual angle and declining very fast; the spatial frequency regards the number

of bright plus dark bars per centimeter on the screen or per degree of visual angle. This HVS feature

allows hiding more noise in the areas with higher spatial frequencies. This is the main concept

8

behind the Contrast Sensitivity Function (CSF) used in Joint Photographic Experts Group (JPEG)

2000 [20]. Figure 2.5 shows the variation of the visual sensitivity as a function of the spatial

frequency.

• Preservation of object boundaries: The HVS is very sensitive to unpreserved object edges in a

scene. Usually, a bad selection of motion vector or an inappropriate selection of the coding mode is

the main cause for the edge-misalignment of a solid object in a scene. This type of distortion is more

likely to happen at very low bitrates [21].

(a)

(b)

Figure 2.3 – For the same image: (a) The noise is added randomly, (b) The noise power is weighted on the texture activity of each block [21]

(a)

(b)

Figure 2.4 – For the same image: (a) The noise is added randomly, (b) The noise power is weighted based on the intensity contrast of each MB [21]

Figure 2.5 – Variation of the contrast sensitivity function with the spatial frequency [19]

Simultaneous contrast consists on the fact that perceived brightness depends not only of its intensity but

also on the background intensity. An example is shown in Figure 2.6: although the small squares have

exactly the same intensity, they appear progressively darker as the background becomes lighter.

Figure 2.6 – Simultaneous contrast: The two smalle

2.1.2. Perceptual Models

A model is a representation of a system that allows for investigation of the properties and, in some cases,

prediction of future outcomes. Accordingly, a perceptual model is a model that represents the perceptual

features of the HVS.

First, it is important to know that exits several types of visual perception which are

shapes (e.g. faces and associated emotions),

colours and luminance.

The study of visual perception begun with Helmholtz, who defended the unconscious interference, i.e.,

the vision is the result of making assumptions and conclusions from incomplete data, based on previous

experiences [22]. After, the Gestalt theory focused on the understanding of visual components as a

collection or an organized joint. In this theory, there are six main factors which determine how humans

group things in agreement with the visual perception, notably proximity, similarit

common fate and continuity [22]. Figure

objects defining the groups: objects closer to each other are perceived as groups, and independent from

objects further apart. Figure 2.7 (b) show

shape or size or colour are interpreted as a group.

brain adds components that are missing to

represents the symmetry factor with

non-symmetric objects.

(a)

(c)

Figure 2.7 – Perceptual factors:

Visual perception can be studied through psychophysics which studies the relationship between physical

stimuli and their subjective perception, in other words, “the analysis of perceptual processes by studying

9

The two smaller squares have equal luminance although

brighter [19]

Perceptual Models


outcomes. Accordingly, a perceptual model is a model that represents the perceptual

exits several types of visual perception which are

faces and associated emotions), spatial relations (e.g. depth, orientation and movement),

begun with Helmholtz, who defended the unconscious interference, i.e.,


Gestalt theory focused on the understanding of visual components as a


group things in agreement with the visual perception, notably proximity, similarity, closure, symmetry,

Figure 2.7 (a) represents the proximity factor, with the distance between

ing the groups: objects closer to each other are perceived as groups, and independent from

(b) shows the similarity factor effect, since objects that are similar in

shape or size or colour are interpreted as a group. Figure 2.7 (c) represents the closure

components that are missing to interpret a partial object as a whole. Finally,

factor with the symmetric objects more easily grouped in collections than the

(b)

(d)

Perceptual factors: (a) Proximity; (b) Similarity; (c) Closure; (d) Symmetry



although the right one appears


outcomes. Accordingly, a perceptual model is a model that represents the perceptual

exits several types of visual perception which are: the perception of

, orientation and movement),

begun with Helmholtz, who defended the unconscious interference, i.e.,


Gestalt theory focused on the understanding of visual components as a


y, closure, symmetry,

the distance between

ing the groups: objects closer to each other are perceived as groups, and independent from

e objects that are similar in

closure factor, that is, the

a partial object as a whole. Finally, Figure 2.7 (d)

objects more easily grouped in collections than the

Symmetry [22]



10

the effect on a subject’s experience or behaviour of systematically varying the properties of a stimulus

along one or more physical dimensions” [23]. An important concept is the so-called just noticeable

distortion (JND), which is a threshold defining the smallest detectable difference between a starting and

secondary levels of a particular sensory stimulus.

Visual perception is typically studied by means of psychophysical methods. Those methods have the

ambition of testing the subjects’ perception using stimulus detection and difference detection

experiments. The main psychophysical methods are the determination of thresholds of absolute and

relative detection (Weber’s law), the equalization of perceptions and the estimation of the amplitude

perception (Steven’s law) [24].

Weber’s law relates the minimum detectable increment in the stimulus with the intensity of the stimulus,

and states that the change in a stimulus that will be just noticeable is a constant ratio of the original

stimulus. This law is represented in Figure 2.8 [19] and equation (2) with S being the intensity of the

stimulus, ∆S the minimum detectable increment and c a constant:

Figure 2.8 – Weber's law [19]

Steven’s law establishes the relation between the intensity of a stimulus and the intensity as perceived by

a human. This law, represented in equation (3), proposes a relation between the magnitude of the

stimulus intensity and the perceived intensity, where S represents the magnitude of the stimulus intensity,

P is the psychophysics function expressing the subjective magnitude invoked by the stimulus, n is an

exponent depending on the type of stimulus and K is a constant depending on the type of stimulus and

the units used [25]:

This law is considered to supersede the Weber’s law since it describes a wider range of sensations. In

addition, a distinction has been made between local psychophysics, where stimuli are discriminated only

with a certain probability, and global psychophysics, where stimuli would be discriminated correctly with a

near certainty. Weber’s law is generally applied in local psychophysics, whereas Stevens’ law is usually

applied in global psychophysics.

2.2. Brief Overview on Video Coding Standards

Since the beginning of the digital video era, several video coding standards have been developed; many

of them are referenced in Figure 2.9 in chronological order. These standards come from two main

∆�� = � (2)

� = �� (3)

11

standardization institutions: the H.xxx recommendations are developed by the Video Coding Experts

Group (VCEG) from the International Telecommunications Union (ITU), more precisely the ITU

Telecommunication standardization sector (ITU-T) while the MPEG standards are developed by the

Motion Picture Experts Group (MPEG) from the International Standards Organisation/ International

Electrotechnical Commission (ISO/IEC).

Figure 2.9 – Chronology of the video recommendations/standards developed by ITU-T VCEG and ISO/IEC MPEG

[11]

In the following, a brief description of each video coding recommendation/standard is provided.

ITU-T H.261 Recommendation

• Objectives

Digital video compression started, in practice, with this ITU-T recommendation [3], issued in 1990. This

recommendation was a pioneering effort and used a hybrid video coding scheme [2]. The H.261 codec

was designed to be used for bidirectional or unidirectional visual communications at data rates multiples

of 64 kbit/s, in the range of [40, 2000] kbit/s, targeting synchronous Integrated Services Digital Network

(ISDN) channels.

• Main target applications

The main target applications are personal communications, notably videotelephony and

videoconference. This recommendation supports two video frame sizes: Common Intermediate Format

(CIF) (352x288 pixels) and Quarter Common Intermediate Format (QCIF) (176x144 pixels) using a

4:2:0 chrominance subsampling scheme (this means the chrominance is sampled with half the samples

in each direction) and non-interlaced pictures, occurring at approximately 29.97 times per second.

• Main coding tools in addition to previous standards

Since this has been the first relevant video coding standard, it has established a reference in terms of

the video coding tools set, notably:

o Hybrid video coding algorithm based on translational motion compensated prediction to exploit

the temporal redundancy, DCT coding with quantization to exploit the spatial redundancy and

visual irrelevance and, finally, entropy coding to exploit the statistical redundancy;

o Huffman coding as a variable length entropy coding tool;

o Motion estimation (ME) and motion compensation (MC) for 16×16 luminance macroblocks (MB)

with full-pixel precision;

12

o DCT coefficients coded as a zig-zag scanned set of (run, length) pairs which are entropy coded

with 2D Huffman variable length coding (VLC) [26] [8].

MPEG-1 Video standard

• Objectives

First MPEG video coding standard which was designed to compress Video Home System (VHS) quality

raw digital video down to 1.5 Mbit/s (video with associated audio) without excessive quality loss; this

standard has a quantization matrix for Intra MBs, that is frequency dependent, thus exploiting

perceptual visual features. With this tool, information in certain frequencies and areas of the picture that

the human eye has limited ability to fully perceive is reduced or completely discarded [5].


The main target application is compact disc (CD) storage with the following important functionalities:

o Random access, the possibility to access any part of the audiovisual data in a limited amount of

time;

o Fast forward/reverse search, faster play (with time compression) in the usual and opposite time

directions;

o Reverse playback, playing at regular speed against the usual temporal direction;

MPEG-1 videos are most commonly seen using Source Input Format (SIF) resolution of 325x240,

325x288, or 320x240, but the standard also supports CIF and higher spatial resolutions. The bitrate is

typically less than 1.5 Mbit/s, normally for video data 1.25 Mbit/s are used.


o Quantization weighting matrix is a string of 64-values (0-255) telling the encoder the relative

importance of each spatial frequency, considering the human visual system; each value in the

matrix corresponds to a certain frequency component of each 8×8 block;

o ME with half pixel precision;

o Bidirectional ME, meaning that each MB may have one forward vector and/or one backward

vector with half pixel accuracy.

H.262/MPEG-2 Video standard

• Objectives

In 1994, the ITU-T and ISO/IEC joined efforts to create the H.262/MPEG-2 Video standard with the goal

of defining a coding syntax suitable for interlaced video at higher resolutions and rates than MPEG-1

Video (up to 40 Mbit/s) [7].


This standard is oriented towards digital television (TV) and allows resolutions such as SIF, CIF, QCIF,

National Television System(s) Committee (NTSC) - 720x480 and Phase Alternating Line (PAL) -

720x576 using a 4:2:0 chrominance subsampling scheme. The recommended bitrate is in the [3, 5]

13

Mbit/s interval for applications like broadcasting (to the users), and in the [8, 10] Mbit/s interval for

contribution, e.g. transmission between studios.

Profiles and levels are new concepts providing a trade-off between implementation complexity for a

certain class of applications and interoperability between applications while guaranteeing the necessary

compression efficiency capability required by the class of applications in question and limiting the codec

complexity and associated costs. A profile is a subset of coding tools able to provide a certain level of

compression efficiency for a certain complexity; on the other hand, a level establishes some relevant

parameters, notably the bitrate and the total amount of pixels/image. There are four levels: low level (up

to 360x288 for the luminance); main level (up to 720x576 for the luminance); high-1440 level (up to

1440x1152 for the luminance); high level (up to 1920x1152 for the luminance), and six profiles: simple

profile; main profile; 4:2:2 profile; Signal to Noise Ratio (SNR) profile; spatial profile; and high profile.

Several types of compatibility may be provided, which can be separated into two groups, notably:

1. Compatibility between different resolutions of encoders and decoders.

o Upward compatibility means the decoder can decode the pictures generated by a lower

resolution encoder;

o Downward compatibility implies that a decoder can decode the pictures generated by a higher

resolution encoder.

2. Compatibility between different codecs.

o Forward compatibility allows a MPEG-2 Video decoder to be able to decode a coded bitstream

compliant with a previous available standard, e.g. MPEG-1 Video;

o Backward compatibility allows a decoder compliant with a previously available standard to be

able to, totally or partially, decode in a useful way a bitstream compliant with MPEG-2 Video.


o Scalable coding, this concept refers to the possibility of obtaining a useful reproduction of a

compressed video signal, decoding just a part of the full compressed information (bitstream).

There are four main types of scalability in the MPEG-2 Video standard:

� Spatial scalability, corresponding to different spatial resolutions;

� Fidelity, SNR or quality scalability, corresponding to different video qualities (e.g. SNR);

� Temporal scalability, corresponding to different frame rates;

� Frequency scalability, corresponding to different quantized transform coefficients for each

block;

o Interlaced video coding can choose two image structures, notably frame and field coding. The

frame structure just divides the picture into MBs (frame-pictures). The field structure considers

two fields, top and bottom fields, which are interleaved. The top field consists in the odd lines,

while the bottom field consists in the even lines. So the field structure is composed of MBs just

from the top field and MBs just from the bottom field (field-pictures). Since the coded pictures

are classified as frame-picture or field-pictures, the main prediction modes are:

� Frame mode for frame-pictures;

� Field mode for field-pictures;

14

� Field mode for frame-pictures;

� 16x8 MC for field-pictures.

These modes will determine the MBs that may predict the current MB, for example, in the case

of field mode for field-pictures for P-field-pictures, the prediction MBs may come from either of

the two most recently coded I- or P-fields, so the top field may take its prediction MBs from

either the top field of the reference field picture or the bottom field of the reference field

picture. The bottom field may take its prediction MBs from either field top field of the p-field

being coded or the bottom field of the reference filed picture.

ITU H.263 Recommendation

• Objectives

The main goal of this recommendation is the optimization of video quality for low bitrates, notably down

to 28.8 kbit/s to provide significantly better picture quality than the already existing ITU-T

recommendation H.261 [8].


Although H.263 is, in principle, network-independent and can be used for a very large range of networks

and applications, its main target application is visual telephony in the Public Switched Telephone

Network (PSTN) and mobile networks; in fact, its target networks are low-bitrate networks, like the

General Switched Telephone Network (GSTN), ISDN, and wireless networks. This recommendation

supports the following resolutions: sub-QCIF, QCIF, CIF, 4CIF and 16CIF [8].


The main improvements are the motion compensation with half pixel precision and the four

optional/negotiable options: unrestricted motion vectors; syntax-based arithmetic coding; advance

prediction; and forward and backward frame prediction. Other tools added are:

o 3D VLC coding to improve the coding efficiency of the DCT coefficients;

o More efficient coding of MB and block signalling overhead such as the information of which

blocks are coded and the information on the quantization step size changes;

o Median prediction for motion vectors to have improved coding efficiency and error resilience;

o Six optional algorithmic coding modes (five of these six modes are not found in H.261):

� Allows sending multiple video streams within a single video channel;

� Extended range of motion vectors values for more efficient performance with high

resolutions and large amounts of motion;

� Arithmetic coding to provide higher coding efficiency;

� Variable block-size motion compensation and overlapped-block motion compensation for

higher coding efficiency and reduced blocking artefacts;

� Representation of pairs of pictures as a single unit for a low-overhead form of bidirectional

prediction .

15

MPEG-4 Visual standard

• Objectives

The MPEG-4 Visual standard (MPEG-4 Part 2), from 1998, had as main goal not only to provide

increased compression efficiency but also to specify an object-based representation framework with

high flexibility and interaction capabilities .


Application areas range from digital television, streaming video, to mobile multimedia and games. This

video standard is designed for a large range of bitrates and spatial resolutions.


Highly flexible toolkit of coding techniques making it possible to deal with a wide range of visual data,

including rectangular frames, arbitrarily shaped video objects, still images and hybrids of natural and

synthetic visual information. Those tools are clustered in profiles, since it is unlikely that all applications

would require all the tools available in the MPEG-4 Visual coding framework, notably considering this

implies a substantial complexity. Some of the key features and tools are:

o Efficient compression of progressive and interlaced “rectangular” video sequences; the core

compression tools are based on the ITU-T H.263 recommendation and can outperform MPEG-1

and MPEG-2 Video compression.

o Coding of arbitrarily video objects which are part of a video scene; this is a new concept for

standard-based video coding and enables the independent coding of foreground and

background objects in a video scene.

o Support for effective transmission over practical networks through error resilience tools which

may help a decoder to recover from errors and maintain a successful video connection in error-

prone network environments.

o Scalable coding tools that can help to support flexible transmission at a range of bitrates.

o Coding of still “textures” this means image data; for example, still images can be coded and

transmitted within the same framework of moving video sequences; texture coding tools may

also be used in conjunction with animation-based rendering.

o Coding of animated visual objects such as 2D and 3D polygonal meshes, animated faces and

animated human bodies.

o Coding for specialist applications such as “studio” quality video; in this type of application, visual

quality is perhaps more important than high compression [9].

H.264/AVC (Advanced Video Coding) standard

• Objectives

This standard was jointly published as Part 10 of the MPEG-4 standard and H.264 recommendation by

ITU-T. The main goals of the H.264/AVC standardization effort have been enhanced compression

performance (50% more than previous standards) and provision of a “network-friendly” video

representation [10].

16


The target applications are conversational (videotelephony) and non-conversational (storage,

broadcast, or streaming). Bitrates used range from 50 kbps to 8 Mbps, depending on the application.

The resolution also depends on the application and may range from sub-QCIF(128x96), QCIF(176x144)

or CIF(352x288) to 4CIF(704x576), 640x480, 1280x720 and 1920x1080. For example, 4CIF is

appropriated for standard-definition television and DVD-video; CIF and QCIF are popular for

videoconferencing applications; QCIF and sub-QCIF are appropriated for mobile multimedia

applications where the display resolution and the bitrate are limited.


Input signal

o Flexible interlaced-scan video coding features;

o Support of monochrome, 4:2:0, 4:2:2 (this means that the chrominance is sampled with half

the columns and all the rows of the luminance), and 4:4:4 chroma subsampling (this means

that the chrominance is sampled with all the columns and all the rows of the luminance).

Prediction design improvement

o Variable block-size MC with small block sizes (the minimum block size is 4x4);

o Quarter-sample accurate MC, that is, quarter-sample motion vector accuracy;

o Motion vectors over picture boundaries;

o Multiple reference pictures motion compensation;

o Decoupling of referencing order from display order: the encoder is allowed to choose the

ordering of pictures for referencing and display purposes with a high degree of flexibility

constrained only by a total memory capacity bound imposed to ensure decoding ability;

o Decoupling of picture representation methods from picture referencing capability, that is, the

encoder is provided with more flexibility and, in many cases, an ability to use a picture for

referencing that is a closer approximation to the picture being encoded;

o Weighted prediction, the motion-compensated prediction signal is weighted and offset by

amounts specified by the encoder;

o Improved “skipped” and “direct” motion inference: in previous standards the “skipped” coding

mode was only used in static MB; now if several MBs have the same motion vector, the first

is coded as previously, but the others are considered skipped with inference of motion from

the first;

o Directional spatial prediction for intra coding: this is a new technique of extrapolating the

edges of the previously-decoded parts of the current picture, applied in regions of pictures

that are coded as intra;

o In-the-loop deblocking filtering, which is included in the MC prediction loop;

Quantization design improvement

o Logarithmic step size control;

17

o Frequency-customized quantization scaling matrices selected by the encoder for perceptual-

based quantization optimization;

Coding efficiency improvement

o Small block-size transform, notably 4x4 instead of 8x8; this allows the encoder to represent

signals in a more locally-adaptive fashion;

o Hierarchical block transform;

o Short word-length transform;

o Exact-match transform;

o Arithmetic entropy coding;

o Context-adaptive entropy coding:

� Context-Adaptive Binary Arithmetic Coding (CABAC) for the quantized transform

coefficients;

� Context-Adaptive Variable-Length Coding (CAVLC);

� Exponential-Golomb coding;

Robustness to data errors/losses and flexibility for operation over a variety of network

environments

o Parameter set structure, this is “global” parameters for a sequence such as picture

dimensions, video format, macroblock allocation map;

o Network Abstraction Layer (NAL) unit syntax structure, which allows greater customization of

the method of carrying the video content in a manner appropriate for each specific network;

o Flexible MB Ordering (FMO): allows the partition of the picture into regions called slice

groups, with each slice becoming an independently decodable subset of a slice group. A

slice group is a subset of the MBs in a coded picture and may contain one or more slices [9];

o Data Partitioning (DP): H.264/AVC allows the syntax of each slice to be separated into up to

three different partitions for transmission, depending on a categorization of syntax elements;

o Redundant pictures allow an encoder to send redundant representations of regions of

pictures, enabling an additional representation of regions of pictures for which the primary

representation has been lost during data transmission;

o Frame numbering allows the creation of “sub-sequences”, enabling temporal scalability but

optional inclusion of extra pictures between other pictures, which can occur due to network

packet losses or channel errors;

o Flexible slice size, which reduces coding efficiency by increasing the header data and

decreasing the effectiveness of prediction;

o Arbitrary slice ordering;

o Instead of I-, P-, B-pictures, the H.264/AVC supports I-, P-, B-slices;

o SP/SI synchronization switching slices:

� Switching P (SP) slice allows efficient switching between different pre-coded pictures; in

other words, SP slice is a inter-coded slice used for switching between coded bitstreams;

18

� Switching I (SI) slice allows an exact match of a MB in an SP slice for random access and

error recovery purposes; SI slice is a intra-coded slice used for switching between coded

bitstreams [12] [9].

In the following section, perceptual video codecs using the H.264/AVC video coding standard as

background video codec to include HVS related novel coding tools will be presented. These tools may be

integrated still keeping the compatibility with the H.264/AVC standard (coded stream and decoder) or

eventually loosing this compatibility which may be a drawback in terms of market deployment.

2.3. Reviewing Perceptual Video Coding Solutions

This section intends to present a brief review on the most relevant perceptual video coding solutions in

the literature since this Thesis targets this type of video compression approach. The main selection

criteria for the coding solutions to be reviewed in the following are their compression performance as well

as the novel way they integrate HVS features and associated tools in the video codec. Reviewing these

solutions is fundamental to have a basic understanding of the state-of-the-art on perceptual video coding

and, thus, to better decide which coding path to follow after in this Thesis.

2.3.1. H.264/AVC Perceptual Video Coding based on a Foveated JND

Model

Basic Approach

In [27] and [28], Chen and Guillemot propose a H.264/AVC perceptual video coding solution based on a

Foveated Just Noticeable Distortion (FJND) model [27]. The foveated model is developed to further

exploit the perceptual redundancy focusing on the fact that the visual acuity decreases with an increased

eccentricity, this means the visibility threshold increases with the pixel distance from the fixation point (the

fraction of the distance along the semi-major axis at which the focus lies), as shown in Figure 2.10 and

expressed in equation (4) where c is the distance between the center of the image and the focus, a is the

size of the semi-major ellipse axis and e is the eccentricity. The focus is the fovea.

� = � (4)

Figure 2.10 – Ellipse [29]

In this solution, bit allocation and rate distortion optimization (RDO) algorithms are proposed based on the

foveated JND model with the target to achieve better visual quality for the same rate. In summary:

19

• Basic idea: use a foveation JND model to further exploit the perceptual redundancy by controlling

the H.264/AVC Quantization Parameter (QP).

• Target: achieve a better visual quality for the same rate.

• HVS property explored: the visual acuity decreases with the increment of the eccentricity.

Architecture and Walkthrough

The solution proposed in this section adopts an improved H.264/AVC coding architecture including the

proposed MB quantization adjustment and a RDO solution based on the adopted FJND model. The

architecture of the proposed solution is presented in Figure 2.11 and the modules changed are the coder

control, which now includes the FJND model, and the quantization.

In the quantization module, the proposed change refers to the determination of the QP for each MB,

which is optimized based on the FJND information. This information is given by the FJND model

integrated in the coder control module. The FJND model includes several JND components, notably the

spatial JND (SJND), the temporal JND (TJND) and the foveation model. The coder control module uses a

Lagrange multiplier for the rate distortion optimization which is adapted so the MB distortion is lower than

the noticeable distortion threshold for each MB. It is assumed equal noticeable distortion for MBs in one

video frame and is computed as in equation (5), where D is the noticeable distortion for a MB, ω denotes

the noticeable distortion weight, Q is the quantizer and ! is a constant.

" = # $�% (5)

Figure 2.11 – Architecture of the H.264/AVC Perceptual Video Coding based on a Foveated JND Model solution

20

Forward/Encoding Path

1. MB Division: First, the input video frame is divided in MBs;

2. Motion Estimation: For each MB, the motion vectors best describing the motion/translation from

one image to another are determined, usually using the adjacent frames in the video sequence. To

choose the best motion vector for each MB, a distortion metric, in this case the Sum of Absolute

Differences (SAD), is typically used;

3. MB Prediction: Each MB is encoded using INTRA (spatial) or INTER (temporal) prediction:

a. If INTRA coding is used, a prediction (PRED) is formed using the already decoded samples in

the current slice, this means the samples that have already been encoded and decoded;

b. If INTER coding is used, a prediction (PRED) is formed by motion compensation from 1 or 2

pictures selected from the available reference frames (in the so-called list 0 and/or list 1);

4. Residual Computation: The prediction is subtracted from the current original MB to produce a

residual MB;

5. Transformation: The blocks in the residual MB are transformed by the H.264/AVC Integer DCT

(ICT);

6. FJND model coder control: The FJND thresholds are computed for each pixel, considering three

components:

a. SJND – refers to the luminance contrast and spatial masking effect;

b. TJND – refers to large temporal masking effects resulting from large inter-frame differences;

c. Foveation model – refers to the contrast sensitivity as a function of eccentricity which is

computed based on the foveated weighting model and the background luminance function;

7. Quantization: The ICT coefficients are quantized, controlled by the FJND model:

a. The MB quantization is determined taking into account the relation between reference

noticeable distortion weighted (ωr=1) and the noticeable distortion weighted (ωi) and a reference

quantizer (Qr) determined by the frame-level rate control, as presented in equation (6);

$& = '#(#& $( (6)

The noticeable distortion weight for MB i, is calculated by equation (7), where a=0.7, b=0.6,

m=0, n=1, c=4, si is the average FJND of MB I, and s* is the average FJND of the frame.:

#& = + , 1 + � exp 1−c s3 − s*s* 41 + exp 1−c s3 − s*s* 4 (7)

8. Entropy coding: The quantized coefficients are entropy coded with CABAC.

Decoding Path (also within the encoder)

1. Scaling & Inv. Transform: The quantized ICT coefficients are scaled (Q-1

) and inverse transformed

(T-1

) to produce the residual block D’n;

2. Motion Compensation and Reconstruction: The motion compensated prediction block PRED is

added to D’n to create a reconstructed block;

21

3. Deblocking Filter: A deblocking filter is applied to the previously reconstructed blocks to reduce the

blocking effects and the decoded picture is created from a series of filtered blocks F’n.

Main Coding Tools

The FJND model is a combination of three elementary models, notably the SJND, TJND, and foveation

models.

• Spatial Just-Noticeable-Distortion Model

The perceptual redundancy in the spatial domain is mainly based on the HVS sensitivity to the

luminance contrast and spatial masking effect which is expressed by equation (8) where f1 and f2 are

functions to estimate the spatial masking and luminance contrast and bg and mg are the average

background luminance and the maximum weighted average of luminance differences, respectively.

�56" �7, 8� = max;<=�,>�7, 8�, �>�7, 8��, <��,>�7, 8��? (8)

• Temporal Just-Noticeable-Distortion Model

Usually, large inter-frame differences result in larger temporal masking effects. The TJND is defined

as in equation (9) where @ = 0.8 and ∆�7, 8, B� denotes the average inter-frame luminance between

frame B and previous frame B − 1.

C56"�7, 8, B� =DEFEG � 7 H@, 82 �7J K−0.152L �∆�7, 8, B� + 255�M + @N ∆�7, 8, B� ≤ 0

� 7 H@, 3.22 �7J K−0.152L �255 − ∆�7, 8, B��M + @N ∆�7, 8, B� > 0Q (9)

• Foveation Model

An analytical model is proposed to measure the contrast sensitivity as a function of the eccentricity

where the contrast sensitivity function CS(f,e) is defined as the reciprocal of the contrast threshold

CT(f, e). The foveation JND model takes into account the foveated weighting model and the function

of background luminance as presented in equation (10).

R�7, 8, S, �� = TUV�WX�Y,Z��S, �� (10)

The foveated JND model is the combination of the three models presented.

• Foveated Just-Noticeable-Distortion Model

The foveated JND is defined as in equation (11)

R56"�7, 8, B, S, �� = ��56"�7, 8�� ∙ �C56"�7, 8, B�� ∙ �R�7, 8, S, �� (11)

When there are multiple fixation points, and since it is assumed a fixed v to calculate the F(x,y,v,e) for

each pixel, F(x,y,v,e) can be calculated by only considering the closest fixation point which results in

the smallest e and the minimum F(x,y,v,e). So, the FJND with multiple fixation points is defined as in

equation (12).

R�7, 8, S, �� = min&∈^=,…,`a R&�7, 8, S, �� (12)

22

The quantization step is based on the JND thresholds of the FJND model, so is determined the noticeable

distortion weighted of the reference and of the current MB, and afterwards the relation between them is

used to compute the new quantization step as shown in equation (6)

Performance Evaluation

• Test conditions

Subjective tests were performed in a typical laboratory viewing environment with normal lighting. The

display system was a 20 inch silicon graphics cathode ray tube (20’’ SGI CRT) display with a resolution of

800x600. The viewing distance was approximately 3 times the image width. The test sequences in CIF

format were coded at bitrates of 50 kbps, 300 kbps, 500 kbps, 300 kbps and 300 kbps, for Akiyo, Stefan,

Football, Bus, and Flower, respectively. The frame rate was 30 fps.

For the subjective experiments, the double stimulus continuous quality scale (DSCQS) protocol, widely

used for quality assessment, has been used; in this method, the pair of stimulus is composed by the

original video sequence and a processed version of the original video, in this case the decoded video

from one of the two video coding solutions under evaluation. Therefore, 10 presentations were

conducted, for the five test sequences coded with the two different coding algorithms, and presented

using a random order. The duration of the two videos was 3 seconds and each pair of videos was

displayed twice with the same interval order. The Mean Opinion Score (MOS) scales for the DSCQS

protocol ranged from 0 to 100 in association to qualities from bad to excellent. A Differential Mean

Opinion Score (DMOS) is computed as the difference between the MOS of the original video and the

MOS of the reconstructed video for each presentation.

• Results and Analysis

The authors conducted first visual subjective tests to evaluate the performance of the FJND model. The

FJND based video coding method was compared to the H.264/AVC Joint Model (JM) reference software,

with a confidence interval (CI) of 95%. The JM reference software has been developed by the same

standardization group who developed the standard itself, the Joint Video Team (JVT), with the purpose of

testing it; this codec implementation is typically used as a benchmark for H.264/AVC related video coding

innovations.

The results of the FJND validation test presented in Table 2.1 indicate that no noticeable distortion is

perceived. Moreover, the subjective quality results measured by DMOS are presented in Figure 2.12. The

smaller DMOS indicate that the subjective quality of the reconstructed video is closer to the original video;

so the proposed model is better that the JM in all sequences meaning that better subjective quality is

obtained for the same bitrate.

Table 2.

PSNR

Akiyo

Stefan

Football

Bus

Flower

Figure 2.12 – DMOS comparisons

The quality of the decoded video using the FJND based coding method is better than

resulting from the H.264/AVC JM reference software as shown in

regions can tolerate higher distortion and

Non-textured regions should not be too coarsely coded since the distortion in smooth regions will be

easily perceived; higher distortion in such regions is annoying and degrades the subjective quality, a

shown in Figure 2.13 (d).

(b) (c)

Figure 2.13 – Portions of decoded frameFixation point of Stefan frame coded with

(g) FJND; Non

Strengths and Weaknesses

The strength of this solution is the use of a FJND model

bitrate. The weakness is presented when is used the model at low

23

2.1 – Results of the FJND validation tests [27]

PSNRSTJND (dB) PSNRFJND (dB) ∆PSNR (dB)

37.16 35.55 -1.61

35.43 33.88 -1.63

36.17 35.01 -1.16

33.70 32.44 -1.34

34.78 33.14 -1.64

DMOS comparisons for the H.264/AVC based coding solutions

The quality of the decoded video using the FJND based coding method is better than

resulting from the H.264/AVC JM reference software as shown in Figure 2.13; for example,

regions can tolerate higher distortion and, thus, may be more coarsely coded, see Figure

textured regions should not be too coarsely coded since the distortion in smooth regions will be

perceived; higher distortion in such regions is annoying and degrades the subjective quality, a

(a)

(d) (f).. (g)

frames for the test sequence Stefan. Stefan frame coded with

Fixation point of Stefan frame coded with (b) JM (f) FJND; Texture region away from fixation point coded with Non-fixation point coded with (d) JM (h) FJND [27]

The strength of this solution is the use of a FJND model achieving a better subjective quality for the same

The weakness is presented when is used the model at low rates, where the distortion is likely to

solutions[27].

The quality of the decoded video using the FJND based coding method is better than the video quality

; for example, the textured

Figure 2.13 (c) and (g).

textured regions should not be too coarsely coded since the distortion in smooth regions will be more

perceived; higher distortion in such regions is annoying and degrades the subjective quality, as

(e)

(g) (h)

Stefan frame coded with (a) JM (e) FJND; FJND; Texture region away from fixation point coded with (c) JM

a better subjective quality for the same

, where the distortion is likely to

24

be above the visibility threshold, and the FJND value is used at each pixel as weighting factor to build

weighted squared-error metrics, because the simple weighting of the MSE may not lead to the optimal

trade-off between visual quality and rate.

2.3.2. H.264/AVC Coding with JND Model based Coefficients Filtering

Basic Approach

In [30], Mak and Ngan propose a novel approach for incorporating a DCT domain JND model in a

H.264/AVC encoder to reduce the bitrate without visual quality loss. The basic idea is to remove the

transformed coefficients that have a magnitude lower than the JND threshold.

• Basic idea: remove the transformed coefficients with a magnitude lower than the JND threshold.

• Target: reduce the bitrate without loss of visual quality.

• HVS property explored: texture masking, intensity contrast masking, spatial frequency sensitivity and

preservation of object boundaries.


The encoder changes proposed are regard, the transformation, quantization and the coder control

modules of the H.264/AVC architecture. The architecture of the proposed solution is presented in

Figure 2.14.

Figure 2.14 – Architecture of the H.264/AVC Coding with JND Model based Coefficients Filtering solution

25



2. Coder Control:

a. Transformation: The input video signal 4×4 blocks are transformed by the ICT;

b. The spatial and temporal contrast sensitivity function (CSF) for each block is computed using

the following information: ICT coefficients of the original block; frame dimension; physical size of

the pixels in the display monitor; viewing distance; frame rate; and motion of the block.

c. Based on the spatial and temporal CSF, each 8x8 block in the frame is classified as a PLAIN,

EDGE or TEXTURE block;

d. Luminance adaptation refers to the fact that sensitivity of distortion is linearly proportional to the

surround intensity, except in very dark and bright regions. Contrast masking is the property of

human eyes of masking distortions when other signals are present. The luminance adaptation

and contrast masking values are computed based on the ICT coefficients and the block type

(result of the previous step, 2.c.);

e. JND model - JND thresholds are computed for each coefficient in each 4×4 block using the

luminance adaptation and contrast masking values;

3. Motion Estimation: For each MB, the motion vectors describing the transformation from one image

to another, usually from adjacent frames in a video sequence, are determined. To choose the best

prediction MB, a distortion metric has to be used, in this case the JND thresholded SAD; this

means that, instead of finding the vector corresponding to the minimum SAD, the ME will find the

minimum thresholded SAD as defined next.

a. Thresholded SAD Computation:

i. Residual computation: The difference between the original and the prediction block is

computed;

ii. Transform: The residue above s transformed by the ICT, getting matrix E;

iii. JND Thresholding:

1. If the absolute value of E is larger than the corresponding JND threshold for a certain

frequency position, do nothing;

2. Otherwise, E is changed to zero;

iv. Inverse Transform: The distortion (thresholded SAD) is computed from the inverse transform

of the result of iii. (E thresholded matrix);

4. MB prediction: Each MB is encoded using INTRA or INTER prediction. If rate distortion (RD)

optimization is enable, the prediction mode is chosen for the block is the one that minimizes the RD

cost defined as in equation (13) where d is the distortion calculated in step 3.a.iv, λ is a lagrangian

multiplier and L is the actual bit length of encoding the block with that prediction method.

b = c + �d (13)

a. If the MB is INTRA coded, its prediction PRED is formed using samples in the current slice that

have been previously encoded and decoded;

26

b. If the MB is INTER coded, its prediction PRED is formed by motion compensation prediction

from 1 or 2 frames selected from the available reference frames (set of list 0 and/or list 1);

5. Residue Computation: The prediction is subtracted from the current original block to produce a

residual block;

6. Transform: The residual block is transformed with the ICT; the output are the transformed

coefficients of the JND thresholded prediction residual;

7. JND Thresholding:

a. If the absolute value of a transformed coefficient of the prediction residual is larger than the

corresponding JND threshold, go to step 8;

b. Otherwise, the transformed coefficient of the prediction residual is set to zero;

8. Quantization: Quantize the transformed coefficients;

9. Entropy encoding: Is the same that is used by the H.264/AVC high profile;

Decoding Path(also within the encoder): The decoding is the same as presented for the previous

perceptual coding solution.

Main Coding Tools

• JND model

This JND model takes into account the fact that humans are more sensitive to the distortion in texture

regions than in plain regions. So, the blocks are classified as PLAIN, EDGE, and TEXTURE blocks,

knowing the classification is computed based on the values of the luminance adaptation and contrast

masking. Then the coefficients filtering is computed for each coefficient in the block as presented in

equation (14) where Y are the transformed coefficients, Jx the JND threshold and Yj the filtered

transformed coefficients.

ef�g, S� = he�g, S� i< je�g, S�j > kl�g, S�0 mBℎ�opiq� Q (14)

The JND threshold is used to eliminate the ICT coefficients that are not perceived by the human eye

that is those which magnitude is smaller than the corresponding JND threshold as presented in

equation (15) where E is the ICT transformed difference between the original and the reconstructed

block..

rf�g, S� = hr�g, S� i< jr�g, S�j > kl�g, S�0 mBℎ�opiq� Q (15)


• Test Conditions

The proposed JND thresholding scheme has been implemented in the H.264/AVC JM 14.0 reference

software and the High profile has been used. The group of pictures (GOP) structure is IBBPBBPBBP..,

with an Intra frame every 0.5 s; the RDO is enabled. The sequences used were City, Panslow, Shield,

Spin-calendar with resolution 1280x720p and Pedestrian Area, Toys Calendar and Bluesky with

resolution 1920x1080p.

27


The Double Stimulus Impairment Scale (DSIS) test method has been used for the subjective evaluation.

The results show that all the JND thresholded sequences have a MOS of 5 or close to 5 (scale is from 1-

very annoying to 5-imperceptible), with an average MOS of 4.93 for all sequences. Even when the PSNR

drops by 7 dB, the observers still cannot perceive the difference between the original and the decoded,

very likely because the error is inserted in image areas where it is not easily perceivable.

On average, the bitrate is reduced by 23% for all sequences when using the proposed perceptual video

codec. While the average reduction for the Intra frames is only 1.74%, it is 20% and 36% for the P and B

frames, respectively. This difference exists because in the RD cost computation, the Lagrangian multiplier

is larger in B frames than in P frames, that means it is more likely to choose modes with shorter bit length

in B frames. Figure 2.15 shows the average bitrate change, relatively to the reference H.264/AVC coding,

at different QPs. The bitrate reduction obtained for the I, P, and B frames when the proposed JND-

threshold coding scheme is used is presented in Table 2.2. The bitrate reduction declines as the QP

increases, meaning the JND-thresholding bitrate reduction effect is less intense.

Figure 2.15 – Bitrate changes at different QP for I, P, and B frames [30]


The strength of this solution is the bitrate reduction achieved without any degradation in the visual quality.

The major weakness is in the use of the average of the full set for each MB of the JND thresholds

determined for the 4×4 ICT coefficients to maintain compatibility with H.264/AVC standard, because this

average threshold has little meaning. Another weakness is related to the coding modes selection which

may not be optimal in terms of bitrate; moreover, when QP is large, some sequences show a bitrate

increase which may be due to the differences in the RD cost function used.

28

Table 2.2 – Bitrate reduction for the JND-thresholded sequences and their MOS [30]

2.3.3. H.264/AVC Inter Coding based on Structural Similarity driven

Motion Estimation

Basic Approach

In [31], Mai, Yang, Kuang and Po propose a novel motion estimation method based on the Structural

Similarity (SSIM) metric for H.264/AVC inter prediction. The SSIM index is a method for measuring the

similarity between two images. The SSIM index is a full reference metric in the sense that it measures the

image quality based on an initial uncompressed or distortion-free image taken as reference. SSIM has

been designed to improve on traditional quality assessment methods like the peak signal-to-noise ratio

(PSNR) and mean squared error (MSE), which have proved to be inconsistent with human eye

perception.

Variable block-size motion compensation is used in H.264/AVC to improve the matching accuracy and

achieve higher compression efficiency. As the SSIM index expresses the structural similarity between two

images, the prediction blocks having larger SSIM should be more similar to the original one, and then

lower frequency residuals which can be more easily encoded should be produced. As the best

H.264/AVC P-slice prediction modes are determined after all the prediction residuals are transformed,

quantized and entropy coded, which cost a great deal of computation complexity, this solution is also able

to reduce the overall coding complexity. In this context, the solution presented in this section proposes a

29

motion estimation method based on the structural similarity (MEBSS) for inter prediction to reduce the

bitrate and encoding time while maintaining the same perceptual video quality. Summing up:

• Basic idea: use a perceptual metric like the SSIM metric in the ME process instead of the usual

SAD.

• Target: reduce the bitrate and encoding time while maintaining the same perceptual video quality.

• HVS property explored: considers luminance, contrast and structure.


This solution mainly changes the distortion metric used in the motion estimation module and the coder

control module. The MEBSS uses the SSIM rather than SAD as the distortion metric in the block

matching motion estimation process. According to the SSIM theory, the candidate block is perceptually

more similar to the original block when its SSIM index is greater, while the SAD behaves in the opposite

way since it is a distortion (not similarity) metric. The idea behind the SSIM index is to measure the

structural information degradation, which includes three comparison dimensions: luminance, contrast and

structure. The higher the value of SSIM(x, y) is, the more similar the images x and y are. The architecture

of the proposed coding solution is presented in Figure 2.16.

Figure 2.16 – Architecture of the H.264/AVC Inter Coding based on SSIM driven ME solution



30

2. Motion Estimation: For each MB, the motion vectors describing the transformation from one image

to another, usually from adjacent frames in a video sequence, are determined. To choose the best

MB, a distortion metric has to be used, in this case the SSIM. The major steps to select the best

matching block(s) and the best inter prediction mode for each MB are:

a. For each MB, find the best matching block from all the candidate blocks using equation (16)

where s is the original block, c is the candidate matching block, λMOTION is the Lagrange

multiplier for motion estimation, ∆MV is the difference between the prediction MV and the actual

MV and, finally, Bit(∆MV) is the number of bits representing the ∆MV.

b. Divide each MB into two 16x8 non-overlapped blocks. For each 16x8 block, find the best-

matching 16x8 block from all the reference frames using equation (16). Then, calculate the sum �st�C for these two 16x8 blocks.

c. Divide each MB into two 8x16 non-overlapped blocks; then, perform as in b.

d. Divide each MB into four 8x8 non-overlapped blocks. For each 8x8 block, find the best-

matching block from all the reference frames using equation (16). Then, calculate the total �st�C for these four 8x8 blocks.

e. If further sub-partitions are allowed, find the best similarity matching blocks for the types 8x4,

4x8 and 4x4, respectively; otherwise, go to step f directly.

f. Find the prediction block using the P_SKIP mode. Its �st�C is 1 − ��u��q, �� since neither a

motion vector nor a reference index parameter is transmitted for this mode.

g. The prediction mode with the minimum �st�C is chosen as the best inter prediction mode for

the MB. The residual for this best coding mode is transformed, quantized and entropy coded.

3. MB prediction: As presented in Section 2.3.1

�st�C�q, �� = 1 − ��u��q, �� + �vwxywz{ |iB�∆�}� (16)

31

4. Residual Computation: As presented in Section 2.3.1

5. Transformation: The residual block is transformed with the ICT;

6. Quantization: As presented in the previous Section 2.3.2;

7. Entropy encoding: As presented in the previous Section 2.3.2.

Decoding Path (also within the encoder): As presented in Section 2.3.2.

Main Coding Tools

• SSIM Index

This tool is used to measure the structural information degradation, based on three features:

luminance, contrast and structure. The SSIM index is defined in equation (17) where l(x,y) regards the

luminance comparison, c(x,y) regards the contrast comparison and s(x,y) regards the structure

comparison. These comparisons are defined in equations (18), (19) and (20), respectively, where x

and y are two nonnegative image signals to be compared, µx and µy are the mean intensity and σx and

σy are the standard deviation of image x and y, respectively, and σxy is the covariance of image x and

y. C1, C2 and C3 are small constants to avoid the denominator being zero.

��u��7, 8� = ~�7, 8� ∙ ��7, 8� ∙ q�7, 8� (17)

~�7, 8� = 2�Y�Z + s=�Y� + �Z� + s= (18)

��7, 8� = 2�Y�Z + s��Y� + �Z� + s� (19)

q�7, 8� = �YZ + s��Y�Z + s� (20)

• Encoder Control

The encoder transmits the coded video together with some side information, notably for indicating

either Intra-slice or Inter-slice coding. Partitions with sizes of 16x16, 16x8, 8x16 and 8x8 for each MB

luma component are supported by the P-slice syntax in block matching motion estimation. The 8x8

partition can also be further subdivided into 8x4, 4x8 or 4x4 sub-block partitions according to the

syntax element.

The block matching motion estimation targets to find the best matching block from the reference

frames within a certain search range, such as 16x16. The Lagrange cost ��st�C� in equation (16) is

used as the selection criterion. The block(s) with minimum �st�C will be chosen as the best matching

block(s) for each prediction mode. For each prediction mode, a RD cost is generated after finding the

best matching block. The prediction mode with the minimum RD cost will be chosen as the best

prediction for that MB.

Due to the change in the distortion metric (SSIM and not SAD), the Lagrangian multiplier should be

modified correspondingly; consequently, the new cost function must be written as in equation (16).

32


• Test Conditions

The adopted test video sequences (with 50 frames) were Carphone, Foreman, Grandma and News with

176x144 resolution, and Hall_monitor and Mobile with 325x288 resolution. All the experiments used the

JVT JM reference software, version JM92. The tests were performed on a P4/2.4 GHz personal computer

with 256 MB RAM and Microsoft Windows XP as the operating system.


The MEBSS coding solution can avoid some RDO coding steps, leading to the reduction of the encoder

computation load. However, as the SSIM computational load itself is larger for than for SAD, the reduction

in the overall coding computation load is not that large.

While maintaining almost the same mean SSIM (MSSIM), the proposed MEBSS solution may achieve a

20% average bitrate reduction with a 2.5% reduction in the encoding time; the maximum bitrate reduction

is more than 50% which is rather significant, see Table 2.3.

Table 2.3 – MEBSS results with QP=10 [31].


The strengths for this solution regard the bitrate reduction for a target quality and the encoding complexity

reduction for each prediction mode. The advantage in computation load is more obvious when the

MEBSS is used in H.264/AVC fast motion estimation with fewer searching blocks. The main weakness is

that the overall complexity reduction is not that large due to the usage of a more complex quality metric.

33

2.3.4. H.264/AVC Bitrate Control based on 4D Perceptual

Quantization Modeling

Basic Approach

In [32], Huang and Lin propose a novel 4D perceptual quantization model for H.264/AVC bitrate control

(PQrc) [32]. This solution includes two major encoding modules:

o Perceptual frame-level bit allocation using a 1D temporal pattern, depicted as the energy

transition table, which is used to predict the frame complexity and determine proper rate budgets;

o Macroblock-level quantizer decision using a 3D rate pattern, formed as the bit-complexity-

quantization (B.C.Q.) model, in which the tangent slope of a B.C.Q. curve is unique information to

find a proper MB quantizer.

In summary:

• Basic idea: to use a 4D temporal BCQ model, this means a 1D temporal pattern to predict the

frame complexity and to determine proper budget bits and a 3D rate pattern, depicted as a BCQ

model.

• Target: to reduce the bitrate while improving the SNR quality and control accuracy.

• HVS property explored: are basically the just noticeable distortions


The whole codec architecture is presented in Figure 2.18; the major changes regard the coder control

and the quantization module. The proposed PQrc solution has major contributions with respect to the

frame complexity estimation and rate-quantization model modules, marked with double stars in Figure

2.17, where the main components of the H.264/AVC rate control module are depicted.

Figure 2.17 – Illustration of the H.264/AVC Bitrate Control based on 4D Perceptual Quantization Modeling rate control main components (major revisions are marked by double stars) [32]

34

Figure 2.18 – Architecture of the H.264/AVC Bitrate Control based on 4D Perceptual Quantization Modeling solution


1. Input video signal: Each frame is divided into MBs;

2. Motion Estimation: For each MB, the motion vectors best describing the transformation from one

image to another, usually from adjacent frames in a video sequence, are determined. To choose

the best MB, a distortion metric, in this case the SAD, is used;

3. MB prediction: Each MB is encoded using INTRA or INTER prediction:

a. If INTRA coding is used, PRED is formed from samples in the current slice that have been

previously encoded, decoded and reconstructed;

b. If INTER coding is used, PRED is formed by motion compensated prediction from 1 or 2

pictures selected from the reference frames set in the so-called list 0 and list 1;

4. Distortion: The best prediction is subtracted from the current block (original) to produce a residual

block;

5. Transformation: As presented in the previous section;

6. Coder Control: The proposed H.264/AVC bitrate control using the 4D perceptual quantization

modeling, see Figure 2.19, works as follows:

a. Frame complexity estimation based on the available channel bandwidth, frame rate, current

target buffer level and actual buffer fullness:

i. Compute the (i-1)th actual mean absolute difference (MAD) through PQrc;

35

ii. Update the energy transition table which records the temporal variation of frame

complexity (energy) between adjacent frames;

b. Frame-level bit-allocation

i. Calculate the initial frame-level budget bit based on the predicted frame complexity using a

quadratic rate-quantization function;

ii. Slightly adjust the demanded frame bitrate, using the buffer fullness, to achieve buffer

stability.

c. Find the slope of the MB characteristic tangent, that is, in order to find the proper MB

quantizer, the current MB property is characterized by its tangent slope.

i. If the MB budget bits and the MB complexity are known

a. Calculate the expected tangent slope through PQrc;

Figure 2.19 – H.264/AVC bitrate control procedure using the 4D perceptual quantization model [32]

36

ii. MB bitrate estimation can allocate MB budget bits based on residual bits and previously

coding information.

iii. MB complexity is computed as a weighted MAD according to the variation of the luminance

within the MB.

d. MB QP decision

i. PQrc determines a proper MB QP that is selected to minimize the expected tangent slope

and the tangent slope in the BCQ model.

ii. Update the BCQ model using the weighed least-square estimation for newly coming videos

continuously.

e. MB encoding/finishing

i. Return to Step 3 to decide further MB quantizers until all MBs are encoded.

ii. The encoded frame is used next frame’s MAD in Step 1.

7. Quantization: As presented in the previous Section 2.3.3;

8. Entropy encoding: As presented in the previous Section 2.3.3;

Decoding Path (also within the encoder): As presented in the previous section.

Main Coding Tools

• Perceptual frame-level bit-allocation

Perceptual frame-level bit-allocation includes the following steps:

1. MAD is predicted by looking the energy transition table;

2. Just-noticeable-distortion PSNR (PSNRJND) is computed through equation (21) using MSE JND

(MSEJND) computed through equation (22) where the frame size is N multiplied by M, f is the

pixel luminance in the original frame, <� is the one in the reconstructed frame, THJND is the

empirical threshold and η is the tuning factor defined in equation (23).

��6b�z� = 10 log=� 255��z� (21)

��z� = � � ��<�7, 8� − <��7, 8�� − C��z�� ∙ ��7, 8�vZ�=zY�= 6 ∙ � (22)

� = �1, i< �<�7, 8� − <��7, 8�� > C��z�0, i< �<�7, 8� − <��7, 8�� ≤ C��z� Q (23)

PSRNJND is adopted to detect scene changes and compensate the error in prior MAD prediction;

several insignificant video signals (noise) will be filtered out to increase the accuracy of the

scene change detection.

3. Buffer fullness is considered to prevent buffer stability and enhance the video quality temporally;

� If the buffer fullness is larger than a certain threshold, the rate-controller can decrease the

budget-bits reducing it by the overflowed bits (∆);

� Otherwise, the controller can increase the demanded frame bitrate by adding to it the

unused buffer capacity (∆) to avoid buffer overflow or underflow and enhance the video

quality.

37

• MB-level quantizer decision using the 3-D BCQ model

The MB-level quantizer decision model using the BCQ model is the core technique for bitrate control

since it directly affects the video bitrate and distortion; its main steps are:

1. MB level quantizer decision control scheme (use the tangent slope of the BCQ curve to find the

proper MB quantizer)

a. Refine the BCQ model, that is, the PQrc uses quadratic functions to model the BCQ curves

for each quantizer;

b. Based on the refined BCQ model, compute the predicted MB bit-quota and complexity for

the �� MB in the i�� frame, to determine its proper quantizer;

c. To find the proper MB quantizer, the current MB is characterized by its tangent slope;

d. If budget bits are finished, the quantizer is reset to $��Y to prevent buffer overflow and

reduce the number of skipped frames;

e. The temporal behavior of the BCQ model is also considered to determine each initial

picture quantizer; the decision depends on the variation of the frame MAD in the IPPP

coding structure.

f. A decreasing frame MAD indicates that the next motionless frame needs less budget bits

and, thus, the current actual $��&� should be subtracted 1 to increase the encoded bits and

enhance the video quality. Otherwise, an increasing frame MAD indicates that the next

intense frame needs more budget bits and, thus, the current $��&� should be added 1 to

reduce encoded bits.

2. BCQ model update using the weighted least-square estimation (use a weighted least-square

estimation to adapt related procedures of updating the 3D BCQ model to current MB properties)

The BCQ model is updated using a weighted least-square estimation based on coded data sets �,�W , ��W , $��, corresponding to the MB bitrate, MB complexity and QP. When there are � data sets

for one specific $�, the BCQ curve function can be initialized by equation (24) where � is the matrix �,= ⋯ ,��x for encoded bitrates, � is the matrix ��= ⋯ ��x for the prediction error, � is �� x, r is the matrix presented in (26).

� = r ∙ � + � (24) (25)

¡¢¢¢£ �� =� �� = 1�� =�� 1�� ⋮ ⋮ ⋮�� 1�� ¥¦¦

¦§ (26)

In the least-square estimation, the indicator of the prediction error �¨� equals �x�.

38


• Test Conditions

The test sequences used were Akiyo, Foreman, News, Carphone and, Suzie with resolutions CIF and

QCIF. The highest channel rates are 1M/256kbps for CIF and the lowest channel rates are 128k/24kbps

for QCIF. The GOP pattern was IPPP/IBBP and the target frame rate 30 fps.


The obtained results indicate that the MAD estimation precision rate is usually at least 98% in any

condition, comparing the MAD prediction and the actual MAD. The MAD prediction can also more

effectively reduce the processing delay regarding the two-pass rate control model, which is used to collect

coding information but requires pre-analysis of the video characteristics.

In Table 2.4 and Table 2.5, results for two encoding conditions are presented: 1) CIF@high bitrate; and 2)

QCIF@low bitrate. The proposed �$o� solution can gain 0.3-0.7 dB on the average PSNR regarding the

H.264/AVC JM10.2 codec. The maximum PSNR improvement and degradation are about 1.1 dB and -0.1

dB, respectively. Figure 2.20 and Figure 2.21 show the comparison of image qualities for the cases

Carphone@24 kbps and Foreman@128 kbps, respectively. These images were encoded with the

proposed PQrc solution; it can be observed that there is a SNR improvement regarding the JM10.2 model

(the most significant differences are marked by black circles). Moreover, the smaller PSNR standard

deviation indicates less flickering and more consistent qualities; on average, the improvement is about 0.5

dB, especially for the lower bitrates. The stability of the buffer can be measured by the bitrate accuracy

value (χ), expressed in equation (27), and the buffer fullness variation:

© = 1 − ªB o>�B_o B� − mgBJgB_o B� B o>�B_o B� ª (27)

(28) The value χ approaching 1 indicates that the buffer status is stable when the amount of encoding bits are

near the pre-defined buffer capacity.

The analysis of the computational complexity considered two aspects:

1. The simpler frame complexity calculation and MB QP decision;

a. The �$o� uses simple arithmetic operations and looking-up-table techniques without

complicated power calculation to determine the frame complexity and MB QP.

2. The slightly complicated scene-change detection and the BCQ model update.

a. To address the quality degradation problem, it is additionally proposed to detect scene

changes and refine the estimated frame complexity;

b. To continuously update the BCQ model, PQrc also requires slightly complicated matrix

operations to adjust the model coefficients.

The best, worst and average execution time gains are +3.13%, -2.97% and +0.87%, respectively.

39

Table 2.4 – Comparison of overall coding performance for the GOP IPPP pattern using PQrc and JM10.2 (average PSNR gain: 0.515 dB) [32]

Table 2.5 – Comparison of overall coding performance for the GOP IBBP pattern using PQrc and JM10.2 (average

PSNR gain: 0.35dB) [32]

40

Figure 2.20 – Comparison of the image quality for the Carphone sequence at 24 kbps (significant differences are

marked with black circles) [32]

Figure 2.21 – Comparison of the image quality for the Foreman sequence at 128 kbps (significant differences are

marked with black circles) [32]


The strengths of this solution are the better visual quality and buffer stability, avoiding the flickering effect,

and the PSNR improvement amounting to around 0.5 dB. The major weakness regards the MAD

prediction which fails at scene changes.

The goal of this chapter was to present a set of solutions that improve the perceptual quality of encoded

video, and use them as reference to the work developed in this thesis. The references used were the

second solution presented, H.264/AVC Coding with JND Model based Coefficients Filtering and the first

solution, H.264/AVC Perceptual Video Coding based on a Foveated JND Model. The idea is to implement

a solution that eliminates the transformed coefficients which are below the JND threshold and adapt the

QP accordingly to the JND thresholds.

41

Chapter 3

A JND Model based Coefficients Pruning Method for

H.264/AVC Video Coding

This chapter presents the first perceptually driven modification made to the H.264/AVC video codec with

the target to eliminate the transform coefficients which are perceptually irrelevant according to an adopted

JND model. After describing the coefficients pruning solution, the performance results obtained in the

context of the H.264/AVC JM 16.2 reference software are presented and analyzed.

3.1. The Perceptual Coefficients Pruning Method

3.1.1. Objective

The basic idea underpinning the adopted perceptual coefficients pruning method is to remove, by setting

them to zero, all the transform coefficients which have a magnitude lower than the corresponding JND

threshold determined using an appropriate JND model. The main target is thus the reduction of the total

bitrate while maintaining the perceptual video quality since the removed coefficients are perceptually

irrelevant. To achieve this objective, the HVS properties are adequately exploited through a JND model

considering the following HVS effects: frequency band masking (spatial frequency sensitivity), luminance

masking (intensity contrast masking), pattern masking and temporal masking [33].

3.1.2. Architecture

The improved codec architecture already including the perceptual coefficients pruning method related

tools is presented in Figure 3.1. The major changes associated to the novel tool regard the determination

of the pruning thresholds using the selected JND model and the transform coefficients pruning process

included before the coefficients quantization. It is important to note that the proposed codec modifications

42

only refer to the encoder and they do not imply any change in the H.264/AVC syntax and semantics as

well as at the decoder, this implies, that still fully compliant H.264/AVC bit streams are created with the

perceptually driven video codec.

Figure 3.1 – Improved H.264/AVC codec architecture including the JND model and the coefficients pruning method.

3.1.3. Walkthrough

This section intends to present the walkthrough of the proposed improved video codec with special

emphasis on the novel modules related to the perceptual pruning of the integer DCT (ICT) coefficients,

which are listed in bold below.


1. MB division: First, the input video frame is divided in MBs;

2. JND thresholds determination: JND thresholds are computed for each ICT coefficient

organized in 4×4 and 8x8 blocks using the JND model described in the following section,

design by Naccari et al. [33];

3. Motion estimation: As presented in Section 2.3.4;

4. MB prediction: As presented in Section 2.3.4;

5. Residue computation: As presented in Section 2.3.2;

6. Transform: As presented in Section 2.3.2;

7. Coefficients pruning:

a. If the absolute value of a prediction residual ICT coefficient is larger than the

corresponding JND threshold, go to step 8;

b. Otherwise, the prediction residual ICT coefficient is set to zero meaning that the

corresponding coefficient is NOT perceptually relevant and, thus, may be pruned,

saving the associated rate; in this context, pruning means setting the value to zero;

43

8. Quantization: The transform coefficients which ‘survived’ the perceptual pruning method are now

quantized;

9. Entropy encoding: As presented in Section 2.3.1;

Decoding Path (also within the encoder): The decoder is the same as presented in Section 2.3.1 since

there are no changes in the decoder architecture also implying the H.264/AVC compliance is kept.

3.1.4. Novel Tools Description

This section describes the novel tools required to perform the perceptual pruning of the transform

coefficients, notably the JND thresholds determination and the perceptual coefficients pruning method.

The adopted pruning solution is based on the perceptual codec reviewed in Section 2.3, designed by Mak

and Ngan [30].

JND Thresholds Determination

Description

The adopted JND thresholds determination process relies on a spatial JND model which exploits three

human visual system masking aspects through appropriate sub-models:

i) Frequency band decomposition masking model which exploits the different sensitivity of the

human eye to the noise introduced at different spatial frequencies. To explore this masking effect,

the model is constituted by the default perceptual matrices adopted in the H.264/AVC reference

software;

ii) Luminance variations masking model which exploits the masking effect associated to luminance

variations in different image regions. To explore this masking effect, the model is based on the

Weber-Fechner law which states that the minimal brightness difference which may be perceived

increases with the background brightness;

iii) Pattern masking effects model which exploits the presence of some patterns in the image. To

explore this masking effect, the Foley-Boynton model was adopted.

The adoption of the spatial JND model has the main target to improve the final rate-distortion

performance through the exploitation of relevant HVS properties.

In the adopted JND model, the determination of the JND thresholds is performed through the following

steps:

• Luminance masking model is defined through equation (29), where d¬�� denotes the average

luminance intensity in block k.

56"®�� = ¯ 0.048 ∙ d¬�� + 4 d¬�� ≤ 621 62 < d¬�� < 1150.021 ∙ d¬�� − 1.464 d¬�� ≥ 115 Q (29)

• Frequency band decomposition masking model is applied to each MB coding mode as described by equation (30).

44

56"W��´&��(��i, µ� = ¶ 6 13 20 2813 20 28 3220 28 32 3728 32 37 42· 56"W��´&��¸(�i, µ� = ¶10 14 20 2414 20 24 2720 24 27 3024 27 30 34· (30)

• Pattern masking model is defined through equation (31), where E(i,j,k) denotes the normalized

contrast energy and ε is 0.6.

56"��i, µ, �� = ¹ 1 i< i, µ = 0max�1, ��i, µ, ��º� mBℎ�opiq�Q (31)

• The final JND threshold (JT) is computed through equation (32)

Perceptual Coefficients Pruning Method

Description

The perceptual coefficients pruning method consists in setting to zero all the transform coefficients which

have an absolute magnitude lower that the corresponding JND threshold given by the adopted JND

model described above. This pruning process is important since it allows sending to the decoder, and

thus first to the following quantization module, only the perceptually relevant ICT coefficients, saving the

associated bitrate without any subjective quality penalty.

The decision to prune an ICT coefficient is based on equation (33), where YP represents the pruned

transform coefficients, Y stands for the transform coefficients and JT represents the relevant JND

threshold as defined by equation (32).

»¼�i, µ, �� = ¹»�i, µ, �� , i< j»�i, µ, ��j > 5x�i, µ, ��0 , mBℎ�opiq� Q (33)

To apply this pruning tool, it is important to know with which DCT, 4x4 or 8x8, the signal was transformed,

and apply the pruning process not only to the luminance coefficients but also to the chrominance

coefficients. The JND model computes different thresholds for the 4x4 and the 8x8 ICT blocks; so, when

a MB is coded with a 4x4 DCT, the JND thresholds used for the comparison in equation (33) are those

associated to the 4x4 blocks; naturally, the same happens for the 8x8 DCT blocks .

Implementation

To implement the ICT perceptual coefficients pruning method, three processes had to be created: one for

the chrominances, and two for the luminance (one for the 4x4 blocks and another for the 8x8 blocks).

These processes correspond to the functions threshold_transform_chroma, threshold_transform4x4 and

threshold_transform8x8. Each of these functions is used after the transformation process and before the

quantization, so they are used in the functions dct_chroma, dct_4x4_perceptual, dct_16x16_perceptual

and dct_8x8_perceptual.

Definition of Assessment Metrics for the Pruning Method

Since the perceptual coefficients pruning method basically sets coefficients to zero, the most meaningful

metric to evaluate the impact of this tool is the number of zeroed coefficients due to the coefficients

5x�i, µ, �� = 56"W��´�i, µ� ∙ 56"®�� ∙ 56"��i, µ, �� (32)

45

pruning; however, other metrics may provide useful information regarding the pruning method. Taking this

into account, the metrics implemented were the following:

Average number of zeroed coefficients at MB level due to the perceptual coefficients pruning

method

o Definition

This metric measures the difference between the average number of zeroed coefficients with the

H.264/AVC perceptual codec with coefficients pruning after the quantization and the average number of

zeroed coefficients with the H.264/AVC High profile codec after the quantization; in this case, the

H.264/AVC High profile codec is taken as reference. With this definition, this metric effectively measures

the coefficients which were set to zero due to the additional perceptually driven coefficients pruning

method.

This metric is defined in equation (34), where Avg_coeff_MB, defined in equation (35), is the average

number of coefficients zeroed at MB level after the quantization in each frame, nr_coef_zero is the

number of coefficients zeroed in the frame and nr_MB_coded is the number of MBs in a frame with coded

coefficients. This metric is computed over all the frames in the sequence, and after the average in time is

computed as expressed in equation (36).

o Implementation

In this subsection, the implementation of the metric above and the problems that appeared during this

implementation will be presented. This metric has been implemented by the following steps:

1. Count the number of zeroed coefficients due only to the perceptual coefficients pruning method

a. First approach: Count the number of coefficients set to zero while computing the perceptual

coefficients pruning method.

i. Problem: The coefficients set to zero by the quantization and the perceptual coefficients

pruning method alone overlap. So, the result of this first approach is not the expected since

the objective is to count the zeroed coefficients due exclusively to the perceptual coefficients

pruning method.

b. Second approach: Count the number of coefficients with zero value after the quantization

computation. Afterwards, the difference between this metric obtained for the H.264/AVC

perceptual codec and for the H.264/AVC High profile codec is computed to determine the

½S>¾¸(¿¸´ÀÁÂÃÃÄÅ = ½S>_�m�<<_�|¼(¿�¿Æ¸´ Ç.�ÈÉ/ËÌÍ �¸(�¸��®� �¿´¸�− ½S>_�m�<<_�|Ç.�ÈÉ/ËÌÍ Ç&X� �(¿U&¸ �¿´¸� (34)

½S>_�m�<<_�| = o_�m�<_Î�om o_�|_�mc�c (35)

½S>¾¸(¿¸´ÀÁÂÃÄÅÏÂÐÑÂÒÀÂ = � ½S>¾¸(¿¸´ÀÁÂÃÄÅ &�(ÃÓÔÕÂÏÖ=&�� o_<o ��q (36)

46

number of coefficients set to zero by the novel tool, without the coefficients zeroed by the

quantization.

2. Compute the number of MBs in a frame using equation (37).

o_�| = ℎ�i>ℎB × picBℎ 16 × 16 (37)

3. Compute equation (35)

To obtain the number of coefficients which are set to zero exclusively due to the perceptual coefficients

pruning method, it is required to perform the steps above both for the H.264/AVC High profile codec and

also for the H.264/AVC perceptual codec with coefficients pruning.

4. Finally, equation (34) is computed.

However, the process above is not a perfect solution to determine the selected metric. Since the RDO is

enabled, the encoder tries all the coding modes to find the best coding mode; so, the best coding mode

for the H.264/AVC High profile codec may not be always the best coding mode for the H.264/AVC

perceptual codec with coefficients pruning; due to these variations, the difference computed in equation

(34) does not compute the effective number of coefficients set to zero exclusively due to the perceptual

coefficients pruning method but it should provide a rather good approximation.

Average zigzag position of the zeroed coefficients exclusively due to the perceptual coefficients

pruning method (4×4 blocks)

o Definition

This metric computes the average zigzag position in the 4×4 blocks corresponding to the zeroed

coefficients due to the perceptual coefficients pruning method (position Є [1, 16]). This metric provides an

idea on the bandwidth zone where the coefficients are being zeroed. The metric is computed with

equation (38) where nr_coef_zeroi represents the number of zeroed coefficients in the position i of each

4x4 block for a frame; this metric is computed in the universe of the MBs coded with the 4x4 DCT.

o Implementation

This metric has been implemented using an array with 16 positions where each position is associated to

the zigzag position in the block. So, position zero in the array corresponds to the first ICT coefficient in the

zigzag scanning order of the 4x4 block. With this purpose, the two following steps are performed:

1. Count the number of zeroed coefficients for each zigzag position (scoring[16])

When performing the quantization process, the number of zeroed coefficients for each zigzag position is

counted. For each frame, only the 4x4 transformed blocks will be considered. For each zigzag position of

the 4x4 block which has its coefficient set to zero, the scoring value for that position is incremented by

½S>¾&X¾�X �¿Æ&�&¿�ØÙØ = � o_�m�<_Î�om& × JmqiBim && � o_�m�<_Î�om&& (38)

47

one. In this way, at the end of each frame, the array will be filled at each position with the number of

zeroed coefficients for that position in the 4x4 transformed blocks.

2. Compute the average zigzag position using equation (38) taking into account that:

• nr_coef_zeroi = scoring[i-1];

• positioni = i+1.

The result is the zigzag scanning position in a 4x4 block where, on average, the coefficients are set to

zero for each frame; to obtain the same metric over the video sequence, the frame metric values have to

be averaged over all the frames in the video sequence.

3.2. Performance Evaluation

This section intends to assess the performance of the presented perceptual H.264/AVC codec, including

the proposed perceptual coefficients pruning method. With this purpose in mind, first the adopted test

conditions are presented, including the selected performance metrics and the benchmarks; after, the

performance results are presented and analyzed.

3.2.1. Test Conditions

For the test experiments, the H.264/AVC reference software, version JM 16.2 (FRExt), has been used,

notably the High profile which is the best performing profile from the compression efficiency point of view.

Thus, the adopted JND model and the perceptual coefficients pruning method described in the previous

section were implemented in the context of this reference software. Further test conditions include [34]:

• GOP prediction structure: IBBPBBPBBP..., with a single Intra frame at the beginning.

• Rate control: RDO is in the high complexity mode .

• Test sequences and resolutions:

o Foreman and Mobile

� Spatial resolution: CIF

� Frame rate: 30 fps

� Total number of frames: 300 frames

o Panslow and Spincalendar

� Spatial resolution: 1280x720



o Playing_cards and Toys_and_calendar

� Spatial resolution: 1920x1080



48

The first frame of each sequence is presented in Figure 3.2 to give an idea on the type of content of each

sequence.

(a) (b)

(c) (d)

(e) (f) Figure 3.2 – First frame of each test sequence: (a) Foreman; (b) Mobile; (c) Panslow; (d) Spincalendar; (e)

Playing_cards; (f) Toys_and_calendar

• Quantization parameters: Several groups of QP values groups are used; to simplify, each group

is presented as follows, Gx = (QPI, QPP, QPB) where Gx represents a group x and QPy the

quantization parameter when y are the I, P and B frames:

o G1 = (12, 12, 12)

o G2 = (16, 16, 16)

o G3 = (22, 23, 24)

o G4 = (27, 28, 29)

o G5 = (32, 33, 34)

o G6 = (37, 38, 39)

The higher is x, the lower is the rate and the quality since the higher are the quantization steps.

• Coding benchmarks: The novel perceptual video codec is compared with the H.264/AVC High

profile codec without perceptual coefficients pruning; in the following, the H.264/AVC High profile

codec will be labeled as ‘HP’ while the H.264/AVC based perceptual with coefficients pruning

codec will be labeled as ‘JND_CP’.

• Performance metrics:

o PSNR: measures the quality of a reconstructed image based on the comparison of the

decoded and the original sequences through the MSE. PSNR is determined by equation (39),

where R is the maximum fluctuation in the input image data type, this means 255 for 8

bit/sample content, and MSE is defined by equation (40) where M and N are the number of

rows and columns in the input images, respectively, and I1 and I2 are the original image and

the reconstructed image, respectively.

49

��6b = 10 log=� H b��N (39)

�� = � �u=��, � − u��, ��v,z � ∙ 6 (40)

In this context, the PSNR for a video sequence is simply the temporal average of the PSNRs

for each frame.

o Multi-scale SSIM (MS-SSIM): measures the quality of the perceived image, taking into account

the signal image density, the distance between the plane and the observer and the perceptual

capability of the observer visual system. This metric is computed through equation (41), where

M is the finest scale obtained after M-1 scaling iterations. lj(x,y), cj(x,y) and sj(x,y) are the

luminance, contrast and structure components at their different scales and αj, βj and ϒj are set

according to M scale, aforementioned, so they match the HVS contrast sensitivity function

[35].

��u��7, 8� = �~v�7, 8��ÚÄ ∙ Û��f�7, 8��ÜÝ ∙ �qf�7, 8��ÞÝvf�& (41)

In this context, the MS-SSIM for a video sequence is simply the temporal average of the MS-

SSIMs for each frame. MS-SSIM is considered to be a objective quality metric with a better

correlation than PSNR with the user mean opinion scores.

o Average number of transform coefficients zeroed per MB due to the perceptual coefficients

pruning method: this metric provides an idea on the number of transform coefficients which

are set to zero due to the usage of the proposed pruning method and is computed using

equation (34) above.

o Average zigzag position of the zeroed coefficients in 4x4 coded blocks: this metric provides an

idea on the zigzag position of the coefficients which are set to zero due to the usage of the

proposed pruning method and is computed using equation (38) above.

3.2.2. Results and Analysis

To assess the RD performance, RD charts for the PSNR and MS-SSIM objective quality metrics versus

the bitrate were obtained for each test sequence and for each adopted RD point.

The variation of the various metrics between the proposed perceptual codec (H.264/AVC JND_CP) and

the H.264/AVC High profile (H.264/AVC HP) benchmark are computed using equations (42), (43) and

(44) for each RD point. These values should ideally show the improvements brought by the presented

H.264/AVC solution regarding the ‘conventional’ H,264/AVC HP codec as implemented in the JVT

reference software developed by the standardization group itself. The average values for these metrics

and respective variations are presented in Table A-1 of Annex A.

∆|iBo B� �%� = |iBo B��z� − |iBo B�Ç¼|iBo B�Ç¼ × 100 (42)

∆��6b �%� = ��6b�z� − ��6bÇ¼��6bÇ¼ × 100 (43)

∆��u� �%� = ��u��z� − ��u�Ç¼��u�Ç¼ × 100 (44)

50

• RD Performance: PSNR versus Bitrate

This subsection presents the RD charts for the PSNR versus the bitrate for each test sequence; the

results are after analyzed.

Figure 3.3 – PSNR RD performance for the Foreman sequence

Figure 3.4 – PSNR RD performance for the Mobile sequence

G1

G2

G3

G4

G5

G629

31

33

35

37

39

41

43

45

47

0 500 1000 1500 2000 2500 3000 3500 4000 4500

HP

JND_CPPSN

R [

dB

]

Bitrate [kbit/s]

G1G2

G3

G4

G5

G625

30

35

40

45

50

0 1000 2000 3000 4000 5000 6000 7000 8000

HP

JND_CPPS

NR

[d

B]

Bitrate [kbit/s]

51

Figure 3.5 – PSNR RD performance for the Panslow sequence

Figure 3.6 – PSNR RD performance for the Spincalendar sequence

Figure 3.7 – PSNR RD performance for the Playing_cards sequence

G1G2

G3

G4

G5

G630

32

34

36

38

40

42

44

46

0 20000 40000 60000 80000 100000 120000 140000

HP

JND_CPPSN

R [

dB

]

Bitrate [kbit/s]

G1G2

G3

G4

G5

G630

32

34

36

38

40

42

44

46

0 20000 40000 60000 80000 100000 120000 140000

HP

JND_CPPSN

R [

dB

]

Bitrate [kbit/s]

G1

G2

G3

G4

G5

G633

35

37

39

41

43

45

47

0 10000 20000 30000 40000 50000 60000 70000 80000 90000

HP

JND_CPPSN

R [

dB

]

Bitrate [kbit/s]

52

Figure 3.8 – PSNR RD performance for the Toys_and_calendar sequence

Analyzing the RD variation in Figure 3.3 to Figure 3.8, the following conclusions may be taken:

• As expected, the PSNR increases with the rate, first quite quickly and after with a rather linear

variation; in this case, the various rates were obtained from the various quantization parameters

combinations, corresponding to the Gx labels in the charts.

• For the low and medium resolution sequences, there are no significant variations in terms of RD

performance between the two H.264/AVC codecs (HP and JND_CP) under comparison.

• For the higher resolution sequences, there are evident RD performance gains obtained with the

H.264/AVC JND_CP solution, notably for the higher rates. The rate gains go up to about 7 Mbit/s,

i.e., 8% (for the same PSNR) for the last RD point or PSNR gains of about 0.5 dB, i.e., 1.1% (for

the same rate) for the last RD point.

• The RD gains are larger for the higher resolutions because in high resolution sequences each MB

corresponds to a tinnier physical area; in this context, the redundancy is higher and therefore the

coefficients are lower and consequently the coefficients pruned have lower impact in the video

quality.

• RD Performance: MS-SSIM versus Bitrate

This subsection presents the RD charts for the MS-SSIM versus the bitrate for each test sequence; the

results are after analyzed.

G1G2

G3

G4

G5

G633

35

37

39

41

43

45

0 20000 40000 60000 80000 100000

HP

JND_CPPSN

R [

dB

]

Bitrate [kbit/s]

53

Figure 3.9 – MS-SSIM RD performance for the Foreman sequence

Figure 3.10 – MS-SSIM RD performance for the Mobile sequence

Figure 3.11 – MS-SSIM RD performance for the Panslow sequence

G1G2G3

G4

G5

G60,965

0,97

0,975

0,98

0,985

0,99

0,995

1

0 500 1000 1500 2000 2500 3000 3500 4000 4500

HP

JND_CP

MS-

SSIM

Bitrate [kbit/s]

G1G2G3

G4

G5

G6

0,98

0,982

0,984

0,986

0,988

0,99

0,992

0,994

0,996

0,998

1

1,002

0 1000 2000 3000 4000 5000 6000 7000 8000

HP

JND_CPMS-

SSIM

Bitrate [kbit/s]

G1G2G3

G4

G5

G60,970

0,975

0,980

0,985

0,990

0,995

1,000

0 20000 40000 60000 80000 100000 120000

HP

JND_CPMS-

SSIM

Bitrate [kbit/s]

54

Figure 3.12 – MS-SSIM RD performance for the Spincalendar sequence

Figure 3.13 – MS-SSIM RD performance for the Playing_cards sequence

Figure 3.14 – MS-SSIM RD performance for the Toys_and_calendar sequence

Analyzing the RD variation in Figure 3.9 to the Figure 3.14, the following conclusions may be taken:

• As expected, the MS-SSIM increases with the rate, first rather quickly and after saturating the

quality for the higher bitrates; this basically means that the subjective quality saturates when the

G1G2

G3

G4

G5

G60,965

0,970

0,975

0,980

0,985

0,990

0,995

1,000

1,005

0 20000 40000 60000 80000 100000 120000 140000

HP

JND_CPMS-

SSIM

Bitrate [kbit/s]

G1G2G3

G4

G5

G6

0,955

0,960

0,965

0,970

0,975

0,980

0,985

0,990

0,995

1,000

1,005

0 10000 20000 30000 40000 50000 60000 70000 80000 90000

HP

JND_CPMS-

SSIM

Bitrate [kbit/s]

G1G2G3

G4

G5

G60,965

0,970

0,975

0,980

0,985

0,990

0,995

1,000

0 20000 40000 60000 80000 100000

HP

JND_CPMS-

SSIM

Bitrate [kbit/s]

55

rate increases above a certain value implying that, contrary to what is said by the PSNR, the

subjective quality does not continuously increase with the rate since non-perceptible details are

being sent at some stage. As before, the various rates are obtained from the various quantization

parameters combinations, corresponding to the Gx labels in the charts.

• For the low and medium resolution sequences, the H.264/AVC JND_CP solution shows rate gains

up to about 8% (for the same MS-SSIM) for the last RD point (G1).

• For the higher resolution sequences, there are more evident RD performance gains associated to

the H.264/AVC JND_CP solution, notably for the higher rates. The rate gains go up to about 11

Mbit/s (13%) (for the same MS-SSIM) for the last RD point (G1).

• From the two bullets above, it is clear that the RD performance gains are larger for the higher

resolution sequences. This happens because in high resolution sequences each MB corresponds

to a tinnier physical area; in this context, the redundancy is higher, therefore the coefficients are

lower and consequently the coefficients pruned have lower impact in the video quality.

• Comparing the MS-SSIM RD performance with the PSNR RD performance, the following

conclusions may be taken:

o For low resolution sequences, high quality (saturation zone) in terms of MS-SSIM is

achieved with a rate around 700 kbit/s and 1800 kbit/s for Foreman and Mobile sequences,

respectively, while high PSNR stable values are only achieved for higher rates, notably

higher than 3700 kbit/s and 6500 kbit/s for the Foreman and Mobile sequences, respectively.

o For medium resolution, the MS-SSIM high quality is achieved with a rate around 60 Mbit/s

and 10 Mbit/s for the Panslow and Spincalendar sequences, respectively, while high PSNR

stable quality is only achieved for rates higher than 100 Mbit/s.

o For high resolution, high quality MS-SSIM is achieved around 10 Mbit/s and 45 Mbit/s for the

Playing_cards and Toys_and_calendar sequences, respectively, while high PSNR stable

quality is only achieved for rates higher than 80 Mbit/s.

o The observations above highlight that PSNR improvements after a certain rate are non-

perceptible since the subjective quality saturates due to the HVS perception limitations. The

recognition of this effect may allow adopting lower coding rates while still achieving the same

high subjective quality.

• Average number of zeroed coefficients per MB due to the perceptual coefficients pruning

method

This metric measures the average number of coefficients in a MB which are zeroed after the quantization

exclusively due to the pruning method. As such, it corresponds to the difference between the average

number of zeroed coefficients for a MB when using the JND model and the average number of zeroed

coefficients for a MB when the sequence is coded with the H.264/AVC High profile (HP) reference

software. However, as the coding modes may not be the always precisely the same, this difference may

sometimes be negative although this happens rather rarely.

56

In this subsection, the evolution in time of the average number of zeroed coefficients for a MB, for each

sequence, will be presented and afterwards analyzed.

Figure 3.15 – Average number of zeroed coefficients for the Foreman sequence

Figure 3.16 – Average number of zeroed coefficients for the Mobile sequence

Figure 3.17 – Average number of zeroed coefficients for the Panslow sequence

-2

0

2

4

6

8

0 50 100 150 200 250 300

Co

eff

icie

nts

Number of Frame

G1

G2

G3

G4

G5

G6

-1,5

-0,5

0,5

1,5

2,5

3,5

4,5

5,5

0 50 100 150 200 250 300

Co

eff

icie

nts

Number of Frame

G1

G2

G3

G4

G5

G6

-17

-12

-7

-2

3

8

0 20 40 60 80 100 120 140

Co

eff

icie

nts

Number of Frame

G1

G2

G3

G4

G5

G6

57

Figure 3.18 – Average number of zeroed coefficients for the Spincalendar sequence

Figure 3.19 – Average number of zeroed coefficients for the Playing_cards sequence

Figure 3.20 – Average number of zeroed coefficients for the Toys_and_calendar sequence

From the analysis of Figure 3.15 to Figure 3.20, it may be concluded that:

• For lowers Gs, this means higher rates and qualities, the average number of zeroed coefficients is

larger because for the higher rates the reference software uses lower thresholds (i.e., has less

-0,5

0,5

1,5

2,5

3,5

4,5

5,5

6,5

0 20 40 60 80 100 120 140

Co

eff

icie

nts

Number of Frame

G1

G2

G3

G4

G5

G6

-2

0

2

4

6

8

10

0 10 20 30 40 50 60

Co

eff

icie

nts

Number of Frame

G1

G2

G3

G4

G5

G6

-3

-1

1

3

5

7

9

0 10 20 30 40 50 60

Co

eff

icie

nts

Number of Frame

G1

G2

G3

G4

G5

G6

58

coefficients to set to zero). Therefore, when the perceptual coefficients pruning method is applied,

there will be more coefficients under the JND threshold. The highest average number of zeroed

coefficients in a MB is around 10.2 coefficients for the lowest G in the Playing_cards sequence.

• For higher Gs, this means lower bitrates and qualities, the average number of zeroed coefficients per

MB is close to zero since most of the coefficients which are perceptually irrelevant are already set

to zero by the quantization process.

• The average number of zeroed coefficients for a MB is not constant along the sequence. This

variation is due to the used GOP prediction structures including I, P and B frames which have

rather different characteristics in terms of coefficients energy.

• The oscillations due to the GOP prediction structure do not have to the same intensity for all

quantization parameters combinations Gx, i.e. the difference of the average number of zeroed

coefficients in a MB between a P frame and a B frame differs with the QP value. While the B

frames have coefficients with low values, the P frames have higher values for the coefficients. As

the QP value decreases, the number of coefficients which have a value below the JND threshold

decreases. Since the values of the coefficients in B frames are low, the number of zeroed

coefficients does not change much with different QPs. On the other hand, in P frames, the number

of zeroed coefficients decreases when the QP decreases. Consequently, the difference between

the number of zeroed coefficients in a MB between a P frame and a B frame rises when the QP

value decreases.

• Considering all the sequences, it becomes clear that the oscillations are not only due to the GOP

prediction structure, but there are also other reasons involved. For example, the panning in

Foreman around frame 200 and the zoom out in Mobile around frames 100 to 150 imply an

increase on the average number of zeroed coefficients for MB. This happens because the image

variation is explored by the frequency decomposition masking model.

• The average number of zeroed coefficients per MB increases as the resolution increases. Looking

into G1, for each sequence, it is clear that for low resolutions the average is around 5 coefficients,

for medium resolution the average increases to around 6 coefficients and, for high resolution, the

average is around 7.5 coefficients. This happens because in high resolution sequences each MB

corresponds to a tinnier physical area; in this context, the redundancy is higher and, therefore, the

coefficients are lower and consequently there are more coefficients pruned.

For each spatial resolution, a more detailed analysis of the results allows to state:

- CIF resolution: the average number of zeroed coefficients per MB is higher for the Foreman

sequence because this sequence presents strong variations in time since it has three main parts:

first a close up of a men talking, then it changes into an open sky, and afterwards to a view of a

building in construction. In this case, the frequency band decomposition masking model exploits

the type and quantity of variations in each image.

- 720p60 resolution: the Panslow sequence presents a higher average number of zeroed coefficients

per MB because there are pattern areas where the JND model, more precisely the pattern masking

59

model, exploits the contrast. As the JND threshold is larger for sequences with pattern objects,

more coefficients are pruned;

- 1080p24 resolution: the average number of zeroed coefficients per MB is higher for the

Playing_cards sequence because it has a lower luminance level than the Toys_and_calendar

sequence. As luminance masking model used in the JND model provides a higher JND threshold

for sequences with a darker background (low level of luminance), the perceptual coefficients

pruning method is capable of setting more coefficients to zero.

• Average zigzag position of the zeroed coefficients exclusively due to the perceptual

coefficients pruning method

This metric represents the average zigzag position in a 4×4 block where the coefficients are more likely to

be set to zero. With this information, it is possible to determine where the pruning method has more

impact in terms of frequency. The position in the block is determined in terms of zigzag scanning order.

Figure 3.21 – Average zigzag position of the zeroed coefficients for the 4x4 blocks in the Foreman sequence

Figure 3. 22 – Average zigzag position of the zeroed coefficients for the 4x4 blocks in the Mobile sequence

8,5

9

9,5

10

10,5

11

11,5

0 50 100 150 200 250 300

Zig

zag

po

siti

on

Number of Frame

G1

G2

G3

G4

G5

G6

8,5

9

9,5

10

10,5

11

11,5

0 50 100 150 200 250 300

Zig

zag

po

siti

on

Number of Frame

G1

G2

G3

G4

G5

G6

60

Figure 3.23 – Average zigzag position of the zeroed coefficients for the 4x4 blocks in the Panslow sequence

Figure 3.24 – Average zigzag position of the zeroed coefficients for the 4x4 blocks in the Spincalendar sequence

Figure 3.25 – Average zigzag position of the zeroed coefficients for the 4x4 blocks in the Playing_cards sequence

8,5

9

9,5

10

10,5

11

0 20 40 60 80 100 120 140

Zig

zag

po

siti

on

Number of Frame

G1

G2

G3

G4

G5

G6

8,5

9

9,5

10

10,5

11

0 20 40 60 80 100 120 140

Zig

zag

po

siti

on

Number of Frame

G1

G2

G3

G4

G5

G6

8,5

8,7

8,9

9,1

9,3

9,5

9,7

9,9

10,1

10,3

10,5

0 10 20 30 40 50 60

Zig

zag

po

siti

on

Number of Frame

G1

G2

G3

G4

G5

G6

61

Figure 3.26 – Average zigzag position of the zeroed coefficients for the 4x4 blocks in the Toys_and_calendar

sequence

From the analysis of Figure 3.21 to Figure 3.26, it is possible to conclude:

• The average zigzag position for the zeroed coefficients in 4x4 blocks increases with the rate. This

is expected as the higher is the rate, the lower is the quantization step and, thus, the more

irrelevant coefficients are coded, if not filtered by the pruning method.

• For the first Gs, this means higher rates and qualities, the average zigzag position is slightly higher,

going up to 11.4, on average, for G1 in the Mobile sequence. This happens since the reference

software has a lower threshold for the higher bitrates (i.e. the quantization only sets to zero

coefficients with very high frequency); consequently, when the perceptual coefficients pruning

method is applied, since the human eye is less sensitive to high frequencies, the coefficients set to

zero will be in a frequency range a little bit higher comparing to the other Gs.

• On average, the zigzag position where it is more likely for coefficients to be set to zero is,

approximately, 10 with variance, approximately, 2 for low and medium resolution and,

approximately, 9 with variance, approximately, 2 for high resolution. These positions correspond to

the middle range frequencies. This happens since the frequency band decomposition masking

model applies a higher threshold for the higher frequencies; thus, in these frequencies more

coefficients are set to zero. Therefore, it could be assumed that this method has a bigger impact on

the high frequencies; however, since the quantization sets coefficients to zero in the higher

frequencies, this method ends by having a bigger impact in the middle frequencies.

• As for the average number of zeroed coefficients per MB, the average zigzag position shows

oscillations due to two types of reasons: the GOP prediction structure and the particular

characteristics of the sequences, as aforementioned. The particular characteristics of the

sequences with impact on the variation of the average zigzag position of the zeroed coefficients are

presented next.

8,5

8,7

8,9

9,1

9,3

9,5

9,7

9,9

10,1

10,3

10,5

0 10 20 30 40 50 60

Zig

zag

po

siti

on

Number of Frame

G1

G2

G3

G4

G5

G6

62

For each resolution, a more detailed analysis allows concluding:

- CIF resolution: the average zigzag position for 4x4 blocks is higher for the Mobile sequence

because this sequence has less image variations than the Foreman sequence. The frequency

decomposition masking model explores more the variations in this image;

- 720p60 resolution: the Spincalendar sequence presents a higher average zigzag position for 4x4

blocks because it is a sequence with a image spinning, contrary to the Panslow sequence where a

camera panning is present, thus introducing some strong variations in the image. The frequency

decomposition masking model explores these variations in the image;

- 1080p24 resolution: the average zigzag position for 4x4 blocks is similar for the

Toys_and_calendar and Playing_cards sequences.

The average zigzag position of the zeroed coefficients for 4x4 blocks is not constant along the sequence

as aforementioned. These variations along time may be explained as follows:

• Foreman sequence: Figure 3.21 shows that, around frame 200, this means when the camera does

a panning over the sky, the average zigzag position for 4x4 blocks decreases. This corresponds to

a strong variation in the image, as there is first a close up with a men speaking, after an image with

some open sky and finally a view to a building under construction. The JND model, more precisely

the frequency band decomposition masking model, explores the type and quantity of variations in

the image.

In Annex A, Table A-2 represents the evolution of the average zigzag position of the zeroed coefficients

for 4x4 blocks and the variance of this position along the RD points for all sequences.

3.2.3. Conclusion

The main conclusion of this chapter is that the adopted perceptual coefficients pruning method may have

some positive impact on the RD performance, notably for sequences with lower luminance levels (e.g.

Playing_cards vs Toys_and_calendar) and objects with patterns (e.g. Panslow vs Spincalendar),

especially for the low QP values and higher resolutions.

The H.264/AVC JND_CP codec can achieve PSNR RD gains that are only evident for the higher

resolutions and for the higher rates. The rate gain goes up to 8% for the last RD point and the PSNR gain

goes up to 0.5 dB for the last RD point. Also, the MS-SSIM RD performance can achieve rate gains up to

8% in rate for low and medium resolutions and 13% for high resolutions, both for the last RD point.

A good quality in terms of MS-SSIM can be achieved with a lower rate (around 700 kbit/s for Foreman

sequence) than in terms of PSNR where a good quality is only achieved with a higher rate (larger than

3.7 Mbit/s for the Foreman sequence).

Regarding the number of zeroed coefficients due to the perceptual coefficients pruning method, it can go

up to around 9.7 coefficients, on average, for the Playing_cards sequence in the lowest G. These zeroed

coefficients are more likely to be located in between the 8th and 12

th zigzag position in a 4x4 block as

63

shown in Figure 3.27. Therefore, the coefficients zeroed by the perceptual coefficients pruning method

are typically in the middle to high frequencies.

Figure 3.27 – 4x4 block with the average zeroed positions highlighted

1 2 6 7

3 5 8 13

4 9 12 14

10 11 15 16

64

Chapter 4

A JND Model based Adaptive Quantization Method for

H.264/AVC Video Coding

This chapter presents the second perceptually driven modification to the reference H.264/AVC video

codec with the objective of quantizing the integer DCT (ICT) coefficients based on an adopted JND

model. After describing the new ICT quantization solution, the performance results obtained in the context

of the H.264/AVC JM 16.2 reference software using several objective quality metrics are presented and

analyzed.

4.1. The JND Adaptive Quantization Method

4.1.1. Objective

The basic idea of the new perceptually driven quantization method for the ICT coefficients is to adapt the

QP values based on the JND thresholds computed based on a selected JND model. With this purpose, a

distortion weight will be determined for each MB to be applied to each initially computed QP value in

order to get a new JND adaptive QP value. The basic idea is to code each ICT coefficient with an

accuracy driven by the HVS properties, thus avoiding to send information that is not visually perceptible.

The final target is to reduce the bitrate necessary to reach a certain target subjective video quality.

4.1.2. Architecture

The improved encoder architecture already including the JND adaptive quantization related tools is

presented in Figure 4.1 while Figure 4.2 presents the improved encoder architecture including also the

65

ICT perceptual coefficients pruning solution presented in the previous chapter. The major architectural

changes regard the control of the QP value based on the JND thresholds. It is important to note that the

proposed codec modifications only refer to the encoder and they do not imply any change in the

H.264/AVC syntax and semantics, meaning that still fully compliant H.264/AVC bit streams are created.

Figure 4.1 – Encoder architecture with the JND related adaptive quantization modules

Figure 4.2 – Encoder architecture with the JND related adaptive quantization and ICT coefficients pruning modules

66

4.1.3. Walkthrough

This section intends to present a walkthrough of the new perceptual video codec with special emphasis

on the novel modules related to the QP perceptual control based on the JND model, which are listed in

bold, is presented below. The 7th step is only included for the improved perceptual codec using both the

JND related methods presented in this Thesis, this means coefficients pruning and JND adaptive

quantization.


1. MB Division: As presented in Section 2.3.1;

2. JND thresholds determination: As presented in Section 3.1.3;

3. Motion Estimation: As presented in Section 2.3.4;

4. MB prediction: As presented in Section 2.3.4;

5. Residue Computation: As presented in Section 2.3.2;

6. Transform: As presented in Section 2.3.2;

7. Coefficients Pruning: As presented in Section 3.1.3;

8. JND QP Adaptation: Adapt the QP value for each MB based on the JND thresholds as

defined in the next section;

9. Quantization: Quantize the transformed (and eventually pruned) coefficients with the JND adaptive

QP;

10. Entropy encoding: As presented in Section 2.3.1;

Decoding Path (also within the encoder): The decoder is the same as presented in Section 2.3.1 since

there are no changes in the decoder. This also reflects the fact that the proposed JND related tools do not

impact the H.264/AVC compliance, and thus the same (normative) decoder is used.

4.1.4. Description of New Tool

This section describes the novel tool required to perform the QP perceptual adaptation, notably the

computation of the new QP. This solution is based on the perceptual video coding solutions presented in

Section 2.3 developed by Chen and Guillemot [27] [28].

JND Adaptive Quantization

Description

The JND adaptive quantization method consists in adapting the QP value for each MB, taking into

account its perceptual relevance. The basic idea is to use the JND thresholds to determine a new QP

value, if the average value of the JND thresholds in a MB is higher than the average value of the JND

threshold in a frame (using the thresholds for the relevant coding mode, e.g. 4x4 DCT or 8x8 DCT and

INTRA or INTER modes) meaning that the MB is perceptually less relevant regarding the average

relevance of the frame. Consequently, the new QP will be higher than the QP determined by the

67

H.264/AVC JM reference software, exploiting the HVS behavior to mask some additional quantization

noise, thus saving some bitrate.

This tool modifies the QP value as initially determined by the H.264/AVC JM reference software by using

a weighted distortion (dist) computed based on the JND thresholds determined using the JND model

presented in Section 3.1.4. The QP value is adapted as follows:

$��z� & = $�&àciqB& (45)

where QPi is the initially determined QP value and QPJNDi is the adapted QP value.

The weighted distortion dist is computed through equation (46) where avg_JND_MBi is the average value

of the JND thresholds for MBi, computed by equation (47), and avg_JND_frame is the average value of

the JND thresholds for the frame, computed by equation (48) where JT is the JND threshold for each

coefficient in the MB.

S>_56"_�| = � � 5x�i��µ�=áf�� =á&�� 16 × 16 (47)

S>_56"_<o �� = � � 5x�i��µ�â&´��Ö=f�� ¸&X��Ö=&�� ℎ�i>ℎB × picBℎ (48)

In summary, the average JND thresholds for the frame and for each MB are computed using the JND

model presented in Section 3.1.4. Afterwards, the weighted distortion for each MB is computed and the

new QP value is determined by this distortion as in (41).

To avoid a subjectively negative flickering effect due to the variation of the QP value between MBs, the

QP variations are limited: the QP value can only decrease 1 and increase until 3 as defined by equation

(49). After the determination of QPJND i with equation (45), the conditions in (49) are checked: if disti is

lower than 1 and QPJNDi is less than QPi-1 or if disti is higher than 1 and QPJNDi is higher than QPi+3,

QPJNDi will be further changed: in the first case, QPJNDi will be set to QPi-1 while in the second case QPJNDi

will be set to QPi+3.

¯$��z� & = $�& + 3 i< �ciqB& > 1� && 1�$��z� & − $�� > 34$��z� & = $� − 1 i< �ciqB& < 1� && 1�$�& − $��z� &� > 14Q (49)

Implementation

To compute the real JND threshold average for a frame using the coding mode selected for each MB, it

would be necessary to known before the encoding process the MB coding modes implying that the frame

would have to be coded twice. To simplify the computation of the average JND threshold for the frame,

this average is computed as if the coding mode of the current MB is the coding mode of all MBs in the

frame (e.g. if the MB being coded uses a 4x4 DCT with INTRA mode, the average JND threshold in the

frame is the average JND thresholds in a frame where all MBs are coded with 4x4 DCT and INTRA

mode). This allows to apply the algorithm above based only on the four average JND threshold at the

frame level identified below.

ciqB& = 0.7 + 0.6 + 11 + �7J ä−4 S>_56"_�|& − S>_56"_<o �� S>_56"_<o �� å

(46)

68

To implement the JND adaptive quantization tool, the following steps are required.

For each frame

1. Compute the average JND threshold in the frame level using equation (48) for the 4x4 and 8x8

integer DCT for each INTRA and INTER coding mode, notably:

• 4x4 DCT with INTRA mode for luminance;

• 4x4 DCT with INTER mode for luminance;

• 8x8 DCT with INTRA mode for luminance;

• 8x8 DCT with INTER mode for luminance;

For each MB

2. Compute the average JND threshold using equation (47) for the current MB using the selected

coding mode;

3. Compute the new JND adaptive QP by:

a. Select the avg_JND_frame and avg_JND_MB corresponding to the coding mode of the current

MB (e.g. MB coded with 4x4 DCT with INTER mode);

b. Compute the distortion weight using equation (46)

c. Compute the new JND adaptive QP value using equation (45) and then check and apply the

conditions in (49).

As mentioned above, this process is not a perfect solution since the average JND threshold in a frame is

an approximation; since instead of computing the average JND threshold for a frame using the JND

thresholds correspondent to the coding modes used in the frame, is used JND thresholds considering that

all MB are coded with the same coding mode as the current MB. However, it should be a rather good

approximation.

4.2. Performance Evaluation

4.2.1. Test Conditions

For the tests, the H.264/AVC reference software, version JM 16.2 (FRExt), has been used, notably the

High profile for the reasons mentioned before. Thus, the adopted JND model, the perceptual coefficients

pruning method described in the previous section and the JND adaptive quantization method were

implemented in the context of this reference software. Further test conditions included [34]:

• GOP prediction structure: As presented in Section 3.2.1.

• Rate control: As presented in Section 3.2.1.

• Test sequences and resolutions: As presented in Section 3.2.1.

• Quantization parameters: As presented in Section 3.2.1.

69

• Coding benchmarks: The proposed perceptual video codec with the JND adaptive QP and the

proposed perceptual video codec including coefficients pruning and JND adaptive QP are

compared with the H.264/AVC High profile codec and with the H.264/AVC based perceptual codec

only with coefficients pruning. In the following, these codecs will be labeled as:

o HP - H.264/AVC High profile codec

o JND_CP - H.264/AVC based perceptual codec with coefficients pruning

o JND_QP - H.264/AVC based perceptual codec with JND adaptive QP

o JND_CP+QP - H.264/AVC based perceptual codec including coefficients pruning and JND

adaptive QP

• Performance metrics: Besides the objective quality metrics already adopted in the previous

chapter, another objective quality metric is adopted to perform a more complete RD performance

assessment:

o Video Quality Metric (VQM):VQM was developed to provide an objective measurement for the

perceived video quality. It measures the perceptual effects of video impairments including

blurring, jerky/unnatural motion, global noise, black distortion and color distortion, and

combines them into a single metric [36]. The VQM procedure has as input the original and

decoded video sequence in the YUV color space; after, the DCT transform is applied to each

video sequence, and the DCT coefficients are converted into a local contrast (LC) metric

computed using equation (50) where DC is the DC component for each block.

ds�i, µ� = "sC�i, µ� ∗ 1 "s10244�.Èá"s

(50)

Afterwards, LC is converted to just-noticeable distortions (JNDs) in order to compare the

significant differences, that is, to compare just the relevant coefficients by ignoring the non-

perceivable coefficients. This conversion is made through a spatial CSF (SCSF) matrix where

each DCT coefficient is multiplied by the correspondent entry in the SCSF matrix and the

result are the reciprocal of the JND thresholds. Finally, the two video sequences are

subtracted (diff) to determine its average and maximum distances which are finally weighted

by equation (51) [37].

}$� = �� ´&Æ� + 0.005 ∗ � 7´&Æ�� (51) with

�� ´&Æ� = 1000 ∗ �� 1�� ,q�ci<<��4 (52)

� 7´&Æ� = 1000 ∗ � 7i�g� 1� 7i�g�� ,q�ci<<��4 (53)

The VQM distortion values range from 0 to 1 with the 0 value representing a video with

excellent quality and 1 representing a very bad video quality.

To compute the VQM, the Command Line VQM (CVQM) tool which performs video calibration

and video quality estimation has been used; however, in this work only video quality

estimation has been performed. The CVQM software was developed by the Institute for

Telecommunication Sciences (ITS) and is described in [38]. CVQM performs the automatic

processing of a pair of corresponding video sequences, one with the original video sequence

70

and the other with the processed video sequence, e.g., after coding by the video codecs under

study.

o Resolving Power (RP) Compensated VQM: VQM generally correlates well with the

subjectively perceived quality; however, it is typically not powerful enough to compare two

codecs in terms of RD performance. For example, the perceived quality of the HP and

JND_QP codecs may be very similar but the associated VQM scores may be quite different,

leading to the conclusion that the JND_QP codec has a worst quality than HP. To improve this

type of comparison, a resolving power (RP) compensation is applied to the basic VQM. The

RP works as a threshold, that is, the RP sets the maximum difference in the adopted metric, in

this case the VQM, that still can be considered having the same perceived quality. So, if, for

example, the VQM difference between the JND_QP and HP codecs is less than RP, the

quality is perceived as similar for a human observer. [39]. The RP used has a confidence

interval of 95% and was determined using the software provided at ITS. Using the RP

compensated VQM should allow a more realistic comparison between the subjective RD

performance of the various video codecs.

4.2.2. Results and Analysis

To evaluate the RD performance of the JND_QP and JND_CP+QP, four objective quality metrics are

used in this chapter: PSNR, MS-SSIM, VQM and RP compensated VQM. With this purpose, RD charts

for the PSNR, MS-SSIM, VQM and RP compensated VQM versus the rate for each RD point will be

presented for each test sequence.

The variation of the various quality metrics between the proposed perceptual codecs and the H.264/AVC

High profile benchmark are computed using equations (42), (43), (44), (54) and (55) for each RD point.

These values should ideally show the gains brought by the proposed perceptual H.264/AVC based

codecs regarding the standard H.264/AVC High profile codec as implemented in the JVT reference

software provided by the standardization group itself. The average values for these metrics and

respective variations are presented in Table B-1 to Table B-5 of Annex B.

∆}$� �%� = }$��z� − }$�Ç¼}$�Ç¼ × 100 (54)

∆}$�_b� �%� = }$�_b��z� − }$�_b�Ç¼}$�_b�Ç¼ × 100 (55)

• RD Performance: PSNR versus Bitrate

This subsection presents the PSNR RD charts for each test sequence; the results are after analyzed.

71

Figure 4.3 – PSNR RD performance for the Foreman sequence

Figure 4.4 – PSNR RD performance for the Mobile sequence

Figure 4.5 – PSNR RD performance for the Panslow sequence

G1G2

G3

G4

G5

G629

31

33

35

37

39

41

43

45

47

0 500 1000 1500 2000 2500 3000 3500 4000 4500

HP

JND_CP

JND_QP

JND_CP+QP

PSN

R [

dB

]

Bitrate [kbit/s]

G1G2

G3

G4

G5

G624

29

34

39

44

49

0 1000 2000 3000 4000 5000 6000 7000 8000

HP

JND_CP

JND_QP

JND_CP+QP

PS

NR

[d

B]

Bitrate [kbit/s]

G1G2

G3

G4

G5

G630

32

34

36

38

40

42

44

46

0 20000 40000 60000 80000 100000 120000

HP

JND_CP

JND_QP

JND_CP+QP

PSN

R [

dB

]

Bitrate [kbit/s]

72

Figure 4.6 – PSNR RD performance for the Spincalendar sequence

Figure 4.7 – PSNR RD performance for the in Playing_cards sequence

Figure 4.8 – PSNR RD performance for the Toys_and_calendar sequence

G1G2

G3

G4

G5

G628

30

32

34

36

38

40

42

44

46

0 20000 40000 60000 80000 100000 120000 140000

HP

JND_CP

JND_QP

JND_CP+QP

PSN

R [

dB

]

Bitrate [kbit/s]

G1G2

G3

G4

G5

G633

35

37

39

41

43

45

47

0 10000 20000 30000 40000 50000 60000 70000 80000 90000

HP

JND_CP

JND_QP

JND_CP+QP

PSN

R [

dB

]

Bitrate [kbit/s]

G1G2

G3

G4

G5

G633

35

37

39

41

43

45

0 20000 40000 60000 80000 100000

HP

JND_CP

JND_QP

JND_CP+QP

PSN

R [

dB

]

Bitrate [kbit/s]

73

From the analysis of Figure 4.3 to Figure 4.8, it may be concluded: o For the H.264/AVC perceptual codec with JND adaptive QP (JND_QP)

o For the low resolution sequences, there are evident RD performance losses, notably for the

higher rates

� comparing with the H.264/AVC High profile codec (HP), the PSNR losses go up to about 2

dB (4.5%)/1.5 dB (3.5%) (for the same rate) for the last RD point, or the rate gains go up to

700 kbit/s (35%)/800 kbit/s (18.2%) (for the same PSNR) for the last RD point (G1) for the

Foreman and Mobile sequences, respectively.

� comparing with the H.264/AVC perceptual codec with coefficients pruning (JND_CP), the

PSNR losses go up to about 2 dB (4.5%)/1.5 dB (3.5%) (for the same rate) for the last RD

point, and the rate losses go up to 800 kbit/s (42.1%)/1 Mbit/s (23.8%) (for the same

PSNR) for the last RD point (G1) for the Foreman and Mobile sequences, respectively.

o For medium resolution sequences

� comparing with the HP codec, the PSNR losses go up to about 1 dB (2.3%)/0.5 dB (1.1%)

(for the same rate) for the last RD point, and rate losses go up to 14 Mbit/s (22%)/8 Mbit/s

(10%) (for same PSNR) for the last RD point for the Panslow and Spincalendar sequences,

respectively.

� comparing with the JND_CP codec, the PSNR losses go up to about 1 dB (2.4%)/0.5 dB

(1.1%) for the last RD point, and the rate losses go up to 14 Mbit/s (22%)/10 Mbit/s (13%)

(for the same PSNR) for the last RD point for the Panslow and Spincalendar sequences,

respectively.

o For high resolution sequences

� comparing with the HP codec, the PSNR losses go up to about 0.5 dB (1%) (for the same

rate) for last RD point and the rate losses go up to about 6 Mbit/s(10%) (for the same

PSNR) for the last RD point for the Playing_cards sequence.

� comparing with the JND_CP codec, the PSNR losses go up to about 0.75 dB (1.6%), or the

rate losses go up to about 12 Mbit/s (24%) (for the same PSNR) for the last RD point for

the Playing_cards sequence.

o For the H.264/AVC perceptual codec including coefficients pruning and JND adaptive QP

(JND_CP+QP)

o For the low resolution sequences, there are again clear RD performance losses, notably for the

higher rates

� comparing with the HP codec, the PSNR losses go up to about 2 dB (4.5%)/1.5 dB (3.5%)

(for the same rate) for the last RD point and the rate losses go up to 700 kbit/s (36.7%)/800

kbit/s (19.5%) (for the same PSNR) for the last RD point for the Foreman and Mobile

sequences, respectively.

� comparing with the JND_CP codec, the PSNR losses go up to about 2 dB (4.5%)/1.5 dB

(3.5%) (for the same rate) for the last RD point and the rate losses go up to 800 kbit/s

(42.1%)/1 Mbit/s (23.8%) (for the same PSNR) for the last RD point for the Foreman and

Mobile sequences, respectively.

74


� comparing with the HP codec, the PSNR losses go up to about 1 dB(2.3%)/0.5 dB(1.2%) in

PSNR (for the same rate) for the last RD point and the rate losses go up to 12/10

Mbit/s(19%/13.2%) (for same PSNR) for the last RD point for the Panslow and

Spincalendar sequences, respectively.

� comparing with the JND_CP codec, the PSNR losses go up to about 1/0.5 dB (2.4/1.2%)

(for the same rate) for the last RD point and the rate losses go up to 14/12 Mbit/s

(22.2/16.2%) (for same PSNR) for the last RD point for the Panslow and Spincalendar



� comparing with the HP codec, the PSNR losses go up to about 0.5/0.25 dB (1.1/0.6%) (for

the same rate) for the last RD point and the rate losses go up to about 8/6 Mbit/s

(16/10.9%) (for the same PSNR) for the last RD point for the Playing_cards and

Toys_and_calendar sequences, respectively.

� comparing with the JND_CP codec, the PSNR losses go up to about 1/0.75 dB (2.2/1.7%)

(for the same rate) for the last RD point and the rate losses go up to 12/14 Mbit/s

(26.1/27.5%) (for same PSNR) for the last RD point for the Panslow and Spincalendar


o Both the JND_QP and the JND_CP+QP codecs present a worst RD performance regarding the HP

codec in terms of the PSNR quality metric: This does not come as a surprise as the PSNR relies

on the mathematical difference between the luminances of the original and decoded sequences

and, thus, does not necessarily accurately model the subjective quality. In fact, the JND adaptive

QP method introduces additional quantization error for some MBs under the assumption that this

additional error is not perceptible although the PNSR will still ‘complain’ about this additional error

as the PSNR is still (mathematically) sensitive to this error; thus, it is simply normal that the PSNR

RD performance for the JND_QP and the JND_CP+QP codecs is worst than for the HP and

JND_CP codecs.

o The RD performance losses are larger for the lower resolutions and smaller for the higher

resolutions because in the high resolution sequences each MB corresponds to a tinnier physical

area; so, the redundancy is higher and the impact of the JND adaptive quantization method is

lower.

• RD Performance: MS-SSIM versus Bitrate

Because the PSNR does not have a good correlation with the subjective quality, this subsection presents

the MS-SSIM RD charts for each test sequence, trying to understand if there are quality gains obtained

with the proposed H.264/AVC based perceptual codecs that can be better ‘detected’ using another quality

metric; the results are after analyzed.

75

Figure 4.9 – MS-SSIM RD performance for the Foreman sequence

Figure 4.10 – MS-SSIM RD performance for the Mobile sequence

Figure 4.11 – MS-SSIM RD performance for the Panslow sequence

G1G2

G3

G4

G5

G6

0,955

0,96

0,965

0,97

0,975

0,98

0,985

0,99

0,995

1

0 1000 2000 3000 4000 5000

HP

JND_CP

JND_QP

JND_CP+QP

MS-

SSIM

Bitrate [kbit/s]

G1

G2G3

G4

G5

G6

0,97

0,975

0,98

0,985

0,99

0,995

1

1,005

0 1000 2000 3000 4000 5000 6000 7000 8000

HP

JND_CP

JND_QP

JND_CP+QP

MS-

SSIM

Bitrate [kbit/s]

G1

G2

G3

G4

G5

G6

0,965

0,970

0,975

0,980

0,985

0,990

0,995

1,000

0 20000 40000 60000 80000 100000 120000

HP

JND_CP

JND_QP

JND_CP+QP

MS-

SSIM

Bitrate [kbit/s]

76

Figure 4.12 – MS-SSIM RD performance for the Spincalendar sequence

Figure 4.13 – MS-SSIM RD performance for the Playing_cards sequence

Figure 4.14 – MS-SSIM RD performance for the Toys_and_calendar sequence

G1G2G3

G4

G5

G6

0,960

0,965

0,970

0,975

0,980

0,985

0,990

0,995

1,000

1,005

0 20000 40000 60000 80000 100000 120000 140000

HP

JND_CP

JND_QP

JND_CP+QP

MS-

SSIM

Bitrate [kbit/s]

G1G2

G3

G4

G5

G6

0,95

0,96

0,97

0,98

0,99

1,00

1,01

0 10000 20000 30000 40000 50000 60000 70000 80000 90000

HP

JND_CP

JND_QP

JND_CP+QP

MS-

SSIM

Bitrate [kbit/s]

G1G2

G3

G4

G5

G6

0,960

0,965

0,970

0,975

0,980

0,985

0,990

0,995

1,000

0 20000 40000 60000 80000 100000

HP

JND_CP

JND_QP

JND_CP+QP

MS-

SSIM

Bitrate [kbit/s]

77

From the analysis of Figure 4.9 to Figure 4.14, it may be concluded: o For the H.264/AVC perceptual codec with JND adaptive QP (JND_QP)

o For the low resolution sequences:

� comparing with the HP codec, rate gains go up to 1300 kbit/s (for the same MS-SSIM) for

the last RD point, i.e., up to 32.5% (for the same MS-SSIM) for the Foreman sequence. For

the Mobile sequence, the rate gains go up to 1900 kbit/s (26.8%) (for the same MS-SSIM)

for the last RD point.

� comparing with the JND_CP codec, rate gains go up to 1000 kbit/s (27%)/1300 kbit/s

(20%) (for the same MS-SSIM) for the last RD point for the Foreman and Mobile



� comparing with the HP codec, rate gains go up to 32 Mbit/s (29.6%)/30 Mbit/s (25.4%) (for

same MS-SSIM) for the last RD point for the Panslow and Spincalendar sequences,

respectively.

� comparing with the JND_CP codec, rate gains go up to 24 Mbit/s (23.5%) and 21 Mbit/s

(19.3%) (for the same MS-SSIM) for the last RD point for the Panslow and Spincalendar



� comparing with the HP codec, rate gains go up to 21 Mbit/s (25.3%)/27 Mbit/s (29%) (for

same MS-SSIM) for the last RD point for the Playing_cards and Toys_and_calendar


� comparing with the JND_CP codec, rate gains go up to 10 Mbit/s(13.9%) and 15

Mbit/s(18.5%) (for the same MS-SSIM) for the last RD point for the Playing_cards and



(JND_CP+QP)

o For the low resolution sequences, there are evident RD performance gains, notably for the

higher rates

� comparing with the HP codec, rate gains go up to 1.4 Mbit/s (35%)/2 Mbit/s (28.2%) (for

the same MS-SSIM) for the last RD point for the Foreman and Mobile sequences,

respectively.

� comparing with the JND_CP codec, rate gains go up to 1.1 Mbit/s (29.7%)/1.2 Mbit/s

(18.5%) (for the same MS-SSIM) for the last RD point for the Foreman and Mobile

sequences, respectively


� comparing with the HP codec, rate gains go up to 33 Mbit/s (32.4%) (for same MS-SSIM)

for the last RD point for the Panslow sequence.


(22%) (for the same MS-SSIM) for the last RD point for the Panslow and Spincalendar


78


� comparing with the HP codec, rate gains go up to about 24 Mbit/s (28.9%)/30 Mbit/s

(32.3%) (for same MS-SSIM) for the last RD point for the Playing_cards and


� comparing with the JND_CP codec, rate gains go up to 13 Mbit/s(18.1%) and 18 Mbit/s

(22.2%) (for the same MS-SSIM) for the last RD point for the Playing_cards and


o Summarizing, the JND_QP and JND_CP+QP codecs present a maximum quality similar or inferior

to the reference HP codec quality because the MS-SSIM, as aforementioned, is based on

structure. Knowing that both the perceptual codecs modify the QP value for each MB, it is possible

that adjacent MBs have a significant difference in the QP value (e.g. if the QP value, as initially

assigned by the H.264/AVC JM software reference, is 33, the QP for the MB may be between a

minimum value of 32 and a maximum value of 36); this does not happen for the HP codec since its

RD performance was evaluated for constant QP values. Consequently, the MS-SSIM interprets this

quality difference as block artifacts, and the quality metric is lower.

o However, there are significant RD gains for certain qualities which are larger for the higher

resolution because in high resolution sequences each MB corresponds to a tinnier physical area;

consequently the impact of the method is lower, so the aforementioned blocking effect is less

noticeable.

• RD Performance: VQM versus Bitrate

Still with the purpose of better assessing the subjective quality impact of the proposed perceptual driven

coding tools, this subsection presents the VQM RD charts for each test sequence; the results are after

analyzed.

Figure 4.15 – VQM RD performance for the Foreman sequence

G1G2G3

G4

G5

G6

0

0,1

0,2

0,3

0,4

0,5

0 500 1000 1500 2000 2500 3000 3500 4000 4500

VQ

M

Bitrate [kbit/s]

HP

JND_CP

JND_QP

JND_CP+QP

79

Figure 4.16 – VQM RD performance for the Mobile sequence

Figure 4.17 – VQM RD performance for the Panslow sequence

Figure 4.18 – VQM RD performance for the Spincalendar sequence

G1G2

G3G4

G5

G6

0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

0,4

0,45

0 1000 2000 3000 4000 5000 6000 7000

VQ

M

Bitrate [kbit/s]

HP

JND_CP

JND_QP

JND_CP+QP

G1G2

G3

G4

G5

G6

0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

0 20000 40000 60000 80000 100000

VQ

M

Bitrate [kbit/s]

HP

JND_CP

JND_QP

JND_CP+QP

G1G2G3

G4

G5

G6

0

0,1

0,2

0,3

0,4

0,5

0,6

0 20000 40000 60000 80000 100000 120000

VQ

M

Bitrate [kbit/s]

HP

JND_CP

JND_QP

JND_CP+QP

80

Figure 4.19 – VQM RD performance for the Playing_cards sequence

Figure 4.20 – VQM RD performance for the Toys_and_calendar sequence


o The VQM decreases with the rate (thus the quality increases), first rather quickly and after tends to

a value slightly higher than zero for the higher bitrates; this basically means that there is a good

subjective quality when the rate increases above a certain value and the VQM is near to zero

meaning the distortion is very low. The various rates are obtained from the various quantization

parameters combinations corresponding to the Gx labels in the charts.

o The JND_QP codec shows the worst video quality for the low and medium resolutions, where the

difference to the H.264/AVC HP VQM is the highest. This happens because the RD performance

for the HP codec has been measured for constant QP values for each MB while the perceptual

codecs change the QP values for each MB between -1 and +3, taking as reference the QP value

initially determined by the H.264/AVC JM reference software. The blocking effect which may be

G1G2

G3

G4

G5

G6

0

0,1

0,2

0,3

0,4

0,5

0,6

0 10000 20000 30000 40000 50000 60000 70000 80000 90000

VQ

M

Bitrate [kbit/s]

HP

JND_CP

JND_QP

JND_CP+QP

G1G2G3

G4

G5

G6

0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

0,4

0,45

0,5

0 20000 40000 60000 80000 100000

VQ

M

Bitrate [kbit/s]

HP

JND_CP

JND_QP

JND_CP+QP

81

associated to the different QPs leads the VQM to lower qualities since this metric is sensitive to

these effects.

o For the H.264/AVC perceptual codec with coefficients pruning (JND_CP) comparing with the HP

codec

o For the low resolution sequences, rate gains go up to 600 kbit/s (8.5%) (for the same VQM) for

the last RD point for the Mobile sequence.

o For the medium resolution sequences, rate gains go up to 9 Mbit/s (7.6%) (for the same VQM)

for the last RD point for the Spincalendar sequence.

o For the high resolution sequences, VQM gains go up to 0.02 (for the same rate) for the second

to last RD point (G2) for the Playing_cards sequence, and rate gains go up to 11 Mbit/s

(13%)/12 Mbit/s (12.9%) (for the same VQM) for the last RD point for the Playing_cards and


o For the H.264/AVC perceptual codec with JND adaptive QP (JND_QP)

o For the low resolution sequences:

� comparing with the HP codec, rate gains go up to 1300 kbit/s (32.5%) (for the same VQM)

for the last RD point for the Foreman sequence. For the Mobile sequence, rate gains go up

to 1900 kbit/s (26.8%) (for the same VQM) for the last RD point. However, there are also

VQM losses around 700 kbit/s for the Foreman sequence which go up to approximately

0.04 and for the Mobile sequence which go up to approximately 0.06

� comparing with the JND_CP codec, rate gains go up to 1000 kbit/s (27%)/1300 kbit/s

(20%) (for the same VQM) for the last RD point for the Foreman and Mobile sequences,

respectively. However, there are also VQM losses around 700kbit/s for the Foreman

sequence which go up to approximately 0.03 and for the Mobile sequence which go up to

approximately 0.05.


� comparing with the HP codec, rate gains go up to 32 Mbit/s (29.6%)/30 Mbit/s (25.4%) (for

the same VQM) for the last RD point for the Panslow and Spincalendar sequences,

respectively.


(19.3%) (for the same VQM) for the last RD point for the Panslow and Spincalendar



� comparing with the HP codec, VQM gains go up to 0.02 (for the same rate) for the second

to last RD point (G2) for the Playing_cards sequence and rate gains go up to 21 Mbit/s

(25.3%)/27 Mbit/s (29%) (for the same VQM) for the last RD point for the Playing_cards

and Toys_and_calendar sequences, respectively. VQM losses go up to 0.01 (for the same

rate) around 10 Mbit/s for the Toys_and_calendar sequence.


(18.5%) (for the same VQM) for the last RD point for the Playing_cards and

Toys_and_calendar sequences, respectively, and VQM gains go up to 0.01 (for the same

82

rate) for the second to last RD point for the Playing_cards sequence; VQM losses go up to

0.01 (for the same rate) around 10 Mbit/s for the Toys_and_calendar sequence.


(JND_CP+QP)

o For the low resolution sequences, there are evident RD performance gains, notably for the

higher rates

� comparing with the HP codec, rate gains go up to 1.4 Mbit/s (35%)/2 Mbit/s (28.2%) (for

the same VQM) for the last RD point for the Foreman and Mobile sequences, respectively.

However, there are VQM losses around 700 kbit/s which go up to approximately 0.04 for

the Foreman sequence and up to approximately 0.06 for the Mobile sequence.

� comparing with the JND_CP codec, rate gains go up to 1.1 Mbit/s (29.7%)/1.2 Mbit/s

(18.5%) (for the same VQM) for the last RD point for the Foreman and Mobile sequences,

respectively. However, there are VQM losses around 700 kbit/s which go up to

approximately 0.03 for the Foreman sequence and go up to approximately 0.05 for the

Mobile sequence.


� comparing with the HP codec, gains go up to 33 Mbit/s (32.4%) (for same VQM) for the last

RD point for the Panslow sequence.


(22%) (for the same VQM) for the last RD point for the Panslow and Spincalendar



� comparing with the HP codec, VQM gains go up to 0.03 (for the same rate) for the second

to last RD point (G2) for the Playing_cards sequence and rate gains go up to 24 Mbit/s

(28.9%)/30 Mbit/s (32.3%) (for same VQM) for the last RD point for the Playing_cards and

Toys_and_calendar sequences, respectively. VQM losses go up to 0.01 (for the same rate)

for the second to last RD point (G2) for the Toys_and_calendar sequence.


(22.2%) (for the same VQM) for the last RD point for the Playing_cards and

Toys_and_calendar sequences, respectively and VQM gains go up to 0.02 (for the same

rate) for the second to last RD point for the Playing_cards sequence; there are also VQM

losses going up to 0.01 (for the same rate) for the second to last RD point for the

Toys_and_calendar sequence.

In summary, the JND_QP and JND_CP+QP codecs present rate gains for all resolutions, especially for

the highest rates; however, there are no VQM gains as the maximum subjective quality is the same for all

codecs under test. For low resolution sequences and quantization steps between G2 and G5, there are

some VQM losses; on the contrary, for high resolution sequences, there are some VQM gains.

The JND_QP codec can achieve VQM RD gains through rate gains up to 32.5%/27% for the last RD

point and VQM gains up to 0.06/0.05 for the second to last RD point relatively to the HP and the JND_CP

83

codecs, respectively. The JND_CP+QP codec can achieve VQM RD gains through rate gains up to

35%/29.7% for the last RD point or VQM gains up to 0.06/0.05 for the second to last RD point for the HP

and the JND_CP codecs, respectively.

• RD Performance: RP compensated VQM versus Bitrate

Still to better assess the RD performance and trying to overcome the VQM limitations previously

presented, this subsection finally presents the RP compensated VQM RD charts for each test sequence;

the results are after analyzed.

Figure 4.21 – RP compensated VQM RD performance for the Foreman sequence

Figure 4.22 – RP compensated VQM RD performance for the Mobile sequence

G1G2G3

G4

G5

G6

0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

0,4

0,45

0,5

0 500 1000 1500 2000 2500 3000 3500 4000

VQ

M

Bitrate [kbit/s]

HP

JND_CP

JND_QP

JND_CP+QP

G1G2

G3

G4

G5

G6

0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

0 1000 2000 3000 4000 5000 6000 7000

VQ

M

Bitrate [kbit/s]

HP

JND_CP

JND_QP

JND_CP+QP

84

Figure 4.23 – RP compensated VQM RD performance for the Panslow sequence

Figure 4.24 – RP compensated VQM RD performance for the Spincalendar sequence

Figure 4.25 – RP compensated VQM RD performance for the Playing_cards sequence

G1G2G3

G4

G5

G6

0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

0 20000 40000 60000 80000 100000

VQ

M

Bitrate [kbit/s]

HP

JND_CP

JND_QP

JND_CP+QP

G1G2G3

G4

G5

G6

0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

0,4

0,45

0,5

0 20000 40000 60000 80000 100000 120000

VQ

M

Bitrate [kbit/s]

HP

JND_CP

JND_QP

JND_CP+QP

G1G2

G3

G4

G5

G6

0

0,1

0,2

0,3

0,4

0,5

0 10000 20000 30000 40000 50000 60000 70000 80000 90000

VQ

M

Bitrate [kbit/s]

HP

JND_CP

JND_QP

JND_CP+QP

85

Figure 4.26 – RP compensated VQM RD performance for the Toys_and_calendar sequence


o For the H.264/AVC perceptual codec with coefficients pruning (JND_CP) comparing with the HP

codec

o For the low resolution sequences, rate gains go up to 600 kbit/s (8.5%) (for the same VQM) for

the last RD point for the Mobile sequence.

o For medium resolution sequences, rate gains go up to 9 Mbit/s (7.6%) (for the same VQM) for

the last RD point for the Spincalendar sequence.

o For high resolution sequences, RP compensated VQM gains go up to 0.02 (for the same rate)

for the second to last RD point (G2) and rate gains go up to 11 Mbit/s (13%)/12 Mbit/s (12.9%)

(for the same RP compensated VQM) for the last RD point for the Playing_cards and


o For the H.264/AVC perceptual codec with JND adaptive QP (JND_QP)

o For the low resolution sequences

� comparing with the HP codec, rate gains go up to 1.3 Mbit/s (32.5%) and 1.9 Mbit/s

(26.8%) for the same RP compensated VQM for the last RD point for Foreman and Mobile

sequences, respectively and RP compensated VQM gains go up to 0.01 and 0.02 or the

same rate around 1600 kbit/s or 1400 kbit/s for the Foreman and Mobile sequence,

respectively

� comparing with the JND_CP codec, rate gains go up to 1 Mbit/s (27%) and 1.3 Mbit/s

(20%) for the same RP compensated VQM for the last RD point for the Foreman and

Mobile sequences, respectively, and RP compensated VQM gains go up to 0.01 and 0.02

for the same rate for around 1600 kbit/s or 1400 kbit/s t for the Foreman and Mobile

sequence, respectively

G1G2

G3

G4

G5

G6

0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

0,4

0,45

0 20000 40000 60000 80000 100000

VQ

M

Bitrate [kbit/s]

HP

JND_CP

JND_QP

JND_CP+QP

86

o For the medium resolution sequences

� comparing with the HP codec, rate gains go up to 32 Mbit/s (29.6%) and 30 Mbit/s (25.4%)

for the same RP compensated VQM for the last RD point for the Panslow and

Spincalendar sequences, respectively, and RP compensated VQM gains go up to 0.01 for

the same rate for the second to last RD point for both sequences.


(19.3%) for the same RP compensated VQM for the last RD point for the Panslow and

Spincalendar sequences, respectively, and RP compensated VQM gains go up to 0.01 for

the same rate for the second to last RD point, for both sequences above.


� comparing with the HP codec, rate gains go up to 21 Mbit/s (25.3%) and 27 Mbit/s (29%)

for the same RP compensated VQM for the last RD point, and the RP compensated VQM

gains go up to 0.03/0.02 (for the same rate) for rate around 38 Mbit/s or 34 Mbit/s for the

Playing_cards and Toys_and_calendar sequences, respectively


(18.5%) for the same RP compensated VQM for the last RD point, and the RP

compensated VQM gains go up to 0.02 /0.01 (for the same rate) for rate around 38 Mbit/s

or 34 Mbit/s for the Playing_cards and Toys_and_calendar sequences, respectively.


(JND_CP+QP)

o For the low resolution sequences

� comparing with the HP codec, rate gains go up to 1.4 Mbit/s (35%) and 2 Mbit/s (28.2%) for

the same RP compensated VQM for the last RD point and the RP compensated VQM

gains go up to 0.01/0.02 for the same rate for the second to last RD point for the Foreman

and Mobile sequence, respectively

� comparing with the JND_CP codec, rate gains go up to 1.1 Mbit/s (29.7%) and 1.2 Mbit/s

(18.5%) for the same RP compensated VQM for the last RD point and the RP

compensated VQM gains go up to 0.03 (78.9%) and 0.02 (40%) for the same rate for the

second to last RD point for the Foreman and Mobile sequence, respectively

o For the medium resolution sequences

� comparing with the HP codec, rate gains go up to 33 Mbit/s (30.6%) and 33 Mbit/s (28%)

for the same RP compensated VQM for the last RD point and the RP compensated VQM

gains go up to 0.01 for the same rate for the second to last RD point, for the Panslow and

Spincalendar sequences, respectively.


(22%) for the same RD compensated VQM for the last RD point and the RP compensated

VQM gains go up to 0.01 for the same rate for the second to last RD point for the Panslow

and Spincalendar sequences, respectively.

87


� comparing with the HP codec, rate gains go up to 24 Mbit/s (28.9%) and 30 Mbit/s (32.3%)

for the same RD compensated VQM for the last RD point and the RP compensated VQM

gains go up to 0.04/0.02 for the same rate for the second to last RD point for the

Playing_cards and Toys_and_calendar sequences, respectively.


(22.2%) for the same VQM for the last RD point and the RP compensated VQM gains go

up to 0.02/0.01 for the same rate for the second to last RD point for the Playing_cards and


In summary, for both the JND_QP and JND_CP+QP codecs and for all resolutions, there are rate gains

and RP compensated VQM gains. The JND_QP codec can achieve RP compensated VQM RD gains

through rate gains up to 32.5%/27% for the last RD point regarding the HP and JND_CP codecs,

respectively, and RP compensated VQM gains up to 0.03/0.02 for the second to last RD point relatively to

the HP and the JND_CP codecs, respectively. The JND_CP+QP codec can achieve RP compensated

VQM RD gains through rate gains up to 35%/29.7% for the last RD point or VQM gains up to 0.04/0.02

for the second to last RD point for the HP and JND_CP codecs, respectively.

4.2.3. Conclusion

The main conclusions of this chapter are that the PSNR and the MS-SSIM quality metrics are not able to

adequately express the subjective RD performance gains obtained with the proposed H.264/AVC

perceptual codec with JND adaptive QP (JND_QP) and the H.264/AVC perceptual codec including

coefficients pruning and JND adaptive QP (JND_CP+QP) because they are not designed to efficiently

measure the subjective quality and, thus, have a low correlation with subjective quality scores. However,

the effective assessment of the RD performance gains was possible with the VQM objective quality metric

and especially with its RP compensated version.

VQM RD gains for the JND_QP and JND_CP+QP codecs are mainly rate gains; however, there are also

some VQM gains, especially for the high resolution sequences, going up to 0.02 and 0.03 when

comparing with the HP codec and 0.01/0.02 when comparing with the JND_CP codec for the JND_QP

and JND_CP+QP codecs, respectively.

Regarding the RP compensated VQM RD gains for the JND_QP and JND_CP+QP codecs, there are

both rate and RP compensated VQM gains. The highest RD gains for the JND_CP+QP codec are RP

compensated VQM gains which go up to 0.02 and rate gains which go up to 35% for the low resolutions.

Still for the same codec, the RP compensated VQM gains go up to 0.04 and the rate gains go up to 31%

for the medium and high resolutions.

88

Chapter 5

Conclusions and Future Work

Chapter 5 concludes this report by presenting a brief summary of the solutions developed as well as the

main conclusions; finally, some suggestions for eventual future work are presented.

5.1. Summary and Conclusions

Chapter 1 introduced the problem addressed in this Thesis, mentioning the increasing importance of

video applications and the growing need for more compression, thus justifying the improvement of

existing video compression solutions to achieve the same quality with a lower bitrate.

Next, Chapter 2 reviewed the conceptual and technical background of this Thesis, notably the human

visual system and the most relevant perceptual video coding solutions in the literature.

The next two chapters report the main contributions and developments associated to this Thesis. Chapter

3 presented the first solution improved video codec, this means the H.264/AVC perceptual codec with

coefficients pruning. The additional perceptually driven method, called perceptual coefficients pruning,

builds on the basic idea of setting to zero all the transform coefficients which have a magnitude lower

than the corresponding JND threshold since these coefficients should be perceptually irrelevant.

Consequently, some changes had to be made in the codec architecture, notably the inclusion of two new

modules: the JND model which determines the JND thresholds and the Coefficients Pruning module

which implements the pruning process. To evaluate the RD performance associated to the additional tool,

two objective quality metrics have been used: PSNR and MS-SSIM. Other relevant metrics assessed

were the average number of zeroed coefficients at MB level due to the perceptual coefficients pruning,

which intends to evaluate the impact of the method in the codec in terms of the number of zeroed

coefficients, and the average zigzag position of the zeroed coefficients exclusively due to the perceptual

coefficients pruning method, which intends to give an idea where are the zeroed coefficients in terms of

89

bandwidth. The main conclusion was that the adopted perceptual coefficients pruning method may have

some positive impact on the RD performance, notably for sequences with lower luminance levels and

objects patterns, and especially for low QP values and higher resolutions. The H.264/AVC perceptual

codec with coefficients pruning can achieve evident PSNR RD gains only for the higher resolutions and

the higher rates. The rate gains go up to 14% for the last RD point and the PSNR gains go up to 0.4 dB

for rates around 8 Mbit/s. The MS-SSIM RD performance can achieve a rate gain up to 7% for low and

medium resolutions and 14% for high resolutions, both for the last RD point. The average number of

zeroed coefficients due to the perceptual coefficients pruning method can go up to around 9.7 coefficients

(60.6%) for the Playing_cards sequence in the lowest G. These zeroed coefficients are more likely to be

located in between the 8th and 12

th zigzag position in a 4x4 block.

After Chapter 4 presented the second perceptually driven tool and the corresponding improved codec as

well as an improved codec including both additional tools. The second tool is a JND adaptive quantization

which has the intent to adapt the QP value based on the computed JND thresholds considering the

human visual system is not sensitive to changes below the JND threshold values and thus no rate should

be used to provide a coefficient’s accuracy which cannot be ‘consumed’. Consequently, some changes in

the architecture have been implemented, notably again the JND model computation and also a JND

adaptive quantization module which computes the adaptive QP values for each MB in each frame based

on the QP initially determined by the H.264/AVC reference software. For the RD performance evaluation,

four objective quality metrics have been used, notably again the PSNR and MS-SSIM and two additional

metrics, the VQM and the RP compensated VQM. For the JND_QP codec, the VQM RD performance

shows some rate gains for the high resolution sequences. The RP compensated performance shows a

RP compensated VQM gain up to 0.03 and a rate gain up to 33%. The joint solution, this means the

H.264/AVC perceptual codec including the coefficients pruning and JND adaptive QP. The JND_CP+QP

shows a VQM RD performance with rate gains up to 35% for the last RD point and VQM gains up to 0.03

for the last RD point. The RP compensated VQM performance shows rate gains of 35% for the last RD

point and a RP compensated VQM gain up to 0.04 for high resolution sequences. To sum up, the main

conclusions of this chapter are that the PSNR and the MS-SSIM are not able to express the RD

performance gains obtained with the proposed JND_QP and JND_CP+QP codecs because they are not

designed to efficiently measure the subjective quality and, thus, have a low correlation with subjective

quality scores. However, the effective assessment of the RD performance gains was possible with the

VQM objective quality metric and its RP compensated version.

5.2. Future Work

Despite the encouraging results achieved with the combination of the two perceptually driven tools

described in this Thesis, the developed video coding solutions still leaves room for improvements.

Animportant module in the H.264/AVC standard that was not improved but it is a good candidate for

perceptually related improvements is the motion estimation. The motion estimation module may be

90

improved using the same basic ideas that were used to improve the transform and quantization

processes, following the computation of the JND model. In this context, a method similar to the one

presented in [30] may be developed; the basic idea is to compute the distortion metric comparing the

original and the prediction blocks to determine the prediction error using only the perceptually relevant

residuals based on some filtering with the relevant JND thresholds. In fact, it is not a perfect solution to

apply the perceptual coefficients pruning method to the transformed coefficients, discarding some of the

perceptually irrelevant coefficients, and then perform the motion estimation process still considering the

coefficients corresponding to those previously discarded to determine the best MB prediction and define

the motion vector. A better solution seems to be, for example, the adoption of a JND thresholded SAD in

the motion estimation module allowing to obtain a more perceptually coherent video codec.

91

References

[1] G. Lin and S. Zheng, "Perceptual Importance Analysis for H.264/AVC Bit Allocation," Journal of

Zhejiang University SCIENCE A, vol. 9, no. 2, pp. 225 - 231, July 2007.

[2] M. Jacobs and J. Probell, "A Brief History Of VIdeo Coding", Oct. 2009, [Online].

http://www.arc.com/upload/download/whitepapers/A_Brief_History_of_Video_Coding_wp.pdf

[3] ITU-T Rec. H.261, "Video Codec for Audio-Visual Services at 64-1920 kbit/s," 1993.

[4] D. Gall, "MPEG: Video Compression Standard For Multimedia Applications,".1991

[5] ISO/IEC 11172-2 MPEG-1, "Coding Of Moving Pictures and Associated Audio For Digital Storage

Media At Up To About 1.5 Mbps," Part2: Visual, 1991.

[6] ISO/IEC 13818-2 MPEG-2, "Generic Coding of Moving Pictures and Associated Audio: Video," same

as ITU-T Rec. H.262, 1995.

[7] B. G. Haskell, A. Puri, and A. N. Netravali, Digital Video: An Introduction to MPEG-2. New York, USA:

Chapman & Hall, 1997.

[8] K. Rijkse, "ITU-T Recommendation H.263: Video Coding for Low-Bit-Rate Communication," IEEE

Communications Magazine, pp. 42-45, December 1996.

[9] I.E.G. Richardson, H.264 And MPEG-4 Video Compression: Video Coding For Next-Generation

Multimedia. Chichester: John Wiley & Sons, 2003.

[10] JVT of ISO/IEC MPEG And ITU-T VCEG, "ITU-T Recommendation And Final Draft International

Standard Of Joint Video Specification (ITU-T Rec. H.264|ISO/IEC 14496-10 AVC)," JVT-G050, 2003.

[11] E. Manoel, "Codificação de Vídeo H.264 – Estudo De Codificação Mista De Macroblocos," in

Florianópolis, 2007.

[12] T. Wiegand, G.J. Sullivan, G. Bjontengaard, and A. Luthra, "Overview Of The H.264/AVC Video

Coding Standard," Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 560 - 576, July 2003.

[13] "Eye", Oct. 2009, [Online]. http://en.wikipedia.org/wiki/Eye

92

[14] J. Martim do Santo et al., Anatomia Geral Moreno, 3rd ed.: Egas Moniz Publicações, 2005.

[15] "Olho - Anatomia", Oct. 2009, [Online].

http://www.medipedia.pt/home/home.php?module=artigoEnc&id=505

[16] "The Anatomy of Vision - Page 2", Oct. 2009, [Online].

http://brainconnection.positscience.com/topics/?main=anat/vision-anat2

[17] "Vernier Tech Info Library TIL #1014", March 2009, [Online]. http://www.vernier.com/til/1014.html

[18] Y. Qiao, Q. Hu, G. Qian, S. Luo, and W. L. Nowinski, "Thresholding based on variance and intensity

contrast," Pattern Recognition, vol. 40, no. 2, pp. 596 - 608, July 2006.

[19] "The Human Visual System", Oct. 2009, [Online].

http://www.dip.ee.uct.ac.za/~nicolls/lectures/eee401f/hvs.pdf

[20] D. Taubman and M.W. Marcellin, "JPEG2000: Image Compression Fundamentals," in Standards and

Practice, Kluwer, Boston, 2002.

[21] K. Minoo and T.Q. Nguyen, "Perceptual Video Coding With H.264," in Conference Record of the

Thirty-Ninth Asilomar Conference, Pacific Grove, CA, USA, 2005, pp. 741-745.

[22] (2009, Out.) Visual Perception. [Online]. http://en.wikipedia.org/wiki/Visual_perception

[23] V. Bruce, P.R. Green, and M.A. Georgeson, Visual Perception, 3rd ed.: Psychology Press, 1996.

[24] S.M.C. Nascimento, "Optica Fisiológica II", Oct. 2009, [Online].

http://www.arauto.uminho.pt/pessoas/smcn/OFII/optica%20fisiologica%20II%20cap1.pdf

[25] "Stevens' Power Law", Oct. 2009, [Online]. http://en.wikipedia.org/wiki/Stevens'_power_law

[26] R. Clarke, Digital Compression Of Still Images And Video. London, England: Academic Press, 1995.

[27] Z. Chen and C. Guillemot, "Perceptually-Friendly H264/AVC Video Coding," in 2009 IEEE

International Conference On Image Processing, Cairo, Egypt, 7-10 Nov. 2009.

[28] Z. Chen and C. Guillemot, "Perceptually-Friendly H264/AVC Video Coding Based On Foveated Just-

Noticeable-Distortion Model," INRIA/IRISA, France, 2009.

[29] "Focus", Jan. 2010, [Online]. http://mathworld.wolfram.com/Focus.html

[30] C. Mak and K.N. Ngan, "Enhancing Compression Rate By Just-Noticeable Distortion Model For

H.264/AVC," in ISCAS 2009, IEEE International Symposium on Circuits and Systems 2009, Taipei, 24-

27 May 2009, pp. 609 - 612.

[31] Z. Mai, C. Yang, K. Kuang, and L. Po, "A Novel Motion Estimation Method Based On Structural

Similarity For H.264 Inter Prediction," in ICASSP 2006 Proceedings, IEEE International Conference on

93

Acoustics, Speech and Signal Processing, vol. 2, Toulouse, France, May 2006, pp. 913-916.

[32] C. Huang and C. Lin, "A Novel 4-D Perceptual Quantization Modelling For H.264 Bit-Rate Control,"

Multimedia, IEEE Transactions on, vol. 9, no. 6, pp. 1113 - 1124, Oct. 2007.

[33] M. Naccari and F. Pereira, "Comparing Spatial Masking Modelling In Just Noticeable Distortion

Controlled H.264/AVC Video Coding," in 11th WIAMIS, Workshop on Image Analysis for Multimedia

Interactive Services WIAMIS, vol. 1, Desenzano del Garda, Italy, April 2010, pp. 1-4.

[34] T. Tan, G. Sullivan, and T. Wedi, "Recommended Simulation Common Conditions For Efficiency

Experiments, revision 2," in VCEG-AH10r3, 34th Meeting, Antalya, Turkey, 12-12 Jan. 2008.

[35] Z. Wang, E. Simoncelli, and A. Bovik, "Multi-Scale Structural Similarity For Image Quality

Assessment," in Proceeding of the 37th IEEE Asilomar Xonference on Signals, Systems and

Computers, Pacific Grove, C, Nov. 2003, pp. 1398-1402.

[36] Y. Wang, "Survey of Objective Video Quality Measurements," EMC Corporation Hopkinton, USA,

2006.

[37] F. Xiao, "DCT-based Video Quality Evolution", 2000, [Online].

http://compression.ru/video/quality_measure/vqm.

[38] M.H. Pinson and S. Wolf, "A New Standardized Method for Objectively measuring video quality,"

Broadcasting, IEEE Transactions on , vol. 50, no. 3, pp. 312-322, September 2004.

[39] M. Naccari and F. Pereira, "Advanced H.264/AVC Based Perceptual Video Coding: Architecture,

Tools and Assessment," 2010.

[40] F. Pereira, "Comunicações de Áudio e Vídeo".

[41] ITU-T Rec. H.263, "Video Codec For Low Bit Rate Communication," 1996.

94

Annex A

This annex, regards the results of Chapter 3, includes two tables with the average rate, PSNR and MS-

SSIM for each RD point and the variation in percentage of each one between the H.264/AVC High profile

codec and the H.264/AVC based perceptual codec with coefficients pruning, and the overall average of

the average zigzag position of zeroed coefficients exclusively due to the perceptual coefficients pruning

method (4x4 blocks) and the variance of the average zigzag position of zeroed coefficients exclusively

due to the perceptual coefficients pruning method (4x4 blocks).

95

Sequence BitrateHP

[kbit/s]

BitrateJND_CP

[kbit/s]

∆Bitrate

[%]

PSNRHP

[dB]

PSNRJND

_CP [dB]

∆PSNR

[%]

MS-

SSIMHP

MS-

SSIMJND_

CP

∆MS-

SSIM

[%]

Fo

rem

an

QP1 3986.190 3677.970 -7.732 46.149 45.698 -0.977 0.9989 0.9988 -0.010

QP2 2339.400 2185.620 -6.573 43.494 43.297 -0.453 0.9983 0.9982 -0.010

QP3 704.600 681.480 -3.281 38.859 38.848 -0.028 0.9958 0.9958 0.000

QP4 332.120 322.340 -2.945 35.994 35.995 0.003 0.9921 0.9921 0.000

QP5 169.280 164.100 -3.060 33.128 33.155 0.082 0.9841 0.9843 0.020

QP6 91.670 89.300 -2.585 30.378 30.401 0.076 0.9675 0.9678 0.031

Mo

bil

e

QP1 7080.900 6517.450 -7.957 45.285 44.651 -1.400 0.9997 0.9996 -0.010

QP2 4900.190 4565.440 -6.831 42.130 41.839 -0.691 0.9995 0.9995 0.000

QP3 1879.300 1794.300 -4.523 36.167 36.154 -0.036 0.9985 0.9985 0.000

QP4 843.590 813.040 -3.621 32.569 32.579 0.031 0.9967 0.9967 0.000

QP5 375.050 360.580 -3.858 29.293 29.317 0.082 0.9927 0.9927 0.000

QP6 172.660 164.510 -4.720 26.069 26.105 0.138 0.9825 0.9826 0.010

Pa

nsl

ow

QP1 108443.010 101542.820 -6.363 45.174 44.706 -1.036 0.9982 0.9981 -0.010

QP2 62850.690 59207.970 -5.796 42.003 41.848 -0.369 0.9964 0.9963 -0.010

QP3 6631.300 6405.530 -3.405 37.233 37.244 0.030 0.9897 0.9897 0.000

QP4 1810.460 1745.520 -3.587 35.624 35.633 0.025 0.9849 0.9850 0.010

QP5 774.200 749.330 -3.212 33.865 33.871 0.018 0.9800 0.9801 0.010

QP6 402.520 392.560 -2.474 31.711 31.738 0.085 0.9717 0.9718 0.010

Sp

inca

len

da

r

QP1 117966.860 109393.340 -7.268 45.338 44.891 -0.986 0.9991 0.9990 -0.010

QP2 71097.700 66182.000 -6.914 42.126 41.982 -0.342 0.9978 0.9977 -0.010

QP3 10567.280 10069.510 -4.710 37.174 37.186 0.032 0.9940 0.9940 0.000

QP4 3125.310 2998.330 -4.063 35.211 35.223 0.034 0.9911 0.9911 0.000

QP5 1336.660 1279.740 -4.258 32.736 32.743 0.021 0.9842 0.9842 0.000

QP6 716.030 687.720 -3.954 30.052 30.062 0.033 0.9683 0.9686 0.031

Pla

yin

g_

card

s

QP1 83313.160 71955.020 -13.633 47.156 46.815 -0.723 0.9990 0.9990 0.000

QP2 50165.110 43041.640 -14.200 44.905 44.696 -0.465 0.9984 0.9982 -0.020

QP3 12351.070 10776.830 -12.746 41.159 41.193 0.083 0.9937 0.9938 0.010

QP4 4332.970 4063.390 -6.222 38.887 38.936 0.126 0.9883 0.9885 0.020

QP5 2054.630 1965.280 -4.349 36.421 36.474 0.146 0.9790 0.9792 0.020

QP6 1084.940 1043.630 -3.808 33.500 33.547 0.140 0.9589 0.9595 0.063

To

ys_

an

d_

cale

nd

ar QP1 93142.660 80918.700 -13.124 45.442 45.151 -0.640 0.9993 0.9992 -0.010

QP2 50370.440 44278.430 -12.094 42.846 42.725 -0.282 0.9983 0.9982 -0.010

QP3 7108.890 6574.970 -7.511 39.423 39.447 0.061 0.9919 0.9920 0.010

QP4 2604.550 2466.810 -5.288 37.926 37.958 0.084 0.9880 0.9881 0.010

QP5 1357.070 1294.060 -4.643 36.055 36.102 0.130 0.9799 0.9802 0.031

QP6 814.220 778.110 -4.435 33.872 33.922 0.148 0.9673 0.9677 0.041

Average 19924.908 18087.427 -5.993 37.927 37.837 -0.190 0.9890 0.9891 0.006

Table A-1 – Average rate, PSNR and MS-SSIM for each sequence for various RD points, and their variation in percentage.

96

Sequence Avg_zigzag_position_4x4 Var_avg_zigzag_position_4x4

Fo

rem

an

QP1 10.536 1.499

QP2 10.210 1.601

QP3 9.594 1.840

QP4 9.289 1.978

QP5 9.045 2.109

QP6 8.848 2.227

Mo

bil

e

QP1 11.040 1.432

QP2 10.822 1.492

QP3 10.294 1.633

QP4 9.915 1.741

QP5 9.538 1.875

QP6 9.189 2.031

Pa

nsl

ow

QP1 10.540 1.516

QP2 10.131 1.624

QP3 9.271 2.009

QP4 8.976 2.179

QP5 8.811 2.273

QP6 8.688 2.345

Sp

inca

len

da

r

QP1 10.669 1.515

QP2 10.265 1.629

QP3 9.565 1.888

QP4 9.285 2.002

QP5 9.056 2.111

QP6 8.853 2.226

Pla

yin

g_

card

s

QP1 10.245 1.521

QP2 9.830 1.666

QP3 9.199 1.981

QP4 8.946 2.134

QP5 8.764 2.254

QP6 8.621 2.357

To

ys_

an

d_

cale

nd

ar QP1 10.311 1.500

QP2 9.878 1.652

QP3 9.201 2.006

QP4 8.952 2.162

QP5 8.780 2.279

QP6 8.652 2.371

Average 9.550 1.907

Table A-2 – Overall average of the average zigzag position for a 4x4 block and the variance of the average zigzag position for a 4x4 block for each RD point.

97

Annex B

This annex, regards the results of Chapter 4, includes five tables with the average rate, PSNR, MS-SSIM,

VQM and RP compensated VQM for the four codecs (H.264/AVC High profile codec, H.264/AVC based

perceptual codec with coefficients pruning, H.264/AVC based perceptual codec with JND adaptive QP

and H.264/AVC based perceptual codec including coefficients pruning and JND adaptive QP and the

variation in percentage between H.264/AVC High profile codec and each one of the remaining codecs.

98

Bitrate ∆Bitrate [%]

HP JND_CP JND_QP JND_CP+QP JND_CP JND_QP JND_CP+QP

Fo

rem

an

G1 3986.19 3677.97 2692.69 2603.98 -7.7322 -32.4495 -34.675

G2 2339.4 2185.62 1603.31 1566.94 -6.57348 -31.4649 -33.0196

G3 704.6 681.48 541.02 533.78 -3.28129 -23.216 -24.2435

G4 332.12 322.34 271.45 267.54 -2.94472 -18.2675 -19.4448

G5 169.28 164.1 146.07 143.36 -3.06002 -13.711 -15.3119

G6 91.67 89.3 82.26 81.4 -2.58536 -10.2651 -11.2032

Mo

bil

e

G1 7080.9 6517.45 5197.84 5046.02 -7.95732 -26.5935 -28.7376

G2 4900.19 4565.44 3537.38 3509.24 -6.83137 -27.8114 -28.3856

G3 1879.3 1794.3 1369.55 1367.81 -4.52296 -27.1245 -27.217

G4 843.59 813.04 627.5 627 -3.62143 -25.6155 -25.6748

G5 375.05 360.58 294.97 294.58 -3.85815 -21.3518 -21.4558

G6 172.66 164.51 143.96 143.54 -4.72026 -16.6223 -16.8655

Pa

nsl

ow

G1 108443.010 101542.820 77407.46 75164.7 -6.36296 -28.6192 -30.6874

G2 62850.690 59207.970 41795.19 40726.63 -5.79583 -33.5008 -35.201

G3 6631.300 6405.530 5098.64 4689.07 -3.40461 -23.1125 -29.2888

G4 1810.460 1745.520 1552.97 1500.74 -3.58693 -14.2224 -17.1073

G5 774.200 749.330 735.87 703.23 -3.21235 -4.95092 -9.16688

G6 402.520 392.560 399.85 380.42 -2.47441 -0.66332 -5.49041

Sp

inca

len

da

r

G1 117966.860 109393.340 87532.46 84896.49 -7.26774 -25.7991 -28.0336

G2 71097.700 66182.000 50307.36 49302.84 -6.91401 -29.2419 -30.6548

G3 10567.280 10069.510 9064.95 8147.54 -4.71048 -14.2168 -22.8984

G4 3125.310 2998.330 2823.85 2708.68 -4.06296 -9.64576 -13.3308

G5 1336.660 1279.740 1265.4 1217.8 -4.25838 -5.3312 -8.89231

G6 716.030 687.720 697.19 671.66 -3.95374 -2.63117 -6.19667

Pla

yin

g_

card

s

G1 83313.160 71955.020 62093.94 58585.48 -13.6331 -25.4692 -29.6804

G2 50165.110 43041.640 37874.46 35454.55 -14.2 -24.5004 -29.3243

G3 12351.070 10776.830 10082.08 9016.3 -12.7458 -18.3708 -26.9998

G4 4332.970 4063.390 3996.48 3599.42 -6.2216 -7.7658 -16.9295

G5 2054.630 1965.280 1998.25 1785.35 -4.34871 -2.74405 -13.106

G6 1084.940 1043.630 1094.85 963.99 -3.80758 0.913415 -11.1481

To

ys_

an

d_

cale

nd

ar G1 93142.660 80918.700 66494.8 63388.33 -13.1239 -28.6097 -31.9449

G2 50370.440 44278.430 34917.75 33764.27 -12.0944 -30.6781 -32.9681

G3 7108.890 6574.970 6080.46 5655.03 -7.5106 -14.4668 -20.4513

G4 2604.550 2466.810 2426.31 2257.89 -5.28844 -6.84341 -13.3098

G5 1357.070 1294.060 1319.04 1219.74 -4.64309 -2.80236 -10.1196

G6 814.220 778.110 806.56 739.86 -4.43492 -0.94078 -9.13267

Table B-1 – Average rate for each sequence for various RD points for different codecs and their variation in percentage.

99

PSNR ∆PSNR [%]


Fo

rem

an

G1 46.149 45.698 42.408 42.274 -0.97727 -8.10635 -8.39671

G2 43.494 43.297 40.046 40.032 -0.45294 -7.92753 -7.95972

G3 38.859 38.848 36.324 36.324 -0.02831 -6.52359 -6.52359

G4 35.994 35.995 33.953 33.953 0.002778 -5.67039 -5.67039

G5 33.128 33.155 31.618 31.625 0.081502 -4.55808 -4.53695

G6 30.378 30.401 29.43 29.534 0.075713 -3.12068 -2.77833

Mo

bil

e

G1 45.285 44.651 41.122 40.825 -1.40002 -9.19289 -9.84874

G2 42.13 41.839 38.146 38.117 -0.69072 -9.45644 -9.52528

G3 36.167 36.154 33.326 33.316 -0.03594 -7.85523 -7.88288

G4 32.569 32.579 30.452 30.433 0.030704 -6.50005 -6.55838

G5 29.293 29.317 27.717 27.726 0.081931 -5.38012 -5.3494

G6 26.069 26.105 24.969 24.978 0.138095 -4.21957 -4.18505

Pa

nsl

ow

G1 45.174 44.706 42.248 42.024 -1.03599 -6.47718 -6.97304

G2 42.003 41.848 39.796 39.784 -0.36902 -5.25439 -5.28296

G3 37.233 37.244 36.531 36.506 0.029544 -1.88542 -1.95257

G4 35.624 35.633 35.091 35.076 0.025264 -1.49618 -1.53829

G5 33.865 33.871 33.388 33.346 0.017717 -1.40853 -1.53256

G6 31.711 31.738 31.258 31.185 0.085144 -1.42853 -1.65873

Sp

inca

len

da

r

G1 45.338 44.891 42.725 42.48 -0.98593 -5.76338 -6.30376

G2 42.126 41.982 40.315 40.296 -0.34183 -4.29901 -4.34411

G3 37.174 37.186 36.69 36.579 0.032281 -1.30199 -1.60058

G4 35.211 35.223 34.727 34.646 0.03408 -1.37457 -1.60461

G5 32.736 32.743 32.306 32.205 0.021383 -1.31354 -1.62207

G6 30.052 30.062 29.724 29.62 0.033276 -1.09144 -1.43751

Pla

yin

g_

card

s

G1 47.156 46.815 45.292 45.134 -0.72313 -3.95284 -4.2879

G2 44.905 44.696 43.447 43.391 -0.46543 -3.24685 -3.37156

G3 41.159 41.193 40.346 40.263 0.082606 -1.97527 -2.17692

G4 38.887 38.936 38.157 38.102 0.126006 -1.87723 -2.01867

G5 36.421 36.474 35.701 35.664 0.14552 -1.97688 -2.07847

G6 33.500 33.547 32.936 32.954 0.140299 -1.68358 -1.62985

To

ys_

an

d_

cale

nd

ar G1 45.442 45.151 43.507 43.373 -0.64038 -4.25818 -4.55306

G2 42.846 42.725 41.604 41.593 -0.28241 -2.89875 -2.92443

G3 39.423 39.447 39.109 39.069 0.060878 -0.79649 -0.89795

G4 37.926 37.958 37.616 37.604 0.084375 -0.81738 -0.84902

G5 36.055 36.102 35.77 35.736 0.130356 -0.79046 -0.88476

G6 33.872 33.922 33.503 33.55 0.147615 -1.0894 -0.95064

Table B-2 – Average PSNR for each sequence for various RD points for different codecs and their variation in percentage.

100

MS-SSIM ∆MS-SSIM [%]


Fo

rem

an

G1 0.9989 0.9988 0.9981 0.9981 -0.01001 -0.08009 -0.08009

G2 0.9983 0.9982 0.9971 0.9971 -0.01002 -0.1202 -0.1202

G3 0.9958 0.9958 0.9932 0.9933 0 -0.2611 -0.25105

G4 0.9921 0.9921 0.9881 0.9882 0 -0.40319 -0.39311

G5 0.9841 0.9843 0.9787 0.979 0.020323 -0.54872 -0.51824

G6 0.9675 0.9678 0.9609 0.9617 0.031008 -0.68217 -0.59948

Mo

bil

e

G1 0.9997 0.9996 0.9993 0.9993 -0.01 -0.04001 -0.04001

G2 0.9995 0.9995 0.9989 0.9989 0 -0.06003 -0.06003

G3 0.9985 0.9985 0.997 0.997 0 -0.15023 -0.15023

G4 0.9967 0.9967 0.9942 0.9942 0 -0.25083 -0.25083

G5 0.9927 0.9927 0.9881 0.9882 0 -0.46338 -0.45331

G6 0.9825 0.9826 0.9741 0.9744 0.010178 -0.85496 -0.82443

Pa

nsl

ow

G1 0.9982 0.9981 0.9968 0.9968 -0.01002 -0.14025 -0.14025

G2 0.9964 0.9963 0.9942 0.9942 -0.01004 -0.22079 -0.22079

G3 0.9897 0.9897 0.9875 0.9874 0 -0.22229 -0.23239

G4 0.9849 0.9850 0.9838 0.9838 0.010153 -0.11169 -0.11169

G5 0.9800 0.9801 0.979 0.9791 0.010204 -0.10204 -0.09184

G6 0.9717 0.9718 0.9698 0.9699 0.010291 -0.19553 -0.18524

Sp

inca

len

da

r

G1 0.9991 0.9990 0.9982 0.9981 -0.01001 -0.09008 -0.10009

G2 0.9978 0.9977 0.9967 0.9966 -0.01002 -0.11024 -0.12026

G3 0.9940 0.9940 0.9933 0.9931 0 -0.07042 -0.09054

G4 0.9911 0.9911 0.9898 0.9896 0 -0.13117 -0.15135

G5 0.9842 0.9842 0.9813 0.981 0 -0.29466 -0.32514

G6 0.9683 0.9686 0.9621 0.9619 0.030982 -0.6403 -0.66095

Pla

yin

g_

card

s

G1 0.9990 0.9990 0.9986 0.9985 0 -0.04004 -0.05005

G2 0.9984 0.9982 0.9975 0.9975 -0.02003 -0.09014 -0.09014

G3 0.9937 0.9938 0.993 0.9927 0.010063 -0.07044 -0.10063

G4 0.9883 0.9885 0.9868 0.9866 0.020237 -0.15178 -0.17201

G5 0.9790 0.9792 0.9757 0.9757 0.020429 -0.33708 -0.33708

G6 0.9589 0.9595 0.9531 0.9543 0.062572 -0.60486 -0.47972

To

ys_

an

d_

cale

nd

ar G1 0.9993 0.9992 0.9984 0.9984 -0.01001 -0.09006 -0.09006

G2 0.9983 0.9982 0.9965 0.9966 -0.01002 -0.18031 -0.17029

G3 0.9919 0.9920 0.9911 0.9911 0.010082 -0.08065 -0.08065

G4 0.9880 0.9881 0.9868 0.987 0.010121 -0.12146 -0.10121

G5 0.9799 0.9802 0.9785 0.9787 0.030615 -0.14287 -0.12246

G6 0.9673 0.9677 0.9646 0.9653 0.041352 -0.27913 -0.20676

Table B-3 – Average MS-SSIM for each sequence for various RD points for different codecs and their variation in percentage.

101

VQM ∆VQM [%]

HP JND_CP JND_QP JND_CP+QP JND_CP JND_QP JND_CP+QP F

ore

ma

n

G1 0.013307 0.014418 0.021743 0.022643 8.348989 63.39521 70.15856

G2 0.018807 0.020201 0.032642 0.033624 7.412134 73.56304 78.7845

G3 0.048995 0.054603 0.100631 0.099747 11.44607 105.3903 103.5861

G4 0.133673 0.136991 0.207254 0.201878 2.482177 55.04552 51.02377

G5 0.274849 0.282759 0.341177 0.34626 2.877944 24.13252 25.9819

G6 0.464149 0.468676 0.529442 0.515567 0.975333 14.06725 11.07791

Mo

bil

e

G1 0.009942 0.011072 0.016049 0.016901 11.36592 61.42627 69.99598

G2 0.013334 0.014516 0.024797 0.024895 8.864557 85.9682 86.70316

G3 0.029509 0.036435 0.06874 0.068473 23.47081 132.9459 132.0411

G4 0.069373 0.077158 0.149015 0.152428 11.22195 114.8026 119.7224

G5 0.169302 0.186417 0.280525 0.28324 10.10915 65.69503 67.29867

G6 0.338682 0.346291 0.433036 0.436007 2.24665 27.85917 28.7364

Pa

nsl

ow

G1 0.007178 0.007118 0.009085 0.009038 -0.83589 26.56729 25.91251

G2 0.009717 0.009983 0.013355 0.013359 2.73747 37.43954 37.4807

G3 0.029884 0.031958 0.038648 0.039723 6.940169 29.32673 32.92397

G4 0.079645 0.082996 0.103395 0.102728 4.20742 29.81983 28.98236

G5 0.188009 0.185446 0.213475 0.205818 -1.36323 13.5451 9.472419

G6 0.329202 0.330458 0.353762 0.354971 0.381529 7.460465 7.827717

Sp

inca

len

da

r

G1 0.009211 0.00937 0.012542 0.012594 1.726197 36.16328 36.72783

G2 0.012474 0.012688 0.01916 0.019318 1.715568 53.59949 54.86612

G3 0.034804 0.036452 0.055895 0.057071 4.735088 60.59936 63.97828

G4 0.103307 0.108802 0.152354 0.155144 5.319097 47.47694 50.17763

G5 0.259347 0.265101 0.334353 0.335999 2.218649 28.9211 29.55577

G6 0.462608 0.467863 0.558566 0.557332 1.135951 20.74283 20.47608

Pla

yin

g_

card

s

G1 0.03466 0.03448 0.038848 0.038888 -0.51933 12.08309 12.1985

G2 0.042045 0.042075 0.047589 0.048618 0.071352 13.18587 15.63325

G3 0.132868 0.135069 0.150024 0.161212 1.656531 12.91206 21.33245

G4 0.25712 0.257557 0.278281 0.283637 0.16996 8.230009 10.31308

G5 0.379949 0.382697 0.412839 0.41351 0.723255 8.656425 8.833028

G6 0.530624 1.025121 0.55516 0.556294 93.1916 4.62399 4.837701

To

ys_

an

d_

cale

nd

ar G1 0.015473 0.015938 0.020392 0.02041 3.005235 31.79086 31.90719

G2 0.021257 0.022064 0.029842 0.029812 3.796396 40.3867 40.24557

G3 0.063034 0.06434 0.078961 0.080065 2.071898 25.26732 27.01875

G4 0.136806 0.137234 0.166383 0.161159 0.312852 21.61967 17.80112

G5 0.257569 0.257839 0.301546 0.285672 0.104826 17.07387 10.91086

G6 0.414189 0.414705 0.46101 0.453069 0.124581 11.30426 9.387019

Table B-4 – Average VQM for each sequence for various RD points for different codecs and their variation in percentage.

102

RP compensated VQM ∆RP_VQM [%]


Fo

rem

an

G1 0.013307 0.013307 0.013307 0.013307 0 0 0

G2 0.018807 0.018807 0.018807 0.018807 0 0 0

G3 0.048995 0.048995 0.048995 0.048995 0 0 0

G4 0.133673 0.133673 0.133673 0.133673 0 0 0

G5 0.274849 0.274849 0.274849 0.274849 0 0 0

G6 0.464149 0.464149 0.464149 0.464149 0 0 0

Mo

bil

e

G1 0.009942 0.009942 0.009942 0.009942 0 0 0

G2 0.013334 0.013334 0.013334 0.013334 0 0 0

G3 0.029509 0.029509 0.029509 0.029509 0 0 0

G4 0.069373 0.069373 0.149015 0.152428 0 114.803 119.7224

G5 0.169302 0.169302 0.169302 0.169302 0 0 0

G6 0.338682 0.338682 0.338682 0.338682 0 0 0

Pa

nsl

ow

G1 0.007178 0.007178 0.007178 0.007178 0 0 0

G2 0.009717 0.009717 0.009717 0.009717 0 0 0

G3 0.029884 0.029884 0.029884 0.029884 0 0 0

G4 0.079645 0.079645 0.079645 0.079645 0 0 0

G5 0.188009 0.188009 0.188009 0.188009 0 0 0

G6 0.329202 0.329202 0.329202 0.329202 0 0 0

Sp

inca

len

da

r

G1 0.009211 0.009211 0.009211 0.009211 0 0 0

G2 0.012474 0.012474 0.012474 0.012474 0 0 0

G3 0.034804 0.034804 0.034804 0.034804 0 0 0

G4 0.103307 0.103307 0.103307 0.103307 0 0 0

G5 0.259347 0.259347 0.259347 0.259347 0 0 0

G6 0.462608 0.462608 0.462608 0.462608 0 0 0

Pla

yin

g_

card

s

G1 0.03466 0.03466 0.03466 0.03466 0 0 0

G2 0.042045 0.042045 0.042045 0.042045 0 0 0

G3 0.132868 0.132868 0.132868 0.132868 0 0 0

G4 0.25712 0.25712 0.25712 0.25712 0 0 0

G5 0.379949 0.379949 0.379949 0.379949 0 0 0

G6 0.530624 0.530624 0.530624 0.530624 0 0 0

To

ys_

an

d_

cale

nd

ar G1 0.015473 0.015473 0.015473 0.015473 0 0 0

G2 0.021257 0.021257 0.021257 0.021257 0 0 0

G3 0.063034 0.063034 0.063034 0.063034 0 0 0

G4 0.136806 0.136806 0.136806 0.136806 0 0 0

G5 0.257569 0.257569 0.257569 0.257569 0 0 0

G6 0.414189 0.414189 0.414189 0.414189 0 0 0

Table B-5 – Average RP compensated VQM for each sequence for various RD points for different codecs and their variation in percentage.

Documents

H.264/AVC Based Perceptual Video Coding - Técnico Lisboa ... · H.264/AVC Based Perceptual Video Coding ... a principal conclusão do trabalho apresentado nesta Tese é que vale