Upload
ngotu
View
239
Download
0
Embed Size (px)
Citation preview
H.264/AVC Based Perceptual Video Coding
Design, Implementation and Evaluation
Ana Sofia Clemente Cabrita
Dissertação para obtenção do Grau de Mestre em
Engenharia Electrotécnica e de Computadores
Júri
Presidente: Prof. José Bioucas Dias
Orientador: Prof. Fernando Pereira
Co-Orientador: Dr. Matteo Naccari
Vogal: Prof. Luís Ducla Soares
Julho 2010
i
Acknowledgments
First and foremost I would like to thank my supervisor in this thesis, Prof. Fernando Pereira for the
valuable guidance, advice, devotion and dedication to this work.
Besides, I would like to thank all the IT Image Group for their support, especially Matteo Naccari and
Catarina Brites, for the availability and help on technical issues when I needed the most.
Finally, an honorable mention goes to my family and friends for their understanding and support on me in
completing this project. Without helps of the particular that mentioned above, I would have faced many
difficulties while doing this project. A deep acknowledgement to all the above mentioned as well as to
everyone close to me, which, in their own way, contributed for the development of this Thesis.
Pais
ii
Abstract
Nowadays, the demand for more video compression keeps growing following the increasing wide scale
deployment of applications where video compression plays a key enabling role. Some of these important
video applications are part of our daily life, notably applications such as the storage of films and video
games in DVD and Blu-ray discs, Internet video streaming as from YouTube, digital television using
various broadcasting channels, mobile TV and real-time applications such as videotelephony and
videoconferencing.
Current video compressions standards are mainly optimized for objective quality metrics such as the
mean squared error which does not take into account the human visual system features. The challenge
addressed in this Thesis regards the inclusion in the state-of-the-art H.264/AVC video compression
standard of additional coding tools able to increase its rate-distortion performance by exploiting the
perceptual features of the human visual system, in summary, to be able to provide the same subjective
quality at a lower bitrate.
This challenge has been addressed by introducing in the H.264/AVC basic architecture additional
perceptually driven modifications. First, a perceptual coefficients pruning method using the JND
thresholds provided by a selected JND model to set some of the DCT coefficients to zero has been
implemented. After, a JND adaptive quantization method adapting the initial quantization parameter (QP)
based on the same JND thresholds has been implemented with the target to avoid coding video data
which is non perceptually relevant.
The main conclusion of the work reported in this Thesis is that it is worthwhile to adopt a perceptual
approach in the optimization of the H.264/AVC video codec, reaching bitrate reductions which may go up
to approximately 35% for the same perceptual quality.
Keywords: H.264/AVC; perceptual video coding; just noticeable distortion model; coefficients pruning;
adaptive quantization
iii
Resumo
Hoje em dia, as necessidades em termos de uma maior eficiência da compressão de vídeo continuam a
crescer, acompanhando o aumento em larga escala do número de aplicações onde as tecnologias de
compressão do vídeo desempenham um papel fundamental. Algumas das mais importantes aplicações
de vídeo são hoje já parte do dia-a-dia, nomeadamente aplicações como a gravação de filmes e
videojogos em DVD e discos Blu-ray, o streaming de vídeos na Internet como o YouTube, a televisão
digital utilizando vários meios de transmissão, a televisão móvel e as aplicações de tempo real como a
videotelefonia e a videoconferência.
As actuais normas de compressão de vídeo foram essencialmente optimizadas para métricas objectivas
de qualidade, como o erro quadrático médio, que não levam em conta as características do sistema
visual humano. O principal desafio considerado nesta Tese consiste na introdução na arquitectura básica
da norma de compressão de vídeo H.264/AVC, de ferramentas perceptuais adicionais com vista a
aumentar o desempenho em termos de qualidade versus débito binário. Como esse objectivo, foi
primeiro implementado o método de selecção perceptiva de coeficientes que utiliza os limiares
fornecidos por um modelo de just noticeable distortion seleccionado para eliminar alguns dos
coeficientes DCT considerando que não são perceptivamente relevantes. De seguida, foi implementado
um modelo de quantização adaptativa dos coeficientes DCT que adapta o passo de quantização inicial
com base nos mesmos limiares usados acima com o objectivo de evitar codificar informação de vídeo
que não seja perceptivamente relevante, controlando a precisão dos coeficientes enviados.
Assim, a principal conclusão do trabalho apresentado nesta Tese é que vale a pena adoptar uma
abordagem perceptiva para optimizar o codec de vídeo H.264/AVC, alcançando reduções de débito que
podem ir até, aproximadamente, 35% para a mesma qualidade perceptiva.
Palavras-chave: H.264/AVC,codificação de vídeo perceptiva, just noticeable distortion model, selecção
de coeficientes, quantização adaptativa
iv
Table of Contents
Chapter 1 ...................................................................................................................................................... 1
Introduction ................................................................................................................................................... 1
1.1. Context and Motivation ................................................................................................................. 1
1.2. Background ................................................................................................................................... 2
1.3. Objective ....................................................................................................................................... 4
1.4. Thesis Organization ...................................................................................................................... 4
Chapter 2 ...................................................................................................................................................... 5
Reviewing Perceptual Video Coding Concepts and Tools ........................................................................... 5
2.1. Brief Overview on the Human Visual System ............................................................................... 5
2.1.1. Human Visual System Features ........................................................................................... 5
2.1.2. Perceptual Models ................................................................................................................ 9
2.2. Brief Overview on Video Coding Standards................................................................................ 10
2.3. Reviewing Perceptual Video Coding Solutions ........................................................................... 18
2.3.1. H.264/AVC Perceptual Video Coding based on a Foveated JND Model ........................... 18
2.3.2. H.264/AVC Coding with JND Model based Coefficients Filtering ....................................... 24
2.3.3. H.264/AVC Inter Coding based on Structural Similarity driven Motion Estimation ............. 28
2.3.4. H.264/AVC Bitrate Control based on 4D Perceptual Quantization Modeling ..................... 33
Chapter 3 .................................................................................................................................................... 41
A JND Model based Coefficients Pruning Method for H.264/AVC Video Coding ...................................... 41
3.1. The Perceptual Coefficients Pruning Method ............................................................................. 41
3.1.1. Objective ............................................................................................................................. 41
3.1.2. Architecture ......................................................................................................................... 41
3.1.3. Walkthrough ........................................................................................................................ 42
3.1.4. Novel Tools Description ...................................................................................................... 43
3.2. Performance Evaluation .............................................................................................................. 47
3.2.1. Test Conditions ................................................................................................................... 47
3.2.2. Results and Analysis ........................................................................................................... 49
3.2.3. Conclusion........................................................................................................................... 62
v
Chapter 4 .................................................................................................................................................... 64
A JND Model based Adaptive Quantization Method for H.264/AVC Video Coding ................................... 64
4.1. The JND Adaptive Quantization Method..................................................................................... 64
4.1.1. Objective ............................................................................................................................. 64
4.1.2. Architecture ......................................................................................................................... 64
4.1.3. Walkthrough ........................................................................................................................ 66
4.1.4. Description of New Tool ...................................................................................................... 66
4.2. Performance Evaluation .............................................................................................................. 68
4.2.1. Test Conditions ................................................................................................................... 68
4.2.2. Results and Analysis ........................................................................................................... 70
4.2.3. Conclusion........................................................................................................................... 87
Chapter 5 .................................................................................................................................................... 88
Conclusions and Future Work ..................................................................................................................... 88
5.1. Summary and Conclusions ......................................................................................................... 88
5.2. Future Work ................................................................................................................................ 89
References .................................................................................................................................................. 91
Annex A ....................................................................................................................................................... 94
Annex B ....................................................................................................................................................... 97
vi
Index of Figures
Figure 2.1 – Eye structure [3] ........................................................................................................................ 6
Figure 2.2 – Internal layer structure of the eye [6] ........................................................................................ 7
Figure 2.3 – For the same image: (a) The noise is added randomly, (b) The noise power is weighted on
the texture activity of each block [11] ............................................................................................................ 8
Figure 2.4 – For the same image: (a) The noise is added randomly, (b) The noise power is weighted
based on the intensity contrast of each MB [11] ........................................................................................... 8
Figure 2.5 – Variation of the contrast sensitivity function with the spatial frequency [9] .............................. 8
Figure 2.6 – Simultaneous contrast: The two smaller squares have equal luminance although the right
one appears brighter [9] ................................................................................................................................ 9
Figure 2.7 – Perceptual factors: (a) Proximity; (b) Similarity; (c) Closure; (d) Symmetry [12] ..................... 9
Figure 2.8 – Weber's law [9] ....................................................................................................................... 10
Figure 2.9 – Chronology of the video recommendations/standards developed by ITU-T VCEG and
ISO/IEC MPEG [16] .................................................................................................................................... 11
Figure 2.10 – Ellipse [25] ............................................................................................................................ 18
Figure 2.11 – Architecture of the H.264/AVC Perceptual Video Coding based on a Foveated JND Model
solution ........................................................................................................................................................ 19
Figure 2.12 – DMOS comparisons for the H.264/AVC based coding solutions [23]. ................................. 23
Figure 2.13 – Portions of decoded frames for the test sequence Stefan. Stefan frame coded with (a) JM
(e) FJND; Fixation point of Stefan frame coded with (b) JM (f) FJND; Texture region away from fixation
point coded with (c) JM (g) FJND; Non-fixation point coded with (d) JM (h) FJND [23] ............................. 23
Figure 2.14 – Architecture of the H.264/AVC Coding with JND Model based Coefficients Filtering solution
.................................................................................................................................................................... 24
Figure 2.15 – Bitrate changes at different QP for I, P, and B frames [26] .................................................. 27
Figure 2.16 – Architecture of the H.264/AVC Inter Coding based on SSIM driven ME solution ................ 29
Figure 2.17 – Illustration of the H.264/AVC Bitrate Control based on 4D Perceptual Quantization
Modeling rate control main components (major revisions are marked by double stars) [28] ..................... 33
Figure 2.18 – Architecture of the H.264/AVC Bitrate Control based on 4D Perceptual Quantization
Modeling solution ........................................................................................................................................ 34
Figure 2.19 – H.264/AVC bitrate control procedure using the 4D perceptual quantization model [28] ...... 35
Figure 2.20 – Comparison of the image quality for the Carphone sequence at 24 kbps (significant
differences are marked with black circles) [28] ........................................................................................... 40
Figure 2.21 – Comparison of the image quality for the Foreman sequence at 128 kbps (significant
differences are marked with black circles) [28] ........................................................................................... 40
Figure 3.1 – Improved H.264/AVC codec architecture including the JND model and the coefficients
pruning method. .......................................................................................................................................... 42
Figure 3.2 – First frame of each test sequence: (a) Foreman; (b) Mobile; (c) Panslow; (d) Spincalendar;
(e) Playing_cards; (f) Toys_and_calendar .................................................................................................. 48
Figure 3.3 – PSNR RD performance for the Foreman sequence .............................................................. 50
Figure 3.4 – PSNR RD performance for the Mobile sequence ................................................................... 50
Figure 3.5 – PSNR RD performance for the Panslow sequence ................................................................ 51
vii
Figure 3.6 – PSNR RD performance for the Spincalendar sequence ........................................................ 51
Figure 3.7 – PSNR RD performance for the Playing_cards sequence ....................................................... 51
Figure 3.8 – PSNR RD performance for the Toys_and_calendar sequence.............................................. 52
Figure 3.9 – MS-SSIM RD performance for the Foreman sequence ......................................................... 53
Figure 3.10 – MS-SSIM RD performance for the Mobile sequence ........................................................... 53
Figure 3.11 – MS-SSIM RD performance for the Panslow sequence ........................................................ 53
Figure 3.12 – MS-SSIM RD performance for the Spincalendar sequence ................................................. 54
Figure 3.13 – MS-SSIM RD performance for the Playing_cards sequence ............................................... 54
Figure 3.14 – MS-SSIM RD performance for the Toys_and_calendar sequence ...................................... 54
Figure 3.15 – Average number of zeroed coefficients for the Foreman sequence .................................... 56
Figure 3.16 – Average number of zeroed coefficients for the Mobile sequence ........................................ 56
Figure 3.17 – Average number of zeroed coefficients for the Panslow sequence ..................................... 56
Figure 3.18 – Average number of zeroed coefficients for the Spincalendar sequence ............................. 57
Figure 3.19 – Average number of zeroed coefficients for the Playing_cards sequence ............................ 57
Figure 3.20 – Average number of zeroed coefficients for the Toys_and_calendar sequence ................... 57
Figure 3.21 – Average zigzag position of the zeroed coefficients for the 4x4 blocks in the Foreman
sequence ..................................................................................................................................................... 59
Figure 3. 22 – Average zigzag position of the zeroed coefficients for the 4x4 blocks in the Mobile
sequence ..................................................................................................................................................... 59
Figure 3.23 – Average zigzag position of the zeroed coefficients for the 4x4 blocks in the Panslow
sequence ..................................................................................................................................................... 60
Figure 3.24 – Average zigzag position of the zeroed coefficients for the 4x4 blocks in the Spincalendar
sequence ..................................................................................................................................................... 60
Figure 3.25 – Average zigzag position of the zeroed coefficients for the 4x4 blocks in the Playing_cards
sequence ..................................................................................................................................................... 60
Figure 3.26 – Average zigzag position of the zeroed coefficients for the 4x4 blocks in the
Toys_and_calendar sequence .................................................................................................................... 61
Figure 3.27 – 4x4 block with the average zeroed positions highlighted ..................................................... 63
Figure 4.1 – Encoder architecture of the proposed solution with the JND related quantization modules .. 65
Figure 4.2 – Encoder architecture of the proposed solution including the JND related quantization and
DCT coefficients pruning modules .............................................................................................................. 65
Figure 4.3 – PSNR RD performance for the Foreman sequence ............................................................... 71
Figure 4.4 – PSNR RD performance for the Mobile sequence ................................................................... 71
Figure 4.5 – PSNR RD performance for the Panslow sequence ................................................................ 71
Figure 4.6 – PSNR RD performance for the Spincalendar sequence ........................................................ 72
Figure 4.7 – PSNR RD performance for the in Playing_cards sequence ................................................... 72
Figure 4.8 – PSNR RD performance for the Toys_and_calendar sequence.............................................. 72
Figure 4.9 – MS-SSIM RD performance for the Foreman sequence ......................................................... 75
Figure 4.10 – MS-SSIM RD performance for the Mobile sequence ........................................................... 75
Figure 4.11 – MS-SSIM RD performance for the Panslow sequence ........................................................ 75
Figure 4.12 – MS-SSIM RD performance for the Spincalendar sequence ................................................. 76
Figure 4.13 – MS-SSIM RD performance for the Playing_cards sequence ............................................... 76
Figure 4.14 – MS-SSIM RD performance for the Toys_and_calendar sequence ...................................... 76
Figure 4.15 – VQM RD performance for the Foreman sequence ............................................................... 78
Figure 4.16 – VQM RD performance for the Mobile sequence .................................................................. 79
Figure 4.17 – VQM RD performance for the Panslow sequence ............................................................... 79
Figure 4.18 – VQM RD performance for the Spincalendar sequence ........................................................ 79
Figure 4.19 – VQM RD performance for the Playing_cards sequence ...................................................... 80
Figure 4.20 – VQM RD performance for the Toys_and_calendar sequence ............................................. 80
Figure 4.21 – RP compensated VQM RD performance for the Foreman sequence .................................. 83
Figure 4.22 – RP compensated VQM RD performance for the Mobile sequence ...................................... 83
Figure 4.23 – RP compensated VQM RD performance for the Panslow sequence ................................... 84
viii
Figure 4.24 – RP compensated VQM RD performance for the Spincalendar sequence ........................... 84
Figure 4.25 – RP compensated VQM RD performance for the Playing_cards sequence .......................... 84
Figure 4.26 – RP compensated VQM RD performance for the Toys_and_calendar sequence ................ 85
ix
Index of Tables
Table 2.1 – Results of the FJND validation tests [23] ................................................................................. 23
Table 2.2 – Bitrate reduction for the JND-thresholded sequences and their MOS [26] ............................. 28
Table 2.3 – MEBSS results with QP=10 [27]. ............................................................................................. 32
Table 2.4 – Comparison of overall coding performance for the GOP IPPP pattern using PQrc and JM10.2
(average PSNR gain: 0.515 dB) [28] .......................................................................................................... 39
Table 2.5 – Comparison of overall coding performance for the GOP IBBP pattern using PQrc and JM10.2
(average PSNR gain: 0.35dB) [28] ............................................................................................................. 39
Table A-1 – Average rate, PSNR and MS-SSIM for each sequence for various RD points, and their
variation in percentage. ............................................................................................................................... 95
Table A-2 – Overall average of the average zigzag position for a 4x4 block and the variance of the
average zigzag position for a 4x4 block for each RD point......................................................................... 96
Table B-1 – Average rate for each sequence for various RD points for different codecs and their variation
in percentage. ............................................................................................................................................. 98
Table B-2 – Average PSNR for each sequence for various RD points for different codecs and their
variation in percentage. ............................................................................................................................... 99
Table B-3 – Average MS-SSIM for each sequence for various RD points for different codecs and their
variation in percentage. ............................................................................................................................. 100
Table B-4 – Average VQM for each sequence for various RD points for different codecs and their
variation in percentage. ............................................................................................................................. 101
Table B-5 – Average RP compensated VQM for each sequence for various RD points for different codecs
and their variation in percentage. .............................................................................................................. 102
x
List of Acronyms
AVC Advanced Video Coding
BCQ Bit Complexity Quantization
BPF Band-Pass Filter
CABAC Context-Adaptive Binary Arithmetic Coding
CAVLC Context- Adaptive Variable Length Coding
CD Compact Disc
CI Confidence Interval
CIF Common Intermediate Format
CRT Cathode Ray Tube
CSF Contrast Sensitivity Function
DCT Discrete Cosine Transform
DMOS Differential MOS
DP Data Partitioning
DSCQS Double Stimulus Continuous Quality Scale
DSIS Double Stimulus Impairment Scale
DVD Digital Video Disc
FJND Foveated JND
FMO Flexible Macroblock Ordering
GOP Group Of Pictures
GSTN General Switched Telephone Network
HVS Human Visual System
ICT Integer Discrete Cosine Transform
xi
IEC International Electrotechnical Commission
ISDN Integrated Services Digital Network
ISO International Standards Organisation
ITS Institute for Telecommunications Sciences
ITU International Telecommunication Union
ITU-T ITU Telecommunication standardization sector
JM Joint Model
JND Just Noticeable Distortion
JPEG Joint Photographic Experts Group
JVT Joint Video Team
MAD Mean Absolute Difference
MB MacroBlock
MC Motion Compensation
ME Motion Estimation
MEBSS Motion Estimation based on SSIM
MOS Mean Opinion Score
MPEG Motion Picture Experts Group
MSE Mean Squared Error
MSSIM Mean SSIM
MS-SSIM Multi-Scale SSIM
NAL Network Abstraction Layer
NTSC National Television System(s) Committee
PAL Phase Alternating Line
PQrc Perceptual Quantization Rate Control
PSNR Peak Signal to Noise Ratio
PSTN Public Switched Telephone Network
QCIF Quarter Common Intermediate Format
QP Quantization Parameter
RD Rate Distortion
xii
RDO Rate Distortion Optimization
RP Resolving Power
SAD Sum of Absolute Differences
SCSF Spatial Contrast Sensitivity Function
SGI Silicon Graphics
SIF Source Input Format
SJND Spatial Just Noticeable Distortion
SNR Signal to Noise Ratio
SSIM Structural Similarity
TJND Temporal Just Noticeable Distortion
TV Television
VCEG Video Coding Experts Group
VHS Video Home System
VLC Variable Length Coding
VQM Video Quality Metric
1
Chapter 1
Introduction
This chapter intends to present the main objectives of this Thesis, after providing its context, motivation
and technical background. Finally, the structure of this report will be presented.
1.1. Context and Motivation
Following the increasing wide scale deployment of daily life applications where video compression plays a
key enabling role, the demand for more video compression keeps growing. Some of these important
video applications are the storage of films and video games in DVD and Blu-ray discs, Internet video
streaming as from YouTube, digital television using various broadcasting channels, mobile TV and real-
time applications such as videotelephony and videoconferencing.
Video compression or video coding technologies have the target to efficiently represent digital video data
to allow the easier transmission and storage of this type of data. Therefore, it implies a complex balance
between video quality, coding bitrate, encoding and decoding complexities, robustness to data losses and
errors, ease of editing, random access, end-to-end delay, and a number of other relevant factors
depending on the target application scenario.
The main challenge in video coding is, thus, to reduce the compressed video data size for a target video
quality or maximizing the video quality for a target coding rate, in other words to find an efficient balance
between quality and bitrate. To compress the video data, compression algorithms exploit the redundancy
and irrelevance in the original (non-compressed) digital data, through largely used tools such as the
discrete cosine transform (DCT), motion estimation and compensation, quantization and entropy coding.
In order for these tools to provide an efficient quality versus bitrate trade-off, advanced rate-distortion
(RD) mechanisms have to be used. These mechanisms have also to consider the relevant application
constraints in terms of the available channel bandwidth and buffer sizes which are limited for most
application scenarios. Thus, the critical need to design rate control solutions to achieve the best visual
quality under the relevant constraints, deeply associated to the application [1].
2
Traditionally, video coding optimization has been made based on simple objective quality metrics such as
the mean squared error (MSE) and the peak signal to noise ratio (PSNR); this has been the case for the
very popular MPEG and ITU-T video compression standards. These metrics do not accurately express
the perceived distortion/quality as assessed by human subjects through the Human Visual System (HVS)
and more appropriate quality metrics may be used to maximize the subjective quality for a certain target
bitrate. This maximization is especially interesting in the context of the H.264/AVC (Advanced Video
Coding) standard which is nowadays widely used for all types of services and products from mobile TV to
Blu-ray and represents the state-of-the art in video coding.
1.2. Background
The most relevant background concepts and tools for this Thesis are the human visual system and its
features and behaviors together with the best available video compression solution, this means the
H.264/AVC standard.
The HVS includes the eye and also a part of the brain which is dedicated to process the visual
information, through memory and knowledge. The eye is structured into three layers and each one has a
specific function; the most important elements are the photoreceptors which transform the luminous
stimuli into nervous impulses. Some HVS properties which have been recently exploited in terms of video
compression are the texture masking, the intensity contrast masking, the spatial frequency sensitivity and
the preservation of object boundaries effects.
Since the 90s, several video compression standards have been developed by the Video Coding Experts
Group (VCEG) from the International Telecommunications Union (ITU), more precisely the ITU
Telecommunication standardization sector (ITU-T), and the Motion Picture Experts Group (MPEG) from
the International Standards Organization/ International Electrotechnical Commission (ISO/IEC) [2]. It all
began with the ITU-T H.261 Recommendation which was designed to address bidirectional and
unidirectional visual communications at data rates multiples of 64 kbit/s, between 40 kbit/s and 2000
kbit/s and had as main target applications, videotelephony and videoconference [3]. Afterwards, the
MPEG-1 Video standard was designed to compress Video Home System (VHS) quality raw digital video,
that is, for a digital quality video equivalent to the VHS quality, down to around 1.2 Mbit/s without
excessive quality loss, so to allow digital movie storage in CD-ROMs [4] [5]. In 1994, after a joint effort
between the ITU-T and ISO/IEC, the H.262/MPEG-2 Video standard was developed targeting digital
television and, thus, interlaced video at higher rates and spatial resolutions than MPEG-1 Video, notably
up to High Definition (HD) [6] [7]. Soon after, the ITU-T H.263 Recommendation was developed to
optimize video quality for lower bitrates, especially targeting applications such as visual telephony in
copper telephone lines and mobile networks [8]. Meanwhile, ISO/IEC developed the MPEG-4 Visual
standard, providing increased compression efficiency but also adopting an object-based representation
framework with high flexibility and advanced interaction capabilities. The main MPEG-4 Visual standard
3
target applications ranged from digital television to mobile and Internet video streaming video and games
[9].
More recently, the H.264/AVC video compression standard was developed with the intent of increasing by
50% the compression performance provided by all previously available video compression standards
while also providing a “network-friendly” video representation. The target applications included
conversational , e.g. videotelephony and videoconference, and non-conversational scenarios, e.g.
storage, broadcast and streaming, using bitrates from 50 kbps to 8 Mbps and more [10] [11] [12].
To achieve a superior video compression performance, the H.264/AVC standard specifies many
normative novel tools and also requires the additional implementation of some non-normative tools and
features [12]. Normative tools are those specified by the standard and which implementation as in the
standard is essential for interoperability. On the contrary, the non-normative tools do not require a
normative specification since they are not essential for interoperability, implying that its precise
implementation is left to the implementer’s criterion. From the full H.264/AVC set of coding tools, a few
are especially important for the provided increased coding performance, notably:
• Variable block sizes for block prediction – block sizes from 4x4 to 16x16 are allowed to more
efficiently encapsulate the video-frame’s regions properties notably in terms of motion.
• Smaller size (4x4) DCT transform – makes the transform coefficients more localized in space
since there is less spatial redundancy to exploit due to the better temporal prediction; as such,
these coefficients are able to better express the visual properties of a region in a frame.
• Quarter pixel Motion Estimation (ME) – allows providing a more accurate prediction of
translational motion, thus reducing the prediction error.
• De-Blocking In-Loop filter – helps to reduce the blocking artifacts typical of block-based video
codecs in a rather block adaptive way, especially at low bitrates.
• Flexible macroblock (MB) ordering – facilitates the grouping of MBs into slices which can be
used either for error resilience or more efficient video coding; this grouping may be performed
based on the perceptual importance of the MBs (for increased error resilience) or based on
similarity properties among the various MBs (for coding efficiency).
Some of the above mentioned tools make the H.264/AVC coding standard a good candidate for
benefiting from the consideration of HVS perceptual aspects in the maximization of the achieved video
quality. In this Thesis, a perceptual video coder-decoder (codec) is any video coding solution somehow
exploiting the HVS characteristics to increase the codec performance. All the video coding standards
mentioned above will be further detailed in Section 2.2.
4
1.3. Objective
The main objective of this work is to design, implement and evaluate perceptually driven coding tools to
be integrated into a standard H.264/AVC video codec, targeting the improvement of the rate distortion
performance notably measured using advanced objective video quality metrics, ideally without
significantly increasing the encoding complexity. In this context, no formal subjective assessments are
foreseen considering the logistic complexity involved in performing this type of evaluation. The
improvement of the H.264/AVC video codec must be made through perceptually driven coding tools
based on a just noticeable distortion (JND) model, trying to only code video information which is
perceptually relevant. In this context, two perceptually driven tools will be developed, namely a DCT
coefficients pruning method and a JND adaptive quantization method. To evaluate the improvements,
notably in terms of RD performance, a set of objective quality metrics with different correlation with the
subjective assessment should be used, notably the PSNR, Multi-Scale Structural Similarity (MS-SSIM),
Video Quality Metric (VQM) and Resolving Power (RP) compensated VQM.
1.4. Thesis Organization
This Thesis reports in detail the design, implementation and assessment of perceptual video codecs
based on the H.264/AVC standard. The process is described in five chapters, including the current one,
which introduces the context, motivations, main background, objectives and structure of this Thesis.
After, Chapter 2 provides a detailed review on the perceptual video coding concepts and technologies in
the literature, starting with a review of the HVS properties, the available video coding standards and,
finally, the most relevant perceptual video coding solutions.
Next, Chapter 3 presents the first perceptually driven coding tool implemented in this work, notably a DCT
perceptual coefficients pruning mechanism targeting the elimination of non-perceptually relevant DCT
coefficients using a JND model. With this purpose, the architecture, walkthrough and metrics used are
first presented; after, the performance results are presented and analysed to derive the main conclusions
associated to the performance of the implemented method.
Chapter 4 presents another perceptually driven coding tool implemented in the context of a H.264/AVC
codec, notably a quantization mechanism introducing error according to a previously selected JND model.
As in the previous chapter, the architecture, walkthrough and relevant assessment metrics are presented
first and after the results are presented and analysed to derive the main conclusions.
Finally, Chapter 5 is reserved for the conclusions and the presentation of eventual further work.
5
Chapter 2
Reviewing Perceptual Video Coding Concepts and Tools
The first objective of this chapter is to briefly review the Human Visual System (HVS) structure,
associated features and properties. After, the evolution of video coding standards and recommendations
is reviewed, notably in terms of their goals and relative improvements. Finally, the most relevant
perceptual video coding solutions in the literature are described. As already mentioned in Chapter 1, in
the context of this Thesis, a perceptual video coder-decoder (codec) is any video coding solution
exploiting the HVS characteristics in some way to increase the codec performance; as it will be seen
later, there are several ways to design a perceptual video codec, depending on how the HVS features
and related tools impact and integrate in the video codec.
2.1. Brief Overview on the Human Visual System
The HVS includes not only the eye but also the part of the brain dedicated to process the visual
information, notably in terms of memory and knowledge. To understand how the human visual system
works, the anatomic features of the eye and, afterwards, the cognitive components responsible for the
perceptual vision, will be presented in the following.
2.1.1. Human Visual System Features
To better understand the anatomic features of the eye, its basic structure is presented in Figure 2.1.The
structure of the eye is formed by three layers:
• External layer, which includes two structures, notably the sclera and cornea with the functions of
eye movement and allowing the luminous rays to enter the eye, respectively;
6
• Average layer, with three elements, notably the iris, ciliary muscle and choroid; they have the
function of nourishing the structures without their own irrigation, including the sensory elements of
the retina;
• Internal layer, named retina, which includes the photoreceptors (cones and rods) and the
sustentation cells, see Figure 2.2; they have the function of receiving the projected luminous rays.
Figure 2.1 – Eye structure [13]
Some of the main anatomic features of the eye are:
• Vision is constrained to the frequencies in the visible region, notably �� ∈ �350, 780� ��,;
• The photoreceptors are especially important since they transform the luminous stimuli into nervous
impulses.
o The cones are high-precision cells specialized to detect red, green, or blue light (day light
vision). They are generally located at the center of the retina in a region of high acuity, called
the fovea, see Figure 2.2;
o On the other hand, rods are very sensitive to changes in contrast, even in low light levels,
providing the black and white vision (night vision); consequently, they are imprecise in terms of
detection position, due to light scatter. Rods are generally located in the periphery of the retina,
see Figure 2.2 [14] [15] [16].
7
Figure 2.2 – Internal layer structure of the eye [16]
The HVS properties can be used to correct shortcomings of mathematical models used as distortion
metrics, such as the Mean Squared Error (MSE). The MSE expresses the mean squared difference
between the original and the decoded sequence. For every data point, it is taken the distance vertically
from the point to the corresponding y value on the curve fit (the error), and square the value [17]. In other
words, the MSE quantifies the difference between an estimator and the true value of the quantity being
estimated as presented in equation (1).
��� ���� = ����� − ���� (1)
So, the MSE indicates a value of distortion, however under certain conditions, the HVS can tolerate more
distortion than the MSE. On the other hand, there are some types of distortions that the MSE does not
measure and express in the same way as they are perceived.
Some of the most recently explored HVS properties are:
• Texture masking: HVS is less sensitive to details (distortion) in the image areas with intense
texture. This means that more noise can be tolerated in areas with texture; see examples in Figure
2.3(a) and (b), which show the same image, one with random noise added and the other with noise
power weighted on the texture activity in each block added.
• Intensity contrast masking: Intensity contrast regards the difference in the mean intensities of the
background and object, thus characterizing the intensity difference between the object and
background [18] , in other words, for example the difference in brightness between the light and dark
areas of a picture. For lower/medium contrast areas, more noise can be hidden in the darker areas.
In the examples in Figure 2.4(a) and (b), the noise is first randomly added and after the noise power
is weighted based on the intensity contrast for each MB. It is important to notice that the visual
perception is sensitive to the luminance contrast but not to the absolute value of the luminance, as it
can be seen in Figure 2.6, where it is shown that the background luminance makes the brightness of
the same object appear differently [19].
• Spatial frequency sensitivity: The HVS acts as a band-pass filter (BPF) with a peak at around four
cycles per degree of visual angle and declining very fast; the spatial frequency regards the number
of bright plus dark bars per centimeter on the screen or per degree of visual angle. This HVS feature
allows hiding more noise in the areas with higher spatial frequencies. This is the main concept
8
behind the Contrast Sensitivity Function (CSF) used in Joint Photographic Experts Group (JPEG)
2000 [20]. Figure 2.5 shows the variation of the visual sensitivity as a function of the spatial
frequency.
• Preservation of object boundaries: The HVS is very sensitive to unpreserved object edges in a
scene. Usually, a bad selection of motion vector or an inappropriate selection of the coding mode is
the main cause for the edge-misalignment of a solid object in a scene. This type of distortion is more
likely to happen at very low bitrates [21].
(a)
(b)
Figure 2.3 – For the same image: (a) The noise is added randomly, (b) The noise power is weighted on the texture activity of each block [21]
(a)
(b)
Figure 2.4 – For the same image: (a) The noise is added randomly, (b) The noise power is weighted based on the intensity contrast of each MB [21]
Figure 2.5 – Variation of the contrast sensitivity function with the spatial frequency [19]
Simultaneous contrast consists on the fact that perceived brightness depends not only of its intensity but
also on the background intensity. An example is shown in Figure 2.6: although the small squares have
exactly the same intensity, they appear progressively darker as the background becomes lighter.
Figure 2.6 – Simultaneous contrast: The two smalle
2.1.2. Perceptual Models
A model is a representation of a system that allows for investigation of the properties and, in some cases,
prediction of future outcomes. Accordingly, a perceptual model is a model that represents the perceptual
features of the HVS.
First, it is important to know that exits several types of visual perception which are
shapes (e.g. faces and associated emotions),
colours and luminance.
The study of visual perception begun with Helmholtz, who defended the unconscious interference, i.e.,
the vision is the result of making assumptions and conclusions from incomplete data, based on previous
experiences [22]. After, the Gestalt theory focused on the understanding of visual components as a
collection or an organized joint. In this theory, there are six main factors which determine how humans
group things in agreement with the visual perception, notably proximity, similarit
common fate and continuity [22]. Figure
objects defining the groups: objects closer to each other are perceived as groups, and independent from
objects further apart. Figure 2.7 (b) show
shape or size or colour are interpreted as a group.
brain adds components that are missing to
represents the symmetry factor with
non-symmetric objects.
(a)
(c)
Figure 2.7 – Perceptual factors:
Visual perception can be studied through psychophysics which studies the relationship between physical
stimuli and their subjective perception, in other words, “the analysis of perceptual processes by studying
9
The two smaller squares have equal luminance although
brighter [19]
Perceptual Models
A model is a representation of a system that allows for investigation of the properties and, in some cases,
outcomes. Accordingly, a perceptual model is a model that represents the perceptual
exits several types of visual perception which are
faces and associated emotions), spatial relations (e.g. depth, orientation and movement),
begun with Helmholtz, who defended the unconscious interference, i.e.,
the vision is the result of making assumptions and conclusions from incomplete data, based on previous
Gestalt theory focused on the understanding of visual components as a
collection or an organized joint. In this theory, there are six main factors which determine how humans
group things in agreement with the visual perception, notably proximity, similarity, closure, symmetry,
Figure 2.7 (a) represents the proximity factor, with the distance between
ing the groups: objects closer to each other are perceived as groups, and independent from
(b) shows the similarity factor effect, since objects that are similar in
shape or size or colour are interpreted as a group. Figure 2.7 (c) represents the closure
components that are missing to interpret a partial object as a whole. Finally,
factor with the symmetric objects more easily grouped in collections than the
(b)
(d)
Perceptual factors: (a) Proximity; (b) Similarity; (c) Closure; (d) Symmetry
Visual perception can be studied through psychophysics which studies the relationship between physical
stimuli and their subjective perception, in other words, “the analysis of perceptual processes by studying
although the right one appears
A model is a representation of a system that allows for investigation of the properties and, in some cases,
outcomes. Accordingly, a perceptual model is a model that represents the perceptual
exits several types of visual perception which are: the perception of
, orientation and movement),
begun with Helmholtz, who defended the unconscious interference, i.e.,
the vision is the result of making assumptions and conclusions from incomplete data, based on previous
Gestalt theory focused on the understanding of visual components as a
collection or an organized joint. In this theory, there are six main factors which determine how humans
y, closure, symmetry,
the distance between
ing the groups: objects closer to each other are perceived as groups, and independent from
e objects that are similar in
closure factor, that is, the
a partial object as a whole. Finally, Figure 2.7 (d)
objects more easily grouped in collections than the
Symmetry [22]
Visual perception can be studied through psychophysics which studies the relationship between physical
stimuli and their subjective perception, in other words, “the analysis of perceptual processes by studying
10
the effect on a subject’s experience or behaviour of systematically varying the properties of a stimulus
along one or more physical dimensions” [23]. An important concept is the so-called just noticeable
distortion (JND), which is a threshold defining the smallest detectable difference between a starting and
secondary levels of a particular sensory stimulus.
Visual perception is typically studied by means of psychophysical methods. Those methods have the
ambition of testing the subjects’ perception using stimulus detection and difference detection
experiments. The main psychophysical methods are the determination of thresholds of absolute and
relative detection (Weber’s law), the equalization of perceptions and the estimation of the amplitude
perception (Steven’s law) [24].
Weber’s law relates the minimum detectable increment in the stimulus with the intensity of the stimulus,
and states that the change in a stimulus that will be just noticeable is a constant ratio of the original
stimulus. This law is represented in Figure 2.8 [19] and equation (2) with S being the intensity of the
stimulus, ∆S the minimum detectable increment and c a constant:
Figure 2.8 – Weber's law [19]
Steven’s law establishes the relation between the intensity of a stimulus and the intensity as perceived by
a human. This law, represented in equation (3), proposes a relation between the magnitude of the
stimulus intensity and the perceived intensity, where S represents the magnitude of the stimulus intensity,
P is the psychophysics function expressing the subjective magnitude invoked by the stimulus, n is an
exponent depending on the type of stimulus and K is a constant depending on the type of stimulus and
the units used [25]:
This law is considered to supersede the Weber’s law since it describes a wider range of sensations. In
addition, a distinction has been made between local psychophysics, where stimuli are discriminated only
with a certain probability, and global psychophysics, where stimuli would be discriminated correctly with a
near certainty. Weber’s law is generally applied in local psychophysics, whereas Stevens’ law is usually
applied in global psychophysics.
2.2. Brief Overview on Video Coding Standards
Since the beginning of the digital video era, several video coding standards have been developed; many
of them are referenced in Figure 2.9 in chronological order. These standards come from two main
∆�� = � (2)
� = ��� (3)
11
standardization institutions: the H.xxx recommendations are developed by the Video Coding Experts
Group (VCEG) from the International Telecommunications Union (ITU), more precisely the ITU
Telecommunication standardization sector (ITU-T) while the MPEG standards are developed by the
Motion Picture Experts Group (MPEG) from the International Standards Organisation/ International
Electrotechnical Commission (ISO/IEC).
Figure 2.9 – Chronology of the video recommendations/standards developed by ITU-T VCEG and ISO/IEC MPEG
[11]
In the following, a brief description of each video coding recommendation/standard is provided.
ITU-T H.261 Recommendation
• Objectives
Digital video compression started, in practice, with this ITU-T recommendation [3], issued in 1990. This
recommendation was a pioneering effort and used a hybrid video coding scheme [2]. The H.261 codec
was designed to be used for bidirectional or unidirectional visual communications at data rates multiples
of 64 kbit/s, in the range of [40, 2000] kbit/s, targeting synchronous Integrated Services Digital Network
(ISDN) channels.
• Main target applications
The main target applications are personal communications, notably videotelephony and
videoconference. This recommendation supports two video frame sizes: Common Intermediate Format
(CIF) (352x288 pixels) and Quarter Common Intermediate Format (QCIF) (176x144 pixels) using a
4:2:0 chrominance subsampling scheme (this means the chrominance is sampled with half the samples
in each direction) and non-interlaced pictures, occurring at approximately 29.97 times per second.
• Main coding tools in addition to previous standards
Since this has been the first relevant video coding standard, it has established a reference in terms of
the video coding tools set, notably:
o Hybrid video coding algorithm based on translational motion compensated prediction to exploit
the temporal redundancy, DCT coding with quantization to exploit the spatial redundancy and
visual irrelevance and, finally, entropy coding to exploit the statistical redundancy;
o Huffman coding as a variable length entropy coding tool;
o Motion estimation (ME) and motion compensation (MC) for 16×16 luminance macroblocks (MB)
with full-pixel precision;
12
o DCT coefficients coded as a zig-zag scanned set of (run, length) pairs which are entropy coded
with 2D Huffman variable length coding (VLC) [26] [8].
MPEG-1 Video standard
• Objectives
First MPEG video coding standard which was designed to compress Video Home System (VHS) quality
raw digital video down to 1.5 Mbit/s (video with associated audio) without excessive quality loss; this
standard has a quantization matrix for Intra MBs, that is frequency dependent, thus exploiting
perceptual visual features. With this tool, information in certain frequencies and areas of the picture that
the human eye has limited ability to fully perceive is reduced or completely discarded [5].
• Main target applications
The main target application is compact disc (CD) storage with the following important functionalities:
o Random access, the possibility to access any part of the audiovisual data in a limited amount of
time;
o Fast forward/reverse search, faster play (with time compression) in the usual and opposite time
directions;
o Reverse playback, playing at regular speed against the usual temporal direction;
MPEG-1 videos are most commonly seen using Source Input Format (SIF) resolution of 325x240,
325x288, or 320x240, but the standard also supports CIF and higher spatial resolutions. The bitrate is
typically less than 1.5 Mbit/s, normally for video data 1.25 Mbit/s are used.
• Main coding tools in addition to previous standards
o Quantization weighting matrix is a string of 64-values (0-255) telling the encoder the relative
importance of each spatial frequency, considering the human visual system; each value in the
matrix corresponds to a certain frequency component of each 8×8 block;
o ME with half pixel precision;
o Bidirectional ME, meaning that each MB may have one forward vector and/or one backward
vector with half pixel accuracy.
H.262/MPEG-2 Video standard
• Objectives
In 1994, the ITU-T and ISO/IEC joined efforts to create the H.262/MPEG-2 Video standard with the goal
of defining a coding syntax suitable for interlaced video at higher resolutions and rates than MPEG-1
Video (up to 40 Mbit/s) [7].
• Main target applications
This standard is oriented towards digital television (TV) and allows resolutions such as SIF, CIF, QCIF,
National Television System(s) Committee (NTSC) - 720x480 and Phase Alternating Line (PAL) -
720x576 using a 4:2:0 chrominance subsampling scheme. The recommended bitrate is in the [3, 5]
13
Mbit/s interval for applications like broadcasting (to the users), and in the [8, 10] Mbit/s interval for
contribution, e.g. transmission between studios.
Profiles and levels are new concepts providing a trade-off between implementation complexity for a
certain class of applications and interoperability between applications while guaranteeing the necessary
compression efficiency capability required by the class of applications in question and limiting the codec
complexity and associated costs. A profile is a subset of coding tools able to provide a certain level of
compression efficiency for a certain complexity; on the other hand, a level establishes some relevant
parameters, notably the bitrate and the total amount of pixels/image. There are four levels: low level (up
to 360x288 for the luminance); main level (up to 720x576 for the luminance); high-1440 level (up to
1440x1152 for the luminance); high level (up to 1920x1152 for the luminance), and six profiles: simple
profile; main profile; 4:2:2 profile; Signal to Noise Ratio (SNR) profile; spatial profile; and high profile.
Several types of compatibility may be provided, which can be separated into two groups, notably:
1. Compatibility between different resolutions of encoders and decoders.
o Upward compatibility means the decoder can decode the pictures generated by a lower
resolution encoder;
o Downward compatibility implies that a decoder can decode the pictures generated by a higher
resolution encoder.
2. Compatibility between different codecs.
o Forward compatibility allows a MPEG-2 Video decoder to be able to decode a coded bitstream
compliant with a previous available standard, e.g. MPEG-1 Video;
o Backward compatibility allows a decoder compliant with a previously available standard to be
able to, totally or partially, decode in a useful way a bitstream compliant with MPEG-2 Video.
• Main coding tools in addition to previous standards
o Scalable coding, this concept refers to the possibility of obtaining a useful reproduction of a
compressed video signal, decoding just a part of the full compressed information (bitstream).
There are four main types of scalability in the MPEG-2 Video standard:
� Spatial scalability, corresponding to different spatial resolutions;
� Fidelity, SNR or quality scalability, corresponding to different video qualities (e.g. SNR);
� Temporal scalability, corresponding to different frame rates;
� Frequency scalability, corresponding to different quantized transform coefficients for each
block;
o Interlaced video coding can choose two image structures, notably frame and field coding. The
frame structure just divides the picture into MBs (frame-pictures). The field structure considers
two fields, top and bottom fields, which are interleaved. The top field consists in the odd lines,
while the bottom field consists in the even lines. So the field structure is composed of MBs just
from the top field and MBs just from the bottom field (field-pictures). Since the coded pictures
are classified as frame-picture or field-pictures, the main prediction modes are:
� Frame mode for frame-pictures;
� Field mode for field-pictures;
14
� Field mode for frame-pictures;
� 16x8 MC for field-pictures.
These modes will determine the MBs that may predict the current MB, for example, in the case
of field mode for field-pictures for P-field-pictures, the prediction MBs may come from either of
the two most recently coded I- or P-fields, so the top field may take its prediction MBs from
either the top field of the reference field picture or the bottom field of the reference field
picture. The bottom field may take its prediction MBs from either field top field of the p-field
being coded or the bottom field of the reference filed picture.
ITU H.263 Recommendation
• Objectives
The main goal of this recommendation is the optimization of video quality for low bitrates, notably down
to 28.8 kbit/s to provide significantly better picture quality than the already existing ITU-T
recommendation H.261 [8].
• Main target applications
Although H.263 is, in principle, network-independent and can be used for a very large range of networks
and applications, its main target application is visual telephony in the Public Switched Telephone
Network (PSTN) and mobile networks; in fact, its target networks are low-bitrate networks, like the
General Switched Telephone Network (GSTN), ISDN, and wireless networks. This recommendation
supports the following resolutions: sub-QCIF, QCIF, CIF, 4CIF and 16CIF [8].
• Main coding tools in addition to previous standards
The main improvements are the motion compensation with half pixel precision and the four
optional/negotiable options: unrestricted motion vectors; syntax-based arithmetic coding; advance
prediction; and forward and backward frame prediction. Other tools added are:
o 3D VLC coding to improve the coding efficiency of the DCT coefficients;
o More efficient coding of MB and block signalling overhead such as the information of which
blocks are coded and the information on the quantization step size changes;
o Median prediction for motion vectors to have improved coding efficiency and error resilience;
o Six optional algorithmic coding modes (five of these six modes are not found in H.261):
� Allows sending multiple video streams within a single video channel;
� Extended range of motion vectors values for more efficient performance with high
resolutions and large amounts of motion;
� Arithmetic coding to provide higher coding efficiency;
� Variable block-size motion compensation and overlapped-block motion compensation for
higher coding efficiency and reduced blocking artefacts;
� Representation of pairs of pictures as a single unit for a low-overhead form of bidirectional
prediction .
15
MPEG-4 Visual standard
• Objectives
The MPEG-4 Visual standard (MPEG-4 Part 2), from 1998, had as main goal not only to provide
increased compression efficiency but also to specify an object-based representation framework with
high flexibility and interaction capabilities .
• Main target applications
Application areas range from digital television, streaming video, to mobile multimedia and games. This
video standard is designed for a large range of bitrates and spatial resolutions.
• Main coding tools in addition to previous standards
Highly flexible toolkit of coding techniques making it possible to deal with a wide range of visual data,
including rectangular frames, arbitrarily shaped video objects, still images and hybrids of natural and
synthetic visual information. Those tools are clustered in profiles, since it is unlikely that all applications
would require all the tools available in the MPEG-4 Visual coding framework, notably considering this
implies a substantial complexity. Some of the key features and tools are:
o Efficient compression of progressive and interlaced “rectangular” video sequences; the core
compression tools are based on the ITU-T H.263 recommendation and can outperform MPEG-1
and MPEG-2 Video compression.
o Coding of arbitrarily video objects which are part of a video scene; this is a new concept for
standard-based video coding and enables the independent coding of foreground and
background objects in a video scene.
o Support for effective transmission over practical networks through error resilience tools which
may help a decoder to recover from errors and maintain a successful video connection in error-
prone network environments.
o Scalable coding tools that can help to support flexible transmission at a range of bitrates.
o Coding of still “textures” this means image data; for example, still images can be coded and
transmitted within the same framework of moving video sequences; texture coding tools may
also be used in conjunction with animation-based rendering.
o Coding of animated visual objects such as 2D and 3D polygonal meshes, animated faces and
animated human bodies.
o Coding for specialist applications such as “studio” quality video; in this type of application, visual
quality is perhaps more important than high compression [9].
H.264/AVC (Advanced Video Coding) standard
• Objectives
This standard was jointly published as Part 10 of the MPEG-4 standard and H.264 recommendation by
ITU-T. The main goals of the H.264/AVC standardization effort have been enhanced compression
performance (50% more than previous standards) and provision of a “network-friendly” video
representation [10].
16
• Main target applications
The target applications are conversational (videotelephony) and non-conversational (storage,
broadcast, or streaming). Bitrates used range from 50 kbps to 8 Mbps, depending on the application.
The resolution also depends on the application and may range from sub-QCIF(128x96), QCIF(176x144)
or CIF(352x288) to 4CIF(704x576), 640x480, 1280x720 and 1920x1080. For example, 4CIF is
appropriated for standard-definition television and DVD-video; CIF and QCIF are popular for
videoconferencing applications; QCIF and sub-QCIF are appropriated for mobile multimedia
applications where the display resolution and the bitrate are limited.
• Main coding tools in addition to previous standards
Input signal
o Flexible interlaced-scan video coding features;
o Support of monochrome, 4:2:0, 4:2:2 (this means that the chrominance is sampled with half
the columns and all the rows of the luminance), and 4:4:4 chroma subsampling (this means
that the chrominance is sampled with all the columns and all the rows of the luminance).
Prediction design improvement
o Variable block-size MC with small block sizes (the minimum block size is 4x4);
o Quarter-sample accurate MC, that is, quarter-sample motion vector accuracy;
o Motion vectors over picture boundaries;
o Multiple reference pictures motion compensation;
o Decoupling of referencing order from display order: the encoder is allowed to choose the
ordering of pictures for referencing and display purposes with a high degree of flexibility
constrained only by a total memory capacity bound imposed to ensure decoding ability;
o Decoupling of picture representation methods from picture referencing capability, that is, the
encoder is provided with more flexibility and, in many cases, an ability to use a picture for
referencing that is a closer approximation to the picture being encoded;
o Weighted prediction, the motion-compensated prediction signal is weighted and offset by
amounts specified by the encoder;
o Improved “skipped” and “direct” motion inference: in previous standards the “skipped” coding
mode was only used in static MB; now if several MBs have the same motion vector, the first
is coded as previously, but the others are considered skipped with inference of motion from
the first;
o Directional spatial prediction for intra coding: this is a new technique of extrapolating the
edges of the previously-decoded parts of the current picture, applied in regions of pictures
that are coded as intra;
o In-the-loop deblocking filtering, which is included in the MC prediction loop;
Quantization design improvement
o Logarithmic step size control;
17
o Frequency-customized quantization scaling matrices selected by the encoder for perceptual-
based quantization optimization;
Coding efficiency improvement
o Small block-size transform, notably 4x4 instead of 8x8; this allows the encoder to represent
signals in a more locally-adaptive fashion;
o Hierarchical block transform;
o Short word-length transform;
o Exact-match transform;
o Arithmetic entropy coding;
o Context-adaptive entropy coding:
� Context-Adaptive Binary Arithmetic Coding (CABAC) for the quantized transform
coefficients;
� Context-Adaptive Variable-Length Coding (CAVLC);
� Exponential-Golomb coding;
Robustness to data errors/losses and flexibility for operation over a variety of network
environments
o Parameter set structure, this is “global” parameters for a sequence such as picture
dimensions, video format, macroblock allocation map;
o Network Abstraction Layer (NAL) unit syntax structure, which allows greater customization of
the method of carrying the video content in a manner appropriate for each specific network;
o Flexible MB Ordering (FMO): allows the partition of the picture into regions called slice
groups, with each slice becoming an independently decodable subset of a slice group. A
slice group is a subset of the MBs in a coded picture and may contain one or more slices [9];
o Data Partitioning (DP): H.264/AVC allows the syntax of each slice to be separated into up to
three different partitions for transmission, depending on a categorization of syntax elements;
o Redundant pictures allow an encoder to send redundant representations of regions of
pictures, enabling an additional representation of regions of pictures for which the primary
representation has been lost during data transmission;
o Frame numbering allows the creation of “sub-sequences”, enabling temporal scalability but
optional inclusion of extra pictures between other pictures, which can occur due to network
packet losses or channel errors;
o Flexible slice size, which reduces coding efficiency by increasing the header data and
decreasing the effectiveness of prediction;
o Arbitrary slice ordering;
o Instead of I-, P-, B-pictures, the H.264/AVC supports I-, P-, B-slices;
o SP/SI synchronization switching slices:
� Switching P (SP) slice allows efficient switching between different pre-coded pictures; in
other words, SP slice is a inter-coded slice used for switching between coded bitstreams;
18
� Switching I (SI) slice allows an exact match of a MB in an SP slice for random access and
error recovery purposes; SI slice is a intra-coded slice used for switching between coded
bitstreams [12] [9].
In the following section, perceptual video codecs using the H.264/AVC video coding standard as
background video codec to include HVS related novel coding tools will be presented. These tools may be
integrated still keeping the compatibility with the H.264/AVC standard (coded stream and decoder) or
eventually loosing this compatibility which may be a drawback in terms of market deployment.
2.3. Reviewing Perceptual Video Coding Solutions
This section intends to present a brief review on the most relevant perceptual video coding solutions in
the literature since this Thesis targets this type of video compression approach. The main selection
criteria for the coding solutions to be reviewed in the following are their compression performance as well
as the novel way they integrate HVS features and associated tools in the video codec. Reviewing these
solutions is fundamental to have a basic understanding of the state-of-the-art on perceptual video coding
and, thus, to better decide which coding path to follow after in this Thesis.
2.3.1. H.264/AVC Perceptual Video Coding based on a Foveated JND
Model
Basic Approach
In [27] and [28], Chen and Guillemot propose a H.264/AVC perceptual video coding solution based on a
Foveated Just Noticeable Distortion (FJND) model [27]. The foveated model is developed to further
exploit the perceptual redundancy focusing on the fact that the visual acuity decreases with an increased
eccentricity, this means the visibility threshold increases with the pixel distance from the fixation point (the
fraction of the distance along the semi-major axis at which the focus lies), as shown in Figure 2.10 and
expressed in equation (4) where c is the distance between the center of the image and the focus, a is the
size of the semi-major ellipse axis and e is the eccentricity. The focus is the fovea.
� = � (4)
Figure 2.10 – Ellipse [29]
In this solution, bit allocation and rate distortion optimization (RDO) algorithms are proposed based on the
foveated JND model with the target to achieve better visual quality for the same rate. In summary:
19
• Basic idea: use a foveation JND model to further exploit the perceptual redundancy by controlling
the H.264/AVC Quantization Parameter (QP).
• Target: achieve a better visual quality for the same rate.
• HVS property explored: the visual acuity decreases with the increment of the eccentricity.
Architecture and Walkthrough
The solution proposed in this section adopts an improved H.264/AVC coding architecture including the
proposed MB quantization adjustment and a RDO solution based on the adopted FJND model. The
architecture of the proposed solution is presented in Figure 2.11 and the modules changed are the coder
control, which now includes the FJND model, and the quantization.
In the quantization module, the proposed change refers to the determination of the QP for each MB,
which is optimized based on the FJND information. This information is given by the FJND model
integrated in the coder control module. The FJND model includes several JND components, notably the
spatial JND (SJND), the temporal JND (TJND) and the foveation model. The coder control module uses a
Lagrange multiplier for the rate distortion optimization which is adapted so the MB distortion is lower than
the noticeable distortion threshold for each MB. It is assumed equal noticeable distortion for MBs in one
video frame and is computed as in equation (5), where D is the noticeable distortion for a MB, ω denotes
the noticeable distortion weight, Q is the quantizer and ! is a constant.
" = # $�% (5)
Figure 2.11 – Architecture of the H.264/AVC Perceptual Video Coding based on a Foveated JND Model solution
20
Forward/Encoding Path
1. MB Division: First, the input video frame is divided in MBs;
2. Motion Estimation: For each MB, the motion vectors best describing the motion/translation from
one image to another are determined, usually using the adjacent frames in the video sequence. To
choose the best motion vector for each MB, a distortion metric, in this case the Sum of Absolute
Differences (SAD), is typically used;
3. MB Prediction: Each MB is encoded using INTRA (spatial) or INTER (temporal) prediction:
a. If INTRA coding is used, a prediction (PRED) is formed using the already decoded samples in
the current slice, this means the samples that have already been encoded and decoded;
b. If INTER coding is used, a prediction (PRED) is formed by motion compensation from 1 or 2
pictures selected from the available reference frames (in the so-called list 0 and/or list 1);
4. Residual Computation: The prediction is subtracted from the current original MB to produce a
residual MB;
5. Transformation: The blocks in the residual MB are transformed by the H.264/AVC Integer DCT
(ICT);
6. FJND model coder control: The FJND thresholds are computed for each pixel, considering three
components:
a. SJND – refers to the luminance contrast and spatial masking effect;
b. TJND – refers to large temporal masking effects resulting from large inter-frame differences;
c. Foveation model – refers to the contrast sensitivity as a function of eccentricity which is
computed based on the foveated weighting model and the background luminance function;
7. Quantization: The ICT coefficients are quantized, controlled by the FJND model:
a. The MB quantization is determined taking into account the relation between reference
noticeable distortion weighted (ωr=1) and the noticeable distortion weighted (ωi) and a reference
quantizer (Qr) determined by the frame-level rate control, as presented in equation (6);
$& = '#(#& $( (6)
The noticeable distortion weight for MB i, is calculated by equation (7), where a=0.7, b=0.6,
m=0, n=1, c=4, si is the average FJND of MB I, and s* is the average FJND of the frame.:
#& = + , 1 + � exp 1−c s3 − s*s* 41 + exp 1−c s3 − s*s* 4 (7)
8. Entropy coding: The quantized coefficients are entropy coded with CABAC.
Decoding Path (also within the encoder)
1. Scaling & Inv. Transform: The quantized ICT coefficients are scaled (Q-1
) and inverse transformed
(T-1
) to produce the residual block D’n;
2. Motion Compensation and Reconstruction: The motion compensated prediction block PRED is
added to D’n to create a reconstructed block;
21
3. Deblocking Filter: A deblocking filter is applied to the previously reconstructed blocks to reduce the
blocking effects and the decoded picture is created from a series of filtered blocks F’n.
Main Coding Tools
The FJND model is a combination of three elementary models, notably the SJND, TJND, and foveation
models.
• Spatial Just-Noticeable-Distortion Model
The perceptual redundancy in the spatial domain is mainly based on the HVS sensitivity to the
luminance contrast and spatial masking effect which is expressed by equation (8) where f1 and f2 are
functions to estimate the spatial masking and luminance contrast and bg and mg are the average
background luminance and the maximum weighted average of luminance differences, respectively.
�56" �7, 8� = max;<=�,>�7, 8�, �>�7, 8��, <��,>�7, 8��? (8)
• Temporal Just-Noticeable-Distortion Model
Usually, large inter-frame differences result in larger temporal masking effects. The TJND is defined
as in equation (9) where @ = 0.8 and ∆�7, 8, B� denotes the average inter-frame luminance between
frame B and previous frame B − 1.
C56"�7, 8, B� =DEFEG � 7 H@, 82 �7J K−0.152L �∆�7, 8, B� + 255�M + @N ∆�7, 8, B� ≤ 0
� 7 H@, 3.22 �7J K−0.152L �255 − ∆�7, 8, B��M + @N ∆�7, 8, B� > 0Q (9)
• Foveation Model
An analytical model is proposed to measure the contrast sensitivity as a function of the eccentricity
where the contrast sensitivity function CS(f,e) is defined as the reciprocal of the contrast threshold
CT(f, e). The foveation JND model takes into account the foveated weighting model and the function
of background luminance as presented in equation (10).
R�7, 8, S, �� = TUV�WX�Y,Z���S, �� (10)
The foveated JND model is the combination of the three models presented.
• Foveated Just-Noticeable-Distortion Model
The foveated JND is defined as in equation (11)
R56"�7, 8, B, S, �� = ��56"�7, 8�� ∙ �C56"�7, 8, B�� ∙ �R�7, 8, S, ��� (11)
When there are multiple fixation points, and since it is assumed a fixed v to calculate the F(x,y,v,e) for
each pixel, F(x,y,v,e) can be calculated by only considering the closest fixation point which results in
the smallest e and the minimum F(x,y,v,e). So, the FJND with multiple fixation points is defined as in
equation (12).
R�7, 8, S, �� = min&∈^=,…,`a R&�7, 8, S, �� (12)
22
The quantization step is based on the JND thresholds of the FJND model, so is determined the noticeable
distortion weighted of the reference and of the current MB, and afterwards the relation between them is
used to compute the new quantization step as shown in equation (6)
Performance Evaluation
• Test conditions
Subjective tests were performed in a typical laboratory viewing environment with normal lighting. The
display system was a 20 inch silicon graphics cathode ray tube (20’’ SGI CRT) display with a resolution of
800x600. The viewing distance was approximately 3 times the image width. The test sequences in CIF
format were coded at bitrates of 50 kbps, 300 kbps, 500 kbps, 300 kbps and 300 kbps, for Akiyo, Stefan,
Football, Bus, and Flower, respectively. The frame rate was 30 fps.
For the subjective experiments, the double stimulus continuous quality scale (DSCQS) protocol, widely
used for quality assessment, has been used; in this method, the pair of stimulus is composed by the
original video sequence and a processed version of the original video, in this case the decoded video
from one of the two video coding solutions under evaluation. Therefore, 10 presentations were
conducted, for the five test sequences coded with the two different coding algorithms, and presented
using a random order. The duration of the two videos was 3 seconds and each pair of videos was
displayed twice with the same interval order. The Mean Opinion Score (MOS) scales for the DSCQS
protocol ranged from 0 to 100 in association to qualities from bad to excellent. A Differential Mean
Opinion Score (DMOS) is computed as the difference between the MOS of the original video and the
MOS of the reconstructed video for each presentation.
• Results and Analysis
The authors conducted first visual subjective tests to evaluate the performance of the FJND model. The
FJND based video coding method was compared to the H.264/AVC Joint Model (JM) reference software,
with a confidence interval (CI) of 95%. The JM reference software has been developed by the same
standardization group who developed the standard itself, the Joint Video Team (JVT), with the purpose of
testing it; this codec implementation is typically used as a benchmark for H.264/AVC related video coding
innovations.
The results of the FJND validation test presented in Table 2.1 indicate that no noticeable distortion is
perceived. Moreover, the subjective quality results measured by DMOS are presented in Figure 2.12. The
smaller DMOS indicate that the subjective quality of the reconstructed video is closer to the original video;
so the proposed model is better that the JM in all sequences meaning that better subjective quality is
obtained for the same bitrate.
Table 2.
PSNR
Akiyo
Stefan
Football
Bus
Flower
Figure 2.12 – DMOS comparisons
The quality of the decoded video using the FJND based coding method is better than
resulting from the H.264/AVC JM reference software as shown in
regions can tolerate higher distortion and
Non-textured regions should not be too coarsely coded since the distortion in smooth regions will be
easily perceived; higher distortion in such regions is annoying and degrades the subjective quality, a
shown in Figure 2.13 (d).
(b) (c)
Figure 2.13 – Portions of decoded frameFixation point of Stefan frame coded with
(g) FJND; Non
Strengths and Weaknesses
The strength of this solution is the use of a FJND model
bitrate. The weakness is presented when is used the model at low
23
2.1 – Results of the FJND validation tests [27]
PSNRSTJND (dB) PSNRFJND (dB) ∆PSNR (dB)
37.16 35.55 -1.61
35.43 33.88 -1.63
36.17 35.01 -1.16
33.70 32.44 -1.34
34.78 33.14 -1.64
DMOS comparisons for the H.264/AVC based coding solutions
The quality of the decoded video using the FJND based coding method is better than
resulting from the H.264/AVC JM reference software as shown in Figure 2.13; for example,
regions can tolerate higher distortion and, thus, may be more coarsely coded, see Figure
textured regions should not be too coarsely coded since the distortion in smooth regions will be
perceived; higher distortion in such regions is annoying and degrades the subjective quality, a
(a)
(d) (f).. (g)
frames for the test sequence Stefan. Stefan frame coded with
Fixation point of Stefan frame coded with (b) JM (f) FJND; Texture region away from fixation point coded with Non-fixation point coded with (d) JM (h) FJND [27]
The strength of this solution is the use of a FJND model achieving a better subjective quality for the same
The weakness is presented when is used the model at low rates, where the distortion is likely to
solutions[27].
The quality of the decoded video using the FJND based coding method is better than the video quality
; for example, the textured
Figure 2.13 (c) and (g).
textured regions should not be too coarsely coded since the distortion in smooth regions will be more
perceived; higher distortion in such regions is annoying and degrades the subjective quality, as
(e)
(g) (h)
Stefan frame coded with (a) JM (e) FJND; FJND; Texture region away from fixation point coded with (c) JM
a better subjective quality for the same
, where the distortion is likely to
24
be above the visibility threshold, and the FJND value is used at each pixel as weighting factor to build
weighted squared-error metrics, because the simple weighting of the MSE may not lead to the optimal
trade-off between visual quality and rate.
2.3.2. H.264/AVC Coding with JND Model based Coefficients Filtering
Basic Approach
In [30], Mak and Ngan propose a novel approach for incorporating a DCT domain JND model in a
H.264/AVC encoder to reduce the bitrate without visual quality loss. The basic idea is to remove the
transformed coefficients that have a magnitude lower than the JND threshold.
• Basic idea: remove the transformed coefficients with a magnitude lower than the JND threshold.
• Target: reduce the bitrate without loss of visual quality.
• HVS property explored: texture masking, intensity contrast masking, spatial frequency sensitivity and
preservation of object boundaries.
Architecture and Walkthrough
The encoder changes proposed are regard, the transformation, quantization and the coder control
modules of the H.264/AVC architecture. The architecture of the proposed solution is presented in
Figure 2.14.
Figure 2.14 – Architecture of the H.264/AVC Coding with JND Model based Coefficients Filtering solution
25
Forward/Encoding Path
1. MB Division: First, the input video frame is divided in MBs;
2. Coder Control:
a. Transformation: The input video signal 4×4 blocks are transformed by the ICT;
b. The spatial and temporal contrast sensitivity function (CSF) for each block is computed using
the following information: ICT coefficients of the original block; frame dimension; physical size of
the pixels in the display monitor; viewing distance; frame rate; and motion of the block.
c. Based on the spatial and temporal CSF, each 8x8 block in the frame is classified as a PLAIN,
EDGE or TEXTURE block;
d. Luminance adaptation refers to the fact that sensitivity of distortion is linearly proportional to the
surround intensity, except in very dark and bright regions. Contrast masking is the property of
human eyes of masking distortions when other signals are present. The luminance adaptation
and contrast masking values are computed based on the ICT coefficients and the block type
(result of the previous step, 2.c.);
e. JND model - JND thresholds are computed for each coefficient in each 4×4 block using the
luminance adaptation and contrast masking values;
3. Motion Estimation: For each MB, the motion vectors describing the transformation from one image
to another, usually from adjacent frames in a video sequence, are determined. To choose the best
prediction MB, a distortion metric has to be used, in this case the JND thresholded SAD; this
means that, instead of finding the vector corresponding to the minimum SAD, the ME will find the
minimum thresholded SAD as defined next.
a. Thresholded SAD Computation:
i. Residual computation: The difference between the original and the prediction block is
computed;
ii. Transform: The residue above s transformed by the ICT, getting matrix E;
iii. JND Thresholding:
1. If the absolute value of E is larger than the corresponding JND threshold for a certain
frequency position, do nothing;
2. Otherwise, E is changed to zero;
iv. Inverse Transform: The distortion (thresholded SAD) is computed from the inverse transform
of the result of iii. (E thresholded matrix);
4. MB prediction: Each MB is encoded using INTRA or INTER prediction. If rate distortion (RD)
optimization is enable, the prediction mode is chosen for the block is the one that minimizes the RD
cost defined as in equation (13) where d is the distortion calculated in step 3.a.iv, λ is a lagrangian
multiplier and L is the actual bit length of encoding the block with that prediction method.
b = c + �d (13)
a. If the MB is INTRA coded, its prediction PRED is formed using samples in the current slice that
have been previously encoded and decoded;
26
b. If the MB is INTER coded, its prediction PRED is formed by motion compensation prediction
from 1 or 2 frames selected from the available reference frames (set of list 0 and/or list 1);
5. Residue Computation: The prediction is subtracted from the current original block to produce a
residual block;
6. Transform: The residual block is transformed with the ICT; the output are the transformed
coefficients of the JND thresholded prediction residual;
7. JND Thresholding:
a. If the absolute value of a transformed coefficient of the prediction residual is larger than the
corresponding JND threshold, go to step 8;
b. Otherwise, the transformed coefficient of the prediction residual is set to zero;
8. Quantization: Quantize the transformed coefficients;
9. Entropy encoding: Is the same that is used by the H.264/AVC high profile;
Decoding Path(also within the encoder): The decoding is the same as presented for the previous
perceptual coding solution.
Main Coding Tools
• JND model
This JND model takes into account the fact that humans are more sensitive to the distortion in texture
regions than in plain regions. So, the blocks are classified as PLAIN, EDGE, and TEXTURE blocks,
knowing the classification is computed based on the values of the luminance adaptation and contrast
masking. Then the coefficients filtering is computed for each coefficient in the block as presented in
equation (14) where Y are the transformed coefficients, Jx the JND threshold and Yj the filtered
transformed coefficients.
ef�g, S� = he�g, S� i< je�g, S�j > kl�g, S�0 mBℎ�opiq� Q (14)
The JND threshold is used to eliminate the ICT coefficients that are not perceived by the human eye
that is those which magnitude is smaller than the corresponding JND threshold as presented in
equation (15) where E is the ICT transformed difference between the original and the reconstructed
block..
rf�g, S� = hr�g, S� i< jr�g, S�j > kl�g, S�0 mBℎ�opiq� Q (15)
Performance Evaluation
• Test Conditions
The proposed JND thresholding scheme has been implemented in the H.264/AVC JM 14.0 reference
software and the High profile has been used. The group of pictures (GOP) structure is IBBPBBPBBP..,
with an Intra frame every 0.5 s; the RDO is enabled. The sequences used were City, Panslow, Shield,
Spin-calendar with resolution 1280x720p and Pedestrian Area, Toys Calendar and Bluesky with
resolution 1920x1080p.
27
• Results and Analysis
The Double Stimulus Impairment Scale (DSIS) test method has been used for the subjective evaluation.
The results show that all the JND thresholded sequences have a MOS of 5 or close to 5 (scale is from 1-
very annoying to 5-imperceptible), with an average MOS of 4.93 for all sequences. Even when the PSNR
drops by 7 dB, the observers still cannot perceive the difference between the original and the decoded,
very likely because the error is inserted in image areas where it is not easily perceivable.
On average, the bitrate is reduced by 23% for all sequences when using the proposed perceptual video
codec. While the average reduction for the Intra frames is only 1.74%, it is 20% and 36% for the P and B
frames, respectively. This difference exists because in the RD cost computation, the Lagrangian multiplier
is larger in B frames than in P frames, that means it is more likely to choose modes with shorter bit length
in B frames. Figure 2.15 shows the average bitrate change, relatively to the reference H.264/AVC coding,
at different QPs. The bitrate reduction obtained for the I, P, and B frames when the proposed JND-
threshold coding scheme is used is presented in Table 2.2. The bitrate reduction declines as the QP
increases, meaning the JND-thresholding bitrate reduction effect is less intense.
Figure 2.15 – Bitrate changes at different QP for I, P, and B frames [30]
Strengths and Weaknesses
The strength of this solution is the bitrate reduction achieved without any degradation in the visual quality.
The major weakness is in the use of the average of the full set for each MB of the JND thresholds
determined for the 4×4 ICT coefficients to maintain compatibility with H.264/AVC standard, because this
average threshold has little meaning. Another weakness is related to the coding modes selection which
may not be optimal in terms of bitrate; moreover, when QP is large, some sequences show a bitrate
increase which may be due to the differences in the RD cost function used.
28
Table 2.2 – Bitrate reduction for the JND-thresholded sequences and their MOS [30]
2.3.3. H.264/AVC Inter Coding based on Structural Similarity driven
Motion Estimation
Basic Approach
In [31], Mai, Yang, Kuang and Po propose a novel motion estimation method based on the Structural
Similarity (SSIM) metric for H.264/AVC inter prediction. The SSIM index is a method for measuring the
similarity between two images. The SSIM index is a full reference metric in the sense that it measures the
image quality based on an initial uncompressed or distortion-free image taken as reference. SSIM has
been designed to improve on traditional quality assessment methods like the peak signal-to-noise ratio
(PSNR) and mean squared error (MSE), which have proved to be inconsistent with human eye
perception.
Variable block-size motion compensation is used in H.264/AVC to improve the matching accuracy and
achieve higher compression efficiency. As the SSIM index expresses the structural similarity between two
images, the prediction blocks having larger SSIM should be more similar to the original one, and then
lower frequency residuals which can be more easily encoded should be produced. As the best
H.264/AVC P-slice prediction modes are determined after all the prediction residuals are transformed,
quantized and entropy coded, which cost a great deal of computation complexity, this solution is also able
to reduce the overall coding complexity. In this context, the solution presented in this section proposes a
29
motion estimation method based on the structural similarity (MEBSS) for inter prediction to reduce the
bitrate and encoding time while maintaining the same perceptual video quality. Summing up:
• Basic idea: use a perceptual metric like the SSIM metric in the ME process instead of the usual
SAD.
• Target: reduce the bitrate and encoding time while maintaining the same perceptual video quality.
• HVS property explored: considers luminance, contrast and structure.
Architecture and Walkthrough
This solution mainly changes the distortion metric used in the motion estimation module and the coder
control module. The MEBSS uses the SSIM rather than SAD as the distortion metric in the block
matching motion estimation process. According to the SSIM theory, the candidate block is perceptually
more similar to the original block when its SSIM index is greater, while the SAD behaves in the opposite
way since it is a distortion (not similarity) metric. The idea behind the SSIM index is to measure the
structural information degradation, which includes three comparison dimensions: luminance, contrast and
structure. The higher the value of SSIM(x, y) is, the more similar the images x and y are. The architecture
of the proposed coding solution is presented in Figure 2.16.
Figure 2.16 – Architecture of the H.264/AVC Inter Coding based on SSIM driven ME solution
Forward/Encoding Path
1. MB Division: First, the input video frame is divided in MBs;
30
2. Motion Estimation: For each MB, the motion vectors describing the transformation from one image
to another, usually from adjacent frames in a video sequence, are determined. To choose the best
MB, a distortion metric has to be used, in this case the SSIM. The major steps to select the best
matching block(s) and the best inter prediction mode for each MB are:
a. For each MB, find the best matching block from all the candidate blocks using equation (16)
where s is the original block, c is the candidate matching block, λMOTION is the Lagrange
multiplier for motion estimation, ∆MV is the difference between the prediction MV and the actual
MV and, finally, Bit(∆MV) is the number of bits representing the ∆MV.
b. Divide each MB into two 16x8 non-overlapped blocks. For each 16x8 block, find the best-
matching 16x8 block from all the reference frames using equation (16). Then, calculate the sum �st�C for these two 16x8 blocks.
c. Divide each MB into two 8x16 non-overlapped blocks; then, perform as in b.
d. Divide each MB into four 8x8 non-overlapped blocks. For each 8x8 block, find the best-
matching block from all the reference frames using equation (16). Then, calculate the total �st�C for these four 8x8 blocks.
e. If further sub-partitions are allowed, find the best similarity matching blocks for the types 8x4,
4x8 and 4x4, respectively; otherwise, go to step f directly.
f. Find the prediction block using the P_SKIP mode. Its �st�C is 1 − ��u��q, �� since neither a
motion vector nor a reference index parameter is transmitted for this mode.
g. The prediction mode with the minimum �st�C is chosen as the best inter prediction mode for
the MB. The residual for this best coding mode is transformed, quantized and entropy coded.
3. MB prediction: As presented in Section 2.3.1
�st�C�q, �� = 1 − ��u��q, �� + �vwxywz{ |iB�∆�}� (16)
31
4. Residual Computation: As presented in Section 2.3.1
5. Transformation: The residual block is transformed with the ICT;
6. Quantization: As presented in the previous Section 2.3.2;
7. Entropy encoding: As presented in the previous Section 2.3.2.
Decoding Path (also within the encoder): As presented in Section 2.3.2.
Main Coding Tools
• SSIM Index
This tool is used to measure the structural information degradation, based on three features:
luminance, contrast and structure. The SSIM index is defined in equation (17) where l(x,y) regards the
luminance comparison, c(x,y) regards the contrast comparison and s(x,y) regards the structure
comparison. These comparisons are defined in equations (18), (19) and (20), respectively, where x
and y are two nonnegative image signals to be compared, µx and µy are the mean intensity and σx and
σy are the standard deviation of image x and y, respectively, and σxy is the covariance of image x and
y. C1, C2 and C3 are small constants to avoid the denominator being zero.
��u��7, 8� = ~�7, 8� ∙ ��7, 8� ∙ q�7, 8� (17)
~�7, 8� = 2�Y�Z + s=�Y� + �Z� + s= (18)
��7, 8� = 2�Y�Z + s��Y� + �Z� + s� (19)
q�7, 8� = �YZ + s��Y�Z + s� (20)
• Encoder Control
The encoder transmits the coded video together with some side information, notably for indicating
either Intra-slice or Inter-slice coding. Partitions with sizes of 16x16, 16x8, 8x16 and 8x8 for each MB
luma component are supported by the P-slice syntax in block matching motion estimation. The 8x8
partition can also be further subdivided into 8x4, 4x8 or 4x4 sub-block partitions according to the
syntax element.
The block matching motion estimation targets to find the best matching block from the reference
frames within a certain search range, such as 16x16. The Lagrange cost ��st�C� in equation (16) is
used as the selection criterion. The block(s) with minimum �st�C will be chosen as the best matching
block(s) for each prediction mode. For each prediction mode, a RD cost is generated after finding the
best matching block. The prediction mode with the minimum RD cost will be chosen as the best
prediction for that MB.
Due to the change in the distortion metric (SSIM and not SAD), the Lagrangian multiplier should be
modified correspondingly; consequently, the new cost function must be written as in equation (16).
32
Performance Evaluation
• Test Conditions
The adopted test video sequences (with 50 frames) were Carphone, Foreman, Grandma and News with
176x144 resolution, and Hall_monitor and Mobile with 325x288 resolution. All the experiments used the
JVT JM reference software, version JM92. The tests were performed on a P4/2.4 GHz personal computer
with 256 MB RAM and Microsoft Windows XP as the operating system.
• Results and Analysis
The MEBSS coding solution can avoid some RDO coding steps, leading to the reduction of the encoder
computation load. However, as the SSIM computational load itself is larger for than for SAD, the reduction
in the overall coding computation load is not that large.
While maintaining almost the same mean SSIM (MSSIM), the proposed MEBSS solution may achieve a
20% average bitrate reduction with a 2.5% reduction in the encoding time; the maximum bitrate reduction
is more than 50% which is rather significant, see Table 2.3.
Table 2.3 – MEBSS results with QP=10 [31].
Strengths and Weaknesses
The strengths for this solution regard the bitrate reduction for a target quality and the encoding complexity
reduction for each prediction mode. The advantage in computation load is more obvious when the
MEBSS is used in H.264/AVC fast motion estimation with fewer searching blocks. The main weakness is
that the overall complexity reduction is not that large due to the usage of a more complex quality metric.
33
2.3.4. H.264/AVC Bitrate Control based on 4D Perceptual
Quantization Modeling
Basic Approach
In [32], Huang and Lin propose a novel 4D perceptual quantization model for H.264/AVC bitrate control
(PQrc) [32]. This solution includes two major encoding modules:
o Perceptual frame-level bit allocation using a 1D temporal pattern, depicted as the energy
transition table, which is used to predict the frame complexity and determine proper rate budgets;
o Macroblock-level quantizer decision using a 3D rate pattern, formed as the bit-complexity-
quantization (B.C.Q.) model, in which the tangent slope of a B.C.Q. curve is unique information to
find a proper MB quantizer.
In summary:
• Basic idea: to use a 4D temporal BCQ model, this means a 1D temporal pattern to predict the
frame complexity and to determine proper budget bits and a 3D rate pattern, depicted as a BCQ
model.
• Target: to reduce the bitrate while improving the SNR quality and control accuracy.
• HVS property explored: are basically the just noticeable distortions
Architecture and Walkthrough
The whole codec architecture is presented in Figure 2.18; the major changes regard the coder control
and the quantization module. The proposed PQrc solution has major contributions with respect to the
frame complexity estimation and rate-quantization model modules, marked with double stars in Figure
2.17, where the main components of the H.264/AVC rate control module are depicted.
Figure 2.17 – Illustration of the H.264/AVC Bitrate Control based on 4D Perceptual Quantization Modeling rate control main components (major revisions are marked by double stars) [32]
34
Figure 2.18 – Architecture of the H.264/AVC Bitrate Control based on 4D Perceptual Quantization Modeling solution
Forward/Encoding Path
1. Input video signal: Each frame is divided into MBs;
2. Motion Estimation: For each MB, the motion vectors best describing the transformation from one
image to another, usually from adjacent frames in a video sequence, are determined. To choose
the best MB, a distortion metric, in this case the SAD, is used;
3. MB prediction: Each MB is encoded using INTRA or INTER prediction:
a. If INTRA coding is used, PRED is formed from samples in the current slice that have been
previously encoded, decoded and reconstructed;
b. If INTER coding is used, PRED is formed by motion compensated prediction from 1 or 2
pictures selected from the reference frames set in the so-called list 0 and list 1;
4. Distortion: The best prediction is subtracted from the current block (original) to produce a residual
block;
5. Transformation: As presented in the previous section;
6. Coder Control: The proposed H.264/AVC bitrate control using the 4D perceptual quantization
modeling, see Figure 2.19, works as follows:
a. Frame complexity estimation based on the available channel bandwidth, frame rate, current
target buffer level and actual buffer fullness:
i. Compute the (i-1)th actual mean absolute difference (MAD) through PQrc;
35
ii. Update the energy transition table which records the temporal variation of frame
complexity (energy) between adjacent frames;
b. Frame-level bit-allocation
i. Calculate the initial frame-level budget bit based on the predicted frame complexity using a
quadratic rate-quantization function;
ii. Slightly adjust the demanded frame bitrate, using the buffer fullness, to achieve buffer
stability.
c. Find the slope of the MB characteristic tangent, that is, in order to find the proper MB
quantizer, the current MB property is characterized by its tangent slope.
i. If the MB budget bits and the MB complexity are known
a. Calculate the expected tangent slope through PQrc;
Figure 2.19 – H.264/AVC bitrate control procedure using the 4D perceptual quantization model [32]
36
ii. MB bitrate estimation can allocate MB budget bits based on residual bits and previously
coding information.
iii. MB complexity is computed as a weighted MAD according to the variation of the luminance
within the MB.
d. MB QP decision
i. PQrc determines a proper MB QP that is selected to minimize the expected tangent slope
and the tangent slope in the BCQ model.
ii. Update the BCQ model using the weighed least-square estimation for newly coming videos
continuously.
e. MB encoding/finishing
i. Return to Step 3 to decide further MB quantizers until all MBs are encoded.
ii. The encoded frame is used next frame’s MAD in Step 1.
7. Quantization: As presented in the previous Section 2.3.3;
8. Entropy encoding: As presented in the previous Section 2.3.3;
Decoding Path (also within the encoder): As presented in the previous section.
Main Coding Tools
• Perceptual frame-level bit-allocation
Perceptual frame-level bit-allocation includes the following steps:
1. MAD is predicted by looking the energy transition table;
2. Just-noticeable-distortion PSNR (PSNRJND) is computed through equation (21) using MSE JND
(MSEJND) computed through equation (22) where the frame size is N multiplied by M, f is the
pixel luminance in the original frame, <� is the one in the reconstructed frame, THJND is the
empirical threshold and η is the tuning factor defined in equation (23).
��6b�z� = 10 log=� 255�����z� (21)
����z� = � � ��<�7, 8� − <��7, 8�� − C��z��� ∙ ��7, 8�vZ�=zY�= 6 ∙ � (22)
� = �1, i< �<�7, 8� − <��7, 8�� > C��z�0, i< �<�7, 8� − <��7, 8�� ≤ C��z� Q (23)
PSRNJND is adopted to detect scene changes and compensate the error in prior MAD prediction;
several insignificant video signals (noise) will be filtered out to increase the accuracy of the
scene change detection.
3. Buffer fullness is considered to prevent buffer stability and enhance the video quality temporally;
� If the buffer fullness is larger than a certain threshold, the rate-controller can decrease the
budget-bits reducing it by the overflowed bits (∆);
� Otherwise, the controller can increase the demanded frame bitrate by adding to it the
unused buffer capacity (∆) to avoid buffer overflow or underflow and enhance the video
quality.
37
• MB-level quantizer decision using the 3-D BCQ model
The MB-level quantizer decision model using the BCQ model is the core technique for bitrate control
since it directly affects the video bitrate and distortion; its main steps are:
1. MB level quantizer decision control scheme (use the tangent slope of the BCQ curve to find the
proper MB quantizer)
a. Refine the BCQ model, that is, the PQrc uses quadratic functions to model the BCQ curves
for each quantizer;
b. Based on the refined BCQ model, compute the predicted MB bit-quota and complexity for
the ��� MB in the i�� frame, to determine its proper quantizer;
c. To find the proper MB quantizer, the current MB is characterized by its tangent slope;
d. If budget bits are finished, the quantizer is reset to $���Y to prevent buffer overflow and
reduce the number of skipped frames;
e. The temporal behavior of the BCQ model is also considered to determine each initial
picture quantizer; the decision depends on the variation of the frame MAD in the IPPP
coding structure.
f. A decreasing frame MAD indicates that the next motionless frame needs less budget bits
and, thus, the current actual $��&� should be subtracted 1 to increase the encoded bits and
enhance the video quality. Otherwise, an increasing frame MAD indicates that the next
intense frame needs more budget bits and, thus, the current $��&� should be added 1 to
reduce encoded bits.
2. BCQ model update using the weighted least-square estimation (use a weighted least-square
estimation to adapt related procedures of updating the 3D BCQ model to current MB properties)
The BCQ model is updated using a weighted least-square estimation based on coded data sets �,�W , ��W , $��, corresponding to the MB bitrate, MB complexity and QP. When there are � data sets
for one specific $�, the BCQ curve function can be initialized by equation (24) where � is the matrix �,= ⋯ ,��x for encoded bitrates, � is the matrix ��= ⋯ ���x for the prediction error, � is ���� ��� ���x, r is the matrix presented in (26).
� = r ∙ � + � (24) (25)
¡¢¢¢£ ��� =� ��� = 1�� =��� �� ��� � 1�� �⋮ ⋮ ⋮��� �� ��� � 1�� �¥¦¦
¦§ (26)
In the least-square estimation, the indicator of the prediction error �¨� equals �x�.
38
Performance Evaluation
• Test Conditions
The test sequences used were Akiyo, Foreman, News, Carphone and, Suzie with resolutions CIF and
QCIF. The highest channel rates are 1M/256kbps for CIF and the lowest channel rates are 128k/24kbps
for QCIF. The GOP pattern was IPPP/IBBP and the target frame rate 30 fps.
• Results and Analysis
The obtained results indicate that the MAD estimation precision rate is usually at least 98% in any
condition, comparing the MAD prediction and the actual MAD. The MAD prediction can also more
effectively reduce the processing delay regarding the two-pass rate control model, which is used to collect
coding information but requires pre-analysis of the video characteristics.
In Table 2.4 and Table 2.5, results for two encoding conditions are presented: 1) CIF@high bitrate; and 2)
QCIF@low bitrate. The proposed �$o� solution can gain 0.3-0.7 dB on the average PSNR regarding the
H.264/AVC JM10.2 codec. The maximum PSNR improvement and degradation are about 1.1 dB and -0.1
dB, respectively. Figure 2.20 and Figure 2.21 show the comparison of image qualities for the cases
Carphone@24 kbps and Foreman@128 kbps, respectively. These images were encoded with the
proposed PQrc solution; it can be observed that there is a SNR improvement regarding the JM10.2 model
(the most significant differences are marked by black circles). Moreover, the smaller PSNR standard
deviation indicates less flickering and more consistent qualities; on average, the improvement is about 0.5
dB, especially for the lower bitrates. The stability of the buffer can be measured by the bitrate accuracy
value (χ), expressed in equation (27), and the buffer fullness variation:
© = 1 − ªB o>�B_o B� − mgBJgB_o B� B o>�B_o B� ª (27)
(28) The value χ approaching 1 indicates that the buffer status is stable when the amount of encoding bits are
near the pre-defined buffer capacity.
The analysis of the computational complexity considered two aspects:
1. The simpler frame complexity calculation and MB QP decision;
a. The �$o� uses simple arithmetic operations and looking-up-table techniques without
complicated power calculation to determine the frame complexity and MB QP.
2. The slightly complicated scene-change detection and the BCQ model update.
a. To address the quality degradation problem, it is additionally proposed to detect scene
changes and refine the estimated frame complexity;
b. To continuously update the BCQ model, PQrc also requires slightly complicated matrix
operations to adjust the model coefficients.
The best, worst and average execution time gains are +3.13%, -2.97% and +0.87%, respectively.
39
Table 2.4 – Comparison of overall coding performance for the GOP IPPP pattern using PQrc and JM10.2 (average PSNR gain: 0.515 dB) [32]
Table 2.5 – Comparison of overall coding performance for the GOP IBBP pattern using PQrc and JM10.2 (average
PSNR gain: 0.35dB) [32]
40
Figure 2.20 – Comparison of the image quality for the Carphone sequence at 24 kbps (significant differences are
marked with black circles) [32]
Figure 2.21 – Comparison of the image quality for the Foreman sequence at 128 kbps (significant differences are
marked with black circles) [32]
Strengths and Weaknesses
The strengths of this solution are the better visual quality and buffer stability, avoiding the flickering effect,
and the PSNR improvement amounting to around 0.5 dB. The major weakness regards the MAD
prediction which fails at scene changes.
The goal of this chapter was to present a set of solutions that improve the perceptual quality of encoded
video, and use them as reference to the work developed in this thesis. The references used were the
second solution presented, H.264/AVC Coding with JND Model based Coefficients Filtering and the first
solution, H.264/AVC Perceptual Video Coding based on a Foveated JND Model. The idea is to implement
a solution that eliminates the transformed coefficients which are below the JND threshold and adapt the
QP accordingly to the JND thresholds.
41
Chapter 3
A JND Model based Coefficients Pruning Method for
H.264/AVC Video Coding
This chapter presents the first perceptually driven modification made to the H.264/AVC video codec with
the target to eliminate the transform coefficients which are perceptually irrelevant according to an adopted
JND model. After describing the coefficients pruning solution, the performance results obtained in the
context of the H.264/AVC JM 16.2 reference software are presented and analyzed.
3.1. The Perceptual Coefficients Pruning Method
3.1.1. Objective
The basic idea underpinning the adopted perceptual coefficients pruning method is to remove, by setting
them to zero, all the transform coefficients which have a magnitude lower than the corresponding JND
threshold determined using an appropriate JND model. The main target is thus the reduction of the total
bitrate while maintaining the perceptual video quality since the removed coefficients are perceptually
irrelevant. To achieve this objective, the HVS properties are adequately exploited through a JND model
considering the following HVS effects: frequency band masking (spatial frequency sensitivity), luminance
masking (intensity contrast masking), pattern masking and temporal masking [33].
3.1.2. Architecture
The improved codec architecture already including the perceptual coefficients pruning method related
tools is presented in Figure 3.1. The major changes associated to the novel tool regard the determination
of the pruning thresholds using the selected JND model and the transform coefficients pruning process
included before the coefficients quantization. It is important to note that the proposed codec modifications
42
only refer to the encoder and they do not imply any change in the H.264/AVC syntax and semantics as
well as at the decoder, this implies, that still fully compliant H.264/AVC bit streams are created with the
perceptually driven video codec.
Figure 3.1 – Improved H.264/AVC codec architecture including the JND model and the coefficients pruning method.
3.1.3. Walkthrough
This section intends to present the walkthrough of the proposed improved video codec with special
emphasis on the novel modules related to the perceptual pruning of the integer DCT (ICT) coefficients,
which are listed in bold below.
Forward/Encoding Path
1. MB division: First, the input video frame is divided in MBs;
2. JND thresholds determination: JND thresholds are computed for each ICT coefficient
organized in 4×4 and 8x8 blocks using the JND model described in the following section,
design by Naccari et al. [33];
3. Motion estimation: As presented in Section 2.3.4;
4. MB prediction: As presented in Section 2.3.4;
5. Residue computation: As presented in Section 2.3.2;
6. Transform: As presented in Section 2.3.2;
7. Coefficients pruning:
a. If the absolute value of a prediction residual ICT coefficient is larger than the
corresponding JND threshold, go to step 8;
b. Otherwise, the prediction residual ICT coefficient is set to zero meaning that the
corresponding coefficient is NOT perceptually relevant and, thus, may be pruned,
saving the associated rate; in this context, pruning means setting the value to zero;
43
8. Quantization: The transform coefficients which ‘survived’ the perceptual pruning method are now
quantized;
9. Entropy encoding: As presented in Section 2.3.1;
Decoding Path (also within the encoder): The decoder is the same as presented in Section 2.3.1 since
there are no changes in the decoder architecture also implying the H.264/AVC compliance is kept.
3.1.4. Novel Tools Description
This section describes the novel tools required to perform the perceptual pruning of the transform
coefficients, notably the JND thresholds determination and the perceptual coefficients pruning method.
The adopted pruning solution is based on the perceptual codec reviewed in Section 2.3, designed by Mak
and Ngan [30].
JND Thresholds Determination
Description
The adopted JND thresholds determination process relies on a spatial JND model which exploits three
human visual system masking aspects through appropriate sub-models:
i) Frequency band decomposition masking model which exploits the different sensitivity of the
human eye to the noise introduced at different spatial frequencies. To explore this masking effect,
the model is constituted by the default perceptual matrices adopted in the H.264/AVC reference
software;
ii) Luminance variations masking model which exploits the masking effect associated to luminance
variations in different image regions. To explore this masking effect, the model is based on the
Weber-Fechner law which states that the minimal brightness difference which may be perceived
increases with the background brightness;
iii) Pattern masking effects model which exploits the presence of some patterns in the image. To
explore this masking effect, the Foley-Boynton model was adopted.
The adoption of the spatial JND model has the main target to improve the final rate-distortion
performance through the exploitation of relevant HVS properties.
In the adopted JND model, the determination of the JND thresholds is performed through the following
steps:
• Luminance masking model is defined through equation (29), where d¬��� denotes the average
luminance intensity in block k.
56"®���� = ¯ 0.048 ∙ d¬��� + 4 d¬��� ≤ 621 62 < d¬��� < 1150.021 ∙ d¬��� − 1.464 d¬��� ≥ 115 Q (29)
• Frequency band decomposition masking model is applied to each MB coding mode as described by equation (30).
44
56"W��´&��(��i, µ� = ¶ 6 13 20 2813 20 28 3220 28 32 3728 32 37 42· 56"W��´&��¸(�i, µ� = ¶10 14 20 2414 20 24 2720 24 27 3024 27 30 34· (30)
• Pattern masking model is defined through equation (31), where E(i,j,k) denotes the normalized
contrast energy and ε is 0.6.
56"����i, µ, �� = ¹ 1 i< i, µ = 0max�1, ��i, µ, ��º� mBℎ�opiq�Q (31)
• The final JND threshold (JT) is computed through equation (32)
Perceptual Coefficients Pruning Method
Description
The perceptual coefficients pruning method consists in setting to zero all the transform coefficients which
have an absolute magnitude lower that the corresponding JND threshold given by the adopted JND
model described above. This pruning process is important since it allows sending to the decoder, and
thus first to the following quantization module, only the perceptually relevant ICT coefficients, saving the
associated bitrate without any subjective quality penalty.
The decision to prune an ICT coefficient is based on equation (33), where YP represents the pruned
transform coefficients, Y stands for the transform coefficients and JT represents the relevant JND
threshold as defined by equation (32).
»¼�i, µ, �� = ¹»�i, µ, �� , i< j»�i, µ, ��j > 5x�i, µ, ��0 , mBℎ�opiq� Q (33)
To apply this pruning tool, it is important to know with which DCT, 4x4 or 8x8, the signal was transformed,
and apply the pruning process not only to the luminance coefficients but also to the chrominance
coefficients. The JND model computes different thresholds for the 4x4 and the 8x8 ICT blocks; so, when
a MB is coded with a 4x4 DCT, the JND thresholds used for the comparison in equation (33) are those
associated to the 4x4 blocks; naturally, the same happens for the 8x8 DCT blocks .
Implementation
To implement the ICT perceptual coefficients pruning method, three processes had to be created: one for
the chrominances, and two for the luminance (one for the 4x4 blocks and another for the 8x8 blocks).
These processes correspond to the functions threshold_transform_chroma, threshold_transform4x4 and
threshold_transform8x8. Each of these functions is used after the transformation process and before the
quantization, so they are used in the functions dct_chroma, dct_4x4_perceptual, dct_16x16_perceptual
and dct_8x8_perceptual.
Definition of Assessment Metrics for the Pruning Method
Since the perceptual coefficients pruning method basically sets coefficients to zero, the most meaningful
metric to evaluate the impact of this tool is the number of zeroed coefficients due to the coefficients
5x�i, µ, �� = 56"W��´�i, µ� ∙ 56"®���� ∙ 56"����i, µ, �� (32)
45
pruning; however, other metrics may provide useful information regarding the pruning method. Taking this
into account, the metrics implemented were the following:
Average number of zeroed coefficients at MB level due to the perceptual coefficients pruning
method
o Definition
This metric measures the difference between the average number of zeroed coefficients with the
H.264/AVC perceptual codec with coefficients pruning after the quantization and the average number of
zeroed coefficients with the H.264/AVC High profile codec after the quantization; in this case, the
H.264/AVC High profile codec is taken as reference. With this definition, this metric effectively measures
the coefficients which were set to zero due to the additional perceptually driven coefficients pruning
method.
This metric is defined in equation (34), where Avg_coeff_MB, defined in equation (35), is the average
number of coefficients zeroed at MB level after the quantization in each frame, nr_coef_zero is the
number of coefficients zeroed in the frame and nr_MB_coded is the number of MBs in a frame with coded
coefficients. This metric is computed over all the frames in the sequence, and after the average in time is
computed as expressed in equation (36).
o Implementation
In this subsection, the implementation of the metric above and the problems that appeared during this
implementation will be presented. This metric has been implemented by the following steps:
1. Count the number of zeroed coefficients due only to the perceptual coefficients pruning method
a. First approach: Count the number of coefficients set to zero while computing the perceptual
coefficients pruning method.
i. Problem: The coefficients set to zero by the quantization and the perceptual coefficients
pruning method alone overlap. So, the result of this first approach is not the expected since
the objective is to count the zeroed coefficients due exclusively to the perceptual coefficients
pruning method.
b. Second approach: Count the number of coefficients with zero value after the quantization
computation. Afterwards, the difference between this metric obtained for the H.264/AVC
perceptual codec and for the H.264/AVC High profile codec is computed to determine the
½S>¾¸(¿¸´ÀÁÂÃÃÄÅ = ½S>_�m�<<_�|¼(¿�¿Æ¸´ Ç.�ÈÉ/ËÌÍ �¸(�¸��®� �¿´¸�− ½S>_�m�<<_�|Ç.�ÈÉ/ËÌÍ Ç&X� �(¿U&¸ �¿´¸� (34)
½S>_�m�<<_�| = o_�m�<_Î�om o_�|_�mc�c (35)
½S>¾¸(¿¸´ÀÁÂÃÄÅÏÂÐÑÂÒÀ = � ½S>¾¸(¿¸´ÀÁÂÃÄÅ &�(ÃÓÔÕÂÏÖ=&�� o_<o ��q (36)
46
number of coefficients set to zero by the novel tool, without the coefficients zeroed by the
quantization.
2. Compute the number of MBs in a frame using equation (37).
o_�| = ℎ�i>ℎB × picBℎ 16 × 16 (37)
3. Compute equation (35)
To obtain the number of coefficients which are set to zero exclusively due to the perceptual coefficients
pruning method, it is required to perform the steps above both for the H.264/AVC High profile codec and
also for the H.264/AVC perceptual codec with coefficients pruning.
4. Finally, equation (34) is computed.
However, the process above is not a perfect solution to determine the selected metric. Since the RDO is
enabled, the encoder tries all the coding modes to find the best coding mode; so, the best coding mode
for the H.264/AVC High profile codec may not be always the best coding mode for the H.264/AVC
perceptual codec with coefficients pruning; due to these variations, the difference computed in equation
(34) does not compute the effective number of coefficients set to zero exclusively due to the perceptual
coefficients pruning method but it should provide a rather good approximation.
Average zigzag position of the zeroed coefficients exclusively due to the perceptual coefficients
pruning method (4×4 blocks)
o Definition
This metric computes the average zigzag position in the 4×4 blocks corresponding to the zeroed
coefficients due to the perceptual coefficients pruning method (position Є [1, 16]). This metric provides an
idea on the bandwidth zone where the coefficients are being zeroed. The metric is computed with
equation (38) where nr_coef_zeroi represents the number of zeroed coefficients in the position i of each
4x4 block for a frame; this metric is computed in the universe of the MBs coded with the 4x4 DCT.
o Implementation
This metric has been implemented using an array with 16 positions where each position is associated to
the zigzag position in the block. So, position zero in the array corresponds to the first ICT coefficient in the
zigzag scanning order of the 4x4 block. With this purpose, the two following steps are performed:
1. Count the number of zeroed coefficients for each zigzag position (scoring[16])
When performing the quantization process, the number of zeroed coefficients for each zigzag position is
counted. For each frame, only the 4x4 transformed blocks will be considered. For each zigzag position of
the 4x4 block which has its coefficient set to zero, the scoring value for that position is incremented by
½S>¾&X¾�X �¿Æ&�&¿�ØÙØ = � o_�m�<_Î�om& × JmqiBim && � o_�m�<_Î�om&& (38)
47
one. In this way, at the end of each frame, the array will be filled at each position with the number of
zeroed coefficients for that position in the 4x4 transformed blocks.
2. Compute the average zigzag position using equation (38) taking into account that:
• nr_coef_zeroi = scoring[i-1];
• positioni = i+1.
The result is the zigzag scanning position in a 4x4 block where, on average, the coefficients are set to
zero for each frame; to obtain the same metric over the video sequence, the frame metric values have to
be averaged over all the frames in the video sequence.
3.2. Performance Evaluation
This section intends to assess the performance of the presented perceptual H.264/AVC codec, including
the proposed perceptual coefficients pruning method. With this purpose in mind, first the adopted test
conditions are presented, including the selected performance metrics and the benchmarks; after, the
performance results are presented and analyzed.
3.2.1. Test Conditions
For the test experiments, the H.264/AVC reference software, version JM 16.2 (FRExt), has been used,
notably the High profile which is the best performing profile from the compression efficiency point of view.
Thus, the adopted JND model and the perceptual coefficients pruning method described in the previous
section were implemented in the context of this reference software. Further test conditions include [34]:
• GOP prediction structure: IBBPBBPBBP..., with a single Intra frame at the beginning.
• Rate control: RDO is in the high complexity mode .
• Test sequences and resolutions:
o Foreman and Mobile
� Spatial resolution: CIF
� Frame rate: 30 fps
� Total number of frames: 300 frames
o Panslow and Spincalendar
� Spatial resolution: 1280x720
� Frame rate: 60 fps
� Total number of frames: 150 frames
o Playing_cards and Toys_and_calendar
� Spatial resolution: 1920x1080
� Frame rate: 24 fps
� Total number of frames: 60 frames
48
The first frame of each sequence is presented in Figure 3.2 to give an idea on the type of content of each
sequence.
(a) (b)
(c) (d)
(e) (f) Figure 3.2 – First frame of each test sequence: (a) Foreman; (b) Mobile; (c) Panslow; (d) Spincalendar; (e)
Playing_cards; (f) Toys_and_calendar
• Quantization parameters: Several groups of QP values groups are used; to simplify, each group
is presented as follows, Gx = (QPI, QPP, QPB) where Gx represents a group x and QPy the
quantization parameter when y are the I, P and B frames:
o G1 = (12, 12, 12)
o G2 = (16, 16, 16)
o G3 = (22, 23, 24)
o G4 = (27, 28, 29)
o G5 = (32, 33, 34)
o G6 = (37, 38, 39)
The higher is x, the lower is the rate and the quality since the higher are the quantization steps.
• Coding benchmarks: The novel perceptual video codec is compared with the H.264/AVC High
profile codec without perceptual coefficients pruning; in the following, the H.264/AVC High profile
codec will be labeled as ‘HP’ while the H.264/AVC based perceptual with coefficients pruning
codec will be labeled as ‘JND_CP’.
• Performance metrics:
o PSNR: measures the quality of a reconstructed image based on the comparison of the
decoded and the original sequences through the MSE. PSNR is determined by equation (39),
where R is the maximum fluctuation in the input image data type, this means 255 for 8
bit/sample content, and MSE is defined by equation (40) where M and N are the number of
rows and columns in the input images, respectively, and I1 and I2 are the original image and
the reconstructed image, respectively.
49
��6b = 10 log=� H b����N (39)
��� = � �u=��, � − u���, ���v,z � ∙ 6 (40)
In this context, the PSNR for a video sequence is simply the temporal average of the PSNRs
for each frame.
o Multi-scale SSIM (MS-SSIM): measures the quality of the perceived image, taking into account
the signal image density, the distance between the plane and the observer and the perceptual
capability of the observer visual system. This metric is computed through equation (41), where
M is the finest scale obtained after M-1 scaling iterations. lj(x,y), cj(x,y) and sj(x,y) are the
luminance, contrast and structure components at their different scales and αj, βj and ϒj are set
according to M scale, aforementioned, so they match the HVS contrast sensitivity function
[35].
����u��7, 8� = �~v�7, 8��ÚÄ ∙ Û��f�7, 8��ÜÝ ∙ �qf�7, 8��ÞÝvf�& (41)
In this context, the MS-SSIM for a video sequence is simply the temporal average of the MS-
SSIMs for each frame. MS-SSIM is considered to be a objective quality metric with a better
correlation than PSNR with the user mean opinion scores.
o Average number of transform coefficients zeroed per MB due to the perceptual coefficients
pruning method: this metric provides an idea on the number of transform coefficients which
are set to zero due to the usage of the proposed pruning method and is computed using
equation (34) above.
o Average zigzag position of the zeroed coefficients in 4x4 coded blocks: this metric provides an
idea on the zigzag position of the coefficients which are set to zero due to the usage of the
proposed pruning method and is computed using equation (38) above.
3.2.2. Results and Analysis
To assess the RD performance, RD charts for the PSNR and MS-SSIM objective quality metrics versus
the bitrate were obtained for each test sequence and for each adopted RD point.
The variation of the various metrics between the proposed perceptual codec (H.264/AVC JND_CP) and
the H.264/AVC High profile (H.264/AVC HP) benchmark are computed using equations (42), (43) and
(44) for each RD point. These values should ideally show the improvements brought by the presented
H.264/AVC solution regarding the ‘conventional’ H,264/AVC HP codec as implemented in the JVT
reference software developed by the standardization group itself. The average values for these metrics
and respective variations are presented in Table A-1 of Annex A.
∆|iBo B� �%� = |iBo B��z� − |iBo B�Ǽ|iBo B�Ǽ × 100 (42)
∆��6b �%� = ��6b�z� − ��6bǼ��6bǼ × 100 (43)
∆����u� �%� = ����u��z� − ����u�Ǽ����u�Ǽ × 100 (44)
50
• RD Performance: PSNR versus Bitrate
This subsection presents the RD charts for the PSNR versus the bitrate for each test sequence; the
results are after analyzed.
Figure 3.3 – PSNR RD performance for the Foreman sequence
Figure 3.4 – PSNR RD performance for the Mobile sequence
G1
G2
G3
G4
G5
G629
31
33
35
37
39
41
43
45
47
0 500 1000 1500 2000 2500 3000 3500 4000 4500
HP
JND_CPPSN
R [
dB
]
Bitrate [kbit/s]
G1G2
G3
G4
G5
G625
30
35
40
45
50
0 1000 2000 3000 4000 5000 6000 7000 8000
HP
JND_CPPS
NR
[d
B]
Bitrate [kbit/s]
51
Figure 3.5 – PSNR RD performance for the Panslow sequence
Figure 3.6 – PSNR RD performance for the Spincalendar sequence
Figure 3.7 – PSNR RD performance for the Playing_cards sequence
G1G2
G3
G4
G5
G630
32
34
36
38
40
42
44
46
0 20000 40000 60000 80000 100000 120000 140000
HP
JND_CPPSN
R [
dB
]
Bitrate [kbit/s]
G1G2
G3
G4
G5
G630
32
34
36
38
40
42
44
46
0 20000 40000 60000 80000 100000 120000 140000
HP
JND_CPPSN
R [
dB
]
Bitrate [kbit/s]
G1
G2
G3
G4
G5
G633
35
37
39
41
43
45
47
0 10000 20000 30000 40000 50000 60000 70000 80000 90000
HP
JND_CPPSN
R [
dB
]
Bitrate [kbit/s]
52
Figure 3.8 – PSNR RD performance for the Toys_and_calendar sequence
Analyzing the RD variation in Figure 3.3 to Figure 3.8, the following conclusions may be taken:
• As expected, the PSNR increases with the rate, first quite quickly and after with a rather linear
variation; in this case, the various rates were obtained from the various quantization parameters
combinations, corresponding to the Gx labels in the charts.
• For the low and medium resolution sequences, there are no significant variations in terms of RD
performance between the two H.264/AVC codecs (HP and JND_CP) under comparison.
• For the higher resolution sequences, there are evident RD performance gains obtained with the
H.264/AVC JND_CP solution, notably for the higher rates. The rate gains go up to about 7 Mbit/s,
i.e., 8% (for the same PSNR) for the last RD point or PSNR gains of about 0.5 dB, i.e., 1.1% (for
the same rate) for the last RD point.
• The RD gains are larger for the higher resolutions because in high resolution sequences each MB
corresponds to a tinnier physical area; in this context, the redundancy is higher and therefore the
coefficients are lower and consequently the coefficients pruned have lower impact in the video
quality.
• RD Performance: MS-SSIM versus Bitrate
This subsection presents the RD charts for the MS-SSIM versus the bitrate for each test sequence; the
results are after analyzed.
G1G2
G3
G4
G5
G633
35
37
39
41
43
45
0 20000 40000 60000 80000 100000
HP
JND_CPPSN
R [
dB
]
Bitrate [kbit/s]
53
Figure 3.9 – MS-SSIM RD performance for the Foreman sequence
Figure 3.10 – MS-SSIM RD performance for the Mobile sequence
Figure 3.11 – MS-SSIM RD performance for the Panslow sequence
G1G2G3
G4
G5
G60,965
0,97
0,975
0,98
0,985
0,99
0,995
1
0 500 1000 1500 2000 2500 3000 3500 4000 4500
HP
JND_CP
MS-
SSIM
Bitrate [kbit/s]
G1G2G3
G4
G5
G6
0,98
0,982
0,984
0,986
0,988
0,99
0,992
0,994
0,996
0,998
1
1,002
0 1000 2000 3000 4000 5000 6000 7000 8000
HP
JND_CPMS-
SSIM
Bitrate [kbit/s]
G1G2G3
G4
G5
G60,970
0,975
0,980
0,985
0,990
0,995
1,000
0 20000 40000 60000 80000 100000 120000
HP
JND_CPMS-
SSIM
Bitrate [kbit/s]
54
Figure 3.12 – MS-SSIM RD performance for the Spincalendar sequence
Figure 3.13 – MS-SSIM RD performance for the Playing_cards sequence
Figure 3.14 – MS-SSIM RD performance for the Toys_and_calendar sequence
Analyzing the RD variation in Figure 3.9 to the Figure 3.14, the following conclusions may be taken:
• As expected, the MS-SSIM increases with the rate, first rather quickly and after saturating the
quality for the higher bitrates; this basically means that the subjective quality saturates when the
G1G2
G3
G4
G5
G60,965
0,970
0,975
0,980
0,985
0,990
0,995
1,000
1,005
0 20000 40000 60000 80000 100000 120000 140000
HP
JND_CPMS-
SSIM
Bitrate [kbit/s]
G1G2G3
G4
G5
G6
0,955
0,960
0,965
0,970
0,975
0,980
0,985
0,990
0,995
1,000
1,005
0 10000 20000 30000 40000 50000 60000 70000 80000 90000
HP
JND_CPMS-
SSIM
Bitrate [kbit/s]
G1G2G3
G4
G5
G60,965
0,970
0,975
0,980
0,985
0,990
0,995
1,000
0 20000 40000 60000 80000 100000
HP
JND_CPMS-
SSIM
Bitrate [kbit/s]
55
rate increases above a certain value implying that, contrary to what is said by the PSNR, the
subjective quality does not continuously increase with the rate since non-perceptible details are
being sent at some stage. As before, the various rates are obtained from the various quantization
parameters combinations, corresponding to the Gx labels in the charts.
• For the low and medium resolution sequences, the H.264/AVC JND_CP solution shows rate gains
up to about 8% (for the same MS-SSIM) for the last RD point (G1).
• For the higher resolution sequences, there are more evident RD performance gains associated to
the H.264/AVC JND_CP solution, notably for the higher rates. The rate gains go up to about 11
Mbit/s (13%) (for the same MS-SSIM) for the last RD point (G1).
• From the two bullets above, it is clear that the RD performance gains are larger for the higher
resolution sequences. This happens because in high resolution sequences each MB corresponds
to a tinnier physical area; in this context, the redundancy is higher, therefore the coefficients are
lower and consequently the coefficients pruned have lower impact in the video quality.
• Comparing the MS-SSIM RD performance with the PSNR RD performance, the following
conclusions may be taken:
o For low resolution sequences, high quality (saturation zone) in terms of MS-SSIM is
achieved with a rate around 700 kbit/s and 1800 kbit/s for Foreman and Mobile sequences,
respectively, while high PSNR stable values are only achieved for higher rates, notably
higher than 3700 kbit/s and 6500 kbit/s for the Foreman and Mobile sequences, respectively.
o For medium resolution, the MS-SSIM high quality is achieved with a rate around 60 Mbit/s
and 10 Mbit/s for the Panslow and Spincalendar sequences, respectively, while high PSNR
stable quality is only achieved for rates higher than 100 Mbit/s.
o For high resolution, high quality MS-SSIM is achieved around 10 Mbit/s and 45 Mbit/s for the
Playing_cards and Toys_and_calendar sequences, respectively, while high PSNR stable
quality is only achieved for rates higher than 80 Mbit/s.
o The observations above highlight that PSNR improvements after a certain rate are non-
perceptible since the subjective quality saturates due to the HVS perception limitations. The
recognition of this effect may allow adopting lower coding rates while still achieving the same
high subjective quality.
• Average number of zeroed coefficients per MB due to the perceptual coefficients pruning
method
This metric measures the average number of coefficients in a MB which are zeroed after the quantization
exclusively due to the pruning method. As such, it corresponds to the difference between the average
number of zeroed coefficients for a MB when using the JND model and the average number of zeroed
coefficients for a MB when the sequence is coded with the H.264/AVC High profile (HP) reference
software. However, as the coding modes may not be the always precisely the same, this difference may
sometimes be negative although this happens rather rarely.
56
In this subsection, the evolution in time of the average number of zeroed coefficients for a MB, for each
sequence, will be presented and afterwards analyzed.
Figure 3.15 – Average number of zeroed coefficients for the Foreman sequence
Figure 3.16 – Average number of zeroed coefficients for the Mobile sequence
Figure 3.17 – Average number of zeroed coefficients for the Panslow sequence
-2
0
2
4
6
8
0 50 100 150 200 250 300
Co
eff
icie
nts
Number of Frame
G1
G2
G3
G4
G5
G6
-1,5
-0,5
0,5
1,5
2,5
3,5
4,5
5,5
0 50 100 150 200 250 300
Co
eff
icie
nts
Number of Frame
G1
G2
G3
G4
G5
G6
-17
-12
-7
-2
3
8
0 20 40 60 80 100 120 140
Co
eff
icie
nts
Number of Frame
G1
G2
G3
G4
G5
G6
57
Figure 3.18 – Average number of zeroed coefficients for the Spincalendar sequence
Figure 3.19 – Average number of zeroed coefficients for the Playing_cards sequence
Figure 3.20 – Average number of zeroed coefficients for the Toys_and_calendar sequence
From the analysis of Figure 3.15 to Figure 3.20, it may be concluded that:
• For lowers Gs, this means higher rates and qualities, the average number of zeroed coefficients is
larger because for the higher rates the reference software uses lower thresholds (i.e., has less
-0,5
0,5
1,5
2,5
3,5
4,5
5,5
6,5
0 20 40 60 80 100 120 140
Co
eff
icie
nts
Number of Frame
G1
G2
G3
G4
G5
G6
-2
0
2
4
6
8
10
0 10 20 30 40 50 60
Co
eff
icie
nts
Number of Frame
G1
G2
G3
G4
G5
G6
-3
-1
1
3
5
7
9
0 10 20 30 40 50 60
Co
eff
icie
nts
Number of Frame
G1
G2
G3
G4
G5
G6
58
coefficients to set to zero). Therefore, when the perceptual coefficients pruning method is applied,
there will be more coefficients under the JND threshold. The highest average number of zeroed
coefficients in a MB is around 10.2 coefficients for the lowest G in the Playing_cards sequence.
• For higher Gs, this means lower bitrates and qualities, the average number of zeroed coefficients per
MB is close to zero since most of the coefficients which are perceptually irrelevant are already set
to zero by the quantization process.
• The average number of zeroed coefficients for a MB is not constant along the sequence. This
variation is due to the used GOP prediction structures including I, P and B frames which have
rather different characteristics in terms of coefficients energy.
• The oscillations due to the GOP prediction structure do not have to the same intensity for all
quantization parameters combinations Gx, i.e. the difference of the average number of zeroed
coefficients in a MB between a P frame and a B frame differs with the QP value. While the B
frames have coefficients with low values, the P frames have higher values for the coefficients. As
the QP value decreases, the number of coefficients which have a value below the JND threshold
decreases. Since the values of the coefficients in B frames are low, the number of zeroed
coefficients does not change much with different QPs. On the other hand, in P frames, the number
of zeroed coefficients decreases when the QP decreases. Consequently, the difference between
the number of zeroed coefficients in a MB between a P frame and a B frame rises when the QP
value decreases.
• Considering all the sequences, it becomes clear that the oscillations are not only due to the GOP
prediction structure, but there are also other reasons involved. For example, the panning in
Foreman around frame 200 and the zoom out in Mobile around frames 100 to 150 imply an
increase on the average number of zeroed coefficients for MB. This happens because the image
variation is explored by the frequency decomposition masking model.
• The average number of zeroed coefficients per MB increases as the resolution increases. Looking
into G1, for each sequence, it is clear that for low resolutions the average is around 5 coefficients,
for medium resolution the average increases to around 6 coefficients and, for high resolution, the
average is around 7.5 coefficients. This happens because in high resolution sequences each MB
corresponds to a tinnier physical area; in this context, the redundancy is higher and, therefore, the
coefficients are lower and consequently there are more coefficients pruned.
For each spatial resolution, a more detailed analysis of the results allows to state:
- CIF resolution: the average number of zeroed coefficients per MB is higher for the Foreman
sequence because this sequence presents strong variations in time since it has three main parts:
first a close up of a men talking, then it changes into an open sky, and afterwards to a view of a
building in construction. In this case, the frequency band decomposition masking model exploits
the type and quantity of variations in each image.
- 720p60 resolution: the Panslow sequence presents a higher average number of zeroed coefficients
per MB because there are pattern areas where the JND model, more precisely the pattern masking
59
model, exploits the contrast. As the JND threshold is larger for sequences with pattern objects,
more coefficients are pruned;
- 1080p24 resolution: the average number of zeroed coefficients per MB is higher for the
Playing_cards sequence because it has a lower luminance level than the Toys_and_calendar
sequence. As luminance masking model used in the JND model provides a higher JND threshold
for sequences with a darker background (low level of luminance), the perceptual coefficients
pruning method is capable of setting more coefficients to zero.
• Average zigzag position of the zeroed coefficients exclusively due to the perceptual
coefficients pruning method
This metric represents the average zigzag position in a 4×4 block where the coefficients are more likely to
be set to zero. With this information, it is possible to determine where the pruning method has more
impact in terms of frequency. The position in the block is determined in terms of zigzag scanning order.
Figure 3.21 – Average zigzag position of the zeroed coefficients for the 4x4 blocks in the Foreman sequence
Figure 3. 22 – Average zigzag position of the zeroed coefficients for the 4x4 blocks in the Mobile sequence
8,5
9
9,5
10
10,5
11
11,5
0 50 100 150 200 250 300
Zig
zag
po
siti
on
Number of Frame
G1
G2
G3
G4
G5
G6
8,5
9
9,5
10
10,5
11
11,5
0 50 100 150 200 250 300
Zig
zag
po
siti
on
Number of Frame
G1
G2
G3
G4
G5
G6
60
Figure 3.23 – Average zigzag position of the zeroed coefficients for the 4x4 blocks in the Panslow sequence
Figure 3.24 – Average zigzag position of the zeroed coefficients for the 4x4 blocks in the Spincalendar sequence
Figure 3.25 – Average zigzag position of the zeroed coefficients for the 4x4 blocks in the Playing_cards sequence
8,5
9
9,5
10
10,5
11
0 20 40 60 80 100 120 140
Zig
zag
po
siti
on
Number of Frame
G1
G2
G3
G4
G5
G6
8,5
9
9,5
10
10,5
11
0 20 40 60 80 100 120 140
Zig
zag
po
siti
on
Number of Frame
G1
G2
G3
G4
G5
G6
8,5
8,7
8,9
9,1
9,3
9,5
9,7
9,9
10,1
10,3
10,5
0 10 20 30 40 50 60
Zig
zag
po
siti
on
Number of Frame
G1
G2
G3
G4
G5
G6
61
Figure 3.26 – Average zigzag position of the zeroed coefficients for the 4x4 blocks in the Toys_and_calendar
sequence
From the analysis of Figure 3.21 to Figure 3.26, it is possible to conclude:
• The average zigzag position for the zeroed coefficients in 4x4 blocks increases with the rate. This
is expected as the higher is the rate, the lower is the quantization step and, thus, the more
irrelevant coefficients are coded, if not filtered by the pruning method.
• For the first Gs, this means higher rates and qualities, the average zigzag position is slightly higher,
going up to 11.4, on average, for G1 in the Mobile sequence. This happens since the reference
software has a lower threshold for the higher bitrates (i.e. the quantization only sets to zero
coefficients with very high frequency); consequently, when the perceptual coefficients pruning
method is applied, since the human eye is less sensitive to high frequencies, the coefficients set to
zero will be in a frequency range a little bit higher comparing to the other Gs.
• On average, the zigzag position where it is more likely for coefficients to be set to zero is,
approximately, 10 with variance, approximately, 2 for low and medium resolution and,
approximately, 9 with variance, approximately, 2 for high resolution. These positions correspond to
the middle range frequencies. This happens since the frequency band decomposition masking
model applies a higher threshold for the higher frequencies; thus, in these frequencies more
coefficients are set to zero. Therefore, it could be assumed that this method has a bigger impact on
the high frequencies; however, since the quantization sets coefficients to zero in the higher
frequencies, this method ends by having a bigger impact in the middle frequencies.
• As for the average number of zeroed coefficients per MB, the average zigzag position shows
oscillations due to two types of reasons: the GOP prediction structure and the particular
characteristics of the sequences, as aforementioned. The particular characteristics of the
sequences with impact on the variation of the average zigzag position of the zeroed coefficients are
presented next.
8,5
8,7
8,9
9,1
9,3
9,5
9,7
9,9
10,1
10,3
10,5
0 10 20 30 40 50 60
Zig
zag
po
siti
on
Number of Frame
G1
G2
G3
G4
G5
G6
62
For each resolution, a more detailed analysis allows concluding:
- CIF resolution: the average zigzag position for 4x4 blocks is higher for the Mobile sequence
because this sequence has less image variations than the Foreman sequence. The frequency
decomposition masking model explores more the variations in this image;
- 720p60 resolution: the Spincalendar sequence presents a higher average zigzag position for 4x4
blocks because it is a sequence with a image spinning, contrary to the Panslow sequence where a
camera panning is present, thus introducing some strong variations in the image. The frequency
decomposition masking model explores these variations in the image;
- 1080p24 resolution: the average zigzag position for 4x4 blocks is similar for the
Toys_and_calendar and Playing_cards sequences.
The average zigzag position of the zeroed coefficients for 4x4 blocks is not constant along the sequence
as aforementioned. These variations along time may be explained as follows:
• Foreman sequence: Figure 3.21 shows that, around frame 200, this means when the camera does
a panning over the sky, the average zigzag position for 4x4 blocks decreases. This corresponds to
a strong variation in the image, as there is first a close up with a men speaking, after an image with
some open sky and finally a view to a building under construction. The JND model, more precisely
the frequency band decomposition masking model, explores the type and quantity of variations in
the image.
In Annex A, Table A-2 represents the evolution of the average zigzag position of the zeroed coefficients
for 4x4 blocks and the variance of this position along the RD points for all sequences.
3.2.3. Conclusion
The main conclusion of this chapter is that the adopted perceptual coefficients pruning method may have
some positive impact on the RD performance, notably for sequences with lower luminance levels (e.g.
Playing_cards vs Toys_and_calendar) and objects with patterns (e.g. Panslow vs Spincalendar),
especially for the low QP values and higher resolutions.
The H.264/AVC JND_CP codec can achieve PSNR RD gains that are only evident for the higher
resolutions and for the higher rates. The rate gain goes up to 8% for the last RD point and the PSNR gain
goes up to 0.5 dB for the last RD point. Also, the MS-SSIM RD performance can achieve rate gains up to
8% in rate for low and medium resolutions and 13% for high resolutions, both for the last RD point.
A good quality in terms of MS-SSIM can be achieved with a lower rate (around 700 kbit/s for Foreman
sequence) than in terms of PSNR where a good quality is only achieved with a higher rate (larger than
3.7 Mbit/s for the Foreman sequence).
Regarding the number of zeroed coefficients due to the perceptual coefficients pruning method, it can go
up to around 9.7 coefficients, on average, for the Playing_cards sequence in the lowest G. These zeroed
coefficients are more likely to be located in between the 8th and 12
th zigzag position in a 4x4 block as
63
shown in Figure 3.27. Therefore, the coefficients zeroed by the perceptual coefficients pruning method
are typically in the middle to high frequencies.
Figure 3.27 – 4x4 block with the average zeroed positions highlighted
1 2 6 7
3 5 8 13
4 9 12 14
10 11 15 16
64
Chapter 4
A JND Model based Adaptive Quantization Method for
H.264/AVC Video Coding
This chapter presents the second perceptually driven modification to the reference H.264/AVC video
codec with the objective of quantizing the integer DCT (ICT) coefficients based on an adopted JND
model. After describing the new ICT quantization solution, the performance results obtained in the context
of the H.264/AVC JM 16.2 reference software using several objective quality metrics are presented and
analyzed.
4.1. The JND Adaptive Quantization Method
4.1.1. Objective
The basic idea of the new perceptually driven quantization method for the ICT coefficients is to adapt the
QP values based on the JND thresholds computed based on a selected JND model. With this purpose, a
distortion weight will be determined for each MB to be applied to each initially computed QP value in
order to get a new JND adaptive QP value. The basic idea is to code each ICT coefficient with an
accuracy driven by the HVS properties, thus avoiding to send information that is not visually perceptible.
The final target is to reduce the bitrate necessary to reach a certain target subjective video quality.
4.1.2. Architecture
The improved encoder architecture already including the JND adaptive quantization related tools is
presented in Figure 4.1 while Figure 4.2 presents the improved encoder architecture including also the
65
ICT perceptual coefficients pruning solution presented in the previous chapter. The major architectural
changes regard the control of the QP value based on the JND thresholds. It is important to note that the
proposed codec modifications only refer to the encoder and they do not imply any change in the
H.264/AVC syntax and semantics, meaning that still fully compliant H.264/AVC bit streams are created.
Figure 4.1 – Encoder architecture with the JND related adaptive quantization modules
Figure 4.2 – Encoder architecture with the JND related adaptive quantization and ICT coefficients pruning modules
66
4.1.3. Walkthrough
This section intends to present a walkthrough of the new perceptual video codec with special emphasis
on the novel modules related to the QP perceptual control based on the JND model, which are listed in
bold, is presented below. The 7th step is only included for the improved perceptual codec using both the
JND related methods presented in this Thesis, this means coefficients pruning and JND adaptive
quantization.
Forward/Encoding Path
1. MB Division: As presented in Section 2.3.1;
2. JND thresholds determination: As presented in Section 3.1.3;
3. Motion Estimation: As presented in Section 2.3.4;
4. MB prediction: As presented in Section 2.3.4;
5. Residue Computation: As presented in Section 2.3.2;
6. Transform: As presented in Section 2.3.2;
7. Coefficients Pruning: As presented in Section 3.1.3;
8. JND QP Adaptation: Adapt the QP value for each MB based on the JND thresholds as
defined in the next section;
9. Quantization: Quantize the transformed (and eventually pruned) coefficients with the JND adaptive
QP;
10. Entropy encoding: As presented in Section 2.3.1;
Decoding Path (also within the encoder): The decoder is the same as presented in Section 2.3.1 since
there are no changes in the decoder. This also reflects the fact that the proposed JND related tools do not
impact the H.264/AVC compliance, and thus the same (normative) decoder is used.
4.1.4. Description of New Tool
This section describes the novel tool required to perform the QP perceptual adaptation, notably the
computation of the new QP. This solution is based on the perceptual video coding solutions presented in
Section 2.3 developed by Chen and Guillemot [27] [28].
JND Adaptive Quantization
Description
The JND adaptive quantization method consists in adapting the QP value for each MB, taking into
account its perceptual relevance. The basic idea is to use the JND thresholds to determine a new QP
value, if the average value of the JND thresholds in a MB is higher than the average value of the JND
threshold in a frame (using the thresholds for the relevant coding mode, e.g. 4x4 DCT or 8x8 DCT and
INTRA or INTER modes) meaning that the MB is perceptually less relevant regarding the average
relevance of the frame. Consequently, the new QP will be higher than the QP determined by the
67
H.264/AVC JM reference software, exploiting the HVS behavior to mask some additional quantization
noise, thus saving some bitrate.
This tool modifies the QP value as initially determined by the H.264/AVC JM reference software by using
a weighted distortion (dist) computed based on the JND thresholds determined using the JND model
presented in Section 3.1.4. The QP value is adapted as follows:
$��z� & = $�&àciqB& (45)
where QPi is the initially determined QP value and QPJNDi is the adapted QP value.
The weighted distortion dist is computed through equation (46) where avg_JND_MBi is the average value
of the JND thresholds for MBi, computed by equation (47), and avg_JND_frame is the average value of
the JND thresholds for the frame, computed by equation (48) where JT is the JND threshold for each
coefficient in the MB.
S>_56"_�| = � � 5x�i��µ�=áf�� =á&�� 16 × 16 (47)
S>_56"_<o �� = � � 5x�i��µ�â&´��Ö=f�� �¸&X��Ö=&�� ℎ�i>ℎB × picBℎ (48)
In summary, the average JND thresholds for the frame and for each MB are computed using the JND
model presented in Section 3.1.4. Afterwards, the weighted distortion for each MB is computed and the
new QP value is determined by this distortion as in (41).
To avoid a subjectively negative flickering effect due to the variation of the QP value between MBs, the
QP variations are limited: the QP value can only decrease 1 and increase until 3 as defined by equation
(49). After the determination of QPJND i with equation (45), the conditions in (49) are checked: if disti is
lower than 1 and QPJNDi is less than QPi-1 or if disti is higher than 1 and QPJNDi is higher than QPi+3,
QPJNDi will be further changed: in the first case, QPJNDi will be set to QPi-1 while in the second case QPJNDi
will be set to QPi+3.
¯$��z� & = $�& + 3 i< �ciqB& > 1� && 1�$��z� & − $�� > 34$��z� & = $� − 1 i< �ciqB& < 1� && 1�$�& − $��z� &� > 14Q (49)
Implementation
To compute the real JND threshold average for a frame using the coding mode selected for each MB, it
would be necessary to known before the encoding process the MB coding modes implying that the frame
would have to be coded twice. To simplify the computation of the average JND threshold for the frame,
this average is computed as if the coding mode of the current MB is the coding mode of all MBs in the
frame (e.g. if the MB being coded uses a 4x4 DCT with INTRA mode, the average JND threshold in the
frame is the average JND thresholds in a frame where all MBs are coded with 4x4 DCT and INTRA
mode). This allows to apply the algorithm above based only on the four average JND threshold at the
frame level identified below.
ciqB& = 0.7 + 0.6 + 11 + �7J ä−4 S>_56"_�|& − S>_56"_<o �� S>_56"_<o �� å
(46)
68
To implement the JND adaptive quantization tool, the following steps are required.
For each frame
1. Compute the average JND threshold in the frame level using equation (48) for the 4x4 and 8x8
integer DCT for each INTRA and INTER coding mode, notably:
• 4x4 DCT with INTRA mode for luminance;
• 4x4 DCT with INTER mode for luminance;
• 8x8 DCT with INTRA mode for luminance;
• 8x8 DCT with INTER mode for luminance;
For each MB
2. Compute the average JND threshold using equation (47) for the current MB using the selected
coding mode;
3. Compute the new JND adaptive QP by:
a. Select the avg_JND_frame and avg_JND_MB corresponding to the coding mode of the current
MB (e.g. MB coded with 4x4 DCT with INTER mode);
b. Compute the distortion weight using equation (46)
c. Compute the new JND adaptive QP value using equation (45) and then check and apply the
conditions in (49).
As mentioned above, this process is not a perfect solution since the average JND threshold in a frame is
an approximation; since instead of computing the average JND threshold for a frame using the JND
thresholds correspondent to the coding modes used in the frame, is used JND thresholds considering that
all MB are coded with the same coding mode as the current MB. However, it should be a rather good
approximation.
4.2. Performance Evaluation
4.2.1. Test Conditions
For the tests, the H.264/AVC reference software, version JM 16.2 (FRExt), has been used, notably the
High profile for the reasons mentioned before. Thus, the adopted JND model, the perceptual coefficients
pruning method described in the previous section and the JND adaptive quantization method were
implemented in the context of this reference software. Further test conditions included [34]:
• GOP prediction structure: As presented in Section 3.2.1.
• Rate control: As presented in Section 3.2.1.
• Test sequences and resolutions: As presented in Section 3.2.1.
• Quantization parameters: As presented in Section 3.2.1.
69
• Coding benchmarks: The proposed perceptual video codec with the JND adaptive QP and the
proposed perceptual video codec including coefficients pruning and JND adaptive QP are
compared with the H.264/AVC High profile codec and with the H.264/AVC based perceptual codec
only with coefficients pruning. In the following, these codecs will be labeled as:
o HP - H.264/AVC High profile codec
o JND_CP - H.264/AVC based perceptual codec with coefficients pruning
o JND_QP - H.264/AVC based perceptual codec with JND adaptive QP
o JND_CP+QP - H.264/AVC based perceptual codec including coefficients pruning and JND
adaptive QP
• Performance metrics: Besides the objective quality metrics already adopted in the previous
chapter, another objective quality metric is adopted to perform a more complete RD performance
assessment:
o Video Quality Metric (VQM):VQM was developed to provide an objective measurement for the
perceived video quality. It measures the perceptual effects of video impairments including
blurring, jerky/unnatural motion, global noise, black distortion and color distortion, and
combines them into a single metric [36]. The VQM procedure has as input the original and
decoded video sequence in the YUV color space; after, the DCT transform is applied to each
video sequence, and the DCT coefficients are converted into a local contrast (LC) metric
computed using equation (50) where DC is the DC component for each block.
ds�i, µ� = "sC�i, µ� ∗ 1 "s10244�.Èá"s
(50)
Afterwards, LC is converted to just-noticeable distortions (JNDs) in order to compare the
significant differences, that is, to compare just the relevant coefficients by ignoring the non-
perceivable coefficients. This conversion is made through a spatial CSF (SCSF) matrix where
each DCT coefficient is multiplied by the correspondent entry in the SCSF matrix and the
result are the reciprocal of the JND thresholds. Finally, the two video sequences are
subtracted (diff) to determine its average and maximum distances which are finally weighted
by equation (51) [37].
}$� = ��� ´&Æ� + 0.005 ∗ � 7´&Æ�� (51) with
�� ´&Æ� = 1000 ∗ �� 1�� � ,q�ci<<��4 (52)
� 7´&Æ� = 1000 ∗ � 7i�g� 1� 7i�g�� ,q�ci<<��4 (53)
The VQM distortion values range from 0 to 1 with the 0 value representing a video with
excellent quality and 1 representing a very bad video quality.
To compute the VQM, the Command Line VQM (CVQM) tool which performs video calibration
and video quality estimation has been used; however, in this work only video quality
estimation has been performed. The CVQM software was developed by the Institute for
Telecommunication Sciences (ITS) and is described in [38]. CVQM performs the automatic
processing of a pair of corresponding video sequences, one with the original video sequence
70
and the other with the processed video sequence, e.g., after coding by the video codecs under
study.
o Resolving Power (RP) Compensated VQM: VQM generally correlates well with the
subjectively perceived quality; however, it is typically not powerful enough to compare two
codecs in terms of RD performance. For example, the perceived quality of the HP and
JND_QP codecs may be very similar but the associated VQM scores may be quite different,
leading to the conclusion that the JND_QP codec has a worst quality than HP. To improve this
type of comparison, a resolving power (RP) compensation is applied to the basic VQM. The
RP works as a threshold, that is, the RP sets the maximum difference in the adopted metric, in
this case the VQM, that still can be considered having the same perceived quality. So, if, for
example, the VQM difference between the JND_QP and HP codecs is less than RP, the
quality is perceived as similar for a human observer. [39]. The RP used has a confidence
interval of 95% and was determined using the software provided at ITS. Using the RP
compensated VQM should allow a more realistic comparison between the subjective RD
performance of the various video codecs.
4.2.2. Results and Analysis
To evaluate the RD performance of the JND_QP and JND_CP+QP, four objective quality metrics are
used in this chapter: PSNR, MS-SSIM, VQM and RP compensated VQM. With this purpose, RD charts
for the PSNR, MS-SSIM, VQM and RP compensated VQM versus the rate for each RD point will be
presented for each test sequence.
The variation of the various quality metrics between the proposed perceptual codecs and the H.264/AVC
High profile benchmark are computed using equations (42), (43), (44), (54) and (55) for each RD point.
These values should ideally show the gains brought by the proposed perceptual H.264/AVC based
codecs regarding the standard H.264/AVC High profile codec as implemented in the JVT reference
software provided by the standardization group itself. The average values for these metrics and
respective variations are presented in Table B-1 to Table B-5 of Annex B.
∆}$� �%� = }$��z� − }$�Ǽ}$�Ǽ × 100 (54)
∆}$�_b� �%� = }$�_b��z� − }$�_b�Ǽ}$�_b�Ǽ × 100 (55)
• RD Performance: PSNR versus Bitrate
This subsection presents the PSNR RD charts for each test sequence; the results are after analyzed.
71
Figure 4.3 – PSNR RD performance for the Foreman sequence
Figure 4.4 – PSNR RD performance for the Mobile sequence
Figure 4.5 – PSNR RD performance for the Panslow sequence
G1G2
G3
G4
G5
G629
31
33
35
37
39
41
43
45
47
0 500 1000 1500 2000 2500 3000 3500 4000 4500
HP
JND_CP
JND_QP
JND_CP+QP
PSN
R [
dB
]
Bitrate [kbit/s]
G1G2
G3
G4
G5
G624
29
34
39
44
49
0 1000 2000 3000 4000 5000 6000 7000 8000
HP
JND_CP
JND_QP
JND_CP+QP
PS
NR
[d
B]
Bitrate [kbit/s]
G1G2
G3
G4
G5
G630
32
34
36
38
40
42
44
46
0 20000 40000 60000 80000 100000 120000
HP
JND_CP
JND_QP
JND_CP+QP
PSN
R [
dB
]
Bitrate [kbit/s]
72
Figure 4.6 – PSNR RD performance for the Spincalendar sequence
Figure 4.7 – PSNR RD performance for the in Playing_cards sequence
Figure 4.8 – PSNR RD performance for the Toys_and_calendar sequence
G1G2
G3
G4
G5
G628
30
32
34
36
38
40
42
44
46
0 20000 40000 60000 80000 100000 120000 140000
HP
JND_CP
JND_QP
JND_CP+QP
PSN
R [
dB
]
Bitrate [kbit/s]
G1G2
G3
G4
G5
G633
35
37
39
41
43
45
47
0 10000 20000 30000 40000 50000 60000 70000 80000 90000
HP
JND_CP
JND_QP
JND_CP+QP
PSN
R [
dB
]
Bitrate [kbit/s]
G1G2
G3
G4
G5
G633
35
37
39
41
43
45
0 20000 40000 60000 80000 100000
HP
JND_CP
JND_QP
JND_CP+QP
PSN
R [
dB
]
Bitrate [kbit/s]
73
From the analysis of Figure 4.3 to Figure 4.8, it may be concluded: o For the H.264/AVC perceptual codec with JND adaptive QP (JND_QP)
o For the low resolution sequences, there are evident RD performance losses, notably for the
higher rates
� comparing with the H.264/AVC High profile codec (HP), the PSNR losses go up to about 2
dB (4.5%)/1.5 dB (3.5%) (for the same rate) for the last RD point, or the rate gains go up to
700 kbit/s (35%)/800 kbit/s (18.2%) (for the same PSNR) for the last RD point (G1) for the
Foreman and Mobile sequences, respectively.
� comparing with the H.264/AVC perceptual codec with coefficients pruning (JND_CP), the
PSNR losses go up to about 2 dB (4.5%)/1.5 dB (3.5%) (for the same rate) for the last RD
point, and the rate losses go up to 800 kbit/s (42.1%)/1 Mbit/s (23.8%) (for the same
PSNR) for the last RD point (G1) for the Foreman and Mobile sequences, respectively.
o For medium resolution sequences
� comparing with the HP codec, the PSNR losses go up to about 1 dB (2.3%)/0.5 dB (1.1%)
(for the same rate) for the last RD point, and rate losses go up to 14 Mbit/s (22%)/8 Mbit/s
(10%) (for same PSNR) for the last RD point for the Panslow and Spincalendar sequences,
respectively.
� comparing with the JND_CP codec, the PSNR losses go up to about 1 dB (2.4%)/0.5 dB
(1.1%) for the last RD point, and the rate losses go up to 14 Mbit/s (22%)/10 Mbit/s (13%)
(for the same PSNR) for the last RD point for the Panslow and Spincalendar sequences,
respectively.
o For high resolution sequences
� comparing with the HP codec, the PSNR losses go up to about 0.5 dB (1%) (for the same
rate) for last RD point and the rate losses go up to about 6 Mbit/s(10%) (for the same
PSNR) for the last RD point for the Playing_cards sequence.
� comparing with the JND_CP codec, the PSNR losses go up to about 0.75 dB (1.6%), or the
rate losses go up to about 12 Mbit/s (24%) (for the same PSNR) for the last RD point for
the Playing_cards sequence.
o For the H.264/AVC perceptual codec including coefficients pruning and JND adaptive QP
(JND_CP+QP)
o For the low resolution sequences, there are again clear RD performance losses, notably for the
higher rates
� comparing with the HP codec, the PSNR losses go up to about 2 dB (4.5%)/1.5 dB (3.5%)
(for the same rate) for the last RD point and the rate losses go up to 700 kbit/s (36.7%)/800
kbit/s (19.5%) (for the same PSNR) for the last RD point for the Foreman and Mobile
sequences, respectively.
� comparing with the JND_CP codec, the PSNR losses go up to about 2 dB (4.5%)/1.5 dB
(3.5%) (for the same rate) for the last RD point and the rate losses go up to 800 kbit/s
(42.1%)/1 Mbit/s (23.8%) (for the same PSNR) for the last RD point for the Foreman and
Mobile sequences, respectively.
74
o For medium resolution sequences
� comparing with the HP codec, the PSNR losses go up to about 1 dB(2.3%)/0.5 dB(1.2%) in
PSNR (for the same rate) for the last RD point and the rate losses go up to 12/10
Mbit/s(19%/13.2%) (for same PSNR) for the last RD point for the Panslow and
Spincalendar sequences, respectively.
� comparing with the JND_CP codec, the PSNR losses go up to about 1/0.5 dB (2.4/1.2%)
(for the same rate) for the last RD point and the rate losses go up to 14/12 Mbit/s
(22.2/16.2%) (for same PSNR) for the last RD point for the Panslow and Spincalendar
sequences, respectively.
o For high resolution sequences
� comparing with the HP codec, the PSNR losses go up to about 0.5/0.25 dB (1.1/0.6%) (for
the same rate) for the last RD point and the rate losses go up to about 8/6 Mbit/s
(16/10.9%) (for the same PSNR) for the last RD point for the Playing_cards and
Toys_and_calendar sequences, respectively.
� comparing with the JND_CP codec, the PSNR losses go up to about 1/0.75 dB (2.2/1.7%)
(for the same rate) for the last RD point and the rate losses go up to 12/14 Mbit/s
(26.1/27.5%) (for same PSNR) for the last RD point for the Panslow and Spincalendar
sequences, respectively.
o Both the JND_QP and the JND_CP+QP codecs present a worst RD performance regarding the HP
codec in terms of the PSNR quality metric: This does not come as a surprise as the PSNR relies
on the mathematical difference between the luminances of the original and decoded sequences
and, thus, does not necessarily accurately model the subjective quality. In fact, the JND adaptive
QP method introduces additional quantization error for some MBs under the assumption that this
additional error is not perceptible although the PNSR will still ‘complain’ about this additional error
as the PSNR is still (mathematically) sensitive to this error; thus, it is simply normal that the PSNR
RD performance for the JND_QP and the JND_CP+QP codecs is worst than for the HP and
JND_CP codecs.
o The RD performance losses are larger for the lower resolutions and smaller for the higher
resolutions because in the high resolution sequences each MB corresponds to a tinnier physical
area; so, the redundancy is higher and the impact of the JND adaptive quantization method is
lower.
• RD Performance: MS-SSIM versus Bitrate
Because the PSNR does not have a good correlation with the subjective quality, this subsection presents
the MS-SSIM RD charts for each test sequence, trying to understand if there are quality gains obtained
with the proposed H.264/AVC based perceptual codecs that can be better ‘detected’ using another quality
metric; the results are after analyzed.
75
Figure 4.9 – MS-SSIM RD performance for the Foreman sequence
Figure 4.10 – MS-SSIM RD performance for the Mobile sequence
Figure 4.11 – MS-SSIM RD performance for the Panslow sequence
G1G2
G3
G4
G5
G6
0,955
0,96
0,965
0,97
0,975
0,98
0,985
0,99
0,995
1
0 1000 2000 3000 4000 5000
HP
JND_CP
JND_QP
JND_CP+QP
MS-
SSIM
Bitrate [kbit/s]
G1
G2G3
G4
G5
G6
0,97
0,975
0,98
0,985
0,99
0,995
1
1,005
0 1000 2000 3000 4000 5000 6000 7000 8000
HP
JND_CP
JND_QP
JND_CP+QP
MS-
SSIM
Bitrate [kbit/s]
G1
G2
G3
G4
G5
G6
0,965
0,970
0,975
0,980
0,985
0,990
0,995
1,000
0 20000 40000 60000 80000 100000 120000
HP
JND_CP
JND_QP
JND_CP+QP
MS-
SSIM
Bitrate [kbit/s]
76
Figure 4.12 – MS-SSIM RD performance for the Spincalendar sequence
Figure 4.13 – MS-SSIM RD performance for the Playing_cards sequence
Figure 4.14 – MS-SSIM RD performance for the Toys_and_calendar sequence
G1G2G3
G4
G5
G6
0,960
0,965
0,970
0,975
0,980
0,985
0,990
0,995
1,000
1,005
0 20000 40000 60000 80000 100000 120000 140000
HP
JND_CP
JND_QP
JND_CP+QP
MS-
SSIM
Bitrate [kbit/s]
G1G2
G3
G4
G5
G6
0,95
0,96
0,97
0,98
0,99
1,00
1,01
0 10000 20000 30000 40000 50000 60000 70000 80000 90000
HP
JND_CP
JND_QP
JND_CP+QP
MS-
SSIM
Bitrate [kbit/s]
G1G2
G3
G4
G5
G6
0,960
0,965
0,970
0,975
0,980
0,985
0,990
0,995
1,000
0 20000 40000 60000 80000 100000
HP
JND_CP
JND_QP
JND_CP+QP
MS-
SSIM
Bitrate [kbit/s]
77
From the analysis of Figure 4.9 to Figure 4.14, it may be concluded: o For the H.264/AVC perceptual codec with JND adaptive QP (JND_QP)
o For the low resolution sequences:
� comparing with the HP codec, rate gains go up to 1300 kbit/s (for the same MS-SSIM) for
the last RD point, i.e., up to 32.5% (for the same MS-SSIM) for the Foreman sequence. For
the Mobile sequence, the rate gains go up to 1900 kbit/s (26.8%) (for the same MS-SSIM)
for the last RD point.
� comparing with the JND_CP codec, rate gains go up to 1000 kbit/s (27%)/1300 kbit/s
(20%) (for the same MS-SSIM) for the last RD point for the Foreman and Mobile
sequences, respectively.
o For medium resolution sequences
� comparing with the HP codec, rate gains go up to 32 Mbit/s (29.6%)/30 Mbit/s (25.4%) (for
same MS-SSIM) for the last RD point for the Panslow and Spincalendar sequences,
respectively.
� comparing with the JND_CP codec, rate gains go up to 24 Mbit/s (23.5%) and 21 Mbit/s
(19.3%) (for the same MS-SSIM) for the last RD point for the Panslow and Spincalendar
sequences, respectively.
o For high resolution sequences
� comparing with the HP codec, rate gains go up to 21 Mbit/s (25.3%)/27 Mbit/s (29%) (for
same MS-SSIM) for the last RD point for the Playing_cards and Toys_and_calendar
sequences, respectively.
� comparing with the JND_CP codec, rate gains go up to 10 Mbit/s(13.9%) and 15
Mbit/s(18.5%) (for the same MS-SSIM) for the last RD point for the Playing_cards and
Toys_and_calendar sequences, respectively.
o For the H.264/AVC perceptual codec including coefficients pruning and JND adaptive QP
(JND_CP+QP)
o For the low resolution sequences, there are evident RD performance gains, notably for the
higher rates
� comparing with the HP codec, rate gains go up to 1.4 Mbit/s (35%)/2 Mbit/s (28.2%) (for
the same MS-SSIM) for the last RD point for the Foreman and Mobile sequences,
respectively.
� comparing with the JND_CP codec, rate gains go up to 1.1 Mbit/s (29.7%)/1.2 Mbit/s
(18.5%) (for the same MS-SSIM) for the last RD point for the Foreman and Mobile
sequences, respectively
o For medium resolution sequences
� comparing with the HP codec, rate gains go up to 33 Mbit/s (32.4%) (for same MS-SSIM)
for the last RD point for the Panslow sequence.
� comparing with the JND_CP codec, rate gains go up to 27 Mbit/s (26.5%) and 24 Mbit/s
(22%) (for the same MS-SSIM) for the last RD point for the Panslow and Spincalendar
sequences, respectively.
78
o For high resolution sequences
� comparing with the HP codec, rate gains go up to about 24 Mbit/s (28.9%)/30 Mbit/s
(32.3%) (for same MS-SSIM) for the last RD point for the Playing_cards and
Toys_and_calendar sequences, respectively.
� comparing with the JND_CP codec, rate gains go up to 13 Mbit/s(18.1%) and 18 Mbit/s
(22.2%) (for the same MS-SSIM) for the last RD point for the Playing_cards and
Toys_and_calendar sequences, respectively.
o Summarizing, the JND_QP and JND_CP+QP codecs present a maximum quality similar or inferior
to the reference HP codec quality because the MS-SSIM, as aforementioned, is based on
structure. Knowing that both the perceptual codecs modify the QP value for each MB, it is possible
that adjacent MBs have a significant difference in the QP value (e.g. if the QP value, as initially
assigned by the H.264/AVC JM software reference, is 33, the QP for the MB may be between a
minimum value of 32 and a maximum value of 36); this does not happen for the HP codec since its
RD performance was evaluated for constant QP values. Consequently, the MS-SSIM interprets this
quality difference as block artifacts, and the quality metric is lower.
o However, there are significant RD gains for certain qualities which are larger for the higher
resolution because in high resolution sequences each MB corresponds to a tinnier physical area;
consequently the impact of the method is lower, so the aforementioned blocking effect is less
noticeable.
• RD Performance: VQM versus Bitrate
Still with the purpose of better assessing the subjective quality impact of the proposed perceptual driven
coding tools, this subsection presents the VQM RD charts for each test sequence; the results are after
analyzed.
Figure 4.15 – VQM RD performance for the Foreman sequence
G1G2G3
G4
G5
G6
0
0,1
0,2
0,3
0,4
0,5
0 500 1000 1500 2000 2500 3000 3500 4000 4500
VQ
M
Bitrate [kbit/s]
HP
JND_CP
JND_QP
JND_CP+QP
79
Figure 4.16 – VQM RD performance for the Mobile sequence
Figure 4.17 – VQM RD performance for the Panslow sequence
Figure 4.18 – VQM RD performance for the Spincalendar sequence
G1G2
G3G4
G5
G6
0
0,05
0,1
0,15
0,2
0,25
0,3
0,35
0,4
0,45
0 1000 2000 3000 4000 5000 6000 7000
VQ
M
Bitrate [kbit/s]
HP
JND_CP
JND_QP
JND_CP+QP
G1G2
G3
G4
G5
G6
0
0,05
0,1
0,15
0,2
0,25
0,3
0,35
0 20000 40000 60000 80000 100000
VQ
M
Bitrate [kbit/s]
HP
JND_CP
JND_QP
JND_CP+QP
G1G2G3
G4
G5
G6
0
0,1
0,2
0,3
0,4
0,5
0,6
0 20000 40000 60000 80000 100000 120000
VQ
M
Bitrate [kbit/s]
HP
JND_CP
JND_QP
JND_CP+QP
80
Figure 4.19 – VQM RD performance for the Playing_cards sequence
Figure 4.20 – VQM RD performance for the Toys_and_calendar sequence
From the analysis of Figure 4.15 to Figure 4.20, it may be concluded that:
o The VQM decreases with the rate (thus the quality increases), first rather quickly and after tends to
a value slightly higher than zero for the higher bitrates; this basically means that there is a good
subjective quality when the rate increases above a certain value and the VQM is near to zero
meaning the distortion is very low. The various rates are obtained from the various quantization
parameters combinations corresponding to the Gx labels in the charts.
o The JND_QP codec shows the worst video quality for the low and medium resolutions, where the
difference to the H.264/AVC HP VQM is the highest. This happens because the RD performance
for the HP codec has been measured for constant QP values for each MB while the perceptual
codecs change the QP values for each MB between -1 and +3, taking as reference the QP value
initially determined by the H.264/AVC JM reference software. The blocking effect which may be
G1G2
G3
G4
G5
G6
0
0,1
0,2
0,3
0,4
0,5
0,6
0 10000 20000 30000 40000 50000 60000 70000 80000 90000
VQ
M
Bitrate [kbit/s]
HP
JND_CP
JND_QP
JND_CP+QP
G1G2G3
G4
G5
G6
0
0,05
0,1
0,15
0,2
0,25
0,3
0,35
0,4
0,45
0,5
0 20000 40000 60000 80000 100000
VQ
M
Bitrate [kbit/s]
HP
JND_CP
JND_QP
JND_CP+QP
81
associated to the different QPs leads the VQM to lower qualities since this metric is sensitive to
these effects.
o For the H.264/AVC perceptual codec with coefficients pruning (JND_CP) comparing with the HP
codec
o For the low resolution sequences, rate gains go up to 600 kbit/s (8.5%) (for the same VQM) for
the last RD point for the Mobile sequence.
o For the medium resolution sequences, rate gains go up to 9 Mbit/s (7.6%) (for the same VQM)
for the last RD point for the Spincalendar sequence.
o For the high resolution sequences, VQM gains go up to 0.02 (for the same rate) for the second
to last RD point (G2) for the Playing_cards sequence, and rate gains go up to 11 Mbit/s
(13%)/12 Mbit/s (12.9%) (for the same VQM) for the last RD point for the Playing_cards and
Toys_and_calendar sequences, respectively.
o For the H.264/AVC perceptual codec with JND adaptive QP (JND_QP)
o For the low resolution sequences:
� comparing with the HP codec, rate gains go up to 1300 kbit/s (32.5%) (for the same VQM)
for the last RD point for the Foreman sequence. For the Mobile sequence, rate gains go up
to 1900 kbit/s (26.8%) (for the same VQM) for the last RD point. However, there are also
VQM losses around 700 kbit/s for the Foreman sequence which go up to approximately
0.04 and for the Mobile sequence which go up to approximately 0.06
� comparing with the JND_CP codec, rate gains go up to 1000 kbit/s (27%)/1300 kbit/s
(20%) (for the same VQM) for the last RD point for the Foreman and Mobile sequences,
respectively. However, there are also VQM losses around 700kbit/s for the Foreman
sequence which go up to approximately 0.03 and for the Mobile sequence which go up to
approximately 0.05.
o For medium resolution sequences
� comparing with the HP codec, rate gains go up to 32 Mbit/s (29.6%)/30 Mbit/s (25.4%) (for
the same VQM) for the last RD point for the Panslow and Spincalendar sequences,
respectively.
� comparing with the JND_CP codec, rate gains go up to 24 Mbit/s (23.5%) and 21 Mbit/s
(19.3%) (for the same VQM) for the last RD point for the Panslow and Spincalendar
sequences, respectively.
o For high resolution sequences
� comparing with the HP codec, VQM gains go up to 0.02 (for the same rate) for the second
to last RD point (G2) for the Playing_cards sequence and rate gains go up to 21 Mbit/s
(25.3%)/27 Mbit/s (29%) (for the same VQM) for the last RD point for the Playing_cards
and Toys_and_calendar sequences, respectively. VQM losses go up to 0.01 (for the same
rate) around 10 Mbit/s for the Toys_and_calendar sequence.
� comparing with the JND_CP codec, rate gains go up to 10 Mbit/s (13.9%) and 15 Mbit/s
(18.5%) (for the same VQM) for the last RD point for the Playing_cards and
Toys_and_calendar sequences, respectively, and VQM gains go up to 0.01 (for the same
82
rate) for the second to last RD point for the Playing_cards sequence; VQM losses go up to
0.01 (for the same rate) around 10 Mbit/s for the Toys_and_calendar sequence.
o For the H.264/AVC perceptual codec including coefficients pruning and JND adaptive QP
(JND_CP+QP)
o For the low resolution sequences, there are evident RD performance gains, notably for the
higher rates
� comparing with the HP codec, rate gains go up to 1.4 Mbit/s (35%)/2 Mbit/s (28.2%) (for
the same VQM) for the last RD point for the Foreman and Mobile sequences, respectively.
However, there are VQM losses around 700 kbit/s which go up to approximately 0.04 for
the Foreman sequence and up to approximately 0.06 for the Mobile sequence.
� comparing with the JND_CP codec, rate gains go up to 1.1 Mbit/s (29.7%)/1.2 Mbit/s
(18.5%) (for the same VQM) for the last RD point for the Foreman and Mobile sequences,
respectively. However, there are VQM losses around 700 kbit/s which go up to
approximately 0.03 for the Foreman sequence and go up to approximately 0.05 for the
Mobile sequence.
o For medium resolution sequences
� comparing with the HP codec, gains go up to 33 Mbit/s (32.4%) (for same VQM) for the last
RD point for the Panslow sequence.
� comparing with the JND_CP codec, rate gains go up to 27 Mbit/s (26.5%) and 24 Mbit/s
(22%) (for the same VQM) for the last RD point for the Panslow and Spincalendar
sequences, respectively.
o For high resolution sequences
� comparing with the HP codec, VQM gains go up to 0.03 (for the same rate) for the second
to last RD point (G2) for the Playing_cards sequence and rate gains go up to 24 Mbit/s
(28.9%)/30 Mbit/s (32.3%) (for same VQM) for the last RD point for the Playing_cards and
Toys_and_calendar sequences, respectively. VQM losses go up to 0.01 (for the same rate)
for the second to last RD point (G2) for the Toys_and_calendar sequence.
� comparing with the JND_CP codec, rate gains go up to 13 Mbit/s (18.1%) and 18 Mbit/s
(22.2%) (for the same VQM) for the last RD point for the Playing_cards and
Toys_and_calendar sequences, respectively and VQM gains go up to 0.02 (for the same
rate) for the second to last RD point for the Playing_cards sequence; there are also VQM
losses going up to 0.01 (for the same rate) for the second to last RD point for the
Toys_and_calendar sequence.
In summary, the JND_QP and JND_CP+QP codecs present rate gains for all resolutions, especially for
the highest rates; however, there are no VQM gains as the maximum subjective quality is the same for all
codecs under test. For low resolution sequences and quantization steps between G2 and G5, there are
some VQM losses; on the contrary, for high resolution sequences, there are some VQM gains.
The JND_QP codec can achieve VQM RD gains through rate gains up to 32.5%/27% for the last RD
point and VQM gains up to 0.06/0.05 for the second to last RD point relatively to the HP and the JND_CP
83
codecs, respectively. The JND_CP+QP codec can achieve VQM RD gains through rate gains up to
35%/29.7% for the last RD point or VQM gains up to 0.06/0.05 for the second to last RD point for the HP
and the JND_CP codecs, respectively.
• RD Performance: RP compensated VQM versus Bitrate
Still to better assess the RD performance and trying to overcome the VQM limitations previously
presented, this subsection finally presents the RP compensated VQM RD charts for each test sequence;
the results are after analyzed.
Figure 4.21 – RP compensated VQM RD performance for the Foreman sequence
Figure 4.22 – RP compensated VQM RD performance for the Mobile sequence
G1G2G3
G4
G5
G6
0
0,05
0,1
0,15
0,2
0,25
0,3
0,35
0,4
0,45
0,5
0 500 1000 1500 2000 2500 3000 3500 4000
VQ
M
Bitrate [kbit/s]
HP
JND_CP
JND_QP
JND_CP+QP
G1G2
G3
G4
G5
G6
0
0,05
0,1
0,15
0,2
0,25
0,3
0,35
0 1000 2000 3000 4000 5000 6000 7000
VQ
M
Bitrate [kbit/s]
HP
JND_CP
JND_QP
JND_CP+QP
84
Figure 4.23 – RP compensated VQM RD performance for the Panslow sequence
Figure 4.24 – RP compensated VQM RD performance for the Spincalendar sequence
Figure 4.25 – RP compensated VQM RD performance for the Playing_cards sequence
G1G2G3
G4
G5
G6
0
0,05
0,1
0,15
0,2
0,25
0,3
0,35
0 20000 40000 60000 80000 100000
VQ
M
Bitrate [kbit/s]
HP
JND_CP
JND_QP
JND_CP+QP
G1G2G3
G4
G5
G6
0
0,05
0,1
0,15
0,2
0,25
0,3
0,35
0,4
0,45
0,5
0 20000 40000 60000 80000 100000 120000
VQ
M
Bitrate [kbit/s]
HP
JND_CP
JND_QP
JND_CP+QP
G1G2
G3
G4
G5
G6
0
0,1
0,2
0,3
0,4
0,5
0 10000 20000 30000 40000 50000 60000 70000 80000 90000
VQ
M
Bitrate [kbit/s]
HP
JND_CP
JND_QP
JND_CP+QP
85
Figure 4.26 – RP compensated VQM RD performance for the Toys_and_calendar sequence
From the analysis of Figure 4.21 to Figure 4.26, it may be concluded that:
o For the H.264/AVC perceptual codec with coefficients pruning (JND_CP) comparing with the HP
codec
o For the low resolution sequences, rate gains go up to 600 kbit/s (8.5%) (for the same VQM) for
the last RD point for the Mobile sequence.
o For medium resolution sequences, rate gains go up to 9 Mbit/s (7.6%) (for the same VQM) for
the last RD point for the Spincalendar sequence.
o For high resolution sequences, RP compensated VQM gains go up to 0.02 (for the same rate)
for the second to last RD point (G2) and rate gains go up to 11 Mbit/s (13%)/12 Mbit/s (12.9%)
(for the same RP compensated VQM) for the last RD point for the Playing_cards and
Toys_and_calendar sequences, respectively.
o For the H.264/AVC perceptual codec with JND adaptive QP (JND_QP)
o For the low resolution sequences
� comparing with the HP codec, rate gains go up to 1.3 Mbit/s (32.5%) and 1.9 Mbit/s
(26.8%) for the same RP compensated VQM for the last RD point for Foreman and Mobile
sequences, respectively and RP compensated VQM gains go up to 0.01 and 0.02 or the
same rate around 1600 kbit/s or 1400 kbit/s for the Foreman and Mobile sequence,
respectively
� comparing with the JND_CP codec, rate gains go up to 1 Mbit/s (27%) and 1.3 Mbit/s
(20%) for the same RP compensated VQM for the last RD point for the Foreman and
Mobile sequences, respectively, and RP compensated VQM gains go up to 0.01 and 0.02
for the same rate for around 1600 kbit/s or 1400 kbit/s t for the Foreman and Mobile
sequence, respectively
G1G2
G3
G4
G5
G6
0
0,05
0,1
0,15
0,2
0,25
0,3
0,35
0,4
0,45
0 20000 40000 60000 80000 100000
VQ
M
Bitrate [kbit/s]
HP
JND_CP
JND_QP
JND_CP+QP
86
o For the medium resolution sequences
� comparing with the HP codec, rate gains go up to 32 Mbit/s (29.6%) and 30 Mbit/s (25.4%)
for the same RP compensated VQM for the last RD point for the Panslow and
Spincalendar sequences, respectively, and RP compensated VQM gains go up to 0.01 for
the same rate for the second to last RD point for both sequences.
� comparing with the JND_CP codec, rate gains go up to 24 Mbit/s (23.5%) and 21 Mbit/s
(19.3%) for the same RP compensated VQM for the last RD point for the Panslow and
Spincalendar sequences, respectively, and RP compensated VQM gains go up to 0.01 for
the same rate for the second to last RD point, for both sequences above.
o For high resolution sequences
� comparing with the HP codec, rate gains go up to 21 Mbit/s (25.3%) and 27 Mbit/s (29%)
for the same RP compensated VQM for the last RD point, and the RP compensated VQM
gains go up to 0.03/0.02 (for the same rate) for rate around 38 Mbit/s or 34 Mbit/s for the
Playing_cards and Toys_and_calendar sequences, respectively
� comparing with the JND_CP codec, rate gains go up to 10 Mbit/s (13.9%) and 15 Mbit/s
(18.5%) for the same RP compensated VQM for the last RD point, and the RP
compensated VQM gains go up to 0.02 /0.01 (for the same rate) for rate around 38 Mbit/s
or 34 Mbit/s for the Playing_cards and Toys_and_calendar sequences, respectively.
o For the H.264/AVC perceptual codec including coefficients pruning and JND adaptive QP
(JND_CP+QP)
o For the low resolution sequences
� comparing with the HP codec, rate gains go up to 1.4 Mbit/s (35%) and 2 Mbit/s (28.2%) for
the same RP compensated VQM for the last RD point and the RP compensated VQM
gains go up to 0.01/0.02 for the same rate for the second to last RD point for the Foreman
and Mobile sequence, respectively
� comparing with the JND_CP codec, rate gains go up to 1.1 Mbit/s (29.7%) and 1.2 Mbit/s
(18.5%) for the same RP compensated VQM for the last RD point and the RP
compensated VQM gains go up to 0.03 (78.9%) and 0.02 (40%) for the same rate for the
second to last RD point for the Foreman and Mobile sequence, respectively
o For the medium resolution sequences
� comparing with the HP codec, rate gains go up to 33 Mbit/s (30.6%) and 33 Mbit/s (28%)
for the same RP compensated VQM for the last RD point and the RP compensated VQM
gains go up to 0.01 for the same rate for the second to last RD point, for the Panslow and
Spincalendar sequences, respectively.
� comparing with the JND_CP codec, rate gains go up to 27 Mbit/s (26.5%) and 24 Mbit/s
(22%) for the same RD compensated VQM for the last RD point and the RP compensated
VQM gains go up to 0.01 for the same rate for the second to last RD point for the Panslow
and Spincalendar sequences, respectively.
87
o For high resolution sequences
� comparing with the HP codec, rate gains go up to 24 Mbit/s (28.9%) and 30 Mbit/s (32.3%)
for the same RD compensated VQM for the last RD point and the RP compensated VQM
gains go up to 0.04/0.02 for the same rate for the second to last RD point for the
Playing_cards and Toys_and_calendar sequences, respectively.
� comparing with the JND_CP codec, rate gains go up to 13 Mbit/s (18.1%) and 18 Mbit/s
(22.2%) for the same VQM for the last RD point and the RP compensated VQM gains go
up to 0.02/0.01 for the same rate for the second to last RD point for the Playing_cards and
Toys_and_calendar sequences, respectively.
In summary, for both the JND_QP and JND_CP+QP codecs and for all resolutions, there are rate gains
and RP compensated VQM gains. The JND_QP codec can achieve RP compensated VQM RD gains
through rate gains up to 32.5%/27% for the last RD point regarding the HP and JND_CP codecs,
respectively, and RP compensated VQM gains up to 0.03/0.02 for the second to last RD point relatively to
the HP and the JND_CP codecs, respectively. The JND_CP+QP codec can achieve RP compensated
VQM RD gains through rate gains up to 35%/29.7% for the last RD point or VQM gains up to 0.04/0.02
for the second to last RD point for the HP and JND_CP codecs, respectively.
4.2.3. Conclusion
The main conclusions of this chapter are that the PSNR and the MS-SSIM quality metrics are not able to
adequately express the subjective RD performance gains obtained with the proposed H.264/AVC
perceptual codec with JND adaptive QP (JND_QP) and the H.264/AVC perceptual codec including
coefficients pruning and JND adaptive QP (JND_CP+QP) because they are not designed to efficiently
measure the subjective quality and, thus, have a low correlation with subjective quality scores. However,
the effective assessment of the RD performance gains was possible with the VQM objective quality metric
and especially with its RP compensated version.
VQM RD gains for the JND_QP and JND_CP+QP codecs are mainly rate gains; however, there are also
some VQM gains, especially for the high resolution sequences, going up to 0.02 and 0.03 when
comparing with the HP codec and 0.01/0.02 when comparing with the JND_CP codec for the JND_QP
and JND_CP+QP codecs, respectively.
Regarding the RP compensated VQM RD gains for the JND_QP and JND_CP+QP codecs, there are
both rate and RP compensated VQM gains. The highest RD gains for the JND_CP+QP codec are RP
compensated VQM gains which go up to 0.02 and rate gains which go up to 35% for the low resolutions.
Still for the same codec, the RP compensated VQM gains go up to 0.04 and the rate gains go up to 31%
for the medium and high resolutions.
88
Chapter 5
Conclusions and Future Work
Chapter 5 concludes this report by presenting a brief summary of the solutions developed as well as the
main conclusions; finally, some suggestions for eventual future work are presented.
5.1. Summary and Conclusions
Chapter 1 introduced the problem addressed in this Thesis, mentioning the increasing importance of
video applications and the growing need for more compression, thus justifying the improvement of
existing video compression solutions to achieve the same quality with a lower bitrate.
Next, Chapter 2 reviewed the conceptual and technical background of this Thesis, notably the human
visual system and the most relevant perceptual video coding solutions in the literature.
The next two chapters report the main contributions and developments associated to this Thesis. Chapter
3 presented the first solution improved video codec, this means the H.264/AVC perceptual codec with
coefficients pruning. The additional perceptually driven method, called perceptual coefficients pruning,
builds on the basic idea of setting to zero all the transform coefficients which have a magnitude lower
than the corresponding JND threshold since these coefficients should be perceptually irrelevant.
Consequently, some changes had to be made in the codec architecture, notably the inclusion of two new
modules: the JND model which determines the JND thresholds and the Coefficients Pruning module
which implements the pruning process. To evaluate the RD performance associated to the additional tool,
two objective quality metrics have been used: PSNR and MS-SSIM. Other relevant metrics assessed
were the average number of zeroed coefficients at MB level due to the perceptual coefficients pruning,
which intends to evaluate the impact of the method in the codec in terms of the number of zeroed
coefficients, and the average zigzag position of the zeroed coefficients exclusively due to the perceptual
coefficients pruning method, which intends to give an idea where are the zeroed coefficients in terms of
89
bandwidth. The main conclusion was that the adopted perceptual coefficients pruning method may have
some positive impact on the RD performance, notably for sequences with lower luminance levels and
objects patterns, and especially for low QP values and higher resolutions. The H.264/AVC perceptual
codec with coefficients pruning can achieve evident PSNR RD gains only for the higher resolutions and
the higher rates. The rate gains go up to 14% for the last RD point and the PSNR gains go up to 0.4 dB
for rates around 8 Mbit/s. The MS-SSIM RD performance can achieve a rate gain up to 7% for low and
medium resolutions and 14% for high resolutions, both for the last RD point. The average number of
zeroed coefficients due to the perceptual coefficients pruning method can go up to around 9.7 coefficients
(60.6%) for the Playing_cards sequence in the lowest G. These zeroed coefficients are more likely to be
located in between the 8th and 12
th zigzag position in a 4x4 block.
After Chapter 4 presented the second perceptually driven tool and the corresponding improved codec as
well as an improved codec including both additional tools. The second tool is a JND adaptive quantization
which has the intent to adapt the QP value based on the computed JND thresholds considering the
human visual system is not sensitive to changes below the JND threshold values and thus no rate should
be used to provide a coefficient’s accuracy which cannot be ‘consumed’. Consequently, some changes in
the architecture have been implemented, notably again the JND model computation and also a JND
adaptive quantization module which computes the adaptive QP values for each MB in each frame based
on the QP initially determined by the H.264/AVC reference software. For the RD performance evaluation,
four objective quality metrics have been used, notably again the PSNR and MS-SSIM and two additional
metrics, the VQM and the RP compensated VQM. For the JND_QP codec, the VQM RD performance
shows some rate gains for the high resolution sequences. The RP compensated performance shows a
RP compensated VQM gain up to 0.03 and a rate gain up to 33%. The joint solution, this means the
H.264/AVC perceptual codec including the coefficients pruning and JND adaptive QP. The JND_CP+QP
shows a VQM RD performance with rate gains up to 35% for the last RD point and VQM gains up to 0.03
for the last RD point. The RP compensated VQM performance shows rate gains of 35% for the last RD
point and a RP compensated VQM gain up to 0.04 for high resolution sequences. To sum up, the main
conclusions of this chapter are that the PSNR and the MS-SSIM are not able to express the RD
performance gains obtained with the proposed JND_QP and JND_CP+QP codecs because they are not
designed to efficiently measure the subjective quality and, thus, have a low correlation with subjective
quality scores. However, the effective assessment of the RD performance gains was possible with the
VQM objective quality metric and its RP compensated version.
5.2. Future Work
Despite the encouraging results achieved with the combination of the two perceptually driven tools
described in this Thesis, the developed video coding solutions still leaves room for improvements.
Animportant module in the H.264/AVC standard that was not improved but it is a good candidate for
perceptually related improvements is the motion estimation. The motion estimation module may be
90
improved using the same basic ideas that were used to improve the transform and quantization
processes, following the computation of the JND model. In this context, a method similar to the one
presented in [30] may be developed; the basic idea is to compute the distortion metric comparing the
original and the prediction blocks to determine the prediction error using only the perceptually relevant
residuals based on some filtering with the relevant JND thresholds. In fact, it is not a perfect solution to
apply the perceptual coefficients pruning method to the transformed coefficients, discarding some of the
perceptually irrelevant coefficients, and then perform the motion estimation process still considering the
coefficients corresponding to those previously discarded to determine the best MB prediction and define
the motion vector. A better solution seems to be, for example, the adoption of a JND thresholded SAD in
the motion estimation module allowing to obtain a more perceptually coherent video codec.
91
References
[1] G. Lin and S. Zheng, "Perceptual Importance Analysis for H.264/AVC Bit Allocation," Journal of
Zhejiang University SCIENCE A, vol. 9, no. 2, pp. 225 - 231, July 2007.
[2] M. Jacobs and J. Probell, "A Brief History Of VIdeo Coding", Oct. 2009, [Online].
http://www.arc.com/upload/download/whitepapers/A_Brief_History_of_Video_Coding_wp.pdf
[3] ITU-T Rec. H.261, "Video Codec for Audio-Visual Services at 64-1920 kbit/s," 1993.
[4] D. Gall, "MPEG: Video Compression Standard For Multimedia Applications,".1991
[5] ISO/IEC 11172-2 MPEG-1, "Coding Of Moving Pictures and Associated Audio For Digital Storage
Media At Up To About 1.5 Mbps," Part2: Visual, 1991.
[6] ISO/IEC 13818-2 MPEG-2, "Generic Coding of Moving Pictures and Associated Audio: Video," same
as ITU-T Rec. H.262, 1995.
[7] B. G. Haskell, A. Puri, and A. N. Netravali, Digital Video: An Introduction to MPEG-2. New York, USA:
Chapman & Hall, 1997.
[8] K. Rijkse, "ITU-T Recommendation H.263: Video Coding for Low-Bit-Rate Communication," IEEE
Communications Magazine, pp. 42-45, December 1996.
[9] I.E.G. Richardson, H.264 And MPEG-4 Video Compression: Video Coding For Next-Generation
Multimedia. Chichester: John Wiley & Sons, 2003.
[10] JVT of ISO/IEC MPEG And ITU-T VCEG, "ITU-T Recommendation And Final Draft International
Standard Of Joint Video Specification (ITU-T Rec. H.264|ISO/IEC 14496-10 AVC)," JVT-G050, 2003.
[11] E. Manoel, "Codificação de Vídeo H.264 – Estudo De Codificação Mista De Macroblocos," in
Florianópolis, 2007.
[12] T. Wiegand, G.J. Sullivan, G. Bjontengaard, and A. Luthra, "Overview Of The H.264/AVC Video
Coding Standard," Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 560 - 576, July 2003.
[13] "Eye", Oct. 2009, [Online]. http://en.wikipedia.org/wiki/Eye
92
[14] J. Martim do Santo et al., Anatomia Geral Moreno, 3rd ed.: Egas Moniz Publicações, 2005.
[15] "Olho - Anatomia", Oct. 2009, [Online].
http://www.medipedia.pt/home/home.php?module=artigoEnc&id=505
[16] "The Anatomy of Vision - Page 2", Oct. 2009, [Online].
http://brainconnection.positscience.com/topics/?main=anat/vision-anat2
[17] "Vernier Tech Info Library TIL #1014", March 2009, [Online]. http://www.vernier.com/til/1014.html
[18] Y. Qiao, Q. Hu, G. Qian, S. Luo, and W. L. Nowinski, "Thresholding based on variance and intensity
contrast," Pattern Recognition, vol. 40, no. 2, pp. 596 - 608, July 2006.
[19] "The Human Visual System", Oct. 2009, [Online].
http://www.dip.ee.uct.ac.za/~nicolls/lectures/eee401f/hvs.pdf
[20] D. Taubman and M.W. Marcellin, "JPEG2000: Image Compression Fundamentals," in Standards and
Practice, Kluwer, Boston, 2002.
[21] K. Minoo and T.Q. Nguyen, "Perceptual Video Coding With H.264," in Conference Record of the
Thirty-Ninth Asilomar Conference, Pacific Grove, CA, USA, 2005, pp. 741-745.
[22] (2009, Out.) Visual Perception. [Online]. http://en.wikipedia.org/wiki/Visual_perception
[23] V. Bruce, P.R. Green, and M.A. Georgeson, Visual Perception, 3rd ed.: Psychology Press, 1996.
[24] S.M.C. Nascimento, "Optica Fisiológica II", Oct. 2009, [Online].
http://www.arauto.uminho.pt/pessoas/smcn/OFII/optica%20fisiologica%20II%20cap1.pdf
[25] "Stevens' Power Law", Oct. 2009, [Online]. http://en.wikipedia.org/wiki/Stevens'_power_law
[26] R. Clarke, Digital Compression Of Still Images And Video. London, England: Academic Press, 1995.
[27] Z. Chen and C. Guillemot, "Perceptually-Friendly H264/AVC Video Coding," in 2009 IEEE
International Conference On Image Processing, Cairo, Egypt, 7-10 Nov. 2009.
[28] Z. Chen and C. Guillemot, "Perceptually-Friendly H264/AVC Video Coding Based On Foveated Just-
Noticeable-Distortion Model," INRIA/IRISA, France, 2009.
[29] "Focus", Jan. 2010, [Online]. http://mathworld.wolfram.com/Focus.html
[30] C. Mak and K.N. Ngan, "Enhancing Compression Rate By Just-Noticeable Distortion Model For
H.264/AVC," in ISCAS 2009, IEEE International Symposium on Circuits and Systems 2009, Taipei, 24-
27 May 2009, pp. 609 - 612.
[31] Z. Mai, C. Yang, K. Kuang, and L. Po, "A Novel Motion Estimation Method Based On Structural
Similarity For H.264 Inter Prediction," in ICASSP 2006 Proceedings, IEEE International Conference on
93
Acoustics, Speech and Signal Processing, vol. 2, Toulouse, France, May 2006, pp. 913-916.
[32] C. Huang and C. Lin, "A Novel 4-D Perceptual Quantization Modelling For H.264 Bit-Rate Control,"
Multimedia, IEEE Transactions on, vol. 9, no. 6, pp. 1113 - 1124, Oct. 2007.
[33] M. Naccari and F. Pereira, "Comparing Spatial Masking Modelling In Just Noticeable Distortion
Controlled H.264/AVC Video Coding," in 11th WIAMIS, Workshop on Image Analysis for Multimedia
Interactive Services WIAMIS, vol. 1, Desenzano del Garda, Italy, April 2010, pp. 1-4.
[34] T. Tan, G. Sullivan, and T. Wedi, "Recommended Simulation Common Conditions For Efficiency
Experiments, revision 2," in VCEG-AH10r3, 34th Meeting, Antalya, Turkey, 12-12 Jan. 2008.
[35] Z. Wang, E. Simoncelli, and A. Bovik, "Multi-Scale Structural Similarity For Image Quality
Assessment," in Proceeding of the 37th IEEE Asilomar Xonference on Signals, Systems and
Computers, Pacific Grove, C, Nov. 2003, pp. 1398-1402.
[36] Y. Wang, "Survey of Objective Video Quality Measurements," EMC Corporation Hopkinton, USA,
2006.
[37] F. Xiao, "DCT-based Video Quality Evolution", 2000, [Online].
http://compression.ru/video/quality_measure/vqm.
[38] M.H. Pinson and S. Wolf, "A New Standardized Method for Objectively measuring video quality,"
Broadcasting, IEEE Transactions on , vol. 50, no. 3, pp. 312-322, September 2004.
[39] M. Naccari and F. Pereira, "Advanced H.264/AVC Based Perceptual Video Coding: Architecture,
Tools and Assessment," 2010.
[40] F. Pereira, "Comunicações de Áudio e Vídeo".
[41] ITU-T Rec. H.263, "Video Codec For Low Bit Rate Communication," 1996.
94
Annex A
This annex, regards the results of Chapter 3, includes two tables with the average rate, PSNR and MS-
SSIM for each RD point and the variation in percentage of each one between the H.264/AVC High profile
codec and the H.264/AVC based perceptual codec with coefficients pruning, and the overall average of
the average zigzag position of zeroed coefficients exclusively due to the perceptual coefficients pruning
method (4x4 blocks) and the variance of the average zigzag position of zeroed coefficients exclusively
due to the perceptual coefficients pruning method (4x4 blocks).
95
Sequence BitrateHP
[kbit/s]
BitrateJND_CP
[kbit/s]
∆Bitrate
[%]
PSNRHP
[dB]
PSNRJND
_CP [dB]
∆PSNR
[%]
MS-
SSIMHP
MS-
SSIMJND_
CP
∆MS-
SSIM
[%]
Fo
rem
an
QP1 3986.190 3677.970 -7.732 46.149 45.698 -0.977 0.9989 0.9988 -0.010
QP2 2339.400 2185.620 -6.573 43.494 43.297 -0.453 0.9983 0.9982 -0.010
QP3 704.600 681.480 -3.281 38.859 38.848 -0.028 0.9958 0.9958 0.000
QP4 332.120 322.340 -2.945 35.994 35.995 0.003 0.9921 0.9921 0.000
QP5 169.280 164.100 -3.060 33.128 33.155 0.082 0.9841 0.9843 0.020
QP6 91.670 89.300 -2.585 30.378 30.401 0.076 0.9675 0.9678 0.031
Mo
bil
e
QP1 7080.900 6517.450 -7.957 45.285 44.651 -1.400 0.9997 0.9996 -0.010
QP2 4900.190 4565.440 -6.831 42.130 41.839 -0.691 0.9995 0.9995 0.000
QP3 1879.300 1794.300 -4.523 36.167 36.154 -0.036 0.9985 0.9985 0.000
QP4 843.590 813.040 -3.621 32.569 32.579 0.031 0.9967 0.9967 0.000
QP5 375.050 360.580 -3.858 29.293 29.317 0.082 0.9927 0.9927 0.000
QP6 172.660 164.510 -4.720 26.069 26.105 0.138 0.9825 0.9826 0.010
Pa
nsl
ow
QP1 108443.010 101542.820 -6.363 45.174 44.706 -1.036 0.9982 0.9981 -0.010
QP2 62850.690 59207.970 -5.796 42.003 41.848 -0.369 0.9964 0.9963 -0.010
QP3 6631.300 6405.530 -3.405 37.233 37.244 0.030 0.9897 0.9897 0.000
QP4 1810.460 1745.520 -3.587 35.624 35.633 0.025 0.9849 0.9850 0.010
QP5 774.200 749.330 -3.212 33.865 33.871 0.018 0.9800 0.9801 0.010
QP6 402.520 392.560 -2.474 31.711 31.738 0.085 0.9717 0.9718 0.010
Sp
inca
len
da
r
QP1 117966.860 109393.340 -7.268 45.338 44.891 -0.986 0.9991 0.9990 -0.010
QP2 71097.700 66182.000 -6.914 42.126 41.982 -0.342 0.9978 0.9977 -0.010
QP3 10567.280 10069.510 -4.710 37.174 37.186 0.032 0.9940 0.9940 0.000
QP4 3125.310 2998.330 -4.063 35.211 35.223 0.034 0.9911 0.9911 0.000
QP5 1336.660 1279.740 -4.258 32.736 32.743 0.021 0.9842 0.9842 0.000
QP6 716.030 687.720 -3.954 30.052 30.062 0.033 0.9683 0.9686 0.031
Pla
yin
g_
card
s
QP1 83313.160 71955.020 -13.633 47.156 46.815 -0.723 0.9990 0.9990 0.000
QP2 50165.110 43041.640 -14.200 44.905 44.696 -0.465 0.9984 0.9982 -0.020
QP3 12351.070 10776.830 -12.746 41.159 41.193 0.083 0.9937 0.9938 0.010
QP4 4332.970 4063.390 -6.222 38.887 38.936 0.126 0.9883 0.9885 0.020
QP5 2054.630 1965.280 -4.349 36.421 36.474 0.146 0.9790 0.9792 0.020
QP6 1084.940 1043.630 -3.808 33.500 33.547 0.140 0.9589 0.9595 0.063
To
ys_
an
d_
cale
nd
ar QP1 93142.660 80918.700 -13.124 45.442 45.151 -0.640 0.9993 0.9992 -0.010
QP2 50370.440 44278.430 -12.094 42.846 42.725 -0.282 0.9983 0.9982 -0.010
QP3 7108.890 6574.970 -7.511 39.423 39.447 0.061 0.9919 0.9920 0.010
QP4 2604.550 2466.810 -5.288 37.926 37.958 0.084 0.9880 0.9881 0.010
QP5 1357.070 1294.060 -4.643 36.055 36.102 0.130 0.9799 0.9802 0.031
QP6 814.220 778.110 -4.435 33.872 33.922 0.148 0.9673 0.9677 0.041
Average 19924.908 18087.427 -5.993 37.927 37.837 -0.190 0.9890 0.9891 0.006
Table A-1 – Average rate, PSNR and MS-SSIM for each sequence for various RD points, and their variation in percentage.
96
Sequence Avg_zigzag_position_4x4 Var_avg_zigzag_position_4x4
Fo
rem
an
QP1 10.536 1.499
QP2 10.210 1.601
QP3 9.594 1.840
QP4 9.289 1.978
QP5 9.045 2.109
QP6 8.848 2.227
Mo
bil
e
QP1 11.040 1.432
QP2 10.822 1.492
QP3 10.294 1.633
QP4 9.915 1.741
QP5 9.538 1.875
QP6 9.189 2.031
Pa
nsl
ow
QP1 10.540 1.516
QP2 10.131 1.624
QP3 9.271 2.009
QP4 8.976 2.179
QP5 8.811 2.273
QP6 8.688 2.345
Sp
inca
len
da
r
QP1 10.669 1.515
QP2 10.265 1.629
QP3 9.565 1.888
QP4 9.285 2.002
QP5 9.056 2.111
QP6 8.853 2.226
Pla
yin
g_
card
s
QP1 10.245 1.521
QP2 9.830 1.666
QP3 9.199 1.981
QP4 8.946 2.134
QP5 8.764 2.254
QP6 8.621 2.357
To
ys_
an
d_
cale
nd
ar QP1 10.311 1.500
QP2 9.878 1.652
QP3 9.201 2.006
QP4 8.952 2.162
QP5 8.780 2.279
QP6 8.652 2.371
Average 9.550 1.907
Table A-2 – Overall average of the average zigzag position for a 4x4 block and the variance of the average zigzag position for a 4x4 block for each RD point.
97
Annex B
This annex, regards the results of Chapter 4, includes five tables with the average rate, PSNR, MS-SSIM,
VQM and RP compensated VQM for the four codecs (H.264/AVC High profile codec, H.264/AVC based
perceptual codec with coefficients pruning, H.264/AVC based perceptual codec with JND adaptive QP
and H.264/AVC based perceptual codec including coefficients pruning and JND adaptive QP and the
variation in percentage between H.264/AVC High profile codec and each one of the remaining codecs.
98
Bitrate ∆Bitrate [%]
HP JND_CP JND_QP JND_CP+QP JND_CP JND_QP JND_CP+QP
Fo
rem
an
G1 3986.19 3677.97 2692.69 2603.98 -7.7322 -32.4495 -34.675
G2 2339.4 2185.62 1603.31 1566.94 -6.57348 -31.4649 -33.0196
G3 704.6 681.48 541.02 533.78 -3.28129 -23.216 -24.2435
G4 332.12 322.34 271.45 267.54 -2.94472 -18.2675 -19.4448
G5 169.28 164.1 146.07 143.36 -3.06002 -13.711 -15.3119
G6 91.67 89.3 82.26 81.4 -2.58536 -10.2651 -11.2032
Mo
bil
e
G1 7080.9 6517.45 5197.84 5046.02 -7.95732 -26.5935 -28.7376
G2 4900.19 4565.44 3537.38 3509.24 -6.83137 -27.8114 -28.3856
G3 1879.3 1794.3 1369.55 1367.81 -4.52296 -27.1245 -27.217
G4 843.59 813.04 627.5 627 -3.62143 -25.6155 -25.6748
G5 375.05 360.58 294.97 294.58 -3.85815 -21.3518 -21.4558
G6 172.66 164.51 143.96 143.54 -4.72026 -16.6223 -16.8655
Pa
nsl
ow
G1 108443.010 101542.820 77407.46 75164.7 -6.36296 -28.6192 -30.6874
G2 62850.690 59207.970 41795.19 40726.63 -5.79583 -33.5008 -35.201
G3 6631.300 6405.530 5098.64 4689.07 -3.40461 -23.1125 -29.2888
G4 1810.460 1745.520 1552.97 1500.74 -3.58693 -14.2224 -17.1073
G5 774.200 749.330 735.87 703.23 -3.21235 -4.95092 -9.16688
G6 402.520 392.560 399.85 380.42 -2.47441 -0.66332 -5.49041
Sp
inca
len
da
r
G1 117966.860 109393.340 87532.46 84896.49 -7.26774 -25.7991 -28.0336
G2 71097.700 66182.000 50307.36 49302.84 -6.91401 -29.2419 -30.6548
G3 10567.280 10069.510 9064.95 8147.54 -4.71048 -14.2168 -22.8984
G4 3125.310 2998.330 2823.85 2708.68 -4.06296 -9.64576 -13.3308
G5 1336.660 1279.740 1265.4 1217.8 -4.25838 -5.3312 -8.89231
G6 716.030 687.720 697.19 671.66 -3.95374 -2.63117 -6.19667
Pla
yin
g_
card
s
G1 83313.160 71955.020 62093.94 58585.48 -13.6331 -25.4692 -29.6804
G2 50165.110 43041.640 37874.46 35454.55 -14.2 -24.5004 -29.3243
G3 12351.070 10776.830 10082.08 9016.3 -12.7458 -18.3708 -26.9998
G4 4332.970 4063.390 3996.48 3599.42 -6.2216 -7.7658 -16.9295
G5 2054.630 1965.280 1998.25 1785.35 -4.34871 -2.74405 -13.106
G6 1084.940 1043.630 1094.85 963.99 -3.80758 0.913415 -11.1481
To
ys_
an
d_
cale
nd
ar G1 93142.660 80918.700 66494.8 63388.33 -13.1239 -28.6097 -31.9449
G2 50370.440 44278.430 34917.75 33764.27 -12.0944 -30.6781 -32.9681
G3 7108.890 6574.970 6080.46 5655.03 -7.5106 -14.4668 -20.4513
G4 2604.550 2466.810 2426.31 2257.89 -5.28844 -6.84341 -13.3098
G5 1357.070 1294.060 1319.04 1219.74 -4.64309 -2.80236 -10.1196
G6 814.220 778.110 806.56 739.86 -4.43492 -0.94078 -9.13267
Table B-1 – Average rate for each sequence for various RD points for different codecs and their variation in percentage.
99
PSNR ∆PSNR [%]
HP JND_CP JND_QP JND_CP+QP JND_CP JND_QP JND_CP+QP
Fo
rem
an
G1 46.149 45.698 42.408 42.274 -0.97727 -8.10635 -8.39671
G2 43.494 43.297 40.046 40.032 -0.45294 -7.92753 -7.95972
G3 38.859 38.848 36.324 36.324 -0.02831 -6.52359 -6.52359
G4 35.994 35.995 33.953 33.953 0.002778 -5.67039 -5.67039
G5 33.128 33.155 31.618 31.625 0.081502 -4.55808 -4.53695
G6 30.378 30.401 29.43 29.534 0.075713 -3.12068 -2.77833
Mo
bil
e
G1 45.285 44.651 41.122 40.825 -1.40002 -9.19289 -9.84874
G2 42.13 41.839 38.146 38.117 -0.69072 -9.45644 -9.52528
G3 36.167 36.154 33.326 33.316 -0.03594 -7.85523 -7.88288
G4 32.569 32.579 30.452 30.433 0.030704 -6.50005 -6.55838
G5 29.293 29.317 27.717 27.726 0.081931 -5.38012 -5.3494
G6 26.069 26.105 24.969 24.978 0.138095 -4.21957 -4.18505
Pa
nsl
ow
G1 45.174 44.706 42.248 42.024 -1.03599 -6.47718 -6.97304
G2 42.003 41.848 39.796 39.784 -0.36902 -5.25439 -5.28296
G3 37.233 37.244 36.531 36.506 0.029544 -1.88542 -1.95257
G4 35.624 35.633 35.091 35.076 0.025264 -1.49618 -1.53829
G5 33.865 33.871 33.388 33.346 0.017717 -1.40853 -1.53256
G6 31.711 31.738 31.258 31.185 0.085144 -1.42853 -1.65873
Sp
inca
len
da
r
G1 45.338 44.891 42.725 42.48 -0.98593 -5.76338 -6.30376
G2 42.126 41.982 40.315 40.296 -0.34183 -4.29901 -4.34411
G3 37.174 37.186 36.69 36.579 0.032281 -1.30199 -1.60058
G4 35.211 35.223 34.727 34.646 0.03408 -1.37457 -1.60461
G5 32.736 32.743 32.306 32.205 0.021383 -1.31354 -1.62207
G6 30.052 30.062 29.724 29.62 0.033276 -1.09144 -1.43751
Pla
yin
g_
card
s
G1 47.156 46.815 45.292 45.134 -0.72313 -3.95284 -4.2879
G2 44.905 44.696 43.447 43.391 -0.46543 -3.24685 -3.37156
G3 41.159 41.193 40.346 40.263 0.082606 -1.97527 -2.17692
G4 38.887 38.936 38.157 38.102 0.126006 -1.87723 -2.01867
G5 36.421 36.474 35.701 35.664 0.14552 -1.97688 -2.07847
G6 33.500 33.547 32.936 32.954 0.140299 -1.68358 -1.62985
To
ys_
an
d_
cale
nd
ar G1 45.442 45.151 43.507 43.373 -0.64038 -4.25818 -4.55306
G2 42.846 42.725 41.604 41.593 -0.28241 -2.89875 -2.92443
G3 39.423 39.447 39.109 39.069 0.060878 -0.79649 -0.89795
G4 37.926 37.958 37.616 37.604 0.084375 -0.81738 -0.84902
G5 36.055 36.102 35.77 35.736 0.130356 -0.79046 -0.88476
G6 33.872 33.922 33.503 33.55 0.147615 -1.0894 -0.95064
Table B-2 – Average PSNR for each sequence for various RD points for different codecs and their variation in percentage.
100
MS-SSIM ∆MS-SSIM [%]
HP JND_CP JND_QP JND_CP+QP JND_CP JND_QP JND_CP+QP
Fo
rem
an
G1 0.9989 0.9988 0.9981 0.9981 -0.01001 -0.08009 -0.08009
G2 0.9983 0.9982 0.9971 0.9971 -0.01002 -0.1202 -0.1202
G3 0.9958 0.9958 0.9932 0.9933 0 -0.2611 -0.25105
G4 0.9921 0.9921 0.9881 0.9882 0 -0.40319 -0.39311
G5 0.9841 0.9843 0.9787 0.979 0.020323 -0.54872 -0.51824
G6 0.9675 0.9678 0.9609 0.9617 0.031008 -0.68217 -0.59948
Mo
bil
e
G1 0.9997 0.9996 0.9993 0.9993 -0.01 -0.04001 -0.04001
G2 0.9995 0.9995 0.9989 0.9989 0 -0.06003 -0.06003
G3 0.9985 0.9985 0.997 0.997 0 -0.15023 -0.15023
G4 0.9967 0.9967 0.9942 0.9942 0 -0.25083 -0.25083
G5 0.9927 0.9927 0.9881 0.9882 0 -0.46338 -0.45331
G6 0.9825 0.9826 0.9741 0.9744 0.010178 -0.85496 -0.82443
Pa
nsl
ow
G1 0.9982 0.9981 0.9968 0.9968 -0.01002 -0.14025 -0.14025
G2 0.9964 0.9963 0.9942 0.9942 -0.01004 -0.22079 -0.22079
G3 0.9897 0.9897 0.9875 0.9874 0 -0.22229 -0.23239
G4 0.9849 0.9850 0.9838 0.9838 0.010153 -0.11169 -0.11169
G5 0.9800 0.9801 0.979 0.9791 0.010204 -0.10204 -0.09184
G6 0.9717 0.9718 0.9698 0.9699 0.010291 -0.19553 -0.18524
Sp
inca
len
da
r
G1 0.9991 0.9990 0.9982 0.9981 -0.01001 -0.09008 -0.10009
G2 0.9978 0.9977 0.9967 0.9966 -0.01002 -0.11024 -0.12026
G3 0.9940 0.9940 0.9933 0.9931 0 -0.07042 -0.09054
G4 0.9911 0.9911 0.9898 0.9896 0 -0.13117 -0.15135
G5 0.9842 0.9842 0.9813 0.981 0 -0.29466 -0.32514
G6 0.9683 0.9686 0.9621 0.9619 0.030982 -0.6403 -0.66095
Pla
yin
g_
card
s
G1 0.9990 0.9990 0.9986 0.9985 0 -0.04004 -0.05005
G2 0.9984 0.9982 0.9975 0.9975 -0.02003 -0.09014 -0.09014
G3 0.9937 0.9938 0.993 0.9927 0.010063 -0.07044 -0.10063
G4 0.9883 0.9885 0.9868 0.9866 0.020237 -0.15178 -0.17201
G5 0.9790 0.9792 0.9757 0.9757 0.020429 -0.33708 -0.33708
G6 0.9589 0.9595 0.9531 0.9543 0.062572 -0.60486 -0.47972
To
ys_
an
d_
cale
nd
ar G1 0.9993 0.9992 0.9984 0.9984 -0.01001 -0.09006 -0.09006
G2 0.9983 0.9982 0.9965 0.9966 -0.01002 -0.18031 -0.17029
G3 0.9919 0.9920 0.9911 0.9911 0.010082 -0.08065 -0.08065
G4 0.9880 0.9881 0.9868 0.987 0.010121 -0.12146 -0.10121
G5 0.9799 0.9802 0.9785 0.9787 0.030615 -0.14287 -0.12246
G6 0.9673 0.9677 0.9646 0.9653 0.041352 -0.27913 -0.20676
Table B-3 – Average MS-SSIM for each sequence for various RD points for different codecs and their variation in percentage.
101
VQM ∆VQM [%]
HP JND_CP JND_QP JND_CP+QP JND_CP JND_QP JND_CP+QP F
ore
ma
n
G1 0.013307 0.014418 0.021743 0.022643 8.348989 63.39521 70.15856
G2 0.018807 0.020201 0.032642 0.033624 7.412134 73.56304 78.7845
G3 0.048995 0.054603 0.100631 0.099747 11.44607 105.3903 103.5861
G4 0.133673 0.136991 0.207254 0.201878 2.482177 55.04552 51.02377
G5 0.274849 0.282759 0.341177 0.34626 2.877944 24.13252 25.9819
G6 0.464149 0.468676 0.529442 0.515567 0.975333 14.06725 11.07791
Mo
bil
e
G1 0.009942 0.011072 0.016049 0.016901 11.36592 61.42627 69.99598
G2 0.013334 0.014516 0.024797 0.024895 8.864557 85.9682 86.70316
G3 0.029509 0.036435 0.06874 0.068473 23.47081 132.9459 132.0411
G4 0.069373 0.077158 0.149015 0.152428 11.22195 114.8026 119.7224
G5 0.169302 0.186417 0.280525 0.28324 10.10915 65.69503 67.29867
G6 0.338682 0.346291 0.433036 0.436007 2.24665 27.85917 28.7364
Pa
nsl
ow
G1 0.007178 0.007118 0.009085 0.009038 -0.83589 26.56729 25.91251
G2 0.009717 0.009983 0.013355 0.013359 2.73747 37.43954 37.4807
G3 0.029884 0.031958 0.038648 0.039723 6.940169 29.32673 32.92397
G4 0.079645 0.082996 0.103395 0.102728 4.20742 29.81983 28.98236
G5 0.188009 0.185446 0.213475 0.205818 -1.36323 13.5451 9.472419
G6 0.329202 0.330458 0.353762 0.354971 0.381529 7.460465 7.827717
Sp
inca
len
da
r
G1 0.009211 0.00937 0.012542 0.012594 1.726197 36.16328 36.72783
G2 0.012474 0.012688 0.01916 0.019318 1.715568 53.59949 54.86612
G3 0.034804 0.036452 0.055895 0.057071 4.735088 60.59936 63.97828
G4 0.103307 0.108802 0.152354 0.155144 5.319097 47.47694 50.17763
G5 0.259347 0.265101 0.334353 0.335999 2.218649 28.9211 29.55577
G6 0.462608 0.467863 0.558566 0.557332 1.135951 20.74283 20.47608
Pla
yin
g_
card
s
G1 0.03466 0.03448 0.038848 0.038888 -0.51933 12.08309 12.1985
G2 0.042045 0.042075 0.047589 0.048618 0.071352 13.18587 15.63325
G3 0.132868 0.135069 0.150024 0.161212 1.656531 12.91206 21.33245
G4 0.25712 0.257557 0.278281 0.283637 0.16996 8.230009 10.31308
G5 0.379949 0.382697 0.412839 0.41351 0.723255 8.656425 8.833028
G6 0.530624 1.025121 0.55516 0.556294 93.1916 4.62399 4.837701
To
ys_
an
d_
cale
nd
ar G1 0.015473 0.015938 0.020392 0.02041 3.005235 31.79086 31.90719
G2 0.021257 0.022064 0.029842 0.029812 3.796396 40.3867 40.24557
G3 0.063034 0.06434 0.078961 0.080065 2.071898 25.26732 27.01875
G4 0.136806 0.137234 0.166383 0.161159 0.312852 21.61967 17.80112
G5 0.257569 0.257839 0.301546 0.285672 0.104826 17.07387 10.91086
G6 0.414189 0.414705 0.46101 0.453069 0.124581 11.30426 9.387019
Table B-4 – Average VQM for each sequence for various RD points for different codecs and their variation in percentage.
102
RP compensated VQM ∆RP_VQM [%]
HP JND_CP JND_QP JND_CP+QP JND_CP JND_QP JND_CP+QP
Fo
rem
an
G1 0.013307 0.013307 0.013307 0.013307 0 0 0
G2 0.018807 0.018807 0.018807 0.018807 0 0 0
G3 0.048995 0.048995 0.048995 0.048995 0 0 0
G4 0.133673 0.133673 0.133673 0.133673 0 0 0
G5 0.274849 0.274849 0.274849 0.274849 0 0 0
G6 0.464149 0.464149 0.464149 0.464149 0 0 0
Mo
bil
e
G1 0.009942 0.009942 0.009942 0.009942 0 0 0
G2 0.013334 0.013334 0.013334 0.013334 0 0 0
G3 0.029509 0.029509 0.029509 0.029509 0 0 0
G4 0.069373 0.069373 0.149015 0.152428 0 114.803 119.7224
G5 0.169302 0.169302 0.169302 0.169302 0 0 0
G6 0.338682 0.338682 0.338682 0.338682 0 0 0
Pa
nsl
ow
G1 0.007178 0.007178 0.007178 0.007178 0 0 0
G2 0.009717 0.009717 0.009717 0.009717 0 0 0
G3 0.029884 0.029884 0.029884 0.029884 0 0 0
G4 0.079645 0.079645 0.079645 0.079645 0 0 0
G5 0.188009 0.188009 0.188009 0.188009 0 0 0
G6 0.329202 0.329202 0.329202 0.329202 0 0 0
Sp
inca
len
da
r
G1 0.009211 0.009211 0.009211 0.009211 0 0 0
G2 0.012474 0.012474 0.012474 0.012474 0 0 0
G3 0.034804 0.034804 0.034804 0.034804 0 0 0
G4 0.103307 0.103307 0.103307 0.103307 0 0 0
G5 0.259347 0.259347 0.259347 0.259347 0 0 0
G6 0.462608 0.462608 0.462608 0.462608 0 0 0
Pla
yin
g_
card
s
G1 0.03466 0.03466 0.03466 0.03466 0 0 0
G2 0.042045 0.042045 0.042045 0.042045 0 0 0
G3 0.132868 0.132868 0.132868 0.132868 0 0 0
G4 0.25712 0.25712 0.25712 0.25712 0 0 0
G5 0.379949 0.379949 0.379949 0.379949 0 0 0
G6 0.530624 0.530624 0.530624 0.530624 0 0 0
To
ys_
an
d_
cale
nd
ar G1 0.015473 0.015473 0.015473 0.015473 0 0 0
G2 0.021257 0.021257 0.021257 0.021257 0 0 0
G3 0.063034 0.063034 0.063034 0.063034 0 0 0
G4 0.136806 0.136806 0.136806 0.136806 0 0 0
G5 0.257569 0.257569 0.257569 0.257569 0 0 0
G6 0.414189 0.414189 0.414189 0.414189 0 0 0
Table B-5 – Average RP compensated VQM for each sequence for various RD points for different codecs and their variation in percentage.