Subjective Audiovisual Quality in Mobile Environment · PDF filedirection for High Speed Downlink Packet Access (HSDPA). ... low complexity method, which is robust against audio compression

Vienna University of Technology

Faculty of Electrical Engineering and Information Technology

Institute of Communications and Radio-Frequency

Engineering

Master of Science Thesis

Subjective Audiovisual Quality

in Mobile Environment

by

Bruno Gardlo

Supervisor: Dr. Michal RIES

Professor: Prof. Markus RUPP

Vienna, 2009

i

Abstract

In today‘s world the mobile devices became important part of peoples‘s life. Weuse our mobile phones, mp3 players, handheld devices, portable DVD players,laptops or cameras every day.

In recent years, the development of mobile phones made a great step towardsmultifunctional devices. Nowadays we can use our mobile phones not just fortelephoning and messaging, but also for many other uses - including watching,listening and sharing multimedia contents. There is a great potential in televisionstreaming (DVB-H) in telecommunication market and since the mobile phonesare changing to small pocket PC‘s, there are more and more customers demandingvideo streaming services and video content accessible on their mobile devices.

The other changing artefact in today‘s telecommunication market is increasingnumber of customers using the video call services instead of simple calls. All thismultimedia services are critical for content and service providers in meanings ofperceived end-user quality. Users want high quality multimedia content and theproviders want to save the bandwidth as much as possible. The trade of betweenperceived quality and bandwidth has to be searched and for this purpose theaudiovisual quality test are very important. It is very inconvenient and expensiveto measure the end-user quality by subjective tests. Yet, if the transmissionscenario is changing, also the perceived quality changes and so the subjectivetests fails. Thus, there is a great potential in exploring and developing newobjective measurement methods for evaluating the perceived audiovisual quality.

In time of writing of this thesis, there exists several objective metrics for mea-suring either the audio or the video perceived quality. Most of them are definedas reference metric, so the presence of reference signal is needed for evaluationof the perceived quality. Another problem is, that they are dealing with audioand video in a separate ways. But the research in this area shows that the audioand video components are closely connected with each other, which is known asmutual compensatory property of audiovisual content.

The goal of this work is to propose non-reference objective audiovisual qual-ity metric suitable for use in mobile environment. Since the most used audio

iii

codecs in mobile environment are Advanced Audio Codec (AAC) and AdaptiveMulti-rate Codec (AMR), we will explore quality properties of these codecs. Theexplored video codec will be H.264/AVC, since at this time it is the most advancedvideo codec. New reference-free approaches are presented for quality estimationbased on motion characteristics of video and reference-free audio quality metric.Moreover, the proposed metric is compared with most recent audiovisual metrics.

iv

Contents

Abstract iii

Contents i

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Audio quality 32.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Psychoacoustics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2.1 Human Auditory System . . . . . . . . . . . . . . . . . . . 32.2.2 Psychoacoustic Principles . . . . . . . . . . . . . . . . . . . 6

2.3 Speech and Audio Coding Technologies . . . . . . . . . . . . . . . 102.3.1 Speech Coding standards . . . . . . . . . . . . . . . . . . . 11

2.4 Audio Content Estimation . . . . . . . . . . . . . . . . . . . . . . . 122.4.1 Audio parameters . . . . . . . . . . . . . . . . . . . . . . . 132.4.2 Speech detector . . . . . . . . . . . . . . . . . . . . . . . . . 172.4.3 LLR Test Based on κ and HZCRRM . . . . . . . . . . . . . 182.4.4 LLR Test Based on Mel-Frequency Cepstrum Coefficients . 192.4.5 Performance evaluation and comparison . . . . . . . . . . . 19

2.5 Audio Quality Estimation Algorithms . . . . . . . . . . . . . . . . 202.5.1 Reference Audio Quality Metrics . . . . . . . . . . . . . . . 212.5.2 Non-reference Audio Quality Metrics . . . . . . . . . . . . . 25

3 Video Quality 293.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2 Basic principles of video coding . . . . . . . . . . . . . . . . . . . . 29

3.2.1 Video and Colour sampling . . . . . . . . . . . . . . . . . . 303.2.2 New features in H.264 . . . . . . . . . . . . . . . . . . . . . 35

3.3 Video Quality Estimation . . . . . . . . . . . . . . . . . . . . . . . 38

i

3.3.1 Quality estimation based on content sensitive parameters . 39

4 Audiovisual quality 434.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.2 Audiovisual quality assessment . . . . . . . . . . . . . . . . . . . . 44

4.2.1 Test Methodology . . . . . . . . . . . . . . . . . . . . . . . 444.2.2 Encoder Settings . . . . . . . . . . . . . . . . . . . . . . . . 454.2.3 Prior Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.3 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.3.1 Video feature extraction . . . . . . . . . . . . . . . . . . . . 494.3.2 Audio feature extraction . . . . . . . . . . . . . . . . . . . . 50

4.4 Audiovisual quality estimation . . . . . . . . . . . . . . . . . . . . 504.5 Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . 53

5 Conclusions 59

Bibliography 61

List of Symbols and Abbreviations 65

List of Figures 67

List of Tables 69

ii

Chapter 1

Introduction

1.1 Motivation

Massive provisioning of mobile multimedia services and higher expectations ofend user quality bring a new challenges for service and content providers. Themost challenging part is improving the subjective quality of audio and audiovi-sual services. Due to audio and video compression improvements of the videocoding standard MPEG-4/AVC and encoding efficiency of AMR and AAC audioencoding standards, provisioning of audiovisual services is possible at low bit andframe rates while preserving its perceptual quality. This is especially suitablefor video applications in broadband wireless networks. The Universal MobileTelecommunications System (UMTS) release 4 (implemented by the first UMTSnetwork elements and terminals) provides a maximum data rate of 1920 kbit/sshared by all users in a cell, release 5 offers up to 14.4Mbit/s in downlink (DL)direction for High Speed Downlink Packet Access (HSDPA). The following audioand video codecs are supported for UMTS video services. For audio encoding thefollowing codecs are supported [2]: AMR speech codec, AAC Low Complexity(AAC-LC), AAC Long Term Prediction (AAC-LTP). The video codec encodingthe following codecs are supported [2]: H.263, MPEG-4 and MPEG-4/AVC. Theappropriate encoder settings for UMTS video services differ for different contentsand streaming application settings (resolution, frame and bit rate) [3].The end user quality is influenced by following aspects: mutual compensationeffect between audio and video, content, encoding and network settings and fi-nally by the transmission conditions. Moreover, video and audio media not onlyinteract, but there is even a synergy of component media (audio and video) [4].Therefore, perceptual mutually compensation effect differently performs in videoswith dominant human voice than in other video contents [5]. The videos contentswith dominant human voice are mainly news, interviews, talk shows... Finally,

1

the audiovisual quality estimation models tuned for videos contents with dom-inant human voice performs better than the universal one [5]. Therefore, ourfocus within this work is given on design of speech detection algorithms for mo-bile environment.This thesis is organised as follows: In the Section 2 of this thesis the audio prop-erties of the audiovisual content will be described. The main goal of the audiopart was to design of the speech detector.In recent years a speech detection has been extensively studied [6], [7], [8], [9].The proposed algorithms for speech detection differs in computational complex-ity, environment of usage, accuracy and application. Our approach is to designreal time speech detection algorithm suitable for mobile environment. There-fore, the design was focused at accurate, low complexity method, which is robustagainst audio compression artifacts.After the audio part, in the Section 3 will be described some basics of the videoprocessing and simple overview of the video quality estimator based on the for-mer work of my supervisor M.Ries[3]In the Section 4 of the work will be described the results from the subjective au-diovisual survey. Finally the new audiovisual metric for mobile streaming serviceswill be introduced.

2

Chapter 2

Audio quality

2.1 Introduction

As the main topic of my diploma thesis is audiovisual quality, it‘s important towrite at least brief overview about several audio properties. In these sectionsare explained the basics of psychoacoustic perception. Moreover the state of theart audio metric are explained. The reference and reference-free audio metricswill be differentiated and explained in separate sections. Finally, the new audiocontent estimator, developed for purpose of this thesis, will be described. Theoutput of this estimator and the output of non-reference audio quality metric willbe further used in audiovisual quality estimation.

2.2 Psychoacoustics

2.2.1 Human Auditory System

This subsection gives an overview of the sound signal processing in the humanauditory system and the main psychoacoustic phenomenas. Most of the modernlossy audio codecs and perceptual quality assessment methods are developedbased on those psychoacoustic effects.

In the following section, the psychoacoustic mechanism of each componentwill be explained:

Pinna:The Pinna pre-filters the incoming sound with a filter characteristic given by theHead Related Transfer Function (HRTF)[18].

3

Figure 2.1: Outer Ear.

Ear canal:The ear canal filters the sound further with a resonance at around 5kHz.

Cochlea:The Cochlea is a fluid-filled coil within the ear and is partially protected by smallbones.

Basilar membrane (BM):The Basilar Membrane semi-partitions the cochlea and acts as a spectrum analysatorby decomposing spatially the signal into frequency components. Each pointon the Basilar Membrane resonates at a different frequency (frequency-to-placetransformation) and the frequency selectivity is given by the width of the filterat each of this points.

Outer hair cells:The Outer hair cells are distributed along the length of the Basilar Membraneand they change the resonant properties of the Basilar Membrane by reacting tofeedback from the brainstorm.

Inner hair cells:

4

Figure 2.2: Cochlea and basilar membrane.

Figure 2.3: Hair cells.

The inner hair cells are transforming the basilar motion to neural firing, wherestronger motions cause more impulses. The neuronal ”firing” starts when the BMmoves upwards and this is the moment of the transformation from physical wavesto physiological information, transducting the sound wave at each point into a

5

signal on the auditory nerve. Each cell needs a certain time to recover betweenfirings, so the average response during a steady tone is lower than that at itsonset. Thus, the inner hair cells act as an automatic gain control. The firing ofany individual cell is pseudo-random, modulated by the movement of the BM[20].

In relation to audio signal processing and telecommunication, the whole hu-man auditory sound signal processing system seems to encode an audio signal,which has a relatively bandwidth and large dynamic range, for transmission alongnerves which each offer a much narrower bandwidth and limited dynamic range.The critical point is that any information lost due the transduction process withinthe cochlea is not available to the brain - the cochlea is effectively a lossy coder.

2.2.2 Psychoacoustic Principles

Psychoacoustics deals with the relationship of physical sounds and the humanbrain’s interpretation of them. The field of psychoacoustics has made signifi-cant progress toward characterising human auditory perception and particularlythe time-frequency analysis capabilities of the inner ear. An auditory model forassessing the perceived quality of coded audio signals by simulating the func-tionality of the human ear and its characteristics is presented in ear, where thepredictions of audible and inaudible conditions in a variety of psychoacousticlistening tests is investigated.

Several psychoacoustic principles simulates the function of the human audi-torial system and are used to identify the irrelevant information, which is notdetectable even by well trained listeners (”golden ears”). These psychoacousticprinciples are

• the absolute hearing thresholds

• the critical band frequency analysis

• the simultaneous masking and the spread of masking (along the basilarmembrane)

• the temporal masking

The absolute threshold of hearing characterises the amount of energy neededin a pure tone such that it can be detected by a listener in a noiseless environmentand is expressed in terms of Sound Pressure Level (dB SPL). The quiet thresholdis well approximated by the non-linear function [19]:

Tq(f) = 3.64(f/1000)−0.8 − 6.5−0.6(f/1000−3.3) + 10−3(f/1000)4 (2.1)

6

The curve is often referenced by audio codec designer by equating the lowestpoint near 4kHz in such way that the smallest possible output signal of theirdecoder will be presented close to 0 dB SPL. The absolute hearing threshold curveis illustrated later in Figure 2.5 together with the effects of frequency masking.

The inner ear separates the frequencies and concentrates them at certainlocations along the basilar membrane (frequency-to-place transformation), so itcan be regarded as a complex system of a series of overlapping band-pass filterswith asymmetrical, non-linear and level depending magnitude responses. Thebandwidth of the cochlear band-pass filters are non-uniform and increase withincreasing frequency. Where these bands should be centred, or how wide theyshould be, has been analysed through several psychoacoustic experiments. One ofthe psychoacoustic models for the centre frequencies of these band-pass filters isthe critical-band rates scale, where frequencies are bundled into 25 critical-bandswith the unit name Bark. A distance of 1 critical band is commonly referred toas ”one bark” and the following equation is often used to convert from frequencyin Hertz to the Bark scale [19]:

z(f) = 13 arctan(0.76f) + 3.5 arctan((f/7.5)2) (2.2)

The Bark scale is a nonlinear scale that describes the nonlinear, almost log-arithmic processing in the ear and Table 2.1 shows the center frequencies andbandwidth for each of the 25 Bark bands:

Band Center Freq. Bandwidth Band Center Freq. Bandwidth Band Center Freq. BandwidthNo. (Hz) (Hz) No. (Hz) (Hz) No. (Hz) (Hz)1 50 100 10 1175 1080-1270 19 4800 4400-53002 150 100-200 11 1370 1270-1480 20 5800 5300-64003 250 200-300 12 1600 1480-1720 21 7000 6400-77004 350 300-400 13 1850 1720-2000 22 8500 7700-95005 450 400-510 14 2150 2000-2320 23 10500 9500-120006 570 510-630 15 2500 2320-2700 24 13500 12000-155007 700 630-770 16 2900 2700-3150 25 19500 1500-8 840 770-920 17 3400 3150-37009 1000 920-1080 18 4000 3700-4400

Table 2.1: The center frequencies and bandwidth for each of the 25 Bark bands.

The ”critical bandwidth” in Hertz is a function of center frequency that quan-tifies the cochlear band-pass filter conveniently approximated by [19]:

BWc(f) = 25 + 75[1 + 1.4(f/1000)2]0.69 (2.3)

Masking is the phenomenon where the perception of one sound is obscuredby the perception of another and can be explained in the frequency- and timedomain.

Simultaneous masking (strong and softer tones with same frequency) andthe spread of masking (strong and softer tones at nearby frequencies) refers toa frequency-domain phenomenon that can be observed whenever two or more

7

stimuli are simultaneously presented to the auditory system. The response of theauditory system is nonlinear and the perception of a given tone is affected by thepresence of other tones. The auditory channels for different tones interfere witheach other, which leads to a complex auditory response, the frequency masking.That means, that a single tone (masker) is surrounded by its so-called maskingthreshold curve and masking bandwidth. Every single tone within this maskingbandwidth with its sound pressure level (SPL) falling below the masking curvewill be masked and so non-audible. The masking bandwidth depends on thefrequency of the masking tone and increases with the SPL-value of the maskingtone. The relation between sound pressure level and masking threshold curve ofa single masker of 440 Hz is illustrated in Figure 2.4

Figure 2.4: Simultane masking: relation between masking threshold curve andsound pressure level of a 440Hz masker.

There is also a frequency relation to the width of the masking threshold curve:louder tones with higher frequencies will mask more neighbouring frequenciesthan softer tones with lower frequencies. So, ignoring the frequency componentsin the masking band whose levels fall below the masking curve does not cause anyperceptual loss. In Figure 2.5, the masking phenomena of two maskers and theabsolute hearing threshold curve are illustrated by a 1 kHz tone with a smallerwidth of the masking threshold curve and a 4 kHz tone with a wider range under

8

the masking threshold curve. Both tones have a sound pressure level of 60 dBto demonstrate the frequency depending of the width of the masking thresholdcurves. Figure 2.5 shows the superimposed individual masking threshold curvesrepresenting the complexity auditory response of just two stimuli. To be audible,a third tone of 2 kHz must have a sound pressure level that lies over the super-position of the masking thresholds (labelled with SP in Figure 2.5) at 2kHz. Theoverall masking threshold curve by superposition of the individual ones dependson the frequency distance between the maskers.

Figure 2.5: Absolute hearing threshold and frequency masking: a 2 kHz tonemust have a sound pressure level over SP to be audible.

In the time-domain, masking effects appears through temporal masking. Tem-poral masking explains the masking effect of a softer (test) tone by the presenceof a stronger one (mask tone). The level of the masked signal depends on thetime between the masker and the test tone. The stronger tone will mask softertones with lower levels, which appears a short time later (decaying time). Thistemporal masking effect is based on the functionality of the human auditory sys-tem, that the inner hair cells within the human ear needs a ”recovery time” fromthe strong mask tone, until they are able to realise the existence of the softertest tone. In the case of audio signals (e.g., the onset of a percussive musical in-

9

strument), abrupt signal transients create pre- and post- masking regions in timeduring which a listener will not perceive signals beneath the audibility thresholdsproduced by the masker. Pre- or ”backward masking” is based on processingtimes in the ear and means, that signals just before the strong masker appearsare masked [20]. Pre-masking occurs prior to masker onset and lasts only a fewmilliseconds, while post-masking may persist for more than 100 milliseconds af-ter the masker is removed. Figure 2.5 shows an example for the masking curve,including pre- and post-masking:

Figure 2.6: Temporal pre- and post-masking.

2.3 Speech and Audio Coding Technologies

In speech and audio coding, digitised speech or audio signals are represented withfew bits as possible, by removing the redundancies and the irrelevancies from theoriginal signal.

The perceptual quality of such digitised speech or audio signals is a functionof the available bit rate. Speech and audio codecs must take account of encod-ing/decoding delay, sound quality of the decoded signal and the transmissionbandwidth. All of them are difference for speech and audio. For example, inspeech signals big signal changes are accepted, while for music such degradationsare forbidden. Most of the speech coding standards are developed to handle nar-row speech at a sampling frequency of 8 kHz (or wideband speech with samplingfrequency 16 kHz) based on a model for speech production. But for non-speechsignals like music or background noise such a source model does not work. Somodern audio codecs employ psychoacoustic principles to model human auditoryperception with the goal for a transparent reproduction of the information thatis relevant to human auditory perception.

10

AMR bit GSM GSM GSM WCDMArate GMSK GMSK 8-PSK

FR HR HR4.75 kbps Yes Yes Yes Yes5.15 kbps Yes Yes Yes Yes5.90 kbps Yes Yes Yes Yes6.70 kbps Yes Yes Yes Yes7.40 kbps Yes Yes Yes Yes7.95 kbps Yes Yes Yes Yes10.2 kbps Yes - Yes Yes12.2 kbps Yes - Yes Yes

Table 2.2: AMR modes in GSM and WCDMA.

2.3.1 Speech Coding standards

Adaptive Multi-Rate Codec

Adaptive Multi Rate (AMR) was designed as an improved standard of voicequality in cellular services and greater capacity for the GSM system and UMTStechnology. It was standardised for GSM Release 98 and 3GPP Release 99. Itsgreat advantages against previous GSM speech codecs are the variable bitrateand its adaptive error concealment section, in which the number of bits for er-ror correction depends on the transmission conditions. The AMR speech codecadapts its error protection level to the local radio channel and traffic conditionsand is a mandatory codec for 3G wireless networks. Narrow band AMR voicecodec supports eight different speech codecs with bit rates ranging from 4.75 kbpsto 12.2 kbps with 8 kHz wide band[17].

The wideband codec AMR-WB was developed as a multi-rate codec consist-ing of several codec modes like the AMR-NB codec and brings speech quality ex-ceeding that of (narrowband) wire line quality to 3G and GSM/GERAN systems.Consequently, the wideband codec is referred to as AMR Wideband (AMR-WB)codec. Like in AMR-NB, the codec mode is chosen based on the operating con-ditions on the radio channel. Adapting coding depending on the channel qualityprovides high robustness against transmission errors. The codec also includes asource controlled rate operation mechanism, which allows it to encode speech ata lower average rate by taking speech inactivity into account.

Advanced Audio Codec

Advanced Audio Codec (AAC) was specified and declared as an internationalstandard by MPEG in 1997 to increase the quality of audio coding of mono,stereo and multichannel signals. The international cooperation of the FraunhoferInstitute and companies like AT&T, Sony and Dolby developed this efficient

11

method for audio data compression. Driving force to develop AAC was thequest for an efficient coding method for surround signals, like 5-channel signals.MPEG-2 AAC is the continuation of the coding method MPEG Audio Layer-3and the sampling frequencies which are used are between 8 kHz and 96 kHz.It is not backward-compatible to standard MPEG-1, but its core supports newcode-standards such as MPEG-4. Compared to coding methods such as MPEG-2Layer-2, it is possible to cut the required bitrate by a factor of two with no lossof subjective quality. Further, the stereo width of difficult to encode signals atbitrate less than 60kbit/s is reduced. Like all perceptual coding schemes, MPEG-2 AAC basically makes use of the signal masking properties of the human ear inorder to reduce the amount of data. Doing so, the quantisation noise is distributedto frequency bands in such a way that it is masked by the total signal and remainsinaudible.

2.4 Audio Content Estimation

End-user quality is influenced by a number of factors including mutual compen-sation effects between audio and video, content, encoding, and network settingsas well as transmission conditions. Moreover, audio and video are not only mixedin the multimedia stream, but there is even a synergy of component media (audioand video) [4]. As previous work has shown, mutual compensation effects causeperceptual differences in video with a dominant voice in the audio track ratherthan in video with other types of audio [5]. Video content with a dominant voiceinclude news, interviews, talk shows, etc. Finally, audio-visual quality estimationmodels tuned for video content with a dominant human voice perform better thana universal models [5]. Therefore, our focus within this work is on the design ofautomatic speech detection algorithms for the mobile environment.

In recent years, speech detection has been extensively studied [6], [7], [8], [9].The proposed algorithms for speech detection differ in computational complexity,application environment, and accuracy. Our approach is to design a speech detec-tion algorithm suitable for real-time implementation in the mobile environment.Therefore, our work is focussed on accurate and low complexity methods whichare robust against audio compression artifacts.

Our proposed low-complexity algorithm has a first stage based on kurtosis[10] and a second stage based on hypothesis testing using a Log-Likelihood Ratio(LLR). In the second stage, we use the High Zero Crossing Rate Ratio (HZCRR)[11] or the Mel-Frequency Cepstral Coefficients (MFCCs) extracted from the au-dio signal. The HZCRR has a lower complexity than MFCCs but lower accuracy.The proposed method shows a good balance between accuracy and computationalcomplexity. Finally, performance and complexity of these methods are compared.

Proposed methods was sent as conference paper and approved for IWSSIP

12

2009 conference [1].

2.4.1 Audio parameters

Due to the low complexity requirement of the algorithm, our investigation wasinitially focused on time-domain methods. Initial inspection of the various audiosignals show significantly different characteristics in speech and non-speech signals(see Figures 2.7 and 2.8). Wide dynamic range of the speech signal (comparedto non-speech signals) is clearly visible.

Figure 2.7: Example of speech signal (time-domain).

Both kurtosis and HZCRR features have been used in blind speech separa-tion [12] and music information retrieval [11]. Kurtosis of a zero-mean randomprocess x(n) is defined as the dimensionless, scale invariant quantity 1

κx =1N

∑Nn=1 [x(n)− x]4(

1N

∑Nn=1 [x(n)− x]2

)2 . (2.4)

where in our case, x(n) represents the n-th sample of an audio signal. A higherκ value is related to a more peaked distribution of samples as is found in speechsignals (see Figure 2.9) whereas a lower value implies a flatter distribution as isfound in other types of audio signals (see Figure 2.9). Therefore, kurtosis wasselected as a basis for detection of speech. However, accurate detection of inshort-time frames is not always possible by kurtosis alone.

1The reader is cautioned that some texts define kurtosis as κx =1N

∑Nn=1[x(n)−x]4(

1N

∑Nn=1[x(n)−x]2

)2 −3 We

shall however follow the definition in [10].

13

Figure 2.8: Example of non-speech (time-domain).

Figure 2.9: Probability density function of the speech and non-speech audiosamples.

The second objective parameter under consideration is the HZCRR definedas the ratio of the number of frames whose Zero Crossing Rate (ZCR) is greater

14

than 1.5× the average ZCR in audio file as [11]

HZCRRM =1

2N

N−1∑n=0

[sgn(ZCR(n,M)− 1.5ZCR) + 1] (2.5)

where ZCR(n,M) is the rate of the n-th, length-M frame (equation given below),N is the total number of frames, ZCR is the average ZCR over the audio file.The ZCR is given by

ZCR(n,M) =1M

M−1∑m=0

1<0 [x(nM +m)x(nM +m+ 1)] (2.6)

where m denotes the sample index within the frame and the indicator functionis defined as

1<0(q) =

{1; q < 00; q ≥ 0.

In the proposed algorithms, we use a frame length of 10 ms and the framingwindows are overlapped by 50%. The 10 ms frame length 2 contains a sufficientnumber of audio samples for further statistical processing. Moreover, a longerframing window would increase the calculation complexity and length of investi-gated audio sequence necessary for speech detection.

Figure 2.10 shows the ZCR curves for both speech and non-speech signals.The ZCR of the non-speech signal has a small amplitude range and low variance.The ZCR of the speech signal, on the other hand, has a wider amplitude range,large variance, and relatively low and stable baseline with occasional high peaks.However, as can be seen in Figure 2.10, many frames of the speech and non-speechsignal have similar ZCRs and thus accurate detection of speech in short-timeframes is also not possible with ZCR and subsequently HZCRR alone.

Audio Corpus

The training and evaluation of our speech detector was performed on a largeaudio corpus. Our corpus consists of 3032 speech and non-speech audio files (seedetails in Tables 2.3 and 2.4). The speech part of the corpus is in the Germanlanguage and consists of ten speakers. The non-speech part of corpus consistsof mainly music files of various genres (e.g. rock, pop, hip-hop, live music). Allaudio files were encoded using typical settings for the UMTS environment. Eachaudio file was encoded using three codec types at different sampling rates: AAC,AMR-WB at 16 kHz and AMR-NB at 8 kHz. Due to limitations of mobile radioresources, bit rates were selected in range 8–32 kbps. Encoded audio files withinsufficient audio quality were excluded.

2e.g. for sample rate (SR) = 32 kHz framing window contains M = 320 samples

15

Figure 2.10: Plot of the ZCR of the speech signal.

Table 2.3: Speech audio corpus.

Codec Encoding settings [BR@SR] Number of audio filesAAC 16 kbps@16 kHz 1817

AMR-NB 7.9 kbps@8 kHz 1856AMR-WB 12.65 kbps@16 kHz 1856

Table 2.4: Non-speech audio corpus.

Codec Encoding settings [BR@SR] Number of audio filesAAC 32 kbps@16 kHz 1169

AMR-NB 7.9 kbps@8 kHz 1172AMR-WB 12.65 kbps@16 kHz 1176

For purposes of determining speech and non-speech detection parameters,2273 audio files without a dominant voice and 3194 audio files with dominantvoice were used in training. These files were selected from all codecs and encodingcombinations. The rest of the audio corpus was used for testing and performanceevaluation.

Kurtosis and HZCRRM measurements on the training files are given in Fig-ures 2.11 and 2.12. It can be seen that kurtosis is a better speech indicator thanHZCRRM , however, HZCRRM may be used as an additional indicator.

16

Figure 2.11: Kurtosis values of speech and non-speech signals.

Figure 2.12: HZCRRM values of speech and non-speech signals.

2.4.2 Speech detector

In order to reduce complexity, we propose a two-stage voice detection algorithmwhere the first stage is based on a threshold comparison of kurtosis and the secondstage is based on a LLR test (see Figure 2.14). For the second stage (based ona LLR test), two solutions are proposed. The first, based on HZCRRM , hassignificantly lower complexity than the second, based on MFCCs, but also loweraccuracy.

In the first solution, non-speech audio frames are first detected by a simpledecision based on whether the kurtosis is less than a pre-defined threshold, i.e. κ <c0 where c0 = 4.96 (see Figure 2.11). The first stage is capable of recognising62.3% of the non-speech frames from our corpus with a 97% accuracy.

In the second solution, we set the threshold in the first stage to c0 = 4 usingthe Least Absolute Errors optimisation technique. All sequences with κ ≤ 4are recognised as non-speech sequences. The first stage is capable of recognising40% of the non-speech sequences from our corpus with 99.7% accuracy. In bothsolutions, if non-speech content is detected in the first stage, we do not carry out

17

Figure 2.13: Cumulative distribution function of the κ.

the second stage in order to reduce computational complexity.The reason for this thresholds can be also seen in the CDF of the kurtosis for

speech and non-speech audio content in the Figure 2.13.

Figure 2.14: Two-stage speech detector.

2.4.3 LLR Test Based on κ and HZCRRM

In the second stage of the first solution, we derive a more general decision rulebased on a hypothesis test (LLR) and we use both kurtosis and HZCRRM of the

18

frame as elements in a feature vector

X =

[κ

HZCRR

].

For speech signals, we denote the mean vector for the speech feature vectors asµs and covariance matrix as Σs and for non-speech feature vectors, we denote themean vector as µm and covariance matrix as Σm. Furthermore, the LLR test isperformed on the first 20 frames, in order to reduce computational complexity.The log-likelihood ratio is calculated as follows

∆ =

∑20i=1 log

{1√

(2π)2|Σs|exp(−1

2(Xi − µs)Σ−1s (Xi − µs)T )

}∑20

i=1 log{

1√(2π)2|Σm|

exp(−12(Xi − µm)Σ−1

m (Xi − µm)T )} (2.7)

If the LLR is greater than the decision threshold, c = 2.2 (see Figure 2.14),we declare a non-speech frame otherwise we declare a speech frame. Note thatmean vectors and covariance matrices in (2.7) are estimated ahead of time in thetraining stage.

2.4.4 LLR Test Based on Mel-Frequency Cepstrum Coefficients

In the second stage of the second solution, we consider the use of MFCCs ex-tracted from the frame as the feature vector. MFCCs are widely used in speechand audio as a feature vector in a variety of applications. The algorithm in [13] isused for calculation of the first 14 MFCCs. Thus the covariance matrix is 14×14and mean vector is 14 × 1. The LLR test is performed on the first 20 frames.The LLR is calculated as

∆ =

∑20i=1 log

{1√

(2π)13|Σm|exp(−1

2(Xi − µm)Σ−1m (Xi − µm)T )

}∑20

i=1 log{

1√(2π)13|Σs|

exp(−12(Xi − µs)Σ−1

s (Xi − µs)T )} (2.8)

If the LLR is greater than the decision threshold, c = 1.04 (see Figure 2.14), wedeclare a speech frame otherwise we declare a non-speech frame.

2.4.5 Performance evaluation and comparison

We evaluate both two-stage algorithms: LLR test based on kurtosis and HZCRRM

and LLR test based on MFCCs. The first algorithm is a relatively low-complexitysolution based on time-domain audio parameters, κ and HZCRRM . The secondalgorithm provides a more sophisticated solution based on MFCCs. The perfor-mance and complexity (measured in terms of computation time) of both methodswas evaluated using 1770 speech files and 1181 non-speech files. The audio cor-pora for training and evaluation were approximately the same size. The overallaccuracy of both proposed methods exceeds 92% (see Table 2.5) for speech and

19

non-speech content averaged over all codecs. The accuracy of second algorithm,however, is higher than the first but at increased computation cost.

Table 2.5: Accuracy results for detection of non-speech and speech from codedaudio.

Content Codec κ & HZCRR MFCNon-speech AAC 92.70 % 98.27 %

AMR-NB 99.06 % 100 %AMR-WB 85.71 % 96.85 %

Speech AAC 89.27 % 98.51 %AMR-NB 94.94 % 100 %AMR-WB 90.30 % 98.21 %

Overall 92.78 % 98.21 %

In order to evaluate complexity, the computation time was measured using6091 audio files (3759 speech files, 2332 non-speech files). The algorithms wereexecuted in MATLAB environment on a Core 2 Duo processor. In order toobtain accurate results, the test was repeated ten times. Table 2.6 gives theaverage computation times. The first algorithm is approximately 2× faster thenthe second algorithm. The efficiency reflects the amount of processed files persecond (see Table 2.6). The computing time and efficiency results show that bothmethods allow for fast detection of speech frames and are suitable for real timeimplementation in mobile devices.

Table 2.6: Time needed for content estimation.

Method Time[s] Efficiency [files/s]κ & HZCRR 106.46 57.20

MFCC 233.89 26.04

Conclusion

The goal of this part of work was to design a speech detector for mobile environ-ment. The design was focused on accurate, low complexity methods, which arerobust against audio compression artifacts. Both proposed algorithms show verygood accuracy (92%) and relatively low complexity. However, the method basedon kurtosis and HZCRRM is 2× faster (lower complexity).

2.5 Audio Quality Estimation Algorithms

In many various application it‘s important to know what the content quality onthe user side looks like. For this purpose was developed several measurementtechniques, whether it‘s subjective or objective measurements. Although, the

20

subjective measurement are very precise, and the network operator know exactlyhow the network setups impacts on the perceived quality, they are also verycomplex and the end-user perceived quality is often hard to obtain. After all,subjective tests are expensive. Therefore, in practise, only the objective mea-surement techniques are used. For audio quality estimation there exists manydifferent types of objective estimation algorithms. They can be divide into twomain groups:

1. Reference audio quality metrics

2. Non-reference audio quality metrics

Generally, for evaluation of the perceived quality, there exist standardisedtechnique, which describes the quality by 5 scores - Mean Opinion Score (MOS).

MOS Quality Impairment5 Excellent Imperceptible4 Good Perceptible but not annoying3 Fair Slightly annoying2 Poor Annoying1 Bad Very annoying

Table 2.7: Description of Mean Opinion Scores.

The aim of the subjective test is to obtain the objective MOS values as wellcorrelated to subjective values as possible.

2.5.1 Reference Audio Quality Metrics

The first research works in audio quality measurement area are dated in theearly eighties. First algorithms was based on the work of Zwicker, Schroder,Brandenburg. The first algorithm that was used in real measurement was Noiseto Mask Ratio (NMR) in 1989 [14]. In terms of standardisation and adoptionin the field, the most advanced objective perceptual quality assessment methodsmay be found in the areas of audio and speech. This is due to the observationthat the psycho-acoustic effects known from masking experiments seem to differin significant, when comparing the perception of speech and music signals. Forwideband audio signals, the Perceptual Audio Quality (PEAQ ) method hasbeen developed and recommended by the ITU-R Rec. Bs. 1387 [15]. PEAQ wasdeveloped originally as an automated method to evaluate the perceptual qualityof different wideband audio codecs.

Several objective perceptual quality assessment methods have been developedfor speech signals. The main ones in use today include Perceptual Analysis Mea-surement System (PAMS ), Perceptual Speech Quality Measurement (PSQM),

21

and Perceptual Evaluation of Speech Quality (PESQ) [16]. Although they sig-nificantly differ in the way they try to model human perception, they also showa very high degree of similarity in their basic structure. While comparing all ofthese measuring algorithms they can be broken down into a block diagram asshown in 2.15.

Figure 2.15: The structure of the generic perceptual measurement algorithm.

In the following subsections will be described two types of reference metrics.

Perceived Evaluation of Speech Quality (PESQ)

In modern mobile networks, just like VoIP, the measurement algorithm has todeal with much higher distortions as with GSM codecs and the most eminentfactor is that the delay between the reference and the test signal is not con-stant anymore (the delay for each time interval is significantly different from theprevious time interval). Those varying delays are handled in PESQ by a timealignment algorithm. PESQ was developed for use a wider range of network con-ditions, including analogue connections, codecs, packet loss and variable delay. Itgives accurate predictions of subjective quality in a very wide range of conditions,including those with background noise, analogous filtering,channel errors, codingdistortions, or variable delay. PESQ addresses these effects with transfer func-tion equalisation, time alignment, and an algorithm for averaging distortions overtime. Further, PESQ is suitable for many applications in assessing the speechquality of mobile networks or narrow-band speech codecs and for end-to-end mea-surements. While PESQ measures only the effects of one-way speech distortion

22

and noise on speech quality, the effects of loudness loss delay, sidetone, echo, andother impairments related to two-way interaction are not reflected in the PESQscore.

Figure 2.16 presents the structure of the PESQ model [16], which comparesa reference signal with degraded signal that is the result of the original signalpassing through a communication system. The output of PESQ is a predictionof the perceived quality that would be given to the degraded signal by subjectsin a subjective listening test.

Figure 2.16: Structure of the PESQ algorithm.

In a first pre-processing step, level alignment brings the reference and de-graded signals to a standard listening level. After that, they are filtered with aninput filter to model a standard telephone handset. Next, the signals are alignedin time and equalised and then processed through an auditory transform using aperceptual model. The key to this process is the transformation of both signals toan internal representation that is analogous to the psycho-physical representationof the audio signals in the human auditory system, taking account of perceptualfrequency (Bark) and loudness (Sone). The steps, which are included, are: time-alignment, level-alignment to calibrated listening level, time-frequency mapping,frequency warping, and compressive loudness scaling. The transformation alsoinvolves equalising for linear filtering in the system and for gain variation whichhave a little perceptual significance. From the difference between the transformedsignals, the so-called ”disturbance”, two distortion parameters are extracted andaggregated in frequency and time and mapped to a prediction of subjective meanopinion score (MOS) to enable a direct comparison between the objective andsubjective scores.The mapping of the objective PESQ score onto the subjectiveMOS score is done by using a linear, monotonic function to preserve all the in-formation during the so-called regression process. This process uses a regression

23

mapping to remove any systematic offset between the objective scores and thesubjective MOS, minimising the mean square of the residual errors:

ei = xi − yi. (2.9)

Various measures may be applied to the residual errors to give an alternativeview of the closeness of objective scores to subjective MOS.

Perceived Evaluation of Audio Quality (PEAQ)

Perceptual Evaluation of Audio Quality according to ITU-R recommendationBS.1387 [15], available as a basic and advanced model. Figure 2.17 present thestructure of the PEAQ algorithm.

Figure 2.17: Structure of the PEAQ algorithm.

The proposed Method for Objective Measurement of Perceived Audio Qual-ity consists of a peripheral ear model, several intermediate steps (here referredas ”pre-processing of excitation patterns”), the calculation of (mostly) psychoacoustically based Model Output Variables (MOVs ) and a mapping from a setof Model Output Variables to a single value representing the basic audio qualityof the Signal Under Test. It includes two peripheral ear models, one based onan FFT and one based on a filter bank. Except for the calculation of the errorsignal (which is only used with the FFT based part of the ear model) the generalstructure is the same for both peripheral ear models.

24

The inputs for the MOV calculation are: - The excitation patterns for bothtest and Reference Signal. - The spectrally adapted excitation patterns for bothtest and Reference Signal. - The specific loudness patterns for both test andReference Signal. - The modulation patterns for both test and Reference Signal.- The error signal calculated as the spectral difference between test and ReferenceSignal (only for the FFT-based ear model).

If not indicated differently, in the case of stereo signals all computations areperformed independently and in the same manner for the left and right channel.The description defines two setups, one called the ”Basic Version” and one calledthe ”Advanced Version”.

In all given equations, the index ”Ref.” stands for all patterns calculated fromthe Reference Signal, the index ”Test” stands for all patterns calculated from theSignal Under Test. The index ”k” stands for the discrete frequency variable (i.e.the frequency band) and ”n” stands for the discrete time variable (i.e. either theframe counter or the sample counter). If the values for k or n are not explicitlydefined, the computations are to be carried out for all possible values of k and n.All other abbreviations are explained at the place they occur.

In the names of the Model Output Variables, the index ”A” stands for allvariables calculated using the filter bank-based part of the ear model and theindex ”B” stands for all variables calculated using the FFT-based part of the earmodel.

Basic VersionThe Basic Version includes only MOVs that are calculated from the FFT-

based ear model. The filter bank-based part of the model is not used. The BasicVersion uses a total of 11 MOVs for the prediction of the perceived basic audioquality.

Advanced VersionThe Advanced Version includes MOVs that are calculated from the filter bank-

based ear model as well as MOVs that are calculated from the FFT-based earmodel. The spectrally adapted excitation patterns and the modulation patternsare computed from the filter bank-based part of the model only. The AdvancedVersion uses a total of 5 MOVs for the prediction of the perceived basic audioquality.

2.5.2 Non-reference Audio Quality Metrics

For our work we decided not to use reference audio metrics, because in mobilescenarios it‘s difficult to work with the reference signal. For example, if the UMTSnetwork operator wants to know the quality on the users side, it‘s not possibleto compare the signal at the output of the receiver, with the reference signal atthe transceiver. The main reason of this is that the transmission condition in themobile environment changes both in the time and space domain. So we decided

25

to work with the non-intrusive audio metric - Singe Sided Speech Quality Metric(3SQM) [14].

Despite it‘s designed primary for the evaluation of the speech quality, it workswell in our audio-visual metric setups also for the evaluation of the non-speechsignals.

Non-intrusive measurement compared to the intrusive measurement offers thepossibility to measure at almost any point of the network with any real-world au-dio signal. Of course there is an disadvantage - the accuracy of the measurementis little bit lower compared to intrusive measurement technique, e.g. PESQ.

These types of metrics can be divided into two fundamentally different prin-ciples.

The first principle works only in the signal processing level,it takes into ac-count the transmission conditions and the transmission path itself. It doesn‘ttake into account the signal itself. If the transmission path changes its prop-erties, this type of measurement will fail. The a priori knowledge of the exacttransmission path and the equipment that is used is important. The advantageof this type of measurement is that, it is slightly less computationally complexthan other methods.

The second principle of the non-intrusive voice quality measurement is moreuniversal. It analyses the voice stream and not the transmission path. Any kindof the voice signal can be analysed and this principle can be used in every pointof the network. It‘s not dependent on the network elements. This principle ismore complex than the first type, but is much more flexible, and can be usedin any scenario. In today‘s heterogeneous networks is this type of non-intrusivemeasurement the only one which can be used with hardly any restrictions.

Single-sided Speech Quality Measurement (3SQM)

The 3SQM algorithm is the example of the second type of the non-reference audiometric. As can be seen in the following picture (2.18), it combines three differentand fundamentally independent parts.

The input signal is first preprocessed. The signal is filtered through thefilter, which simulates the standard headset used in the laboratories for subjectivetesting, after filtering is the signal adjusted to speech level and after all is thesignal divided into voiced and unvoiced parts.

After pre-processing is the signal lead into the next stage, where the distortionand speech parameters are extracted. The parameters is divided into three maindistortion classes :

1. Vocal tract analysis and unnaturalness of speech

2. Analysis of strong additional noise

3. Interruptions, mutes and time clipping

26

Figure 2.18: Blockdiagram of the 3SQM analysis algorithm.

After division of the parameters, several parameters are clustered and theydefine single isolated distortion class. This is done with respect to analogy of thereal listeners subjective testing. The listeners focuses on the foreground of thesignal, and would not judge the quality by simple sum of all occurred distortionsbut just by the single dominant artifact in the signal. The dominant distortionclasses used in 3SQM metric are :

1. Low static SNR

2. Mutes

3. Low segmental SNR

4. Unnatural voice - Robotization

5. Basic speech quality

Finally, for each dominant distortion the final quality estimation is calculated.It‘s calculated based on a selection of the MOVs. The result of the estimationis equivalent to Mean Opinion Score - Objective Listening Quality (MOS-LQO),which is defined by the ITU-T recommendation P.800.

27

Chapter 3

Video Quality

3.1 Introduction

In this chapter will be described some basic overview of the video encoding pro-cess, some basic steps which are common for several video codec - video andcolour sampling, frames structure, creation of motion vectors. Also the encodingproperties of the H.264/AVC will be described in more details. At the end of thischapter will be presented the video quality metric based on the motion character-istics of the video. The objective parameters extracted from the motion vectorswill be further used in audiovisual quality metric. These objective parameterswill be also described in more details.

3.2 Basic principles of video coding

If we take the simplest case - storing uncompressed digital video data, we willneed enormous amount of storage space. If we will use the 8-bit code wordsfor each sample, then for the standard VGA (640x480 pixels) picture size and 25frames per second (Frame Rate - FR) we will need approximately 155 GB1 of discspace for two hour movie. It‘s obvious that we must use compressing methods,so we will need less storage capacity, or lower bitrates for streaming of the videodata.

This is the major problem in the mobile environment, since here we are scop-ing with the limited bandwidth, worst transmission properties and also serviceproviders want to use as low media capacity as possible. For video streaming ser-vices, the following video compression standards are used today: H.263[23] stan-dardised by International Telecommunication Union (ITU), MPEG-4 Part 2[21]

13 colours× 8 bit× 640 pixels× 480pixels× 25 fps× 60 sec× 120 min = 1.3271 ∗ 1012 bits =154.4952 GB

29

standardised by Internationale Organisation for Standardisation (ISO) MotionPicture Expert Group, and the newest H.264/AVC [22] (also known as MPEG-4Part 10), standardised by Joint Video Team (JVT) of experts from both ISO/IECand ITU.

All of the mentioned codecs uses similar basic principles for encoding thevideo, so brief overview of these principles will be described in the followingsections. In this work, we focused just for the newest codec - the H.264, since thisis at the time the most advanced technique and offers the best coding efficiency.Also it has the great potential for wide use in the UMTS networks and in theDigital Video Broadcasting(DVB).

3.2.1 Video and Colour sampling

Redundancy and irrelevance qualify the redundant information in the video sig-nal. The video signal, which contains these information will be perceived by ahuman likewise also after elimination of these information. The difference be-tween redundant and irrelevant part is only in the subjective or objective pointof view.

1. Redundant part of the video signal qualify the part of information, whichcan be removed and the signal can be reconstructed again without any lossof information and without any distortion.

2. Irrelevant part of the video signal qualify the part of signal, which ab-sence will be not recognised by human eye. It means, that signal withoutirrelevant information will be perceived equally like the original signal.

These two types of the information allow two types of the compression,namely:

1. Loss-less compression

2. Lossy compression

Loss-less compression realises removal of the redundant information from thevideo signal. Because of the fact that redundant information can be 100% re-constructed, we use the term loss-less compression. There are many types of thecompress algorithms, but the most common used in the video processing are RunLength Encoding(RLE), Lempel-Ziv-Welch(LZW) and Huffman code. The com-pression ratio which can be obtained by this type of signal processing is relativesmall compared to the lossy one.

Lossy compression lowers the data amount by removal of the irrelevant infor-mation. The compress ratio varies because of the encoder ability to set compressratio and also which irrelevant information will be removed. This changeable and

30

variable compress ratio is very suitable for multimedia applications and it allowshigh reduction of the bit rates and the reduction of the transmission bandwidth.

If we take a look at the video signal it self we can define several levels or partsof the video stream.

The video signal consists of several pictures which goes continually one afteranother. The number of these picture in one second defines the basic video pa-rameter - Frame Rate(FR) or sometimes also know as Frames Per Second(FPS).We can easily eliminate the bitrate by simply lowering the FR. Several standardsdefine several values of the FR, but most commonly used value is 25. Frame ratesbelow 10 FPS are sometimes used for very low bit-rate video communications,but motion is jerky and unnatural at this rate. Frame rates between 10 and 20frames per second are typical for low bit-rate video communications, sampling at25-30 FPS is standard for television (with interlacing to improve the appearanceof the movement) and 50 or 60 FPS produces smooth apparent motion.

The processing of the whole frame is very inconvenient and difficult and sofor the video processing are used smaller parts of the frames - pixels. Severalpixels form the rows, the number of pixels in one row define the width of thevideo frame. Several rows form the slices, which are used in video processingfor better control over the video stream, and for better error concealment. Thenumber of row define the height of the video frame. The height and width of thevideo frame form the resolution or picture size. There are several picture sizescommonly used in mobile environment. The most used are listed in the table 3.1.

Abbreviation Size DescriptionVGA 640x480 Video Graphics Array

QVGA 320x240 Quarter Video Graphics ArrayCIF 352x288 Common Intermediate Format

QCIF 176x144 Quarter Intermediate Format

Table 3.1: Video resolutions used in mobile environment.

In our subjective measurement only VGA and QVGA picture resolutions wereused.

Also we have worked with colour video so here will be described some basicsof the storing mechanisms of the colour and the colour space sub-sampling.

The main principle of the color picturing devices is drawing each colour as amixture of red, green and blue color. Any colour can be created by combiningthese three basic colours. The colour of the pixel in the frame can be describedas 1-by-3 matrix, where number in the each row describes relative proportionsof Red, Green and Blue. The picture divided into each significant colour can beseen in the figure 3.1

Since Human Visual System (HVS) is more sensitive for the luminance thanfor the chrominance, there exist another way, how to reduce the size of the image

31

Figure 3.1: Red,Green and Blue compound of an image.

or video sequence. The colour space which uses the higher sensitivity for theluminance is called YCbCr. Y is the luminance(luma) component and can becalculated as a weighted average of R,G,and B:

Y = krR + kgG + kbB (3.1)

The other components of the YCbCr can be easily computed from the com-ponents of the RGB colour space. The following equations are used for derivationYCbCr from the RGB [25]:

Y = krR + (1− kb − kr)G + kbB

Cb =0.5

1− kb(B−Y) (3.2)

Cr =0.5

1− kr(R−Y)

The kr and kb are the constants and ITU-R recommendation BT.601 definesthem as kb = 0.114 and kr = 0.299 which leads to the:

Y = 0.299R + 0.587G + 0.114B

Cb = 0.564(B−Y) (3.3)

Cr = 0.713(R−Y)

The picture divided into the separate components of the YCbCr colour spacecan be seen in figure 3.2.

It‘s obvious that simple transformation into another colour space can notachieve lower transmission bitrates. But as mentioned above, the HVS is moresensitive to the luma component and is less sensitive to the chrominance informa-tion of the picture or of the video signal. So the Cb and Cr components can besampled using lower sampling rate and this will result in saving the data neededfor digital picture representation.

32

Figure 3.2: Luminance and chrominance components of the picture in YCbCrcolour space.

There exist several colour sub-sampling scheme, the main ones are listed be-low:

1. 4:4:4 sub-sampling means that all three components - Y, Cb, Cr, have sameresolution ans hence a sample of each component exists at every pixel po-sition. The numbers indicate the relative sampling rate of each componentin the horizontal direction.

2. 4:2:2 sub-sampling means, that chrominance components have the samevertical resolution as the luminance component but half the horizontal res-olution.

For calculation of the saved bandwidth, we can use a little help - if we dividethe sum of the sampling ratio coefficients(4:2:2) by the number 12, we getthe result, which defines the ratio of sub-sampled video bandwidth and thebandwidth of the original video. Hence this sampling scheme requires twothirds 2 of bits which will be needed by 4:4:4 version.

3. 4:2:0 sub-sampling, also called ’YV12’,means that chrominance compo-nents (Cb, Cr) each have half the horizontal and vertical resolution of lumacomponent Y. Hence, with this scheme we will need half bandwidth of theoriginal 4:4:4 video version.

The sampled pixels forms the Macroblock, and the further encoding pro-cessing is made mainly in the macroblock level. The macroblock consist of 16x16luma and 8x8 chroma pixels.The further processing differs for three types offrames, namely it‘s Intra-coded frames (I frames), Predicted frames (P frames)and Bi-directional predicted frames (B-frames).

If the I frame is encoded, the frame is divided into 8x8 pixel blocks. Theseblocks are transformed using the Discrete Cosine Transform (DCT),but the trans-formed data still offers no data saving. The data saving is obtained by quanti-sation of the DCT coefficients,after which many of the coefficient, mainly thehigher frequency ones, will be equal to zero. By zigzagging the frame matrix, the

2 (4+2+2)12

= 23

33

Figure 3.3: 4:2:2 Colour sampling pattern[24].

Figure 3.4: 4:2:0 Colour sampling pattern[24].

picture data bit-stream is obtained. This data stream will have distinctive groupsof zeros. This can be used for data saving by using the run-length codes. Afterthis step finally the Huffman coding is used and so the data are transformed intosmaller groups of numbers.

If the P or B frame is encoded, different process is used. First the presentframe is compared with previous or future reference frame(or 2 reference framesin case of B Frame). This comparing is done in macroblock level. By using theenergy level, the similar or same macroblock is searched in the reference frame.

34

If found, the motion vector and the difference between the present and referencemacroblock are calculated (this is also know as motion-compensated prediction).This residual is further encoded by the same process like the one used in I frameencoding.

3.2.2 New features in H.264

Above were mentioned some of the elementary encoding processes, but H.264 wasdeveloped to achieve better coding efficiency than it‘s ancestors, so some newfeatures has to be applied into tho new codec. Here is the list of new featurespresent in the H.264/AVC codec[26].

• Variable block-size motion compensation with small block sizes:

The fixes motion compensations was non-efficient and so this standard sup-port more flexibility in the selection of motion compensation block size andshapes. The luma motion compensation block can be as small as 4x4 pixels.

• Quarter-sample-accurate motion compensation:

Most prior standards enables half-sample motion vector accuracy at most.The H.264 brings this into another level, enabling quarter-sample motionvector accuracy.

• Motion vectors over picture boundaries:

This feature was previously implemented in H.263 standard and now is alsoincluded on H.264/AVC. This technique is know as boundary extrapolationtechnique.

• Multiple reference picture motion compensation:

While previous codec used to use only one previous picture to predict thevalues in an incoming picture, the new standard enables to use more thanone previously decoded picture for prediction. This is also possible forbi-directional predicted frames - B frames.

• Decoupling of referencing order from display order:

In prior standards, there was a strict dependency between the ordering ofpictures for motion compensation referencing purposes and the ordering ofpictures for display purposes. In H.264/AVC, these restrictions are largelyre- moved, allowing the encoder to choose the ordering of pictures for ref-erencing and display purposes with a high degree of flexibility constrainedonly by a total memory capacity bound imposed to ensure decoding ability.Removal of the restriction also enables removing the extra delay previouslyassociated with bi-predictive coding.

35

• Decoupling of picture representation methods from picture refer-encing capability:

Now, the bi-directionally predicted pictures can be used for prediction ofother pictures in the video sequence.

• Weighted prediction:

A new innovation in H.264/AVC allows the motion-compensated predictionsignal to be weighted and offset by amounts specified by the encoder. Thiscan dramatically improve coding efficiency for scenes containing fades, andcan be used flexibly for other purposes as well.

• Improved ”skipped” and ”direct” motion inference:

The H.264/AVC design infers motion in ”skipped” areas, and also includean enhanced motion inference method known as ”direct” motion compen-sation, which improves previous ”direct” prediction designs.

• Directional spatial prediction for intra coding:

Extrapolation of the edges of the previously-decoded parts of the currentpicture helps to improve the prediction quality in regions of pictures whichwas coded without referencing to the content of referencing pictures.

• In-the-loop deblocking filtering:

For improving the video quality and for prevention of blocking effect, thenew adaptive deblocking filter is used in H.264/AVC. The deblocking filterin the H.264/AVC design is brought within the motion-compensated predic-tion loop, so that this improvement in quality can be used in inter-pictureprediction to improve the ability to predict other pictures as well.

• Small block-size transform:

Despite the fact, that the prior video coding standards used a transformblock size of 8x8, the H.264/AVC is based primarily on a 4x4 transform.

• Hierarchical block transform:

For signals that contain sufficient correlation, the longer (longer than 4x4)basis functions for transformation can be used. The H.264/AVC achievebetter coding efficiency and higher picture quality also by allowing theencoder to select a special coding type for intra coding. The transformmatrices of size 4x4, 8x8 or 16x16 can be used.

• Short word-length transform:

While previous designs have generally required 32-bit processing for trans-form computation , the H.264/AVC design requires only 16-bit arithmetic.

36

• Exact-match inverse transform:

In previous video coding standards, the transform used for representingthe video was generally specified only within an error tolerance bound, dueto the impracticality of obtaining an exact match to the ideal specifiedinverse transform. As a result, each decoder design would produce slightlydifferent decoded video, causing a ”drift” between encoder and decoderrepresentation of the video and reducing effective video quality. Buildingon a path laid out as an optional feature in the H.263++ effort, H.264/AVCis the first standard to achieve exact equality of decoded video content fromall decoders.

• Arithmetic entropy coding:

More effective use of the arithmetic entropy coding than in H.263 codec isused in this new coding standard.

• Context-adaptive entropy coding:

The two entropy coding methods applied in H.264/AVC, termed CAVLC(context-adaptive variable-length coding) and CABAC, both use context-based adaptivity to improve performance relative to prior standard designs.

• Parameter set structure:

The parameter set design provides for robust and efficient conveyance headerinformation. As the loss of a few key bits of information (such as sequenceheader or picture header information) could have a severe negative impacton the decoding process when using prior standards, this key informationwas separated for handling in a more flexible and specialised manner in theH.264/AVC design.

• NAL unit syntax structure:

Each syntax structure in H.264/AVC is placed into a logical data packetcalled a NAL unit. The NAL unit syntax structure allows greater customi-sation of the method of carrying the video content in a manner appropriatefor each specific network.

• Flexible slice size:

Slice sizes in H.264/AVC are highly flexible, as was the case earlier inMPEG-1.

• Flexible macroblock ordering (FMO):

The picture is partitioned into regions called slice groups, and each slice isbecoming independently-decodable subset of a slice group.

37

• Arbitrary slice ordering (ASO):

As a result of previously mentioned feature, the H.264/AVC enables sendingand receiving the slices of the picture in any order relative to each other.

• Redundant pictures:

The encoder can send redundant representations of regions of pictures. Thisability of encoder enables a representation of regions of pictures for whichthe primary representation has been lost during data transmission.

• Data Partitioning:

The syntax of each slice can be separated into up to three different partitionsfor transmission, depending on the categorisation of syntax elements. Thiscategorisation is done as a matter of fact, that some coded informations aremore valuable for representation of the video content.

• SP/SI synchronisation/switching pictures:

The H.264/AVC design includes a new feature consisting of picture typesthat allow exact synchronisation of the decoding process of some decoderswith an ongoing video stream produced by other decoders without penal-ising all decoders with the loss of efficiency resulting from sending an Ipicture. This can enable switching a decoder between representations ofthe video content that used different data rates, recovery from data lossesor errors, as well as enabling trick modes such as fast-forward, fast-reverse,etc.

3.3 Video Quality Estimation

In recent years the great effort was put into the research of the video qualityestimation. Several video quality metrics were designed, several method for mea-suring the subjective and objective video quality were designed. At the time ofwriting of this thesis, there is still not any well received and commonly used videoquality metrics based on objective parameters with good correlation to the sub-jective test results. So for the evaluation of the video quality, the subjective testsare still the usual way how to obtain the perceived quality by various people.

But, as the goal of this thesis was to propose an objective audiovisual qualitymetric, it was essential to propose also independent video quality metric basedon objective parameters of the video. This part of work was based on the formerresearch published in [3].

The [3] describes several ways for measuring the video quality. All of themworks independent from the original video signal (non-reference metrics) and usesvarious objective parameters.

38

In the following paragraphs will be described objective parameters which willbe later used in the video quality estimation for audiovisual quality estimation.

3.3.1 Quality estimation based on content sensitive parameters

For estimation of the perceived video quality, it can be defined several contentclasses of the video. Each of these contents have some specific features which canbe used for the content classification. In subjective tests, which will be describedlater, we decided to use three very specific video content classes which differs inseveral ways. If we take the global look at them and we will focus also in thedifference in the audio domain, these three video content classes will be also veryspecific here.

For our work we choose the Video clip, Video call and the Soccer clip. Thedifference in the audio domain is obvious - Video clip will contain manly music,perhaps with the speech in the background, the video call will contain mainlythe voiced data and finally soccer will contain speech, but strongly noised by thenoise of the audience in the soccer stadium.

But let‘s focus on the difference at the video domain.The Video clip is specific by the fast changes of pictures, fast movements,

very variate colours and perhaps also some person in the foreground. This typeof video content contain both the global and local movement.

On the other side, the video call contain mainly very slow moving objectin the foreground, slowly moving or static background. There is very low localmovement and eventually the slow global movement.

The soccer video content type contains a lot of specific colours, mainly greenone. The audience can be defined as a background and specifically with verylow bitrates the background seem to be for the human static, or with very lowmovement. In the foreground there is fast local movements.

These specific properties can be very well used in the feature extraction. Ifwe take the motion vector (MV) as one of the objective parameter of the video,then if we calculate various statistical values of the MV, we can easily obtain theparticular properties of the movement in the video.

The list below specifies several statistical properties of the MV and other videoproperties which will be later use in the complex audiovisual quality metric, butwhich can be also used for estimation of the simple video quality(without presenceof the audio) [3].

• Zero MV ratio within one shot Z:

The percentage of zero MVs is the proportion of the frame that does notchange at all (or changes very slightly) between two consecutive framesaveraged over all frames in the shot. This feature detect the proportion ofstill region. The high proportion of the still region refers to a very static

39

Video Clip Video call

Soccer

Figure 3.5: Example of various video content types.

sequence with small significant local movement. The viewer attention isfocused mainly on this small moving region. The low proportion of the stillregion indicates uniform global movement and/or a lot of local movement.

• Mean MV size within one shot N:

This is a percentage of mean size of the non-zero MVs normalised to thescreen width. This parameter determines the intensity of a movementwithin a moving region. Low intensity indicates the static sequence. Highintensity within a large moving region indicates a rapidly changing scene.

• Ratio of MV deviation within one shot S:

Percentage of standard MV deviation to mean MV size within one shot.A high deviation indicates a lot of local movement and a low deviationindicates a global movement.

• Uniformity of movement within one shot U:

Percentage of MVs pointing in the dominant direction (the most frequentdirection of MVs) within one shot. For this purpose, the resolution of thedirection is 10o. This feature expresses the proportion of uniform and localmovement within one sequence.

• Average BR:

This parameter refers to the pure video payload. The parameter BR iscalculated as an average over the whole stream. Furthermore, the parameterBR reflects a compression gain in spatial and temporal domain. Moreoverthe encoder performance is dependent on the motion characteristics. TheBR reduction causes a loss of the spatial and temporal information what isusually annoying for viewers.

40

As mention in above section, these parameters can be used for video qualityestimation with no presence of audio stream within. For purposes of this thesis(based on the work in [3]) I decided to use the ensemble based quality estima-tion, which is easy to implement in MATLAB environment. For training of theensemble model was used the Entool 1.1 [27] , which is available for free down-load as a MATLAB toolbox. The aim was to train a defined ensemble of modelswith a set of four motion sensitive objective parameters (Z ,N, S, U) and BR.The ensemble consists of different model classes to improve the performance inregression problems.The closer view of the ensemble modelling will be described later in the sectionabout the audiovisual quality estimation metric.

The simplified scheme for estimation of video quality based on content adap-tive parameter is depicted in Figure 3.6

Figure 3.6: Video quality estimation based on content adaptive parameters.

With this method can be achieved quite good results of objective MOS scores.For validation of the ensemble based quality estimation the Pearson correlationwas used:

r =(x− x)T (y − y)√

((x− x)T (x− x))((y − y)T (y − y)), (3.4)

where x is the vector of the MOS values from subjective tests(averaged fromall subjective evaluations), x is the average MOS value over x, y is the vector ofobjective MOS values obtained by estimation metric and y is the average valueover y.

The best achieved result described by the Pearson correlation factor for thismetric was 85.85%.

Similar method will be described also later in the section of audiovisual qualityestimation.

41

Chapter 4

Audiovisual quality

4.1 Introduction

Provisioning of mobile video services is a difficult challenge since in the mobileenvironment, bandwidth and processing resources are limited. Audiovisual con-tent is present in the most multimedia services, however, the user expectationof perceived audiovisual quality differs for speech and non-speech contents. Oneof the challenges is to improve the subjective quality of audio and audio-visualservices. Due to advances in audio and video compression and wide-spread useof standard codecs such as AMR and AAC (audio) and MPEG-4/AVC (video),provisioning of audio-visual services is possible at low bit rates while preservingperceptual quality. The Universal Mobile Telecommunications System (UMTS)release 4 (implemented by the first UMTS network elements and terminals) pro-vides a maximum data rate of 1920 kbps shared by all users in a cell and release 5offers up to 14.4 Mbps in the downlink direction for High Speed Downlink PacketAccess (HSDPA). The following audio and video codecs are supported for UMTSvideo services: for audio these include AMR speech codec, AAC Low Complexity(AAC-LC), AAC Long Term Prediction (AAC-LTP) [2] and for video these in-clude H.263, MPEG-4 and MPEG-4/AVC [2]. The appropriate encoder settingsfor UMTS video services differ for various content and streaming application set-tings (resolution, frame and bit rate) [3].End-user quality is influenced by a number of factors including mutual compen-sation effects between audio and video, content, encoding, and network settingsas well as transmission conditions. Moreover, audio and video are not only mixedin the multimedia stream, but there is even a synergy of component media (audioand video) [4]. As previous work has shown, mutual compensation effects causeperceptual differences in video with a dominant voice in the audio track ratherthan in video with other types of audio [5]. Video content with a dominant voice

43

include news, interviews, talk shows, etc. Finally, audio-visual quality estima-tion models tuned for video content with a dominant human voice perform betterthan a universal models [5]. Therefore, our focus within this work is on the designaudiovisual metric based on audio and video content adaptive features.We are looking at measures that do not need the original (non-compressed) se-quence for the estimation of quality, because this reduces the complexity and atthe same time broadens the possibilities of the quality prediction deployment.Furthermore, we investigated novel ensemble based estimation systems. The en-semble based estimation method shows that ensemble based systems are morebeneficial than their single classifier counterparts [28].

4.2 Audiovisual quality assessment

4.2.1 Test Methodology

The proposed test methodology is based on ITU-T P.911 [29] and adapted toour specific purpose and limitations. For this particular application it was con-sidered that the most suitable experimental method, among those proposed inthe ITU-T Recommendation, is ACR, also called Single Stimulus Method. TheACR method is a category judgement in which the test sequences are presentedone at a time and are rated independently on a category scale. Only degradedsequences are displayed, and they are presented in arbitrary order. This methodimitates the real world scenario, because the customers of mobile video servicesdo not have access to original videos (high quality versions). On the other hand,ACR introduces a higher variance in the results, as compared to other methodsin which also the original sequence is presented and serves as a reference by thetest subjects.After each presentation the test subjects were asked to evaluate the overall qualityof the sequence shown. In order to measure the quality perceived, a subjectivescaling method is required. However, whatever the rating method, this mea-surement will only be meaningful if there actually exists a relation between thecharacteristics of the video sequence presented and the magnitude and natureof the sensation that it causes on the subject. The existence of this relation isassumed. Test subjects evaluated the video quality after each sequence in a pre-pared form using a five grade MOS scale: “5–Excellent”, “4–Good”, “3–Fair”,“2–Poor”, “1–Bad”. Higher discriminative power was not required, because ourtest subjects were used to five grade MOS scales (school). Furthermore, a fivegrade MOS scale offers the best trade-off between the evaluation interval andreliability of the results. Higher discriminative power can introduce higher vari-ations to MOS results.For emulating the real word conditions of the UMTS video service all the audio

44

Figure 4.1: Snapshots of selected sequences for audiovisual test: Video clip (left),Soccer (middle), Video call (right).

and video sequences were played at the UE (Vodafone VPA IV). In this singularpoint the proposed methodology for audiovisual quality testing is not compliantwith ITU-T P.911 [29]. Furthermore, since one of our intentions is to study therelation between audio quality and video quality, we have decided to take allthe tests with a standard stereo headset. During the training session of threesequences the subjects were allowed to adjust the volume level of the headsetto a comfortable level. The viewing distance from the phone was not fixed andselected by the test person but we have noticed that all subjects were comfortableto hold the cell-phone at a distance of 20-30 cm.

4.2.2 Encoder Settings

Resolution Audio Video BR Video FR Audio BR Audio SRCodec [kbps] [fps] [kbps] [kHz]

QVGA AAC 190.28 12.5 16 16VGA AAC 231.40 15 32 16QVGA AAC 173.80 12.5 32 16VGA AAC 229.80 12.5 32 16QVGA AAC 75.87 12.5 16 16

Table 4.1: Encoding settings of Video clip sequence.

All video sequences were encoded using typical settings for the UMTS envi-ronment. Due to limitations of mobile radio resources, bit rates were selectedin range 64–260 kbps. Encoded audio files with insufficient audio quality wereexcluded. The test sequences were encoded with H.264/AVC baseline profile 1bcodec. The audio was encoded with AAC or AMR codec. The encoding param-eters were selected according our former experiences described in [3] and [5]. Intotal there were 12 encoding combinations (see Tables 4.1, 4.2, 4.3) tested. Toevaluate the subjective perceptual audiovisual quality a group of 15 people fortraining set and a group of 16 people for evaluation set was chosen. The chosengroup ranged different ages (between 22 and 30), gender, education and experi-

45

ence. The sequences were presented in an arbitrary order, with the additionalcondition that the same sequence (even differently degraded) did not appear insuccession. In the further processing of data results we have rejected the se-quences which were evaluated with individual variance higher than one. In totalthere were 6% of the obtained results rejected. Two rounds of each test weretaken. The duration of each test round was about 20 minutes.


QVGA AAC 199.14 15 16 16QVGA AAC 92.30 15 16 16QVGA AAC 181.46 12.5 32 16QVGA AAC 196.98 12.5 16 16QVGA AAC 182.98 15 32 16

Table 4.2: Encoding settings of Soccer sequence.

For audiovisual quality tests three different content types (Video clip, Soccerand Video call) were selected with different perception of video and audio media.The video snapshots are depicted in Figure 4.1. The first two sequences Videoclip and Soccer contain a lot of local and global movement. The main differencebetween them is in their audio part. In Soccer the speaker voice as well as a loudsupport of audience is present, where the speaker’s voice is rather important. Theresults depicted in Figure 4.2 show the importance of video quality. Especiallyimportant are small moving objects: players and ball. Furthermore, it can be seenthat a higher audio BR does not significantly improve the audiovisual quality.Finally, the video media is more dominant for soccer content. In Video clipinstrumental music with voice is present in the foreground. The results depictedin Figure 4.3 show the importance of audio quality. In Video call a human voiceis the most dominant.

Furthermore, the obtained results for Video call and Soccer show that higherresolution has no or little impact at audiovisual quality. This was influenced withgranularity of LCD on test PDA.


QVGA AAC 202.84 12.5 16 16QVGA AMR 59.25 7 5 8

Table 4.3: Encoding settings of Video call sequence.

46

Figure 4.2: Measured MOS results for Soccer video sequences.

Figure 4.3: Measured MOS results for Video clip sequences.

4.2.3 Prior Art

In former work [3], [5] we investigated audiovisual quality on different contentclasses, codecs and encoding settings. The obtained subjective video quality re-sults clearly show the existence of a mutual influence in audio and video and thepresence of the mutual compensation effect. The depicted Figures 4.4 and 4.5(color code serves only for better visualisation of the results) show results of au-diovisual quality assessment based on H263 encoding. The mutual compensationeffect was more dominant for Cinema trailer and Video clip contents as shown in(see Figure 4.5). In Video call the audiovisual quality is more influenced by theaudio quality than the video quality as shown in (see Figure 4.4). More detailscan be found in [3].

47

Figure 4.4: MOS results for the Video call content - codecs combinationH.263/AAC.

Further investigation within this work shows that it is beneficial to proposeone audiovisual model with different coefficients for various video contents de-pending on the presence (Video call) or absence of dominant human voice (Videoclip and Cinema trailer). Therefore, within the new work presented in this con-tribution an additional parameter was introduced for detecting speech and non-speech audio content (cf. Section 4.3.2).

4.3 Feature extraction

The proposed method is focused on reference free audiovisual quality estimation.The character of the sequence is determined by content dependent audio andvideo features in between two scene changes. Therefore, the investigation of theaudio and video stream was focused on sequence motion features as well as onaudio content and quality. The video content influences significantly the sub-jective video quality [3], [31] and the sequence motion features reflect very wellvideo content. The well-known ITU-T standard P.563 [14] was used for audioquality estimation. Furthermore, a speech/non-speech detector was introducedfor eliminating of different influence of mutual compensation effect between audioand video in speech and non-speech content. Finally, temporal segmentation was

48

Figure 4.5: MOS results for the Video clip - codecs combination H.263/AAC.

used also as a prerequisite in the process of video quality estimation. For this pur-pose a scene change detector was designed with an adaptive threshold based onthe video dynamics. The scene change detector design is described in detail in [3].

4.3.1 Video feature extraction

The focus of our investigation is given on the motion features of the video se-quences. The motion features can be used directly as an input into the estimationformulas or models. Both possibilities were investigated in [41], [32] and [3], re-spectively.The investigated motion features concentrate on the motion vector statistics, in-cluding the size distribution and the directional features of the motion vectors(MV) within one sequence of frames between two cuts. Zero MVs allow for es-timating the size of the still regions in the video pictures. That, in turn, allowsanalysing MV features for the regions with movement separately. This particularMV feature makes it possible for distinguishing between rapid local movementsand global movement. Moreover, the perceptual quality reduction in spatial andtemporal domain is very sensitive to the chosen motion features, making thesevery suitable for reference free quality estimation because a higher compressiondoes not necessarily reduce the subjective video quality (e.g. in static sequences).

49

This particular MV features make it possible to detect rapid local movements orthe character of global movements. The selection MV features is based on multi-variate statistical analysis and the details can be found in [3], and the MV usedin this audiovisual estimation method is further described in the Section 3.3.1Quality estimation based on content sensitive parameters.

4.3.2 Audio feature extraction

Many reliable estimators for audiovisual quality were proposed recently, some ofthem became standards [16], [15] and [14]. For our purpose a reference free esti-mation method called “Single ended method for objective speech quality assess-ment in narrow-band telephony applications” [14] turned out to be very suitable.The 3SQM [14] performs audio quality estimation in two stages, the first stageincludes intermediate reference system filtering, signal normalisation and voiceactivity detection. In the second stage of operation, twelve parameters based onthe processed input signal are calculated. These parameters take into accountspeech level, noise, delay, repeated frames, disruptions in pitch period, artificialcomponents in speech signal (beeps, clicks). These 12 parameters are then lin-early combined to form the final audio quality prediction (in MOS scale).As previous work has shown [4], mutual compensation effects cause differencesin perception of video content with a dominant voice in the audio track ratherthan in video with other types of audio [5]. Video content with a dominant voiceinclude news, interviews, talk shows, and so on. Finally, audio-visual qualityestimation models tuned for video content with a dominant human voice performbetter than a universal models [5].Therefore, our further investigation was focusedon the design of speech detection algorithms suitable for the mobile environment.

For this purpose was proposed new low-complexity method, which was furtherdescribed in the Chapter 2, in Section 2.4 Audio Content Estimation.

4.4 Audiovisual quality estimation

According to our former experiences with metric design [3, 32], we propose en-semble based estimation for the investigated scenario. Ensemble based estimatorsaverage the outputs of several estimators in order to reduce the risk of an unfor-tunate selection of a poorly performing estimator. The very first idea was to usemore than one classifier for estimation comes from the neural network commu-nity [33]. In the last decade research in this field has expanded in strategies [34]for generating individual classifiers, and/or the strategy employed for combiningthe classifiers.The aim is to train a defined ensemble of models with a feature vector based on

50

audio and video content sensitive objective parameters:

X =

Z

N

S

U

BR

MOSaCCa

.

The ensemble consists of different model classes to improve the performance inregression problems. The theoretical background [35] of this approach is thatan ensemble of heterogeneous models usually leads to reduction of the ensemblevariance because the cross terms in the variance contribution have a higher ambi-guity. A data set with input values of feature vector X and output value (MOS)y with a functional relationship is considered , where e is an estimation error:

y = f(X) + e. (4.1)

The weighted average f(X) of the ensemble of models is defined as follows:

f(X) =K∑k=1

wkfk(X), (4.2)

where fk(X) denotes the k-th individual model and the weights wk sum toone (

∑k wk = 1). The generalisation (squared) error q(X) of the ensemble is

given by:

q(X) = (y(X)− f(X))2. (4.3)

According to [35], the error can be decomposed as follows:

q(X) = q(X)− a(X). (4.4)

This assumption allows us to neglect the mixed terms of the following equationwhere the average error q(X) of the individual model is:

q(X) =K∑k=1

wk(y(X)− fk(X))2, (4.5)

and the average ambiguity a(X) of the ensemble is:

a(X) =K∑k=1

(fk(X)− f(X))2. (4.6)

51

• A consequence of (4.4) is that the ensemble generalisation error q(X) isalways smaller than the expected error of the individual models q(X) [35].

• The previous equations (4.1) — (4.6) require that an ensemble should con-sist of well trained but diverse models in order to increase the ensembleambiguity.

This prerequisite was applied to an ensemble of universal models. In orderto estimate the generalisation error and to select models for the final ensemblea cross-validation scheme for model training [36] was used. These algorithms in-crease the ambiguity and thus improve generalisation of a trained model. Further-more, an unbiased estimator of the ensemble generalisation error was obtained.The cross-validation works as follows:

• The data set is divided in two subsets and the models are trained on thefirst set.

• The models are evaluated on the second set, the model with the best per-formance becomes an ensemble member.

• The data set is divided with small overlapping with previous subsets intotwo new subsets and the models are trained on the first set.

• The cross-validation continues until the ensemble has a desired size. Thebest trade-off between ensemble complexity and performance was achievedfor an ensemble of six estimators.

The final step in the design of an ensemble based system is to find a suitablecombination of models. Due to outliers and overlapping in data distribution ofthe data set, it is impossible to propose a single estimator with perfect gener-alisation performance. Therefore, an ensemble of many classifiers was designedand their outputs were combined such that the combination improves upon theperformance of a single classifier. Moreover, classifiers with significantly differentdecision boundaries from the rest of the ensemble set were chosen. This prop-erty of an ensemble set is called diversity. The above mentioned cross-validationintroduces model-diversity, the training on slightly different data sets leads todifferent estimators (classifiers). Additionally, diversity was increased by usingtwo independent models; k-nearest neighbour rule and artificial neural network.Furthermore, in cross validation classifiers with worse correlation than 50% onthe second set were automatically excluded.

1. As the first estimation model, we chose a simple nonparametric method, thek-Nearest Neighbour rule (kNN ) with adaptive metric [36]. This methodis very flexible and does not require any preprocessing of the training data.

52

The kNN decision rule assigns to an unclassified sample point the classi-fication of the nearest sample point of a set of previous classified points.Moreover, a locally adaptive form of the k-nearest neighbour was used forclassification. The value of k is selected by cross validation.

2. As the second model an Artificial Neural Network (ANN ) was used. Anetwork with three layers was proposed; input, one hidden and output layerusing five objective parameters as an input and estimated MOS as output.Each ANN has 90 neurons in the hidden layer. As a learning methodImproved Resilient Propagation (IRPROP+ ) with back propagation [37]was used. IRPROP+ is a fast and accurate learning method in solvingestimation tasks for the data set.

The ensemble consists of six estimators, three based on kNN and three basedon ANN. The three different estimators were based on different subsets as beingdefined for training and evaluation as explained before. Each of the six estimatorsresults in a MOS value:

MOSi = f(X), i = 1...6. (4.7)

Finally, the estimated MOS is an average over the obtained individual MOSivalues from the proposed ensemble of estimators:

MOS =16

6∑i=1

MOSi. (4.8)

4.5 Performance evaluation

To validate the performance of the proposed ensemble based estimator, the Pear-son (linear) correlation factor [38] was applied:

r =(x− x)T (y− y)√

((x− x)T (x− x))((y− y)T (y− y)), (4.9)

Here, the vector x corresponds to the average MOS values of the evaluationset (averaged over two runs of all obtained subjective evaluations for particulartest sequence and one encoding setting) for all tested encoded sequences and xcorresponds to average over x. Vector y corresponds to the prediction made bythe proposed metric and y corresponds to average over y. The dimension of xand y refers to amount of tested sequences. The obtained results for the Pear-son correlation factor shows 91% correlation, thus reflects the excellent fit (seeFigure 4.6) with the independent evaluation set for all content types together.

Furthermore, it was necessary to provide objective comparison with state ofthe art estimation method. For this purpose were selected audiovisual quality

53

Figure 4.6: Estimated MOS over subjective MOS results.

models devoted in [30]. The models relate individual audio and video quality(MOSa and MOSv) to subjective audiovisual quality:

MOS = a+ b ·MOSa + c ·MOSv, (4.10)

MOS = a+ d ·MOSa ×MOSv, (4.11)

MOS = a+ b ·MOSa + c ·MOSv + d ·MOSa ×MOSv. (4.12)

Several different forms of equations were analysed, including cross products {(4.11)and (4.12)}.The MOSa was based on Auditory Distance (AD ) algorithm [39]. The subjectiveaudio quality model was devoted in [30]. AD is a measure of difference betweenthe original and degraded audio signal. Thus, larger AD indicate poorer qualityof degraded audio signal. AD is linearly fitted to subjective audio quality:

MOSa = 4.388− 0.638AD. (4.13)

For video quality estimation was applied well known ANSI T1.801.03 metric [40].ANSI T1.801.03 is based on quality parameters that measure the perceptual ef-fects of a wide range of impairments such as blurring, block distortion, unnaturalmotion, noise and error blocks. Each quality parameter is calculated through aquality feature, defined as a quantity of information associated with a spatial-temporal sub-region of a video stream. ANSI T1.801.03 is a full reference metric.In order to provide fair comparison of our ensemble based estimator and audiovi-sual quality models (4.10), (4.11) and (4.12) we train audiovisual quality models

54

{(4.10), (4.11) and (4.12)} on our training set. The model coefficients were ob-tained by a linear regression (see Table 4.4). The correlation of the linear modelwith subjective test results on the training set are devoted as r. The comparisonwas performed on evaluation set.

Estimator a b c d r r(4.10) -21.036 5.499 0.038 — 0.29 0.15(4.11) 2.630 — — 0.029 0.20 0.11(4.12) 213.251 -48.026 -86.865 19.854 0.44 0.03Ensemble based estimator — — — — 0.98 0.91

Table 4.4: Coefficients and correlations of audiovisual quality estimators.

While correlations on the training set are quite fair, unfortunately, audiovi-sual quality models (4.10), (4.11) and (4.12) have extremely poor correlations onour evaluation set (see Table 4.4). Ensemble based estimator significantly out-performs simple quality models. This is mainly influenced by following factors:

• The ANSI T1.801.03 and (4.13) estimators do not consider audio and videocontent which significantly influences the subjective quality.

• The ANSI T1.801.03 and (4.13) estimators are not originally designed formobile environment.

Moreover, the depicted results in Table (see Table 4.4) show that it is essentialto consider audiovisual content and scenario in estimator design.

Since the results show extremely poor correlations, we decided to make moreaudiovisual quality models for comparison with our ensemble based model. Theseaudiovisual models are both based on the previous equations (4.10), (4.11) and(4.12), but use MOSa parameters obtained by different metrics. The video qualitywas in both models estimated with ANSI T1.801.03 metric.

In the following Table 4.5 are described the model coefficients for audiovisualmodel which use the PESQ audio quality metric to estimate the MOSa score. Asin the previous mentioned model, the coefficients was obtained by linear regres-sion.

Estimator a b c d r r(4.10) PESQ 0.1526 0.8889 0.0405 — 0.39 0.36(4.11) PESQ 2.5058 — — 0.0529 0.28 0.25(4.12) PESQ 19.3622 -5.2270 -6.1220 1.9489 0.60 0.31Ensemble based estimator — — — — 0.98 0.91

Table 4.5: Coefficients and correlations of audiovisual quality estimator usingPESQ and ANSI T1.801.03 metric.

55

Again, correlation on the training set r are fair, but as in previous case, onthe Figure 4.7 we can see that also this model has very low correlation withsubjective tests results, and our ensemble based model significantly outperformsit.

Figure 4.7: Estimated MOS over subjective MOS results for model with MOSascore obtained by PESQ metric.

The last audiovisual model use the 3SQM audio quality metric for estimationof MOSa. The Table 4.6 show the coefficients for (4.10), (4.11) and (4.12) modelsand its correlation on evaluation set. The results can be seen on the Figure 4.8.

Estimator a b c d r r(4.10) 3SQM 3.5923 0.4662 -0.3928 — 0.67 0.39(4.11) 3SQM 2.7657 — — 0.0450 0.52 0.33(4.12) 3SQM 1.8803 2.0568 -0.0370 -0.3363 0.68 0.37Ensemble based estimator — — — — 0.98 0.91

Table 4.6: Coefficients and correlations of audiovisual quality estimator using the3SQM and ANSI T1.801.03 metric.

The linear audiovisual models based on ANSI T1.801.03 video metric andPESQ or 3SQM audio metric show better correlation on the evaluation set thanthe one using the AD parameter, but the results shows still correlation only about30-40% with subjective tests.

56

Figure 4.8: Estimated MOS over subjective MOS results for model with MOSascore obtained by 3SQM metric.

57

Chapter 5

Conclusions

The recent development in handheld devices and video encoding brought sig-nificant improvement in processing power of handheld devices thus allowing in-creasing screen resolution of these devices. Multimedia applications are thereforemore and more of interest and make up for a large part of the transmitted datain mobile communications. As service providers are requested to guarantee highquality but at the same time desire to limit the necessary resources for deliveryit is important to optimise perceived quality of multimedia services. Here, mutualcompensation effects in audiovisual transmissions offer interesting trade-offs. Al-though in poor receive situations one part of the transmitted stream may not beupdated or only on a very low rate, the information is carried mostly by the othermultimedia mode (for example audio) and thus the perceived quality is moderateto high although the data rate required is relatively low.

The aim of this thesis was to design a non-reference audiovisual quality metric.For this purpose several steps had to be made. First, the subjective tests wereperformed, to obtain some reference basis for further development of an objectivemetric.

For a subjective test we decided to use the ACR assessment method since itseems to be the most suitable, because a user does not have access to the originalsequences. Three significantly different content classes were put into the testingscenario and all sequences were encoded with encoding settings typically used inmobile environment. The audio stream was encoded either with AAC codec -mostly where the non-speech content was present, or with the AMR codec forthe videos where the human speech was present. For the video stream there wasonly one video codec used - H.264/AVC, since nowadays it is the most advancedvideo codec. Only various bitrate setting was explored for the video stream.The results from the subjective tests show us, that the MOS grades can differup to three degrees for the same audiovisual content encoded with the different

59

encoding settings. The results also show, that it is very important to encodethe audio stream with higher bitrate settings, thus to higher quality, because thehuman perception is strongly affected by the audiovisual quality. Even thoughthe video stream was in poor quality, or encoded with lower bitrate, if the audiostream had higher quality the whole audiovisual stream was graded by very highMOS grade. By this setting the providers can save bandwidth, because higheraudio quality has lower demands on the bandwidth than higher video quality.

The other important step in developing the objective audiovisual quality met-ric, was choosing the audio objective parameters. The easy way how to obtainsuitable audio objective parameter, is to test the audio quality. Since at thetime of research only one reference-free audio quality was available, we decidedto use it. Thus, one of two audio objective parameters was the result from the3SQM (also known as Single Sided Speech quality metric based on ITU-T P.563recommendation) audio quality metric.

As the second objective audio parameter was used the distinguish value be-tween non-speech (valued as 0) and speech audio content (valued as 1). For thispurpose we proposed a new low-complexity algorithm, which preserves high accu-racy and still remains suitable for use in mobile environment. The algorithm wasproposed as a two type algorithm - one type with lower accuracy and lower cal-culation demands and other with high accuracy and higher calculation demands.This algorithm will be also available as a separate work [1].

Former works were the basis for the video quality estimation. Based on [3],several video objective parameters were extracted from the video stream. Fourstatistical properties of MVs and bitrate of the video stream were used as theinput for ensemble model. Finally, the non-reference objective audiovisual qualitymetric was proposed. The proposed metric shows very good correlation with thesubjective tests and in future can be very well adapted for different audio or videoparameters. At the end of thesis we compare our method with state of the artaudiovisual quality estimation methods.

Our contribution shows that clever estimation techniques extracting side in-formation like speech/non-speech of the audio stream can result in excellent qual-ity estimation. Such quality metric on the other hand is the basis to optimisetransmission methods for specific contents.

60

Bibliography

[1] M. Ries, B. Gardlo, M. Rupp, P. De Leon, ”Low-Complexity Voice Detectorfor Mobile Environments,” International Conference on Systems, Signals &Image Processing, June 2009. [cited at p. 13, 60]

[2] 3GPP TS 26.234 V6.13.0: ”Transparent end-to-end Packet-switched Stream-ing Service (PSS); Protocols and codecs,” Mar. 2008. [cited at p. 1, 43]

[3] M. Ries, ”Video Quality Estimation for Mobile Video Streaming,” Doctoralthesis, INTHFT, Vienna University of Technology, Vienna, Austria, Oct.2008. [cited at p. 1, 2, 38, 39, 41, 43, 45, 47, 48, 49, 50, 60]

[4] S. Tasaka, Y. Ishibashi, “Mutually Compensatory Property of MultimediaQoS,” in Proc. of IEEE International Conference on Communications 2002,vol. 2, pp. 1105–1111, NY, USA, 2002. [cited at p. 1, 12, 43, 50]

[5] M. Ries, R. Puglia, T. Tebaldi, O. Nemethova, M. Rupp, “Audivisual Qual-ity Estimation for Mobile Streaming Services,” in Proc. of 2nd Int. Symp.on Wireless Communications (ISWCS), pp. 173–177, Siena, Italy, Sep. 2005.[cited at p. 1, 2, 12, 43, 44, 45, 47, 50]

[6] D. Wu, M. Tanaka, R. Chen, L. Olorenshaw, M. Amador, X. Menendez-Pidal, “A robust speech detection algorithm for speech activated hands-freeapplications,” in Proc. of ICASSP’99, pp. 2407–2410, Mar. 1999. [cited at p. 2,

12]

[7] J. Junqua, C. B. Mak, B. Reaves, “A Robust Algorithm for Word BoundaryDetection in the Presence of Noise,” IEEE Transactions on Speech and AudioProcessing, vol. 2, no. 3, Jul. 1994. [cited at p. 2, 12]

[8] L. Mauuary, J. Monne, ”Speech/non-Speech Detection for speech ResponsesSystems,” in Proc. of Eurospeech93, Berlin, pp. 1097-1 100, September 1993.[cited at p. 2, 12]

61

[9] H. Hermansky, ”Perceptual linear predictive (PLP) analysis of speech,”Acoust. Soc. Am., vol. 87, no. 4, pp. 1738-1752, Apr. 1990. [cited at p. 2, 12]

[10] M. G. Bulmer, ”Principles of statistics,” New York: Dover Publications,1967. [cited at p. 12, 13]

[11] L. Lu, H. Jiang, H. J. Zhang, ”Content analysis for audio classification andsegmentation,” IEEE Transactions on Speech and Audio Processing, vol. 10,no. 7, Oct. 2002. [cited at p. 12, 13, 15]

[12] P. De Leon, ”Short-Time Kurtosis of Speech Signals with Application toCo-Channel Speech Separation,” in Proc. IEEE Int. Conf. Multimedia andExpo, NY, USA, 2000. [cited at p. 13]

[13] D. P. W. Ellis, (2005) PLP and MFCC in Matlab, Ac-cessed on 10th December 2008. [Online]. Available http:http://www.ee.columbia.edu/∼dpwe/resources/matlab/rastamat/melfcc.m[cited at p. 19]

[14] Opticom GmbH, ”3SQM Advanced Non-Intrusive Voice Quality Testing,”Opticom GmbH, Erlangen, Germany, 2004. [cited at p. 21, 26, 48, 50]

[15] ITU-R Recommendation BS. 1387-1,”Method for objective measurements ofperceived audio quality,” Revised 11/01. [cited at p. 21, 24, 50]

[16] ITU-T Recommendation P.862, ”PESQ, An objective method for end-to-end speech quality assessment of narrowband telephone networks and speechcodecs,” February 2001. [cited at p. 22, 23, 50]

[17] H. Holma, J. Melero, J. Vaionio, T. Halonen, J. Mkinen, ”Performance ofAdaptive Multirate (AMR) Voice in GSM and WCDMA IEEE,” VehicularTechnology Conference, 2003. VTC 2003-Spring. The 57th IEEE SemiannualVolume 4, 22-25 April 2003 [cited at p. 11]

[18] H. Moller, M. F. Sorensen, D. Hammershoi, and C. B. Jension, Head RelatedTransfer functions of Human Subjects, J. Acoust. Soc. Am., vol 43, no. 5,pp.300-321 , 1995. [cited at p. 3]

[19] E. Zwicker, ”Subdivision of the audible frequency range into critical bands,”The Journal of the Acoustical Society of America, 33, Feb., 1961. [cited at p. 6,

7]

[20] D. J. M. Robinson, M. J. Hawksford, Time-Domain Auditory Model for theAssessment of High-Quality Coded Audio, preprint 5017, presented at the107th convention of the Audio Engineering Society in New York, Sept 1999.[cited at p. 6, 10]

62

[21] ISO/IEC JTC1, Coding of audio-visual objects Part 2: Visual, ISO/IEC14496-2 (MPEG-4 visual version 1), April 1999; Amendment 1 (version 2),February, 2000; Amendment 4 (streaming profile), January, 2001. [cited at p. 29]

[22] ITU-T H.264, Series H: Audiovisual and multimedia systems, Infrastructureof audiovisual servicescoding of moving video, Advanced video coding forgeneric audiovisual services, International Telecommunication Union, Mar.2005. [cited at p. 30]

[23] ITU-T H.263, Series H: Audiovisual and multimedia systems, Infrastruc-ture of audiovisual services coding of moving video, Video coding for lowbit rate communication, International Telecommunication Union, Jan. 2005.[cited at p. 29]

[24] [Online]. Available http:http: //www.hpx500.eu/technical advanced progressive.php?lang=en[cited at p. 34, 67]

[25] I. E. G. Richardson, H.264 and MPEG-4 Video Compression, Video Codingfor the Next-Generation Multimedia, John Wiley & Sons Ltd., Mar. 2005.[cited at p. 32]

[26] T. Wiegand, G. J. Sullivan, G. Bjontegaard, A. Luthra, Overview of theH.264/AVC video coding standard, IEEE Trans. on Circuits and Systems forVideo Technology, vol. 13, no. 7, July 2003. [cited at p. 35]

[27] C. Merkwirth, J. Wichard, ”Short introduction to ENTOOL,” January 2004Available http: http: //www.j-wichard.de/entool/ [cited at p. 41]

[28] R. Polikar, “Ensemble based systems in decision making,” IEEE Circuits andSystems Magazine, Vol. 6, no. 3, pp. 21 - 45, Third Quarter 2006. [cited at p. 44]

[29] ITU-T Recommendation P.911, ”Subjective audiovisual quality assess-ment methods for multimedia application,” International TelecommunicationUnion, 1998. [cited at p. 44, 45]

[30] C. Jones, D. J. Atkinson, ”Development of Opinion-Based AudiovisualQuality Models for Desktop Video-Teleconferencing,” 6th IEEE InternationalWorkshop on Quality of Service, Napa, CA, USA, May, 1998. [cited at p. 54]

[31] G. Zhai, J. Cai, W. Lin, X. Yang, W. Zhang, M. Etoh, Cross-dimensionalPerceptual Quality Assessment for Low Bitrate Videos, IEEE Transactionson Multimedia, vol.10 (7), pp. 1316-1324, Nov. 2008, [cited at p. 48]

[32] M. Ries, O. Nemethova, M. Rupp,”Performance evaluation of mobile videoquality estimators,” invited paper, in Proc. of 15th European Signal Process-ing Conference (EUSIPCO), Poznan, Polen, Sep. 2007. [cited at p. 49, 50]

63

[33] B. V. Dasarathy, B. V. Sheela, ”Composite classifier system design: Con-cepts and methodology,” in Proc. of the IEEE, vol. 67, no. 5, pp. 708-713,1979. [cited at p. 50]

[34] L. I. Kuncheva, ”Combining Pattern Classifiers, Methods and Algorithms,”New York, Wiley Interscience, 2005. [cited at p. 50]

[35] A. Krogh, J. Vedelsby, ”Neural Network Ensembles, Cross Validation andActive Learning,” Advances in Neural Information Processing Systems 7, MITPress, 1995. [cited at p. 51, 52]

[36] Hastie, Tibshirani, Friedman, ”The Elements of Statistical Learning,”Springer, 2001. [cited at p. 52]

[37] C. Igel, M. Hsken, ”Improving the Rprop learning algorithm,” in Proc. of the2nd Int. Symp. on Neural Computation, pp. 115-121, Berlin, ICSC AcademicPress, 2000. [cited at p. 53]

[38] VQEG: ”Final report from the Video Quality Experts Group on the vali-dation of objective models of video quality assessment,” 2000, available athttp://www.vqeg.org/. [cited at p. 53]

[39] S. Voran, ”Objective Estimation of Perceived Speech Quality. Part I: De-velopment of the Measuring Normalizing Block Technique,” in Journal ofIEEE transactions on speech and audio processing, vol. 7, no. 4, Jul. 1999.[cited at p. 54]

[40] ANSI T1.801.03, ”American National Standard for Telecommunications -Digital transport of one-way video signals. Parameters for objective perfor-mance assessment,” American National Standars Institute, 2003. [cited at p. 54]

[41] M. Ries, O. Nemethova, M. Rupp, ”Motion Based Reference-Free QualityEstimation for H.264/AVC Video Streaming,” in Proc. of IEEE Int. Symp.on Wireless Pervasive Computing (ISWPC), San Juan, Puerto Rico, US,Feb. 2007. [cited at p. 49]

64

List of Symbols

and Abbreviations

Abbreviation Description Definition

3SQM Single Sided Speech Quality Metric page 26

AAC Advanced Audio Codec page 11AD Auditory Distance page 54AMR Adaptive Multi Rate page 11AMR-NB Adaptive Multi Rate Narrow Band page 11AMR-WB Adaptive Multi Rate Wide Band page 11ANN Artificial Neural Network page 53ASO Arbitrary Slice Ordering page 38

CIF Common Intermediate Format page 31

DCT Discrete Cosine Transform page 33

FMO Flexible Macroblock Ordering page 37FR Frame Rate page 29

GMSK Gaussian Minimum Shift Keying page 11GSM Global system mobile ( former Groupe Speciele

Mobile)page 11

HVS Human Visual System page 31HZCRR High Zero Crossings Rate Ratio page 12

IRPROP+ Improved Resilient Propagation page 53

kNN k-Nearest Neighbour page 52

LLR Log Likelihood Ratio test page 12

65

Abbreviation Description Definition

LZW Lempel-Ziv-Welch Algorithm page 30

MFCC Mel-Frequency Cepstrum Coefficients page 19MOS-LQO Mean Opinion Score - Objective Listening Quality page 27MOV Model Output Variable page 24MPEG Motion Picture Expert Group page 11MV Motion Vector page 39

NAL Network Abstraction Layer page 37NMR Noise to Mask Ratio page 21

PAMS Perceptual Analysis Measurement System page 21PEAQ Perceptual Evaluation of Audio Quality page 21PESQ Perceptual Evaluation of Speech Quality page 22PSQM Perceptual Speech Quality Measurement page 21

QCIF Quarter Common Intermediate Format page 31QVGA Quarter Video Graphics Array page 31

RLE Run Length Encoding page 30

SNR Signal to Noise Ratio page 27

UMTS Universal Mobile Telecommunication System page 1

VGA Video Graphics Array page 31

WCDMA Wide Code Division Multiplex page 11

66

List of Figures

2.1 Outer Ear. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Cochlea and basilar membrane. . . . . . . . . . . . . . . . . . . . . . . 52.3 Hair cells. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.4 Simultane masking: relation between masking threshold curve and

sound pressure level of a 440Hz masker. . . . . . . . . . . . . . . . . . 82.5 Absolute hearing threshold and frequency masking: a 2 kHz tone must

have a sound pressure level over SP to be audible. . . . . . . . . . . . 92.6 Temporal pre- and post-masking. . . . . . . . . . . . . . . . . . . . . . 102.7 Example of speech signal (time-domain). . . . . . . . . . . . . . . . . . 132.8 Example of non-speech (time-domain). . . . . . . . . . . . . . . . . . . 142.9 Probability density function of the speech and non-speech audio samples. 142.10 Plot of the ZCR of the speech signal. . . . . . . . . . . . . . . . . . . . 162.11 Kurtosis values of speech and non-speech signals. . . . . . . . . . . . . 172.12 HZCRRM values of speech and non-speech signals. . . . . . . . . . . . 172.13 Cumulative distribution function of the κ. . . . . . . . . . . . . . . . . 182.14 Two-stage speech detector. . . . . . . . . . . . . . . . . . . . . . . . . 182.15 The structure of the generic perceptual measurement algorithm. . . . 222.16 Structure of the PESQ algorithm. . . . . . . . . . . . . . . . . . . . . 232.17 Structure of the PEAQ algorithm. . . . . . . . . . . . . . . . . . . . . 242.18 Blockdiagram of the 3SQM analysis algorithm. . . . . . . . . . . . . . 27

3.1 Red,Green and Blue compound of an image. . . . . . . . . . . . . . . . 323.2 Luminance and chrominance components of the picture in YCbCr

colour space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.3 4:2:2 Colour sampling pattern[24]. . . . . . . . . . . . . . . . . . . . . 343.4 4:2:0 Colour sampling pattern[24]. . . . . . . . . . . . . . . . . . . . . 343.5 Example of various video content types. . . . . . . . . . . . . . . . . . 403.6 Video quality estimation based on content adaptive parameters. . . . . 41

67

4.1 Snapshots of selected sequences for audiovisual test: Video clip (left),Soccer (middle), Video call (right). . . . . . . . . . . . . . . . . . . . . 45

4.2 Measured MOS results for Soccer video sequences. . . . . . . . . . . . 474.3 Measured MOS results for Video clip sequences. . . . . . . . . . . . . 474.4 MOS results for the Video call content - codecs combination H.263/AAC. 484.5 MOS results for the Video clip - codecs combination H.263/AAC. . . . 494.6 Estimated MOS over subjective MOS results. . . . . . . . . . . . . . . 544.7 Estimated MOS over subjective MOS results for model with MOSa

score obtained by PESQ metric. . . . . . . . . . . . . . . . . . . . . . 564.8 Estimated MOS over subjective MOS results for model with MOSa

score obtained by 3SQM metric. . . . . . . . . . . . . . . . . . . . . . 57

68

List of Tables

2.1 The center frequencies and bandwidth for each of the 25 Bark bands. 72.2 AMR modes in GSM and WCDMA. . . . . . . . . . . . . . . . . . . . 112.3 Speech audio corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.4 Non-speech audio corpus. . . . . . . . . . . . . . . . . . . . . . . . . . 162.5 Accuracy results for detection of non-speech and speech from coded

audio. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.6 Time needed for content estimation. . . . . . . . . . . . . . . . . . . . 202.7 Description of Mean Opinion Scores. . . . . . . . . . . . . . . . . . . . 21

3.1 Video resolutions used in mobile environment. . . . . . . . . . . . . . . 31

4.1 Encoding settings of Video clip sequence. . . . . . . . . . . . . . . . . 454.2 Encoding settings of Soccer sequence. . . . . . . . . . . . . . . . . . . 464.3 Encoding settings of Video call sequence. . . . . . . . . . . . . . . . . 464.4 Coefficients and correlations of audiovisual quality estimators. . . . . . 554.5 Coefficients and correlations of audiovisual quality estimator using

PESQ and ANSI T1.801.03 metric. . . . . . . . . . . . . . . . . . . . . 554.6 Coefficients and correlations of audiovisual quality estimator using the

3SQM and ANSI T1.801.03 metric. . . . . . . . . . . . . . . . . . . . . 56

69

Documents

Subjective Audiovisual Quality in Mobile Environment · PDF filedirection for High Speed Downlink Packet Access (HSDPA). ... low complexity method, which is robust against audio compression