20
Smooth Quality Streaming of Live Internet Video Dimitrios Miras and Graham Knight Dept. of Computer Science, University College London Gower St., London WC1E 6BT Email: {d.miras, g.knight}@cs.ucl.ac.uk February 2004 Abstract A live video stream, when encoded and transmitted using a congestion controlled IP flow, experiences a variety of quality of service due variations in video content activity and bandwidth availability. As a consequence, the perceived quality of the video can suffer frequent oscillations which are particularly disturbing to the viewer. We therefore tackle the problem of accommodating the mismatch between the available transmission rate and the encoding rate required for stable perceived quality. By utilizing a reliable metric of perceived quality, we develop a technique for source rate control of real-time live video that maintains a more uniform quality. The method comprises an artificial neural network to generate predictions of the on-going quality and a fuzzy rate-quality controller that considers properties of human perception of quality in order to provide smooth streaming quality. Experimental results indicate that in the presence of sufficient buffering, the proposed adaptation technique can improve quality stability, while maintaining TCP-friendly transmission. 1 Introduction Encoding and transmitting live video over the Internet is subject to significant variations in quality. This is attributed to the video content’s inherently varying spatio-temporal complexity: video scenes with low spatial activity and motion are easier to encode with good quality, while on the other hand, complex visual content and motion increase the distortion introduced by the encoder. Furthermore, in order to avoid congestion in network resources and be fair to compliant flows, media streams need to employ congestion control to determine their fair-share of bandwidth and adapt their transmission rate to match it [3]. Unfortunately, confining the source rate to the TCP-friendly rate of the stream results in frequent fluctuations in video quality. Such variability in quality is extremely annoying to the human viewer; users prefer video with medium but stable quality to a video that oscillates between high and low quality [5]. Most work on smooth quality video in the literature concerns with streaming of stored media and not live video material. In such cases, the system has access to future frames in its disposal to perform encoding optimizations like multiple-pass coding or efficient packet scheduling. Furthermore, measurement of the video quality performance is usually limited to a few objective metrics like mean square error (MSE) and peak signal-to-noise ratio (PSNR) (e.g., [27]). Work in [15] proposes smoothness criteria for layered streams based on layer runs, defined as the number of consecutive frames in a layer. Notwithstanding the fact that frequentoscillations in the number of transmitted layers result in variations in quality, the assumption that layer smoothness coincides with quality smoothness cannot be substantiated. This problem is aggravated by the fact that the algorithm presented works on layered CBR streams; it is known that CBR video exhibits high quality variation. Kim and Ammar [13] extend this work to solve the problem of TCP- friendly streaming of layered FGS MPEG-4 video with minimum quality variation. Again, layer runs are 1

Smooth Quality Streaming of Live Internet Video

  • Upload
    ronny72

  • View
    365

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Smooth Quality Streaming of Live Internet Video

Smooth Quality Streaming of Live Internet Video

Dimitrios Miras and Graham KnightDept. of Computer Science, University College London

Gower St., London WC1E 6BTEmail: {d.miras, g.knight}@cs.ucl.ac.uk

February 2004

Abstract

A live video stream, when encoded and transmitted using a congestion controlled IP flow, experiencesa variety of quality of service due variations in video content activity and bandwidth availability. As aconsequence, the perceived quality of the video can suffer frequent oscillations which are particularlydisturbing to the viewer. We therefore tackle the problem of accommodating the mismatch between theavailable transmission rate and the encoding rate required for stable perceived quality. By utilizing areliable metric of perceived quality, we develop a technique for source rate control of real-time live videothat maintains a more uniform quality. The method comprises an artificial neural network to generatepredictions of the on-going quality and a fuzzy rate-quality controller that considers properties of humanperception of quality in order to provide smooth streaming quality. Experimental results indicate thatin the presence of sufficient buffering, the proposed adaptation technique can improve quality stability,while maintaining TCP-friendly transmission.

1 Introduction

Encoding and transmitting live video over the Internet is subject to significant variations in quality. Thisis attributed to the video content’s inherently varying spatio-temporal complexity: video scenes with lowspatial activity and motion are easier to encode with good quality, while on the other hand, complexvisual content and motion increase the distortion introduced by the encoder. Furthermore, in order toavoid congestion in network resources and be fair to compliant flows, media streams need to employcongestion control to determine their fair-share of bandwidth and adapt their transmission rate to matchit [3]. Unfortunately, confining the source rate to the TCP-friendly rate of the stream results in frequentfluctuations in video quality. Such variability in quality is extremely annoying to the human viewer; usersprefer video with medium but stable quality to a video that oscillates between high and low quality [5].Most work on smooth quality video in the literature concerns with streaming of stored media and not livevideo material. In such cases, the system has access to future frames in its disposal to perform encodingoptimizations like multiple-pass coding or efficient packet scheduling. Furthermore, measurement of thevideo quality performance is usually limited to a few objective metrics like mean square error (MSE) andpeak signal-to-noise ratio (PSNR) (e.g., [27]). Work in [15] proposes smoothness criteria for layered streamsbased on layer runs, defined as the number of consecutive frames in a layer. Notwithstanding the fact thatfrequent oscillations in the number of transmitted layers result in variations in quality, the assumption thatlayer smoothness coincides with quality smoothness cannot be substantiated. This problem is aggravatedby the fact that the algorithm presented works on layered CBR streams; it is known that CBR videoexhibits high quality variation. Kim and Ammar [13] extend this work to solve the problem of TCP-friendly streaming of layered FGS MPEG-4 video with minimum quality variation. Again, layer runs are

1

Page 2: Smooth Quality Streaming of Live Internet Video

Fn Fn+1 Fn+2 Fn+3 Fn+4 Fn+5

Temporal Width

Horizontal Width

Vertical Width

Figure 1: This figure illustrates the definition of the spatio-temporal (S-T) region

used as an indication of smoothness, which preserves the aforementioned disadvantages. Furthermore,the experiments simulated the transmission of high rate streams( 4 Mbps); these rates can support highquality video anyway and are not typical to what the majority of Internet users experience today.

This work presents a rate-quality adaptation method for live video that reduces fluctuations of quality.Our method differs from other adaptation techniques that rely on inaccurate metrics to represent quality,in its use of a realistic quality metric. Internal representations of the on-going quality are obtainedby an objective video quality metric [25], which is proved to provide ratings highly correlated to humanjudgements of quality. The basic implementation concepts of this metric are introduced in the next section.In section 3 we introduce the proposed system architecture, describe related terminology and present themain challenges that arise. We firstly recognise that application of an objective quality metric places acomputational burden to the streaming server. Therefore, we develop a method for on-line estimationof the ongoing quality, based on an artificial neural network (section 4), that yields accurate predictionsof the objective quality. We then present (section 5) a rate-quality controller based on the principles offuzzy logic that utilises predictions from the neural network and manipulates the encoding rate in orderto provide a smooth evolution of quality. Experimental results of the approach are presented therein.

2 Description of the objective quality metric

An important issue that arises with video adaptation is to understand its impact on video quality andhow to measure it. Pixel-error metrics (like MSE and PSNR) are widely used for this purpose, however,they suffer a major drawback: they do not always correspond well with human judgements of quality [6],especially at low-to-modest bit-rates (up to a few hundreds of Kbps). The main issue with MSE and PSNRis that they cannot discriminate between impairments that humans can and cannot see or impairmentsthat are less or more annoying. Until recently, the most regarded method of measuring the quality ofdigital video was by means of subjective quality assessment [11]. This method however, requires costlyand complex setups and therefore, it is not suitable for on-line quality monitoring. The answer to thisproblem is the recent emergence of objective video quality metrics (VQM) (e.g., [8, 19, 25, 22, 24, 18]).These are computational models that measure video quality in a way that preserves high correlation tohuman ratings of quality, by accounting the type and magnitude of perceived distortions in the videosignal. Current research on objective quality metrics is at a considerably mature level and several modelsare under evaluation and approaching standardisation [20].

We implemented and used the ITS VQM [25], proposed at the Institute for Telecommunication Sci-ences, to measure perceived quality. This metric has attained significant performance during a recent

2

Page 3: Smooth Quality Streaming of Live Internet Video

Fuzzy Quality Adaptive Smoother

TCPF Congestion

Control ANN

Quality Predictor

Feature Extraction

module

Rtcpf(t)

Qtcpf(t)

-

quality error

Rate-quality controller Qtarget(t)

Encoder

Content Features

R Q(R)

Renc(t)

buflev

Encoded Frames

Input Frames

Qtcpf(t-1)

Send/Recv buffer

monitor

Figure 2: Components of the smooth quality adaptation framework and the interactions between involvedmodules.

evaluation by the Video Quality Experts Group (VQEG) [21]. Its algorithm is based on the extractionand statistical summarization of scalar, spatio-temporal features from the original and degraded videoframes, to obtain a single measure of perceived distortion. Summarization of these features occurs withinspatio-temporal (S-T) regions, usually 8× 8 pixels × 6 frames (Figure 1). For each S-T region, featuresfrom both the original and distorted frames are compared using functions that resemble human perceptionand visual masking, to obtain measures that quantify the level of perceptual distortions present (tiling,blurring, motion jerkiness, etc.). The calculated measures of perceptual impairments from each S-T regionare then pooled for the temporal duration of the S-T region (6 frames) and summarised by averaging theworst 5% of measured distortions (this reflects the fact that subjective quality is primarily determinedby the worst quality during the observation period). A single score of perceptual distortion for the wholevideo sequence (typically 8-10 sec long) is obtained by averaging the measured impairments of every 6-frame evaluation period (which we call S-T period) over the duration of the clip. Score values are in the[0, 1] range, with zero corresponding to a sequence with imperceptible impairments and one to a heavilydistorted video. If D(t) is the S-T period perceptual distortion value produced by the ITS VQM at timeindex t, then the value 1 − D(t) can be considered as a reasonable representation of the instantaneousquality for the purposes of realtime quality monitoring. We call this value S-T period quality. S-T periodquality scores are scaled to the [0,100] range (with 0 representing unacceptable quality and 100 perfectquality).

3 Problem formulation and proposed architecture

Achieving real-time video streaming with consistent quality requires a method that manipulates the sourcerate so that more bits are allocated to scenes or frames with high spatial and temporal energy. In otherwords, the problem is how to control the encoding rate, denoted Renc, in the presence of a variable andunknown available transmission rate (the TCP-friendly rate of the stream, denoted Rtcpf ), so that theresulting target quality Qtarget, is a smoothed alternative of the quality that the encoder would haveproduced if the video rate was set to Rtcpf (denoted Qtcpf ). To enable a quality-based rate adaptation,

3

Page 4: Smooth Quality Streaming of Live Internet Video

a method that associates an encoding bit-rate to the resulting encoding quality in realtime is required.Then, appropriate target quality values can be continuously chosen for successive S-T periods. A smootherquality would mean that at times Qtarget is higher than Qtcpf , while at other times it is lower. Consequently,a similar relationship would occur between Renc and Rtarget. Note that the sender is always transmittingat its TCP-friendly rate, therefore, mismatches between the two rates are accommodated using a senderand a receiver buffer. Hence, the system has to maintain buffer stability at the same time.

The ITS VQM can be used to obtain continuous S-T period quality scores as described in section 2.Doing so however, requires encoding and decoding of the S-T period frames at several candidate bit-rates, and the subsequent application of the metric. This approach is prohibitive in terms of real-timeperformance, a strict requirement of live video coding. In order to bypass this time-consuming process,our system utilises an artificial neural network (ANN) to automatically generate accurate predictions ofthe continuous S-T period quality scores when presented with details of the content features of the inputframes and a target encoding rate. S-T quality scores obtained from our implementation of the ITS VQMare used to train the ANN. The details of the ANN quality predictor are presented in section 4.

Figure 2 illustrates the architecture and the components of the proposed system. A companion con-gestion control module is periodically sampled to elicit the nominal transmission bit-rate of the stream(Rtcpf ). Although the proposed system is not bound to a specific transmission control policy, we assumeTCP-friendly congestion control [4]. The video encoder receives video frames from a live video source(camera, satellite feed, etc.) with the task of producing a compressed bitstream. Every S-T period t,summary content statistics of video features are extracted (cf. section 4) from a small number (six) ofconsecutive frames. Based on content features statistics, that reflect the complexity of the underlyingvisual content, and the current nominal transmission rate Rtcpf (t), the neural network generates a predic-tion of the resulting quality, Qtcpf (t). The sampling of the TCP-friendly rate and the estimation of thecontinuous quality scores are therefore carried out at a period equal to the duration of the S-T period (i.e.,every 6 frames, or, 200 ms for a 30 frames per second input video). This period is an efficient tradeoffbetween a suitable granularity of network adaptation1 and the duration of the quality evaluation period ofthe ITS VQM. It also minimises the additional delay and buffering requirement at the sender. Finally, afuzzy rate-quality controller receives successive values of Qtcpf and an estimate of the sender and receiverbuffer sizes to determine a value for Qtarget that achieves the desired encoding quality and maintains thestability of both send and receive buffers. The function of the controller is to locate, by further invocationsof the ANN, the encoding bit-rate, Renc, that approximates Qtarget. We discuss why a controller basedon the principles of fuzzy logic is a good choice for our system, and present how it determines the targetquality Qtarget, in section 5.

4 Neural network quality predictor

An Artificial Neural Network (ANN) is a general, practical form of machine learning, that provides a robustapproach of approximating real, discrete or vector target functions, and learns to interpret complex real-world data. When suitably trained, ANNs can provide accurate estimation of the output(s) based on aselection of inputs, efficiently predict non-linear relationships among multidimensional data and supporta general paradigm to deal with complex mathematical functions. Extensive research in this area hasresulted into a multitude of approaches to neural network computing; we limit our discussion to the verybasic principles the govern an ANN and to the most popular type of ANN, multi-layer perceptron witherror back-propagation [7, 17], which is the one used in this work.

The basic building block of a neural network is an elementary neuron, or perceptron (Figure 3). Each1The network adaptation timescale is dictated by the frequency of incoming acknowledgements of the TCP-friendly

protocol.

4

Page 5: Smooth Quality Streaming of Live Internet Video

x1

xn

x3

x2

. . .

w3

w1w2

wn

f )(1

����� n

iii bxwfa

b=1

W1x1

xn

x3

x2

. . .

w11,1

w2

�f

�f

w2

�f

. . .

w1m,n

b1

b2

bm

w21,1

�w2

1,2

w21,m

fInputLayer

Hidden Layer

Output Layer

Perceptron ANN with one hiddenlayer

y~

Inputs

Output

. . .

a1

a2

am

W2

Figure 3: The structure of the basic ANN component – a neuron or perceptron, and a feedforward neuralnetwork with n inputs, one hidden layer with m neurons, and one output layer. WL is the weights matrixfor layer L.

input vector xT = [x1, x2, ..., xn] is weighted with an appropriate weight wi, that defines the contributionof input xi to the perceptron’s output α. The sum of weighted inputs together with a bias b are projectedon a differentiable transfer function f , to produce the output of the neuron or, activation

α = f(n∑

i=0

wixi + b)

Layers of several perceptrons can be combined to form a multi-layer feedforward network (Figure 3).Feedforward networks often have one or more hidden layers of non-linear neurons. Multiple layers ofperceptrons with non-linear transfer functions, like log-sigmoid, or tangent-sigmoid2, allow the networkto learn linear and non-linear relationships between inputs and output(s), without a-priori assumptionof a specific model form. The function of a neural network is to determine suitable values for a set ofadjustable parameters, like the weights and biases at every layer and neuron, by performing an iterativeprocedure, called training or learning, on the set of train samples. These adjustable parameters are givenrandom initial values, and the training process consists of two steps per iteration. For a set of traininginput vectors with a known response y, a forward pass calculates all the activations at every neuron togenerate a predicted response y. Then, a backpropagation step is used to adjust all the weights of theneural network based on the magnitude of the error between the predicted and actual output

Ek =12(yk − yk)2 (1)

E =K∑

k=1

Ek (2)

2

logsig(x) =1

(1 + βe−x)

tansig(x) =ex − e−x

ex + e−x

5

Page 6: Smooth Quality Streaming of Live Internet Video

where Ek is the network error vector for the training pattern xk and K is the total number of trainingpatterns. The error cost measure in expression 1 is commonly used for its simplicity and it presents thedeviation of the network’s output from the ideal. The task of the training is to find the weights and biasesthat minimise E. This iterative procedure with new optimised parameters is repeated until an acceptablelow error is achieved. There are several algorithms proposed to adjust the weights at every iteration ofthe training phase, and the gradient descent is probably the most popular [2]. Essentially, this methodperforms iterative steps in the weight space, proportional to the negative gradient of the cost function Eto update the weights

wij = wij + ∆wij

∆wij = −η∂E

∂wij

∂E

∂wij=

∀k

∂Ek

∂wij,

where η is the step size parameter, usually called the learning rate. An ANN is therefore an optimisationtechnique that attempts to locate the minimum of a multidimensional error surface, which usually includesseveral local minima. A neural network might not always find the absolute minimum, but an acceptablelocal minimum close to it. After the training phase, the ANN can be validated for its generalisationcapability, by comparing its output with the actual (expected) values, where the input data come froma set of (unknown during the training phase) samples, called the test set. A usual problem that occursduring the training process is over-fitting. The error on the training set may be reduced to a very smallvalue, but when presented with new, unknown test patterns, the network has poor performance (largeprediction error) because it has almost memorized the training samples. The tendency for over-fittingincreases with the network size, but deciding what is the best size for the network is difficult to knowbeforehand. Early stopping is a technique that is very often used to stop the training process before thenetwork starts to over-fit. In this method, the available data for training can be split into a training setand a monitoring set, and the error of the monitoring set is also inspected during training. While atthe beginning both the training and monitoring errors decrease, when the network begins to over-fit thetraining data, the monitoring error will start to increase. If this increase continues for a specific numberof iterations, the training process is stopped and the ANN parameters (weights and biases) that presentedthe minimum monitoring error are retained.

The motivation behind the use of neural networks, is that the encoding quality of video is primarilydepending on the source rate and the level of spatial activity and motion in the video scene3. Therefore,the ANN model operates on visual content descriptors that are extracted from the input video framesduring the encoding process, on a S-T period basis, and directly yields objective quality scores that arehighly correlated to the corresponding quality values if the ITS VQM had been used. The function thatmaps content feature vectors into objective quality ratings is learned by training the neural network. Forthe (off-line) training process, continuous objective quality scores are obtained by directly using the ITSVQM. The ANN method does not rely on the availability of the distorted version of video frames duringits real-time operation. Quality predictions are sought based only on features that are extracted from theoriginal input frames. The main challenge of the ANN design is the extraction of appropriate features fromthe visual content. These features should (i) adequately represent most of the spatio-temporal activity ofvideo content and (ii), since realtime performance is important, they should be calculated as part of thenormal operation of the encoder, so that no significant overhead occurs.

Keeping in mind the requirement of real-time processing, a set of content features, summarised inTable 1, are extracted from every original frame within the S-T period. Four of these features measure

3Under the reasonable assumptions of a non-changing picture size (resolution) and video codec.

6

Page 7: Smooth Quality Streaming of Live Internet Video

texture complexity: the pixel activity, PelAct, defined as the standard deviation of luminance pixels ineach (8 × 8) block averaged over the number of blocks in the frame, and the spatial spread of pixelactivity, PelActSpread, defined as the deviation of block-PelAct values over the frame. Similar featuresare calculated to measure the ’edges’ activity within a frame. Edges convey significant visual information,reveal texture, and are more susceptible to certain encoding impairments in comparison to flat regionsof the image (e.g., blurring distorts the intensity of edges). From a human visual system point of view,spatial and texture masking are sensitive to the intensity of areas with edge activity. To determine the edgeactivity, we calculate the magnitude of pixel gradients in each block, by applying a Sobel filter (gradientoperator) at each pixel value:

magn(∇pi,j) = |pi−1,j−1 + 2pi−1,j + pi−1,j+1 − pi+1,j−1 − 2pi+1,j − pi+1,j+1| + |pi−1,j−1 + 2pi,j−1 +pi+1,j−1 − pi−1,j+1 − 2pi,j+1 − pi+1,j+1|,

where pi,j is the luminance value of the pixel at row i and column j in the frame’s pixel grid. Theedge activity, EdgeAct is the standard deviation of magn(∇pi,j) values in every block, averaged over thenumber of frame blocks. The spread of edge activity, EdgeActSpread is calculated similar to PelActSpread.Motion related features are also extracted with the aim of covering the range of motion attributes. The sumof absolute pixel differences, soad, is a measure of pixel change between the current (motion-estimated)frame and its reference frame. The average magnitude of the motion vectors (MV) over the whole frame,MVMagn, and the spatial variance of the MVs magnitude, MVMagnVar, are also calculated. To locateframes where strong motion in portions of the image may lead to localised impairments, the averagemagnitude of MVs is also measured for each of the four spatial quadrants of the frame, resulting in fouradditional features, MVMagnUL, MVMagnUR, MVMagnLL, and MVMagnLR. The ratio of the motionestimated macroblocks (MB) over the total number of MBs, MERatio, is also calculated as a representativemeasure of the coding efficiency of the motion estimation process. Motion complexity, MotCompl, iscalculated as follows: motion vectors are classified according to the dominant axis of the vector (up,down, left, right, none), and the variance of this five-bin histogram is taken. A uniform histogram of thedirectional MVs reveals a more complex motion throughout the frame. MotDirChange represent changesin the motion direction, and is formed by subtracting the MVs of successive motion estimated blocks, andaveraging over the number of macroblocks in the frame:

MotDirChange =1M

i

‖mvF (i)−mvF ′(i)‖,

where F ′ represents the reference frame of frame F used for motion estimation. MotAccel capturesthe change in the motion speed (acceleration), again averaged over the number of MBs:

MotAccel =1M

i

(‖mvF (i)‖ − ‖mvF ′(i)‖).

Descriptive statistics of these features are then calculated over the 6-frame period to obtain contentfeature descriptors. These summary statistics are the mean, median, standard deviation, minimium andmaximum values, the 5, 25, 75 and 95-percentiles. In total, one hundred and thirty five (135) contentfeatures descriptors are gathered per S-T activity period. We modified a H.263+ video codec to performthe feature extraction process and the calculation of their descriptive statistics (this process can be applied,with minimal modifications, to any other hybrid DCT-based codec that employs motion estimation).

4.1 Neural network architecture and prediction performance

The ANN architecture comprises a two-layer feedforward network with backpropagation, with one hiddenlayer of nh neurons and non-linear (tangent-sigmoid) transfer function, and a linear output layer. In

7

Page 8: Smooth Quality Streaming of Live Internet Video

Table 1: Content features extracted from the original video framesContent Feature Description

PelAct Pixel activity averaged over all blocksPelActSpread Deviation of pixel activity over all blocksEdgeAct Edges activity averaged over all blocksEdgeActSpread Deviation of pixel activity over all blockssoad Sum of abs. pixel differences between adjacent framesMVMagn Magnitude of motion vectorsMVMagnVar Spatial variance of motion vector magnitudesMVMagnLL, MVMagnLR,MVMagnUL, MVMagnUR

Magnitude of motion vectors per quadrant – low left & right,upper left & right

MERatio The ratio of motion estimated MBs in the frameMotCompl Motion complexity (variance of the directional motion vectors

histograms)MotDirChange Change of motion direction between adjacent framesMotAccel Acceleration of motion between adjacent frames

order to reduce the size of the input vectors that train the neural network, remove redundancies presentamong the original 135 inputs and retain those variables that are relevant to the model (thus improveboth training time and generalisation performance) a data dimensionality reduction process precedes theANN training process. The first step applies Principal Component Analysis (PCA) on the input datamatrix. Principal Component Analysis [12] is a data dimensionality reduction scheme, which is very oftenused in neural networks. This data compression technique extracts characteristic features from the datawhilst minimizing the information loss. The basic principle of PCA is the representation of the data by areduced set of unit vectors (eigenvectors). The eigenvectors are positioned along the directions of greatestdata variance so that the projections from the data points onto the axis of the vector are minimized acrossthe full data set. PCA is applied to the train input vectors (calibration matrix). Therefore, if Fn×m

is the calibration matrix and Pm×m is the principal component transformation matrix, the transformedcalibration set of patterns is the (n ×m) matrix F ′ = F × P. Note that, the same transformation hasto be applied to the set of test patterns as well, using the same transformation matrix P derived fromthe calibration matrix. Usually, most of the data variance can be explained using the first few principalcomponents (PCs) of F ′. While it difficult to consider the significance of the original input variables tothe model, it becomes much easier to do so when the input data are preconditioned with PCA. Inputfeatures that are relevant to the model can be derived through a stepwise, trial and error method. Forexample, with stepwise addition, one may start with an initial small set of inputs (first few PCs), and adda new variable at the time until a satisfactory monitoring or prediction error is achieved. This has the riskthat the method may stop with selected input variables PC1, ..., PCm, but some important informationto the model may also be contained in input PCn, n > m. With stepwise elimination, a deliberatelylarge subset of initial variables is chosen, and variables are subsequently removed until the monitoringor prediction error improves no longer. The selection process of the appropriate input variables can beimproved if the relevance of each variable to the model, called its sensitivity, can be estimated. We use atwo-variance-based approach for variable sensitivity determination, proposed in [1]. This method is basedon the estimation of the individual contribution of each input variable to the variance of the predictedresponse of the neural network (which can be derived when the neural network is trained when all theinput variables, except the one under consideration, are set to zero). Once all sensitivities are estimated,the variable with the lowest sensitivity is tentatively removed and the ANN is retrained. If the monitoring

8

Page 9: Smooth Quality Streaming of Live Internet Video

0 20 40 60 80 100

020

4060

80

S−T

per

iod

qual

ity s

core

+

+

+

+

++

+++

+

+

+

+

+

+

+++

+

++

+

+

+

++

+++

+

+

+

+

+

++

++

+

++

+

+

++

+

+

+

+

+++

+

+

+

+

+

+

+

+

++

+

+

+

+

+++

+

+

++

+

+

+

+

+

+

+

+

++

+

+

+

++

+

+

++

+

++

+

+

+

+

+

+

******

****

**********

********

******

*****

*****

****

****

*******

**

*

****

*

****

**

****

*

*

*

**********

***

*******

*

o+*

actual scoresANN predictionsabsolute error

Figure 4: This plot depicts a series of neural network predictions together with the actual values obtainedfrom the ITS VQM. The prediction residual is also shown.

error decreases, the variable is deemed irrelevant to the model and is removed, otherwise, it is replacedand the process continues with the next variable. At the end of this process, a subset of the initial inputfeatures presents the new input features set.

In order to determine which subset of principal components (PCs) are relevant to the model, asensitivity-based, stepwise elimination process is then applied. First, the relevance, or sensitivity of eachvariable (PC) to the model is estimated, using a two-variance-based approach [1]. This method estimatesthe individual contribution of each PC to the variance of the predicted response of the neural network(derived when the neural network is trained with all input variables but the one under consideration set tozero). Once all sensitivities are estimated, the variable with the lowest sensitivity is tentatively removedand the ANN is re-trained. If the monitoring error decreases4, the variable is deemed irrelevant to themodel and is removed, otherwise, it is retained and the process continues with the next variable. Thestepwise elimination process retained a total of eighteen (18) input variables.

A large collection of video scenes were selected to test the proposed neural network model, featuringa wide range of content, camera actions (static, panning, zooming, fades, etc.), and various levels ofscene activity. Video frames were extracted from: action movies (The Matrix, Terminator, XMen), sports(extended football clips from the English Premiership) and also several short video clips from the VQEGweb site [20]. In total, the test video library contained approximately 39,000 frames (6,500 S-T periods).From the set of 6,500 patterns, 80% were randomly chosen as the training set and the rest 20% as thevalidation (test) set. One fourth of the training samples formed the monitoring set, and the rest was theactual training set. All sequences in the video library where encoded at several rates, ranging from 100Kbps to 2 Mbps (with a step of 100Kbps). Multiple neural networks were then trained, each correspondingto a distinct encoding rate in the selected range. We performed the sensitivity analysis and the stepwiseelimination of input variables for various configurations of hidden layer neurons. This analysis proved thatthe value of nh does not significantly affect the prediction performance of the ANN, nevertheless, for thespecific data set, a network with nh = 18 produced the smallest monitoring error.

We investigated the ANN prediction ability when presented with unknown input patterns. To clearly4During the training phase a monitoring set is also fed to the ANN to facilitate better training and improve the generali-

sation performance by preventing the network from memorizing the training samples (over-fitting).

9

Page 10: Smooth Quality Streaming of Live Internet Video

20 40 60 80 100

2040

6080

100

Actual vs. predicted scores

Actual quality score

Qua

lity

pred

ictio

n

Figure 5: Actual quality scores vs. ANN prediction for the test set (400 Kbps).

visualise that the neural network predictions closely follow the actual S-T quality scores, we plot inFigure 4 a 100-samples long subset of the ANN outputs together with the corresponding expected scores.The bottom line on the same graph corresponds to the absolute error between the actual score and theANN prediction. Figure 5 shows, for the test set of features (approx. 1300 input patterns, encodingrate: 400Kbps), the actual objective S-T quality scores obtained from the ITS VQM plotted against thecorresponding outputs of the neural network. The neural network achieved significant prediction accuracy:the Person correlation between the predicted and expected responses was as high as 0.901, the mean ofthe absolute residual error was 4.20 with a standard deviation of 3.54. Similar generalisation performancewas gained for various encoding bit-rates from the range 100-2000Kbps used in our experiments, as shownin Figure 65.

4.1.1 Examination of additional overhead

The on-line quality predictor introduces two additional processing modules in the live-streaming system:the extraction and statistical manipulation of content features inside the video codec, and the invocationof the neural network quality predictor.

The overhead of the feature extraction process and statistical manipulation of the data to the videoencoder is not significant. Most chosen features, like pixel activity, soad, motion vectors, apart from theedges energy, are calculated as part of the encoding process, namely for motion estimation, so no additionaldelay occurs. Features like complexity of motion, acceleration and direction of motion are computed fromthe values of the motion vectors using simple statistics. Calculation of edge activity in the frame is addinga slight overhead (the Sobel gradient involves twelve additions per pixel)6. The rest of the processing costinvolves the statistical summarisation of both frame-level features (mean and standard deviation over theframe) and S-T period level content descriptors. In total, the additional overhead is on average less than

5The prediction error of the ANN is even smaller at higher bit-rates, because quality values were usually in the high-endof the scale, allowing the neural network to learn better.

6Furthermore, we can remove it from the features vector with only a small loss in the prediction accuracy of the ANN

10

Page 11: Smooth Quality Streaming of Live Internet Video

0 200 400 600 800 1000 1200 1400 1600 1800 20000

1

2

3

4

5

6

7

8

9

10Absolute prediction error of the ANN

Encoding bit rate points (Kbps)

Abs

olut

e er

ror

Average abs. error

Figure 6: ANN prediction error at different encoding bit-rates. Error bars extend to the 5 and 95percentiles.

15ms per 6-frame activity period for CIF-size frames on a 2.2GHz processor) therefore it does not affectreal-time performance. Similarly, the overhead of the neural network is also negligible: by nature, anANN might require significant amount of time to train but the process of calculating a response involvesa mere number of operations on the input variables vector.

5 Estimation of encoding rate

We can achieve a more stable target quality Qtarget by smoothing out transient increases and especiallydrops of Qtcpf . At the same time, Qtarget has to be responsive to consistent changes in Qtcpf . As Qtarget

deviates from Qtcpf , so does the encoding rate Renc in relation to the transmission rate Rtcpf . Whilemismatches between the source and channel rates can be alleviated by the sender and receiver buffers inthe short term, Qtarget has to follow the ’trend’ of Qtcpf in the longer-term. The basis of the approach isto calculate the target quality value as an moving average (MA) of Qtcpf :

Qtarget(t) = α ·Qtarget(t− 1) + (1− α) ·Qtcpf (t), α ∈ [0, 1]

for successive S-T periods t. MA predictors are quite simple but the main design difficulty is the choiceof weight α. Given that in practice the variation of Qtcpf is unknown, setting α to a high value leadsto successful elimination of large variations but lacks responsiveness and compromises the stability ofbuffers, while a small value fails to decrease variations. The desired approach is to determine α on-line,according to changes in Qtcpf and the status of the two buffers. We introduce a fuzzy logic controller [14]to dynamically calculate appropriate values for α. Fuzzy logic was introduced by Zadeh [26] to describevagueness in system behaviour, where variables or parameters do not exhibit exclusive set membership,but a gradual transition between states (or, a grade of membership). In our case, a fuzzy controller isuseful because, while it is difficult to analytically describe the system’s behaviour, as the output rate ofthe encoder can hardly be characterised accurately and the transmission rate is unknown beforehand, weknow qualitative what the behaviour of the system should be.

11

Page 12: Smooth Quality Streaming of Live Internet Video

−1.0 −0.5 0.0 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

errorneglarge

negsmall possmall

zero poslarge

−0.4 0.4−0.2 0.2

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

buflev

lowmedium

high

0.3 0.7

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

α

small medium large

0.3 0.4 0.65 0.75 0.92

Figure 7: Membership functions of all fuzzy sets for the controller’s inputs (error and buflev) and output(parameter α).

With respect to the sender and receiver buffer sizes, the difference between the encoding and trans-mission rates has the following effects: if Renc is higher than Rtcpf then the sender buffer size increases, ifequal, it remains unchanged, otherwise it decreases. Similarly, the receiver buffer fills, remains unchangedor empties, when the transmission rate is higher, equal or lower than the video data playout rate. Thereare two inputs to the fuzzy controller: error and buflev, and one output, the EWMA parameter α.The input error is associated with the change in the value of Qtcpf (or quality error) between successiveST-periods: ∆Qtcpf = Qtcpf (t)−Qtcpf (t− 1). This change represents the level of short-term variation inQtcpf that we like to curtail7. The input buflev ∈ [0, 1] reveals how buffered data is distributed betweenthe send and receive buffers, and is defined as: buflev = Br/(Br + Bs), where Bs, Br are the sender andreceiver buffer sizes respectively. The value of this variable is a convenient way to establish whether thesender or receiver buffer level is low. If the receive buffer runs low (Br → 0) then buflev → 0, while if thesend buffer approaches underflow levels (Bs → 0), buflev → 1. Therefore, the system can determine atany time how video data is distributed between the two buffers, monitor if any buffer runs at low levelsand react accordingly. Note that, for low packet loss rates, the sender can quite accurately track the sizeof the receiver buffer at any time, by continuously updating a buffer sz variable:

buffer sz(t) = buffer sz(t− 1) + (Rtcpf (t)−Renc(t)) · T

where, T is the duration of an adaptation period. Packet loss can however contaminate the accuracy ofthis estimate. Alternatively, a method that lets the receiver feedback this information back to the senderis preferable.

We define five gradations for the fuzzy input error (linguistic values that error takes on): large negative(neglarge), small negative (negsmall), zero, positive small (possmall), and positive large (poslarge).The buflev variable takes on three linguistic values: low, medium and high. These gradations areenough to describe the different states of both buffers; adding further gradations does not present anyobvious advantage and introduces unnecessary complexity. A fuzzy value ’low’ means that the receiverbuffer runs low, a ’high’ value that the sender buffer’s level is low, while a ’medium’ value that thereis enough data distributed, more or less evenly, between the two buffers. Finally, the gradations of thecontroller’s output α are represented by three linguistic variables: small, medium and high. We usestandard trapezoidal fuzzy sets and the corresponding membership functions are shown in Figure 7.

The last step in the design of the fuzzy controller is the definition of the rules that govern its operation.The controller opts for preserving a fuzzy large α in order to preserve stable evolution of Qtarget. However,when buflev is fuzzy low and there is a fuzzy negative error, α needs to take a smaller value to avoid a

7Input error is scaled in the [-1,1] range, using error = ∆Qtcpf/20, since the majority of ∆Qtcpf values are confined inthe [-20,20] range, as established through quality experiments with numerous video clips.

12

Page 13: Smooth Quality Streaming of Live Internet Video

receiver buffer underflow. Qtarget then gets close to Qtcpf , thus avoiding a Renc value that is much higherthan the transmission rate Rtcpf . Analogous rules are employed when error is fuzzy positive and buflevis high, to avert a sender buffer underflow. Using this approach of quality control, preference is given toavoid the target quality from dropping to low values. This is in accordance to subjective experiments thatreveal that quality during a time interval is primarily determined by the worst impairment observed andthat drops of quality have a greater negative impact than the positive effect of an equal in size qualityincrease [25, 9]. Using the above guidelines, the complete set of control rules of the fuzzy controller aredefined as follows:

1. if error is neglarge and buflev is low then α is small

2. if error is neglarge and buflev is medium then α is large

3. if error is neglarge and buflev is high then α is large

4. if error is negsmall and buflev is low then α is medium

5. if error is negsmall and buflev is medium then α is large

6. if error is negsmall and buflev is high then α is large

7. if error is zero and buflev is low then α is large

8. if error is zero and buflev is medium then α is large

9. if error is zero and buflev is high then α is large

10. if error is possmall and buflev is low then α is large

11. if error is possmall and buflev is medium then α is large

12. if error is possmall and buflev is high then α is medium

13. if error is poslarge and buflev is low then α is large

14. if error is poslarge and buflev is medium then α is large

15. if error is poslarge and buflev is high then α is small

5.1 Performance of the quality controller

As discussed in section 4.1, the ANN network is trained to provide predictions at discrete operatingbit-rates R0, R1, ..., RN . By performing a small number of invocations of the ANN, the rate controllerperforms the simple task of finding i ∈ {0, ..., N − 1}, such that, QRi ≤ Qtarget ≤ QRi+1). The overheadof this process is insignificant. Assuming that QR is an increasing function of R, Renc is then found byinterpolating between Ri and Ri+1. To avoid Renc getting much higher than Rtcpf , causing the receiverbuffer to drain quickly, we allow it to increase relative to the instantaneous receiver buffer occupancy, i.e.,up to ratio = 1 + buflev times more than Rtcpf during the same period (so ratio ∈ [1, 2]).

We investigated the ability of the fuzzy controller to (i) provide a smooth encoding quality and (ii)avoid starvation of the sender and receiver buffers. In a simulated transmission scenario using ns-2 [16],8000 video frames (≈ 260 sec) from an action movie (The Matrix ) were transmitted using a TCP-friendlyflow (TFRC) [4]. The test video sequence contained scenes with various levels of content activity. Thesimulation topology was a typical dumbbell network with bottleneck bandwidth set to 10Mbps and delayto 20ms. To create a realistic variation of bandwidth, a number background ON/OFF CBR flows, withON and OFF times drawn from a Pareto distribution [23] also traversed the bottleneck link. The mean

13

Page 14: Smooth Quality Streaming of Live Internet Video

o

o

o

ooooo

o

o

oo

o

o

o

o

oo

oo

o

o

o

oo

o

o

oo

oo

o

oo

oo

o

oo

o

o

o

oo

o

o

o

o

o

o

o

o

oo

o

o

ooo

oo

o

o

o

o

oo

o

o

ooo

o

o

o

o

o

o

oo

o

o

o

o

oooo

o

o

o

o

o

o

o

oo

o

ooo

o

o

o

o

o

o

o

ooo

o

o

o

o

o

o

oo

o

o

ooo

o

o

o

oo

o

o

o

o

o

o

oo

o

o

o

oo

o

o

o

oo

o

o

oo

o

o

o

o

o

ooo

ooo

o

o

ooo

o

o

o

o

o

oo

oo

oo

oo

o

ooooo

o

oo

o

o

o

ooo

ooo

o

o

o

o

oooooo

o

o

o

o

ooo

o

o

ooo

oo

ooo

o

o

o

oo

ooo

o

oo

o

o

oo

o

o

o

ooo

o

o

oo

o

o

ooooo

o

o

oo

o

o

o

oo

oo

o

o

oo

oo

o

o

o

o

o

o

oo

o

oo

o

o

o

o

oo

ooo

o

o

o

oo

oo

o

o

o

o

ooo

o

o

o

o

o

o

o

o

oooo

o

o

oo

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

ooo

oo

ooo

o

o

o

o

o

o

oo

o

o

o

oo

o

o

oo

oo

oo

o

o

o

o

o

o

o

o

o

o

o

o

ooo

o

o

oo

o

o

o

ooooooooo

o

o

o

o

oo

oo

oo

o

ooooo

o

o

o

o

oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

oo

oo

ooo

o

o

o

oo

o

ooo

o

o

o

o

o

oo

oo

o

o

o

o

oo

o

o

o

o

oo

oo

o

o

o

o

o

o

o

oo

o

o

oo

o

o

o

o

o

o

o

oo

o

oo

o

oo

o

o

o

oo

o

o

o

o

oooo

o

o

o

o

oo

o

o

o

o

o

oo

o

o

o

oo

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

ooo

oo

o

o

o

o

o

o

o

oo

o

o

oo

oo

o

oo

o

o

o

o

o

oo

ooo

o

oo

o

o

o

o

oo

o

o

oooo

o

oooo

o

o

o

o

o

oo

oooo

o

o

ooooo

o

ooo

o

oo

o

o

o

o

o

o

o

o

o

oo

o

oo

o

o

oo

o

o

oo

oo

o

o

oo

oo

o

ooo

o

o

o

o

o

o

o

o

oooo

o

o

o

oo

o

oo

o

oo

o

o

o

o

ooo

o

o

o

oo

oo

o

o

o

o

oo

oo

o

o

o

o

o

oo

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

oo

o

ooo

o

o

o

oo

o

ooo

ooooo

oo

o

oo

o

o

o

o

o

o

oo

oo

o

o

o

o

o

oo

oo

oo

o

oo

oo

o

o

o

o

ooo

o

o

ooo

o

o

o

oo

o

o

ooo

ooo

ooooo

ooo

o

oo

oo

o

ooo

o

o

o

o

oooooo

o

o

oo

o

oo

o

o

o

oo

o

o

o

o

o

o

o

o

oo

o

oo

o

oo

o

o

o

o

o

o

o

o

ooo

o

oo

oo

o

o

oo

o

o

o

o

o

o

o

ooo

o

o

o

oo

oo

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

oo

o

oo

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

oo

o

oo

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

ooo

oo

o

oo

oo

o

oo

o

o

o

o

oo

o

o

o

o

o

oo

oo

o

ooo

o

ooo

o

o

oo

o

o

oo

ooo

ooo

o

o

o

ooooooo

o

oo

o

o

oo

o

o

oo

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

ooo

oo

o

o

o

ooo

o

o

o

oo

ooo

oooo

o

o

oo

oo

ooo

o

o

o

o

o

o

oo

o

o

o

o

o

o

oo

o

o

oo

o

o

o

oooo

o

o

o

o

oo

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

ooo

o

o

oooo

o

o

oo

o

o

0 50 100 150 200 250

2040

6080

100

time (sec)

Qua

lity

++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++

++++++++++++++

++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++

++++++++++++++

++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++

+++++++++++++++++

+++++++++++++++++

+++++++++++++

+++++++++++++

+++++++++++++

++++++++++++++++

++++++++++++++++

+

+

+

++

+

+

+

++++++++

+++++++++++++++++++++++++++++++++++

++++++++++++++++

+++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++

+++++++++++++

++++++++++++++++++

+++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++

+++++++++++++++++

++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++

+

+++++++++

+++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++

++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++

o+

QtcpfQtarget

Figure 8: Improvement of quality stability. The proposed system achieves a smoother ongoing quality.

ON and OFF times were 1 sec and 2 sec respectively. The number of CBR flows was chosen so thatthe bandwidth available to the TFRC sender was in the range of encoding bit-rates R0 and RN . Theparameters of the experiment were as following: R0 = 100Kbps, RN = 2000Kbps, and the initial bufferbuilt-up delay was 10sec (i.e., the first 5 sec are used to fill the sender buffer and the remaining 5 secondsthe receiver buffer). Note that, the sender buffer delay is usually not visible to the receiver applicationsince a ’broadcast-delay’ is commonly used in these kind of services.

Figure 8 illustrates the continuous quality for the two approaches: (i) when the encoding rate for each6-frame region is determined by the TCP-friendly bandwidth share of the flow (Qtcfp), and (ii) whenthe encoding rate is obtained from the proposed fuzzy quality controller (Qtarget). We observe that thetarget quality determined by the fuzzy controller follows the trend of Qtcpf , a necessary condition fortransmission stability, but at the same time, it exhibits significantly less variation as the controller avoidsdriving Qtarget to considerably low and high values.

We define the magnitude of quality change between successive S-T periods, ∆Q = |Q(t))−Q(t− 1))|,as a metric of quality smoothness. High ∆Q values indicate significant quality variation, while low ∆Qssuggest a stable on-going quality. Figure 13 shows the distributions of |∆Qtcpf |, |∆Qtarget| and |∆Qactual|.Qactual represents the quality that is finally achieved by the encoder. The side-by-side box-plots extend tothe minimum and maximum values of the observations, with the horizontal lines indicating the 0.25, 0.5and 0.75 quantiles. We observe that the proposed system attains a more stable quality, while respecting itsTCP-friendly transmission rate (readers are encouraged to verify these results by viewing the reconstructedvideo sequences as well as further results with other test sequences in [10]). Notice that the distributionsof |∆Qactual| are slightly ’flatter’ than those of |∆Qtarget|. Since the encoding rate is always restricted toat most ratio times the available transmission rate, the encoder cannot always achieve as high quality astargeted by the controller, hence at times Qactual 6= Qtarget.

Section 5 described how the fuzzy controller is designed to avoid buffer underflow situations. Senderbuffer underflows are caused by the buffer fill rate (Renc) being consistently lower than the buffer drain rate(the transmission rate Rtcpf ). Given that uncompressed frames are produced at a constant rate (e.g., 30fps), the encoder cannot retrieve frames faster than this rate. On the other hand, receiver buffer underflowhappens when the receive buffer fill rate (Rtcpf ) is lower than the rate at which video data is consumed by

14

Page 15: Smooth Quality Streaming of Live Internet Video

05

1015

2025

30

02

46

810

1214

Quality change

playout delay (sec)

6 6 8 8 10 10 12 12

∆Qtcpf∆Qtarget∆Qactual

Figure 9: Distribution of |∆Qtcpf |, |∆Qtarget| and |∆Qactual| for different initial buffering times. Noticethe difference in the range of the y-axes.

the decoder8. Figure 14 plots the evolution of sender and receiver buffer sizes for different initial bufferingtimes. These results indicate that the system is resilient to buffer underflows; under reasonable initialbuffering delays there are no buffer under-runs. Occasional receiver buffer underflows occur when theinitial buffering delay is very low (e.g., at around time 90 sec for 3 sec of initial receiver buffering). Asexpected, the smaller the initial delay, the weaker is the capacity of both buffers to accommodate themismatches between the encoding and transmission rates. As a result, the controller reacts by decreasingthe control parameter α, reducing at the same time the smoothness of the encoding quality, as proven byhigher ∆Qactual values in Figure 13. However, the quality smoothing gain is significant even for low initialbuffer sizes.

6 Conclusion

We presented a method for smooth quality source rate adaptation of streamed live video. The methoduses a realistic metric of perceived quality and relies on the generalisation properties of an artificial neuralnetwork to obtain accurate quality scores based on the candidate bit-rate and content features of thevideo scene. A fuzzy rate-quality controller is proposed that manipulates the encoding rate based onthe quality of the recent past and the state of the sender and playout buffer to achieve stable streamingquality. Experimental results as well as viewing of the produced video sequences [10], demonstrate thatthe proposed method reduces variation in quality, while at the same time adheres to constraints by theTCP-friendly rate. The proposed method tackles source rate-quality control of live video; error resiliencetechniques (FEC, re-transmission, etc.) can be incrementally built on top to provide protection frompacket loss.

8This rate is Renc(t− P ), where P is the playout delay.

15

Page 16: Smooth Quality Streaming of Live Internet Video

0 50 100 150 200 2500

100

200

300

400

500

600

time (sec)

KByt

es

sender buffer size (Bs)

6 sec8 sec10 sec12 sec

0 50 100 150 200 2500

100

200

300

400

500

600

700

800

900receiver buffer size (Br)

time (sec)

KByt

es

6 sec8 sec10 sec12 sec

Figure 10: Evolution of sender (top) and receiver (bottom) buffer sizes for different initial buffering times.

References

[1] F. Despagne and D.L. Massart, Variable selection for neural networks in multivariate calibration,Chemometrics & Intelligent Laboratory Systems 40 (1998), 145–163.

[2] R. Fletcher, Practical methods of optimization, Unconstrained Optimasation, vol. 1, John Wiley &Sons, New York, 1980.

[3] S. Floyd and K. Fall, Promoting the use of end-to-end congestion control in the internet, IEEE/ACMTransactions on Networking 7 (1999), no. 4, 458–472.

[4] S. Floyd, M. Handley, J. Padhye, and J. Widmer, Equation-based congestion control for unicastapplications, ACM SIGCOMM ’00 (Stockholm, Sweden), August 2000, pp. 43–56.

[5] B. Girod, Psychovisual aspects of image communication, Signal Processing 28 (1992), no. 3, 239–251.

16

Page 17: Smooth Quality Streaming of Live Internet Video

[6] , What’s wrong with mean-squared error?, Digital Images and Human Vision (A.B. Watson,ed.), MIT Press, Cambridge, MA, USA, 1993, pp. 207–220.

[7] M. T. Hagan, H. B. Demuth, and M. H. Beale, Neural netowrk design, PWS Publishing Company,Boston, MA, 1996.

[8] T. Hamada, S. Miyaji, and S. Matsumoto, Picture quality assessment system by three-layered bottom-up noise weighting considering human visual perception, SMPTE Journal 108 (1999), no. 1, 20–26.

[9] D. Hands and S.E. Avons, Recency and duration neglect in subjective assessment of television picturequality, Applied Cognitive Psychology 15 (2001), 639–657.

[10] http://www.cs.ucl.ac.uk/staff/d.miras/QAVideo/, 2004.

[11] International Telecommunications Union, ITU-T Recommendation BT.500-11, Methodology for thesubjective assessment of the quality of television pictures, 2002.

[12] I.T. Jolliffe, Principal component analysis, Springer-Verlag New York Inc., October 2002.

[13] T. Kim and M. H. Ammar, Optimal quality adaptation for MPEG-4 fine-grained scalable video, Proc.of IEEE Infocom 2003 (San Francisco, CA, USA), March 2003.

[14] J. M. Mendel, Fuzzy logic systems for engineering: A tutorial, Proceedings of IEEE 83 (1995), no. 3,345–377.

[15] S. Nelakuditi, R. R. Harinath, E. Kusmierek, and Z.-L. Zhang, Providing smoother quality layeredvideo stream, 10th Intl. Workshop on Network and Operating System Support for Digital Audio andVideo (NOSSDAV) (Chapel Hill, North Carolina, USA), June 2000.

[16] ns–2 Network Simulator, 1998, http://www-mash.cs.berkeley.edu/ns.

[17] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning internal representations by errorpropagation, Parallel Data Processing (D. Rumelhart and J. McClelland, eds.), vol. 1, MIT Press,Cambridge, MA, 1986, pp. 318–362.

[18] K. T. Tan and M. Ghanbari, A multi-metric objective picture-quality measurement model for MPEGvideo, IEEE Transactions on Circuits and Systems for Video Technology 10 (2000), no. 7, 1208–1213.

[19] The Alliance for Telecommunications Industry Solutions (ATIS), Objective perceptual quality mea-surement using a JND-based full reference technique, Tech. Report T1.TR.PP.75-2001, October 2001.

[20] The Video Quality Experts Group, VQEG, http://www.VQEG.org.

[21] , Draft final report from the Video Quality Experts Group on the validation of objective modelsof video quality assessment, Phase II, July 2003, Version 4.

[22] C. J. van den Braden Lambrecht and O. Verscheure, Perceptual quality measure using a spatio-temporal model of the human visual system, Proceedings of Digital Video Compression: Algorithmsand Technologies (San Jose, CA), 1996, pp. 450–461.

[23] W. Willinger, M. S. Taqqu, R. Sherman, and D. V. Wilson, Self-similarity trhough high-variability:statistical analysis of Ethernet LAN traffic at source level, ACM SIGCOMM ’95 (Cambridge, MA),August 1995.

17

Page 18: Smooth Quality Streaming of Live Internet Video

[24] S. Winkler, A perceptual distortion metric for digital color video, Proceedings of SPIE Human Vsionand Electronic Imaging (San Jose, CA), vol. 3644, 1999, pp. 175–184.

[25] S. Wolf and M. Pinson, Spatial-temporal distortion metrics for in-service quality monitoring of anydigital video system, Proceedings of SPIE International Symposium on Voice, Video and Data Com-munications (Boston, USA), September 1999, pp. 175–184.

[26] L. H. Zadeh, Outline of a new approach to the analysis of complex systems and decision processes,IEEE Transactions on Systems, Man & Cybernetics 3 (1973), no. 1, 28–44.

[27] X. M. Zhang, A. Vetro, Y. Q. Shi, and H. Sun, Constant quality constrained rate allocation forFGS-coded video, IEEE Transactions on Circuits and Systems for Video Technology 13 (2003), no. 2,121–130.

A Additional results

We present additional results that show the ability of the fuzzy controller to obtained using two furthervideo sequences: a 8000-frames long excerpt from the action movie Terminator and a 5400-frames longsports clip with scenes from an English Premiership football match.

05

1015

2025

30

02

46

8

Quality change

playout delay (sec)

6 6 8 8 10 10 12 12 16 16

Figure 11: Distribution of |∆Qtcpf |, |∆Qtarget| and |∆Qactual| for different initial buffering times (sequence:Terminator).

18

Page 19: Smooth Quality Streaming of Live Internet Video

0 50 100 1500

100

200

300

400

500

600

700

800

900

time (sec)

KByte

s

sender buffer

6 s8 s10 s12 s16 s

0 50 100 1500

200

400

600

800

1000

1200

time (sec)KB

ytes

receiver buffer

6 s8 s10 s12 s16 s

Figure 12: Evolution of sender (left) and receiver (right) buffer sizes for different initial buffering times(sequence: Terminator).

05

1015

2025

30

05

1015

Quality change

playout delay (sec)

6 6 8 8 10 10 12 12 16 16

∆Qtcpf∆Qtarget∆Qactual

Figure 13: Distribution of |∆Qtcpf |, |∆Qtarget| and |∆Qactual| for different initial buffering times (sequence:Football).

19

Page 20: Smooth Quality Streaming of Live Internet Video

0 50 100 1500

100

200

300

400

500

600

700

800

time (sec)

KByte

s

sender buffer

6 s8 s10 s12 s16 s

0 50 100 1500

200

400

600

800

1000

1200

time (sec)

KByte

s

receiver buffer

6 s8 s10 s12 s16 s

Figure 14: Evolution of sender (left) and receiver (right) buffer sizes for different initial buffering times(sequence: Football).

20