Upload
dinhthien
View
246
Download
0
Embed Size (px)
Citation preview
Frame Rate Up-Conversion for Encoded Video
J O N A T A N S A M U E L S S O N
Master of Science Thesis Stockholm, Sweden 2007
Frame Rate Up-Conversion for Encoded Video
J O N A T A N S A M U E L S S O N
Master’s Thesis in Computer Science (20 credits) at the School of Computer Science and Engineering Royal Institute of Technology year 2007 Supervisor at CSC was Lars Kjelldahl Examiner was Lars Kjelldahl TRITA-CSC-E 2007:021 ISRN-KTH/CSC/E--07/021--SE ISSN-1653-5715 Royal Institute of Technology School of Computer Science and Communication KTH CSC SE-100 44 Stockholm, Sweden URL: www.csc.kth.se
Frame rate up-conversion for encoded video. Abstract If you watch a video sequence with low frame rate, for example 10 Hz, you will notice an annoying jerkiness that does not occur if the frame rate is higher, for example 30 Hz. If there were a way to increase the frame rate from 10 Hz to 30 Hz by inserting two frames for each existing frame, the problem would be solved. Unfortunately, the information about the two missing frames cannot be found directly in video sequence. This Master’s Project presents a way to interpolate frames in between existing frames in a video sequence. The interpolation is done for video sequences encoded in the standard format H.264, and it uses information contained in the encoded motion vector field.
Bildfrekvenshöjning för kodad video. Sammanfattning Om du tittar på en videoskevens med låg bildfrekvens, till exempel 10 Hz, kommer du lägga märke till en irriterande hackighet som inte uppträder vid högre bildfrekvens, till exempel 30 Hz. Om det fanns ett sätt att öka bildfrekvensen från 10 Hz till 30 Hz genom insättning av två mellanliggande bilder skulle problemet vara löst. Tyvärr finns igen information om de två saknade bilderna i videosekvensen. Det här examensarbetet presenterar ett sätt att interpolera fram bilder mellan existerande bilder i en videosekvens. Interpolationen görs för videosekvenser kodade i standardformatet H.264, och utnyttjar information som finns i det kodade rörelsevektorfältet.
Frame rate up‐conversion for encoded video.
Foreword You are about to start reading a Master’s Thesis in the area of computer science. It has been performed at Ericsson Research, Visual Technology in Kista and presented at the school of Computer Science and Communications at the Royal Institute of Technology (KTH) in Stockholm, Sweden.
I would like to thank my supervisor at Ericsson, Dr. Kenneth Andersson, and M.Sc. Clinton Priddle for their support and help. My supervisor at KTH, Lars Kjelldahl, is also to thank as well as the others at Ericsson Research, TV, especially for participating in evaluations.
I would also like to send a special thanks to my beloved wife and son, for supporting me in my work.
Contents
1 Introduction.............................................................................. 1 1.1 Background.................................................................... 1 1.1.1 Visual perception ....................................................... 1 1.1.2 Video Coding.............................................................. 2 1.2 Problem specification ................................................... 4 1.2.1 A scenario.................................................................... 4 1.2.2 Goal.............................................................................. 4 1.2.3 Previous work ............................................................ 4
2 Analysis .................................................................................... 5 2.1 Motivation of the Master’s project.............................. 5 2.1.1 Current interest .......................................................... 5 2.1.2 Problems...................................................................... 5 2.1.3 Measurements of success.......................................... 6 2.2 Two naive approaches ................................................. 7 2.2.1 Frame repetition......................................................... 7 2.2.2 Linear Interpolation................................................... 7 2.3 A sophisticated approach ............................................ 8 2.3.1 Fundamental concept ................................................ 8 2.3.2 Global motion calculation......................................... 8 2.3.3 The evilness of a vector ............................................. 9 2.3.4 Algorithm description............................................... 9 2.3.5 Complexity analysis ................................................ 14
3 Results.................................................................................... 16 3.1 Some examples............................................................ 16 3.2 Objective measurements ............................................ 17 3.3 Subjective test .............................................................. 17 3.3.1 The test sequences.................................................... 17 3.3.2 Environment ............................................................. 19 3.3.3 Methodology ............................................................ 19 3.3.4 Results ....................................................................... 19 3.3.5 Conclusions and circumstances ............................. 21 3.3.6 Improvements after the test.................................... 22
4 Conclusions ........................................................................... 23 5 Future work ............................................................................ 23 Bibliography.............................................................................. 24 Appendixes ............................................................................... 25
A. Test instructions .......................................................... 26 B. Test evaluation............................................................. 29
Frame rate up‐conversion for encoded video. Introduction.
1
1 Introduction In a time where the flow of information is constantly increasing and communication channels gains higher capacity from year to year, you might think that compression of data loses its importance. Since this is a Master’s project in the area of video coding it is in place to introduce you to the art of video compression and to motivate its value.
1.1 BackgroundBefore giving you details on how video coding is performed, I would like to present some facts about humans’ ability to view and interpret images. The interested reader can find a deeper description in [1].
1.1.1 Visual perception ʺRemember that itʹs all in your head.ʺ [2]. Everything that involves videos, motions and image perceptions begins with the human eye and the human brain. A good start when dealing with a video coding problem on quality enhancement is of course to study the properties of the human visual system. As stated by [1], knowledge of the limitations and capabilities of human visual processing has given important advances in perceptually driven rendering, realistic image display, high‐fidelity visualization, and appearance‐preserving geometric simplification. This section intends to give a short introduction to the human visual system and the properties that is of interest for this Masterʹs project.
1.1.1.1 Visual anatomy and physiology
Figure 1.1 shows the anatomy of a human eye. When light enters through the lens, it hits a photoreceptive surface, called the retina. Although the retina contains photoreceptors almost everywhere they are not equally distributed. A small spot where the optic nerve exits from the eye contains no receptors and is usually referred to as the blind spot [3]. There are two different kinds of photoreceptors. The cones are used for day‐light colour vision and the rods for murky environments, which will be received in grey‐scale. The cones are concentrated to the fovea, which is the only part of the retina that is exposed to direct light. On this spot, there are no rods, which is why humans have no night vision in the central field of vision [3]. When light of the right wavelength strike a receptor, a signal is sent from a ganglion cell to the visual cortex. The ganglion cells are either parvo ganglion cells which take their input from the fovea region and handles colour information, or magno ganglion cells which’s input comes from the rest of the retina and register movements (but not colour). The visual cortex
is positioned in the rear of the brain and process signals from all the ganglion cells to create the image of what is seen.
Figure 1.1. Anatomy of the human eye [4].
Most of the visual perceptions are evaluated rela‐tively. An example can bee seen in figure 1.2 adapted from [5]. The grey bar in the middle is actually of consistent colour but on a dark background, it seems lighter than on a bright background. This phenome‐non is called simultaneous contrast. Another example is presented by [6]. If you walk from normal daylight in to a dark theatre, it seems so dark that you almost cannot see anything. However, after a couple of min‐utes, when the visual system has adapted, you see much better and can even distinguish small objects on the floor.
Figure 1.2. Simultaneous contrast [5].
Frame rate up‐conversion for encoded video. Introduction.
2
Figure 1.3. The images from the motorcycle sequence [7].
1.1.1.2 Motion perception
Dealing with motions, the human visual system has limitations that can be used (and so has been) when it comes to displaying video sequences. Professor George Mather has created an interactive tutorial on the subject of motion perception which can be viewed at the web site [7]. It discusses a large number of properties of the human visual system and gives ex‐amples on how the visual system can be ʺfooledʺ. One of the most impressive examples is the one with two images of a motorcycle heading forward on a road (see figure 1.3), that are put together with a blank screen to a looping video sequence. This produces a feeling of moving forward. Phenomena that occur when watching moving pictures are hard to represent in a written paper. However, there are effects that take place both in moving and still pictures. One of these is presented by [1] and is referred to as masking. Figure 1.4 gives an example. However, one of the most inter‐esting properties when it comes to this masterʹs project is the human eyesʹ ability to perceive video sequences as a function of the rate at which images are displayed (frame rate). In [8] it is stated that if the frame rate is below 10 frames per second (10 Hz), motion will be seen clearly jerky and unnatural. If the frame rate is between 10 and 20 Hz, slow movements will appear ʺOKʺ, but it would require a frame rate of 20‐30 Hz to get reasonably smooth movement.
1.1.2 Video Coding When a motion picture is shown in a cinema, you see analogue pictures that are projected on a screen with such high intensity that you observe it as a fluid video sequence. It is the same situation when you look at a DVD‐video although here the pictures are stored digitally as ones and zeros. The most obvious way to store a digital video sequence would be to store every image in a video sequence separately, and for each of these images store every pixels colour value. This is called uncompressed format and requires a large memory for storage. For example, a 2‐hour movie would require the disc space of 42 DVDs [8]. In order to reduce the size of a digital video sequence there must be some kind of compression, which can be done using some properties of natural video sequences.
Figure 1.4. The top images have an 8‐bit colour depth. The bottom images have a 4‐bit colour depth. The
difference in colour depth is more obvious on a smooth surface (the left images) [1].
Frame rate up‐conversion for encoded video. Introduction.
3
1.1.2.1 Video compression
In a picture, (think of a frame in a video sequence) neighbouring pixels are likely to have similar colour value. This is referred to as spatial correlation [8]. Also in a video sequence, an area in one frame is likely to have properties in common with the same (or an adja‐cent) area in the previous frame. This is called tempo‐ral correlation [8]. These properties together with the limitations of the human visual system, described in section 1.1.1, and a quantization technique, can be used to compress the video with small or no visible difference and great space gain. These techniques have for example been adopted by the MPEG‐2 standard, which is used in digital television and DVDs.
1.1.2.2 Motion compensation
Most video sequences contain some kind of motion, i.e. camera movement or object displacement. An area that consists of an object that is affected by a motion will have lower temporal correlation to the same area in the previous frame but high correlation to another area in the previous frame, that is the area from where the object has moved. Instead of just looking to the same area in the previous frame a search can be made to find the area that gives the best match. If the picture is divided into small blocks and a search for a motion vector is performed for each of these, a motion vector field is created. It is important to distinguish this cal‐culated vector field from a vector field corresponding to the real motion in the scene captured on the video. Although these two vector fields in many cases will look similar, the prior one is created in order to give as good compression as possible, not in order to capture the actual motion. Figure 1.5 gives an example of a motion vector field that represents how the motion from frame 72 to frame 73 is encoded in the video sequence stefan.
1.1.2.3 The video codec H.264
In 1998, the development of a new video coding stan‐dard started with the aim to double the coding effi‐ciency in comparison to existing standards [9]. The standard was completed in 2003 and was given the name H.264. As for (almost) all video coding stan‐dards, the scope of the H.264 standard is the decoding process, which means that all decoders that follow the standard shall produce the same output for a given encoded video sequence. The encoding process is not standardized which gives the opportunity for devel‐opers to focus on compression efficiency, low com‐plexity or whatever they please and has as an effect that the same video sequence can be encoded in a various of ways, producing different encoded files that all apply to the standard.
One of the properties of the human visual system that can be used in the H.264 standard is the higher sensi‐tivity to brightness difference than colour difference. The colour space of H.264 can be set to 4:2:0, which means that luma (brightness) has four, times higher resolution than chroma (colour). An encoded H.264 picture is divided into macroblock of size 16x16. Thus a QCIF picture, with 176x144 pixels, will have 11x9 macroblocks. These macroblocks can be divided fur‐ther into blocks of size as small as 4x4 pixels. Each macroblock will be coded in Inter mode (using infor‐mation from previous encoded frames) or Intra mode (using information from previous encoded macrob‐lock in the current frame).
When a block is coded in Inter mode there is a corre‐sponding motion vector, which points out a block in a previous frame. The motion vector can have a quarter pixel precision in which case the pixel value is inter‐polated from the surrounding pixels [10].
Figure 1.5. The encoded motion vector field for one frame (the right) from the preceding frame (the left).
Frame rate up‐conversion for encoded video. Introduction.
4
1.2 Problem specification In some cases, the best way to describe a problem is to give an example. In this section, a realistic scenario is presented along with a specification of the goal of the Master’s Project. I will also give you a glance of what has previously been done in this area of research.
1.2.1 A scenario A 30 Hz video sequence is to be broadcasted over a mobile network. The terminals (mobile phones) are connected to a server and receive the encoded video sequence. However, since only a low bit‐rate video sequence can be sent, due to the bandwidth limita‐tions, the encoder removes every two out of three frames, resulting in a sequence with frame rate 10 Hz. The terminal can display videos of 30 Hz but since it receives a 10 Hz video, it repeats each frame three times. This is a reasonable solution but when motions are involved, it produces an annoying jerkiness.
1.2.2 Goal The aim of this masterʹs project has been to study, develop and evaluate methods for inserting predicted frames in between of existing frames in a video se‐quence. The focus has been on low complexity solu‐tions that use information encoded in the H.264 mo‐tion vector field, rather than high complexity solu‐tions.
1.2.3 Previous work There have been attempts to solve the problem of frame rate up‐conversion both with applications similar to the one described above and with the aim to convert between PAL (25 Hz) and NTSC (30 Hz). Most suggestions that have been proposed either contain high complexity calculations, adjustments of the en‐coding process or have limitations of usage.
In [11] a scheme is proposed which calculates motion vectors in two directions in the decoder. This is done without using an encoded motion vector field and thus requires many calculations (comparable to the ones done in the encoding process) and will not apply well in the above scenario. In [12] changes in the en‐coding process are made to give a more reliable mo‐tion vector field and thus restrict the usage to a subset of encoded videos (those that are encoded with the proposed technique). Finally, [13] has focused on the video conferencing scenario and typical movements are pre‐stored and then used to capture head and face movements. This technique is not applicable for gen‐eral video sequences.
There exist some products that have the functionality of increasing frame rate. One of theses is the Alche‐mist Platinum Ph.C [14], which is a professional tool for converting between different broadcasting stan‐dards. It uses a phase correlation technique, which gives high performance but is computationally heavy [15]. Another is the Philips television set with Natural Motion [16], a technique that increases frame rate from 50 Hz to 100 Hz in broadcasted television. These products do not operate on encoded video sequences and there is a great difference between up‐converting from 50 Hz to 100 Hz and up‐converting from 10 Hz to 30 Hz. The jerkiness will not be that annoying at 50 Hz and there will be a greater correlation between two consecutive frames.
Some studies have been made for frame rate up‐con‐version in the case of an encoded video sequence. In [17] a low complex method is presented at high level. The method consists of three steps. First, a motion vector classification is performed, then bi‐directional motion estimation and finally post‐processing and motion compensation. Some ideas from this paper have been used in my solution and a deeper descrip‐tion of these is found in section 2.3.3.
Frame rate up‐conversion for encoded video. Analysis.
5
2 Analysis If you have read through the introduction, you will know the background of this Master’s project and you will have some knowledge of the area in which it is performed. This chapter contains a deeper motivation of this Master’s Project and a thorough description of my solution to the given problem.
2.1 Motivation of the Master’s project Since this Master’s project position was announced by Ericsson, you can assume that it is of current interest and since it was approved as a Master’s project by KTH, you can assume that it contains enough of in‐quiry, analysis and interesting questions to fulfil the requirements of a Master’s thesis. Nevertheless, I would like to present some argumentation of its im‐portance and I would like to introduce some problems that have to be dealt with. This section will be con‐cluded with a description of how to measure the level of success.
2.1.1 Current interest As seen in the first chapter, video coding and effective compression is as important in these days as it ever has been. An effective coding is essential for the exis‐tence of most digital video media. In order to meet broadcasting limitations and to economize with net usage, image quality and frame rate is reduced. Many mobile phones have the possibility to download and play video sequences from the internet and tests have been performed with mobile TV in Sweden. However, as stated by the Swedish journal “Mobil” in October 2006 [18] (when evaluating Teracoms dvb‐h test) it is when watching sequences with many fast movements, for example a hockey game, that you notices the dif‐ference from ordinary TV. If the decoder could in‐crease frame rate by inserting frames consistent with the motions in the sequence, this problem would not occur. Thus, the problem of frame rate up‐conversion is of great current interest.
2.1.2 Problems Here I will present some of the problems that have to be dealt with when it comes to frame rate up‐conver‐sion for encoded video.
As mentioned in section 1.1.2 an encoded video frame (in H.264) contains motion vectors for some of the macroblocks in the frame. These are of course a good start when estimating the real motions in the scene but two important questions must be answered:
• How can you know that an encoded motion vector corresponds to a real motion?
• What is the motion for an Intra coded mac‐roblock i.e. a macroblock that has no refer‐ence to a previous encoded frame?
Another large problem is the covering and uncovering of regions. This problem occurs both in the actual scene and in the encoded image. Figure 2.1 gives an example of how blocks can cover each other in an encoded image. In some way, a decision must be made on what to do with regions with multiple cov‐ering and regions with no covering. Figure 2.2 shows how a region can be uncovered.
Figure 2.1. Two objects cover the same region.
Figure 2.2. The region behind the dog is uncovered.
Frame rate up‐conversion for encoded video. Analysis.
6
2.1.3 Measurements of success In order to prove that a solution to a problem is suc‐cessful there has to be some way of measuring the level of success. In this Master’s project, the aim is to increase the experienced quality of video sequences. A good measurement is people’s opinion on a modified video sequence compared to the unmodified video sequence. To get a reliable opinion value I decided, in consent with my supervisors, to perform a subjective test after about three fourths of the time available for the Master’s project. At that time, I would have a working implementation of a solution to evaluate, but still time left to use the test result as guidance in final performance optimization. You can read more about the subjective test and its result in section 3.3.
Apart from the subjective test, I wanted some objec‐tive measurement of my solution’s performance. A widely used method for measuring quality on en‐coded video is Peak Signal to Noise Ratio (PSNR).
PSNR uses the difference between an impaired image and the original image. The Mean Squared Error (MSE) is calculated by squaring the difference between the images pixel for pixel and then taking the mean value. This is put in relation to the square of the highest possible difference, 255, and measured on a logarith‐mic scale [8] according to the following equation:
MSEPSNR
n
dB
2
10)12(
log10−
=
(2.1)
There are problems with the existing objective quality measurements methods [8], including PSNR, because they do not always correlate to human opinions on quality. But in general, high image quality gives high PSNR and low image quality gives low PSNR. Since PSNR is defined on images and I am working with video, I gather PSNR values for all images in a video sequence and calculate the mean of them.
Frame rate up‐conversion for encoded video. Analysis.
7
2.2 Two naive approaches When it comes to the problem of temporal up sam‐pling there are two methods that directly comes to mind. I will describe those here together with their pros (there are actually pros) and cons.
2.2.1 Frame repetition If a terminal receives a video sequence with a frame rate 10 Hz and wants to increase the frame rate to 30 Hz the easiest thing to do would be to display every frame three times. However, since frame nr 2 and frame nr 3 will look exactly the same as frame nr 1, the viewer will experience a frame rate of 10 Hz and problems that occur at low frame rate will be pre‐served, i.e. motion jerkiness. The only advantage of this method is its low complexity since no calculations are performed at all.
2.2.2 Linear Interpolation If a smoother transition from one frame to another is wanted, one possibility is to let the in‐between picture be a mix of the previous frame and the next frame. If the up‐conversion rate is k, a frame Ikn+j can be repro‐duced from Ikn and Ik(n+1) using the equation:
kjIIjk
I nkknjkn
)1()( ++
+−=
(2.2)
for every pixel in Ikn+j, where n is the frame number before up sampling and j is in the interval 0<j<k. The disadvantage of this technique can bee seen in figure 2.3. Sections that contain motions will be blurry and “ghost” objects will occur. In figure 2.3, you can see two blurry noses next to each other. The strength of this method is as with Frame Repetition its low com‐plexity. The calculations in this method are very easy to perform but the result is not acceptable. In most cases, the blurriness of the linearly interpolated frame is more annoying than the jerkiness in the video with lower frame rate.
Figure 2.3. Above, the original sequence at 30 Hz. Below, the sequence encoded at 15 Hz, up‐converted to 30 Hz using Linear Interpolation.
Frame rate up‐conversion for encoded video. Analysis.
8
2.3 A sophisticated approach The problem of really increasing the experienced frame rate and at the same time improving the experi‐enced quality calls for a sophisticated method. Before I describe my solution in detail, I will present some concepts and calculations that are being used in my frame rate up‐converter.
2.3.1 Fundamental concept As previously stated, a crucial concept for this Mas‐ter’s project is the usage of the encoded motion vector field in the construction of the inserted frames. In H.264, motion vectors are attached to blocks that can have size from 4x4 to 16x16 pixels. In my solution, the inserted frame is dived into blocks of size 4x4 and each of these is associated with a motion vector (read more in chapter 5 about the choice of block size). The motion vectors of an inserted frame do not only point out one reference block, but two blocks. The idea is that it shall point from one area in the previous frame, to one area in the next frame, in such a way that it captures the motion that occurs between the two frames. Since only linear movements are considered, the directions of these two motion vectors are always opposite, and the length is proportional to the tempo‐ral difference between the inserted frame and the original frames. In figure 2.4, the blocks with broken lines are pointed out by the block in the inserted frame.
Figure 2.4. Each block in the inserted frame points out one block in the next encoded frame and one block in
the previous encoded frame.
When the motion vector field of an inserted frame is to be created, the motion vector field of the next frame is used to create a first guess. In figure 2.4, the block in the inserted frame is initially given the motion vector
kjv
v nf =
(2.3)
in forward direction, and
kvjk
v nb
)( −−=
(2.4)
in backward direction, where k is the up‐conversion rate, j is in the interval 0<j<k, and vn is the motion vector from the next encoded frame.
2.3.2 Global motion calculation Most macroblocks in an H.264 coded image will be associated with a motion vector, if the image does not correspond to a scene change. These motion vectors will together build up a vector field but since there are no restrictions to the motion vectors this vector field can have a rather irregular appearance. In some cases, it can be of great value to know a little bit about the characteristics of the overall motion in the scene, the global motion. Think of a scene where the camera fol‐lows a ball rolling on a lawn. The global motion will then be the pan of the camera when it is following the ball, and the ball itself will be the only object where the motion differs from the global motion.
The global motion can be approximated using:
bAxv +=
(2.5)
where:
⎥⎦
⎤⎢⎣
⎡=
2221
1211
aaaa
A
(2.6)
and:
⎥⎦
⎤⎢⎣
⎡=
2
1
bb
b
(2.7)
The x and the v is the position and motion respec‐tively, both with x and y components. This arrange‐ment makes it possible to calculate a macroblock’s motion vector directly from its position, given that the matrix A and the vector b is known.
Frame rate up‐conversion for encoded video. Analysis.
9
To calculate good values for A and b I use the method of least squares [17]. The values for a11, a12 and b1 can be calculated separate from the values for a21, a22 and b2 using:
11211 byaxavx ++=
(2.8)
22221 byaxavy ++=
(2.9)
Instead of using all the encoded motion vectors, I pick a random set in order to reduce complexity. For every vector the position and x‐magnitude is inserted in (2.8) and these equations together build up a system of equations that can be put in matrix form:
[ ] Xayx
v xx =⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡=
111211 baa
(2.10)
Using the new vectors introduced in (2.10) the least squares solution [19] is given by:
xx XvXXa 1)( −= T
(2.11)
At first sight, the matrix inversion might look scary but the matrix that is to be inverted in (2.11) is only of size 3x3 no matter how many samples (motion vec‐tors) are being used in (2.10). The calculation of the variables that affect the y‐magnitude is done the same way. Experimental results showed that a set of 10 samples gives a good result and is computationally inexpensive.
2.3.3 The evilness of a vector What characteristics can be assigned to a vector? I claim that a motion vector for instance could be both Sad, Bad and Evil. Let me explain this in a little more detail.
A motion vector in this case is a vector that points from a block in one frame to a block (of the same size) in another frame (the previous frame). These blocks will both consist of pixels and they will both have a surrounding. From the pixel values, the Sum of Abso‐lute Differences (Sad) can be calculated by adding to‐gether the difference in pixel value between the blocks for every pixel. If a new block is to be inserted in a frame between the two frames, the new block will get a new surrounding. If the block is put in the right place, it is probably similar to its new surrounding. The idea, called Border Absolute Difference (Bad) is mentioned in [17] but not defined. I have defined it to be calculated by adding together the differences be‐tween the block’s border pixels and its new sur‐rounding according to figure 2.5 in desired directions.
Figure 2.5. The calculation of BAD.
If you are to compare two vectors that both compete for the role as the true motion vector the above men‐tioned measurements are of great interest. A low sad value would imply that there is only a small difference between the block in the next picture and the block in the previous picture. Thus, it is likely to correspond to a real motion. In addition, if the bad value is low, the block fits well into its new environment and thus likely corresponds to a real motion.
The evilness of a vector is calculated from a weighted sum of Sad and Bad in addition to some properties of the vector (such as length) that I will present in more detail later. The evilness of a vector will be a good estimation of how well the vector corresponds to a real motion.
2.3.4 Algorithm description The algorithm that I have come up with consists of a large number of steps. I will try to describe every part of my algorithm in detail in order to make it clear what is really happening, but my aim is never to be‐come unnecessary technical. You can see an overview of the algorithm in figure 2.6
The algorithm.
1. Border Calculation.
2. Global motion calculation.
3. Passepartout preparation.
4. Special motion estimation.
5. Vector evilness classification.
6. Candidate gathering.
7. Candidate evaluation.
8. Vector field post processing.
9. Scene change detection.
10. Image creation.
11. Deblocking.
Figure 2.6. The algorithm.
2.3.4.1 Border calculation
When a video sequence is recorded and broadcasted to TV, sometimes artefacts occur that are so well cam‐ouflaged, that you most certainly will not notice them. I think specifically about black lines along the border of the image. The reason why you do not notice this
Frame rate up‐conversion for encoded video. Analysis.
10
line is that it is at the end of the screen and that it is constantly there. When I insert an image in between two images that contains black lines, I must make sure that the black line in my image is consistent and un‐broken. The first step is to search for black lines start‐ing from the image borders. If a black line is found it is labelled as forbidden area in the sense that the output for this area is forced to be black, and no other part of the image is allowed to have motion vectors origin inside this area. Figure 2.7 shows a close‐up of what can happen if no restrictions are made concerning black lines. This can be compared to figure 2.8 where the black line has been label as forbidden.
Figure 2.7. The image in the middle is constructed
without restrictions on the black line.
Figure 2.8. The image in the middle is constructed
with restrictions on the black line.
2.3.4.2 Global Motion Calculation
How the global motion is calculated using a least squares method is described in section 2.3.2. In order to get an even more reliable result the global motion calculation is performed multiple times with different (time dependent) seeds for the randomized selection of vectors. To see how consistent the motion vectors are, the difference between my estimation and the encoded value is calculated. All the largest such dif‐ferences for every calculation are compared and the one with the smallest largest difference is chosen. The result of the chosen calculation will be used as the motion estimation. In the next step of the algorithm, I will describe how the estimation of the global motion estimation is used.
2.3.4.3 Passepartout preparation
Sometimes you want to talk about a certain thing that is well defined only to you and when you start to put words on it, confusion easily occurs. This section (and this part of the algorithm) concerns the areas in an image that is closest to the image borders. Since the words “border”, “frame” and “edge” are all occupied for other purposes I have chosen to refer to these areas as the passepartout of the image. Figure 2.9 shows the analogy to paintings. The white area is in some sense a part of the image but it has different properties than the central area. This is the concept that I wanted to capture, and therefore named it passepartout.
Figure 2.9. Two images that both have thick white passepartouts [20].
When a video is recorded using a moving camera, the background of the scenery will change from one im‐age to another. In figure 2.10, you see two consecutive images that have different backgrounds due to camera panning. The part of the audience that you see to the left in the top picture is not included in the bottom picture. Thus, the macroblocks that are positioned along the left image border will most certainly be coded in Intra mode or with some false motion vector since the area that corresponds to the real motion lies outside of the picture. It is possible for motion vectors to point outside of the image area but since that area contains no image information, such vectors are highly unreliable.
During the development of my algorithm I noticed that it was not only the uncovered regions had false vectors. When large camera movements are involved, all regions close to the image borders have less reliable motion vectors than those closer to the middle. Thus, a special treatment of the passepartout can give vectors that are more reliable and increase the result when inserting a predicted frame.
Frame rate up‐conversion for encoded video. Analysis.
11
Figure 2.10. The camera panning uncovers and occludes regions along the image borders.
The width of the passepartout depends on the size of the global movement. The macroblocks that lie in the passepartout region are at an early stage given motion vectors according to the calculated global motion, regardless of its encoded motion vector. Later on, during the last steps of the algorithm they are changed to a motion vector calculated from the closest neighbouring macroblocks that lie outside of the passepartout. Figure 2.11 gives an example of how the passepartout special treatment can improve quality.
Figure 2.11. In the right picture, the
passepartout was given special treatment.
2.3.4.4 Special motion estimation
The macroblocks that are coded in Intra mode are not associated with motion vectors. In most cases, it is more efficient to code a macro block using Inter mode but in some special cases Intra mode can be a better alternative:
Scene change. When a scene change occurs, the objects in the new frame do not have a valid refer‐ence in the previous frame. This is discussed in section 2.3.4.9.
New object. If an object occurs that could not bee seen in the previous frame, it can be hard to find a reference in the previous frame. It is also pos‐sible that the object could bee seen in the pre‐vious frame, but its motion differs too much from the motion of its surrounding and there‐fore it is not found in a motion estimator.
Coding efficiency. The choice between Inter and Intra coding always depends on what is the cheap‐est to encode. It is possible that the true mo‐tion of the current block does resemble the motion of the surrounding, but due to light‐ning changes or transformations in the scene, it is more efficient to encode with Intra cod‐ing.
In order to support both the latter ones, a comparison is made between a motion vector that is calculated from the surrounding and the zero motion vector. The one that gives the least SAD value is chosen as a start for a simple motion estimation. This motion estima‐tion can be implemented in various ways. I chose a search where the x and y values are incremented and decremented by one pixel respectively, thus giving eight alternatives. Figure 2.12 gives an example of how a search for motion vectors for Intra coded blocks can improve quality.
Figure 2.12. The right picture is produced using spe‐cial motion estimation.
2.3.4.5. Vector evilness classification
In section 2.3.2 I described how vectors could be clas‐sified based on an estimation of how well they repre‐sent the true motion for a block. Of course, the differ‐ent properties (sad, bad, length…) can be valued dif‐ferently and the result will depend on how these properties are weighted together. How the properties are supposed to be weighted depends in some sense of the application but my aim has been to use a setting that performs well on as many sequences as possible.
Frame rate up‐conversion for encoded video. Analysis.
12
Figure 2.13. The leftmost picture is constructed with no replacement of evil vectors; the rightmost picture is constructed after replacing the evil vectors for the least
evil alternative and the picture in the middle shows which blocks were labelled as evil.
For every block in the image, the evilness of its motion vector is calculated using:
nbse 843 ++=
(2.12)
Where s is the sad value, b is the bad value (see section 2.3.3) and n is neighbourhood match, which is simply calculated by adding together the differences between the current vector and the vector of each of the neighbouring blocks. If the evilness is too high, the block will be labelled as evil, which essentially means that an attempt will be made to find a less evil vector. In figure 2.13, you can see which blocks are considered evil and that a great improvement can be achieved through replacing their motion vectors.
The construction of equation 2.12 and forthcoming equation 2.13 and the settings of the weights have been done manually by me. There is no easy way to measure automatically if all evil motion vectors have been correctly labelled. If there existed such a method, I would probably have used it for the evilness label‐ling process. Nor is it possible to use an objective measurement (such as PSNR) to decide the values of the weight, due to its poor correlation to human’s experience of quality. Instead, my approach has been to estimate good weights using reasoning about their magnitude and then perform fine‐tuning, using ex‐perimental results.
2.3.4.6. Candidate gathering
After labelling some vectors as evil, the next step is to find an alternative that is not that evil. Instead of making an exhaustive search for a better vector, the motion vectors of the neighbours are considered. The vectors of the eight adjacent blocks are put in a set together with the vectors from the nine adjacent blocks in the previous frame.
2.3.4.7. Candidate evaluation
When a set of candidate vectors has been gathered, a comparison must be performed in order to find out which vector is the least evil. For this purpose, ex‐periments that were performed during implementa‐tion showed that a slightly different equation than (2.12) gives a better result:
mvnbse 56323232 ++++=
(2.13)
Here s, b and n represent the same thing as in (2.12). The formula has been extended with v, which is the vectors length squared, and m, which represent the global motion match. How the global motion match works, demands a little bit more explanation, but first a few words on the length. The length is taken into account because in natural videos, long vectors occur less frequently than short vectors. In addition, if an incorrect vector is chosen, the human eye is better at masking an error of too small movement than an error of too large movement. Unfortunately, the addition of a length parameter in equation (2.13) does not just give images that are more pleasing to view; it also affects the PSNR value negatively. However, since it is the experienced quality that is most important, I did not hesitate to add this parameter.
In some cases, two vectors are both good candidates to represent the motion of a specific block, due to the fact that in a scene, different objects can move in different directions, and occlude and uncover each other. Fig‐ure 2.14 gives a simplified example. If you just see frame one and frame three you cannot be sure of what frame two would look like.
Frame rate up‐conversion for encoded video. Analysis.
13
Figure 2.14. What happens in frame two?
In figure 2.15, you can see three suggestions of what happened in frame two. The second one suggests that the ball disappears in frame two and reappear in frame three.
Figure 2.15. Three suggestions for the missing frame.
If there is no other information than the one in frame one and two, the first suggestion seems unmotivated. However, if there was a sequence of frames were the ball moved in the path of a parable, perhaps the first suggestion would give the most natural impression. In most cases though, a linear approximation will give good result even when the real motion follows an‐other path. Searching for non‐linear motion‐paths, which would require significantly more calculations, has not been an alternative for this Master’s project.
At first glance, the second alternative can seem a little bit strange. If there is a ball in frame one and a ball in frame three, most certainly, there has to be a ball in frame two. However, the area in the middle looks the same in frame three as in frame one. A local consid‐eration yields that the area will look the same also in frame two. However, the motion vector that repre‐sents the motion of the ball will have to be considered as an alternative. If no extra constraints were added, the zero motion vector of the background would be chosen instead of the motion vector of the ball, since it is shorter. The solution that I have come up with lies in the global motion estimation.
A vector that points from a block in the previous frame to a block next frame, that have good matches in the global motion is likely to be a part of the back‐ground. A vector that points between two blocks that do not give a good match in the global motion, is likely to belong to an object (foreground), and should be displayed in front of the background.
In the above example, the global motion would be zero and the grass area in the middle will match well in the global motion. The ball will have bad match in the global motion, since it moves between frame one
and frame two. Thus, it will be chosen in favour of the grass area.
The measurement of the global motion match is made through calculating sad for the two areas pointed out by the motion vector in forward and backward direc‐tion, using the global motion as motion vector.
Figure 2.16. Comparison of an image without vector field post processing (left), and
with vector field post processing (right).
2.3.4.8. Vector field post processing
On rare occasions in the candidate evaluation process, a block is assigned to a motion vector that does not correspond to the real motion, nor does it correlate well to the motion vectors of the surrounding blocks. See an example of this in figure 2.16.
This problem can be solved through simple vector field post processing. In my solution, each block is compared to its neighbours. If the motion vector is larger or smaller than seven of the motion vectors of its eight neighbours, it is changed to the mean value of the four vectors with values in the middle.
2.3.4.9 Scene change detection
If a scene change occurs in a video sequence there are no (or very small) correspondence between one frame and the next. In this case, it would not be a good thing for a frame rate up‐converter to search for motions that occur between the two consecutive frames. The motion field found for that transition would most likely look very strange. In order to maintain a con‐stant ratio between encoded frames and displayed frames, an in‐between frame of some kind must be produced. Using frame repetition could produce an annoying “freeze” just before (or after) the scene change. Instead, I have chosen to use the linear inter‐polation for transition between scenes. An alternative solution is to extrapolate the missing frames using only preceding frames or forthcoming frames. This alternative has not been evaluated since the linear interpolation gives a good result.
Frame rate up‐conversion for encoded video. Analysis.
14
Figure 2.17 gives an example of a frame in a scene change, with and without special treatment (linear interpolation).
Figure 2.17. Scene change with no special treatment (above), and with linear interpolation (below).
In order for a transition to be considered as a scene change, it must fulfil three conditions. First, it must contain a certain amount of Intra coded blocks (⅓ has shown to bee a good threshold). Second, the average absolute difference between the pixels pointed out in forward direction and those pointed out in backward direction, call this d, must exceed a certain limit (15). And third, the average difference between pixels us‐ing linear interpolation must not be larger than twice the difference d.
2.3.4.10. Image creation
All the above calculations are performed in order to create a motion vector field that well represents the true motion in the scene, or at least one that looks good. The only thing that is left is to create the image using the information from the calculated vector field. In order to produce a pleasant image and to capture small variations and brightness changes, the blocks’ pixel values are interpolated from the two blocks pointed out in forward and backward direction.
Figure 2.18. A close‐up of the block artefact that oc‐curs without a deblocking filter.
2.3.4.11. Deblocking
Before displaying the image on the screen, a final filtering is performed. If there is a large enough differ‐ence between the motion vectors of two adjacent blocks, there is a risk that they will have different pixel values along their border. If that is the case, sharp edges are produced in places where there should be
no edges, and block artefacts occurs, as seen in figure 2.18.
The deblocking filter operates only at the edges of the blocks. If the absolute value of the difference between the motion vectors of two adjacent blocks is over a certain threshold the new pixel values will be:
32 oldold
newba
a+
=
(2.14)
32 oldold
newba
b+
=
(2.15)
This will have a smoothing effect on sharp edges.
2.3.5 Complexity analysis Although this work has no focus on optimization or complexity theory, an analysis of my solutions com‐plexity is of great interest. If the above described algo‐rithm is to be implemented in for example a mobile‐TV decoder, there are limitations of processor speed and available time for decoding.
Most important, I think, is a well‐founded motivation of the complexity on a high level, since the imple‐mentation that I have made is to be considered a pro‐totype, rather than an optimized end‐user product. I will, however, also present some statistics of running time at the end of this section.
2.3.5.1 High level complexity reasoning
With the scenario from section 1.2.1 in mind, the over‐all idea in this Master’s Project has been never to use high complexity solutions in any step. It is of course possible to produce a reliable motion vector field if you perform your own motion estimation. However, even a fast motion estimation algorithm requires many calculations. All the steps described in section 2.3.4 are easy to perform and requires few calcula‐tions.
A profiling tool shows that the most time consuming part is the subpixel creation for the SAD calculation.
2.3.5.2 Running time comparison
In the following table, three different methods are evaluated using simple run time measurement. The same sequence has been decoded ten times without printing to file. The mean value of the run can bee seen in table 2.1.
The values listed under No FRUC are the running time for standard decoding with no frame rate up‐conver‐sion.
Frame rate up‐conversion for encoded video. Analysis.
15
The Slow FRUC on the right is my frame rate up‐con‐verter with all at the features described in chapter 2.3.3 tuned in for best performance with no restrictions on speed. No effort has been made to optimize the func‐tions or to remove unnecessary calculations. As you can see the running time is very high. The Fast FRUC is an attempt to reduce the running time.
There are four differences between Slow FRUC and Fast FRUC. First of all, there is a higher acceptance of evilness in Slow FRUC, a fewer amount are labelled evil. Second, the calculation of evilness does not in‐clude Border Absolute Difference or Global Motion Match.
Third, all calculations of pixel values are made on full‐pel resolution instead of quarter‐pel resolution. The last difference is that no special search is made for the Intra coded macroblocks (they will be initialized with the zero motion vector).
Table 2.1 Running time.
Running time (s) No FRUC Fast FRUC Slow FRUC
Stefan 128 kbps 15 Hz D64
0.85 1.41 21.5
Foreman 110 kbps 15 Hz Helix
1.13 1.81 23.2
Of course these differences does not only affect the running time, which is reduced to a great extent (see Table 2.1), but also the quality level. Figure 2.19 and 2.20 illustrates the difference.
I am convinced that if a rigorous attempt was made as to optimize the performance of the frame rate up‐converter, it is possible to achieve (almost) the per‐formance of Slow FRUC at (almost) the speed of Fast FRUC.
Figure 2.19. The image to the left is produced using the encoded motion vector field directly, the image in the middle is produced with Fast FRUC and the image to the right is produced with Slow FRUC.
Figure 2.20. The image to the left is produced using the encoded motion vector field directly, the image in the middle is produced with Fast FRUC and the image to the right is produced with Slow FRUC.
Frame rate up‐conversion for encoded video. Results.
16
3 Results Perhaps the most interesting part with a Master’s project is looking at the results. After months of literature studies, approach analysis, algorithm construction and implementation, it is a great thing to realize, that it actually works. In many cases, I’ve used my own opinion when deciding which setting yields the best quality, and many parameters has been set after I’ve watched a large number of video sequences with different parameter values. Of course, one man's opinion is not a reliable measure-ment of how different users might experience the quality and I will describe here how a fair evaluation has been performed.
3.1 Some examples Before describing how I measured the results of my work and the outcome of these measurements, I will let you form your own opinion. It is not an easy thing to represent motion pictures in written paper but in some sense, a sequence of images printed on paper can reveal more of what is right and wrong, than a displayed video sequence. This is due to the limita‐tions of the eye to distinguish small changes in a
short period of time. In addition, the human visual system tends to mask small differences if they appear in a mixed environment, as discussed in chapter 1. The image sequences in figure 3.1 and figure 3.2 are ex‐amples of how good and how bad my solution per‐forms compared to the method of linear interpolation discussed in section 2.2.2 and compared to a higher bit rate and higher frame rate.
Figure 3.1 Five frames from the Mallorca sequence using linear interpolation (above) and using my method (below).
Figure 3.2 Four frames from the Glorious sequence encoded at 30 Hz with 218 kbps. The below sequence is encoded at 10 Hz with 128 kbps, up‐converted with my method.
Frame rate up‐conversion for encoded video. Results.
17
3.2 Objective measurements Although it is people’s opinion that matters most, there are some objective methods to calculate video quality. One of these methods (PSNR) is described in section 2.1.3, and even though it is not without faults, it can give an indication of how good (or bad) the quality is. Table 3.1 shows the PSNR value for my method compared to the method of frame repetition and compared to encoding the same sequence at the same bit rate but with the original frame rate (in this case 30 Hz).
Table 3.1 PSNR.
PSNR (dB) Frame repetition
Original frame rate
My up-converter
Stefan 128 kbps 15 Hz D64
21.17 26.43 27.38
Foreman 110 kbps 15 Hz Helix
27.67 33.21 33.22
Foreman 32 kbps 10 Hz D64
24.70 28.07 28.82
Glorious 64 kbps 10 Hz D64
18.23 28.56 21.22
It is interesting to see that my method clearly outper‐forms the method of frame repetition for all se‐quences. This is not the case when it is compared to the method of keeping the original frame rate, reduc‐ing the quality. Sensational, perhaps, is the fact the result for glorious, where my method gains over seven dB lower score. This result does not at all corre‐spond to the result from the subjective test where my method was given much better score. From this, we can conclude that PSNR is not reliable as an estimation of subjective quality experience.
3.3 Subjective test In order to confirm that the algorithm gave good re‐sults, I arranged a subjective test. Its purpose was to ensure that a larger set of people (than my supervisors and me) would say that the overall experience of the video quality was increased when my frame interpo‐lation algorithm was used compared to when simple frame repetition was used. A sequence was encoded in H.264 with some quantization level and a certain amount of frames was dropped. Then it was decoded both using frame repetition and using my method restoring the original frame rate. When these two
decoded sequences were displayed to the test person, she could make a comparison.
I also wanted to see how well a sequence that had been encoded and decoded using my method could compete with two other encoded‐decoded sequences. The first of those is one where the original frame rate is kept and the same quantization level is being used, thus giving a higher bit rate (typically 50 % higher or more). This is kind of an unfair comparison, and more interesting is perhaps the third test where the original frame rate once again is kept but the quantization level is raised so that the file size would be the same. A higher quantization level gives lower quality and besides other factors, such as faster encoding time and less H.264‐decoding, it would be encouraging if my method also outperformed this method in quality of experience.
3.3.1 The test sequences Six different video sequences were used during the test. These sequences where chosen to represent the variation of typical video sequences. A short descrip‐tion of the sequences and their nature follows.
3.3.1.1 Stefan
In stefan, a tennis player (Stefan Edberg) is running, turning and hitting the ball on a tennis court. This sequence contains camera panning with acceleration and fast movements, especially around Stefan’s legs. The audience and Stefan’s head have very similar appearance, which complicates calculations. Figure 3.3 shows one frame from the stefan sequence.
Figure 3.3. A frame from the stefan sequence.
3.3.1.2 Foreman
The foreman sequence consist of two parts. First, you see the head and shoulders of a man that is talking and moving his head, as can bee seen in figure 3.4. The background contains a repetitive pattern of diagonal
Frame rate up‐conversion for encoded video. Results.
18
lines. Later, there is a camera panning, passing by the blue sky, some trees and a building area.
Figure 3.4. A frame from the foreman sequence.
3.3.1.3 Glorious
From the music video to Andreas Johnson’s song “Glorious” the test sequence glorious was captured. It contains many scene changes, lightning changes and clips of rotation and panning. Figure 3.5 shows one frame from the glorious sequence.
Figure 3.5. A frame from the glorious sequence.
3.3.1.4. Mallorca
A football game with three scenes is captured in the mallorca sequence. Two scenes are close‐ups of two different football players and one contains a large part of the football ground with a large number of players. Panning in both horizontal and vertical direction is included. One of the close‐up scenes can be seen in figure 3.6.
Figure 3.6 A frame from the mallorca sequence.
3.3.1.5 Bluebird
The bluebird sequence is a small part of the music video to Staffan Hellstrand’s “Lilla fågel blå”. It con‐tains fast panning, rotating objects, a moving sky, scene changes and grey‐scale parts. Figure 3.7 shows one frame from the bluebird sequence.
Figure 3.7. A frame from the bluebird sequence.
3.3.1.6 News
The news sequence is a clip of a news report. It contains a talking woman of who you see the head and shoulders and after a scene change, it contains text and numbers. Figure 3.8 is a frame from the first part of the news sequence.
Figure 3.8 A frame from the news sequence.
Frame rate up‐conversion for encoded video. Results.
19
3.3.2 Environment The test session was held at Ericsson in Kista and test persons were invited from the Division of Video Technology and the Division of Audio Technology. There were two strong reasons for using this test group. The first is that I wanted an audience that was familiar with video coding, and could distinguish even relatively small quality differences. Second rea‐son, it was economical and practical to use this group. Of course, one can argue that it would be more inter‐esting to use ordinary people that are not initiated in video coding, as test persons. However, such a test would require larger effort and more time and devel‐opment and implementation had higher priority than performing such a test.
I wanted the subjective test to be as objective as possi‐ble, in the sense that sequences should be evaluated based on the experience of quality not on other factors, such as emotions, video coding convictions or sleazy judgment. Thus, the art of the test was held secret to the test persons and the only information they re‐ceived was that they were to give their judgment on different video sequences.
The test person was put alone in the test room in front of the computer on which the test was running. She received the test instructions that you can see in ap‐pendix A (in Swedish) and had half an hour to carry out the test.
3.3.3 Methodology The test was divided into three parts of which the two first were judgements on video sequences and the third part had the nature of an evaluation.
3.3.3.1 Part one
In the first part, two versions of the same sequence were shown after each other and the test person was to give her opinion on their relative quality. Both versions were encoded and decoded using one of the four methods described above. Information about how the sequences were encoded and decoded was not given to the test person. Order and position was al‐tered throughout the whole test in order to prevent suspiciousness. Six different sequences were used with the follow parameter settings:
• Image size: QCIF (176x144) • Bit rate: 32 kbps, 64 kbps and 128 kbps • Encoded frame rate: 10 and 15 Hz • Displayed frame rate: 30 Hz • Quantization level: 29 to 41
Each sequence was involved in four comparisons. My method was compared to each of the three other methods. There was also an extra “reference” com‐parison of one of the following kinds:
Repetition. The same pair was compared twice to see if the answers were consistent.
Mutual comparison. Two different methods, other than mine, were compared.
Forged variation. The same method was used for both sequences in the pair.
The purpose of these extra tests was to measure reli‐ability and consistency in the answers. Only the prior one was used when calculating the average score.
The sequences were displayed to the test person once, and after both sequences, she was to give her judge‐ment on the quality of second one compared to the first one (called the reference).
3.3.3.2. Part two
In the second part of the test, the test person could see videos that contained two versions of the same se‐quence next to each other. She had the possibility to look at them as many times as they wanted. Then she was to give her opinion on a scale from “the one at the left was of much better quality than the right one” to “the one at the right was of much better quality than the left one”.
In this test, every sequence (out of the six) was used only once and my method was compared two times against each of the other three methods.
3.3.3.3. Part three
After the two parts in the test lab, I wanted the test person to answer a few questions about the quality of the sequences and about their thoughts of the test. The form is attached in appendix B.
3.3.4 Results Since the test was divided into three parts, I will pre‐sent the result in three parts and collect the conclu‐sions in the next section. The result is presented in diagrams where the values correspond to the score of the three other methods compared to mine. A positive score means that the other method was given higher score and a negative value means that my method was given higher score. The sequence that I performed my method on was encoded at a certain bit rate and with a lower frame rate than the original (½ or ⅓). This was compared to the following methods:
Frame rate up‐conversion for encoded video. Results.
20
A. The sequence was encoded with the same quantization level but with the original frame rate. This is the best that any of the other three can ever get and is therefore to be considered as a reference.
C. The sequence was encoded with the original frame rate but with the same bit rate.
D. The sequence was encoded the same way as with my method but decoded using frame repetition.
3.3.4.1 Part one
The average score from part one can be seen in figure 3.9. It is beyond doubt that using higher bit rate gives the best result. It is also clear that my method is considered to give better quality than method C. When it comes to my method versus frame repetition, it is (surprisingly) a close call. Read more about outer circumstances and conclusions in 3.3.4. Figure 3.10 gives more detail on the score for each sequence.
For the sequence stefan my method makes a very modest performance and a part of the explanation lies in that stefan is an extraordinary tough sequence,
including many fast movements, fast camera panning and small difference between foreground objects (Stefan himself) and the background (the audience).
Part 1
0,6
-0,4
0
-2
-1
0
1
2
ACD
Figure 3.9. The average score from part one.
The performance of my method has, in my opinion, been improved after the test. This can be seen, with an example from stefan, in section 3.3.5.
-2
-1,5
-1
-0,5
0
0,5
1
1,5
2
A B C
foremanstefangloriousmallorcabluebirdnews
00 0
Figure 3.10. The test results for each sequence in part one.
3.3.4.2 Part two
In part two, my method was compared to method A for the sequences mallorca and news. It was com‐pared to method C in bluebird and mallorca and compared to method D in foreman and stefan. The mean has been calculated from the two scores in pairs, and can be seen in figure 3.11. It is interesting to see that my method gains a much better score in this sec‐ond part of the test compared to the first part. Section 3.3.4 contains some explanations and thoughts of how this can be.
Part 2
-1,2
1,1
-1,4-2
-1
0
1
2
ACD
Figure 3.11. The average score from part two.
Frame rate up‐conversion for encoded video. Results.
21
3.3.4.3 Part three
One of the most interesting questions on the form of part three is perhaps the first one, where the test per‐son should answer to; on what she founded her judg‐ment of the quality. The four most dominating an‐swers were (in decreasing order): blockiness, jerkiness, blurriness and strange movements.
The artefact of blockiness can occur using my frame rate up‐converter but I think it is more frequent and noticeable when using method C, where a higher quantization level is being used. The artefact of jerki‐ness is obviously a problem that is coherent with low frame rate and I will therefore assign these judgments to method D. As I see it, it is the latter two that can be seen when my method has been used. See section 3.3.5 for a description of how the algorithm was improved based on this judgment.
On the question whether it was easy or not to con‐clude which method was used, only one person claimed that it was easy.
In the ranking of which sequence it was easiest to see difference between two versions, the following list is constructed from the mean result:
1. Mallorca 2. Foreman 3. Stefan 4. Glorious 5. Bluebird 6. News
This ranking list corresponds well to the magnitude of the score in part one.
When it comes to the question of which of part one and part two gave the best possibility for a well‐founded answer, there was about an equal amount of people voting for each part (5 vs. 4 respectively).
The additional thoughts that came in concerned the test environment (such as brightness level in the room) as well as the quality of the video sequences. A common opinion was that bluebird was of unac‐ceptable low quality.
3.3.5 Conclusions and circumstances In this section, I will try to conclude some things about the test, with focus on the differences between the results from the two parts. A conclusion of the whole Master’s project and the performance of my imple‐mentation is given in chapter 4. As you can see, there are great differences between the results of part one and part two. Perhaps most sensational is the fact that for both stefan and foreman, method D scores higher than my method in part one while my method
scores higher in part two. I will bring forward a few circumstances that may have affected the results of the test.
3.3.5.1 The repetition possibility
In part one, the test person could only see the se‐quences once; first one, then the other. This can vastly reduce the possibility to notice the differences and increase the risk for a judgment that is based on faulty impressions. An indication of this is the foreman coded with method C compared to itself. Nine out of the twelve gave an opinion in one or the other direc‐tion and four of them with a magnitude that was at least one sixth of the maximal magnitude.
3.3.5.2 Side by side comparison
In part two, the sequences are displayed next to each other, which give the possibility to look back and fourth to compare things that occur in the video over a period of time. One thing that is of importance con‐nects back to section 1.1.1.1. If the test person is con‐centrating on one sequence, she will still see the other sequence, but since the light that enters the eye from that sequence does not hit the retina at the fovea, it will be handled by cones and magno ganglion cells. Thus, the colour and the details will not be perceived, only the movement. This indicates that difference in frame rate will be easy to detect when two sequences are displayed next to each other.
3.3.5.2 Order of display
In part one the order in which the sequences are dis‐played can affect the judgment. A suggestion is that the first sequence unintentionally will be thought of as the original and therefore given higher score. This suggestion is supported by the result of when the same comparison was displayed twice, but in reversed order. When my method was displayed first, it scored much better than method D (0.48). When method D was displayed first, it gained a subtle better score than mine (0.07).
3.3.5.3 Displayed frame rate difference
The differences mentioned so far are things that I could not control and perhaps their nature are positive since the two parts of the test complement each other with different strengths. This fourth difference is one that could have been avoided and preferably should have been, if it had been discovered before the test started.
The frame rate at which the sequences were displayed was not the same in part one. The original frame rate (for most sequences) was 30 Hz and that was the frame rate of part two. However, for some unfortunate
Frame rate up‐conversion for encoded video. Results.
22
and unknown reason the frame rate of part one was 25 Hz. This has as an effect that the videos were played slightly in slow motion, which gave the test person a little longer time to view each frame. Thus, strange things would appear more legible since the temporal masking will be reduced.
3.3.5.4 Question formulation difference
It is a well‐known fact that the formulation of test questions can be of decisive importance for the result of the test. The purpose of my test was to get a picture of how persons experienced the quality of videos decoded using my frame rate up‐conversion algo‐rithm.
The description and question formulation were not identical for the two parts. One opinion was notified to me from one of the test participants, concerning this matter. This person understood the first part to be an evaluation of which sequence he would prefer to watch, and the second part as an evaluation as which he thought had the (objective) best quality. It is possi‐ble that other test persons shared this view, and it might have affected the outcome of the test.
3.3.6 Improvements after the test The test was performed in the latter part of the Mas‐ter’s project, at a stage where it would be possible to use the test results to improve the algorithm and its performance.
Two major changes were done after the test. These correspond to opinions that where brought to me by persons who participated in the test. It occurred to me that people had seen an annoying blurriness and strange movements.
To deal with the artefact of blurriness, I introduced the calculation of sub pixels in the same way as is defined by the H.264 standard. It uses one six‐tap filter and one two‐tap filter in order to create quarter pixel resolution. This gives a much more accurate motion and a sharper picture but it has its price in increased computation time.
To reduce the occurrences of strange movements, which in many cases correspond to Intra coded mac‐roblocks, I decided two improve the special motion estimation. At that stage, the motion of an Intra coded block was always initialized to be the mean of the surrounding motions. To improve on this I added a Sad comparison of the surrounding motion and the zero motion. After the best of these is chosen, a small search is performed with that motion as an initial motion vector.
Finally, the weights in equation 2.11 were changed slightly, in order to increase the performance.
The difference between the performance of my algo‐rithm before and after the test can be seen in figure 3.12.
Figure 3.12. The sequence above was used during the test. The sequence below
is constructed after the improvements that were made after the test.
Frame rate up‐conversion for encoded video. Conclusions and future work.
23
4 Conclusions To sum up what I achieved during this Master’s project and what conclusions can be drawn from this, in some sense the initial goals and specifications has to be compared with the results. First of all, it is interesting to see if the problem of frame rate up‐conversion is at all solvable. You can read about existing, working implementations of frame rate up‐converters in section 1.2.3. However, the focus of this Master’s project was not to study frame rate up‐conversion in general, but rather the specific case of a low complexity solution for encoded video sequences. These constraints induced both exertion, since the complexity must not be too high, and assis‐tance, since helpful information is found in encoded video sequences.
One of the major parts of this Master’s project has been implementation. To meet the goals of low com‐plexity, I decided to aim for a solution were functions optionally could be removed to increase execution speed.
The subjective test that was performed (section 3.3), gives an indication of how hard the problem of frame rate up‐conversion is, and that the result to a great extent depends on the nature of the video sequence. Although it was not the final version of the imple‐mentation that was used during the subjective test, the overall result showed that my method outperformed the method of reducing quality in order to retain the frame rate.
As a final conclusion I would like to claim that it is definitely possible to create a frame rate up‐converter that make use of the encoded motion vector field, to increase the quality of a video sequence with low frame rate. Moreover, I believe, that is what I have created during this Master’s project.
5 Future work As you have read in this report, I have produced a prototype of a frame rate up-converter. And, since the result has been successful, a natural step for the future would be to construct a product based on my algorithm. This will be my goal for the coming future, but I believe some interesting aspects first must be given attention. As you could read in the section about running time (2.3.4.2), the time consumed by my algorithm is too high. Therefore, some kind of optimization has to be performed before a consumer product can be con‐structed.
The first step in future work could be to try to remove those parts of the algorithm that does not give no‐ticeably large quality improvement. The general idea during the implementation has been that I have added functions that I believed increases the performance. It is possible that some of these functions have become obsolete when others were added, and that redundant work is performed.
Second, for the most time consuming parts that are essential for the performance of the algorithm, an
implementation optimization could be done. Of course, the level of optimization could be increased using knowledge about the application in which the frame rate up‐converter is to be used. It can be that it is constrained to a certain bit rate, frame rate, image size or a certain type of video sequences. All these constraints can be used to remove unnecessary flexi‐bility.
The last thing that I find interesting for future work is variations of the algorithm. For example, the block size in the inserted frame where chosen to 4x4 at an early stage because it gave better performance than 8x8 and it has the ability to capture object borders more accurately. It would be interesting to compare different block sizes in means of performance and running time.
Frame rate up‐conversion for encoded video. Bibliography.
24
Bibliography [1] FERWERDA, J.A. 2001, Elements of Early Vision
for Computer Graphics, IEEE Computer Graph‐ics and Applications, Volume 21 , Issue 5, pp. 22‐33, September 2001.
[2] GORILLAZ, 2001, Quotation from the lyrics of the song Clint Eastwood appearing on the com‐pact disc “Gorillaz”.
[3] GRANLUND, G. H., & KNUTSSON H., 1995, Signal Processing for Computer Vision , Kluwer Aca‐demic Publishers, ISBN 0‐7923‐9530‐1.
[4] Image: http://www.gene.com/gene/products/ information/images/eye‐anatomy.jpg
[5] PINGSTONE, A. Simultaneous Contrast Illusion, http://upload.wikimedia.org/wikipedia/commons/9/9a/Gradient.illusion.arp.jpg, Drawn using Photoshop in March 2005.
[6] PATTANAIK, S., TUMBLIN, J., YEE, H., & GREENBERG D., 2000, Time‐dependent visual adaptation for fast realistic display, SIGGRAPH ʹ00 pp. 47‐54, July 2000.
[7] MATHER G., Introduction to Motion Perception, 2006 http://www.lifesci.sussex.ac.uk/home/George_Mather/Motion/index.html Visited October 2006.
[8] RICHARDSON, I. E. G., 2002, Video codec design, John Wiley & Sons Ltd, ISBN 0‐471‐48553‐5.
[9] WIEGAND, T., SULLIVAN, G. J., BJNTEGAARD, G. & LUTHRA, A., 2003, Overview of the H.264/AVC video coding standard, IEEE Transactions on Circuits and Systems for Video Technology, Volume 13, Issue 7, pp. 560‐576, July 2003.
[10] RICHARDSON, I. E. G., 2003, H.264 and MPEG‐4, John Wiley & Sons Ltd, ISBN 0‐470‐84837‐5.
[11] CHOI, B.T., LEE, S.H., & KO, S.J., 2000, New frame rate up‐conversion using bi‐directional motion estimation, IEEE Trans. Consum. Electron., Volume 46, Number 3, pp. 603‐609.
[12] CHEN, Y.‐K., VETRO, Y.‐K., SUN H., & KUNG S.‐Y., 1998, Frame Rate Up‐Conversion Using Transmitted True Motion , to appear in Proc. of 1998 Workshop on Multimedia Signal Processing, December 1998.
[13] SHIRATORI, T., MATSUSHITA, Y., KANG, S.B. & TANG, X., 2006, Video Completion by Motion Field Transfer, In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006), June 2006
[14] SNELL & WILCOX, Alcheminst Platinum Ph.C Motion Compensated Standards Converter with Ph.C, 2006, http://www.snellwilcox.com/products/data/ conversion_restoration/alcpdata.pdf Visisted December 2006.
[15] MARSH, D. 2001, Temporal Rate Conversion, http://www.microsoft.com/whdc/ archive/TempRate.mspx Visited December 2006.
[16] Haan, G. de, 2001, Natural Motion, http://www.research.philips.com/newscenter/ dossier/naturalmotion/index.html Visited December 2006.
[17] ZHAI, J., YU, K., LI, J. & LI, S., 2005, A Low Complexity Motion Compensated Frame Interpolation Method, The 2005 IEEE International Symposium on Circuits and Systems (ISCAS2005), Kobe, Japan, 23‐26 May, 2005
[18] THORESSON, H., 2006, Vi har testat Teracoms mobil‐tv, Mobil, http://mobil.mkf.se/ArticlePages/200610/09/ 20061009181649_MKF_MOB_ Administratorer707/20061009181649_MKF_ MOB_Administratorer707.dbp.asp, 10 October 2006, Visited December 2006
[19] ANTON H. & RORRES C., 2000, Elementary Linear Algebra 8th ed., John Wiley & Sons Ltd, ISBN 0‐471‐17052‐6.
[20] Image: http://www.yle.fi/matochfritid/bilder/ verkstad/tavelram.jpg
Frame rate up‐conversion for encoded video. Appendixes.
25
Appendixes
Frame rate up‐conversion for encoded video. Appendixes.
26
Appendix A. Test instructions
Frame rate up‐conversion for encoded video. Appendixes.
27
Testhandledning I båda delarna av detta test ska du göra en jämförande bedömning mellan två sekvenser. Skillnaden mellan sekvenserna kan ofta upplevas som liten. Försök därför vara uppmärksam och koncentrerad under hela testet. Du ska göra en personlig bedömning av kvaliten på en steglös skala. En markering längst till vänster betyder att du mycket hellre skulle titta på video med kvalitet som den första sekvensen. En markering längst till höger innebär att du mycket hellre skulle titta på video med kvalitet som den andra sekvensen. När du resonerar kring vilken du tycker bäst om kan du t.ex. ha i åtanke följande artifakter: Suddighet, Blockighet, Hackiga rörelser, Dålig skärpa i bilderna, Flimmer, Problem utmed bildens kanter, Rörelser ser konstiga ut, Spöken (dubbla/halvt genomskinliga objekt), Blockförluster, Andra kodningsartifakter, m.m.
Jag (Jonatan) kommer finnas i fikarummet under hela testet. Har du några frågor är det bara att gå till fikarummet och prata med mig.
Del 1. Fyll I ditt för‐ och efternamn och starta testet. Det inleds med en övningssekvens. Har du inga frågor efter övningssekvensen är det bara att fortsätta.
Bedömning.
Sekvenserna visas parvis, först den ena sedan den andra. Du ska ange vad du tycker om kvaliten i den andra sekvensen i förhållande till den första (som kallas referenssekvensen). Frågan om du skulle acceptera videokvaliten gäller den andra sekvensen.
När du är färdig med del 1 kan du avsluta det testet, vända blad, och forsätta med del 2.
Frame rate up‐conversion for encoded video. Appendixes.
28
Del 2.
När du har avslutat testet i del 1 ser du på skrivbordet 7 ikoner med nummer från 1 till 7. Varje av dessa innehåller ett filmklipp med två sekvenser bredvid varandra. Du ska göra en jämförande bedömning av kvaliten liknande den som gjordes i del 1. Här får du dock möjlighet att titta på sekvenserna bredvid varandra och att spela upp dem flera gånger. Öppna filmklippen i windows media‐player. Det är inte tillåtetet att pausa, spola fram eller tillbaka (eller hoppa över några delar) eller ändra storlek. Du får inte heller ändra uppspelningshastighet eller spela upp sekvenserna i något annat program. Tyvärr kunde jag inte ordna en testmiljö med dessa begränsningar inbyggda, men jag hoppas att ni kan följa dessa regler.
Du har totalt 5 min på dig att göra bedömningen av alla 7 sekvenserna och ansvar själv för att hålla tiden. Fyll i bedömningen nedan:
1.
Hur upplevde du kvaliten på den högra sekvensen i förhållande till den vänstra? (markera med ett X)
Mycket sämre Lika bra Mycket bättre
2.
Hur upplevde du kvaliten på den högra sekvensen i förhållande till den vänstra? (markera med ett X)
Mycket sämre Lika bra Mycket bättre
3.
Hur upplevde du kvaliten på den högra sekvensen i förhållande till den vänstra? (markera med ett X)
Mycket sämre Lika bra Mycket bättre
4.
Hur upplevde du kvaliten på den högra sekvensen i förhållande till den vänstra? (markera med ett X)
Mycket sämre Lika bra Mycket bättre
5.
Hur upplevde du kvaliten på den högra sekvensen i förhållande till den vänstra? (markera med ett X)
Mycket sämre Lika bra Mycket bättre
6.
Hur upplevde du kvaliten på den högra sekvensen i förhållande till den vänstra? (markera med ett X)
Mycket sämre Lika bra Mycket bättre
7.
Hur upplevde du kvaliten på den högra sekvensen i förhållande till den vänstra? (markera med ett X)
Mycket sämre Lika bra Mycket bättre
Frame rate up‐conversion for encoded video. Appendixes.
29
Appendix B. Test evaluation
Frame rate up‐conversion for encoded video. Appendixes.
30
Fler frågor 1. Markera med ett x de artifakter som avgjorde din bedömning:
□ Suddighet
□ Blockighet
□ Hackiga rörelser
□ Dålig skärpa i bilderna
□ Flimmer
□ Problem utmed bildens kanter
□ Rörelser ser konstiga ut
□ Spöken (dubbla/halvt genomskinliga objekt)
□ Blockförluster
□ Något annat:________________________
□ Dessutom:__________________________
2. Varje sekvens var kodad och avkodad på ett av fyra olika sätt. Var det enkelt att urskilja på vilket sätt en sekvens var kodad/avkodad?
□ Ja □ Nej □ Vet inte3. Vilken sekvens ansåg du överlag ha bäst kvalitet (bortse från klipp 7 i del 2)? Rangordna nedanstående från 1 till 6 där 1 är bäst och 6 är sämst
• Foreman
• Stefan
• Gloriuos
• Mallorca
• Bluebird
• News
Frame rate up‐conversion for encoded video. Appendixes.
31
3. Vilken av delarna anser du gav bäst möjligheter till en välgrundad bedömning?
□ Del 1 □ Del 2 □ Lika bra
4. Vad tycker du om testet som helhet?
______________________________________________________________________________________________
______________________________________________________________________________________________
5. Övriga synpunkter.
______________________________________________________________________________________________
______________________________________________________________________________________________
______________________________________________________________________________________________
______________________________________________________________________________________________
Tack för din medverkan.
TRITA-CSC-E 2007:021 ISRN-KTH/CSC/E--07/021--SE
ISSN-1653-5715
www.kth.se