Download pdf - VIDEO CODECS

Director, eSILICON LABS, INDIA

VIDEO CODECVinayagam M

Next Generation Broadcasting Technology

2

3

Agenda

HVS

Images / Video

Video / Image Compression

Image Coding

Video Coding

Video Coder Architecture

Video Codec Standards

HEVC

4

HVS

5

HVS

• HVS properties influence the

design/tradeoffs of imaging/video

systems

• Basic properties of HVS “front-end”

– 4 types of photo-receptors in the retina

– Rods, 3 types of cones

• Rods

– Achromatic (no concept of color)

– Used for scotopic vision (low light levels)

– Concentrated in periphery

• Cones

– 3 types: S - Short, M- Medium, L - Long

– Red, Green, and Blue peaks

– Used for Photopic Vision (daylight levels)

– Concentrated in fovea (center of the

retina)

6

HVS…

• Eyes, optic nerve, parts of the brain

• Transforms electromagnetic energy

• Image Formation

– Cornea, Sclera, Pupil, Iris, Lens, Retina, Fovea

• Transduction

– Retina, Rods, and Cones

• Processing

– Optic Nerve, Brain

• Retina and Fovea

– Retina has photosensitive receptors at back of eye

– Fovea is small, dense region of receptors

Only cones (no rods)

Gives visual acuity

– Outside Fovea

Fewer receptors overall

Larger proportion of rods

Fovea

Retina

7

HVS…

• Transduction (Retina)

– Transform light to neural impulses

– Receptors signal bipolar cells

– Bipolar cells signal ganglion cells

– Axons in the ganglion cells form optic

nerve

• Image Formation in the Human Eye

8

HVS…

• HVS Properties

– Tradeoff in resolution between space and time

Low resolution for high spatial AND high temporal frequencies

However, eye tracking can convert fast-moving object into low retinal frequency

– Achromatic versus chromatic channels

Achromatic channel has highest spatial resolution

Yellow/Blue has lower spatial resolution than Red/Green channel

– Color refers to how we perceive a narrow band of electromagnetic energy

Source, Object, Observer

9

HVS…

• Visual System

– Visual system transforms light energy into sensory experience of sight

10

HVS…

• Color Perception (Color Theory)

– Hue

Distinguishes named colors, e.g., RGB

Dominant wavelength of the light

– Saturation

Perceived intensity of a specific color

How far color is from a gray of equal intensity

– Brightness (lightness)

Perceived intensity

Hue Scale

Satu

ratio

nO

rigin

al

lightn

ess

11

HVS…

• Visual Perception

– Resolution and Brightness

– Spatial Resolution depends on

Image Size

Viewing Distance

– Brightness

Perception of brightness is higher than perception of color

Different perception of primary colors

Relative Brightness: green:red:blue=59%:30%:11%

– B/W vs. Color

12

HVS…

• Visual Perception

– Temporal Resolution

Effects caused by inertia of human eye

Perception of 16 frames/second as continuous sequence

Special Effect: Flicker

Flicker

Perceived if frame rate or refresh rate of screen too low (<50Hz)

Especially in large bright areas

Higher refresh rate requires

Higher scanning frequency

Higher bandwidth

13

HVS…

• Visual Perception Influence

– Viewing distance

– Display ratio (width/height – 4/3 for conventional TV)

– Number of details still visible

– Intensity (luminance)

14

HVS…

• Imaging / Visual System designed

based on HVS principles

• Example

– Image Sensor

– Television

– Image / Video Display

• Image Sensor

– CCD (charge coupled device):

Arrays of photo diodes

Linearity

Less light needed

Electronic shuttering

– CMOS

Cheaper

Easy manufacturing

• Television

– NTSC (National Television System

Committee):

60 Hz, 30 fps, 525 scan lines

North America, Japan, Korea ….

– PAL (Phase Alteration by Line):

50 Hz, 25 fps, 625 scan lines

Europe …

• Image / Video Display

– CRT Monitor

– LCD TV/Display Monitor

15

IMAGE / VIDEO

16

IMAGE / VIDEO

• Images

– View Observation by HVS @ time instant

– A multidimensional array of numbers (such as intensity image) or vectors(such as color image)

Each component in the image called pixel

associates with the pixel value (a single

number in the case of intensity images or

a vector in the case of color images)

39871532

22132515

372669

28161010

39656554

42475421

67965432

43567065

99876532

92438585

67969060

78567099

17

IMAGE / VIDEO…

• Video

– Series of Frames (or Images)

18

IMAGE / VIDEO…

• Images / Video Frame

– A multidimensional function of spatial coordinates

– Spatial Coordinate

(x,y) for 2D case such as photograph,

(x,y,z) for 3D case such as CT scan images

(x,y,t) for movies

– The function f may represent intensity (for monochrome images) or color

(for color images) or other associated values

Image “After snow storm” f(x,y)

x

y

Origin

19

IMAGE / VIDEO…

• Images / Video Frame

– An image that has been discretized both in Spatial coordinates and

associated value

Consist of 2 sets:(1) a point set and (2) a value set

Can be represented in the form

– I = {(x,a(x)): x ε X, a(x) ε F}

where X and F are a point set and value set, respectively

An element of the image, (x,a(x)) is called a pixel

where

x is called the pixel location and

a(x) is the pixel value at the location x

– Conventional Coordinate for Image Representation

20

IMAGE / VIDEO…

• Images / Video Frame Representation

– Basic Unit : Pixel

– Dimensions

Height

Width

– Frame rate determines how long the pixel

exists, i.e. how it moves

– Color Depth of the pixel

How many bits are used to represent the color of

each pixel?

21

IMAGE / VIDEO…

• Image Type

– Binary Image

– Intensity Image

– Color Image

– Index image

22

IMAGE / VIDEO…

• Binary Image

– Binary image or black and white image

– Each pixel contains one bit

1 represent white

0 represents black

1111

1111

0000

0000

Binary Data

23

IMAGE / VIDEO…

• Intensity Image

– Intensity / Monochrome/ Gray Scale Image

– Each pixel corresponds to light intensity normally represented in gray

scale (gray level)

39871532

22132515

372669

28161010

Gray Scale Values

24

IMAGE / VIDEO…

• Color Image

– Each pixel contains a vector representing red, green and blue components

39871532

22132515

372669

28161010

39656554

42475421

67965432

43567065

99876532

92438585

67969060

78567099

RGB Components

25

IMAGE / VIDEO…

• Index Image

– Each pixel contains index number pointing to a color in a color table

256

746

941

Index Value

Index

No.

Redcomponent

Greencomponent

Bluecomponent

1 0.1 0.5 0.3

2 1.0 0.0 0.0

3 0.0 1.0 0.0

4 0.5 0.5 0.5

5 0.2 0.8 0.9

… … … …

Color Table

26

IMAGE / VIDEO…

• Colourspace Representations

– RGB (Red, Green, Blue) – Basic analog components (from camera/to TV)

– YPbPr (Y,B-Y,R-Y) – ANALOG Colourspace (derived from RGB)

Y=Luminance, B=Blue,

– R=Red

– YUV – Colour difference signals scaled to be modulated on a composite

carrier

– YIQ – Used in NTSC. I=In-phase, Q=Quadrature (IQ plane is 33o rotation

of UV plane)

– YCbCr/YCC – DIGITAL representation of the YPbPr Colourspace (8bit, 2s

compliment)

27

IMAGE / VIDEO…

• RGB Color

– All color can be composed by adding specific amounts of R, G, & B

– 8-bits (28) specifies the amount of each color

– This is the scheme used by most electronic displays to generate color;

e.g. we often call our computer monitors, "RGB displays"

8-bits Red

8-bits Green

8-bits Blue

28

IMAGE / VIDEO…

• Color Reduction

– Human eye is not as sensitive to color as it is to Luminance

– To this end, to save costs the various standards decided to

Maintain luminance information in our images, but Reduce color information

Using RBG, though, how do we easily reduce color information without

removing luminance?

For this, and other technical reasons, a separate color space was chosen by

most video standards …

29

IMAGE / VIDEO…

• Colour Image: RGB

• YCbCr

– Even though most displays actually

use RGB to create the image, YCbCr

is used most often in consumer

electronics for transmission of the

image

– Historically, B/W televisions

transmitted only luminance (Y)

– The color signals were added later

30

IMAGE / VIDEO…

• YCbCr Generated By Sub sampling

– YUV 4:4:4 = 8bits per Y,U,V channel (no downsampling the chroma

channels)

– YUV 4:2:2 = 4 Y pixels sampled for every 2 U and 2 V (2:1 horizontal

downsampling, no vertical downsampling

– YUV 4:2:0 = 2:1 horizontal downsampling, 2:1 vertical downsampling

– YUV 4:1:1 = 4 Y pixels sampled for every 1 U and 1 V (4:1 horizontal

downsampling, no vertical downsampling)

• YUV 4:4:4

Y Y Y Y

Y Y Y Y

4:4:4 Format (3 bytes/pixel):

Cb Cr Cb Cr Cb Cr Cb Cr

Cb Cr Cb Cr Cb Cr Cb Cr

31

IMAGE / VIDEO…

• YUV 4:2:2

• YUV 4:2:0

Y Y Y Y

Y Y Y Y

4:2:2 Format (2 bytes/pixel):

Cb Cr

Cb Cr

Cb Cr

Cb Cr

Y Y Y Y

Y Y Y Y

Cb Cr Cb Cr

4:2:0 Format (1.5 bytes/pixel):

32

IMAGE / VIDEO…

• Up sampling

• Downsampling

nT

Input Signal

1 2 3 4

F(nT) F(nT/2)

nT

Intermediate Signal

12345678

Interpolatinglow-pass filter

nT

nT

F(nT/2)

Output Signal

12345678

nT

Input Signal

1 2 3 4

F(nT)Decimating

low-pass filterprevents aliasat lower rate

F(2nT)

1

Output Signal

2

33

IMAGE / VIDEO…

• RGB to YCbCr

• RGB to YUV Conversion

– Y = 0.299R + 0.587G + 0.114B

– U= (B-Y)*0.565

– V= (R-Y)*0.713

U-V plane at Y=0.5

Clamp the output: Y=[16, 235], U,V=[16,239]

34

VIDEO / IMAGE COMPRESSION

35

VIDEO/IMAGE COMPRESSION

• How can we use fewer bits?

• To understand how image/audio/video signals are compressed tosave storage and increase transmission efficiency

• Reduces signal size by taking advantage of correlation

– Spatial

– Temporal

– Spectral

36

VIDEO/IMAGE COMPRESSION…

• Compression Methods

• Need to take advantage of redundancy– Images

Space

Frequency

– VideoSpace

Frequency

Time

Linear Predictive AutoRegressive Polynomial Fitting

Model-Based

Huffman

Statistical

Arithmetic Lempel-Ziv

Universal

Lossless

Spatial/Time-Domain

Subband Wavelet

Filter-Based

Fourier DCT

Transform-Based

Frequency-Domain

Lossy

Waveform-Based

Compression Methods

37


• Need to take advantage of redundancy

RGBYCbCr

Blocks

Macro

Blocks

I B P

Remove Temporal Redundancy

Transform

QuantizationCoding

01100010101,0

Remove Spatial Redundancy

Motion

Compensation

38


• Spatial Redundancy

– Take advantage of similarity among most neighboring pixels

• RGB to YUV

– Less information required for YUV (humans less sensitive to chrominance)

• Macro Blocks

– Take groups of pixels (16x16)

• Discrete Cosine Transformation (DCT)

– Based on Fourier analysis where represent signal as sum of sine's and cosine’s

– Concentrates on higher-frequency values

– Represent pixels in blocks with fewer numbers

• Quantization

– Reduce data required for coefficients

• Entropy coding

– Compress

39


• Spatial Redundancy Reduction

Zig-Zag Scan,

Run-length

coding

Quantization

• major reduction

• controls ‘quality’

“Intra-Frame

Encoded”

40


• When may spatial redundancy elimination be ineffective?

– High-resolution images and displays

– May appear ‘coarse’

• What kinds of images/movies?

– A varied image or ‘busy’ scene

– Many colors, few adjacent

Original (63 kb)

Low (7kb)

Very Low (4 kb)Due to Loss of Resolution

Solution ? Temporal Redundancy Reduction

41


• Temporal Redundancy Reduction– Take advantage of similarity between successive frames

950 951 952

42



43



44


When may temporal redundancy

reduction be ineffective?

45


• Many scene changes vs. few scene changes

• Sometimes high motion

46


• Many scene changes vs. few scene changes

• Sometimes high motion

47

IMAGE CODING

48

IMAGE CODING

• Lossless Compression

• Lossy Compression

• Transform Coding

49

IMAGE CODING…

• Image compression system is composed of three key building blocks

– Representation

Concentrates important information into a few parameters

– Quantization

Discretizes parameters

– Binary encoding

Exploits non-uniform statistics of quantized parameters

Creates bitstream for transmission

50

IMAGE CODING…

• Image compression system is composed of three key building blocks

– Representation

Concentrates important information into a few parameters

– Quantization

Discretizes parameters

– Binary encoding

Exploits non-uniform statistics of quantized parameters

Creates bitstream for transmission

51

IMAGE CODING…

• Generally, the only operation that is lossy is the quantization stage

• The fact that all the loss (distortion) is localized to a single operation

greatly simplifies system design

• Can design loss to exploit human visual system (HVS) properties

• Source decoder performs the inverse of each of the three operations

52

IMAGE CODING…

• Representations - Transform and Subband Filtering Methods

– Goal

Transform signal into another domain where most of the information (energy) is

concentrated into only a small fraction of the coefficients

– Enables perceptual processing

Exploiting HVS response to different frequency components

53

IMAGE CODING…

• Representations - Transform and Subband Filtering Methods

– Examples of “traditional” transforms

KLT, DFT, DCT

– Examples of “traditional” Subband filtering methods

Perfect reconstruction filter banks, wavelets

– Transform and Subband interpretations

All of the above are linear representations and can be interpreted from either a

transform or a Subband filtering viewpoint

– Transform viewpoint

Express signal as a linear combination of basis vectors

Stresses linear expansion (linear algebra) perspective

– Subband filtering viewpoint

Pass signal through a set of filters and examine the frequencies passed by

each filter (Subband)

Stresses filtering (signal processing) perspective

54

IMAGE CODING…

• Representations – Transform Image Coding

– A good transform provides

Most of the image energy is concentrated into a small fraction of the

coefficients

Coding only these small fraction of the coefficients and discarding the rest can

often lead to excellent reconstructed quality

The more energy compaction the better

– Orthogonal transforms are particularly useful

Energy in discarded coefficients is equal to energy in reconstruction error

55

IMAGE CODING…

• Representations – Transform Image Coding

– Karhunen-Loeve Transform (KLT)

Optimal energy compaction

Requires knowledge of signal covariance

In general, no simple computational algorithm

– Discrete Fourier Transform (DFT)

Fast algorithms

Good energy compaction, but not as good as DCT

– Discrete Cosine Transform (DCT)

Fast algorithms

Good energy compaction

All real coefficients

Overall good performance and widely used for image and video coding

56

IMAGE CODING…

• Discrete Cosine Transform (DCT)

– 1-D Discrete Cosine Transform (N-point)

– 1-D DCT basis vectors

– 2-D DCT: Separable transform of 1-D DCT

– 2-D DCT basis vectors?

Basis pictures!

– 2-D basis vectors for 2-D DCT are basis pictures!

– 64 basis pictures for 8x8-pixel 2-D DCT

– Image coding with the 2-D DCT is equivalent to approximating the image

as a linear combination of these basis pictures!

57

IMAGE CODING…

• Representations – Coding Transform Coefficients

– Selecting the basis pictures to approximate an image is equivalent to

selecting the DCT coefficients to code

– General methods of coding/discarding coefficients

Zonal Coding

▫ Code all coefficients in a zone and discard others

▫ Example zone: Spatial low frequencies

▫ Only need to code coefficient amplitudes

Threshold Coding

▫ Keep coefficients with magnitude above a threshold

▫ Coefficient amplitudes and locations must be coded

▫ Provides best performance

58

IMAGE CODING…

• Video / Image Coding are Block based Coding

– Frames are divided into Sub-Block and then coded

• Macroblock (MB) and Block Layer

– Process the data in blocks of 8x8 samples

– Convert Red-Green-Blue intoLuminance (greyscale) andChrominance (Blue color differenceand Red color difference)

– Use half resolution for Chrominance (because eye is more sensitive to greyscale than to color)

59

IMAGE CODING…

• Macroblock (MB) and Block Layer

– Macroblock

Consist of

16x16 luminance block

8x8 chrominance block

Basic unit for motion estimation

– Block

8 pixels by 8 lines

Basic unit for DCT

60

IMAGE CODING…


– General-Purpose Compression: Entropy Encoding

– Remove statistical redundancy from data

– ie, Encode common values with short codes, uncommon values with longer codes


– Huffman Coding

– Example : ABCCDEAAB

After compression: 1011000000001010111011

– Compression ratio

According to probability of the characters appears in the uncompressed data

C:12 D:13 F:5 E:9 B:16 A:45

1425

55

100

30

10

0

0

0

0

1

1

1

1

000 001 0100 0101 011 1

61

IMAGE CODING…


– Run-Length Coding

Reduce the number of samples to code

Implementation is simple

Input Sequence

0,0,-3,5,1,0,-2,0,0,0,0,2,-4,3,-2,0,0,0,1,0,0,-2,EOB

Run-Length Sequence

(2,-3)(0,5)(0,1)(1,-2)(4,2)(0,-4)(0,3)(0,-2)(3,1)(2,-2)EOB

62

IMAGE CODING…


– Transform Coding

(-1,1) (1,1)

(0.4,1.4) = 0.4•(1,0)+1.4•(0,1)

= 0.9•(1,1)+0.5•(-1,1)

Basis vector

{ (1,0), (0,1) }

New basis vector

{ (1,1), (-1,1) }

New vector

(0.9, 0.5)

(0,1)

(1,0)

63

IMAGE CODING…


– Transform Coding : DCT

Transform blocks of images to frequency domain, code only the significant transform coefficients

2D DCT


8x8 DCT Basis Function

64

IMAGE CODING…



2D DCT Coefficients

65

IMAGE CODING…


– Lossy Predictive Coding

66

IMAGE CODING…


– Quantization

Many to one mapping

Quantization is the most import means of irrelevancy reduction

– Implementation

Lookup Table

Divide by quantization step-size (round/truncate)

67

IMAGE CODING…


– Divide by quantization step-size

Input signal：0 1 2 3 4 5 6 7（3 bits）

Step-size：2

Quantization：0 0 1 1 2 2 3 3（2 bits）

Inverse quantization：0 0 2 2 4 4 6 6

Quantization Errors：0 1 0 1 0 1 0 1

– Lookup Table

Divide each DCT coefficient by an integer, discard remainder

Result: loss of precision

Typically, a few non-zero coefficients are left

68

IMAGE CODING…


– Zigzag Scan

Efficient encoding of the position of non-zero transform coefficients

Scan” quantized coefficients in a zig-zag order

Non-zero coefficients tend to be grouped together

69

IMAGE CODING…

• DCT + Quantization + Run-Level-Coding

70

VIDEO CODING

71

VIDEO CODING



• Transform Coding

• Motion Coding

72

VIDEO CODING…

• Video

– Sequence of frames (images) that are related

• Moving images contain significant temporal redundancy

– Successive frames are very similar

– Related along the temporal dimension - Temporal redundancy exists

73

VIDEO CODING…

• Video Coding

– The objective of video coding is to compress moving images

– Main addition over image compressionTemporal redundancy

Video coder must exploit the temporal redundancy

– The MPEG (Moving Picture Experts Group) and H.26X are the major standards for video coding

• Video coding algorithms usually contains two coding schemes :

– Intraframe coding

Intraframe coding does not exploit the correlation among adjacent frames

Intraframe coding therefore is similar to the still image coding

– Interframe codingThe interframe coding should include motion estimation/compensation process to remove temporal redundancy

• Basic Concept

– Use interframe correlation for attaining better rate distortion

74

VIDEO CODING…

• Usually high frame rate: Significant temporal redundancy

• Possible representations along temporal dimension

– Transform/Subband Methods

Good for textbook case of constant velocity uniform global motion

Inefficient for nonuniform motion, I.e. real-world motion

Requires large number of frame stores

Leads to delay (Memory cost may also be an issue)

– Predictive Methods

Good performance using only 2 frame stores

However, simple frame differencing in not enough

75

VIDEO CODING…

• Main addition over image compression

– Exploit the temporal redundancy

• Predict current frame based on previously coded frames

• Types of coded frames

– I-frame

Intra-coded frame, coded independently of all other frames

– P-frame

Predictively coded frame, coded based on previously coded frame

– B-frame

Bi-directionally predicted frame, coded based on both previous and future coded frames

76

VIDEO CODING…

• Motion-Compensated Prediction

– Simple frame differencing fails when there is motion

– Must account for motion

Motion-compensated (MC) prediction

– MC-prediction generally provides significant improvements

– Questions

How can we estimate motion?

How can we form MC-prediction?

• Motion Estimation– Ideal Situation

Partition video into moving objects

Describe object motion

Generally very difficult

– Practical approach: Block-Matching Motion EstimationPartition each frame into blocks

Describe motion of each block

No object identification required

Good, robust performance

77

VIDEO CODING…

• Block-Matching Motion Estimation

– Assumptions

Translational motion within block

All pixels within each block have the same motion

– ME Algorithm

Divide current frame into non-overlapping N1xN2 blocks

For each block, find the best matching block in reference frame

– MC-Prediction Algorithm

Use best matching blocks of reference frame as prediction of blocks in current frame

78

VIDEO CODING…

• Block-Matching - Determining the Best Matching Block– For each block in the current frame search for best matching block in the

reference frameMetrics for determining “best match”

Candidate blocks: All blocks in, e.g., (± 32,±32) pixel areaStrategies for searching candidate blocks for best matchFull search: Examine all candidate blocksPartial (fast) search: Examine a carefully selected subset

– Estimate of motion for best matching block: “motion vector”

• Motion Vectors and Motion Vector Field– Motion Vector

Expresses the relative horizontal and vertical offsets (mv1,mv2), or motion, of a given block from one frame to anotherEach block has its own motion vector

– Motion Vector FieldCollection of motion vectors for all the blocks in a frame

79

VIDEO CODING…

• Example of Fast Search: 3-Step (Log) Search

– Goal: Reduce number of search points

Example:(± 7,±7) search area

Dots represent search points

Search performed in 3 steps (coarse-to-fine)

– Step 1: (± 4 pixels )



– Best match is found at each step

– Next step: Search is centered around the best match of prior step

– Speedup increases for larger search areas

80

VIDEO CODING…

• Motion Vector Precision

– Motivation

Motion is not limited to integer-pixel offsets

However, video only known at discrete pixel locations

To estimate sub-pixel motion, frames must be spatially interpolated

– Fractional MVs are used to represent the sub-pixel motion

– Improved performance (extra complexity is worthwhile)

– Half-pixel ME used in most standards: MPEG-1/2/4

– Why are half-pixel motion vectors better?

Can capture half-pixel motion

Averaging effect (from spatial interpolation) reduces prediction error -> Improved prediction

For noisy sequences, averaging effect reduces noise -> Improved compression

81

VIDEO CODING…

• Practical Half-Pixel Motion Estimation Algorithm– Half-Pixel ME (coarse-fine) Algorithm

Coarse Step: Perform integer motion estimation on blocks; find best integer-pixel MV

Fine Step: Refine estimate to find best half-pixel MV

Spatially interpolate the selected region in reference frame

Compare current block to interpolated reference frame block

Choose the integer or half-pixel offset that provides best match

Typically, bilinear interpolation is used for spatial interpolation

• Example– MC-Prediction for Two Consecutive Frames

82

VIDEO CODING…

• Bi-Directional MC-Prediction

– Bi-Directional MC-Prediction is used to estimate a block in the current frame from a block in

Previous frame

Future frame

Average of a block from the previous frame and a block from the future frame

– Motion compensated prediction

Predict the current frame based on reference frame(s) while compensating for the motion

– Examples of block-based motion-compensated prediction (P-frame) and bi-directional prediction (B-frame)

83

VIDEO CODING…

• Motion Estimation and Compensation

– The amount of data to be coded can be reduced significantly if the previous frame is subtracted from the current frame

84

VIDEO CODING…


– Uses Block-Matching

The MPEG and H.26X standards use block-matching technique for motion estimation /compensation

In the block-matching technique, each current frame is divide into equal-size blocks, called source blocks

Each source block is associated with a search region in the reference frame

The objective of block-matching is to find a candidate block in the search region best matched to the source block

The relative distances between a source block and its candidate blocks are called motion vectors

Video Sequence

The current frameThe reconstructed

reference frame

Bx: Search area

associated with X

MV: Motion Vector

X: Source block for

block-matching

85

VIDEO CODING…


– Uses Block-Matching

86

VIDEO CODING…


The Reconstructed Previous Frame The Current Frame

Results of Block-

Matching

The Predicted

Current Frame

87

VIDEO CODING…


– Search Range

– The size of the search range =

– The number of candidate blocks =

)2)(2( max2max1 yx dNdN

)12)(12( maxmax yx dd

88

VIDEO CODING…


– Motion Vector and Search Area

pnpn 22 Search Area:

Motion vector: (u, v)

89

VIDEO CODING…


– Matching Function

Mean square error(MSE)

Mean absolute difference(MAD)

Number of threshold difference(NTD)

Normalized cross-correlation function(NCF)

1

0

21

0

221121

21

21

1

1

1

1

)]1,,(),,([1

),(N

n

N

n

tdndnftnnfNN

ddMSE

1

0

1

0

221121

21

21

1

1

1

1

|)1,,(),,(|1

),(N

n

N

n

tdndnftnnfNN

ddMAD

90

VIDEO CODING…


– Algorithm

Full search block matching (FSB)

Fast algorithm

▫ 2D Logarithmic Search (TDL)

▫ Three Step Search (TSS)

▫ Cross-Search Algorithm (CSA)

▫ …

– Full Search Algorithm

If p=7, then there are

(2p+1)(2p+1)=225 candidate blocks.

u

vSearch Area

Candidate

Block

91

VIDEO CODING…


– Full Search Algorithm

Intensive computation

Need for fast Motion Estimation !

92

VIDEO CODING…


– 2D Logarithmic Search

Diamond-shape search area

Matching function

▫ MSE

-7 –6 –5 –4 –3 –2 –1 0 +1 +2 +3 +4 +5 +6 +7

+7

+6

+5

+4

+3

+2

+1

0

-1

-2

-3

-4

-5

-6

-7

MV

93

VIDEO CODING…

• Motion Estimation and Compensation– Three-Step Search

The first step involves block-matching based on 4-pel resolution at the nine location

The second step involves block-matching based on 2-pel resolution around the location determined by the first step

The third step repeats the process in the second step (but with resolution 1-pel)

-6

-5

-4

-3

-2

-1

0

1

2

3

4

5

6

7

11 1

11

11 1

-7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7

3

3

333

3

3 3

2 2 2

2

222

2

94

VIDEO CODING…


– Motion Vector Prediction

predMVx = Median(MV1x, MV2x, MV3x)

predMVy = Median(MV1y, MV2y, MV3y)

MVx`=MVx - predMVx

MVy`=MVy - predMVy

95

VIDEO CODER ARCHITECTURE

96

VIDEO CODER ARCHITECTURE

• Image / Video Coding Based on Block-Matching

– Assume frame f-1 has been encoded and reconstructed, and frame f is the current frame to be encoded

• Exploiting the redundancies

– Temporal

MC-Prediction (P and B frames)

– Spatial

Block DCT

– Color

Color Space Conversion

• Scalar quantization of DCT coefficients

• Zigzag scanning, runlength and Huffman coding of the nonzero quantized DCT coefficients

97

VIDEO CODER ARCHITECTURE…

• Video Encoder

– Divide frame f into equal-size blocks

– For each source block,

Find its motion vector using the block-matching algorithm based on the reconstructed frame f-1

Compute the DFD of the block

– Transmit the motion vector of each block to decoder

– Compress DFD’s of each block

– Transmit the encoded DFD’s to decoder

98


• Video Encoder

99


• Video Decoder– Receive motion vector of each block from encoder

– Based on the motion vector ,find the best-matching block from thereference frame

ie,, Find the predicted current frame from the reference frame

– Receive the encoded DFD of each block from encoder

– Decode the DFD.

– Each reconstructed block in the current frame = Its decompressed DFD +the best-matching block

100


• Video Decoder

101

VIDEO CODEC STANDARDS

102

VIDEO CODEC STANDARDS

• Goal of Standards

– Ensuring Interoperability

Enabling communication between devices made by different manufacturers

– Promoting a technology or industry

– Reducing costs

What do the Standards Specify?

103

VIDEO CODEC STANDARDS…

What do the Standards Specify?

• Not the encoder

• Not the decoder

• Just the bitstream syntax and the decoding process(e.g. use IDCT, but not how to implement the IDCT)

– Enables improved encoding & decoding strategies to be employed in a standard-compatible manner

104


• The Scope of Picture and Video Coding Standardization

– Only the Syntax and Decoder are standardized:

Permits optimization beyond the obvious

Permits complexity reduction for implementability

Provides no guarantees of Quality

Pre-Processing EncodingSource

Destination

Post-Processing

& Error Recovery

Decoding

Scope of Standard

105


106


• Based on the same fundamental building blocks

– Motion-compensated prediction (I, P, and B frames)

– 2-D Discrete Cosine Transform (DCT)

– Color space conversion

– Scalar quantization, runlengths, Huffman coding

• Additional tools added for different applications:

– Progressive or interlaced video

– Improved compression, error resilience, scalability, etc.

• MPEG-1/2/4, H.261/3/4

– Frame-based coding

• MPEG-4

– Object-based coding and Synthetic video

107


• The Video Standards uses all the three types of frames as shown below

Encoding order: I0, P3, B1, B2, P6, B4, B5, I9, B7, B8.

Playback order: I0, B1, B2, P3, B4, B5, P6, B7, B8, I9.

108


• Video Structure

– Video standards code video sequences in hierarchy of layers

– There are usually 5 Layers

GOP (Group of Pictures)

Picture

Slice

Macroblock

Block

109


• Video Structure

– A GOP usually started with I frame, followed by a sequence of P and B frames

– A Picture is indeed a frame in the video sequence

– A Slice is a portion in a picture

Some standards do not have slices

Some view a slice as a row

Each slice in H.264 is not necessary to be a row

It can be any shape containing integral number of macroblocks

– A Macroblock is a 16×16 block

Many standards use Marcoblocks as the basic unit for block-matching operations

– A Block is a 8×8 block

Many standards use the Blocks as the basic unit for DCT

110


• Scalable Video Coding

– Three classes of scalable video coding techniques

Temporal Scalability

Spatial Scalability

SNR Scalability

– Uses B frames for attaining temporal scalability

B frames depend on other frames

No other frames depend on B frames

Discard B frames without affecting other frames

111


• Scalable Video Coding – Spatial Scalability

– Basically Resolution Scalability

Here the base layer is the low resolution version of the video sequence

– The base layer uses coaster quantizer for DFD coding

– The residuals in the base layer is refined in the enhancement layer

112


113

HEVC

114

HEVC

• Video Coding Standards Overview

Next Generation Broadcasting

115

HEVC…

• MPEG-H

– High Efficiency Coding and Media Delivery inHeterogeneous Environments a new suite ofstandards providing technical solutions foremerging challenges in multimedia industries

– Part 1: System, MPEG Media Transport (MMT)Integrated services with multiple components in a hybriddelivery environment, providing support for seamless andefficient use of heterogeneous network environments,including broadcast, multicast, storage media and mobilenetworks

– Part 2: Video, High Efficiency Video Coding (HEVC)

Highly immersive visual experiences, with ultra high definitiondisplays that give no perceptible pixel structure even ifviewed from such a short distance that they subtend a largeviewing angle (up to 55 degrees horizontally for 4Kx2Kresolution displays, up to 100 degrees for 8Kx4K)

– Part 3: Audio, 3D-AudioHighly immersive audio experiences in which the decodingdevice renders a 3D audio scene. This may be using 10.2 or22.2 channel configurations or much more limited speakerconfigurations or headphones, such as found in a personaltablet or smartphone.

116

HEVC…

• Transport/System Layer Integration

– On going definitions (MPEG, IETF,…,DVB): benefit from H.264/AVC

– MPEG Media Transport (MMT) ?

117

HEVC…

• HEVC = High Efficiency Video Coding

• Joint project between ISO/IEC/MPEG and ITU-T/VCEG

– ISO/IEC: MPEG-H Part 2 (23008-2)

– ITU-T: H.265

• JCT-VC committee

– Joint Collaborative Team on Video Coding

– Co-chairs: Dr. Gary Sullivan (Microsoft, USA) and Dr. Jens-Reiner Ohm (RWTH Aachen, Germany)

• Target

– Roughly half the bit-rate at the same subjective quality compared to H.264/AVC (50% over H.264/AVC)

– x10 complexity max for encoder and x2/3 max for decoder

• Requirements

– Progressive required for all profiles and levels

Interlaced support using field SEI message

– Video resolution: sub QVGA to 8Kx4K, with more focus on higher resolution video content (1080p and up)

– Color space and chroma sampling: YUV420, YUV422, YUV444, RGB444

– Bit-depth: 8-14 bits

– Parallel Processing Architecture

118

HEVC…

• H.264 Vs H.265

119

HEVC…

• Potential applications

– Existing applications and usage scenarios

IPTV over DSL : Large shift in IPTV eligibility

Facilitated deployment of OTT and multi-screen services

More customers on the same infrastructure: most IP traffic is video

More archiving facilities

– Existing applications and usage scenarios

1080p60/50 with bitrates comparable to 1080i

Immersive viewing experience: Ultra-HD (4K, 8K)

Premium services (sports, live music, live events,…): home theater, Bars venue, mobile

HD 3DTV Full frame per view at today’s HD delivery rates

What becomes possible with 50% video rate reduction?

120

HEVC…

• Tentative Timeline

121

HEVC…

• History

122

HEVC…

• H.264 Vs H.265

123

HEVC…

• H.264 Vs H.265

124

HEVC…

• HEVC Encoder

125

HEVC…

• HEVC Decoder

126

HEVC…

• Video Coding Techniques : Block-based hybrid video coding

– Interpicture prediction

Temporal statistical dependences

– Intrapicture prediction

Spatial statistical dependences

– Transform coding

Spatial statistical dependences

• Uses YCbCr color space with 4:2:0 subsampling

– Y component

Luminance (luma)

Represents brightness (gray level)

– Cb and Cr components

Chrominance (chroma).

Color difference from gray toward blue and red

127

HEVC…

• Video Coding Techniques : Block-based hybrid video coding

– Motion compensation

Quarter-sample precision is used for the MVs

7-tap or 8-tap filters are used for interpolation of fractional-sample positions

– Intrapicture prediction

33 directional modes, planar (surface fitting), DC (flat)

Modes are encoded by deriving most probable modes (MPMs) based on those of previously decoded neighboring PBs

– Quantization control

Uniform reconstruction quantization (URQ)

– Entropy coding

Context adaptive binary arithmetic coding (CABAC)

– In-Loop deblocking filtering

Similar to the one in H.264 and More friendly to parallel processing

– Sample adaptive offset (SAO)

Nonlinear amplitude mapping

For better reconstruction of amplitude by histogram analysis

128

HEVC…

• Coding Tree Unit (CTU) - A picture is partitioned into CTUs

– The CTU is the basic processing unit instead of Macro Blocks (MB)

– Contains luma CTBs and chroma CTBs

A luma CTB covers L × L samples

Two chroma CTBs cover each L/2 × L/2 samples

– HEVC supports variable-size CTBs

The value of L may be equal to 16, 32, or 64.

Selected according to needs of encoders - In terms of memory and computational requirements

Large CTB is beneficial when encoding high-resolution video content

– CTBs can be used as CBs or can be partitioned into multiple CBs using quadtree structures

– The quadtree splitting process can be iterated until the size for a lumaCB reaches a minimum allowed luma CB size (8 × 8 or larger).

129

HEVC…

• Block Structure

– Coding Tree Units (CTU)

Corresponds to macroblocks in earlier coding standards (H.264, MPEG2, etc)

Luma and chroma Coding Tree Blocks (CTB)

Quadtree structure to split into Coding Units (CUs)

16x16, 32x32, or 64x64, signaled in SPS

130

HEVC…

• A new framework composed of threenew concepts

– Coding Units (CU)

– Prediction Units (PU)

– Transform Units (TU)

• The decision whether to code apicture area using inter or intraprediction is made at the CU level

Goal: To be as flexible as possible and to adapt the

compression-prediction to image peculiarities

131

HEVC…

• Block Structure

– Coding Units (CU)

Luma and chroma Coding Blocks (CB)

Rooted in CTU

Intra or inter coding mode

Split into Prediction Units (PUs) and Transform Units (TUs)

132

HEVC…

• Block Structure

– Prediction Units (PU)

Luma and chroma Prediction Blocks (PB)

Rooted in CU

Partition and motion info

133

HEVC…

• Block Structure

– Transform Units (TU)

Rooted in CU

4x4, 8x8, 16x16, 32x32 DCT, and 4x4 DST

134

HEVC…

• Relationship of CU, PU and TU

135

HEVC…

• Intra Prediction

– 35 intra modes: 33 directional modes +DC + planar

– For chroma, 5 intra modes: DC, planar,vertical, horizontal, and luma derived

– Planar prediction (Intra_Planar)

Amplitude surface with a horizontal andvertical slope derived from boundaries

– DC prediction (Intra_DC)

Flat surface with a value matching themean value of the boundary samples

– Directional prediction (Intra_Angular)

33 different directional prediction isdefined for square TB sizes from 4×4 upto 32×32

136

HEVC…


– Adaptive reference sample filtering

3-tap filter: [1 2 1]/4

Not performed for 4x4 blocks

For larger than 4x4 blocks, adaptively performed for a subset of modes

Modes except vertical/near-vertical, horizontal/near-horizontal, and DC

– Mode dependent adaptive scanning

4x4 and 8x8 intra blocks only

All other blocks use only diagonal upright scan (left-most scan pattern)

137

HEVC…


– Boundary smoothing

Applied to DC, vertical, and horizontal modes, luma only

Reduces boundary discontinuity

– For DC mode, 1st column and row of samples in predicted block are filtered

– For Hor/Ver mode, first column/row of pixels in predicted block are filtered

138

HEVC…

• Inter Prediction

– Fractional sample interpolation

¼ pixel precision for luma

– DCT based interpolation filters

8-/7- tap for luma

4-tap for chroma

Supports 16-bit implementation with non-normative shift

– High precision interpolation and biprediction

– DCT-IF design

Forward DCT, followed by inverse DCT

139

HEVC…

• Inter Prediction

– Asymmetric Motion Partition (AMP) for Inter PU

– Merge

Derive motion (MV and ref pic) from spatial andtemporal neighbors

Which spatial/temporal neighbor is identified bymerge_idx

Number of merge candidates (≤ 5) signaled in sliceheader

Skip mode = merge mode + no residual

– Advanced Motion Vector Prediction (AMVP)

Use spatial/temporal PUs to predict current MV

140

HEVC…

• Transforms

– Core transforms: DCT based

4x4, 8x8, 16x16, and 32x32

Square transforms only

Support partial factorization

Near-orthogonal

Nested transforms

– Alternative 4x4 DST

4x4 intra blocks, luma only

– Transform skipping mode

By-pass the transform stage

Most effective on “screen content”

4x4 TBs only

141

HEVC…

• Scaling and Quantization

– HEVC uses a uniform reconstruction quantization (URQ)scheme controlled by a quantization parameter (QP).

– The range of the QP values is defined from 0 to 51

142

HEVC…

• Entropy Coding

– One entropy coder, CABAC

Reuse H.264 CABAC core algorithm

More friendly to software and hardwareimplementations

Easier to parallelize, reduced HW area, increasedthroughput

– Context modeling

Reduced # of contexts

Increased use of by-pass bins

Reduced data dependency

– Coefficient coding

Adaptive coefficient scanning for intra 4x4 and 8x8

▫ Diagonal upright, horizontal, vertical

Processed in 4x4 blocks for all TU sizes

Sign data hiding:

▫ Sign of first non-zero coefficient conditionally hidden inthe parity of the sum of the non-zero coefficientmagnitudes

▫ Conditions: 2 or more non-zero coefficients, and“distance” between first and last coefficient > 3

143

HEVC…

• Entropy Coding - CABAC– Binarization: CABAC uses Binary Arithmetic Coding which means that only binary decisions (1 or

0) are encoded. A non-binary-valued symbol (e.g. a transform coefficient or motion vector) is"binarized" or converted into a binary code prior to arithmetic coding. This process is similar to theprocess of converting a data symbol into a variable length code but the binary code is furtherencoded (by the arithmetic coder) prior to transmission.

– Stages are repeated for each bit (or "bin") of the binarized symbol.

– Context model selection: A "context model" is a probability model for one or more bins of thebinarized symbol. This model may be chosen from a selection of available models depending onthe statistics of recently coded data symbols. The context model stores the probability of each binbeing "1" or "0".

– Arithmetic encoding: An arithmetic coder encodes each bin according to the selected probabilitymodel. Note that there are just two sub-ranges for each bin (corresponding to "0" and "1").

– Probability update: The selected context model is updated based on the actual coded value (e.g. ifthe bin value was "1", the frequency count of "1"s is increased)

144

HEVC…

• Parallel Processing Tools

– Slices

– Tiles

– Wavefront parallel processing (WPP)

– Dependent Slices

• Slices

– Slices are a sequence of CTUs that are processed in the orderof a raster scan. Slices are self-contained and independent

– Each slice is encapsulated in a separate packet

145

HEVC…

• Tile

– Self-contained and independently decodable rectangular regions

– Tiles provide parallelism at a coarse level of granularity

Tiles more than the cores Not efficient Breaks dependencies

146

HEVC…

• WPP

– A slice is divided into rows of CTUs. Parallel processing of rows

– The decoding of each row can be begun as soon a few decisions havebeen made in the preceding row for the adaptation of the entropy coder.

– Better compression than tiles. Parallel processing at a fine level ofgranularity.

No WPP with tiles !!

147

HEVC…

• Dependent Slices

– Separate NAL units but dependent (Can only be decoded after part ofthe previous slice)

– Dependent slices are mainly useful for ultra low delay applicationsRemote Surgery

– Error resiliency gets worst

– Low delay

– Good Efficiency Goes well with WPP

148

HEVC…

• Slice Vs Tile

– Tiles are kind of zero overhead slices

Slice header is sent at every slice but tile information once for a sequence

Slices have packet headers too

Each tile can contain a number of slices and vice versa

– Slices are for :

Controlling packet sizes

Error resiliency

– Tiles are for:

Controlling parallelism (multiple core architecture)

Defining ROI regions

149

HEVC…

• Tile Vs WPP

– WPP

Better compression than tiles

Parallel processing at a fine level of granularity

But …

Needs frequent communication between processing units

If high number of cores Can’t get full utilization

– Good for when

Relatively small number of nodes

Good inter core communication

No need to match to MTU size

Big enough shared cache

150

HEVC…

• In-Loop Filters

– Two processing steps, a deblocking filter (DBF) followed by ansample adaptive offset (SAO) filter, are applied to thereconstructed samples

The DBF is intended to reduce the blocking artifacts due to block-based coding

The DBF is only applied to the samples located at blockboundaries

The SAO filter is applied adaptively to all samples satisfyingcertain conditions. e.g. based on gradient.

151

HEVC…

• Loop Filters: Deblocking

– Applied to all samples adjacent to a PU or TU boundary

Except the case when the boundary is also a picture boundary, orwhen deblocking is disabled across slice or tile boundaries

– HEVC only applies the deblocking filter to the edge that arealigned on an 8×8 sample grid

This restriction reduces the worst-case computational complexitywithout noticeable degradation of the visual quality

It also improves parallel-processing operation

– The processing order of the deblocking filter is defined ashorizontal filtering for vertical edges for the entire picture first,followed by vertical filtering for horizontal edges.

152

HEVC…

• Loop Filters: Deblocking

– Simpler deblocking filter in HEVC (vs H.264 )

– Deblocking filter boundary strength is set according to

Block coding mode

Existence of non zero coefficients

Motion vector difference

Reference picture difference

153

HEVC…

• Loop Filters: SAO

– A process that modifies the decodedsamples by conditionally adding anoffset value to each sample after theapplication of the deblocking filter,based on values in look-up tablestransmitted by the encoder.

– SAO: Sample Adaptive Offsets

New loop filter in HEVC

Non-linear filter

– For each CTB, signal SAO type andparameters

– Encoder decides SAO type andestimates SAO parameters (rate-distortion opt.)

154

HEVC…

• Special Coding– I_PCM mode

The prediction, transform, quantization and entropy coding are bypassed

The samples are directly represented by a pre-defined number of bits

Main purpose is to avoid excessive consumption of bits when the signalcharacteristics are extremely unusual and cannot be properly handled by hybridcoding

– Lossless mode

The transform, quantization, and other processing that affects the decoded pictureare bypassed

The residual signal from inter- or intrapicture prediction is directly fed into theentropy coder

It allows mathematically lossless reconstruction

SAO and deblocking filtering are not applied to this regions

– Transform skipping mode

Only the transform is bypassed

Improves compression for certain types of video content such as computer-generated images or graphics mixed with camera-view content

Can be applied to TBs of 4×4 size only

155

HEVC…

• High Level Parallelism

– Independently decodable packets

– Sequence of CTUs in raster scan

– Error resilience

– Parallelization

– Independently decodable (re-entry)

– Rectangular region of CTUs

– Parallelization (esp. encoder)

– 1 slice = more tiles, or 1 tile = more slices

– Rows of CTUs

– Decoding of each row can be parallelized

– Shaded CTU can start when gray CTUs inrow above are finished

– Main profile does not allow tiles + WPPcombination

156

HEVC…

• Profiles, Levels and Tiers– Historically, profile defines collection of coding

tools, whereas Level constrains decoderprocessing load and memory requirements

– The first version of HEVC defined 3 profiles

Main Profile: 8-bit video in YUV4:2:0 format

Main 10 Profile: same as Main, up to 10-bit

Main Still Picture Profile: same as Main, onepicture only

– Levels and Tiers

Levels: max sample rate, max picture size,max bit rate, DPB and CPB size, etc

Tiers: “main tier” and “high tier” within onelevel

157

HEVC…

• Complexity Analysis

– Software-based HEVC decoder capabilities (published by NTT Docomo)

Single-threaded: 1080p@30 on ARMv7 (1.3GHz),1080p@60 decoding on i5 (2.53GHz)

Multi-threaded: 4Kx2K@60 on i7 (2.7GHz), 12Mbps, decoding speed up to 100fps

– Other independent software-based HEVC real-time decoder implementations published by Samsung and Qualcomm during HEVC development

– Decoder complexity not substantially higher

More complex modules: MC, Transform, Intra Pred, SAO

Simpler modules: CABAC and deblocking

158

HEVC…

• Quality Performance

159

THANK YOU