ARPS Architecture

1

ELECTRONICS AND ELECTRICAL COMMUNICATION ENGINEERINGINDIAN INSTITUTE OF TECHNOLOGYKHARAGPUR

ELECTRONICS AND ELECTRICAL COMMUNICATION ENGINEERINGINDIAN INSTITUTE OF TECHNOLOGYKHARAGPUR-721302

VLSI ARCHITECTURE FOR BLOCK MATCHING MOTION ESTIMATION

USING ARPS ALGORITHM

Project Presentationon

Under the guidance of

Dr. Indrajit Chakrabarti

Presented By:

HARSH KAUSHIK(12EC62R10)

2

OUTLINE

INTRODUCTION

MOTIVATION

OBJECTIVE

ARPS METHOD

PROPOSED ARCHITECTURE

SIMULATION RESULTS

CONCLUSIONS

FUTURE SCOPE

REFERENCES

3

INTRODUCTION ‘Raw’ video is composed of huge amount of data. [1]

Video compression is used to exploit redundancies present in the video. [1]

Various applications such as digital video storage, video conferencing via wired or wireless medium, online gaming and broadcast services over satellite channels use digital video interface.

International video coding standards such as H.264 or MPEG-4 use modern methods of video compression.

Compression Efficiency of video encoders have to be improved.

4

INTRODUCTION (contd.) Video compression involves two main blocks: Encoder and Decoder.

Encoder converts the data coming from the source into a compressed bit stream of data.

Motion Estimation is one of the most important block in a Video Encoder.

It provides the displacement between each block in the current frame and its closest match in the reference frame.

Types of Motion Estimation:

1. Pixel Based Motion Estimation.

2. Block Based Motion Estimation.

3. Mesh Based Motion Estimation.

5

MOTIVATION

Motion Estimation is one of the most consequential as well as computationally exhaustive block of a video encoder.

Block-Matching algorithm (BMA) is used for motion estimation (ME) in various video coding.

FS is highly computational so, we use fast BMA Techniques.

Adaptive Rood Pattern Search (ARPS) most efficient in terms of the computational speed and achieves good PSNR.

6

OBJECTIVE

To design an efficient VLSI architecture for block matching motion estimation using ARPS algorithm which will help to enhance the performance of a video encoder.

7

Adaptive Rood Pattern Search method SIZE of the search pattern and MAGNITUDE of the target MV determines

Speed and Accuracy of pattern-based search algorithms. Two main issues are: 1) Pre-determining the motion behavior of current block ? 2) Size and shape of the search pattern ? For First issue, Current block’s motion behavior can be predicted by its neighboring blocks’ MVs. For Second issue, two types of search patterns are used:- 1. Adaptive rood pattern (ARP) 2. Small search pattern (URP) Prediction of the target motion vector is achieved with the help of ROS (region

of search).

8

Current Macro Block

Neighboring macro Block which acts as a ROS whose motion vector has already been computed

Figure 1: Region of support for the current Macro Block.

9

START

ARP

Calculate SAD min Block

Set the MME point of ARP as the Center point for URP

STOP

URP

MOTION VECTOR

Is Center MME point?

YES

NOFigure 2:

Algorithmic Flow Chart of ARPS

10

Adaptive Rood Pattern ARP’s size,

L = Max {| MVpredicted(x)|,| MVpredicted(y)|}

Leftmost blocks in each frame have a fixed arm length of 2 pixels.

Predicted MV along with four-armed rood pattern is added into ARP which is similar to target MV.

It mostly overlaps with one of the search points.

Therefore, ARP has either 5 search points when predicted Motion Vector is not zero; otherwise only one search point is involved.

ARP is applied only once.

Figure 3: Adaptive Rood Pattern [2][3].

11

Figure 4: Initial Search using ARP.

Right Shifted Block by 2 pixels

Bottom shifted Block by 2

pixels

Left shifted Block by 2

pixels

Top shifted Block by 2 pixels

12

Fixed Pattern – for refined search

Initial search, leads to new search center.

Small search pattern (URP) is used for local

refined search unrestrictedly and repeatedly .

The URP continues until the center becomes the

MME point.

When the center becomes the minimum point

the process stops and corresponding Motion

Vector is obtained and the search is started for

the next macro block. Figure 5: Unit Rood Pattern [2][3].

13

Figure 6: Refined search using Unit Rood Pattern.

Right Shifted Block by 1 pixel

Bottom shifted Block by 1

pixel

Left shifted Block by 1

pixel

Top shifted Block by 1 pixels

14

PROPOSED ARCHITECTUREA

D

D

R

E

S

S

And

C

O

N

T

R

O

L

Unit

BLOCK RAM FOR CURRENT FRAME

BLOCK RAM FOR REFERENCE FRAME

ARP MODULE

URP MODULE

COMPARATOR For Deciding the URP center Point

MOTION VECTOR UNIT

ARP FLAG

MEMORY UNIT

SSP FLAG

Figure 7: Block Diagram of the Proposed Architecture

15

MEMORY UNIT

0 511

512

(16 × 16) macro block

352

288

For proper Addressing of the pixel intensity values we use zero padding.

CIF FRAME(352 × 288)

16

Current Frame Memory

Reference Frame Memory

SAD Block

Comparison Block

Motion Vector

Address Generation Block

Figure 8: Internal Constitution of ARP and URP Modules

ARP and URP Module both are made up three main sub blocks namely Address generation block, Comparator block and SAD block.

PROPOSED ARCHITECTURE

17

Base Address Reference Address

Block Pattern Address

19 19

19

19

19

Figure 9: Address Generation Block

Address Generation Block Current frame and Reference frame are stored in two block RAM’s of (2K × 9).

Address generation block generates the addresses to refer the pixel intensity values

which are stored in the memory.

Address generation block has three main sub blocks:

1. Base address generation block.

2. Reference address generation block.

3. Block Pattern address generation block.

18

MOD – 18 Counter

MOD – 22 Counter

TCTC ResetReset

CLK CLKGLOBAL CLOCK

Figure 10: Base Address Generation Block

Base Address Generation Block This block is used to generate the address of the current macro block.

The block is designed using two counters which are made up of 5 D – flip flop

modules.

CIF format video test sequence has resolution of 352 × 288. Frame is divided

into 16 × 16 blocks that means 18 rows and 22 columns.

19

MOD – 5 Counter

CLK

DECODER

19 bit

Figure 11: Reference Address Generation Block

Reference Address Generation Block It generates the addresses of the blocks which are to be searched in the reference

frame.

Five search points including the center point are to be searched.

A MOD – 5 counter using 3 D – flip flop modules is used for generating five

positions.

Output of this counter is passed to a decoder.

20

Output of MOD – 5

Counter

Address in Decimal

Form

Address in Binary 2’s

Complement Form

000 (0) 0 0000000000000000000

001 (1) 2 0000000000000000010

010 (2) 1024 0000000010000000000

011 (3) -2 1111111111111111110

100 (4) -1024 1111111110000000000

For ARP module (Left most blocks the ARP size is two pixels)

21

Output of MOD – 5

Counter

Address in Decimal

Form

Address in Binary 2’s

Complement Form

000 (0) 0 0000000000000000000

001 (1) 1 0000000000000000001

010 (2) 512 0000000001000000000

011 (3) -1 1111111111111111111

100 (4) -512 1111111111000000000

For URP module

22

MOD – 16 Counter

MOD – 16 Counter

TCTC ResetReset

CLK CLKGLOBAL CLOCK

Figure 12: Block Pattern Address Generation Block

Block Pattern Address Generation Block

This block is used to generate the address of each pixel present in the current

macro block of (16 ×16) size.

It is designed using two counter made up of 4 D – flip flops each.

23

Current Pixel

Reference Pixel

SAD register

88

8

14

14

Figure 13: Sum of Absolute Difference Block

It computes the difference of the pixel intensity values of both the current and

reference frame macro block.

The difference is added with the previous pixel difference value until the

difference of all the pixel intensity values for a macro block is finished.

Sum of Absolute Difference Block

24

Register with Minimum SAD

value

SAD value from the

SAD block

MUX

Sign bit

14 14

1414

14

Figure 14: Comparison Block

This block calculates the minimum SAD value.

The SAD value obtained from the SAD block and Register value which is storing

the minimum SAD value are fed as inputs of the subtractor.

At the Start of comparison the Register value is preset.

The borrow output of subtractor acts as the selection line of the multiplexer.

Comparison Block

25

Motion Vector Computing Unit

The Borrow output from the Subtractor of the Comparison Block is used to store the

address of the reference block having the minimum matching error.

The motion vector is obtained by using the reference address obtained.

26

SIMULATION RESULTS

27

MATLAB SIMULATION RESULTS

28

AVERAGE PSNR (dB) PERFORMANCE OF FS, TSS, DS, ARPS AND FIXED MESH ARPS (30 fps)

Video (Kbps) FS TSS DS ARPS FIXED MESH ARPS

AKIYO(1024) 43.7737 43.6324 43.7575 43.7228 40.6678BRIDGE-CLOSE(1024) 35.0512 35.0512 35.0512 35.0512 34.2305CONTAINER (1024) 38.3828 38.3823 38.3822 38.3820 37.6384ELEPHANT DREAMS (1024) 38.4459 38.0282 38.2063 38.0708 38.1541FOREMAN (1024) 33.8942 33.1140 33.6211 33.4685 28.4741FOOTBALL (1024) 23.5653 22.8817 22.7465 22.8567 22.4860HALL (1024) 35.5239 35.4433 35.4129 35.3432 30.6391MOTHER AND DAUGHTER (1024)

42.2745 42.1989 42.2362 42.2024 39.2444

NEWS (1024) 38.4927 38.4050 38.4446 38.4142 34.5488PARIS (1024) 31.3799 31.0338 31.2952 31.2478 28.2278SILENT (1024) 37.1517 36.9080 36.8728 36.8263 36.9954STEFAN (1024) 25.9698 24.7376 24.3586 25.5327 22.4696TABLE (1024) 31.3935 30.7814 30.6616 30.4916 27.0253TEMPETE (1024) 26.7330 26.5292 26.5532 26.5184 25.1976WATERFALL (1024) 35.3052 35.3051 35.3053 35.3047 32.5124

29

AVERAGE NUMBER OF SEARCH POINTS PER MV GENERATION

Video (Kbps) FS TSS DS ARPS

AKIYO(1024) 262.1717 23.2121 12.2596 5.0378BRIDGE-CLOSE(1024) 262.1717 23.2432 14.2559 6.0570CONTAINER (1024) 262.1717 23.2256 12.3772 5.1253ELEPHANT DREAMS (1024) 262.1717 23.2935 16.343 8.1380FOREMAN (1024) 262.1717 23.2571 16.043 8.4591FOOTBALL (1024) 262.1717 23.3654 20.805 11.7219HALL (1024) 262.1717 23.2510 12.9231 5.8659MOTHER AND DAUGHTER (1024)

262.1717 23.2874 13.375 6.2562

NEWS (1024) 262.1717 23.2128 12.5235 5.3822PARIS (1024) 262.1717 23.2217 12.827 5.7681SILENT (1024) 262.1717 23.2143 12.9878 5.8658STEFAN (1024) 262.1717 23.3177 17.2021 8.1941TABLE (1024) 262.1717 23.2853 13.7468 6.5654TEMPETE (1024) 262.1717 23.2356 12.867 5.9502WATERFALL (1024) 262.1717 23.2121 12.2942 5.2849

30

1 3 5 7 9 1 1 1 3 1 5 1 7 1 9 2 1 2 3 2 5 2 7 2 90

5

10

15

20

25

30

35

40

FS TSS DS ARPS

FRAME NumBER

PSNR (DB)

1 3 5 7 9 1 1 1 3 1 5 1 7 1 9 2 1 2 3 2 5 2 7 2 90

5

10

15

20

25

30

35

40

45FS TSS DS ARPS

Frame Number

PSNR (DB)

Plot of frame by frame PSNR values of FS, DS, TSS and ARPS for Table Tennis and Football Sequence

FOOTBALL SEQUENCETABLE TENNIS SEQUENCE

31

BLOCK-MATCHING ALGORITHM USING ARPS

REFERENCE FRAME

CURRENT FRAME

RECONSTRUCTED FRAME

AKIYO SEQUENCE

32


REFERENCE FRAME

CURRENT FRAME

RECONSTRUCTED FRAME

MOTHER AND DAUGHTER SEQUENCE

33


REFERENCE FRAME

CURRENT FRAME

RECONSTRUCTED FRAME

FOOTBALL SEQUENCE

34

XILINX TIMING DIAGRAMS

35

SIMULATION RESULTS

Figure 15: Timing Simulation of Combined Base Address and Reference Address Generation Block.

36

SIMULATION RESULTS

Figure 16: Timing Simulation of Block Pattern Address Generation Block.

37

SIMULATION RESULTS

Figure 17: Timing Simulation of Address Generation Block.

38

SIMULATION RESULTS

Figure 18: Timing Simulation of SAD Block.

39

SIMULATION RESULTS

Figure 19: Timing Simulation of Comparison Block.

40

SIMULATION RESULTS

Figure 20: Timing Simulation of ARP Module.

41

SIMULATION RESULTS

Figure 21: Timing Simulation of URP Module.

42

SIMULATION RESULTS

Figure 22: Timing Simulation of Final Architecture Block.

43

SIMULATION RESULTS

Figure 23: Schematic of Final Architecture by Design Vision.

44

SIMULATION RESULTS

Logic Utilization Used Available Utilization

Number of Slices 1702 63168 2%

Number of Slice Flip Flops 374 126336 1%

Number of 4 input LUTs 3077 126336 2%

Number of bonded IOBs 252 768 32%

Number of FIFO 16/RAMB16s 512 552 92%

Number of GCLKs 5 32 15%

The overall device utilization summary generated using XILINX 14.2 is given

in the table below. (Virtex4, Device: XC4VFX140)

45

Minimum Period: 9.702 ns

Maximum Frequency: 103.027 MHz

Minimum input arrival time before

clock:

2.805 ns

Maximum output required time after

clock:

13.580 ns

SIMULATION RESULTS

The overall timing summary generated using XILINX 14.2 is given in the

table below.

(Speed Grade: -11)

46

SIMULATION RESULTS

Figure 24: XILINX Power Report.

47

SIMULATION RESULTS

Sr. No. Module Name Operating

Conditions

Area Analysis (µm2) Power Analysis

1 Address Generation

Block

SS 12117.9744 281.9040µWTT 11727.3744 352.3345µW

FF 11727.3744 437.9998µW

2 SAD Block SS 6771.4416 323.18.18µW

TT 6496.4592 408.7913µW

FF 6496.4592 505.7709µW

3 Comparison Block SS 10927.4257 545.6620µW

TT 10952.4241 675.6204µW

FF 10977.4225 847.0907µW

4 ARP Module SS 34150.9394 495.8019µW

TT 32654.1602 603.2296µW

FF 32647.9106 745.2020µW

5 URP Module SS 38488.1618 522.8998µW

TT 37022.6305 634.4330µW

FF 37016.3809 793.7040µW

6 Final Architecture SS 73126.5700 974.1342µW

TT 70151.7602 1.1877mWFF 70139.2611 1.4743mW

Synopsys design vision tool (0.18 µm). Results obtained at 10ns (Frequency = 100 MHz).

48

CONCLUSIONS Main focus of this project has been on improving the performance of the motion

estimation block of the video encoder.

Adaptive Rood Pattern Search Algorithm has been used for this purpose. Various BMA techniques such as FS, TSS, DS and ARPS have been compared with each other to determine the most efficient search algorithm. ARPS was found to be having the least time complexity.

The hardware architecture was designed for the implementation of the ARPS algorithm.

Many individual blocks like address generator, comparator and SAD block which are made up of several other smaller blocks were designed and tested. The individual blocks were then integrated to complete the ARP module as well as the URP module. Then, finally both the ARP module and URP module were combined together to develop the ARPS architecture.

49

FUTURE SCOPE Optimization of the proposed Architecture in terms of speed.

Architecture enhancement for mesh based motion estimation using ARPS for removing the disadvantages of block based motion estimation techniques.

It will help in including spatial deformations like zoom, shear and rotation which are present in the video.

50

REFERENCES[1] I.E.G. Richardson, “H.264 and MPEG-4 Video Compression Video Coding for Next –

generation Multimedia”, 2003.

[2] Yao Nie and Kai-Kuang Ma, “Adaptive Rood Pattern Search for Fast Block-Matching

Motion Estimation”, IEEE Transaction on Image Processing, Vol. 11, No. 12, pp. 1442-1448,

December 2002.

[3] Kai-Kuang Ma and Gang Qiu, “An Adaptive Rood Pattern Search for Fast Block-Matching

Motion Estimation in JVT/H.26L”, IEEE Conference on Circuits and Systems, Vol. 2, pp. II-

708-II-711, 2003.

[4] A. Barjatya, “Block Matching Algorithms for Motion Estimation”, IEEE Conference on

Digital Image Processing, April 2004.

[5] D. Vijendra Babu, P. Subramanian, C. Karthikeyan, “Performance Analysis of Block

Matching Algorithms for Highly Scalable Video Compression”, International Symposium on Ad

Hoc and Ubiquitous Computing, December 2006.

[6] Renxiang Li, Bing Zeng and Ming L. Liou, “A new Three-Step Search Algorithm for Block

Motion Estimation”, IEEE Conference on Circuits and Systems for Video Technology, Vol. 4,

No. 4, pp. 438-442, August 1994.

51

[7] Shan Zhu and Kai-Kuang Ma, “A new Diamond Search Algorithm for Fast Block-Matching

Motion Estimation”, IEEE Transaction on Image Processing, Vol. 9, No. 2, pp. 287-290, February

2000.

[8] Yifeng Qiu and W. Badawy, “The Hardware Architecture of a Novel Motion Estimator with

Adaptive Crossed Quarter Polar Search Patterns for H.264 Encoding”, IEEE Conference, pp. 819-822,

2009.

[9] B.G. Kim, S.T. Kim, S.K. Song and P.S. Mah, “Fast-adaptive rood pattern search for block motion

estimation”, Electronic letters, Vol. 41, No. 16, August 2005.

[10] Chunchun Chen, Junming Shan and Eryan Yang, “An improved Adaptive Rood Pattern Search

Algorithm Based on Temporal and Spatial Correlation”, International Conference on Audio,

Language and Image Processing, pp. 1088-1092, 2012.

[11] D. Dia, M. Atri and R. Tourki, “A Improved Fast Motion Block Matching for Wavelet Video

Coding”, International Symposium on Signal Processing and Information Technology, pp. 801-806,

2007.

REFERENCES (contd.)

52

[12] U.Y. Oktiawati and V.V. Yap, “Evaluating the Effects of the Dual Tree Complex Wavelet

Transform and the Adaptive Rood Pattern Search Algorithm on a Video Codec”, IEEE Conference

on Industrial Electronics and Applications, pp. 2544-2547, 2008.

[13] M. Sayed and W. Badawy, “An Affine-Based Algorithm and SIMD Architecture for Video

Compression with Low Bit-Rate Applications”, IEEE Transactions on Circuits and Systems for

Video Technology, Vol.16, No. 4, April 2006.

[14] W. Badawy, G. Zhang and M. Bayoumi, “VLSI Architecture for Hierarchical Mesh Based

Motion Estimation”, IEEE Conference on Signal Processing Systems, pp. 110-119, October 1999.

[15] S. Palnitkar, “Verilog HDL: A Guide to Digital Design and Synthesis”, second edition,

Prentice Hall Inc., February 2003.

[16] Video Lectures on Multimedia Processing by Prof. Somnath Sengupta,

“htttp://nptel.iitm.ac.in/courses/Web course-contents/IIT Kharagpur/Multimedia

Processing/New_index1.html”.

[17] MATLAB User Manual Version R 2012b.

[18] XILINX - ISE User Manual Version 14.2.

REFERENCES (contd.)

53

ELECTRONICS AND ELECTRICAL COMMUNICATION ENGINEERINGINDIAN INSTITUTE OF TECHNOLOGYKHARAGPUR

ELECTRONICS AND ELECTRICAL COMMUNICATION ENGINEERINGINDIAN INSTITUTE OF TECHNOLOGYKHARAGPUR-721302

THANK YOU

Devices & Hardware

ARPS Architecture