Reconﬁgurable Data Flow Engine for HEVC Motion EstimationHigh efﬁciency video encoding (HEVC) is an emergent video coding standard that achieves re-duced rate distortion at the

Reconfigurable Data Flow Engine for HEVC MotionEstimation

D’huys Thomas

Dissertacao para obtencao do Grau de Mestre emEngenharia Electrotecnica e de Computadores

JuriPresidente: Doutor Nuno HortaOrientador: Doutor Leonel Augusto Pires Seabra de SousaCo-orientador: Doutor Frederico Correia Pinto PratasVogal: Doutor Horacio Claudio de Campos Neto

January 2014

Acknowledgments

First and foremost, I would like to express my sincerest gratitude to my supervisor, Leonel

Sousa, for his huge help and incredible, never seen before, support during this whole period.

Next I would like to gratefully thank Frederico Pratas for his guidance and support during the

many hours that we worked together. It was a great pleasure. I would also like to thank Svetislav

Momcilovic for helping me understand video encoding, making the GPU implementation used in

this thesis and his friendship. A thank you for all the people that made me feel at home at INESC-

ID and specially to Aleksandar, Sveta, Hector, Diogo. Furthermore I would like the thank IST and

the KU Leuven for this Erasmus exchange opportunity which really enriched my life. I am very

thankful to my family for continuously supporting and motivating me.

To all the wonderful people from all over the world that I was able to meet during my Erasmus

in Lisbon: Muito obrigado, Thank you, Dank u

Abstract

High efficiency video encoding (HEVC) is an emergent video coding standard that achieves re-

duced rate distortion at the cost of high computational load. In this thesis a reconfigurable design

for HEVC motion estimation is proposed. Full-Search Block-Matching (FSBM) is used for high

quality video encoding. The design is implemented on a data flow engine with the Maxeler frame-

work. The reconfigurability allows the Coding Units (CUs) to have any set of sizes ranging from

8x8 to 64x64 pixels, also taking non-square shapes into account. The search area width is config-

urable to 32, 64, 128 and 256 pixels. Furthermore the adopted approach and the implementation

provide a fine grained trade-of between maximum performance and minimum resource usage.

Experimental results show that 720p video can be processed at 56,9 frames per second (fps).

The hardware resource usage, Look Up Tables (LUT) and Flip-Flops (FF), can be decreased 41%

and 13%, respectively, with a performance decrease factor of two as trade-off.

Keywords

Motion Estimation, Full-Search Block-Matching, Variable Block-Size, HEVC, FPGA, Maxeler

Platform, Scalable Design

iii

Resumo

A codificacao de vıdeo de elevada eficiencia (HEVC) e uma norma de codificacao de vıdeo

emergente que atinge uma relacao distorcao debito-binario melhorada, mas impoe um elevado

custo computacional. Nesta tese e proposto um acelerador de processamento reconfiguravel

para estimacao de movimento segundo a norma HEVC. E utilizada pesquisa exaustiva por em-

parelhamento de blocos (FSBM) para codificacao de vıdeo de alta qualidade. O projeto, baseado

numa arquitetura de fluxo de dados, foi implementado numa plataforma Maxeler. A reconfiguracao

permite que as unidades de codificacao (UCs) possam assumir um conjunto de tamanhos que

variam de 8x8 para 64x64 pixels, suportando tambem blocos com geometria nao-quadrada. A

largura da area de pesquisa e configuravel para 32, 64, 128 e 256 pixels. Alem disso, a abor-

dagem adotada e a implementacao realizada permitem equilibrar desempenho e recursos de

hardware, com uma granularidade fina. Resultados experimentais mostram que vıdeos com 720p

podem ser processados a 56,9 imagens por segundo (fps). Os recursos de hadware da FPGA

usados, tabelas logicas (LUT) e basculas (FF), podem ser reduzidos 41% e 13%, respetivamente,

tendo como contrapartida um fator de diminuicao do desempenho de dois.

Palavras Chave

Estimacao de Movimento, Pesquisa Exaustiva por Emparelhamento de Blocos, Blocos de

Dimensao Variavel, HEVC, FPGA, Plataforma Maxeler, Projeto Escalavel

v

Contents

List of Acronyms xvi

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Main contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.5 Dissertation outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Background: Video Coding on FPGA 7

2.1 Video Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.1 High Efficiency Video Coding . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.2 Advanced Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.2 Designing with FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.3 Dataflow programming with the Maxeler platform . . . . . . . . . . . . . . . 18

3 Video Coding Architecture 21

3.1 Streaming model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2 Hierarchical SAD computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 Streaming model with hierarchical SAD computation . . . . . . . . . . . . . . . . . 23

3.4 Streaming Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.5 SadGenerator Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.6 Scalable SadGenerator Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.7 SadComparator Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4 Hardware Accelerator for High Efficiency Video Coding 43

4.1 Platform features and restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2 Design decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

vii

Contents

4.3 SadComparator implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.4 SadGenerator implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.4.1 The RF-stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.4.2 The OB-stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.4.3 The ALU implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.5 Host Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5 Results 59

5.1 Implementation Specific Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.1.1 SadGenerator Output Width . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.1.2 Custom Accum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.2.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.2.2 Reference Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.2.3 Design Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.3 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.3.1 About the GPU implementation . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

6 Conclusions and Future work 79

6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

A Appendix A 85

B Appendix B 91

viii

List of Figures

2.1 A basic video coder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 H.265/HEVC encoder with intra/inter selection. . . . . . . . . . . . . . . . . . . . . 9

2.3 (a) Grouping of CTUs in slices and tiles, (b) Subdivision of a CTB into CBs . . . . 10

2.4 Modes for splitting a CB into PBs in case of inter prediction . . . . . . . . . . . . . 12

2.5 The basic structure of an FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.6 A simplified example of a configurable logic block . . . . . . . . . . . . . . . . . . . 15

2.7 An overview of the main features of the Virtex-5 LX330T FPGA from Xilinx that is

used in this thesis [20]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.8 An overview of a standard software application versus a dataflow application [17] . 19

2.9 The architecture of a DFE and its connections [16] . . . . . . . . . . . . . . . . . . 20

3.1 The streaming model of the FSME accelerator . . . . . . . . . . . . . . . . . . . . 22

3.2 Hierarchical SAD computation inside a Z-OB . . . . . . . . . . . . . . . . . . . . . 23

3.3 Detailed model of the accelerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.4 The memory is split into an even and odd SAD-buffer . . . . . . . . . . . . . . . . 24

3.5 An overview of the executions sequence of the SadGenerator’s and the SadCom-

parator’s Z-itterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.6 The streaming pattern of the Z-OBs in the original frame . . . . . . . . . . . . . . . 26

3.7 The RF-chunk with multiple reference frames that is streamed alongside each Z-

OB-chunk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.8 The SAD-chunks are streamed to and from the memory. Each SAD-chunk contains

all the A-SADs of one Z-OB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.9 The streaming of the MV-Chunks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.10 An overview of the processing of a frame with a total of 4 Z-OBs . . . . . . . . . . 29

3.11 An RF-line is used by A×L SADs of one A-OB . . . . . . . . . . . . . . . . . . . . 30

3.12 A×L ALUs are grouped in a grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.13 An ALU has as input A reference pixels and A original pixels, and as output an

accumulated value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.14 An overview which ALU lines calculate which SADs by using which lines from the SA 31

ix

List of Figures

3.15 The output pattern of the SadGenerator when calculating the SADs of a single

A-OB, named OB 0, at a time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.16 By extending the size of the RF-line, a row A-OBs use the RF-line to calculate all

its SADs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.17 The output pattern of the SadGenerator when calculating the SADs of one row of

A-OBs at a time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.18 By extending the number of RF-lines, multiple rows of A-OBs use the RF-line to

calculate all its SADs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.19 The output pattern of the SadGenerator when calculating the SADs of all Z-OB’s

A-OBs at a time, with A=8 and Z=32 . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.20 During different OB-iterations, another sub-RF-line, part of the RF-line, is sent to

the ALU-grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.21 An overview of the SAD calculation sequence . . . . . . . . . . . . . . . . . . . . . 37

3.22 An overview of the SadGenerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.23 Four different Horizontal parallelizations for the ALU grid . . . . . . . . . . . . . . . 38

3.24 From the streamed RF-line, H+A-1 pixels are send to the ALU grid per horizontal

iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.25 Two different Processing power configurations of the ALUs . . . . . . . . . . . . . 39

3.26 The calculation sequence of an H=L/4 P=A/8 configuration . . . . . . . . . . . . . 40

3.27 An overview of the SadComparator in the situation of Figure 3.2 . . . . . . . . . . 40

4.1 An overview of the architecture on the FPGA . . . . . . . . . . . . . . . . . . . . . 45

4.2 Detailed ALU of a H(L)P(8) configuration . . . . . . . . . . . . . . . . . . . . . . . . 47

4.3 SadComparator with Transposer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.4 The Transposer consists of two memory banks that allow for simultaneous read and

writes capability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.5 The transposer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.6 An overview of the iterations when processing a Z-OB . . . . . . . . . . . . . . . . 50

4.7 Reference frame in the DRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.8 The first RF-lines of adjacent Z-OBs in the DRAM for Z and L equal to 64 . . . . . 51

4.9 The data of multiple Z-OBs inside an RF-line for Z and L equal to 64 . . . . . . . . 52

4.10 The access pattern of the RF-line inside the SadGenerator for different Z-OBs, with

Z,L equal to 64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.11 The 8 Original Blocks are stored into the B-RAM and distributed to the ALU grid per

row . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.12 Detailed ALU implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.13 The SadGenerator ALU’s registers with library accumulators . . . . . . . . . . . . 54

4.14 The SadGenerator ALU’s registers with custom accumulators . . . . . . . . . . . . 55

x

List of Figures

5.1 The resource usage of the SadGenerator’s kernel for different accum implementa-

tions relative to the custom accum implementation, for H1A8 and L=64 . . . . . . . 61

5.2 An in-depth overview of the resource usage of the H64P2 reference implementation 64

5.3 The effect of four parameters on the architecture’s performance . . . . . . . . . . . 64

5.4 The resource usage for different designs . . . . . . . . . . . . . . . . . . . . . . . . 65

5.5 The number of cycles needed per processing of a 720p frame. The red line indi-

cates the maximum number of cycles allowed for real time (25fps) processing of a

720p frame. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.6 The read and write rates to the DRAM for each design . . . . . . . . . . . . . . . . 66

5.7 The total power usage per design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.8 The number of cycles needed for each unit to process independently one Z-OB . . 68

5.9 The total number of cycles needed to process one 720p frame by both units working

in parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.10 The amount of data used by the SadGenerator when processing one Z-OB for

different sizes of Z . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.11 The total amount of data used by the SadGenerator when processing one 720p

frame for different sizes of Z . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.12 The read and write rates of the DRAM for different sizes of Z . . . . . . . . . . . . 70

5.13 The number of cycles needed for each unit to process independently one Z-OB . 71

5.14 The total number of cycles needed to process one 720p frame by both units working

in parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72


different sizes of L . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.16 The read and write rates of the DRAM for different size of L . . . . . . . . . . . . . 73

5.17 The number of cycles needed to process one 720p frame for different numbers of

reference frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74


different sizes of RF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.19 A comparison of the GPU and proposed implementation, for different frame sizes . 76

5.20 A comparison of the GPU and proposed implementation, for different number of

reference frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.21 A comparison of the GPU and proposed implementation, for search area sizes . . 77

A.1 The first RF-lines of adjacent Z-OBs in the DRAM for Z equal to 32 . . . . . . . . . 87

A.2 The data of multiple Z-OBs inside an RF-line for Z equal to 32 . . . . . . . . . . . . 88


A.4 The data of multiple Z-OBs inside an RF-line for Z equal to 64 . . . . . . . . . . . . 89


xi

List of Figures

A.6 The data of a single Z-OBs inside an RF-line for Z equal to 96 . . . . . . . . . . . . 90

B.1 The Xilinx Power Estimator’s result for the H64P2 reference implementation . . . . 92

xii

List of Tables

5.1 Resource usage relative to the design with 1 output cycle . . . . . . . . . . . . . . 61

5.2 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.4 Major resource changes when increasing Z, relative to Z=32 . . . . . . . . . . . . 67

5.5 The difference in increase of RF data and SAD data compared to the increase of

the Search Area L× L, relative to L=32 . . . . . . . . . . . . . . . . . . . . . . . . 73

A.1 The RF-line data efficiency for different Z and L values . . . . . . . . . . . . . . . . 86

xiii

List of Tables

xiv

List of Acronyms

ALU Arithmetic Logic Unit

A-OB AxA Original Block

BRAM Block random-access memory

CU Coding Unit

DFE Data Flow Engine

DRAM Dynamic random-access memory

FPGA Field Programmable Gate Array

FSME Full Search Motion Estimation

HEVC High Efficiency Video Coding

MAD Mean Absolute Difference

ME Motion Estimation

MSE Mean Square Error

MV Motion Vector

OB Original Block

PB Prediction Block

PCIe Peripheral Component Interconnect Express

RB Reference Block

RF Reference Frame

SA Search Area

SAD Search Area Difference

SSE Sum of Squared Errors

xv

List of Acronyms

SM Streaming Multi-processor

VBSFSME Variable Block Size Full Search Motion Estimation

Z-OB ZxZ Original Block

xvi

1Introduction

Contents1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Main contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.5 Dissertation outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1

1. Introduction

1.1 Motivation

Video coding continues to claim an increasingly important role in our everyday life. It is em-

bedded in a wide range of applications that became indispensable, including digital TV, video

conferencing, surveillance and Blu-ray. Video applications become more and more demanding

with higher quality and resolutions like 8K UHD (Ultra High Definition). Also new applications

such as, such as multiview capture and display, forces video coding to continuously improve and

innovate.

The newest video coding standard HEVC/H.265 was approved by the International Telecom-

munications Union (ITU-T) in April 2013 to meet these new demands [15]. It is the successor of

the widely used H.264/MPEG-4 AVC (Advanced Video Coding) standard and is under joint devel-

opment by the ISO/IEC Moving Picture Experts Group (MPEG) and ITU-T Video Coding Experts

Group (VCEG). The standard exploits statistical correlation in the encoder to reduce the data rate

of the video while assuring high quality. Between successive video frames (inter mode) temporal

redundancies are further exploited. Within a video frame (intra mode) spatial redundancies are

exploited. The encoding process is completed by transforming the signal, followed by quantisation

and entropy encoding in order to exploit redundancies. HEVC employs many improvements over

other coding standards. Next to higher quality and support for ultra high resolutions, the most

important improvement is the achievement of a significantly higher coding efficiency. Compared

to H.264/MPEG-4 AVC the data compression ratio is typically doubled for the same quality and

resolution of video but comes at the cost of dramatic computational requirements, which is a new

demand namely for achieving real-time processing.

It is important to reduce the computational requirements and design specialised processing

engines to make real-time video encoding and decoding with the newest HEVC standard acces-

sible for mobile and embedded devices that often lack high computational power. Furthermore

lowering the computational requirements will greatly affect the energy consumption. For doing so

the effort should be concentrated on the most computational demanding part of the video coding

which is the Motion Estimation (ME) that takes place on the encoding side. ME takes up to 90%

of the computational requirements of the encoding process. In this step the temporal redundan-

cies are exploited at the level of rectangular blocks in which each frame is divided. For each

rectangular block another rectangular block is looked for in the search area of multiple reference

frames that will result in a minimum residual signal. The result of the ME is a motion vector for

each rectangular block in the frame which points to a rectangular block in a reference frame that

leads to a minimum residual signal. The search for the motion vector can be performed in different

ways. Exhaustive Search compares each rectangular block with all the possible reference square

blocks. Other methods will decrease the number of comparisons by reducing the search space

but will have suboptimal results which decreases the quality of the image.

2

1.2 Related works

Video encoding can be performed on a wide range of platforms ranging from multicore general

purpose central processing units (CPU) to application specific integrated circuits (ASIC). Field

Programmable Gate Arrays (FPGA) are becoming more and more a popular choice for several

reasons. It is fully configurable for the needs of the application and can be remotely reprogrammed

with new bitstreams. There are not the high non-recurring expenses (NRE) associated with an

ASIC design.

Synthesis tools are used to configure the operation of the FPGA. They model a design and

are able to simulate its behavior to verify their correct functionality. The hardware description

language (HDL) that comes with the synthesis tool allows a designer to model a design. Logic

synthesis is performed at the register transfer level (RTL) to design and to generate a bitmap that

is used to configure the FPGA. Verilog and VHSIC Hardware Description Language (VHDL) are

popular HDL for which synthesis tools exist. They allow for a fine-grained RTL design with a lot of

control. With the increasing complexity of designs, there is a need for high level synthesis tools

that raise the abstraction of the HDL. This higher level of abstraction simplifies the design and

allows to decrease the design time drastically while giving up some of the fine-grained control.

MaxCompiler is a programming tool suit that describes hardware on such a high abstraction

level. It provides a data flow model where data is streamed from the memory and processed by

several computation units without being written to the off-chip memory until the chain of process-

ing is complete. This method is especially favourable with data intensive applications since the

expensive write back to memory after each computation is avoided. Motion estimation is a good

example of a data intensive application considering that each rectangular block in a frame has to

be compared with many other rectangular blocks in multiple reference frames.

1.2 Related works

Motion Estimation is the most computational intensive part of the video encoding. It is used in

many previous standards as well as in the newly introduced HEVC. Especially with the Full Search

Motion Estimation, most research is focused on effectively reusing the huge amounts of data that

is needed. Four main levels of data reuse are described in [2]. The more supported levels of data

reuse, the higher the degree of data reuse which lowers the bandwidth and energy consumption.

A high data reuse can be accomplished by utilising caches as in [10] and by exploiting a high

degree of parallelization as in [7] and [6]. This design uses a high degree of data reuse with a

smart parallelization approach.

The HEVC standard extends the use of variable block sizes. Most Variable Block Size FSME

(VBSFSME) architectures such as [3], [Yu-Wen Huang and Chen] and [8] use a N×N paralleliza-

tion grid, with N being the block width. This parallelization limits the data reuse and processing

time, especially when large search area’s are used. When supporting the new large block sizes in-

3

1. Introduction

troduced by HEVC, these designs may be not feasible for implementation due to the high resource

usage of the N×N parallelization. The design presented in the thesis uses a L×N parallelization,

allowing to scale not only with the search area width L, but also with the block size N. When large

search area’s are used, fast processing is achieved due to the L parallelization. The same holds

for large block sizes due to the N parallelization.

According to the target platform, available resources and required performance, the architec-

ture herein proposed can be scaled. Both the search area width L and block size N parallelization

can be scaled down in order to use less resources, at the cost of more processing time. This is a

unique feature that at the time of writing is not present in any known work.

1.3 Objectives

Given the immense computational requirements of the new HEVC video coding standard, the

main goal of this work is to develop a hardware accelerator that tackles the most computational

intensive part of HEVC video coding, namely the motion estimation on the encoder part. The

motion estimation is performed with full search search to achieve the optimal video quality. In

order to process the huge amounts of data during the full search effectively, data access regularity

is exploited to design a hardware accelerator based on the data flow approach.

The motion estimation accelerator was designed with the following main objectives in mind:

• a scalable architecture that can balance precisely between resource usage and speed per-

formance

• an adaptable architecture supporting different search area sizes, multiple reference frames

and different rectangular block sizes and shapes

• an implementation of the architecture on a Virtex 5 FPGA.

1.4 Main contributions

This thesis presents a high definition motion estimation accelerator with full search for the

newest HEVC standard. The accelerator is a full streaming solution implemented on an FPGA

with the Maxeler framework. It can balance between real-time 1080p motion estimation and low

resource usage. There is support for a search area width of 32, 64, 128 and 256 pixels. The block

sizes of 8×8, 16×16, 32×32 and 64×64 are considered. It can easily be extended with custom

block shapes, for example 32x8, as long as the granularity remains 8x8. The accelerator is highly

parallelised with the focus of reducing the data bandwidth and increasing the speed performance.

Furthermore a trade-off between performance and resource usage is possible due to the adoption

of a scalable ALU grid.

4

1.5 Dissertation outline

1.5 Dissertation outline

This thesis is organised in the following way:

• Chapter 2: An overview of the HEVC video coding, in particular the motion estimation

component, is given in this chapter followed by an introduction to FPGAs and dataflow com-

puting.

• Chapter 3: This chapter describes the proposed motion estimation architecture.

• Chapter 4: The implementation of the proposed architecture on a Virtex 5 FPGA is dis-

cussed in this chapter.

• Chapter 5: The results of the implementation and a comparison a reference GPU imple-

mentation is given in this chapter.

• Chapter 6: Finally, a global conclusion and possible directions for future work are presented

herein.

5

1. Introduction

6

2Background: Video Coding on

FPGA

Contents2.1 Video Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

7

2. Background: Video Coding on FPGA

This chapter provides the theoretical background for the rest of the thesis, considering both

video coding and the hardware platforms used that are based on FPGAs. The video coding

section 2.1 outlines the video coding procedure, and the most recent H.265/Hight Efficiency Video

Coding (HEVC) standard. In detail, it focuses on the video encoding and more specifically on the

motion estimation module. Section 2.2 is mainly related to the FPGA technology that is used to

implement the motion estimation architecture proposed in this thesis.

2.1 Video Coding

The video coding procedure aims at providing an efficient compression mechanism, and the

reverse decompression, of rough digital video in order to decrease both video storage and video

communication requirements. The source of a digital video signal is originally a camera or a

video synthesis tool such as animation software. This digital video signal can be optionally pre-

processed to enhance the performance of the encoding step. During the encoding the video is

converted to a bit stream. This bit stream is used to store the video on a medium or to transmit

it over a communication channel. When displaying the video, the bitstream is first decoded and

then post-processed. A basic architecture of a video coder is presented in Figure 2.1.

Figure 2.1: A basic video coder

The video coding standard specifies the structure and the syntax of a bitstream. In such a way,

the standard provides the constraints on the bitstream that must be respected by the encoder, as

well as the procedure to interpret the bitstream by a decoder. This ensures that the decoding

of a given bitstream with any decoder implementation (where both are conform to the standard)

results in the same output. Developers are free to implement alternative decoders as long as

they are functionally equivalent to the method in the standard. Moreover, the limited scope of

the standard grants the freedom to optimise the implementations to a specific application when

balancing between compression quality, implementation cost, time to market, etc. The video

coding technique analysed herein is the newly approved HEVC standard [15].

8

2.1 Video Coding

2.1.1 High Efficiency Video Coding

The H.265/HEVC is the newest international video coding standard approved by the Inter-

national Telecommunications Union (ITU-T) and the International Organization for Standardiza-

tion/International Electrotechnical Commission (ISO/IEC). Just as its preceding standard H.264/AVC,

it uses a hybrid video coding scheme, where the statistical correlation is exploited in the encoder

either between successive video frames (inter predication mode) or within a frame (intra predica-

tion mode).

Figure 2.2: H.265/HEVC encoder with intra/inter selection.

The frame of a video is split into blocks where these temporal (former) and spatial (later) re-

dundancies are exploited (see Figure 2.2). When inter-mode is selected, the blocks are predicted

from a decoded picture buffer, while in the case of intra-mode, the blocks are predicted using the

samples from the adjacent decoded blocks. The residual signal is transformed, scaled and quan-

tized before it is entropy-encoded and transmitted. The decoding process is also implemented in

the feedback loop of the encoder in order to reconstruct the decoded picture buffer. A deblocking

filter is applied to improve the visual aspect of the reconstructed pictures, and smooth the artifi-

cial edges caused by the blocking nature of the encoder. By relying on new advanced encoding

techniques, the new standard provides an increased compression efficiency of up to 50% when

compared to the previous standard.

A – Picture representation A picture consists of multiple samples. Each sample can repre-

sent a brightness value and multiple chroma values in the case of a colour picture. HEVC supports

several colour spaces, among which the most typically used is YCbCr with 4:2:0 sampling [13].

The Y component represents the brightness information and is called luma. The two chroma com-

9


ponents are represented by Cb and Cr and store the blue and red colour information, respectively.

Considering the fact that the human visual system is less sensitive to the chrominance than to the

luminance, the 4:2:0 sampling structure is typically used, in which each chroma component is

subsampled by a factor of 2 in both the horizontal and vertical direction. A rectangular picture with

W as width and H as height for the luma samples is accompanied with two chroma components of

dimension W/2×H/2. Each sample of a component is represented with 8 or 10 bits of precision.

B – CTUs, Slices and Tiles According to the HEVC standard, a picture is divided into several

coding tree units (CTUs) to easily process it. Each CTU comprises one luma coding tree block

(CTB) of L×L samples and two chroma CTBs of L/2×L/2 samples. The CTBs represent the

basic processing units in the HEVC standard, which supports several different CTB sizes, where

L value can be equal to 16, 32 or 64. The CTBs can be further split into coding blocks (CBs) by

iteratively applying the quadtree structure (see Figure 2.3(b)), which is limited by the the minimum

allowed CB size of 8×8 luma samples (with the exception of the first reference frame, where 8×4

and 4×8 shapes are also allowed). A coding unit (CU) is the collection of one luma CB and two

chroma CBs that span the same area of a picture. The ability to encode larger CTUs than in

previous standards is one of the reasons for the increased compression efficiency, especially with

high-resolution video content.

Slices and tiles are sequences of CTUs in which a picture can be divided (see Figure 2.3(a)).

Slice represent a set of CTUs in the raster scan order, which can be correctly decoded without

the use of any data from other slices in the same picture. Their main purpose of using slices

is the resynchronisation of the encoding process in order to prevent the propagation of errors

in the prediction procedure. On the other hand, the tiles are self-contained and independently-

decodable structures which main purpose is to enable the use of parallel processing architectures

for video encoding and video decoding. In contrast to the slices, the tiles are always rectangular

regions of the picture with approximately an equal number of CTUs in each tile.

(a) (b)

Figure 2.3: (a) Grouping of CTUs in slices and tiles, (b) Subdivision of a CTB into CBs

C – Prediction At the level of the CUs is decided whether intra-picture (spatial) prediction or

inter-picture (temporal) prediction is used. When intra prediction is chosen, the CBs are predicted

10

2.1 Video Coding

by using the information of adjacent CBs that are already encoded. Directional prediction with

33 different directional orientations are used at the level of prediction blocks (PBs). PBs are sub-

blocks of CB and the luma and two chroma PBs are combined in a prediction unit (PU). When inter

prediction is chosen, the CBs are predicted by searching for the best-matching predictor within

already encoded reference frames (RFs). This process, assigned as motion estimation (ME), is

also performed at the level of PBs. It represents the most computationally demanding module in

the encoding process, which typically requires 80% of the total encoding time [1]. The ME module

is explained in more detail in Section 2.1.2.

D – Transformation, scaling and quantisation After the prediction step, in either intra or

inter mode, the predicted CBs are subtracted from the originals. The result is a residual data.

The residual data of each CB is integer transformed to minimise its energy. This transformation

is performed at the level of transform blocks (TBs). TBs are grouped in transform units (TUs),

which can have the sizes of 4x4, 8x8, 16x16 or 32x32 samples. In contrast to previous standards,

the HEVC design allows a TB to span across multiple PBs for inter-predicted CUs in order to

maximise the potential coding efficiency benefits of the quad tree structured TB partitioning [15].

The transform matrix is composed of values that are derived from scaled discrete cosine transform

(DCT) basis functions. For simplicity one 32x32 matrix is specified. For smaller transform units,

a smaller transform matrix is composed by using a sub-sampled version of the specified 32x32

matrix. The scaling operation that is used in the previous standard is omitted since the scaling

is incorporated in the transformation matrix [15]. The result of the transformation is residual data

that contains less energy. This data is then quantised using uniform-reconstruction quantisation

(URQ).

E – Deblocking filter A deblocking filter (DF) is used on the quantised TUs to remove blocking

artifacts. Blocking artefacts are sharp edges that can be seen in an image due to the fact that the

image is split into blocks during the encoding process. The DF smoothes out these sharps edges

by using an 8×8 sample grid. The filter automatically determines the blockiness’s strength in the

concrete part of the frame. The DF strength is determined dynamical, taking a trade-off between

sharpness and smoothness in consideration.

F – Entropy coding At the end of the encoding process, the quantized transform coefficients

are reordered and entropy coded, in order to exploit the statistical redundancy in the encoded

bitstream. In contrast to the H.264/AVC, the HEVC standard specifies only one entropy cod-

ing method, namely, context adaptive binary arithmetic coding (CABAC). CABAC is a lossless

compression method which encodes the bitstream by applying a set of binary symbols chosen

according to the local statistics of recently-coded data and several probability models.

11


Figure 2.4: Modes for splitting a CB into PBs in case of inter prediction

2.1.2 Advanced Motion Estimation

Motion estimation is a module applied in the inter-prediction mode in order to exploit the tem-

poral redundancy in successive video frames. During the motion estimation, the current frame

is partitioned into blocks of pixels. For each block in the current frame a best match is located

inside one of the previously encoded frames, which are called the reference frames (RFs). This

best match is the predictor of the block in the current frame. It is used to calculate the residue

block which is the pixel-wise difference between the block and its predictor. The energy of this

residue block is significantly smaller than the energy of the original block which makes it possible

to quantise it with fewer bits and/or greater precision. The result of the motion estimation is a

motion vector (MV) associated with a current block, that defines in which RF the best match was

found and what the displacement between the original block and its best match is. The MV of

each block of the current frame is given to the next module in the encoding process, such that it

can encode the residue block and its associated motion vector instead of the original block, which

leads to an significant better coding efficiency. The better the match between the original block

and its predictor, the lesser bits needs to be used to encode the current block and thus improving

the coding efficiency.

In the HEVC standard, the way how the current frame can be partitioned into blocks is stan-

dardised. As explained in section 2.1.1 and presented in Figure 2.3, a frame is divided into CTUs

which are formed by CUs. Each CU contains of one luma CB and two chroma CBs. During the

inter-picture prediction a CB can be split into prediction blocks (PBs). The motion estimation in

the HEVC standard is performed at the level these PBs. For each PB of the current frame, a MV

is found that points to its predictor. A CB can be split into one, two or four PBs. Figure 2.4 lists

the supported modes of splitting a CB into a PB in case of inter prediction. From here on a PB

in the current frame is called an original block (OB) and a PB in the reference frame is called a

reference block (RB).

It is very important to find the best as possible predictor during the ME because a better

predictor means a better coding efficiency. The HEVC standard supports a ME to search for

the best predictor into up to 16 RFs. The more RFs that are used, the better the predictor will

be, but the more computational intensive the ME is. More importantly is how the search for the

12

2.1 Video Coding

best matching predictor in each RF is performed. An algorithm or strategy is not defined in the

standard and is free to choose by the designer. Due to the high computational load of ME, a lot

of research is performed in developing strategies that try to find an as good as possible predictor

with a limited amount of computational demands. As a result of this, there are many strategies

and it is important to choose the right one for each application and its goal.

There are several evaluation metrics to measure how well an RB in the reference frame pre-

dicts an OB in the current frame. Sum of squared errors (SSE), mean square error (MSE) and

mean absolute difference (MAD) are possible metrics. The most used metric is the sum of ab-

solute difference (SAD). The value is calculated by computing the absolute difference between

each pixel in the OB and the corresponding pixel in the RB. These absolute values are summed

and form the SAD value. This SAD is used to calculate the distortion. As shown in Equation 2.1,

the distortion is calculated by adding and extra term to the SAD. This extra term is equal to a

constant value λ multiplied by the number of bits that are needed to represent the motion vector.

The RB that corresponds with the lowest distortion value is the best predictor for the OB. The SAD

is one of the most simple metrics that take every pixel in both blocks into account. The simple

metric makes it easy to implement many SAD-computation components with a limited amount of

resources.

Distortion = SAD + λ×#bits(MV ) (2.1)

Next to different evaluation metrics, there is also a wide range of search algorithms. The choice

of search algorithm decides which blocks in the reference frames will be actually compared with

the OB and which will not be taken into consideration.

Full Search Motion Estimation (FSME) is the most straightforward and thorough search

algorithm. It calculates and compares all SADs from all possible candidates in a square L×L

search area around the OB’s location in the RF. L/2 is called the search range. A motion estimation

algorithm that uses FSME with a search range of L/2 has to calculate and compare L×L SADs in

each reference frame, for each OB. Increasing the search range will thus quadratically increase

the number of SADs. It is an algorithm with superior performance in finding the best match, but in

return has the highest computational load.

Three Step Search (TSS) [4] is a search algorithm that drastically reduces the number of

blocks that is compared with, and thus reducing the complexity. Compared to FSME, it has a

near optimal performance. Another more significant drawback is that in the first step TSS has an

uniformly allocated checking point pattern, which is inefficient for smaller search ranges [4].

Diamond search (DS) [24] is another popular search algorithm that is widely used. It greatly

outperforms TSS in accuracy and computational load, and has many implementations in en-

hanced search algorithms. A large diamond search pattern (LDSP) and a small diamond pattern

(SDSP) with respectively 9 and 5 points are use to locate the best match.

13


Enhanced Predictive Zonal Search (EPZS) [18] is a state of the art search algorithm. It

has a near FSME performance, but is difficult to efficiently implement in hardware due to its not

straightforward data pattern.

After the initial motion estimation, a sub-pixel motion estimation is performed to increase the

coding efficiency. With sub-pixel motion estimation, the region around the best match prediction

block in the reference frame is interpolated. The sub-pixel motion estimation now search in this

interpolated region for an even better match. The sub-pixel motion estimation is out of the scope

of this thesis.

2.2 FPGA

2.2.1 Introduction

FPGA stands for ”Field-Programmable Gate Array”. It contains a matrix of reconfigurable

gate array logic circuitry and is used to implement custom hardware functionality. When con-

figured, the internal circuitry is connected in a way that creates a hardware implementation that

can be configured according to the application. Unlike hard-wired printed circuit board (PCB)

or application-specific integrated circuit (ASIC) designs which have fixed hardware resources,

FPGA-based systems can literally rewire their internal circuitry to allow reconfiguration. Digital

computing tasks are developed in software and compiled by synthesis tools to a bitstream that

contains information on how the components should be wired together.

Figure 2.5: The basic structure of an FPGA

The structure of an FPGA consists of three major components. (see figure 2.5)

• Configurable logic blocks (CLBs) can be used to implement different functions with com-

binational and sequential logic. They are programmable to provide functionality as simple

14

2.2 FPGA

Figure 2.6: A simplified example of a configurable logic block

as that of a transistor or as complex as that of a microprocessor. Common components

used in logic blocks are flip-flips (FF), look-up-tables (LUT), block random access memory

(BRAM) and multiplexers. Figure 2.6 illustrates a simplified example of a configurable logic

block.

• Interconnects consists of wire segments of varying lengths which can be interconnected

via electrically programmable switches. They provide routing paths to connect the inputs

and outputs of the CLBs and IOBs onto the appropriate networks.

• Input/Output blocks (IOBs) are the interface between the package pins and the internal

signals.

Currently there are three main technologies how FPGA’s can be programmed. Each technol-

ogy has its own advantages, which shall be discussed briefly. Antifuse FPGAs are configured

by burning a set of fuses. Once the chip is configured, it cannot be altered anymore. Bug fixes

and updates are possible for new PCBs, but are very difficult or impossible for already manufac-

tured boards. They are used as ASIC replacement for small volumes. Flash FPGAs may be

re-programmed several thousand times and are non-volatile. This means that they keep their

configuration after power-off. The disadvantage is the use of the more expensive flash memory

and re-configuration takes several seconds. SRAM FPGAs is currently the dominating technol-

ogy. The memory is volatile but can be unlimited re-programming. Additional circuitry is required

to load the configuration into the FPGA after power-on. But re-configuration is very fast, some

devices allow partial re-configuration during operation. This allows for new approaches and appli-

cations that make use of run-time reconfiguration (RTR) where the circuitry is dynamically loaded

during the execution of applications.

Over the past decades FPGAs have evolved enormously. The first modern-era FPGA was

introduced by Xilinx in 1984 [19]. It contained 64 configurable logic blocks and 58 inputs and out-

puts [19]. Thanks to the semiconductor industry, the number of transistors on integrated chips has

increased greatly. This allows a modern state-of-the-art FPGA to contain approximately 1,950K

15


equivalent logic blocks and around 1200 inputs and outputs [22]. There is a still ongoing trend

of adding specialized blocks to FPGAs. Block RAM is added for fast on chip data buffering. To

allow fast and area efficient implementations of logical operations, shifting, addition and multiply-

add arithmetics, digital signal processors (DSP) are integrated in the chips. Both embedded hard

(e.g. PowerPC) and soft (e.g. NIOS II) processors for FPGAs are also available [Leong]. Fig-

ure 2.6 shows an overview of the features of the FPGA that is used in this thesis. It sports a total

of 207360 look-up-tables (LUTs), 207360 flip-flops (FFs), 324 block random access memories

(BRAMs) and 192 digital signal processors (DSPs).

Figure 2.7: An overview of the main features of the Virtex-5 LX330T FPGA from Xilinx that is usedin this thesis [20].

The technology used in FPGAs, causes it to have costly interconnects and run on lower operat-

ing frequencies. The high latency that is associated with it is compensated by deep pipelining and

additional paralellism. Unlike processors, FPGAs are truly parallel in nature. Different processing

operations do not have to compete for the same resources, if enough resources are provided by

the FPGA. Each independent processing task is assigned to a dedicated section of the chip, and

can function autonomously without any influence from other logic blocks. As a result, the perfor-

mance of one part of the application is not affected when you add more processing. This not only

affects the raw calculation performance but also the I/O throughput. As long as there are enough

I/O blocks provided by the FPGA and the bandwidth of the peripherals is not exceeded, adding

more inputs and/or outputs will not decrease the I/O performance of already used logic on the

FPGA. The deep pipelining and highly parallelism nature fit well with streaming/dataflow applica-

tions. The decoupling of communication from computation in these applications combined with

these techniques allow for a very high throughput. Taking the parallel nature of FPGAs into ac-

count, it is easy to see that they operate best on problems that can be easily and efficiently divided

into many parallel, often repetitive, computational tasks. The exceptional high throughput that can

be achieved when combined with a dataflow model makes it an excellent platform to solve data

intensive problems. Many applications fall into this class of problem, including image processing

and even more in particularly full search motion estimation, which is the target application of this

thesis.

FPGAs are particularly well suited to fixed-point calculations. FPGAs can perform these types

of calculations with a low amount of logic to implement them, which gives an FPGA an extremely

high calculation density. In motion estimation, the samples that are worked with are integer values

and thus allow for a high calculation density when implemented on an FPGA.

16

2.2 FPGA

FPGA devices deliver the performance and reliability of ASICs, without the high non recurring

cost (NRC) of their complex design flow. FPGA’s less time consuming design flow also allows

for a faster time to market. Nevertheless it takes specialised skills to develop code for an FPGA.

FPGA development often takes much longer than an equivalent development task for a micro-

processor using a high-level language like C or C++. This is partly due to the time and tedium

demands of the iterative nature of FPGA code development and the associated long synthe-

sis/simulation/execution design cycle. In the last few years specialised FPGA development tools

have decreased dramatically the development time with the use of high level synthesis tools.

2.2.2 Designing with FPGAs

The designer facing a design problem must go through a series of 5 phases between initial

ideas and final hardware. This series of phases is commonly referred to as the design flow. Most

projects start with a need of something. The first phase of the flow is specifying the requirements.

For example, in the particular case of this thesis the requirements can be summarised as: a mo-

tion estimation accelerator with as constraint the amount of resources used and a small execution

time. The tools supplied by the different FPGA vendors to target their chips do not help the de-

signer in this phase. For the other four phases which will be discussed briefly, there is a wide

variety of tools available.

• Design entry consists in transforming the design ideas into a computerised representation.

This is most commonly accomplished using Hardware Description Languages (HDLs). As

the name states, this language is used to describe the hardware. Most HDLs also have the

ability to simulate the behaviour of the described hardware to verify its correct functionality.

An HDL differs from conventional software in the sense that the statements are not sequen-

tial executed. Instead the code is executed in parallel since it describes the hardware that

operates concurrently. The two most popular HDLs are Verilog and Very High Speed In-

tegrated Circuit HDL (VHDL). These languages describe the circuit at the register transfer

level (RTL), a design abstraction which models the flow of data between logic and registers.

• Synthesis The synthesis tool receives as input the HDL and the target FPGA model. The

input is used to generate a netlist which is a model at the level of logic gates. It satisfies the

logic behaviour specified in the HDL files and uses the primitives of the specified FPGA type.

The synthesis goes through many steps such as logic optimization, register load balancing,

and other techniques to enhance timing performance [Serrano].

• Mapping and Place-and-route The next step maps the netlist onto the FPGA. For each

component in the netlist, a component on the target FPGA is selected. During the routing

process, these components are connected with each other. This routing process has to take

many constrains into consideration. The most important constraint is the timing, the delay

17


between connected components. This delay is limited to a threshold in order to meet the

targeted frequency.

• Bit stream generation The configuration of the FPGA as specified in the place and routing

step is stored as a bitstream. This bistream can be uploaded to the targeted FPGA to

configure it.

Design entry is the only step that requires human labour. The other steps are performed by

tools provided by FPGA vendors. Describing the design at the register transfer level can be a

difficult task, especially with large designs. This is because RTL is a low level design abstraction

that requires the logic to be described in great detail. This detailed description allows for a thor-

ough control of the FPGA configuration but on the other side, consumes a lot of time to design.

In the recent years several high-level HDLs were developed to decrease the time spend in the

design phase. These languages describe the hardware on a higher level and accompanied tools

are used to compile the description to the RTL level. An example is Simulink, a block diagram

environment that describes the hardware with functional blocks [11]. Another example is the Max-

eler platform that uses custom libraries in java to describe the hardware at a high level. The next

section will discusses the Maxeler platform that is used to implement the architecture proposed in

this thesis.

2.2.3 Dataflow programming with the Maxeler platform

The Maxeler platform offers an efficient way to design dataflow computing solutions. As illus-

trated in figure 2.8, a dataflow application differs fundamentally from a standard software applica-

tion. In a standard software application, the source code is transformed in a list of instructions.

These instructions are loaded into the memory together with the data of the application. During

the execution of the application, the instructions and data are loaded from the memory into the

processor. After each execution of an instruction, the result is written back into the memory before

a new instruction is executed. This sequential execution of instructions is limited to the latency of

data movement in this loop. Figure 2.8(a) shows an overview of a standard software application

with its memory loop.

In contrary, the source file of a dataflow application is a dataflow engine (DFE) configuration

file. This file describes the inner structure of the DFE. During the execution of the application,

no instructions are needed. The data is loaded from the memory and streamed to the dataflow

engine. In the dataflow engine, the data flows from core to core and is processed herein until

the final result is obtained. There is no need of writing the result bad back to memory after each

operation on it. Only the final result that is obtained at the end of the dataflow engine is written

back to the memory as is described in figure 2.8(b) .

The design of the DFE is decoupled into two parts. On one side there are the kernels that

18

2.2 FPGA

(a) Standard software application (b) Dataflow application

Figure 2.8: An overview of a standard software application versus a dataflow application [17]

describe the data processing structure that is needed by the application. Arithmetic units and

data control such as multiplexers and fast on-chip memory are included. On the other side there is

one Manager that describes the flow of the data stream. They interconnect the dataflow between

kernels and with the DRAM memory and host application through PCIe. They also handle the

buffering of data and automate the conversion between different data path widths for the streaming

between entities with different I/O widhts.

Not only the design of computation and communication are decouple with the use of kernels

and a manager. Also their implementation on the FPGA is decouple which is highly beneficial for

as well the communication as the computation. The decoupling allows for deep pipe-lined kernels

without having synchronisation problems and allows for concurrent computation and dataflow,

which reduces the impact of the high latency that is characteristic for FPGAs.

Both the kernels and the manager are implemented with Maxeler’s custom Java libraries. The

MaxCompiler, that is part of the Maxeler platform, compiles the kernels and manager to VHDL

which is further used to generate a DFE configuration file. This configuration file, .max file, is used

to configure the FPGA and link it with the host application. The custom Java libraries and the

MaxCompiler allow the designer to describe complex hardware structures such as BRAMs, accu-

mulators, PCIe- and DRAM-communication modules without the requirement of time-consuming

RTL design, but with Java functions and build-in automation of the MaxCompiler. The use of these

Java functions and the build-in automation have the cost of losing some control over the hardware

introduced by the tool [5].

The Simple Live CPU Interface (SLiC) is an API used to control and communicate live with the

DFE from a host application. It allows the user to easily run different profiles on a DFE and send

and receive data from it without the need of developing specialised communication software. The

host application runs on a CPU and is in this thesis implemented in c. The various c functions that

19


are provided by the SLiC are created by the MaxCompiler, during the compilation of the kernel

and manager, together with the configuration .max file. Figure 2.9 illustrates an overview of a DFE

and its connections.

Figure 2.9: The architecture of a DFE and its connections [16]

20

3Video Coding Architecture

Contents3.1 Streaming model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2 Hierarchical SAD computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3 Streaming model with hierarchical SAD computation . . . . . . . . . . . . . . . . 233.4 Streaming Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.5 SadGenerator Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.6 Scalable SadGenerator Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 373.7 SadComparator Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

21

3. Video Coding Architecture

This chapter presents the architecture of a full search motion estimation (FSME) accelerator.

We have chosen to accelerate this component of the video encoder because, as referred before,

it is the most data and computational intensive one.

3.1 Streaming model

The FSME algorithm is very data and computational intensive. A streaming model is used to

process these huge amounts of data. In a streaming model, data is streamed trough a highly

pipelined architecture. As shown in Figure 3.1 the accelerator has two input streams and one

output stream.

Figure 3.1: The streaming model of the FSME accelerator

The first input stream contains the pixels of the reference frame. The second input stream

brings the pixels of the original frame. The accelerator calculates the SADs of the original blocks

and their candidates in the search area of the reference frame. The accelerator’s output stream

contains the resultant motion vectors with the minimum SAD value for each original block. The

definition of the block size to be used for encoding the frame is computed by other components of

the encoder and is of the scope of this work.

3.2 Hierarchical SAD computation

In hierarchical SAD computation, the prediction blocks can be of different sizes. From here on,

the prediction blocks are called original blocks (OBs). Lets assume the smallest size A×A (A-OB)

and the biggest size Z×Z (Z-OB). Depending on the Z/A ratio, there are a number of intermediate

block sizes. In Figure 3.2 the Z/A ratio is, i.e, 4: Z is four times larger than A and there is one

intermediate block size B.

The SADs of blocks with different sizes can be hierarchically calculated by using the SADs

of the smaller blocks. For example, based on Figure 3.2, the SAD of a Z-OB (Z-SAD) can be

calculated from the SADs of B-OBs (B-SADs), as show in Equation (3.2), and each B-SAD can

be calculated by adding the A-SADs as show in Equation (3.1).

22

3.3 Streaming model with hierarchical SAD computation

Figure 3.2: Hierarchical SAD computation inside a Z-OB

B-SAD0 =3∑

i=0

A-SADi B-SAD1 =

7∑i=4

A-SADi (3.1)

B-SAD2 =

11∑i=8

= A-SADi B-SAD3 =

15∑i=12

A-SADi

Z-SAD =

3∑i=0

B-SADi (3.2)

In this example there is only one intermediate block size. With a higher Z/A ratio, many other

intermediate block sizes are possible. It is also possible to hierarchically calculate non-square

SADs by adding different groups of SADs. As long as the granularity is A×A all shapes are

possible.


In order to use the hierarchical SAD computation model, using the example in Figure 3.2, first

the SADs of all the A-OBs inside a Z-OB must be calculated. Once all the A-SADs from the Z-OB

are calculated, all the intermediate SADs until the final Z-SAD can be hierarchically calculated.

Thus we need to store all the A-SADs of one Z-OB before we can calculate the intermediate SADs

and find the minimum SAD of each shape. To incorporate the need of storing the A-SADs, the

architecture can be further detailed as shown in Figure 3.3.

The accelerator consists of 3 blocks:

• SadGenerator: computes all the A-SADs of each Z-OB.

• Memory: stores the A-SADs

• SadComparator: generates the Z-SAD, its intermediate SADs and compare them along

each other to find the MV of the minimum SADs.

23


Figure 3.3: Detailed model of the accelerator

The original frame is split into Z-OBs that are processed sequentially in a streaming manner.

In detail this process consists of the following steps. The SadGenerator has as input a Z-OB from

the original frame and its search area from the reference frame. The A-SADs of the Z-OB are

calculated and streamed to the memory. When all the A-SADs of the Z-OB are streamed to the

memory, the next Z-OB and its search area are processed. The iteration in which all the A-SADs

from a Z-OB are calculated is called a SadGenerator Z-iteration. In the mean time the SadCom-

parator reads the A-SADs from the memory and hierarchically computes the intermediate- and

Z-SADs. The output stream of the SadComparator contains the motion vectors of the minimum

SADs. The iteration in which all the A-SADs from a Z-OB are processed is called a SadCom-

parator Z-iteration. The memory will store two SAD-buffers: one in which the SadGenerator is

writing and one from which the SadComparator is reading. This is illustrated in Figure 3.4. When

the SadGenerator is writing into the even SAD-buffer, the SadComparator is reading from the odd

SAD-buffer, and vice versa.

Even

SAD-buffer

Odd

SAD-buffer

A-SAD

IN

A-SAD

OUT

Memory

Figure 3.4: The memory is split into an even and odd SAD-buffer

The SadGenerator and the SadComparator are running concurrently. Only during the genera-

tion of the A-SADs of the first Z-OB, the SadComparator is not running (since there are no A-SADs

24


in the memory yet). And during the processing of the A-SADs of the last Z-OB, the SadGenerator

is not working (since there are no Z-OBs to process anymore).

Figure 3.5(a) shows the execution sequence of the SadGenerator’s and the SadComparator’s

Z-iterations in the case of a frame with 8 Z-OBs and when both iterations take exactly the same

amount of time. The overall performance of the accelerator is determined by the slower unit.

In Figure 3.5(b) the slower unit is the SadGenerator. After each iteration, the SadComparator

has to wait for the SadGenerator to finish calculating the A-SADs. In Figure 3.5(c) the slower

unit is the SadComparator. After each iteration, the SadGenerator has to wait for the SadCom-

parator to finish reading the A-SADs from the memory before it can write new A-SADs in that

same SAD-chunk. The total execution time for processing one frame is given by Equation (3.3).

NZ-OBs is the number of Z-OBs that are being processed. ∆slowZiteration is the execution time

needed to process a single Z-OB by the slowest unit, either the SadGenerator or SadComparator.

∆fastZiteration is the execution time needed to process a single Z-OB by the fastest unit. When

optimising the execution time of the system, the main goal is to approximate as much as possible

the execution time of the two units, i.e., it is important to optimise the slower unit.

TotalExecutionT ime = NZ-OBs ×∆slowZiteration+ ∆fastZiteration (3.3)

(a) Same speed (b) Slower Sad-Generator

(c) Slower Sad-Comparator

Figure 3.5: An overview of the executions sequence of the SadGenerator’s and the SadCompara-tor’s Z-itterations

25


3.4 Streaming Pattern

A – Original frame stream Each original frame is divided into Z×Z blocks (Z-OBs). Z is

the dimension of the biggest prediction block in the motion estimation. During an iteration of

the SadGenerator, one Z-OB is streamed at a time to the SadGenerator. A Z-OB that is being

streamed to the SadGenerator is called an OB-chunk. Figure 3.6 shows the streaming pattern of

the original frame stream with a frame that is divided into 8×4 Z-OBs. It takes 8×4 SadGenerator

Z-iterations to stream and process all these Z-OBs to the SadGenerator.

Z-OB Z-OB Z-OB Z-OB Z-OB Z-OB Z-OB Z-OB

Z-OB



Z-OB Z-OB Z-OB Z-OB Z-OB Z-OB Z-OB

Figure 3.6: The streaming pattern of the Z-OBs in the original frame

B – Reference frame stream Alongside each OB-chunk from the original frame, its reference-

frame-chunk (RF-chunk) is streamed to the SadGenerator. This RF-chunk contains the search

area of the Z-OB. The search area is a part of the reference frame (RF) in which a candidate

predictor is looked for. It is a square area with its centre positioned at the Z-OB’s upper left corner

coordinate in the original frame. This is illustrated in Figure 3.7. With L the width of the search

area and twice the length of the so called search range, L×L SADs are calculated for each A-OB

inside the Z-OB. The L× L search area inside the RF-chunk is extended by Z − 1 pixels towards

the left- and downside direction such that the candidates that are on the border of the search area

can be evaluated as well. It is possible to look for a candidate in multiple reference frames. In this

case, the RF-chunk contains multiple search areas in order to calculate L×L SADs per reference

frame. Figure 3.7 shows the dimensions of the search area with 3 reference frames and their

positioning relative to the Z-OB in the original frame.

C – SAD streams The SAD stream is a continuous stream from the SadGenerator to the

memory. The stream can be divided into Sad-chunks. Each Sad-chunk contains all the A-SADs

from one Z-OB and is generated during one SadGenerator Z-iteration. Figure 3.8 shows the

26


Z-OB

L/2

L/2

L/2 L/2

3

Z-1

Z-1

Figure 3.7: The RF-chunk with multiple reference frames that is streamed alongside each Z-OB-chunk

streaming of the Sad-chunks into and from the memory. The Sad-chunk numbers correspond to

the numbers of the Z-OBs: Sad-chunk 0 contains all the A-SADs from the first Z-OB. The Sad-

chunks are also streamed from the memory to the SadComparator. Except for the processing

of the first or the last Z-OB, there is always one Sad-chunk being streamed to the memory and

one Sad-chunk being steamed from the memory. In Figure 3.8 Sad-chunk 2 is read from the

even SAD-Chunk in the memory and Sad-chunk 3 is being written into the odd SAD-Chunk in the

memory.

Memory

SAD-chunk

3SAD-chunk

2

SadGenerator SadComparator

Figure 3.8: The SAD-chunks are streamed to and from the memory. Each SAD-chunk containsall the A-SADs of one Z-OB.

D – The MV stream During one SadComparator Z-iteration, one MV-chunk is streamed from

the SadComparator. An MV-chunk contains the motion vector (MV) to the predictor of each A-OB

inside a Z-OB. In the case of Figure 3.2, there are 16 A-OBs, 4 B-OBs and 1 Z-OB. In this case

an MV-chunk contains 21 MVs as show in Figure 3.9.

27


Figure 3.9: The streaming of the MV-Chunks

An overview of the streaming is given in Figure 3.10. In this simplified example, the frame

is divided into 4 Z-OBs. Four Z-OB-chunks are streamed to the SadGenerator alongside their 4

RF-chunks. Four SAD-chunks are streamed from the SadGenerator to the memory and then re-

streamed from the memory to the SadComparator. The SadComparator streams four MV-chunks

with the motion vector of each block of the frame.

28


SAD-chunk

1

OB-chunk

1

RF-chunk

1


Memory

(a) The first SAD-chunk is generated and streamed to the even SAD-buffer in the memory

SAD-chunk

2

SAD-chunk

1

MV-chunk

1

OB-chunk

2

RF-chunk

2


Memory

(b) The second SAD-chunk is generated while the first SAD-chunk is streamed to the SadGenerator

SAD-chunk

3

SAD-chunk

2

MV-chunk

2

OB-chunk

3

RF-chunk

3


Memory

(c) The third SAD-chunk is generated while the second SAD-chunk is streamed to the SadGenerator

SAD-chunk

4

SAD-chunk

3

MV-chunk

3

OB-chunk

4

RF-chunk

4


Memory

(d) The fourth SAD-chunk is generated while the third SAD-chunk is streamed to the SadGenerator

SAD-chunk

4

MV-chunk

4SadGenerator SadComparator

Memory

(e) The final SAD-chunk is streamed through the SadComparator and the last MV-chunk is generated

Figure 3.10: An overview of the processing of a frame with a total of 4 Z-OBs

29


3.5 SadGenerator Architecture

As stated before, the FSME algorithm is very memory intensive, leading to huge memory

bandwidth requirement. In order to exploit as much as possible the data parallelism inherent to the

algorithm, while maintaining low memory bandwidth requirements, the proposed SadGenerator

Architecture makes full use of a highly optimized structure that supports maximum reutilization of

the loaded data.

The main architecture of the SadGenerator is an ALU-grid that computes the SADs of one

A-OB in parallel. With a search area of L×L, each A-OB has L×L SADs. The data needed to

calculate these SADs is the A-OB from the original frame and reference frame data (RF-data):

(L + A − 1) RF-lines with a length of (L + A − 1). A SAD can be named by their coordinate x, y

with x, y ∈ {0, L− 1}. The calculation of the SADs happens while streaming RF-lines one by one.

One RF-line is used to calculate a maximum of A×L SADs for one A-OB: L because an A-OB only

has L SADs in the x-direction, A because one line can be maximum used by SADs with A different

y-coordinates. The utilisation of an RF-line in both dimensions is illustrated in Figure 3.11

A

(a) The vertical utilisation of an RF-line is A. Themost up and down prediction candidates usingthe RF-line are marked in yellow.

L

L+A-1

(b) The horizontal utilisation of an RF-line is L. Themost left and right prediction candidates usingthe RF-line are marked in yellow.

Figure 3.11: An RF-line is used by A×L SADs of one A-OB

The ALU-grid is a structure with A×L ALUs, where each calculates a SAD value (see Fig-

ure 3.12). SAD(x, y) is calculated by ALU(x, ymodA).

ALU-gridL

A

Figure 3.12: A×L ALUs are grouped in a grid

30


Each ALU in the ALU-grid is identical (see Figure 3.13). It has two inputs, A pixels of an A-OB

and A pixels of the RF-line. It calculates the absolute difference of the two vectors pixel wise

and adds those values to an accumulator. After accumulating A absolute values, the accumulator

contains a SAD.

ALU

RF

O

B

A

A

acc

Figure 3.13: An ALU has as input A reference pixels and A original pixels, and as output anaccumulated value

Processing one A-OB: Figure 3.14(a) illustrates an A×L ALU-grid with A equal to four for

the sake of simplicity. Each line of ALUs in the grid has one colour. Figure 3.14(b) illustrates the

L × L SADs calculated by the ALU-grid. The SAD rows’ colours in Figure 3.14(b) correspond to

the colours of the ALU row computing them. Figure 3.14(c) illustrates the RF data that is needed

to calculate the SADs of one A-OB. The coloured columns span the rows that are needed to

calculated the SADs from the row in the corresponding colour. Because the ALU-grid calculates

the SADs of only one A-OB at a time, a square of (L+A−1)×(L+A−1) pixels from the reference

frame is sufficient to calculate the L×L SADs.

(a) ALU rows calculatingSADs

(b) SADs calculated with theuse of RF-lines

(c) RF-lines used to calculateSADs with ALUs

Figure 3.14: An overview which ALU lines calculate which SADs by using which lines from the SA

The first row of ALUs (3.14(a), blue) starts calculating the first row of SADs (3.14(b),blue)

when the first RF-line is streamed (3.14(c), row spanned by blue column). The second row of

31


ALUs (yellow) starts calculating the second row of SADs (yellow) when the second RF-line is

streamed (row spanned by the yellow column). The second row of RF data is also spanned by the

blue column: the RF-line is also used by the first row of ALUs to calculate the first row of SADs.

When the fourth RF-line is streamed (row spanned by the red column), the fourth ALU-row (red)

starts calculating the SADs of the fourth row (red). At the same time, the first ALU-row (blue) is

finishing calculating the SADs of the first row (blue). When streaming the fifth RF-line, the first

row of ALUs start calculating the the fifth row of SADs (blue). In this way, from the streaming of

the fourth RF-line on, each ALU is continuously calculating SADs and each RF-line is used by all

four ALU rows.

The ALU-grid computes A× L SADs of an A-OB in parallel. This way of calculating the SADs

greatly reuses the RF-line’s data, resulting in a reduced bandwidth of reference frame data. When

calculating the SADs in the traditional way, each SAD is calculated separately requiring A×A pixels

from the reference frame per SAD. This results in a total of (A×A)×(L×L) pixels for the calculation

of L×L SADs of one A-OB. Using the proposed architecture only requires the streaming of (L+A-

1)×(L+A-1) pixels for calculating the same amount of SADs. With A equal to 8 and L equal to 64,

this results in a bandwidth decrease ratio of 52× ( (8× 8)× (64× 64)/(71× 71) = 262144/5041

) compared to the traditional way.

OB 0

SADx= 0..L-1RF-Line SADy

0

1

2

3

4

5

L-1

OB 0

OB 0

OB 0

OB 0

OB 0

OB 0

0

1

2

3

4

5

L-1

Figure 3.15: The output pattern of the SadGenerator when calculating the SADs of a single A-OB,named OB 0, at a time

The output pattern of the SADs can be derived from Figure 3.15. This figure illustrates which

SADs the ALU-grid starts to calculate during the streaming of the RF-lines. When streaming the

first RF-line, the ALU-grid starts calculating L SADs with y-coordinate (SADy) equal to 0. Their

x-coordinate (SADx) range from 0 to L-1. When streaming the second RF-line, the ALU-grid

starts calculating L SADs with y-coordinate equal to 1. Since the pattern of the beginning of SAD

calculation is the same as the pattern of the ending of SAD calculation, the pattern in Figure 3.15

is also the output pattern of the SadGenerator.

Processing one row of A-OBs: An even more efficient use of the streamed RF-lines is

achieved when they are not only used to calculate the SADs of 1 A-OB, but of several A-OBs

32


at a time. The reason is illustrated in Figure 3.16. The RF-data of nearby A-OBs have a huge

overlapping area. An RF-line of L+A− 1 pixels is overlapped by L− 1 pixels of the next A-OB’s

RF-line.

Z

A

(a) Z-OB with fourhighlightedA-OBs

L+A-1

L+Z-1

L+A-1

(b) The RF-data of four consecutive A-OBs overlapeach other

Figure 3.16: By extending the size of the RF-line, a row A-OBs use the RF-line to calculate all itsSADs

When streaming a larger RF-line of L + Z − 1 pixels instead of L + A − 1 pixels, the RF-line

can be used to calculate the SADs of the (Z/A) A-OBs that are on the same horizontal line inside

the Z-OB, as shown in Figure 3.16(a). For each RF-line that is streamed, the ALU-grid goes

trough (Z/A) so called OB-iterations. During an OB-iteration, each ALU is calculating the SAD of

one A-OB and stores the result in an accumulator. During the next OB-iteration, the ALUs are

calculating the SADs with the same x, y coordinates, but for a different A-OB. Each ALU in the

ALU-grid has to store an accumulator value for the (Z/A) A-OBs in the row.

When streaming RF-lines for each A-OB separately, calculating the SADs for Z/A A-OBs

would require to stream (Z/A) times (L + A − 1) × (L + A − 1) pixels. When streaming a larger

RF-line and allowing the ALUs to store an accumulator value for each A-OB, only (L + Z − 1) ×

(L + A − 1) pixels need to be streamed. With A equal to 8, Z equal to 64 and L equal to 64, this

results in a bandwidth decrease ratio of 4,47× ( (71×71×8)/(71×127) = 40328/9017 ) compared

to when processing one A-OB at a time. Compared to the traditional way, the bandwidth is

decreased 232,33×.

When calculating the SADs of one row of A-OBs at a time, the output pattern of the SADs is

different from when calculating only the SADs of one single A-OB at a time. The output pattern

can be observed in Figure 3.17. In this example, the Z/A ratio is 4, which means the SADs of four

A-OBs are calculated after streaming an RF-line. When streaming the first RF-line, the ALU-grid

starts calculating L SADs with y-coordinate (SADy) equal to 0. It does this first for A-OB 0, then

for A-OB 1, 2 and finally for A-OB 3. After these four OB-iterations, a new RF-line is streamed

33


which processes the next 4 OB-iterations.

OB 0


0

1

2

3

4

5

L-1

OB 1 OB 2 OB 3

OB 0 OB 1 OB 2 OB 3

OB 0 OB 1 OB 2 OB 3

OB 0 OB 1 OB 2 OB 3

OB 0 OB 1 OB 2 OB 3

OB 0 OB 1 OB 2 OB 3

OB 0 OB 1 OB 2 OB 3

0

1

2

3

4

5

L-1

Figure 3.17: The output pattern of the SadGenerator when calculating the SADs of one row ofA-OBs at a time

Processing multiple rows of A-OBs: An even more efficient use of the streamed RF-lines is

achieved when they are not only used to calculate the SADs of 1 A-OB row, but of all the A-OBs

inside a Z-OB at a time. The reason is illustrated in Figure 3.18. As above, the RF-data of nearby

A-OBs have a huge overlapping area.. From the L + A − 1 RF-lines used by a single A-OB row,

L − 1 of these RF-lines are used by the adjacent A-OB row. Each ALU in the ALU-grid has to

have an accumulator for storing the value for the (Z/A)2 A-OBs.

When streaming RF-lines for each A-OB row seperatly, calculating the SADs of (Z/A)2 A-OBs

would require to stream (Z/A) times (L+Z− 1)× (L+A− 1) pixels. When processing all A-OBs

at the same time by streaming more RF-lines and allowing each ALU to store an accumulator

value for each A-OB, only (L + Z − 1) × (L + Z − 1) pixels need to be streamed. With A equal

to 8, Z equal to 64 and L equal to 64, this results in a bandwidth decrease ratio of again 4,46× (

(71×127×8)/(127×127) = 72126/16129) compared to when processing one A-OB row at a time.

Compared to the traditional way, the bandwidth is decreased 1037×.

When calculating the SADs of multiple A-OB rows at a time, the output pattern of the SADs is

different from when calculating only the SADs of a single A-OB row at a time. The output pattern

can be observed in Figure 3.19. In this example, Z equals to 32 and A equals to 8 resulting in a

Z/A ratio of 4. This means the SADs of at least 4 A-OBs are calculated after streaming an RF-

line. When streaming the first RF-line, the ALU-grid starts calculating L SADs with y-coordinate

(SADy) equal to 0. It does this first for A-OB 0, then for A-OB 1, 2 and finally for A-OB 3. After

these four OB-iterations, a new RF-line is streamed which is being processed by again 4 OB-

iterations. From RF-line 0 until A-1, the ALU-grid starts calculating only the SADs of the first row

of A-OBs. This is because the first A RF-lines streamed only belong to the search area of the first

row of A-OBs.

In Figure 3.19, A equals to 8. When streaming RF-line 8, this line is used to start calculating

the SADs of both the first and second row of A-OBs. During the first 4 OB-iterations, the ALU-

34


Z

A

(a) Z-OB with fourhighlightedA-OB rows

L+Z-1

L+A-1

L+Z-1

(b) The RF-data of four consecutive A-OB rows overlapeach other

Figure 3.18: By extending the number of RF-lines, multiple rows of A-OBs use the RF-line tocalculate all its SADs

grid uses the RF-line to start calculating the SADs with y-coordinate 8 for the first A-OB row.

During the other 4 OB-iterations, the ALU-grid uses the RF-line to start calculating the SADs with

y-coordinate 0 for the second A-OB row. These SADs have y-coordinate 0 because the RF-line

is the first line (y-coordinate 0) of the search area of the second row of A-OBs. From RF-line 8 to

15, each RF-line is used to start calculating the SADs of 8 A-OBs. From RF-line 16 to 23, each

RF-line is used to start calculating the SADs of 12 A-OBs.

As seen in Figure 3.17, RF-line L − 1 is the last line for which the ALU-grid starts calculating

SADs of the first row of A-OBs. This is why from RF-line L to L + 7, the RF-line is not used

anymore to start calculating new SADs for the first A-OB row, but only for the last 3 A-OB rows.

After streaming every 8 RF-lines, the RF-lines is not used for an extra OB-row. This complex

streaming pattern can be a drawback when the SADs need to be further processed in a specific

order.

During the different OB-iterations, different parts of the RF-line are sent to the ALU-grid to

process the SADs of one A-OB as shown in Figure 3.20. Each sub-RF-line is shifted A pixels to

the right relative to each other. This is because the A-OBs that are in the same A-OB row, differ

A pixels in the horizontal dimension in the original frame and thus so do their search areas in the

reference frame.

The processing sequence when streaming the RF-lines is illustrated in Figure 3.21. Streaming

an RF-line that is used to calculate not only the SADs of one A-OB, but of several A-OBs greatly

35


OB 0


0

24

16

8

0

L-8

OB 1 OB 2 OB 3

OB 0 OB 1 OB 2 OB 3

OB 4 OB 5 OB 6 OB 7

OB 8 OB 9 OB 10 OB 11

OB 12 OB 13 OB 14 OB 15

OB 12 OB 13 OB 14 OB 15

0

24

24

24

24

L+24

8 80

OB 0 OB 1 OB 2 OB 3

OB 4 OB 5 OB 6 OB 7

16 8

0

OB 4 OB 5 OB 6 OB 7

OB 8 OB 9 OB 10 OB 11

8

16

16 16 OB 0 OB 1 OB 2 OB 3

L-8

L-16

L-24

OB 4 OB 5 OB 6 OB 7

OB 8 OB 9 OB 10 OB 11

OB 12 OB 13 OB 14 OB 15

L

L

L

L-8

L-16

OB 8 OB 9 OB 10 OB 11

OB 12 OB 13 OB 14 OB 15

L+8

L+8

L-1 OB 12 OB 13 OB 14 OB 15L+31

4 OB / RF-line

8 OB / RF-line

12 OB / RF-line

16 OB / RF-line

12 OB / RF-line

8 OB / RF-line

4 OB / RF-line

Figure 3.19: The output pattern of the SadGenerator when calculating the SADs of all Z-OB’sA-OBs at a time, with A=8 and Z=32

decreases the bandwith of the RF-data.

Figure 3.22 gives an overview of the SadGenerator. The RF-data is streamed RF-line by

RF-line. Each RF-line is processed by the selector which sends a sub-RF-line to the ALU-grid.

The ALU-grid computes the SADs of a number of A-OBs that is streamed to it. This number

of A-OBs can be 1 when processing a single A-OB, (Z/A) when processing a row of A-OBs or

(Z/A)2 when processing all A-OB rows. During each OB-iteration, the Output-select selects L

accumulated values of the ALU-grid row that has SADs available and sends them to the output.

The output pattern of SADs depends greatly how many A-OBs are being processed at the same

time.

The architecture that computes the SAD for multiple AOB-s, depicted in Figure 3.22, greatly

reduces the bandwidth of the RF data in three steps.

The combination of these three steps achieve an impressive total bandwidth reduction ratio of

36

3.6 Scalable SadGenerator Architecture

L+A-1

L+Z-1

Figure 3.20: During different OB-iterations, another sub-RF-line, part of the RF-line, is sent to theALU-grid

RF-line 1

OB iteration 1

OB iteration 2

OB iteration 3

OB iteration 4

RF-line 2

RF-line 3

OB iteration 1

OB iteration 2

OB iteration 3

OB iteration 4

OB iteration 1

OB iteration 2

OB iteration 3

OB iteration 4

Tim

e

Figure 3.21: An overview of the SAD calculation sequence

1101×. (51× 4, 47× 4, 46)

3.6 Scalable SadGenerator Architecture

A scalable architecture allows the designer to find a trade-off between performance and re-

source footprint. It allows the architecture to be implemented in both high performance systems

and systems with a small resource footprint priority. A fine-grained scalability allows the system

to adapt to the resource usage or the speed of other components of the video encoder or to

the whole system that is possibly implemented on the same chip. The SadArchitecture has two

mechanisms that allow it to scale. Bot are discussed in this section.

The A×L ALU-grid can become too large when the resources are limited or when L is very

large due to a large search area. The horizontal parallelization: H is the width of the ALU-grid

and can be reduced to any divisor of L: L/2, L/4,... until 1(see Figure 3.23).

When the width of the ALU-grid is L/2, each ALU in the grid calculates 2 SADs instead of 1

SAD, to make up for the second half of the ALU-grid that is not present anymore. ALU(0,..) will

calculate SAD(0,...) during the first horizontal iteration (H-iteration) and SAD(L/2,...) during the

second H-iteration. In order to be able to calculate this extra SAD, each ALU needs to double

its number of accumulator values. The arithmetic part of the ALU remains unchanged. Overall,

SAD(x, y) will be calculated by ALU(xmodH, ymodA). Each ALU will be slightly bigger due to

the extra accumulators, but the number of ALUs will be significantly reduced, resulting in less

resource usage. Moreover, for each halving of H, the processing time is doubled because each

ALU needs L/H H-iterations.

The ALU-grid now needs a sub-RF-line of H +A− 1 pixels instead of L+A− 1 pixels. During

37


ALU grid

Sub-RF-Line Selector

Output select

RF-lineA-OB

A-SADs

SadGenerator

Figure 3.22: An overview of the SadGenerator

L

H = L

A ALU grid

(a) The maximum Horizontal parallelization:H=L

H = L/2

A ALU grid

(b) Horizontal parallelization equal to L/2

H = L/4

A ALU grid

(c) Horizontal parallelization equal to L/4

H = 1

A

(d) The minimum Horizontal parallelization:H=1

Figure 3.23: Four different Horizontal parallelizations for the ALU grid

each horizontal iteration, the RF-selector will select H+A−1 pixels from the RF-line and send it to

the ALU-grid. Figure 3.24 illustrates the selection of the sub-RF-line of length H +A− 1. The first

selection, OB-select, selects the L + A − 1 pixels that are used by the OB that is currently being

processed. This selection depends on the OB-iteration. The next selection, Horizontal-select,

selects H +A− 1 pixels. This selection depends on the H-iteration.

Another approach for reducing the resource usage is to reduce the footprint of each individual

ALU. In the standard formation, an ALU has two vectors of size A as input that are processed at

once. One A-vector origins from the A-OB and the other A-vector origins from the RF-line. It is

possible to scale the ALU down by calculating the absolute value and adding them together in

several iterations. Instead of using as input size A, we could use any divisor of A: A/2, A/4,... until

1. This number is called P, and corresponds to the processing power P of the ALU. Dividing P by

two will reduce the footprint significantly, but will also double the processing time that is needed.

Figure 3.25(a) shows an ALU with P equal to A. Only one P-iteration is needed to process the

38

3.7 SadComparator Architecture

L+A-1

L+Z-1

H+A-1

OB Select

Horizontal Select

Figure 3.24: From the streamed RF-line, H+A-1 pixels are send to the ALU grid per horizontaliteration

two A-vectors. When P is equal to A/2 as in Figure 3.25(b), the two A-vectors are split into four

A/2-vectors. During two P-iterations, these four A/2-vectors are processed, two at a time.

ALU

RF

O

B

P = AP

= A

acc

ALU

RF

O

B

P= A/2

acc

P =

A/2

Figure 3.25: Two different Processing power configurations of the ALUs

With the scalability of the horizontal parallelization of the ALU-grid and the processing power of

each individual ALU, the SadGenerator can be optimised for a smaller footprint at the cost of lower

performance due to the larger number of iterations required. With a horizontal parallelization H

equal to L and a processing power P equal to A, each RF-line is processed in (Z/A) OB-iterations

with no extra iterations. The situation in which H is equal to L/4 and P is equal to A/8 is illustrated

in Figure 3.26. Each OB-iteration is split into 4 H-iterations and moreover each H-iteration is split

into 8 P-iterations. This causes the design to be 8×4 times slower than the H=L,P=A design.

3.7 SadComparator Architecture

The SadComparator receives all the A-SADs of a Z-OB in a SAD-chunk. It calculates the

minimum distortion for each A-OB and outputs its corresponding Motion Vector. The distortion is

based on the SAD value (see Equation 2.1). Next to the A-OB, it also calculates the minimum

distortion of the other OBs. These OBs are composed by A-OBs and can have any shape, as

long as the granularity is A×A. In order to calculate the minimum distortion of the other OBs,

their SADs need to be calculated by adding the correct A-SADs. In the situation of Figure 3.2,

39


H iteration 1

H iteration 2

H iteration 3

H iteration 4

P iteration 1

P iteration 2

P iteration 3

P iteration 4

P iteration 5

P iteration 6

P iteration 7

P iteration 8

RF-line 1

OB iteration 1

OB iteration 2

OB iteration 3

OB iteration 4

RF-line 2

RF-line 3

OB iteration 1

OB iteration 2

OB iteration 3

OB iteration 4

OB iteration 1

OB iteration 2

OB iteration 3

OB iteration 4

Tim

e

(a)

Figure 3.26: The calculation sequence of an H=L/4 P=A/8 configuration

there are 16 A-OBs in one Z-OB with 4 intermediate B-OBs. Four A-SADs are added to form

one B-SAD, and four B-SADs are added to form one Z-SAD. For each A-OB, B-OB and Z-OB,

the SadComparator will calculate a motion vector that corresponds to their minimum distortion.

Figure 3.27 gives an overview how the SadComparator architecture looks like in this case.

4-to-1

adder

A

Minima

Comparator

4-to-1

adder

B

Minima

Comparator

Z

Minima

Comparator

SAD-chunk

B-SADs

Z-SADs

16 A-MVs

4 B-MVs

1 Z-MVs

MV-chunk

SadComparator

Figure 3.27: An overview of the SadComparator in the situation of Figure 3.2

The SAD-chunk contains L×L A-SADs for each of the 16 A-OBs. The A-SADs are grouped

per 16 SADs. These 16 SADs have the same motion vector but origin from a different A-OBs.

The first group of 16 A-SADs contain the SADs with coordinate (0, 0). This group of is processed

by the ”A-Minima-Comparator”. This block holds for each A-OB (16) the minimum distortion value

and its motion vector. The group is also processed by a cascade of two 4-to-1 adders. These

adders add 4 SADs resulting in a single SAD value from a bigger OB. After the first 4-to-1 adder,

the 16 A-SADs are transformed into 4 B-SADs. After the second 4-to-1 adder, the 4 B-SADs are

transformed into 1 Z-SAD. The group of four B-SADs is processed by the ”B-Minima-Comparator”.

This block holds for each B-OB (4) the minimum distortion value and its motion vector. The Z-SAD

40

3.8 Summary

is processed by the ”Z-Minima-Comparator”. This block holds the minimum distortion value and

its motion vector. After streaming L×L groups of 16 A-SADs, the Minima-Comparators’ MVs are

streamed as an MV-chunk as output.

The designer is free to choose which OBs sizes are taken into consideration. It is for example

possible to not take to A-OBs into consideration and only look for the MVs of the B-OBs and Z-

OBs. In this case, the SadGenerator does not need the ”A-Minima-Comparator”. However, in the

most likely case that all the square OBs are taken into consideration, each chunk has (Z/A)×(Z/A)

A-MVs. The amount of MVs for each intermediate OB is four times less for each step in the OB

hierarchy.

3.8 Summary

This chapter describes the architecture of an FSME accelerator. It is the most data and com-

putational intensive component of the video encoder. A streaming model is used to process the

huge amounts of data that is needed for the FSME. Reference frame and original frame data are

streamed into the accelerator and motion vectors are streamed out of the accelerator. The motion

vectors are processed by other components of the encoder that are out of the scope of this thesis.

The core of the accelerator consists of two units: the SadGenerator and the SadComparator. The

SadGenerator calculates the SADs of A×A prediction blocks. The SadComparator uses these

SADs to hierarchical calculate the SADs of larger prediction blocks, up to Z×Z block.

The SadGenerator uses a 2D ALU-grid to calculate the A×A SADs in an extremely parallel

way. The ALU-grid has two parameters: horizontal parallelization H and processing power P.

Both parameters control the ALU-grid’s level of parallelization. Furthermore 3 designs of the

SadGenerator are proposed. The first design reuses the reference data for adjacent candidate

blocks within the search area, achieving a bandwidth reduction of 52×. The second design reuses

the search area’s reference data for horizontally adjacent original blocks, increasing the bandwidth

reduction to 232,44×. Finally, the third design reuses the search area’s reference data for the

vertical adjacent original blocks, achieving a total bandwidth reduction of 1101×. In order to

achieve such gains, there is a trade-off between the amount of data reuse that can be performed

within the accelerator and the complexity of the SadGenerator’s output stream of AxA SADs.

The proposed architecture is highly reconfigurable: on top of the ALU-grid’s H and P parame-

ters, the A×A and Z×Z size of, respectively, smallest and largest prediction block can be chosen.

Furthermore, the L×L size of the search area and the number of reference frames are also con-

figurable. This flexibility allows the designer to easily optimize the architecture to given design

constraints according to the trade-offs mentioned above.

41


42

4Hardware Accelerator for High

Efficiency Video Coding

Contents4.1 Platform features and restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.2 Design decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.3 SadComparator implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.4 SadGenerator implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.5 Host Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

43

4. Hardware Accelerator for High Efficiency Video Coding

In order to evaluate the methodology proposed in the previous chapter, a hardware environ-

ment framework was selected. This section describes the details about the hardware. Next, the

design decisions are discussed followed by details about the SadComparator’s and SadGener-

ator’s implementation. Finally, the host-code that controls the hardware implementation is dis-

cussed.

4.1 Platform features and restrictions

The architecture is implemented on a MAX2 board with the Maxeler framework. The MAX2

board is a Virtex-5 LX330T FPGA with 12GB RAM (6x 2GB SODIMMs). Section 2.2.3 gives an

overview of the Maxeler framework. The Virtex-5 LX330T is made by Xilinx in 2009 and has high-

performance logic with advanced serial connectivity. It has 11,664 Kb Block-RAM, 51,840 Virtex

5 Slices and an 240 x 108 array of CLBs with 3,420 Kb of Distributed-RAM and 12GB off-chip

DRAM. More information about the Virtex-5 family and the LX330T model can be found in [20].

When implementing the architecture on the MAX2 board, its restrictions have to be taken in

consideration. There are 3 major restrictions that have to be taken into consideration:

• The streaming widths of the input and output of the kernels have to be a multiple or integer

divisor of 48 bytes. This is because the internal stream width of the system is 48 bytes.

Using a non multiple stream width will result in sub optimal performance and resource usage.

• The LMem (the off-chip memory) can only be accessed starting at burst aligned byte ad-

dresses. The burst size is 96 bytes. Thus when the bytes at the address 95 to 96 need to

be accessed, 2 bursts need to be streamed, from byte 0 until byte 191, in order to access

them.

• The LMem can only be accessed with two streaming patterns. The two patterns are a

Linear streaming pattern and a Strided streaming pattern. When another pattern is needed,

this restriction can be bypassed in some inefficient ways. One way is using a combination

of these two patterns using multiple action. Splitting the design into multiple actions has

the disadvantage of having overhead and of losing the internal state of the kernel when

switching from one action to the other. The internal state of the kernel can be saved into

the on-chip FMem, which requires load and store cycles and resources. Another way of

achieving a different streaming pattern is to stream the data with one of the two allowed

patterns to the kernel and modify the stream inside the kernel itself. This requires to store

a significant amount of data in the kernel’s FMem and read the data out in such a way that

the desired pattern is achieved.

44

4.2 Design decisions

4.2 Design decisions

The architecture proposed in previous chapter has a smallest block size of A×A pixels and a

largest block size of Z×Z pixels. To comply with the HEVC standard, Z is chosen to be 64, the

largest allowed size and A is chosen to be 8. With these design choices, each Z-OB contains 64

A-OBs. Every pixel is represented by one byte.

The processing flow goes as following: i) The original blocks are streamed from the Host

to the SadGenerator. The Host is the CPU that controls the accelerator over a PCIe link. ii)

The Reference Frames are streamed from the Off-Chip Memory to the SadGenerator. iii) The

A-SADs that are calculated by the SadGenerator are streamed to the Off-Chip Memory. iv) The

SadComparator reads the stream of A-SADs from the Off-Chip Memory and streams the MVs to

the Host.

The SadGenerator and SadComparator are implemented as two separate Kernels. They do

not communicate directly with each other, but only via the Off-Chip memory.


Off-Chip Memory

Reference

Frames

Even&Odd

SAD-Chunk

PCIe Host Link

OBs

A-SADs

RFs

MVs

Figure 4.1: An overview of the architecture on the FPGA

For sake of simplicity and because there is a direct dependency between how the SadGen-

erator stores the data into the off-chip memory, and the pattern how data is consumed by the

SadComparator, in the following sections we have opted to start by presenting the architecture

implementation of the SadComparator and only after the SadGenerator.

45


4.3 SadComparator implementation

The SadComparator architecture, which is discussed in Section 3.7, needs as input (Z/A)2

SADs from (Z/A)2 different A-OBs. These SADs have the same (x, y, r) coordinates with x, y the

coordinate of the prediction candidate in the search area, and r the index of the search area’s ref-

erence frame. Section 3.5 describes three possible SadGenerator architectures: an architecture

processing one A-OB at a time, an architecture processing (Z/A) A-OBs at a time and an archi-

tecture processing all (Z/A)2 A-OBs inside the Z-OB at a time. The SadGenerator architecture

that processes all A-OBs inside the Z-OB at once during the streaming of the reference frame has

a very complex streaming pattern of SADs, as show in Figure 3.19. The (Z/A)2 SADs with the

same motion vector are spread inside the stream and can not be accessed at the same time with

an efficient implementation.

To ensure that the SadComparator is able to access the SADs in a more efficient way, the

SadGenerator has to process the A-OBs row per row. Instead of processing (Z/A)2 A-OBs at a

time, (Z/A) A-OBs are processed at a time. Next to a more efficient way of accessing the SADs,

it also simplifies the SadGenerator’s architecture. The disadvantage is a bandwidth increase with

factor 4.46, as explained in Section 3.5. Figure 3.17 illustrates the streaming pattern of the SADs

when processing (Z/A) A-OBs at a time. Figure 4.2(a) shows exactly the same streaming pattern

but for A=8 and Z=64 resulting in 64 A-OBs inside one Z-OB. The figure also shows the complete

streaming pattern, of all SADs, with two reference frames. First all the SADs from the first A-OB

row are calculated. The SADs from the first reference frame are illustrated in dark blue and the

SADs from the second reference frame are illustrated in light blue. Thereafter, the SADs from

the second A-OB row are streamed, illustrated in yellow. The same pattern continues until the

SADs of the last A-OB row are streamed, illustrated in green. The (Z/A)2 = 64 SADs that the

SadComparator needs to access in parallel are still scattered around inside the SAD stream. This

is illustrate by the black dots in Figure 4.2(a) which represent to SADs with motion vector (0, 0, 0).

The SadComparator needs to access these SADs, and all the other groups of SADs with the

same MV, in parallel.

It is possible to bring these SADs more closely together by not storing the stream linearly in

the DRAM, but with a strided pattern. The result of writing the stream with a strided pattern into

the memory is illustrated in Figure 4.2(b). With this organization, all the SADs with the same

motion vector are stored on the same line but they are still separated by 63 other SADs. One line

of SADs contains now L groups of SADs of 64 A-OBs.

In order to access the 64 SADs in parallel, a Transposer is added to the SadComparator

as illustrated in Figure 4.3. The goal of the Transposer is to read out the SADs linearly from

the memory and output them in groups of (Z/A)2 SADs with the same motion vector, such that

they can be processed by the rest of the architecture. The Transposer consists of two memory

46


OB 1 OB 2 OB 3 OB 4 OB 5 OB 6 OB 7OB 0

y=0 y=1..L-2 y=L-1

RF 0OB 1 OB 2 OB 3 OB 4 OB 5 OB 6 OB 7OB 0RF 1



OB 57 OB 58 OB 59 OB 60 OB 61 OB 62 OB 63OB 56 OB 57 OB 58 OB 59 OB 60 OB 61 OB 62 OB 63OB 56

OB 57 OB 58 OB 59 OB 60 OB 61 OB 62 OB 63OB 56 OB 57 OB 58 OB 59 OB 60 OB 61 OB 62 OB 63OB 56RF 0RF 1

RF 0RF 1



OB

0..7

OB

8..15

OB

16..55

OB

56..63

(a) The SadGenerator’s SAD output stream pattern

OB1 OB2 OB3 OB4 OB5 OB6 OB7OB0


OB 0..7 OB 8 .. 55 OB56..63

y=0y=1 OB57 OB58 OB59 OB60 OB61 OB62 OB63OB56

y=L-1 OB57 OB58 OB59 OB60 OB61 OB62 OB63OB56

RF

1R

F 0



y=0

y=L-1






OB1 OB2 OB3 OB4 OB5 OB6 OB7OB0y=1

(b) The storing pattern of the SADs in the memory

Figure 4.2: Detailed ALU of a H(L)P(8) configuration

banks and four barrel shifters as illustrated in Figure 4.4. The memory banks are local memory,

implemented inside the SadComparator kernel with BRAMs.

The Transposer reads L SADs from one OB and stores them into the memory bank. A memory

bank consists of L memory stripes each with a depth of (Z/A)2 pixels. Each SAD from the OB is

stored into a different memory stripe. The first L SADs are stored at level one: each SAD is stored

into a memory stripe at address 0. The second L SADs are barrel-shifted by one to the right and

then stored at level two: into a memory stripe at address 1. The third L SADs are barrel-shifted

by two to the right and stored at level three: into a memory stripe at address 2. This goes on until

the (Z/A)2th L SADs are streamed. This is illustrated in Figure 4.5(a). It is now possible to read

all (Z/A)2 SADs from different A-OBs, but with the same motion vector. Those values are aligned

diagonal through the memory bank and have in the figure the same number. They can be loaded

by accessing the correct address in each memory stripe. The loaded result is barrel-shifted to the

left such that the SAD from OB0 is always the first SAD in the row. It is important that the SADs

are always organised from OB0 to OB64 because the SADs are later added by the 4-to1-adders.

These 4-to-1-adders add the A-SADs which form an intermediate SAD after the addition.

There are (Z/A)2 cycles needed to store the values into the memory bank. The values are

read out in L cycles. During the readout of the values, the next SADs are stored into the second

memory bank. In this way, after the first store, there is a continuous flow of SADs in the correct

47


4-to-1

adder

A

Minima

Comparator

4-to-1

adder

B

Minima

Comparator

Z

Minima

Comparator

A-SADs

B-SADs

Z-SADs

16 A-MVs

4 B-MVs

1 Z-MVs

MV-Chunk

SadComparator

Transposer

Figure 4.3: SadComparator with Transposer

Barrel Shifter Barrel Shifter

Barrel Shifter Barrel Shifter

MUX

SADs

SADs

Transposer

Figure 4.4: The Transposer consists of two mem-ory banks that allow for simultaneous read andwrites capability

0 2 3 4 5 6 7

0 2 3 4 5 67

0

Barrel Shifter - Right

L

(Z/A)^2

0

Barrel Shifter - Left

05 6 742 3

(Z/A)^2

210

1

0 1

0 1

0 1

0 1

1

1

1

3 4 5 6 7

5 6 742 31

1 1 1 1 1 1 11

1 1 1 1 1 1 1 1

(a) The transposer when L equals(Z/A)2

0 2 3 4 5 6 7

5

5

Barrel Shifter - Right

L

(Z/A)^2

13

13

13

13

0 5 6 742 3

(Z/A)^2

210

1

1

3 4 5 6 7

5 6 742 31

1098 11 12 13 14 15

109 811 12 13 14 15

0109 811 12 13 14 15

5

5

5

5

13

13

8 10 11 12 13 14 159

Barrel Shifter - Left

13 13 13 13 13 5 5 5 5 5 5 5 5 13 13 13

5 5 5 5 5 5 5 5 13 13 13 13 13 13 13 13

5 5 5 5 5 5 5 5

3 4 5 6 7

(b) The transposer when L is bigger than (Z/A)2

Figure 4.5: The transposer

format. In the case of a search area width L= 64, A=8 and Z=64, L equals (Z/A)2 and there are

as many write cycles as read cycles. When the search area width L= 128, there are more reads

48


than writes in a memory bank. This situation is illustrated in Figure 4.5(b)

49


4.4 SadGenerator implementation

As stated in Section 3.5, the SadGenerator is the architectural structure responsible for calcu-

lating the A-SADs. The SadGenerator processses processes Z-OB per Z-OB. During a Z-iteration,

it calculates the SADs of every A-OB inside a Z-OB. Because the SadComparator needs the SADs

in a specific pattern, the SadGenerator calculates the SADs in a row by row basis. During a so

called Row-iteration, the SadGenerator calculates L×L SADs per reference frame for every A-OB

in the A-OB-row. When there is more than one reference frame, a Row-iteration is split into RF-

iterations for every reference frame. During each RF-iteration, L×L SADs of one reference frame

are calculated for each A-OB. With Z=64, A=8 and Z/A=8, there are 8 A-OBs in an A-OB-row.

This situation, with two reference frames, is illustrated in Figure 4.6.

Z 1

Row 1

Row 2

Row 3

Row 4

Row 5

Row 6

Row 7

Row 8

RF 1

RF 2

RF 1

RF 2

RF 1

RF 2

RF 1

RF 2

RF 1

RF 2

RF 1

RF 2

RF 1

RF 2

RF 1

RF 2

Z-iteration Row-iterations RF-iterations

Figure 4.6: An overview of the iterations when processing a Z-OB

4.4.1 The RF-stream

During each RF-iteration, the RF-data is streamed to the SadGenerator. The RF-data are

pixels from the reference frame that are needed to calculate the SADs of an A-OB row. The

reference frames in the Off-Chip Memory are extended frames to take the search area of the

original blocks on the border of the frame into account. With a search area of L×L, the upper

and left border are extended with L/2 pixels, the lower and right border with L/2 − 1 pixels. The

grey border in Figure 4.7(a) serves this purpose. The Off-Chip Memory on the MAX2 board is

accessed with a burstlength of 96 bytes. Every read is a multiple of 96 bytes and the starting point

of the read must be aligned with a multiple of 96 bytes in the memory. In order to be able to start

reading the second line of the frame, a burst alignment padding is added on the right side of the

frame. This padding makes sure that each line of the frame starts at an address that is a multiple

of 96 bytes, such that it is burst aligned.

The RF-data consists of L+A− 1 RF-lines with a length of L+ Z − 1 pixels and is accessed

with a strided pattern. The RF-data of the first A-OB-row from the first Z-OB starts at the burst

50


(a) Each frame is stored into the DRAM with asearch area border (grey) and burst align-ment padding (green)

RF-data

(b) The RF-data spans an area in the refer-ence frame

Figure 4.7: Reference frame in the DRAM

aligned address 0 in the DRAM, as show in Figure 4.7(b). The RF-data of the same A-OB-row,

but in a different reference frame is also burst aligned. This is because the second frame is stored

immediately behind the first frame and every line in the frame is a multiple of the burstsize due to

the burst alignment padding. The RF-data of the next A-OB-row in the same Z-OB starts A frame-

lines lower. Since every frame-line is a multiple of the burstsize, the RF-data’s start address is

burst aligned as well.

The RF-data of the first A-OB-row of the next Z-OB has a start address that is not aligned.

Suppose Z and L equal to 64. The RF-line has a width of 127 bytes. As already explained before,

the first RF-line of the first Z-OB is burst aligned: it starts at address 0 and ends at 126. The first

RF-line of the second Z-OB, on the right of the first Z-OB, is shifted by Z pixels. It starts at address

64 and ends at 190, as highlighted in yellow in Figure 4.8. It is not burst aligned. The first RF-line

of the 3th Z-OB is also not burst aligned, it starts at address 128 and ends at 254. In turn, the first

RF-line of the 4th Z-OB is again burst aligned, it starts at address 192 and ends at 318. The blue,

yellow, green pattern repeats itself.

0 96 192 288 38432 64 128 160 224 256 320 352

Figure 4.8: The first RF-lines of adjacent Z-OBs in the DRAM for Z and L equal to 64

It is possible to access the blue, yellow and green data with a single read of an RF-line with

length of 192 bytes. The output of the DRAM is shown in Figure 4.9 for each of the three different

cases. To access the blue data from Figure 4.8, the RF-line of 192 bytes with starting address 0 is

streamed and the first 127 bytes are selected.If we look at the memory output, the data is aligned

to the left of the 192 byte RF-line. To access the yellow data, the same RF-line is streamed

but in this case the data is aligned to the right side of the RF-line, and the 127 bytes are now

selected from the byte word 64 onwards. To access the green data, the RF-line is read starting

from address 96 and the selected 127 bytes of data appear from byte 32 on. When accessing the

51


second blue data, the pattern repeats itself but with an 192 byte address offset in the DRAM.

0 96 19232 64 128 160

Figure 4.9: The data of multiple Z-OBs inside an RF-line for Z and L equal to 64

In the case of Z and L equal to 64, an RF-line of 192 bytes is streamed, of which 127 bytes

are used. Thus due to the burst alignment restriction, for every RF byte needed, 1.5 RF bytes

need to be streamed as depicted in Figure 4.10. The efficiency percentage is 66,66%. The

architecture’s RF-data reuse factor of 232,44 × that is achieved by processing a row of A-OBs,

is hereby reduced to 154,96 ( = 232, 44 × 0, 6666). Appendix A illustrates that the blue, green,

yellow, blue pattern also holds for Z=32. With Z equal to 96, the pattern is blue blue, since they

are all burst aligned. The length L does not influence the pattern, but does influence the length

of the RF-line. The length of the RF-line corresponds to the bandwidth of the SadGenerator’s

RF-stream. Table A.1 shows the RF-data usage efficiency for every Z,L pair.

L+A-1

L+Z-1

H+A-1

OB Select

Horizontal Select

RF-line Length

Z Select

Figure 4.10: The access pattern of the RF-line inside the SadGenerator for different Z-OBs, withZ,L equal to 64

Figure 4.10 illustrates the SadGenerator’s sub-RF-line Selector. The Z Select, shown in blue,

selects the bytes depending on the horizontal position of the Z-OB. The OB Select, shown in

green, selects the bytes depending on the OB-iteration, i.e., which A-OB inside the A-OB-row is

currently being processed. The Horizontal Select, shown in red, selects the bytes depending on

the horizontal iteration of the ALU-grid.

4.4.2 The OB-stream

The A-OBs streamed to the SadGenerator are stored in On-Chip Block-RAM, or the FMem in

Maxeler jargon. During the streaming of the first line of the first A-OB, the ALU is not working, and

thus no RF-lines are streamed. During each processing cycle, all the 8 pixel lines, each 8 pixels

long, of a single A-OB, are sent to the ALU-grid. Each ALU-row in the grid uses a different pixel

52


line. In order to have 8-read ports in the BRAM, the data is automatically duplicated by the max

compiler. During the next OB-iteration, the pixels of the next A-OB are accessed by the ALU-grid.

A-OB 1

A-OB 2

A-OB 3

A-OB 4

A-OB 5

A-OB 6

A-OB 7

A-OB 0

B-RAM

ALU row 0

ALU row 1

ALU row 2

ALU row 3

ALU row 4

ALU row 5

ALU row 6

ALU row 7

Figure 4.11: The 8 Original Blocks are stored into the B-RAM and distributed to the ALU grid perrow

4.4.3 The ALU implementation

An individual ALU consists of an ABS-unit followed by a reduction tree (ADD-Tree). The ABS-

unit is responsible for calculating the absolute value of two vectors dimension-wise. The result is

a vector with the same length who’s values are added by the ADD-Tree, resulting in a single value

that is added to one of the accumulators. In the standard configuration, with the horizontal paral-

lelization equal to L, each ALU accumulates the SAD with a single x-coordinate for an A-OB-row.

With Z equal to 64 and A equal to 8, there are 8 accumulated values, each one corresponding to a

SAD value for an A-OB. In the case when the horizontal parallelization is L/c, the ALU calculates

the SADs for c x-coordinates. For example, when c is four, ALU(0,0) calculates the SADs (0,0),

(L/4,0), (L/2,0) and (3L/4,0). In the case of c equal to four, there are 4×8 accumulator values in

each ALU. In the standard configuration, the processing power P is equal to A. This means that

all the A pixels of the vector are processed at once. For more information about the processing

power P, the reader is referred to Section 3.6. An example of an ALU with P equal to A is given

in Figure4.12(a). When the processing power is less than A, for example A/2, only A/2 pixels are

processed per cycle. Thus two cycles are needed to process the two A vectors. During these two

cycles the ADD output value adds to the same accumulator value. An example of an ALU with P

equal to A/2 is given in Figure4.12(b).

A possible implementation is the use of (Z/A*L/H) accumulators. The accumulator is a prede-

fined Maxeler library accumulator. The logical view of each accumulator is shown in Figure 4.13.

The number of accumulation values, accumulation depth, in the figure is only four to simplify the

model. With Z equal to 64 and A equal to 8, the accumulation depth is always a multiple of

53


ABS

ADD

Tree

A OB pixels A Ref pixels

1 ADD value

1 ACC value

A ABS values

Accumulators

(Z/A)

(a) H(L)P(A) ALU configuration

ADD

Tree

ABS

A OB pixels A Ref pixels

1 ADD value

1 ACC value

A/2 ABS values

Mux Mux

A/2 Ref pixels A/2 OB pixels

Accumulators

(Z/A)x4

(b) H(L/4)P(A/2) ALU configuration

Figure 4.12: Detailed ALU implementation

8 (8× L/H).

ADD

R E

R

ACCACCACCACC

MUX

E E E E

R R R R

ADD Value

ACC Value

Figure 4.13: The SadGenerator ALU’s registers with library accumulators

The array of accumulation values can be implemented more efficiently since only one value

is accumulated or reset at a time. There is no need for Z/A × L/H accumulators. A single

accumulator and Z/A × L/H − 1 delay/hold pairs are enough to store the accumulated values.

Figure 4.14 shows the implementation with a custom shift register. Only one adder is used, in

combination with Z/A×L/H − 1 delay/hold pairs1. The accumulated values are conveyed by the

signals before and after the delay/hold pairs as described in the figure by ACC 0 to ACC 3. The

signal shift E can hold all the accumulated values, except for the first one. While the shift registers

are on hold, the delayed ACC 0 value is always accumulated with the value at the output of the

ADD unit. The circuit is in this state when multiple iterations are needed to accumulate the two

1a delay/hold pair is able to store data for a given number of cycles

54

4.5 Host Code

A-bit vector inputs of the ALU. This is the case when the processing power P of the ALU is less

than A, as explained in subsection 4.4.3.

ACC 0

ADD value

ADD

MUX

MUX

DHDHDH

0

R

shift E

shift Eshift Eshift E

ACC value

ACC 1ACC 2ACC 3

Figure 4.14: The SadGenerator ALU’s registers with custom accumulators

When two new A-bit vectors are processed, the add value has to be accumulated with a new

accumulation value. The accumulation values are shifted and the adder will add the add-value

with the ACC 3 value. This results in the new ACC 0 value. When ALU(0,0) has calculated the

SADs(0,0) it starts the calculation of the SADs(8,0). It does so by reseting the accumulation value

instead of using ACC 3.

It is clear that when the ALU’s processing power P is equal to A, each cycle, the accumulation

values are shifted and the adder never uses the delayed version. In this case, a simplified custom

implementation is possible without the hold blocks and without the mux that chooses between

ACC 0 and 3. This simplified custom implementation can not be used when P is not equal to A,

since the values of the shift registers must be hold.

4.5 Host Code

The data-flow-engine (DFE) is controlled by the host code. The host code controls the DFE

by sending commands to it. These commands are called actions. An action is a data structure

that contains all the information that a DFE needs to run. There are three things that need to be

configured by the action:

• Ticks: For each kernel, the number of cycles that it needs to run is set.

• Streams: Streams from/to the memory need to have a start address, length and streaming

pattern. Streams from/to the host need to have a length and a source/destination.

• Scalars: Kernels can have scalars who’s value can be set before each run.

55


A frame is processed Z-OB per Z-OB. Each Z-OB is processed by multiple RF-iterations as

shown in Figure 4.6. One action sets the commands that are necessary for one RF-iteration.

The SadComparator runs in parallel with the SadGenerator. An action thus contains configuration

data for both kernels.

An action sets the following parameters for the SadGenerator:

• Ticks: The number of cycles the SadGenerator has to run.

• RF-stream: Strided read pattern of RF-data from the DRAM to the SadGenerator

• SAD-stream: Strided writing pattern of SADs from the SADGenerator to the DRAM

• Scalars: Sets the state of the mux that chooses Z-Select in the sub-RF-Selector

An action sets the following parameters for the SadComparator:

• Ticks: The number of cycles the SadComparator has to run.

• SAD-stream: Strided read pattern of SADs from the DRAM to the SadComparator

• MV-stream: Write stream from the SadComparator to the PCIe

4.6 Summary

This chapter shows a possible embodiment for the architecture proposed in Chapter 3. Namely,

the implementation targets a Virtex 5 LX330T FPGA. For the HEVC motion estimation, the small-

est prediction block is chosen to be 8×8 (A=8) and the largest prediction block is 64×64 (Z=64).

The best prediction block is looked for in a search area of 64×64 pixels (L=64) and in two ref-

erence frames (RF=2). The implementation is performed with the Maxeler framework that uses

a high level hardware description language. The SadGenerator and SadComparator are imple-

mented as two separate kernels. To communicate with the outside world, the two kernels have

streams connected to the off-chip memory and PCIe host link.

The previous chapter proposed 3 possible architecture designs for the SadGenerator: pro-

cessing one A-OB, processing one row of A-OBs, and processing multiple rows of A-OBs in par-

allel. In this implementation we consider the architecture that processes one row of A-OBs: it has

a high reference frame data reuse and a simple SAD output pattern. For reading the reference

frame data, the SadGenerator implements a larger RF-line, thus circumventing the off-chip mem-

ory?s strided access pattern limitation. Moreover, a custom accumulator design for the ALU-grid

is also proposed.

In order to be able to process the SADs hierarchically, the SadComparator needs to stream the

SADs from the off-chip memory with a specific pattern. In order to support the efficiently streaming

of the SAD values we propose a new technique that combines the way how the SadGenerator

56

4.6 Summary

stores the data with a very efficient HW module on the SadComparator to transpose the SAD

values.

57


58

5Results

Contents5.1 Implementation Specific Optimizations . . . . . . . . . . . . . . . . . . . . . . . . 605.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625.3 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

59

5. Results

This chapter describes the evaluation of the proposed architecture. Section 5.1 starts by pro-

viding an evaluation of two implementation specific optimizations performed in each of the kernels.

Section 5.2 presents the results obtained for a real implementation on a MAX2 data-flow accel-

eration card with a Xilinx Virtex 5 FPGA. The results obtained show how different performance

metrics vary with different design parameters. Finally, Section 5.3 presents a comparison be-

tween the FPGA results obtained for two parameter configurations and a state-of-the-art GPU

implementation.

5.1 Implementation Specific Optimizations

5.1.1 SadGenerator Output Width

After streaming the first A-1 RF-lines, every L/H×A/P cycles, a total of L SADs will be ready.

These values will be streamed as the output of the kernel. As defined in Section 3.6, H is the ALU-

grid’s Horizontal Parallelization and P is the Processing Power of an individual ALU. It is possible

to have the stream width as large as L to output all SADs at once. This is necessary for designs

with H(L)P(A), where every cycle, L accumulation values will be ready to output. The output of the

kernel must be a multiple or natural divisor of 48 bytes because this is the internal stream width

of the system. With L = 64 and each SAD value represented by 2 bytes, the amount of SAD data

is 128 bytes. A padding of 16 bytes is added such that the stream width of the kernel’s output is

3× 48.

When H is smaller than L and/or P is smaller than A, there are some cycles between each

output cycle, where nothing is streamed to the output. When there are 3 cycles available for

output, the stream width can be 26 SAD values (48 bytes) and the 64 SADs will be streamed to

the output in 3 cycles. The last cycle will have 32 bytes SAD values and 16 bytes padding. An

example of a system where 3 output cycles are available is H(L/2)P(A/2) (2*2 output cycles are

available in total). When more cycles are available, it is possible to stream the SADs in 6, 12,

24,... cycles.

A design with L=64, A=8, Z=64, H(L), P(A) is made to test the resource usage of the SadGen-

erator with the possible number of output cycles. As seen in table 5.1, it is clear that the kernel

output width of 48 bytes, which in achieved with 3 output cycles, is the optimal case in terms of

resource usage. This can be explained because, since the data is already aligned to the internal

format, the hardware necessary to rescale and buffer the data are pretty straightforward. The

only marginal negative effect is that to output the last SADs, the system will need to run for 2

extra cycles in order to have 3 output cycles. Therefore, for configurations, where the throughput

is lower, i.e., the design has 3 or more output cycles available, the design uses 3 output cycles.

Otherwise 1 output cycle is used.

60

5.1 Implementation Specific Optimizations

Table 5.1: Resource usage relative to the design with 1 output cycleoutput width (bytes) LUTs (%) FFs (%) BRAMs (%)

1 output cycle 144 1.00 1.00 1.003 output cycles 48 0.999 1.005 0.8486 output cycles 24 1.002 1.004 0.87612 output cycles 12 1.000 1.004 0.85724 output cycles 6 1.002 1.003 0.848

5.1.2 Custom Accum

Each ALU in the SadGenerator has a number of accumulation values. The total number of

accumulation values is equal to the number of A-OBs that are processed during one Z-iteration

(Z/A) multiplied by the number of horizontal iterations (L/H).

Figure 5.1 shows the resource usage of the SadGenerator kernel for 3 different designs. The

first design, Library Accum, uses the standard accumulators provided by the Maxeler framework,

where each accumulation value is stored in a full accumulator, as in Figure 4.13. The second de-

sign, Custom Accum, stores the accumulation values in a custom shift register, as in Figure 4.14.

Finally, and as described in Section 4.4.3, the third design, Simplified Custom Accum is an op-

timization of the second one, with no hold-block and only one mux to reset that can be applied

when P = A. The design is for a motion estimation with as width of the search area L = 64,

horizontal parallelization H = 1 and ALU processing power P = A. The Z and A block widths are

respectively 64 and 8. Each ALU calculates the SAD value for Z/A = 8 OBs and for L/H = 64

x-SADs. This results in a total of 512 accumulation values that each ALU has to store.

Figure 5.1: The resource usage of the SadGenerator’s kernel for different accum implementationsrelative to the custom accum implementation, for H1A8 and L=64

The resources of the different designs are expressed relatively to the resource usage of the

Custom Accum design. The Library Accum design uses significant more LUTs and FFs, respec-

tively 6× and 3,5×. The reduction of resource usage by the Custom Accum design is accom-

plished by using shift registers instead of full accumulators, for which an adder is used per unit but

only one is used at a time. The Simplified Custom Accum design uses slightly less LUTs (2,5%)

but more BRAMs (2%).

It is clear that the custom accum implementations are a better choice, because it requires a

significant less amount of LUTs and FFs. When comparing the Custom Accum with the Simplified

61

5. Results

Custom Accum, there is a trade-off between BRAM and LUTs. Since BRAMs are a more scarce

resource, the Custom Accum implementation has the preference.

5.2 Experimental Results

5.2.1 Framework

The architecture is implemented on a MAX2 board with the Maxeler framework. The following

list describes some of the terminologies used throughout this section and how they should be

interpreted:

• Resources The resource usage as reported by the Xilinx tool after the mapping and place-

and-routing design-flow phases.

• Cycles The number of cycles are the total number of ”ticks” that the Data Flow Engine (DFE)

is set to run. This number of ticks is set via the software run-time API actions, which are

command streams. Execution-times that are derived from these number of cycles do not

take into account the initialisation time that is needed for each action.

• Memory The read and write rates are calculated for the case in which both the SadGener-

ator and SadComparator are running at the same time. In practice, this is always the case,

except during the processing of the first and last Z-OB. In these later cases, either the Sad-

Generator or the SadComparator is not running, which is why the read and write rates to the

DRAM will be significantly less.

• Power The power is calculated with the XPower Estimator tool from Xilinx [21]. This tool

automatically estimates the power consumption from the data provided on the MAP-report.

5.2.2 Reference Implementation

The architecture is successfully implemented on a MAX2 board. The accelerator is synthe-

sized for the Virtex 5 FPGA at 125MHz with the off-chip’s memory controller operating at 150MHz

(with a word width of 96 bytes, providing a maximum bandwidth of 14,4 MBytes/s. The parameters

of the reference implementations are summarized in table 5.2. There are three reference imple-

mentations; design H64P2, H64P4, and H64P8. Their results can be found in table 5.3. A detailed

overview of the power estimations based on the Xilinx XPower Estimator tool were obtained for

the worst case scenario, i.e., the largest design H64P8 and can be found in Appendix B.

The H64P8 design extracts four times more parallelism than the H64P2 design. The required

DRAM read and write bandwidth utilization is quadrupled in order to support the performance

increase. Both the H64P4 and H64P8 are able to process 720p frames in real-time. Namely, the

H64P8 design can process 720p, 1080p and 2160 frames in real time with a frame rate up to,

respectively 56.9, 26.8 and 6.7 frames per second.

62


Table 5.2: ParametersH P Z L RF

H64 P2 64 2 64 64 2H64 P4 64 4 64 64 2H64 P8 64 8 64 64 2

From the three designs, only the H64P8 design was not able to run on the FPGA. Although we

were not able to fit the H64P8 design on the MAX2 board, the preliminary projections obtained

show an occupancy only slightly above the limits, namely of 101% LUTs, 96% FFs and 84%

BRAMs. Therefore, we believe that the analysis of the results obtained by these projections

are still valuable, and provide interesting conclusions. In particular, since the MAX2 card has the

smallest form factor FPGA of Maxeler, other more recent products could easily fit the design, such

as the MAX3 card, which has a Virtex 6 LX240T with about 3x the size. Moreover, regarding the

memory bandwidth requirements, the MAX2 card is also not able to provide enough bandwidth

for the H64P8, and would limit its overall performance. On the other hand, MAX3 would support

4x more bandwidth. An intermediate implementation point is shown for the H64P4 design point. It

can support a frame rate of up to 28.6 fps for 720p while also having lower hardware requirements

and reasonable memory bandwidth requirement for the target platform.

Table 5.3: Experimental resultsPower [W] Resources [#] Speed [fps] DRAM [bytes/cycle]

LUT FF BRAM 720p 1080p 2160p Read WriteH64 P2 13,6 84398 140747 297 14,3 6,7 1,6 38,4 32,4H64 P4 N.A. 124516∗ 173655∗ 273∗ 28,6 13,5 3,4 76,8 64,8H64 P8 N.A. 210158∗ 199682∗ 273∗ 56,9 26,8 6,7 153,3 129,4∗ Results from preliminary resource report.

Finally, regarding the H64P2 design, its main limitation is the much lower frames-per-second.

An overview of the resource usage of the H64P2 reference implementation can be found in Fig-

ure 5.2. Figure 5.2(a) shows that the kernels use the bulk of the LUTs and FFs. The kernels

implement the SadGenerator and SadComparator. The manager and kernels both use a lot of

BRAMs, respectively 41,67% and 49,68% of the total available BRAMs. When looking more into

detail, Figure 5.2(b) shows that the SadGenerator uses most of the LUTs and FFs. The SadCom-

parator uses much less LUTs and FFs, but a lot of BRAMs which are used for the implementation

of the SadComparator’s Transposer. The manager uses BRAMs for the implementation of the

FIFOs and controllers.

Next to the design choice, there are 3 more parameters that have a major impact on the

accelerator’s performance: Z, L and RF. The effect of these four parameters on the experimental

results are discussed in section 5.3.

63

5. Results

(a) Overview (b) Breakdown

Figure 5.2: An in-depth overview of the resource usage of the H64P2 reference implementation

5.2.3 Design Comparison

This section illustrates the effect of the architecture’s parameters on the accelerator’s perfor-

mance. H P is a two dimensional parameter. H specifies the ALU-grid’s horizontal parallelization.

P specifies an individual ALU’s processing power. Z is the width of the largest prediction block

for which SADs are calculated and compared. L defines the width of the search area in which a

prediction candidate is looked for in each reference frame. RF is the number of reference frames

in which a prediction candidate is looked for. H P is a design parameter that does not influence

the encoding efficiency. Z, L and RF are design parameters that influence the encoding efficiency.

The larger they are, the higher the encoding efficiency.

The next paragraphs give an in-depth description of each parameter and discuss their effect

on the resource usage, performance, memory usage and power consumption.

Z

L

H P

RF

Architecture

Resources

Cycles

DRAM

Power

Parameters Results

Figure 5.3: The effect of four parameters on the architecture’s performance

design

64


A – parameter: H P The design parameter H P is a two dimensional parameter that changes

the configuration of the SadGenerator. The horizontal parallelization H specifies how many ALU

rows the SadGenerator’s ALU-grid has. H can go from 1 up to L, the width of the search area.

The parameter P specifies the processing power of each ALU in the SadGenerator. P can go from

1 up to A, the width of an A-OB.

The comparison between different designs is performed with the other parameters constant:

L=64, Z=64 and RF=2.

A –.1 H P: Resources The resource usage for each design can be found in Figure 5.4.

Changing the design, does not vary the usage of the BRAMs significantly. Increasing horizontal

parallelization H slightly increases the usage of LUTs and FFs. Increasing the processing power

P also slightly increases the usage of LUTs and FFs. An exception is when P is equal to A, 8 in

this case. This might be explained because each ALU does not need a mux anymore, which is

illustrated in Figure 4.12. The results for designs H64P8* and H64P4* are based on preliminary

reports, which explains the drop in BRAM usage. Changing the design from H64P8 to H64P4

decreases the LUT usage with 41% and the FF usage with 13%.

Figure 5.4: The resource usage for different designs

A –.2 H P: Cycles The design H1A1 has the slowest performance. Doubling H or P halves

the number of cycles needed. H64A8 is the fastest design. Figure 5.5 gives an overview of the

speed performance of each design.

A –.3 H P: DRAM The total amount of data read from or written to the DRAM is not affected

by the design parameter. But the read and write bandwidth required is affected by the choice of

design. Since doubling H or P halves the number of cycles, the read and write rates are thus

doubled. An overview of the read and write rates for the 12 fastest designs is given in Figure 5.6.

65

5. Results

Figure 5.5: The number of cycles needed per processing of a 720p frame. The red line indicatesthe maximum number of cycles allowed for real time (25fps) processing of a 720p frame.

Figure 5.6: The read and write rates to the DRAM for each design

A –.4 H P: Power Increasing H slightly increases the power usage. P equal to A uses less

energy compared to other choices of P. An overview of the power usage for each design is given

in Figure 5.7.

Figure 5.7: The total power usage per design

B – parameter: Z The parameter Z represents the width of the largest prediction block that is

able to be evaluated. Most of the times Z is imposed by the video coding standards. The HEVC

standard introduces the possibility to use prediction blocks with a size of up to 64×64. When

66


the architecture is used to perform motion estimation with prediction blocks with a size of up to

64×64, Z has to be equal to 64. However, HEVC motion estimation can also be performed with

prediction blocks with a size of up to 32×32. In this case, Z can be chosen as 32 or 96. When Z

equals 96, each frame is processed with blocks of 96×96, but only prediction blocks up to 32×32

will be evaluated in the SadComparator. Changing Z has a major impact on the architecture and

its implementation. Depending on the requirements, an according Z size can be chosen.

B –.1 Z: Resources

• Increasing Z causes the SadComparator to calculate and compare more SADs in parallel.

With Z equal to 32, the SadComparator has to calculate only one SAD of a 32-OB. With Z

equal to 64, the SadComparator has to calculate 4 32-OB SADs in parallel. With Z equal

to 96, the SadComparator has to calculate 9 32-OB SADs in parallel. The higher the Z, the

more add trees and thus the more resources the SadComparator will need.

• Because of the pattern of the SADs in the DRAM, a transposer (as described in Figure 4.4)

is needed in order to internally stream the SADs in another pattern. The size of the BRAM

used by the transposer is 2× (Z/A)2 × L. Choosing a bigger Z-OB, exponentially expands

the usage of BRAM by the SadComparator.

• Increasing Z causes each ALU in the SadGenerator to store more accumulation values. The

number of accumulation values stored inside each ALU is equal to (Z/A)× (L/H).

• Increasing Z increases the number of A-OBs that are processed during the processing of a

Z-OB and thus also the number of SADs that are generated. This results in storing more

SADs during a Z-OB iteration, requiring a larger SAD-chunk in the DRAM.

An overview of the major resource changes is given in table 5.4.

Table 5.4: Major resource changes when increasing Z, relative to Z=32Z=32 Z=64 Z=96

SadGenerator: Accum Values 1x 2x 3xSadComperator: Transposer BRAM 1x 4x 9xSadComperator: Add Trees 1x 4x 9xSAD chunk: DRAM 1x 4x 9x

B –.2 Z: Cycles The number of cycles used to process one Z-OB by the SadGenerator is

proportional to Z2. The number of cycles used by the SadComparator is not proportional to the

size of Z and stays constant. Figure 5.8 illustrates to number of cycles needed to process one

Z-OB.

The number of Z-OBs that need to be processed in order to processes one frame is inversely

proportional to Z2. As long as the SadGenerator is the slower unit, changing the size of Z will not

67

5. Results

Figure 5.8: The number of cycles needed for each unit to process independently one Z-OB

significantly change the number of cycles needed to process a frame. This is because the pro-

cessing time of the SadGenerator is proportional to Z2 and the processing time of the SadCom-

parator stays constant when changing Z. When Z equals 32, the SadGenerator is not anymore

the slowest. The number of cycles needed by SadComparator determines now the total number

of cycles needed. The total number of cycles needed to process one 720p frame is illustrated

in Figure 5.9. As stated before, as long as the SadGenerator is the slower unit, changing the

size of Z will not significantly change the number of cycles needed to process a frame. With Z

equal to 64 and 96, the SadGenerator is the slower unit, resulting in an approximately constant

time to process a frame. With Z equal to 32, the SadComparator is the slower unit, resulting in

a bottleneck that significantly increases the time to process a frame. Choosing Z=96 over Z=32

improves the processing time by a factor of 3, 2.

B –.3 Z: DRAM Figure 5.10 shows the data usage by the SadGenerator when processing

a single Z-OB. Input:RF is the number of bytes that is read from the DRAM and streamed to

the SadGenerator. Output:SADs is the number of bytes that is outputted by the SadGenerator

and written to the DRAM. The amount of data outputted by the SadGenerator equals the amount

of data read by the SadComparator. The total amount of data read from the DRAM is thus the

amount of RF data read by the SadGenerator plus the SAD data read by the SadComparator. The

total amount of data read from the DRAM thus equals Input:RF + Output:SAD. The total amount

of data written to the DRAM is the amount of SADs outputted by the SadGenerator. The size of

Output:SAD when processing one Z-OB also equals the size of one SAD-Chunk. The size of the

DRAM that needs to be allocated for the SADs is twice the size of Output:SAD.

68


Figure 5.9: The total number of cycles needed to process one 720p frame by both units workingin parallel

Figure 5.10: The amount of data used by the SadGenerator when processing one Z-OB for differ-ent sizes of Z

As shown in Figure 5.10, when increasing Z, the RF data and SAD data used to process one

Z-OB increases significantly. Doubling Z quadruples the RF and SAD data used when processing

one Z-OB. This is expected since larger Z-OBs have larger search areas and contain more A-OBs

which results in more SADs.

Figure 5.11 shows the data needed when processing a full 720p frame. When increasing Z,

less Z-OBs need to be processed when processing a full frame. The SAD data is constant, and

the RF data decreases with an increase of Z. Choosing a larger Z decreases the total RF-data that

has to be streamed. The reason for this is twofold. With a larger Z, the streamed RF-line is used

to calculate more A-OB SADs than with a smaller Z, since there (Z/A) A-OBs in an A-OB-row.

This results in less overlap of adjacent RF-lines and thus a decrease in the total RF-data that has

to be streamed. The other reason is due to the more efficient streaming of the RF-lines when Z is

69

5. Results

larger, which is explained in Appendix A. Choosing Z=96 over Z=32 improves the RF-data reuse

by a factor of 3.

Figure 5.11: The total amount of data used by the SadGenerator when processing one 720pframe for different sizes of Z

Figure 5.12 shows the data rates for different Z values. The data rate to and from the DRAM

is significantly lower with Z equal to 32 since more cycles are needed as illustrated in Figure 5.9.

When comparing Z equal to 64 and 96, the number of cycles is constant. The SAD data is

constant as well which explains why the write rate does not vary. The read rate for Z equal to 96

is sightly lower since less RF data is read from the DRAM.

Figure 5.12: The read and write rates of the DRAM for different sizes of Z

B –.4 Z: Power A larger Z value causes the implementation to use more resources. An

increase in power usage is expected.

70


C – parameter: L L is the width of the search area that consists of L × L pixels. The larger

the search area, the more prediction candidates will be checked. This will increase the encoding

efficiency of the encoder, but will also increase the computation cost. It also means that for each

OB, more data from the reference frame is needed and more SADs have to be calculated.

The comparison between different values of L is performed with Z=64, RF=2 and as design

H(L)A8.

C –.1 L: Resources Increasing L greatly increases the size of the SadComparator’s trans-

poser which is equal to (Z/A)2 × L. Thus the number of BRAMs used by the SadCompartor is

proportional to L. Increasing L does not necessarily increases the horizontal parallelization H of

the SadGenerator. But when choosing for maximum speed, it has to scale with L. The increase in

horizontal parallelization will cause the SadGenerator to use significantly more LUTs and FFs.

An implementation of L=128 or L=256 is not possible on the MAX2 board since it has not

enough BRAMs to support it.

C –.2 L: Cycles The number of cycles needed to process one Z-OB by the SadGenerator

is proportional to L2 if the horizontal parallelization is equal to 1. With a horizontal parallelization

of L, the number of cycles is proportional to L. The number of cycles needed to process one Z-OB

by the SadComparator is proportional to L2. Figure 5.13 illustrates the number of cycles needed

to process one Z-OB.

Figure 5.13: The number of cycles needed for each unit to process independently one Z-OB

The total number of cycles needed to process a Z-OB with the SadGenerator and SadCom-

parator working in parallel will depend on the slowest unit. The total number of cycles to process

one 720p frame for different choices of L is illustrated in Figure 5.14. When going from L=32 to

L=64, the number of cycles is doubled since the SadGenerator is the slowest unit. From L=64 on

to L=128 or L=256, the total number of cycles needed is quadrupled for each doubling of L. This

71

5. Results

is because the SadComparator is the slowest unit and its number of cycles needed scales with

L2.

Figure 5.14: The total number of cycles needed to process one 720p frame by both units workingin parallel

C –.3 L: DRAM Increasing L greatly increases the RF data streamed from the DRAM to

the SadGenerator. The number of calculated SADs increases quadratically with L. Thus the total

amount of SADs streamed from the SadGenerator to the DRAM and the stream from the DRAM

to the SadComparator will increase quadratically as well when doubling L. Figure 5.15 presents

the total amount of data usage during the processing of one Z-OB by the SadGenerator.

Figure 5.15: The amount of data used by the SadGenerator when processing one Z-OB for differ-ent sizes of L

Doubling L, quadruples the number of pixels from the reference frame that has to be streamed

and the number of SADs that has to be calculated. This quadrupling does not happen for the

data transfer of the RF data and SAD data. Table 5.5 shows the expected increase in the second

row: Search Area L×L. The third and forth rows show the actual increase of the data streamed for

Input: RF and Output: SAD. The difference in data increase is mainly because in the streaming

of the data there is overhead due to padding, which is relatively smaller when L is larger. This

padding is necessary to meet the stream pattern restrictions regarding the size of the burst length.

Increasing L increases the data drastically, but also the number of cycles is increased. Fig-

ure 5.16 shows the data rate for each L. From L=32 to L=128, the data increases more than the

72


Table 5.5: The difference in increase of RF data and SAD data compared to the increase of theSearch Area L× L, relative to L=32

L=32 L=64 L=128 L=256Search Area L×L 1 4 16 64

Input: RF 1 1,82 5,19 13,49Output: SAD 1 3 12 44

number of cycles, resulting in increased data rates. From L=128 to L=256, the number of cycles

increases more than the amount of data, resulting in decreased data rates.

Figure 5.16: The read and write rates of the DRAM for different size of L

C –.4 L:Power A larger L value causes the design to use more resources. Also the amount

of data streamed to and from the offchip DRAM increases significantly from L=32 to L=128 as is

shown in Figure 5.16. An increase in power usage is expected.

D – parameter: RF RF is the number of reference frames that are used to look for a candidate.

The comparison between different choices of L is performed with L=64, Z=64 and as design

H64P8.

D –.1 RF: Resources Changing RF does not change the resource usage since the design

stays the same. There is only an increase in the number of actions that are performed.

D –.2 RF: Cycles Doubling the number of reference frames, double the number of actions

that are performed, which doubles the number of cycles. This is illustrated in Figure 5.17

D –.3 RF: DRAM Doubling the number of reference frames, double the number of SADs

and the RF data that is needed. This is illustrated in Figure 5.18

Since both the cycles and the data is double when using the double of reference frames, the

data rate is not affected.

73

5. Results

Figure 5.17: The number of cycles needed to process one 720p frame for different numbers ofreference frames

Figure 5.18: The amount of data used by the SadGenerator when processing one Z-OB for differ-ent sizes of RF

D –.4 RF: Power The design is not changed. There is no change in power when changing

the number of reference frames.

5.3 Comparison

The results that are discussed in previous section are now compared with a NVIDIA Fermi-

based GPU implementation. First the GPU implementation is described and then the results are

discussed.

5.3.1 About the GPU implementation

The GPU implementation is developed by Svetislav Momcilovic [12]. It is executed on a NVIDIA

Fermi-based GPU architecture with 512 processing cores, clocked at 1.5GHz, and with a maxi-

mum TDP of 244W. It focuses on exploiting a fine grained data parallelism on the level of Coding

Tree Units, which are simultaneously processed across hundreds of GPU cores. In general,

74

5.3 Comparison

CTUs are mapped to different GPU Streaming Multi-processors (SMs), which consist of several

GPU cores. The cores, themselves, simultaneously process different matching candidates, while

keeping in SMs registers the minimum distortion values found among them, as well as the related

Motion Vectors.

In order to achieve higher processing performance, the reference samples are cached in the

SMs local storages. In general, the entire Search Area can not fit to the available cache memory,

and it is required to perform caching in several steps, while maximally reusing the samples already

available in the local storage. The size of the SA partition cached in a single step is determined

according to the available cache memory on particular GPU device. This approach does not only

provide fast access of the reference samples, but also the scalability of the Motion Estimation

algorithm over the SA size.

For different CTU partitioning modes, we applied hierarchical SAD computing, where the SADs

of the smaller partitions are reused to compute the SAD values of the larger ones. This approach

requires to save not only the reference samples, MVs and distortion values for the best processed

candidates, but also their SAD values, and leads to even larger memory requirements. In order to

deal with such high memory requirements and limited local storage available per SM, we divided

the ME algorithm into two kernels, where the first one performs the search algorithm for the

partitioning modes as large as 32x32, while the larger partitions (up to 64x64) are processed

in the second kernel. However, The SAD values obtained for the 32x32 partitions are saved in

global GPU memory and latter reused for the larger partitions. Even though leading to additional

caching overhead, this approach allows to more efficient use the available resources, and to

simultaneously employ larger number of GPU cores. The best candidates found on different GPU

cores are finally compared with each other in the reduction procedure in order to find the final set

of the best matching candidates for all partitioning modes of processed CTU.

5.3.2 Results

The proposed architecture’s implementation is now compared with a GPU implementation. It

performs motion estimation for blocks with a size of 64×64. Just like the proposed architecture

with Z=64, it calculates 85 motion vectors for each 64×64 block: 1 MV for the 64×64 OB, 4 MVs

for the 32×32 OBs, 16 MVs for the 16×16 OBs and 64 MVs for the 8×8 OBs.

Figure 5.19 compares the processing time for different frames sizes between the implementa-

tions. The motion estimation uses a single reference frame and a search area of 64×64. For both

the reference GPU’s implementation and the proposed architecture’s implementations, increasing

the size of the frame, increases the processing time by the same factor. The H64 P8 design is

with a near constant factor of 2.5 times faster than the GPU implementation. The H64P4 design

is with a near constant factor of 1.24 times faster than the GPU implementation.

Figure 5.20 compares the processing time for different number of reference frames between

75

5. Results

Figure 5.19: A comparison of the GPU and proposed implementation, for different frame sizes

Figure 5.20: A comparison of the GPU and proposed implementation, for different number ofreference frames

both implementations. The motion estimation is performed on a 1080p frame and uses a search

area of 64×64. For both the reference GPU’s implementation and the proposed architecture’s

implementation, increasing the number of reference frames, increases the processing time by the

same factor. The H64P8 design is with a near constant factor of 2.4 times faster than the GPU

implementation. The H64P4 design is with a near constant factor of 1.22 times faster than the

GPU implementation.

Figure 5.21 compares the processing time for different L×L search area sizes between both

implementations. The proposed architecture’s horizontal parallelization H scales here with L. The

motion estimation is performed on a 1080p frame and uses a single reference frame. For L equal

to 64 and 128, the H(L)P8 design is respectively 2, 45 and 2, 53 times faster. When L is equal

to 32, the H(L)P8 is still faster, but only with a factor of 1, 32. This is because, as shown in

Figure 5.13, with L=32 the SadGenerator is the slower unit and thus becomes a bottleneck. For L

equal to 128, the H(L)P4 design is only slightly slower than the H(L)P8 design. This is because,

as shown in Figure 5.13, with L equal to 128, the SadGenerator is almost a factor 2 slower than

76

5.4 Summary

Figure 5.21: A comparison of the GPU and proposed implementation, for search area sizes

the SadComparator. Slowing down the SadGenerator with a factor of two by choosing for P=4,

will thus only slightly slow down the overall performance of the proposed accelerator.

5.4 Summary

The reference implementation is successfully implemented on the FPGA with the following

parameters: A=8, Z=64, L=64 and RF=2. With the ALU-grid’s parameters H=64 and P=8, the

implementation achieves a staggering motion estimation of 6,7 fps/2160p, 26,8 fps/1080p and

56,9 fps/720p. The H64P8 design uses too many resources to fit on the targeted Virtex 5 FPGA.

Thanks to the ALU-grid’s parallelization parameters, the resource usage can be scaled down.

Changing the design from H64P8 to H64P4 decreases the LUT usage with 41% and the FF usage

with 13%. The read and write bandwidths to the DRAM, respectively 76,8 and 64,8 bytes/cycle

are halved compared to the H64P8. As a trade-off the performance is with a factor two slower.

The H64P2 design has even lower resource usage and bandwidths and fits on the targeted FPGA

but has a 2× performance decrease compared to the H64P4 design. The effects of several

implementation choices are studied. Changing the SadGenerator’s output width decreases it’s

BRAM usage by 15%. The use of custom accumulators in the ALU-grid achieves a 6× LUT and

3,5× FF usage reduction.

Furthermore the effects of the parameters Z, L, RF and H P on the architecture’s implemen-

tation are studied. Increasing H P drastically improves the performance at the cost of increased

data rates and resource usage. Increasing L increases the encoding efficiency at the cost of a

decrease in performance and increase in resource usage. Doubling RF increases the encoding

efficiency at the cost of performance decrease.

When the ME is performed with 64×64 prediction blocks, Z has to be chosen 64. In case the

maximum prediction block size is 32×32, Z can be chosen 32 or 96. Choosing Z=96 over Z=32

77

5. Results

increases the performance by a factor of 3,2 and the RF-data reuse by a factor of 3. The cost

is a significant increase in resource usage: the number of accumulation values in the ALU-gird

increases by 3×. Several other components increase by a factor of 9×: the number of BRAMs

used by the SadGenerator’s Transposer, the number of add-trees in the SadGenerator and the

SAD-chunk’s size in the off-chip memory.

Furthermore the results of the implemented architecture are compared with a NVIDIA FERMI

GPU implementation. The implemented architecture achieves a 2,5× performance increase over

the GPU for any frame resolution, number of reference frame and search area. Only for a small

search area of 32×32 pixels, the implementation achieves a 1,32× increased performance, which

is still significantly faster. Although the power studies shown herein are not enough to detail the

exact gains achieved in terms of power, we believe that it is reasonable to consider that the power

consumption of all the FPGA-based implementations are about one order of magnitude lower than

the GPU.

78

6Conclusions and Future work

Contents6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 806.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

79

6. Conclusions and Future work

6.1 Conclusion

A motion estimation accelerator is designed in this thesis for the new High Efficiency Video

Coding standard, based on the data-flow approach. Motion estimation is the most computational

intensive part of video encoding, taking up to 80% in the total computation time. HEVC’s ME

supports different sizes of prediction blocks, up to 64×64 in combination with multiple reference

frames and search area sizes ranging from 32×32 to 256×256 to look for the best matching

prediction block. These numerous configurations form a true challenge for a ME architecture.

The proposed architecture implements a Full Search ME algorithm, evaluating every prediction

candidate in the search area. This requires huge amounts of bandwidth from the memory that

stores the reference frames. In order to save computational time and bandwidth, hierarchical SAD

computing is used for the computation of the larger prediction block’s SADs. The architecture is

highly parallelised and focuses on performance and data reuse. This results in a bandwidth

reduction of several GBytes per frame to several MBytes, achieving a reduction factor of up to

1037. The architecture is composed of two units. The SadGenerator calculates the fine-grained

SADs. Using these SADs and hierarchical SAD computing, the SadComparator calculates the

movement vector for every prediction block. The architecture works with any size of the smallest

prediction block A×A and largest prediction block Z×Z.

Using the Maxeler framework, the architecture is implemented with a high level HDL in a

highly pipelined manner on a Virtex-5 FPGA. The implemented accelerator has the potential to

process 720p and 1080p frames at respectively 55,6 fps and 27 fps. These speeds outperform

the reference GPU implementation with a factor of 2,5. The core of the SadGenerator is a 2D

ALU-grid that is highly scalable. With its Horizontal parallelisation parameter H and Processing

power parameter P, the ALU-grid’s parallelisation can be easily scaled down to save resource

usage, at the cost of performance decrease.

Changing the ALU-grid’s processing power P from 8 to 4 decreases the LUT usage with 41%

and the FF usage with 13% at the cost of doubling the processing time. Choosing the H parameter

not the scale with the search area width also allows the architecture to process large search areas

with a high level of parallelism without using huge amounts of resources.

The reference implementation is designed to perform ME with prediction blocks of 8×8 up to

64×64, 2 reference frames and a search area of 64×64. It achieves a reference frame band-

width of 35,3 MBytes per processed 720p frame. With a H64A8 ALU-grid, the implementation

can outperform the GPU implementation with a factor of 2,5. With a H64A4 ALU-grid, the GPU

implementation is outperformed with a factor of 1,25.

80

6.2 Future work

6.2 Future work

• After having reduced significantly the bandwidth usage caused by the reference data, the

buffering of the SADs in the memory is now responsible for the most part of the bandwidth

required. Implementing a direct streaming of the SADs from the SadGenerator to the Sad-

Comparator can erase the need of buffering

• To refine the motion vectors, a sub pixel motion estimation is usually performed. The calcu-

lations and data patterns needed for sub pixel motion estimation are very similar to the one

used in this architecture and can be integrated in a future implementation.

81

6. Conclusions and Future work

82

Bibliography

[1] Cheng, Y.-S., Chen, Z.-Y., and Chang, P.-C. (2009). An h. 264 spatio-temporal hierarchical

fast motion estimation algorithm for high-definition video. In Circuits and Systems, 2009. ISCAS

2009. IEEE International Symposium on, pages 880–883. IEEE.

[2] Ching-Yeh Chen, Chao-Tsung Huang, Y.-H. C. and Chen, L.-G. (2006a). Level c+ data

reuse scheme for motion estimation with corresponding coding orders. TRANSACTIONS ON

CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, page 553.

[3] Ching-Yeh Chen, Shao-Yi Chien, Y.-W. H. T.-C. C. T.-C. W. and Chen, L.-G. (2006b). Anal-

ysis and architecture design of variable block-size motion estimation for h.264/avc. IEEE

TRANSACTIONS ON CIRCUITS AND SYSTEMS.

[4] Deepak Turaga, M. A. (1998). Search algorithms for block-matching in motion estimation.

[5] Frederico Pratas, Joao Andrade, G. F. V. S. and Sousa, L. (2013). Open the gates using

high-level synthesis towards programmable ldpc decoders on fpgas. Proc IEEE Global Conf.

on Signal and Information Processing - Global SIP.

[6] Jen-Chieh Tuan, T.-S. C. (2002). On the data reuse and memory bandwidth analysis for

full-search block-matching vlsi architecture. TRANSACTIONS ON CIRCUITS AND SYSTEMS

FOR VIDEO TECHNOLOGY.

[7] Lai, Y.-K. and Chen, L.-G. (1998). A data-interlacing architecture with two-dimensional

data-reuse for full-search block-matching algorithm. TRANSACTIONS ON CIRCUITS AND

SYSTEMS FOR VIDEO TECHNOLOGY.

[8] Lei Deng, Wen Gao, M. Z. H. Z. Z. J. (2005). An efficient hardware implementation for motion

estimation of avc standard. IEEE Transactions on Consumer Electronics.

[Leong] Leong, P. H. Recent trends in fpga architectures and applications.

[10] Mateus Grellert, Felipe Sampaio, J. C. B. M. L. A. (2011). A multilevel data reuse scheme for

motion estimation and its vlsi design. TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR

VIDEO TECHNOLOGY.

83

Bibliography

[11] MathWorks (2014). Simulink: Simulation and model based design.

[12] Momcilovic, S. (2014). Svetislav momcilovic.

[13] Poynton, C. (2002). Chroma subsampling notation. Retrieved June, 19:2004.

[Serrano] Serrano, J. Introduction to fpga design.

[15] Sullivan, G. J., Ohm, J.-R., Han, W.-J., and Wiegand, T. (2012). Overview of the high effi-

ciency video coding (hevc) standard. Trans. on Circuits and Systems for video technology.

[16] Technologies, M. (2012). Multiscale Dataflow Programming Version 2012.2.

[17] Technologies, M. (2014). Maxeler.

[18] Tourapis, A. M. (2002). Enhanced predictive zonal search for single and multiple frame

motion estimation. In Electronic Imaging 2002, pages 1069–1079. International Society for

Optics and Photonics.

[19] W. Carter, K. Duong, R. H. F. H. H. J. Y. J. J. E. M. L. T. N. and Sze, S. L. (1986). A user

programmable reconfiguration gate array. In in Proceedings of the IEEE Custom Integrated

Circuits Conference.

[20] Xilinx (2009). Virtex-5 family overview.

[21] Xilinx (2012). Xilinx power estimator 14.3.

[22] Xilinx (2013). 7 series fpgas overview.

[Yu-Wen Huang and Chen] Yu-Wen Huang, Tu-ChihWang, B.-Y. H. and Chen, L.-G. Hardware

architecture design for variable block size motion estimation in mpeg-4 avc/jvt/itu-t h.264.

[24] Zhu, S. and Ma, K.-K. (2000). A new diamond search algorithm for fast block-matching

motion estimation. IEEE TRANSACTIONS ON IMAGE PROCESSING.

84

AAppendix A

85

A. Appendix A

Section 4.4.1 describes how the pixels from the reference frame are streamed from the off-

chip memory to the SadGenerator for Z equal to 64 and L (search area width) equal to 64. Due

to the stride access pattern limitation of 96 bytes, the streamed RF-line contains data of several

Z-OBs. Depending on which Z-OB is being processed, a mux selects the corresponding bytes

that are needed to process that Z-OB. The remaining data is discarded. Figure A.1 shows the

first RF-lines of adjacent Z-OB in the DRAM for Z equal to 32 and for different values of L. The

corresponding RF-lines and their Z-OB data is shown in Figure A.2. Figure A.3 and Figure A.4

show the same for Z equal to 64. Figure A.5 and Figure A.6 show the same for Z equal to 96.

When comparing Figure A.2, with Figure A.4 and Figure A.6, it is clear that for the same value

of L, the length of the RF-line stays constant, regardless of Z. With L equal to 32 or 64, the RF-line

has always a length of 192 bytes. With L equal to 128, the RF-line has always a length of 288

bytes. With L equal to 256, the RF-line has always a length of 384 bytes. What does change with

the value of Z, are the bytes that are used. With Z and L equal to 32, only 64 bytes of the total

192 bytes are used, resulting in an efficiency of 33%. Table A.1 shows the efficiency for every Z,L

pair.

Table A.1: The RF-line data efficiency for different Z and L valuesZ=32 Z=64 Z=96

L=32 33% 50% 66%L=64 50% 66% 83%L=128 55% 66% 77%L=256 75% 83% 92%

In general, a higher L value or Z value means in a higher efficient use of the streamed RF

data, resulting in less data that has to be streamed to the SadGenerator.

86

0 96 192 288 38432 64 128 160 224 256 320 352 416 448 352

480

(a) L = 32

0 96 192 288 38432 64 128 160 224 256 320 352 416 448 352

480

(b) L = 64

0 96 192 288 38432 64 128 160 224 256 320 352 416 448 352

480

(c) L = 128

0 96 192 288 38432 64 128 160 224 256 320 352 416 448 352

480

(d) L = 256

Figure A.1: The first RF-lines of adjacent Z-OBs in the DRAM for Z equal to 32

87

A. Appendix A

0 96 19232 64 128 160

(a) L = 32

0 96 19232 64 128 160

(b) L = 64

0 96 192 28832 64 128 160 224 256

(c) L = 128

0 96 192 288 38432 64 128 160 224 256 320 352

(d) L = 256

Figure A.2: The data of multiple Z-OBs inside an RF-line for Z equal to 32

0 96 192 288 38432 64 128 160 224 256 320 352 416 448

480512

(a) L = 32

0 96 192 288 38432 64 128 160 224 256 320 352 416 448

480512

(b) L = 64

0 96 192 288 38432 64 128 160 224 256 320 352 416 448

480512

(c) L = 128

0 96 192 288 38432 64 128 160 224 256 320 352 416 448

480512

(d) L = 256


88

0 96 19232 64 128 160

(a) L = 32

0 96 19232 64 128 160

(b) L = 64

0 96 192 28832 64 128 160 224 256

(c) L = 128

0 96 192 288 38432 64 128 160 224 256 320 352

(d) L = 256

Figure A.4: The data of multiple Z-OBs inside an RF-line for Z equal to 64

0 96 192 288 38432 64 128 160 224 256 320 352 416 448

480512

(a) L = 32

0 96 192 288 38432 64 128 160 224 256 320 352 416 448

480512

(b) L = 64

0 96 192 288 38432 64 128 160 224 256 320 352 416 448

480512

(c) L = 128

0 96 192 288 38432 64 128 160 224 256 320 352 416 448

480512

(d) L = 256


89

A. Appendix A

0 96 19232 64 128 160

(a) L = 32

0 96 19232 64 128 160

(b) L = 64

0 96 192 28832 64 128 160 224 256

(c) L = 128

0 96 192 288 38432 64 128 160 224 256 320 352

(d) L = 256

Figure A.6: The data of a single Z-OBs inside an RF-line for Z equal to 96

90

BAppendix B

91

B. Appendix B

Figure B shows an overview of the Xilinx Power Estimator’s result for the proposed architec-

ture’s reference implementation with H64P2, which is described in Section 5.2.2. The estimation

is based on the MAP report generated by Xinilix during the compilation of the implementation with

the Maxeler platform. The total power consumption is 13,603 W. Most power is consumed by

the I/O and Device Static, with respectively 58% and 30% of the total power consumption. The

remaining power is consumed by Transceiver and Core Dynamic, with respectively 7% and 6% of

the total power consumption.

Figure B.1: The Xilinx Power Estimator’s result for the H64P2 reference implementation

92

Documents

Reconﬁgurable Data Flow Engine for HEVC Motion EstimationHigh efﬁciency video encoding (HEVC) is an emergent video coding standard that achieves re-duced rate distortion at the