Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Reconfigurable Data Flow Engine for HEVC MotionEstimation
D’huys Thomas
Dissertacao para obtencao do Grau de Mestre emEngenharia Electrotecnica e de Computadores
JuriPresidente: Doutor Nuno HortaOrientador: Doutor Leonel Augusto Pires Seabra de SousaCo-orientador: Doutor Frederico Correia Pinto PratasVogal: Doutor Horacio Claudio de Campos Neto
January 2014
Acknowledgments
First and foremost, I would like to express my sincerest gratitude to my supervisor, Leonel
Sousa, for his huge help and incredible, never seen before, support during this whole period.
Next I would like to gratefully thank Frederico Pratas for his guidance and support during the
many hours that we worked together. It was a great pleasure. I would also like to thank Svetislav
Momcilovic for helping me understand video encoding, making the GPU implementation used in
this thesis and his friendship. A thank you for all the people that made me feel at home at INESC-
ID and specially to Aleksandar, Sveta, Hector, Diogo. Furthermore I would like the thank IST and
the KU Leuven for this Erasmus exchange opportunity which really enriched my life. I am very
thankful to my family for continuously supporting and motivating me.
To all the wonderful people from all over the world that I was able to meet during my Erasmus
in Lisbon: Muito obrigado, Thank you, Dank u
Abstract
High efficiency video encoding (HEVC) is an emergent video coding standard that achieves re-
duced rate distortion at the cost of high computational load. In this thesis a reconfigurable design
for HEVC motion estimation is proposed. Full-Search Block-Matching (FSBM) is used for high
quality video encoding. The design is implemented on a data flow engine with the Maxeler frame-
work. The reconfigurability allows the Coding Units (CUs) to have any set of sizes ranging from
8x8 to 64x64 pixels, also taking non-square shapes into account. The search area width is config-
urable to 32, 64, 128 and 256 pixels. Furthermore the adopted approach and the implementation
provide a fine grained trade-of between maximum performance and minimum resource usage.
Experimental results show that 720p video can be processed at 56,9 frames per second (fps).
The hardware resource usage, Look Up Tables (LUT) and Flip-Flops (FF), can be decreased 41%
and 13%, respectively, with a performance decrease factor of two as trade-off.
Keywords
Motion Estimation, Full-Search Block-Matching, Variable Block-Size, HEVC, FPGA, Maxeler
Platform, Scalable Design
iii
Resumo
A codificacao de vıdeo de elevada eficiencia (HEVC) e uma norma de codificacao de vıdeo
emergente que atinge uma relacao distorcao debito-binario melhorada, mas impoe um elevado
custo computacional. Nesta tese e proposto um acelerador de processamento reconfiguravel
para estimacao de movimento segundo a norma HEVC. E utilizada pesquisa exaustiva por em-
parelhamento de blocos (FSBM) para codificacao de vıdeo de alta qualidade. O projeto, baseado
numa arquitetura de fluxo de dados, foi implementado numa plataforma Maxeler. A reconfiguracao
permite que as unidades de codificacao (UCs) possam assumir um conjunto de tamanhos que
variam de 8x8 para 64x64 pixels, suportando tambem blocos com geometria nao-quadrada. A
largura da area de pesquisa e configuravel para 32, 64, 128 e 256 pixels. Alem disso, a abor-
dagem adotada e a implementacao realizada permitem equilibrar desempenho e recursos de
hardware, com uma granularidade fina. Resultados experimentais mostram que vıdeos com 720p
podem ser processados a 56,9 imagens por segundo (fps). Os recursos de hadware da FPGA
usados, tabelas logicas (LUT) e basculas (FF), podem ser reduzidos 41% e 13%, respetivamente,
tendo como contrapartida um fator de diminuicao do desempenho de dois.
Palavras Chave
Estimacao de Movimento, Pesquisa Exaustiva por Emparelhamento de Blocos, Blocos de
Dimensao Variavel, HEVC, FPGA, Plataforma Maxeler, Projeto Escalavel
v
Contents
List of Acronyms xvi
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Main contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Dissertation outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Background: Video Coding on FPGA 7
2.1 Video Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 High Efficiency Video Coding . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Advanced Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Designing with FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.3 Dataflow programming with the Maxeler platform . . . . . . . . . . . . . . . 18
3 Video Coding Architecture 21
3.1 Streaming model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Hierarchical SAD computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Streaming model with hierarchical SAD computation . . . . . . . . . . . . . . . . . 23
3.4 Streaming Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5 SadGenerator Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.6 Scalable SadGenerator Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.7 SadComparator Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4 Hardware Accelerator for High Efficiency Video Coding 43
4.1 Platform features and restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Design decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
vii
Contents
4.3 SadComparator implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4 SadGenerator implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4.1 The RF-stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4.2 The OB-stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4.3 The ALU implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.5 Host Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5 Results 59
5.1 Implementation Specific Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.1.1 SadGenerator Output Width . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.1.2 Custom Accum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2.2 Reference Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2.3 Design Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.3.1 About the GPU implementation . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6 Conclusions and Future work 79
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
A Appendix A 85
B Appendix B 91
viii
List of Figures
2.1 A basic video coder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 H.265/HEVC encoder with intra/inter selection. . . . . . . . . . . . . . . . . . . . . 9
2.3 (a) Grouping of CTUs in slices and tiles, (b) Subdivision of a CTB into CBs . . . . 10
2.4 Modes for splitting a CB into PBs in case of inter prediction . . . . . . . . . . . . . 12
2.5 The basic structure of an FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 A simplified example of a configurable logic block . . . . . . . . . . . . . . . . . . . 15
2.7 An overview of the main features of the Virtex-5 LX330T FPGA from Xilinx that is
used in this thesis [20]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.8 An overview of a standard software application versus a dataflow application [17] . 19
2.9 The architecture of a DFE and its connections [16] . . . . . . . . . . . . . . . . . . 20
3.1 The streaming model of the FSME accelerator . . . . . . . . . . . . . . . . . . . . 22
3.2 Hierarchical SAD computation inside a Z-OB . . . . . . . . . . . . . . . . . . . . . 23
3.3 Detailed model of the accelerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 The memory is split into an even and odd SAD-buffer . . . . . . . . . . . . . . . . 24
3.5 An overview of the executions sequence of the SadGenerator’s and the SadCom-
parator’s Z-itterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.6 The streaming pattern of the Z-OBs in the original frame . . . . . . . . . . . . . . . 26
3.7 The RF-chunk with multiple reference frames that is streamed alongside each Z-
OB-chunk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.8 The SAD-chunks are streamed to and from the memory. Each SAD-chunk contains
all the A-SADs of one Z-OB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.9 The streaming of the MV-Chunks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.10 An overview of the processing of a frame with a total of 4 Z-OBs . . . . . . . . . . 29
3.11 An RF-line is used by A×L SADs of one A-OB . . . . . . . . . . . . . . . . . . . . 30
3.12 A×L ALUs are grouped in a grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.13 An ALU has as input A reference pixels and A original pixels, and as output an
accumulated value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.14 An overview which ALU lines calculate which SADs by using which lines from the SA 31
ix
List of Figures
3.15 The output pattern of the SadGenerator when calculating the SADs of a single
A-OB, named OB 0, at a time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.16 By extending the size of the RF-line, a row A-OBs use the RF-line to calculate all
its SADs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.17 The output pattern of the SadGenerator when calculating the SADs of one row of
A-OBs at a time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.18 By extending the number of RF-lines, multiple rows of A-OBs use the RF-line to
calculate all its SADs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.19 The output pattern of the SadGenerator when calculating the SADs of all Z-OB’s
A-OBs at a time, with A=8 and Z=32 . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.20 During different OB-iterations, another sub-RF-line, part of the RF-line, is sent to
the ALU-grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.21 An overview of the SAD calculation sequence . . . . . . . . . . . . . . . . . . . . . 37
3.22 An overview of the SadGenerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.23 Four different Horizontal parallelizations for the ALU grid . . . . . . . . . . . . . . . 38
3.24 From the streamed RF-line, H+A-1 pixels are send to the ALU grid per horizontal
iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.25 Two different Processing power configurations of the ALUs . . . . . . . . . . . . . 39
3.26 The calculation sequence of an H=L/4 P=A/8 configuration . . . . . . . . . . . . . 40
3.27 An overview of the SadComparator in the situation of Figure 3.2 . . . . . . . . . . 40
4.1 An overview of the architecture on the FPGA . . . . . . . . . . . . . . . . . . . . . 45
4.2 Detailed ALU of a H(L)P(8) configuration . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3 SadComparator with Transposer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 The Transposer consists of two memory banks that allow for simultaneous read and
writes capability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.5 The transposer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.6 An overview of the iterations when processing a Z-OB . . . . . . . . . . . . . . . . 50
4.7 Reference frame in the DRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.8 The first RF-lines of adjacent Z-OBs in the DRAM for Z and L equal to 64 . . . . . 51
4.9 The data of multiple Z-OBs inside an RF-line for Z and L equal to 64 . . . . . . . . 52
4.10 The access pattern of the RF-line inside the SadGenerator for different Z-OBs, with
Z,L equal to 64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.11 The 8 Original Blocks are stored into the B-RAM and distributed to the ALU grid per
row . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.12 Detailed ALU implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.13 The SadGenerator ALU’s registers with library accumulators . . . . . . . . . . . . 54
4.14 The SadGenerator ALU’s registers with custom accumulators . . . . . . . . . . . . 55
x
List of Figures
5.1 The resource usage of the SadGenerator’s kernel for different accum implementa-
tions relative to the custom accum implementation, for H1A8 and L=64 . . . . . . . 61
5.2 An in-depth overview of the resource usage of the H64P2 reference implementation 64
5.3 The effect of four parameters on the architecture’s performance . . . . . . . . . . . 64
5.4 The resource usage for different designs . . . . . . . . . . . . . . . . . . . . . . . . 65
5.5 The number of cycles needed per processing of a 720p frame. The red line indi-
cates the maximum number of cycles allowed for real time (25fps) processing of a
720p frame. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.6 The read and write rates to the DRAM for each design . . . . . . . . . . . . . . . . 66
5.7 The total power usage per design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.8 The number of cycles needed for each unit to process independently one Z-OB . . 68
5.9 The total number of cycles needed to process one 720p frame by both units working
in parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.10 The amount of data used by the SadGenerator when processing one Z-OB for
different sizes of Z . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.11 The total amount of data used by the SadGenerator when processing one 720p
frame for different sizes of Z . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.12 The read and write rates of the DRAM for different sizes of Z . . . . . . . . . . . . 70
5.13 The number of cycles needed for each unit to process independently one Z-OB . 71
5.14 The total number of cycles needed to process one 720p frame by both units working
in parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.15 The amount of data used by the SadGenerator when processing one Z-OB for
different sizes of L . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.16 The read and write rates of the DRAM for different size of L . . . . . . . . . . . . . 73
5.17 The number of cycles needed to process one 720p frame for different numbers of
reference frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.18 The amount of data used by the SadGenerator when processing one Z-OB for
different sizes of RF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.19 A comparison of the GPU and proposed implementation, for different frame sizes . 76
5.20 A comparison of the GPU and proposed implementation, for different number of
reference frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.21 A comparison of the GPU and proposed implementation, for search area sizes . . 77
A.1 The first RF-lines of adjacent Z-OBs in the DRAM for Z equal to 32 . . . . . . . . . 87
A.2 The data of multiple Z-OBs inside an RF-line for Z equal to 32 . . . . . . . . . . . . 88
A.3 The first RF-lines of adjacent Z-OBs in the DRAM for Z equal to 64 . . . . . . . . . 88
A.4 The data of multiple Z-OBs inside an RF-line for Z equal to 64 . . . . . . . . . . . . 89
A.5 The first RF-lines of adjacent Z-OBs in the DRAM for Z equal to 96 . . . . . . . . . 89
xi
List of Figures
A.6 The data of a single Z-OBs inside an RF-line for Z equal to 96 . . . . . . . . . . . . 90
B.1 The Xilinx Power Estimator’s result for the H64P2 reference implementation . . . . 92
xii
List of Tables
5.1 Resource usage relative to the design with 1 output cycle . . . . . . . . . . . . . . 61
5.2 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.4 Major resource changes when increasing Z, relative to Z=32 . . . . . . . . . . . . 67
5.5 The difference in increase of RF data and SAD data compared to the increase of
the Search Area L× L, relative to L=32 . . . . . . . . . . . . . . . . . . . . . . . . 73
A.1 The RF-line data efficiency for different Z and L values . . . . . . . . . . . . . . . . 86
xiii
List of Tables
xiv
List of Acronyms
ALU Arithmetic Logic Unit
A-OB AxA Original Block
BRAM Block random-access memory
CU Coding Unit
DFE Data Flow Engine
DRAM Dynamic random-access memory
FPGA Field Programmable Gate Array
FSME Full Search Motion Estimation
HEVC High Efficiency Video Coding
MAD Mean Absolute Difference
ME Motion Estimation
MSE Mean Square Error
MV Motion Vector
OB Original Block
PB Prediction Block
PCIe Peripheral Component Interconnect Express
RB Reference Block
RF Reference Frame
SA Search Area
SAD Search Area Difference
SSE Sum of Squared Errors
xv
List of Acronyms
SM Streaming Multi-processor
VBSFSME Variable Block Size Full Search Motion Estimation
Z-OB ZxZ Original Block
xvi
1Introduction
Contents1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Main contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.5 Dissertation outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1
1. Introduction
1.1 Motivation
Video coding continues to claim an increasingly important role in our everyday life. It is em-
bedded in a wide range of applications that became indispensable, including digital TV, video
conferencing, surveillance and Blu-ray. Video applications become more and more demanding
with higher quality and resolutions like 8K UHD (Ultra High Definition). Also new applications
such as, such as multiview capture and display, forces video coding to continuously improve and
innovate.
The newest video coding standard HEVC/H.265 was approved by the International Telecom-
munications Union (ITU-T) in April 2013 to meet these new demands [15]. It is the successor of
the widely used H.264/MPEG-4 AVC (Advanced Video Coding) standard and is under joint devel-
opment by the ISO/IEC Moving Picture Experts Group (MPEG) and ITU-T Video Coding Experts
Group (VCEG). The standard exploits statistical correlation in the encoder to reduce the data rate
of the video while assuring high quality. Between successive video frames (inter mode) temporal
redundancies are further exploited. Within a video frame (intra mode) spatial redundancies are
exploited. The encoding process is completed by transforming the signal, followed by quantisation
and entropy encoding in order to exploit redundancies. HEVC employs many improvements over
other coding standards. Next to higher quality and support for ultra high resolutions, the most
important improvement is the achievement of a significantly higher coding efficiency. Compared
to H.264/MPEG-4 AVC the data compression ratio is typically doubled for the same quality and
resolution of video but comes at the cost of dramatic computational requirements, which is a new
demand namely for achieving real-time processing.
It is important to reduce the computational requirements and design specialised processing
engines to make real-time video encoding and decoding with the newest HEVC standard acces-
sible for mobile and embedded devices that often lack high computational power. Furthermore
lowering the computational requirements will greatly affect the energy consumption. For doing so
the effort should be concentrated on the most computational demanding part of the video coding
which is the Motion Estimation (ME) that takes place on the encoding side. ME takes up to 90%
of the computational requirements of the encoding process. In this step the temporal redundan-
cies are exploited at the level of rectangular blocks in which each frame is divided. For each
rectangular block another rectangular block is looked for in the search area of multiple reference
frames that will result in a minimum residual signal. The result of the ME is a motion vector for
each rectangular block in the frame which points to a rectangular block in a reference frame that
leads to a minimum residual signal. The search for the motion vector can be performed in different
ways. Exhaustive Search compares each rectangular block with all the possible reference square
blocks. Other methods will decrease the number of comparisons by reducing the search space
but will have suboptimal results which decreases the quality of the image.
2
1.2 Related works
Video encoding can be performed on a wide range of platforms ranging from multicore general
purpose central processing units (CPU) to application specific integrated circuits (ASIC). Field
Programmable Gate Arrays (FPGA) are becoming more and more a popular choice for several
reasons. It is fully configurable for the needs of the application and can be remotely reprogrammed
with new bitstreams. There are not the high non-recurring expenses (NRE) associated with an
ASIC design.
Synthesis tools are used to configure the operation of the FPGA. They model a design and
are able to simulate its behavior to verify their correct functionality. The hardware description
language (HDL) that comes with the synthesis tool allows a designer to model a design. Logic
synthesis is performed at the register transfer level (RTL) to design and to generate a bitmap that
is used to configure the FPGA. Verilog and VHSIC Hardware Description Language (VHDL) are
popular HDL for which synthesis tools exist. They allow for a fine-grained RTL design with a lot of
control. With the increasing complexity of designs, there is a need for high level synthesis tools
that raise the abstraction of the HDL. This higher level of abstraction simplifies the design and
allows to decrease the design time drastically while giving up some of the fine-grained control.
MaxCompiler is a programming tool suit that describes hardware on such a high abstraction
level. It provides a data flow model where data is streamed from the memory and processed by
several computation units without being written to the off-chip memory until the chain of process-
ing is complete. This method is especially favourable with data intensive applications since the
expensive write back to memory after each computation is avoided. Motion estimation is a good
example of a data intensive application considering that each rectangular block in a frame has to
be compared with many other rectangular blocks in multiple reference frames.
1.2 Related works
Motion Estimation is the most computational intensive part of the video encoding. It is used in
many previous standards as well as in the newly introduced HEVC. Especially with the Full Search
Motion Estimation, most research is focused on effectively reusing the huge amounts of data that
is needed. Four main levels of data reuse are described in [2]. The more supported levels of data
reuse, the higher the degree of data reuse which lowers the bandwidth and energy consumption.
A high data reuse can be accomplished by utilising caches as in [10] and by exploiting a high
degree of parallelization as in [7] and [6]. This design uses a high degree of data reuse with a
smart parallelization approach.
The HEVC standard extends the use of variable block sizes. Most Variable Block Size FSME
(VBSFSME) architectures such as [3], [Yu-Wen Huang and Chen] and [8] use a N×N paralleliza-
tion grid, with N being the block width. This parallelization limits the data reuse and processing
time, especially when large search area’s are used. When supporting the new large block sizes in-
3
1. Introduction
troduced by HEVC, these designs may be not feasible for implementation due to the high resource
usage of the N×N parallelization. The design presented in the thesis uses a L×N parallelization,
allowing to scale not only with the search area width L, but also with the block size N. When large
search area’s are used, fast processing is achieved due to the L parallelization. The same holds
for large block sizes due to the N parallelization.
According to the target platform, available resources and required performance, the architec-
ture herein proposed can be scaled. Both the search area width L and block size N parallelization
can be scaled down in order to use less resources, at the cost of more processing time. This is a
unique feature that at the time of writing is not present in any known work.
1.3 Objectives
Given the immense computational requirements of the new HEVC video coding standard, the
main goal of this work is to develop a hardware accelerator that tackles the most computational
intensive part of HEVC video coding, namely the motion estimation on the encoder part. The
motion estimation is performed with full search search to achieve the optimal video quality. In
order to process the huge amounts of data during the full search effectively, data access regularity
is exploited to design a hardware accelerator based on the data flow approach.
The motion estimation accelerator was designed with the following main objectives in mind:
• a scalable architecture that can balance precisely between resource usage and speed per-
formance
• an adaptable architecture supporting different search area sizes, multiple reference frames
and different rectangular block sizes and shapes
• an implementation of the architecture on a Virtex 5 FPGA.
1.4 Main contributions
This thesis presents a high definition motion estimation accelerator with full search for the
newest HEVC standard. The accelerator is a full streaming solution implemented on an FPGA
with the Maxeler framework. It can balance between real-time 1080p motion estimation and low
resource usage. There is support for a search area width of 32, 64, 128 and 256 pixels. The block
sizes of 8×8, 16×16, 32×32 and 64×64 are considered. It can easily be extended with custom
block shapes, for example 32x8, as long as the granularity remains 8x8. The accelerator is highly
parallelised with the focus of reducing the data bandwidth and increasing the speed performance.
Furthermore a trade-off between performance and resource usage is possible due to the adoption
of a scalable ALU grid.
4
1.5 Dissertation outline
1.5 Dissertation outline
This thesis is organised in the following way:
• Chapter 2: An overview of the HEVC video coding, in particular the motion estimation
component, is given in this chapter followed by an introduction to FPGAs and dataflow com-
puting.
• Chapter 3: This chapter describes the proposed motion estimation architecture.
• Chapter 4: The implementation of the proposed architecture on a Virtex 5 FPGA is dis-
cussed in this chapter.
• Chapter 5: The results of the implementation and a comparison a reference GPU imple-
mentation is given in this chapter.
• Chapter 6: Finally, a global conclusion and possible directions for future work are presented
herein.
5
1. Introduction
6
2Background: Video Coding on
FPGA
Contents2.1 Video Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
7
2. Background: Video Coding on FPGA
This chapter provides the theoretical background for the rest of the thesis, considering both
video coding and the hardware platforms used that are based on FPGAs. The video coding
section 2.1 outlines the video coding procedure, and the most recent H.265/Hight Efficiency Video
Coding (HEVC) standard. In detail, it focuses on the video encoding and more specifically on the
motion estimation module. Section 2.2 is mainly related to the FPGA technology that is used to
implement the motion estimation architecture proposed in this thesis.
2.1 Video Coding
The video coding procedure aims at providing an efficient compression mechanism, and the
reverse decompression, of rough digital video in order to decrease both video storage and video
communication requirements. The source of a digital video signal is originally a camera or a
video synthesis tool such as animation software. This digital video signal can be optionally pre-
processed to enhance the performance of the encoding step. During the encoding the video is
converted to a bit stream. This bit stream is used to store the video on a medium or to transmit
it over a communication channel. When displaying the video, the bitstream is first decoded and
then post-processed. A basic architecture of a video coder is presented in Figure 2.1.
Figure 2.1: A basic video coder
The video coding standard specifies the structure and the syntax of a bitstream. In such a way,
the standard provides the constraints on the bitstream that must be respected by the encoder, as
well as the procedure to interpret the bitstream by a decoder. This ensures that the decoding
of a given bitstream with any decoder implementation (where both are conform to the standard)
results in the same output. Developers are free to implement alternative decoders as long as
they are functionally equivalent to the method in the standard. Moreover, the limited scope of
the standard grants the freedom to optimise the implementations to a specific application when
balancing between compression quality, implementation cost, time to market, etc. The video
coding technique analysed herein is the newly approved HEVC standard [15].
8
2.1 Video Coding
2.1.1 High Efficiency Video Coding
The H.265/HEVC is the newest international video coding standard approved by the Inter-
national Telecommunications Union (ITU-T) and the International Organization for Standardiza-
tion/International Electrotechnical Commission (ISO/IEC). Just as its preceding standard H.264/AVC,
it uses a hybrid video coding scheme, where the statistical correlation is exploited in the encoder
either between successive video frames (inter predication mode) or within a frame (intra predica-
tion mode).
Figure 2.2: H.265/HEVC encoder with intra/inter selection.
The frame of a video is split into blocks where these temporal (former) and spatial (later) re-
dundancies are exploited (see Figure 2.2). When inter-mode is selected, the blocks are predicted
from a decoded picture buffer, while in the case of intra-mode, the blocks are predicted using the
samples from the adjacent decoded blocks. The residual signal is transformed, scaled and quan-
tized before it is entropy-encoded and transmitted. The decoding process is also implemented in
the feedback loop of the encoder in order to reconstruct the decoded picture buffer. A deblocking
filter is applied to improve the visual aspect of the reconstructed pictures, and smooth the artifi-
cial edges caused by the blocking nature of the encoder. By relying on new advanced encoding
techniques, the new standard provides an increased compression efficiency of up to 50% when
compared to the previous standard.
A – Picture representation A picture consists of multiple samples. Each sample can repre-
sent a brightness value and multiple chroma values in the case of a colour picture. HEVC supports
several colour spaces, among which the most typically used is YCbCr with 4:2:0 sampling [13].
The Y component represents the brightness information and is called luma. The two chroma com-
9
2. Background: Video Coding on FPGA
ponents are represented by Cb and Cr and store the blue and red colour information, respectively.
Considering the fact that the human visual system is less sensitive to the chrominance than to the
luminance, the 4:2:0 sampling structure is typically used, in which each chroma component is
subsampled by a factor of 2 in both the horizontal and vertical direction. A rectangular picture with
W as width and H as height for the luma samples is accompanied with two chroma components of
dimension W/2×H/2. Each sample of a component is represented with 8 or 10 bits of precision.
B – CTUs, Slices and Tiles According to the HEVC standard, a picture is divided into several
coding tree units (CTUs) to easily process it. Each CTU comprises one luma coding tree block
(CTB) of L×L samples and two chroma CTBs of L/2×L/2 samples. The CTBs represent the
basic processing units in the HEVC standard, which supports several different CTB sizes, where
L value can be equal to 16, 32 or 64. The CTBs can be further split into coding blocks (CBs) by
iteratively applying the quadtree structure (see Figure 2.3(b)), which is limited by the the minimum
allowed CB size of 8×8 luma samples (with the exception of the first reference frame, where 8×4
and 4×8 shapes are also allowed). A coding unit (CU) is the collection of one luma CB and two
chroma CBs that span the same area of a picture. The ability to encode larger CTUs than in
previous standards is one of the reasons for the increased compression efficiency, especially with
high-resolution video content.
Slices and tiles are sequences of CTUs in which a picture can be divided (see Figure 2.3(a)).
Slice represent a set of CTUs in the raster scan order, which can be correctly decoded without
the use of any data from other slices in the same picture. Their main purpose of using slices
is the resynchronisation of the encoding process in order to prevent the propagation of errors
in the prediction procedure. On the other hand, the tiles are self-contained and independently-
decodable structures which main purpose is to enable the use of parallel processing architectures
for video encoding and video decoding. In contrast to the slices, the tiles are always rectangular
regions of the picture with approximately an equal number of CTUs in each tile.
(a) (b)
Figure 2.3: (a) Grouping of CTUs in slices and tiles, (b) Subdivision of a CTB into CBs
C – Prediction At the level of the CUs is decided whether intra-picture (spatial) prediction or
inter-picture (temporal) prediction is used. When intra prediction is chosen, the CBs are predicted
10
2.1 Video Coding
by using the information of adjacent CBs that are already encoded. Directional prediction with
33 different directional orientations are used at the level of prediction blocks (PBs). PBs are sub-
blocks of CB and the luma and two chroma PBs are combined in a prediction unit (PU). When inter
prediction is chosen, the CBs are predicted by searching for the best-matching predictor within
already encoded reference frames (RFs). This process, assigned as motion estimation (ME), is
also performed at the level of PBs. It represents the most computationally demanding module in
the encoding process, which typically requires 80% of the total encoding time [1]. The ME module
is explained in more detail in Section 2.1.2.
D – Transformation, scaling and quantisation After the prediction step, in either intra or
inter mode, the predicted CBs are subtracted from the originals. The result is a residual data.
The residual data of each CB is integer transformed to minimise its energy. This transformation
is performed at the level of transform blocks (TBs). TBs are grouped in transform units (TUs),
which can have the sizes of 4x4, 8x8, 16x16 or 32x32 samples. In contrast to previous standards,
the HEVC design allows a TB to span across multiple PBs for inter-predicted CUs in order to
maximise the potential coding efficiency benefits of the quad tree structured TB partitioning [15].
The transform matrix is composed of values that are derived from scaled discrete cosine transform
(DCT) basis functions. For simplicity one 32x32 matrix is specified. For smaller transform units,
a smaller transform matrix is composed by using a sub-sampled version of the specified 32x32
matrix. The scaling operation that is used in the previous standard is omitted since the scaling
is incorporated in the transformation matrix [15]. The result of the transformation is residual data
that contains less energy. This data is then quantised using uniform-reconstruction quantisation
(URQ).
E – Deblocking filter A deblocking filter (DF) is used on the quantised TUs to remove blocking
artifacts. Blocking artefacts are sharp edges that can be seen in an image due to the fact that the
image is split into blocks during the encoding process. The DF smoothes out these sharps edges
by using an 8×8 sample grid. The filter automatically determines the blockiness’s strength in the
concrete part of the frame. The DF strength is determined dynamical, taking a trade-off between
sharpness and smoothness in consideration.
F – Entropy coding At the end of the encoding process, the quantized transform coefficients
are reordered and entropy coded, in order to exploit the statistical redundancy in the encoded
bitstream. In contrast to the H.264/AVC, the HEVC standard specifies only one entropy cod-
ing method, namely, context adaptive binary arithmetic coding (CABAC). CABAC is a lossless
compression method which encodes the bitstream by applying a set of binary symbols chosen
according to the local statistics of recently-coded data and several probability models.
11
2. Background: Video Coding on FPGA
Figure 2.4: Modes for splitting a CB into PBs in case of inter prediction
2.1.2 Advanced Motion Estimation
Motion estimation is a module applied in the inter-prediction mode in order to exploit the tem-
poral redundancy in successive video frames. During the motion estimation, the current frame
is partitioned into blocks of pixels. For each block in the current frame a best match is located
inside one of the previously encoded frames, which are called the reference frames (RFs). This
best match is the predictor of the block in the current frame. It is used to calculate the residue
block which is the pixel-wise difference between the block and its predictor. The energy of this
residue block is significantly smaller than the energy of the original block which makes it possible
to quantise it with fewer bits and/or greater precision. The result of the motion estimation is a
motion vector (MV) associated with a current block, that defines in which RF the best match was
found and what the displacement between the original block and its best match is. The MV of
each block of the current frame is given to the next module in the encoding process, such that it
can encode the residue block and its associated motion vector instead of the original block, which
leads to an significant better coding efficiency. The better the match between the original block
and its predictor, the lesser bits needs to be used to encode the current block and thus improving
the coding efficiency.
In the HEVC standard, the way how the current frame can be partitioned into blocks is stan-
dardised. As explained in section 2.1.1 and presented in Figure 2.3, a frame is divided into CTUs
which are formed by CUs. Each CU contains of one luma CB and two chroma CBs. During the
inter-picture prediction a CB can be split into prediction blocks (PBs). The motion estimation in
the HEVC standard is performed at the level these PBs. For each PB of the current frame, a MV
is found that points to its predictor. A CB can be split into one, two or four PBs. Figure 2.4 lists
the supported modes of splitting a CB into a PB in case of inter prediction. From here on a PB
in the current frame is called an original block (OB) and a PB in the reference frame is called a
reference block (RB).
It is very important to find the best as possible predictor during the ME because a better
predictor means a better coding efficiency. The HEVC standard supports a ME to search for
the best predictor into up to 16 RFs. The more RFs that are used, the better the predictor will
be, but the more computational intensive the ME is. More importantly is how the search for the
12
2.1 Video Coding
best matching predictor in each RF is performed. An algorithm or strategy is not defined in the
standard and is free to choose by the designer. Due to the high computational load of ME, a lot
of research is performed in developing strategies that try to find an as good as possible predictor
with a limited amount of computational demands. As a result of this, there are many strategies
and it is important to choose the right one for each application and its goal.
There are several evaluation metrics to measure how well an RB in the reference frame pre-
dicts an OB in the current frame. Sum of squared errors (SSE), mean square error (MSE) and
mean absolute difference (MAD) are possible metrics. The most used metric is the sum of ab-
solute difference (SAD). The value is calculated by computing the absolute difference between
each pixel in the OB and the corresponding pixel in the RB. These absolute values are summed
and form the SAD value. This SAD is used to calculate the distortion. As shown in Equation 2.1,
the distortion is calculated by adding and extra term to the SAD. This extra term is equal to a
constant value λ multiplied by the number of bits that are needed to represent the motion vector.
The RB that corresponds with the lowest distortion value is the best predictor for the OB. The SAD
is one of the most simple metrics that take every pixel in both blocks into account. The simple
metric makes it easy to implement many SAD-computation components with a limited amount of
resources.
Distortion = SAD + λ×#bits(MV ) (2.1)
Next to different evaluation metrics, there is also a wide range of search algorithms. The choice
of search algorithm decides which blocks in the reference frames will be actually compared with
the OB and which will not be taken into consideration.
Full Search Motion Estimation (FSME) is the most straightforward and thorough search
algorithm. It calculates and compares all SADs from all possible candidates in a square L×L
search area around the OB’s location in the RF. L/2 is called the search range. A motion estimation
algorithm that uses FSME with a search range of L/2 has to calculate and compare L×L SADs in
each reference frame, for each OB. Increasing the search range will thus quadratically increase
the number of SADs. It is an algorithm with superior performance in finding the best match, but in
return has the highest computational load.
Three Step Search (TSS) [4] is a search algorithm that drastically reduces the number of
blocks that is compared with, and thus reducing the complexity. Compared to FSME, it has a
near optimal performance. Another more significant drawback is that in the first step TSS has an
uniformly allocated checking point pattern, which is inefficient for smaller search ranges [4].
Diamond search (DS) [24] is another popular search algorithm that is widely used. It greatly
outperforms TSS in accuracy and computational load, and has many implementations in en-
hanced search algorithms. A large diamond search pattern (LDSP) and a small diamond pattern
(SDSP) with respectively 9 and 5 points are use to locate the best match.
13
2. Background: Video Coding on FPGA
Enhanced Predictive Zonal Search (EPZS) [18] is a state of the art search algorithm. It
has a near FSME performance, but is difficult to efficiently implement in hardware due to its not
straightforward data pattern.
After the initial motion estimation, a sub-pixel motion estimation is performed to increase the
coding efficiency. With sub-pixel motion estimation, the region around the best match prediction
block in the reference frame is interpolated. The sub-pixel motion estimation now search in this
interpolated region for an even better match. The sub-pixel motion estimation is out of the scope
of this thesis.
2.2 FPGA
2.2.1 Introduction
FPGA stands for ”Field-Programmable Gate Array”. It contains a matrix of reconfigurable
gate array logic circuitry and is used to implement custom hardware functionality. When con-
figured, the internal circuitry is connected in a way that creates a hardware implementation that
can be configured according to the application. Unlike hard-wired printed circuit board (PCB)
or application-specific integrated circuit (ASIC) designs which have fixed hardware resources,
FPGA-based systems can literally rewire their internal circuitry to allow reconfiguration. Digital
computing tasks are developed in software and compiled by synthesis tools to a bitstream that
contains information on how the components should be wired together.
Figure 2.5: The basic structure of an FPGA
The structure of an FPGA consists of three major components. (see figure 2.5)
• Configurable logic blocks (CLBs) can be used to implement different functions with com-
binational and sequential logic. They are programmable to provide functionality as simple
14
2.2 FPGA
Figure 2.6: A simplified example of a configurable logic block
as that of a transistor or as complex as that of a microprocessor. Common components
used in logic blocks are flip-flips (FF), look-up-tables (LUT), block random access memory
(BRAM) and multiplexers. Figure 2.6 illustrates a simplified example of a configurable logic
block.
• Interconnects consists of wire segments of varying lengths which can be interconnected
via electrically programmable switches. They provide routing paths to connect the inputs
and outputs of the CLBs and IOBs onto the appropriate networks.
• Input/Output blocks (IOBs) are the interface between the package pins and the internal
signals.
Currently there are three main technologies how FPGA’s can be programmed. Each technol-
ogy has its own advantages, which shall be discussed briefly. Antifuse FPGAs are configured
by burning a set of fuses. Once the chip is configured, it cannot be altered anymore. Bug fixes
and updates are possible for new PCBs, but are very difficult or impossible for already manufac-
tured boards. They are used as ASIC replacement for small volumes. Flash FPGAs may be
re-programmed several thousand times and are non-volatile. This means that they keep their
configuration after power-off. The disadvantage is the use of the more expensive flash memory
and re-configuration takes several seconds. SRAM FPGAs is currently the dominating technol-
ogy. The memory is volatile but can be unlimited re-programming. Additional circuitry is required
to load the configuration into the FPGA after power-on. But re-configuration is very fast, some
devices allow partial re-configuration during operation. This allows for new approaches and appli-
cations that make use of run-time reconfiguration (RTR) where the circuitry is dynamically loaded
during the execution of applications.
Over the past decades FPGAs have evolved enormously. The first modern-era FPGA was
introduced by Xilinx in 1984 [19]. It contained 64 configurable logic blocks and 58 inputs and out-
puts [19]. Thanks to the semiconductor industry, the number of transistors on integrated chips has
increased greatly. This allows a modern state-of-the-art FPGA to contain approximately 1,950K
15
2. Background: Video Coding on FPGA
equivalent logic blocks and around 1200 inputs and outputs [22]. There is a still ongoing trend
of adding specialized blocks to FPGAs. Block RAM is added for fast on chip data buffering. To
allow fast and area efficient implementations of logical operations, shifting, addition and multiply-
add arithmetics, digital signal processors (DSP) are integrated in the chips. Both embedded hard
(e.g. PowerPC) and soft (e.g. NIOS II) processors for FPGAs are also available [Leong]. Fig-
ure 2.6 shows an overview of the features of the FPGA that is used in this thesis. It sports a total
of 207360 look-up-tables (LUTs), 207360 flip-flops (FFs), 324 block random access memories
(BRAMs) and 192 digital signal processors (DSPs).
Figure 2.7: An overview of the main features of the Virtex-5 LX330T FPGA from Xilinx that is usedin this thesis [20].
The technology used in FPGAs, causes it to have costly interconnects and run on lower operat-
ing frequencies. The high latency that is associated with it is compensated by deep pipelining and
additional paralellism. Unlike processors, FPGAs are truly parallel in nature. Different processing
operations do not have to compete for the same resources, if enough resources are provided by
the FPGA. Each independent processing task is assigned to a dedicated section of the chip, and
can function autonomously without any influence from other logic blocks. As a result, the perfor-
mance of one part of the application is not affected when you add more processing. This not only
affects the raw calculation performance but also the I/O throughput. As long as there are enough
I/O blocks provided by the FPGA and the bandwidth of the peripherals is not exceeded, adding
more inputs and/or outputs will not decrease the I/O performance of already used logic on the
FPGA. The deep pipelining and highly parallelism nature fit well with streaming/dataflow applica-
tions. The decoupling of communication from computation in these applications combined with
these techniques allow for a very high throughput. Taking the parallel nature of FPGAs into ac-
count, it is easy to see that they operate best on problems that can be easily and efficiently divided
into many parallel, often repetitive, computational tasks. The exceptional high throughput that can
be achieved when combined with a dataflow model makes it an excellent platform to solve data
intensive problems. Many applications fall into this class of problem, including image processing
and even more in particularly full search motion estimation, which is the target application of this
thesis.
FPGAs are particularly well suited to fixed-point calculations. FPGAs can perform these types
of calculations with a low amount of logic to implement them, which gives an FPGA an extremely
high calculation density. In motion estimation, the samples that are worked with are integer values
and thus allow for a high calculation density when implemented on an FPGA.
16
2.2 FPGA
FPGA devices deliver the performance and reliability of ASICs, without the high non recurring
cost (NRC) of their complex design flow. FPGA’s less time consuming design flow also allows
for a faster time to market. Nevertheless it takes specialised skills to develop code for an FPGA.
FPGA development often takes much longer than an equivalent development task for a micro-
processor using a high-level language like C or C++. This is partly due to the time and tedium
demands of the iterative nature of FPGA code development and the associated long synthe-
sis/simulation/execution design cycle. In the last few years specialised FPGA development tools
have decreased dramatically the development time with the use of high level synthesis tools.
2.2.2 Designing with FPGAs
The designer facing a design problem must go through a series of 5 phases between initial
ideas and final hardware. This series of phases is commonly referred to as the design flow. Most
projects start with a need of something. The first phase of the flow is specifying the requirements.
For example, in the particular case of this thesis the requirements can be summarised as: a mo-
tion estimation accelerator with as constraint the amount of resources used and a small execution
time. The tools supplied by the different FPGA vendors to target their chips do not help the de-
signer in this phase. For the other four phases which will be discussed briefly, there is a wide
variety of tools available.
• Design entry consists in transforming the design ideas into a computerised representation.
This is most commonly accomplished using Hardware Description Languages (HDLs). As
the name states, this language is used to describe the hardware. Most HDLs also have the
ability to simulate the behaviour of the described hardware to verify its correct functionality.
An HDL differs from conventional software in the sense that the statements are not sequen-
tial executed. Instead the code is executed in parallel since it describes the hardware that
operates concurrently. The two most popular HDLs are Verilog and Very High Speed In-
tegrated Circuit HDL (VHDL). These languages describe the circuit at the register transfer
level (RTL), a design abstraction which models the flow of data between logic and registers.
• Synthesis The synthesis tool receives as input the HDL and the target FPGA model. The
input is used to generate a netlist which is a model at the level of logic gates. It satisfies the
logic behaviour specified in the HDL files and uses the primitives of the specified FPGA type.
The synthesis goes through many steps such as logic optimization, register load balancing,
and other techniques to enhance timing performance [Serrano].
• Mapping and Place-and-route The next step maps the netlist onto the FPGA. For each
component in the netlist, a component on the target FPGA is selected. During the routing
process, these components are connected with each other. This routing process has to take
many constrains into consideration. The most important constraint is the timing, the delay
17
2. Background: Video Coding on FPGA
between connected components. This delay is limited to a threshold in order to meet the
targeted frequency.
• Bit stream generation The configuration of the FPGA as specified in the place and routing
step is stored as a bitstream. This bistream can be uploaded to the targeted FPGA to
configure it.
Design entry is the only step that requires human labour. The other steps are performed by
tools provided by FPGA vendors. Describing the design at the register transfer level can be a
difficult task, especially with large designs. This is because RTL is a low level design abstraction
that requires the logic to be described in great detail. This detailed description allows for a thor-
ough control of the FPGA configuration but on the other side, consumes a lot of time to design.
In the recent years several high-level HDLs were developed to decrease the time spend in the
design phase. These languages describe the hardware on a higher level and accompanied tools
are used to compile the description to the RTL level. An example is Simulink, a block diagram
environment that describes the hardware with functional blocks [11]. Another example is the Max-
eler platform that uses custom libraries in java to describe the hardware at a high level. The next
section will discusses the Maxeler platform that is used to implement the architecture proposed in
this thesis.
2.2.3 Dataflow programming with the Maxeler platform
The Maxeler platform offers an efficient way to design dataflow computing solutions. As illus-
trated in figure 2.8, a dataflow application differs fundamentally from a standard software applica-
tion. In a standard software application, the source code is transformed in a list of instructions.
These instructions are loaded into the memory together with the data of the application. During
the execution of the application, the instructions and data are loaded from the memory into the
processor. After each execution of an instruction, the result is written back into the memory before
a new instruction is executed. This sequential execution of instructions is limited to the latency of
data movement in this loop. Figure 2.8(a) shows an overview of a standard software application
with its memory loop.
In contrary, the source file of a dataflow application is a dataflow engine (DFE) configuration
file. This file describes the inner structure of the DFE. During the execution of the application,
no instructions are needed. The data is loaded from the memory and streamed to the dataflow
engine. In the dataflow engine, the data flows from core to core and is processed herein until
the final result is obtained. There is no need of writing the result bad back to memory after each
operation on it. Only the final result that is obtained at the end of the dataflow engine is written
back to the memory as is described in figure 2.8(b) .
The design of the DFE is decoupled into two parts. On one side there are the kernels that
18
2.2 FPGA
(a) Standard software application (b) Dataflow application
Figure 2.8: An overview of a standard software application versus a dataflow application [17]
describe the data processing structure that is needed by the application. Arithmetic units and
data control such as multiplexers and fast on-chip memory are included. On the other side there is
one Manager that describes the flow of the data stream. They interconnect the dataflow between
kernels and with the DRAM memory and host application through PCIe. They also handle the
buffering of data and automate the conversion between different data path widths for the streaming
between entities with different I/O widhts.
Not only the design of computation and communication are decouple with the use of kernels
and a manager. Also their implementation on the FPGA is decouple which is highly beneficial for
as well the communication as the computation. The decoupling allows for deep pipe-lined kernels
without having synchronisation problems and allows for concurrent computation and dataflow,
which reduces the impact of the high latency that is characteristic for FPGAs.
Both the kernels and the manager are implemented with Maxeler’s custom Java libraries. The
MaxCompiler, that is part of the Maxeler platform, compiles the kernels and manager to VHDL
which is further used to generate a DFE configuration file. This configuration file, .max file, is used
to configure the FPGA and link it with the host application. The custom Java libraries and the
MaxCompiler allow the designer to describe complex hardware structures such as BRAMs, accu-
mulators, PCIe- and DRAM-communication modules without the requirement of time-consuming
RTL design, but with Java functions and build-in automation of the MaxCompiler. The use of these
Java functions and the build-in automation have the cost of losing some control over the hardware
introduced by the tool [5].
The Simple Live CPU Interface (SLiC) is an API used to control and communicate live with the
DFE from a host application. It allows the user to easily run different profiles on a DFE and send
and receive data from it without the need of developing specialised communication software. The
host application runs on a CPU and is in this thesis implemented in c. The various c functions that
19
2. Background: Video Coding on FPGA
are provided by the SLiC are created by the MaxCompiler, during the compilation of the kernel
and manager, together with the configuration .max file. Figure 2.9 illustrates an overview of a DFE
and its connections.
Figure 2.9: The architecture of a DFE and its connections [16]
20
3Video Coding Architecture
Contents3.1 Streaming model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2 Hierarchical SAD computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3 Streaming model with hierarchical SAD computation . . . . . . . . . . . . . . . . 233.4 Streaming Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.5 SadGenerator Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.6 Scalable SadGenerator Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 373.7 SadComparator Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
21
3. Video Coding Architecture
This chapter presents the architecture of a full search motion estimation (FSME) accelerator.
We have chosen to accelerate this component of the video encoder because, as referred before,
it is the most data and computational intensive one.
3.1 Streaming model
The FSME algorithm is very data and computational intensive. A streaming model is used to
process these huge amounts of data. In a streaming model, data is streamed trough a highly
pipelined architecture. As shown in Figure 3.1 the accelerator has two input streams and one
output stream.
Figure 3.1: The streaming model of the FSME accelerator
The first input stream contains the pixels of the reference frame. The second input stream
brings the pixels of the original frame. The accelerator calculates the SADs of the original blocks
and their candidates in the search area of the reference frame. The accelerator’s output stream
contains the resultant motion vectors with the minimum SAD value for each original block. The
definition of the block size to be used for encoding the frame is computed by other components of
the encoder and is of the scope of this work.
3.2 Hierarchical SAD computation
In hierarchical SAD computation, the prediction blocks can be of different sizes. From here on,
the prediction blocks are called original blocks (OBs). Lets assume the smallest size A×A (A-OB)
and the biggest size Z×Z (Z-OB). Depending on the Z/A ratio, there are a number of intermediate
block sizes. In Figure 3.2 the Z/A ratio is, i.e, 4: Z is four times larger than A and there is one
intermediate block size B.
The SADs of blocks with different sizes can be hierarchically calculated by using the SADs
of the smaller blocks. For example, based on Figure 3.2, the SAD of a Z-OB (Z-SAD) can be
calculated from the SADs of B-OBs (B-SADs), as show in Equation (3.2), and each B-SAD can
be calculated by adding the A-SADs as show in Equation (3.1).
22
3.3 Streaming model with hierarchical SAD computation
Figure 3.2: Hierarchical SAD computation inside a Z-OB
B-SAD0 =3∑
i=0
A-SADi B-SAD1 =
7∑i=4
A-SADi (3.1)
B-SAD2 =
11∑i=8
= A-SADi B-SAD3 =
15∑i=12
A-SADi
Z-SAD =
3∑i=0
B-SADi (3.2)
In this example there is only one intermediate block size. With a higher Z/A ratio, many other
intermediate block sizes are possible. It is also possible to hierarchically calculate non-square
SADs by adding different groups of SADs. As long as the granularity is A×A all shapes are
possible.
3.3 Streaming model with hierarchical SAD computation
In order to use the hierarchical SAD computation model, using the example in Figure 3.2, first
the SADs of all the A-OBs inside a Z-OB must be calculated. Once all the A-SADs from the Z-OB
are calculated, all the intermediate SADs until the final Z-SAD can be hierarchically calculated.
Thus we need to store all the A-SADs of one Z-OB before we can calculate the intermediate SADs
and find the minimum SAD of each shape. To incorporate the need of storing the A-SADs, the
architecture can be further detailed as shown in Figure 3.3.
The accelerator consists of 3 blocks:
• SadGenerator: computes all the A-SADs of each Z-OB.
• Memory: stores the A-SADs
• SadComparator: generates the Z-SAD, its intermediate SADs and compare them along
each other to find the MV of the minimum SADs.
23
3. Video Coding Architecture
Figure 3.3: Detailed model of the accelerator
The original frame is split into Z-OBs that are processed sequentially in a streaming manner.
In detail this process consists of the following steps. The SadGenerator has as input a Z-OB from
the original frame and its search area from the reference frame. The A-SADs of the Z-OB are
calculated and streamed to the memory. When all the A-SADs of the Z-OB are streamed to the
memory, the next Z-OB and its search area are processed. The iteration in which all the A-SADs
from a Z-OB are calculated is called a SadGenerator Z-iteration. In the mean time the SadCom-
parator reads the A-SADs from the memory and hierarchically computes the intermediate- and
Z-SADs. The output stream of the SadComparator contains the motion vectors of the minimum
SADs. The iteration in which all the A-SADs from a Z-OB are processed is called a SadCom-
parator Z-iteration. The memory will store two SAD-buffers: one in which the SadGenerator is
writing and one from which the SadComparator is reading. This is illustrated in Figure 3.4. When
the SadGenerator is writing into the even SAD-buffer, the SadComparator is reading from the odd
SAD-buffer, and vice versa.
Even
SAD-buffer
Odd
SAD-buffer
A-SAD
IN
A-SAD
OUT
Memory
Figure 3.4: The memory is split into an even and odd SAD-buffer
The SadGenerator and the SadComparator are running concurrently. Only during the genera-
tion of the A-SADs of the first Z-OB, the SadComparator is not running (since there are no A-SADs
24
3.3 Streaming model with hierarchical SAD computation
in the memory yet). And during the processing of the A-SADs of the last Z-OB, the SadGenerator
is not working (since there are no Z-OBs to process anymore).
Figure 3.5(a) shows the execution sequence of the SadGenerator’s and the SadComparator’s
Z-iterations in the case of a frame with 8 Z-OBs and when both iterations take exactly the same
amount of time. The overall performance of the accelerator is determined by the slower unit.
In Figure 3.5(b) the slower unit is the SadGenerator. After each iteration, the SadComparator
has to wait for the SadGenerator to finish calculating the A-SADs. In Figure 3.5(c) the slower
unit is the SadComparator. After each iteration, the SadGenerator has to wait for the SadCom-
parator to finish reading the A-SADs from the memory before it can write new A-SADs in that
same SAD-chunk. The total execution time for processing one frame is given by Equation (3.3).
NZ-OBs is the number of Z-OBs that are being processed. ∆slowZiteration is the execution time
needed to process a single Z-OB by the slowest unit, either the SadGenerator or SadComparator.
∆fastZiteration is the execution time needed to process a single Z-OB by the fastest unit. When
optimising the execution time of the system, the main goal is to approximate as much as possible
the execution time of the two units, i.e., it is important to optimise the slower unit.
TotalExecutionT ime = NZ-OBs ×∆slowZiteration+ ∆fastZiteration (3.3)
(a) Same speed (b) Slower Sad-Generator
(c) Slower Sad-Comparator
Figure 3.5: An overview of the executions sequence of the SadGenerator’s and the SadCompara-tor’s Z-itterations
25
3. Video Coding Architecture
3.4 Streaming Pattern
A – Original frame stream Each original frame is divided into Z×Z blocks (Z-OBs). Z is
the dimension of the biggest prediction block in the motion estimation. During an iteration of
the SadGenerator, one Z-OB is streamed at a time to the SadGenerator. A Z-OB that is being
streamed to the SadGenerator is called an OB-chunk. Figure 3.6 shows the streaming pattern of
the original frame stream with a frame that is divided into 8×4 Z-OBs. It takes 8×4 SadGenerator
Z-iterations to stream and process all these Z-OBs to the SadGenerator.
Z-OB Z-OB Z-OB Z-OB Z-OB Z-OB Z-OB Z-OB
Z-OB
Z-OB Z-OB Z-OB Z-OB Z-OB Z-OB Z-OB Z-OB
Z-OB Z-OB Z-OB Z-OB Z-OB Z-OB Z-OB Z-OB
Z-OB Z-OB Z-OB Z-OB Z-OB Z-OB Z-OB
Figure 3.6: The streaming pattern of the Z-OBs in the original frame
B – Reference frame stream Alongside each OB-chunk from the original frame, its reference-
frame-chunk (RF-chunk) is streamed to the SadGenerator. This RF-chunk contains the search
area of the Z-OB. The search area is a part of the reference frame (RF) in which a candidate
predictor is looked for. It is a square area with its centre positioned at the Z-OB’s upper left corner
coordinate in the original frame. This is illustrated in Figure 3.7. With L the width of the search
area and twice the length of the so called search range, L×L SADs are calculated for each A-OB
inside the Z-OB. The L× L search area inside the RF-chunk is extended by Z − 1 pixels towards
the left- and downside direction such that the candidates that are on the border of the search area
can be evaluated as well. It is possible to look for a candidate in multiple reference frames. In this
case, the RF-chunk contains multiple search areas in order to calculate L×L SADs per reference
frame. Figure 3.7 shows the dimensions of the search area with 3 reference frames and their
positioning relative to the Z-OB in the original frame.
C – SAD streams The SAD stream is a continuous stream from the SadGenerator to the
memory. The stream can be divided into Sad-chunks. Each Sad-chunk contains all the A-SADs
from one Z-OB and is generated during one SadGenerator Z-iteration. Figure 3.8 shows the
26
3.4 Streaming Pattern
Z-OB
L/2
L/2
L/2 L/2
3
Z-1
Z-1
Figure 3.7: The RF-chunk with multiple reference frames that is streamed alongside each Z-OB-chunk
streaming of the Sad-chunks into and from the memory. The Sad-chunk numbers correspond to
the numbers of the Z-OBs: Sad-chunk 0 contains all the A-SADs from the first Z-OB. The Sad-
chunks are also streamed from the memory to the SadComparator. Except for the processing
of the first or the last Z-OB, there is always one Sad-chunk being streamed to the memory and
one Sad-chunk being steamed from the memory. In Figure 3.8 Sad-chunk 2 is read from the
even SAD-Chunk in the memory and Sad-chunk 3 is being written into the odd SAD-Chunk in the
memory.
Memory
SAD-chunk
3SAD-chunk
2
SadGenerator SadComparator
Figure 3.8: The SAD-chunks are streamed to and from the memory. Each SAD-chunk containsall the A-SADs of one Z-OB.
D – The MV stream During one SadComparator Z-iteration, one MV-chunk is streamed from
the SadComparator. An MV-chunk contains the motion vector (MV) to the predictor of each A-OB
inside a Z-OB. In the case of Figure 3.2, there are 16 A-OBs, 4 B-OBs and 1 Z-OB. In this case
an MV-chunk contains 21 MVs as show in Figure 3.9.
27
3. Video Coding Architecture
Figure 3.9: The streaming of the MV-Chunks
An overview of the streaming is given in Figure 3.10. In this simplified example, the frame
is divided into 4 Z-OBs. Four Z-OB-chunks are streamed to the SadGenerator alongside their 4
RF-chunks. Four SAD-chunks are streamed from the SadGenerator to the memory and then re-
streamed from the memory to the SadComparator. The SadComparator streams four MV-chunks
with the motion vector of each block of the frame.
28
3.4 Streaming Pattern
SAD-chunk
1
OB-chunk
1
RF-chunk
1
SadGenerator SadComparator
Memory
(a) The first SAD-chunk is generated and streamed to the even SAD-buffer in the memory
SAD-chunk
2
SAD-chunk
1
MV-chunk
1
OB-chunk
2
RF-chunk
2
SadGenerator SadComparator
Memory
(b) The second SAD-chunk is generated while the first SAD-chunk is streamed to the SadGenerator
SAD-chunk
3
SAD-chunk
2
MV-chunk
2
OB-chunk
3
RF-chunk
3
SadGenerator SadComparator
Memory
(c) The third SAD-chunk is generated while the second SAD-chunk is streamed to the SadGenerator
SAD-chunk
4
SAD-chunk
3
MV-chunk
3
OB-chunk
4
RF-chunk
4
SadGenerator SadComparator
Memory
(d) The fourth SAD-chunk is generated while the third SAD-chunk is streamed to the SadGenerator
SAD-chunk
4
MV-chunk
4SadGenerator SadComparator
Memory
(e) The final SAD-chunk is streamed through the SadComparator and the last MV-chunk is generated
Figure 3.10: An overview of the processing of a frame with a total of 4 Z-OBs
29
3. Video Coding Architecture
3.5 SadGenerator Architecture
As stated before, the FSME algorithm is very memory intensive, leading to huge memory
bandwidth requirement. In order to exploit as much as possible the data parallelism inherent to the
algorithm, while maintaining low memory bandwidth requirements, the proposed SadGenerator
Architecture makes full use of a highly optimized structure that supports maximum reutilization of
the loaded data.
The main architecture of the SadGenerator is an ALU-grid that computes the SADs of one
A-OB in parallel. With a search area of L×L, each A-OB has L×L SADs. The data needed to
calculate these SADs is the A-OB from the original frame and reference frame data (RF-data):
(L + A − 1) RF-lines with a length of (L + A − 1). A SAD can be named by their coordinate x, y
with x, y ∈ {0, L− 1}. The calculation of the SADs happens while streaming RF-lines one by one.
One RF-line is used to calculate a maximum of A×L SADs for one A-OB: L because an A-OB only
has L SADs in the x-direction, A because one line can be maximum used by SADs with A different
y-coordinates. The utilisation of an RF-line in both dimensions is illustrated in Figure 3.11
A
(a) The vertical utilisation of an RF-line is A. Themost up and down prediction candidates usingthe RF-line are marked in yellow.
L
L+A-1
(b) The horizontal utilisation of an RF-line is L. Themost left and right prediction candidates usingthe RF-line are marked in yellow.
Figure 3.11: An RF-line is used by A×L SADs of one A-OB
The ALU-grid is a structure with A×L ALUs, where each calculates a SAD value (see Fig-
ure 3.12). SAD(x, y) is calculated by ALU(x, ymodA).
ALU-gridL
A
Figure 3.12: A×L ALUs are grouped in a grid
30
3.5 SadGenerator Architecture
Each ALU in the ALU-grid is identical (see Figure 3.13). It has two inputs, A pixels of an A-OB
and A pixels of the RF-line. It calculates the absolute difference of the two vectors pixel wise
and adds those values to an accumulator. After accumulating A absolute values, the accumulator
contains a SAD.
ALU
RF
O
B
A
A
acc
Figure 3.13: An ALU has as input A reference pixels and A original pixels, and as output anaccumulated value
Processing one A-OB: Figure 3.14(a) illustrates an A×L ALU-grid with A equal to four for
the sake of simplicity. Each line of ALUs in the grid has one colour. Figure 3.14(b) illustrates the
L × L SADs calculated by the ALU-grid. The SAD rows’ colours in Figure 3.14(b) correspond to
the colours of the ALU row computing them. Figure 3.14(c) illustrates the RF data that is needed
to calculate the SADs of one A-OB. The coloured columns span the rows that are needed to
calculated the SADs from the row in the corresponding colour. Because the ALU-grid calculates
the SADs of only one A-OB at a time, a square of (L+A−1)×(L+A−1) pixels from the reference
frame is sufficient to calculate the L×L SADs.
(a) ALU rows calculatingSADs
(b) SADs calculated with theuse of RF-lines
(c) RF-lines used to calculateSADs with ALUs
Figure 3.14: An overview which ALU lines calculate which SADs by using which lines from the SA
The first row of ALUs (3.14(a), blue) starts calculating the first row of SADs (3.14(b),blue)
when the first RF-line is streamed (3.14(c), row spanned by blue column). The second row of
31
3. Video Coding Architecture
ALUs (yellow) starts calculating the second row of SADs (yellow) when the second RF-line is
streamed (row spanned by the yellow column). The second row of RF data is also spanned by the
blue column: the RF-line is also used by the first row of ALUs to calculate the first row of SADs.
When the fourth RF-line is streamed (row spanned by the red column), the fourth ALU-row (red)
starts calculating the SADs of the fourth row (red). At the same time, the first ALU-row (blue) is
finishing calculating the SADs of the first row (blue). When streaming the fifth RF-line, the first
row of ALUs start calculating the the fifth row of SADs (blue). In this way, from the streaming of
the fourth RF-line on, each ALU is continuously calculating SADs and each RF-line is used by all
four ALU rows.
The ALU-grid computes A× L SADs of an A-OB in parallel. This way of calculating the SADs
greatly reuses the RF-line’s data, resulting in a reduced bandwidth of reference frame data. When
calculating the SADs in the traditional way, each SAD is calculated separately requiring A×A pixels
from the reference frame per SAD. This results in a total of (A×A)×(L×L) pixels for the calculation
of L×L SADs of one A-OB. Using the proposed architecture only requires the streaming of (L+A-
1)×(L+A-1) pixels for calculating the same amount of SADs. With A equal to 8 and L equal to 64,
this results in a bandwidth decrease ratio of 52× ( (8× 8)× (64× 64)/(71× 71) = 262144/5041
) compared to the traditional way.
OB 0
SADx= 0..L-1RF-Line SADy
0
1
2
3
4
5
L-1
OB 0
OB 0
OB 0
OB 0
OB 0
OB 0
0
1
2
3
4
5
L-1
Figure 3.15: The output pattern of the SadGenerator when calculating the SADs of a single A-OB,named OB 0, at a time
The output pattern of the SADs can be derived from Figure 3.15. This figure illustrates which
SADs the ALU-grid starts to calculate during the streaming of the RF-lines. When streaming the
first RF-line, the ALU-grid starts calculating L SADs with y-coordinate (SADy) equal to 0. Their
x-coordinate (SADx) range from 0 to L-1. When streaming the second RF-line, the ALU-grid
starts calculating L SADs with y-coordinate equal to 1. Since the pattern of the beginning of SAD
calculation is the same as the pattern of the ending of SAD calculation, the pattern in Figure 3.15
is also the output pattern of the SadGenerator.
Processing one row of A-OBs: An even more efficient use of the streamed RF-lines is
achieved when they are not only used to calculate the SADs of 1 A-OB, but of several A-OBs
32
3.5 SadGenerator Architecture
at a time. The reason is illustrated in Figure 3.16. The RF-data of nearby A-OBs have a huge
overlapping area. An RF-line of L+A− 1 pixels is overlapped by L− 1 pixels of the next A-OB’s
RF-line.
Z
A
(a) Z-OB with fourhighlightedA-OBs
L+A-1
L+Z-1
L+A-1
(b) The RF-data of four consecutive A-OBs overlapeach other
Figure 3.16: By extending the size of the RF-line, a row A-OBs use the RF-line to calculate all itsSADs
When streaming a larger RF-line of L + Z − 1 pixels instead of L + A − 1 pixels, the RF-line
can be used to calculate the SADs of the (Z/A) A-OBs that are on the same horizontal line inside
the Z-OB, as shown in Figure 3.16(a). For each RF-line that is streamed, the ALU-grid goes
trough (Z/A) so called OB-iterations. During an OB-iteration, each ALU is calculating the SAD of
one A-OB and stores the result in an accumulator. During the next OB-iteration, the ALUs are
calculating the SADs with the same x, y coordinates, but for a different A-OB. Each ALU in the
ALU-grid has to store an accumulator value for the (Z/A) A-OBs in the row.
When streaming RF-lines for each A-OB separately, calculating the SADs for Z/A A-OBs
would require to stream (Z/A) times (L + A − 1) × (L + A − 1) pixels. When streaming a larger
RF-line and allowing the ALUs to store an accumulator value for each A-OB, only (L + Z − 1) ×
(L + A − 1) pixels need to be streamed. With A equal to 8, Z equal to 64 and L equal to 64, this
results in a bandwidth decrease ratio of 4,47× ( (71×71×8)/(71×127) = 40328/9017 ) compared
to when processing one A-OB at a time. Compared to the traditional way, the bandwidth is
decreased 232,33×.
When calculating the SADs of one row of A-OBs at a time, the output pattern of the SADs is
different from when calculating only the SADs of one single A-OB at a time. The output pattern
can be observed in Figure 3.17. In this example, the Z/A ratio is 4, which means the SADs of four
A-OBs are calculated after streaming an RF-line. When streaming the first RF-line, the ALU-grid
starts calculating L SADs with y-coordinate (SADy) equal to 0. It does this first for A-OB 0, then
for A-OB 1, 2 and finally for A-OB 3. After these four OB-iterations, a new RF-line is streamed
33
3. Video Coding Architecture
which processes the next 4 OB-iterations.
OB 0
SADx= 0..L-1RF-Line SADy
0
1
2
3
4
5
L-1
OB 1 OB 2 OB 3
OB 0 OB 1 OB 2 OB 3
OB 0 OB 1 OB 2 OB 3
OB 0 OB 1 OB 2 OB 3
OB 0 OB 1 OB 2 OB 3
OB 0 OB 1 OB 2 OB 3
OB 0 OB 1 OB 2 OB 3
0
1
2
3
4
5
L-1
Figure 3.17: The output pattern of the SadGenerator when calculating the SADs of one row ofA-OBs at a time
Processing multiple rows of A-OBs: An even more efficient use of the streamed RF-lines is
achieved when they are not only used to calculate the SADs of 1 A-OB row, but of all the A-OBs
inside a Z-OB at a time. The reason is illustrated in Figure 3.18. As above, the RF-data of nearby
A-OBs have a huge overlapping area.. From the L + A − 1 RF-lines used by a single A-OB row,
L − 1 of these RF-lines are used by the adjacent A-OB row. Each ALU in the ALU-grid has to
have an accumulator for storing the value for the (Z/A)2 A-OBs.
When streaming RF-lines for each A-OB row seperatly, calculating the SADs of (Z/A)2 A-OBs
would require to stream (Z/A) times (L+Z− 1)× (L+A− 1) pixels. When processing all A-OBs
at the same time by streaming more RF-lines and allowing each ALU to store an accumulator
value for each A-OB, only (L + Z − 1) × (L + Z − 1) pixels need to be streamed. With A equal
to 8, Z equal to 64 and L equal to 64, this results in a bandwidth decrease ratio of again 4,46× (
(71×127×8)/(127×127) = 72126/16129) compared to when processing one A-OB row at a time.
Compared to the traditional way, the bandwidth is decreased 1037×.
When calculating the SADs of multiple A-OB rows at a time, the output pattern of the SADs is
different from when calculating only the SADs of a single A-OB row at a time. The output pattern
can be observed in Figure 3.19. In this example, Z equals to 32 and A equals to 8 resulting in a
Z/A ratio of 4. This means the SADs of at least 4 A-OBs are calculated after streaming an RF-
line. When streaming the first RF-line, the ALU-grid starts calculating L SADs with y-coordinate
(SADy) equal to 0. It does this first for A-OB 0, then for A-OB 1, 2 and finally for A-OB 3. After
these four OB-iterations, a new RF-line is streamed which is being processed by again 4 OB-
iterations. From RF-line 0 until A-1, the ALU-grid starts calculating only the SADs of the first row
of A-OBs. This is because the first A RF-lines streamed only belong to the search area of the first
row of A-OBs.
In Figure 3.19, A equals to 8. When streaming RF-line 8, this line is used to start calculating
the SADs of both the first and second row of A-OBs. During the first 4 OB-iterations, the ALU-
34
3.5 SadGenerator Architecture
Z
A
(a) Z-OB with fourhighlightedA-OB rows
L+Z-1
L+A-1
L+Z-1
(b) The RF-data of four consecutive A-OB rows overlapeach other
Figure 3.18: By extending the number of RF-lines, multiple rows of A-OBs use the RF-line tocalculate all its SADs
grid uses the RF-line to start calculating the SADs with y-coordinate 8 for the first A-OB row.
During the other 4 OB-iterations, the ALU-grid uses the RF-line to start calculating the SADs with
y-coordinate 0 for the second A-OB row. These SADs have y-coordinate 0 because the RF-line
is the first line (y-coordinate 0) of the search area of the second row of A-OBs. From RF-line 8 to
15, each RF-line is used to start calculating the SADs of 8 A-OBs. From RF-line 16 to 23, each
RF-line is used to start calculating the SADs of 12 A-OBs.
As seen in Figure 3.17, RF-line L − 1 is the last line for which the ALU-grid starts calculating
SADs of the first row of A-OBs. This is why from RF-line L to L + 7, the RF-line is not used
anymore to start calculating new SADs for the first A-OB row, but only for the last 3 A-OB rows.
After streaming every 8 RF-lines, the RF-lines is not used for an extra OB-row. This complex
streaming pattern can be a drawback when the SADs need to be further processed in a specific
order.
During the different OB-iterations, different parts of the RF-line are sent to the ALU-grid to
process the SADs of one A-OB as shown in Figure 3.20. Each sub-RF-line is shifted A pixels to
the right relative to each other. This is because the A-OBs that are in the same A-OB row, differ
A pixels in the horizontal dimension in the original frame and thus so do their search areas in the
reference frame.
The processing sequence when streaming the RF-lines is illustrated in Figure 3.21. Streaming
an RF-line that is used to calculate not only the SADs of one A-OB, but of several A-OBs greatly
35
3. Video Coding Architecture
OB 0
SADx= 0..L-1RF-Line SADy
0
24
16
8
0
L-8
OB 1 OB 2 OB 3
OB 0 OB 1 OB 2 OB 3
OB 4 OB 5 OB 6 OB 7
OB 8 OB 9 OB 10 OB 11
OB 12 OB 13 OB 14 OB 15
OB 12 OB 13 OB 14 OB 15
0
24
24
24
24
L+24
8 80
OB 0 OB 1 OB 2 OB 3
OB 4 OB 5 OB 6 OB 7
16 8
0
OB 4 OB 5 OB 6 OB 7
OB 8 OB 9 OB 10 OB 11
8
16
16 16 OB 0 OB 1 OB 2 OB 3
L-8
L-16
L-24
OB 4 OB 5 OB 6 OB 7
OB 8 OB 9 OB 10 OB 11
OB 12 OB 13 OB 14 OB 15
L
L
L
L-8
L-16
OB 8 OB 9 OB 10 OB 11
OB 12 OB 13 OB 14 OB 15
L+8
L+8
L-1 OB 12 OB 13 OB 14 OB 15L+31
4 OB / RF-line
8 OB / RF-line
12 OB / RF-line
16 OB / RF-line
12 OB / RF-line
8 OB / RF-line
4 OB / RF-line
Figure 3.19: The output pattern of the SadGenerator when calculating the SADs of all Z-OB’sA-OBs at a time, with A=8 and Z=32
decreases the bandwith of the RF-data.
Figure 3.22 gives an overview of the SadGenerator. The RF-data is streamed RF-line by
RF-line. Each RF-line is processed by the selector which sends a sub-RF-line to the ALU-grid.
The ALU-grid computes the SADs of a number of A-OBs that is streamed to it. This number
of A-OBs can be 1 when processing a single A-OB, (Z/A) when processing a row of A-OBs or
(Z/A)2 when processing all A-OB rows. During each OB-iteration, the Output-select selects L
accumulated values of the ALU-grid row that has SADs available and sends them to the output.
The output pattern of SADs depends greatly how many A-OBs are being processed at the same
time.
The architecture that computes the SAD for multiple AOB-s, depicted in Figure 3.22, greatly
reduces the bandwidth of the RF data in three steps.
The combination of these three steps achieve an impressive total bandwidth reduction ratio of
36
3.6 Scalable SadGenerator Architecture
L+A-1
L+Z-1
Figure 3.20: During different OB-iterations, another sub-RF-line, part of the RF-line, is sent to theALU-grid
RF-line 1
OB iteration 1
OB iteration 2
OB iteration 3
OB iteration 4
RF-line 2
RF-line 3
OB iteration 1
OB iteration 2
OB iteration 3
OB iteration 4
OB iteration 1
OB iteration 2
OB iteration 3
OB iteration 4
Tim
e
Figure 3.21: An overview of the SAD calculation sequence
1101×. (51× 4, 47× 4, 46)
3.6 Scalable SadGenerator Architecture
A scalable architecture allows the designer to find a trade-off between performance and re-
source footprint. It allows the architecture to be implemented in both high performance systems
and systems with a small resource footprint priority. A fine-grained scalability allows the system
to adapt to the resource usage or the speed of other components of the video encoder or to
the whole system that is possibly implemented on the same chip. The SadArchitecture has two
mechanisms that allow it to scale. Bot are discussed in this section.
The A×L ALU-grid can become too large when the resources are limited or when L is very
large due to a large search area. The horizontal parallelization: H is the width of the ALU-grid
and can be reduced to any divisor of L: L/2, L/4,... until 1(see Figure 3.23).
When the width of the ALU-grid is L/2, each ALU in the grid calculates 2 SADs instead of 1
SAD, to make up for the second half of the ALU-grid that is not present anymore. ALU(0,..) will
calculate SAD(0,...) during the first horizontal iteration (H-iteration) and SAD(L/2,...) during the
second H-iteration. In order to be able to calculate this extra SAD, each ALU needs to double
its number of accumulator values. The arithmetic part of the ALU remains unchanged. Overall,
SAD(x, y) will be calculated by ALU(xmodH, ymodA). Each ALU will be slightly bigger due to
the extra accumulators, but the number of ALUs will be significantly reduced, resulting in less
resource usage. Moreover, for each halving of H, the processing time is doubled because each
ALU needs L/H H-iterations.
The ALU-grid now needs a sub-RF-line of H +A− 1 pixels instead of L+A− 1 pixels. During
37
3. Video Coding Architecture
ALU grid
Sub-RF-Line Selector
Output select
RF-lineA-OB
A-SADs
SadGenerator
Figure 3.22: An overview of the SadGenerator
L
H = L
A ALU grid
(a) The maximum Horizontal parallelization:H=L
H = L/2
A ALU grid
(b) Horizontal parallelization equal to L/2
H = L/4
A ALU grid
(c) Horizontal parallelization equal to L/4
H = 1
A
(d) The minimum Horizontal parallelization:H=1
Figure 3.23: Four different Horizontal parallelizations for the ALU grid
each horizontal iteration, the RF-selector will select H+A−1 pixels from the RF-line and send it to
the ALU-grid. Figure 3.24 illustrates the selection of the sub-RF-line of length H +A− 1. The first
selection, OB-select, selects the L + A − 1 pixels that are used by the OB that is currently being
processed. This selection depends on the OB-iteration. The next selection, Horizontal-select,
selects H +A− 1 pixels. This selection depends on the H-iteration.
Another approach for reducing the resource usage is to reduce the footprint of each individual
ALU. In the standard formation, an ALU has two vectors of size A as input that are processed at
once. One A-vector origins from the A-OB and the other A-vector origins from the RF-line. It is
possible to scale the ALU down by calculating the absolute value and adding them together in
several iterations. Instead of using as input size A, we could use any divisor of A: A/2, A/4,... until
1. This number is called P, and corresponds to the processing power P of the ALU. Dividing P by
two will reduce the footprint significantly, but will also double the processing time that is needed.
Figure 3.25(a) shows an ALU with P equal to A. Only one P-iteration is needed to process the
38
3.7 SadComparator Architecture
L+A-1
L+Z-1
H+A-1
OB Select
Horizontal Select
Figure 3.24: From the streamed RF-line, H+A-1 pixels are send to the ALU grid per horizontaliteration
two A-vectors. When P is equal to A/2 as in Figure 3.25(b), the two A-vectors are split into four
A/2-vectors. During two P-iterations, these four A/2-vectors are processed, two at a time.
ALU
RF
O
B
P = AP
= A
acc
ALU
RF
O
B
P= A/2
acc
P =
A/2
Figure 3.25: Two different Processing power configurations of the ALUs
With the scalability of the horizontal parallelization of the ALU-grid and the processing power of
each individual ALU, the SadGenerator can be optimised for a smaller footprint at the cost of lower
performance due to the larger number of iterations required. With a horizontal parallelization H
equal to L and a processing power P equal to A, each RF-line is processed in (Z/A) OB-iterations
with no extra iterations. The situation in which H is equal to L/4 and P is equal to A/8 is illustrated
in Figure 3.26. Each OB-iteration is split into 4 H-iterations and moreover each H-iteration is split
into 8 P-iterations. This causes the design to be 8×4 times slower than the H=L,P=A design.
3.7 SadComparator Architecture
The SadComparator receives all the A-SADs of a Z-OB in a SAD-chunk. It calculates the
minimum distortion for each A-OB and outputs its corresponding Motion Vector. The distortion is
based on the SAD value (see Equation 2.1). Next to the A-OB, it also calculates the minimum
distortion of the other OBs. These OBs are composed by A-OBs and can have any shape, as
long as the granularity is A×A. In order to calculate the minimum distortion of the other OBs,
their SADs need to be calculated by adding the correct A-SADs. In the situation of Figure 3.2,
39
3. Video Coding Architecture
H iteration 1
H iteration 2
H iteration 3
H iteration 4
P iteration 1
P iteration 2
P iteration 3
P iteration 4
P iteration 5
P iteration 6
P iteration 7
P iteration 8
RF-line 1
OB iteration 1
OB iteration 2
OB iteration 3
OB iteration 4
RF-line 2
RF-line 3
OB iteration 1
OB iteration 2
OB iteration 3
OB iteration 4
OB iteration 1
OB iteration 2
OB iteration 3
OB iteration 4
Tim
e
(a)
Figure 3.26: The calculation sequence of an H=L/4 P=A/8 configuration
there are 16 A-OBs in one Z-OB with 4 intermediate B-OBs. Four A-SADs are added to form
one B-SAD, and four B-SADs are added to form one Z-SAD. For each A-OB, B-OB and Z-OB,
the SadComparator will calculate a motion vector that corresponds to their minimum distortion.
Figure 3.27 gives an overview how the SadComparator architecture looks like in this case.
4-to-1
adder
A
Minima
Comparator
4-to-1
adder
B
Minima
Comparator
Z
Minima
Comparator
SAD-chunk
B-SADs
Z-SADs
16 A-MVs
4 B-MVs
1 Z-MVs
MV-chunk
SadComparator
Figure 3.27: An overview of the SadComparator in the situation of Figure 3.2
The SAD-chunk contains L×L A-SADs for each of the 16 A-OBs. The A-SADs are grouped
per 16 SADs. These 16 SADs have the same motion vector but origin from a different A-OBs.
The first group of 16 A-SADs contain the SADs with coordinate (0, 0). This group of is processed
by the ”A-Minima-Comparator”. This block holds for each A-OB (16) the minimum distortion value
and its motion vector. The group is also processed by a cascade of two 4-to-1 adders. These
adders add 4 SADs resulting in a single SAD value from a bigger OB. After the first 4-to-1 adder,
the 16 A-SADs are transformed into 4 B-SADs. After the second 4-to-1 adder, the 4 B-SADs are
transformed into 1 Z-SAD. The group of four B-SADs is processed by the ”B-Minima-Comparator”.
This block holds for each B-OB (4) the minimum distortion value and its motion vector. The Z-SAD
40
3.8 Summary
is processed by the ”Z-Minima-Comparator”. This block holds the minimum distortion value and
its motion vector. After streaming L×L groups of 16 A-SADs, the Minima-Comparators’ MVs are
streamed as an MV-chunk as output.
The designer is free to choose which OBs sizes are taken into consideration. It is for example
possible to not take to A-OBs into consideration and only look for the MVs of the B-OBs and Z-
OBs. In this case, the SadGenerator does not need the ”A-Minima-Comparator”. However, in the
most likely case that all the square OBs are taken into consideration, each chunk has (Z/A)×(Z/A)
A-MVs. The amount of MVs for each intermediate OB is four times less for each step in the OB
hierarchy.
3.8 Summary
This chapter describes the architecture of an FSME accelerator. It is the most data and com-
putational intensive component of the video encoder. A streaming model is used to process the
huge amounts of data that is needed for the FSME. Reference frame and original frame data are
streamed into the accelerator and motion vectors are streamed out of the accelerator. The motion
vectors are processed by other components of the encoder that are out of the scope of this thesis.
The core of the accelerator consists of two units: the SadGenerator and the SadComparator. The
SadGenerator calculates the SADs of A×A prediction blocks. The SadComparator uses these
SADs to hierarchical calculate the SADs of larger prediction blocks, up to Z×Z block.
The SadGenerator uses a 2D ALU-grid to calculate the A×A SADs in an extremely parallel
way. The ALU-grid has two parameters: horizontal parallelization H and processing power P.
Both parameters control the ALU-grid’s level of parallelization. Furthermore 3 designs of the
SadGenerator are proposed. The first design reuses the reference data for adjacent candidate
blocks within the search area, achieving a bandwidth reduction of 52×. The second design reuses
the search area’s reference data for horizontally adjacent original blocks, increasing the bandwidth
reduction to 232,44×. Finally, the third design reuses the search area’s reference data for the
vertical adjacent original blocks, achieving a total bandwidth reduction of 1101×. In order to
achieve such gains, there is a trade-off between the amount of data reuse that can be performed
within the accelerator and the complexity of the SadGenerator’s output stream of AxA SADs.
The proposed architecture is highly reconfigurable: on top of the ALU-grid’s H and P parame-
ters, the A×A and Z×Z size of, respectively, smallest and largest prediction block can be chosen.
Furthermore, the L×L size of the search area and the number of reference frames are also con-
figurable. This flexibility allows the designer to easily optimize the architecture to given design
constraints according to the trade-offs mentioned above.
41
3. Video Coding Architecture
42
4Hardware Accelerator for High
Efficiency Video Coding
Contents4.1 Platform features and restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.2 Design decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.3 SadComparator implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.4 SadGenerator implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.5 Host Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
43
4. Hardware Accelerator for High Efficiency Video Coding
In order to evaluate the methodology proposed in the previous chapter, a hardware environ-
ment framework was selected. This section describes the details about the hardware. Next, the
design decisions are discussed followed by details about the SadComparator’s and SadGener-
ator’s implementation. Finally, the host-code that controls the hardware implementation is dis-
cussed.
4.1 Platform features and restrictions
The architecture is implemented on a MAX2 board with the Maxeler framework. The MAX2
board is a Virtex-5 LX330T FPGA with 12GB RAM (6x 2GB SODIMMs). Section 2.2.3 gives an
overview of the Maxeler framework. The Virtex-5 LX330T is made by Xilinx in 2009 and has high-
performance logic with advanced serial connectivity. It has 11,664 Kb Block-RAM, 51,840 Virtex
5 Slices and an 240 x 108 array of CLBs with 3,420 Kb of Distributed-RAM and 12GB off-chip
DRAM. More information about the Virtex-5 family and the LX330T model can be found in [20].
When implementing the architecture on the MAX2 board, its restrictions have to be taken in
consideration. There are 3 major restrictions that have to be taken into consideration:
• The streaming widths of the input and output of the kernels have to be a multiple or integer
divisor of 48 bytes. This is because the internal stream width of the system is 48 bytes.
Using a non multiple stream width will result in sub optimal performance and resource usage.
• The LMem (the off-chip memory) can only be accessed starting at burst aligned byte ad-
dresses. The burst size is 96 bytes. Thus when the bytes at the address 95 to 96 need to
be accessed, 2 bursts need to be streamed, from byte 0 until byte 191, in order to access
them.
• The LMem can only be accessed with two streaming patterns. The two patterns are a
Linear streaming pattern and a Strided streaming pattern. When another pattern is needed,
this restriction can be bypassed in some inefficient ways. One way is using a combination
of these two patterns using multiple action. Splitting the design into multiple actions has
the disadvantage of having overhead and of losing the internal state of the kernel when
switching from one action to the other. The internal state of the kernel can be saved into
the on-chip FMem, which requires load and store cycles and resources. Another way of
achieving a different streaming pattern is to stream the data with one of the two allowed
patterns to the kernel and modify the stream inside the kernel itself. This requires to store
a significant amount of data in the kernel’s FMem and read the data out in such a way that
the desired pattern is achieved.
44
4.2 Design decisions
4.2 Design decisions
The architecture proposed in previous chapter has a smallest block size of A×A pixels and a
largest block size of Z×Z pixels. To comply with the HEVC standard, Z is chosen to be 64, the
largest allowed size and A is chosen to be 8. With these design choices, each Z-OB contains 64
A-OBs. Every pixel is represented by one byte.
The processing flow goes as following: i) The original blocks are streamed from the Host
to the SadGenerator. The Host is the CPU that controls the accelerator over a PCIe link. ii)
The Reference Frames are streamed from the Off-Chip Memory to the SadGenerator. iii) The
A-SADs that are calculated by the SadGenerator are streamed to the Off-Chip Memory. iv) The
SadComparator reads the stream of A-SADs from the Off-Chip Memory and streams the MVs to
the Host.
The SadGenerator and SadComparator are implemented as two separate Kernels. They do
not communicate directly with each other, but only via the Off-Chip memory.
SadGenerator SadComparator
Off-Chip Memory
Reference
Frames
Even&Odd
SAD-Chunk
PCIe Host Link
OBs
A-SADs
RFs
MVs
Figure 4.1: An overview of the architecture on the FPGA
For sake of simplicity and because there is a direct dependency between how the SadGen-
erator stores the data into the off-chip memory, and the pattern how data is consumed by the
SadComparator, in the following sections we have opted to start by presenting the architecture
implementation of the SadComparator and only after the SadGenerator.
45
4. Hardware Accelerator for High Efficiency Video Coding
4.3 SadComparator implementation
The SadComparator architecture, which is discussed in Section 3.7, needs as input (Z/A)2
SADs from (Z/A)2 different A-OBs. These SADs have the same (x, y, r) coordinates with x, y the
coordinate of the prediction candidate in the search area, and r the index of the search area’s ref-
erence frame. Section 3.5 describes three possible SadGenerator architectures: an architecture
processing one A-OB at a time, an architecture processing (Z/A) A-OBs at a time and an archi-
tecture processing all (Z/A)2 A-OBs inside the Z-OB at a time. The SadGenerator architecture
that processes all A-OBs inside the Z-OB at once during the streaming of the reference frame has
a very complex streaming pattern of SADs, as show in Figure 3.19. The (Z/A)2 SADs with the
same motion vector are spread inside the stream and can not be accessed at the same time with
an efficient implementation.
To ensure that the SadComparator is able to access the SADs in a more efficient way, the
SadGenerator has to process the A-OBs row per row. Instead of processing (Z/A)2 A-OBs at a
time, (Z/A) A-OBs are processed at a time. Next to a more efficient way of accessing the SADs,
it also simplifies the SadGenerator’s architecture. The disadvantage is a bandwidth increase with
factor 4.46, as explained in Section 3.5. Figure 3.17 illustrates the streaming pattern of the SADs
when processing (Z/A) A-OBs at a time. Figure 4.2(a) shows exactly the same streaming pattern
but for A=8 and Z=64 resulting in 64 A-OBs inside one Z-OB. The figure also shows the complete
streaming pattern, of all SADs, with two reference frames. First all the SADs from the first A-OB
row are calculated. The SADs from the first reference frame are illustrated in dark blue and the
SADs from the second reference frame are illustrated in light blue. Thereafter, the SADs from
the second A-OB row are streamed, illustrated in yellow. The same pattern continues until the
SADs of the last A-OB row are streamed, illustrated in green. The (Z/A)2 = 64 SADs that the
SadComparator needs to access in parallel are still scattered around inside the SAD stream. This
is illustrate by the black dots in Figure 4.2(a) which represent to SADs with motion vector (0, 0, 0).
The SadComparator needs to access these SADs, and all the other groups of SADs with the
same MV, in parallel.
It is possible to bring these SADs more closely together by not storing the stream linearly in
the DRAM, but with a strided pattern. The result of writing the stream with a strided pattern into
the memory is illustrated in Figure 4.2(b). With this organization, all the SADs with the same
motion vector are stored on the same line but they are still separated by 63 other SADs. One line
of SADs contains now L groups of SADs of 64 A-OBs.
In order to access the 64 SADs in parallel, a Transposer is added to the SadComparator
as illustrated in Figure 4.3. The goal of the Transposer is to read out the SADs linearly from
the memory and output them in groups of (Z/A)2 SADs with the same motion vector, such that
they can be processed by the rest of the architecture. The Transposer consists of two memory
46
4.3 SadComparator implementation
OB 1 OB 2 OB 3 OB 4 OB 5 OB 6 OB 7OB 0
y=0 y=1..L-2 y=L-1
RF 0OB 1 OB 2 OB 3 OB 4 OB 5 OB 6 OB 7OB 0RF 1
OB 1 OB 2 OB 3 OB 4 OB 5 OB 6 OB 7OB 0
OB 1 OB 2 OB 3 OB 4 OB 5 OB 6 OB 7OB 0
OB 57 OB 58 OB 59 OB 60 OB 61 OB 62 OB 63OB 56 OB 57 OB 58 OB 59 OB 60 OB 61 OB 62 OB 63OB 56
OB 57 OB 58 OB 59 OB 60 OB 61 OB 62 OB 63OB 56 OB 57 OB 58 OB 59 OB 60 OB 61 OB 62 OB 63OB 56RF 0RF 1
RF 0RF 1
OB 9 OB 10 OB 11 OB 12 OB 13 OB 14 OB 15OB 8 OB 9 OB 10 OB 11 OB 12 OB 13 OB 14 OB 15OB 8
OB 9 OB 10 OB 11 OB 12 OB 13 OB 14 OB 15OB 8 OB 9 OB 10 OB 11 OB 12 OB 13 OB 14 OB 15OB 8
OB
0..7
OB
8..15
OB
16..55
OB
56..63
(a) The SadGenerator’s SAD output stream pattern
OB1 OB2 OB3 OB4 OB5 OB6 OB7OB0
OB1 OB2 OB3 OB4 OB5 OB6 OB7OB0
OB 0..7 OB 8 .. 55 OB56..63
y=0y=1 OB57 OB58 OB59 OB60 OB61 OB62 OB63OB56
y=L-1 OB57 OB58 OB59 OB60 OB61 OB62 OB63OB56
RF
1R
F 0
OB57 OB58 OB59 OB60 OB61 OB62 OB63OB56
OB1 OB2 OB3 OB4 OB5 OB6 OB7OB0
y=0
y=L-1
OB1 OB2 OB3 OB4 OB5 OB6 OB7OB0
OB1 OB2 OB3 OB4 OB5 OB6 OB7OB0
OB57 OB58 OB59 OB60 OB61 OB62 OB63OB56
OB57 OB58 OB59 OB60 OB61 OB62 OB63OB56
OB57 OB58 OB59 OB60 OB61 OB62 OB63OB56
OB1 OB2 OB3 OB4 OB5 OB6 OB7OB0y=1
(b) The storing pattern of the SADs in the memory
Figure 4.2: Detailed ALU of a H(L)P(8) configuration
banks and four barrel shifters as illustrated in Figure 4.4. The memory banks are local memory,
implemented inside the SadComparator kernel with BRAMs.
The Transposer reads L SADs from one OB and stores them into the memory bank. A memory
bank consists of L memory stripes each with a depth of (Z/A)2 pixels. Each SAD from the OB is
stored into a different memory stripe. The first L SADs are stored at level one: each SAD is stored
into a memory stripe at address 0. The second L SADs are barrel-shifted by one to the right and
then stored at level two: into a memory stripe at address 1. The third L SADs are barrel-shifted
by two to the right and stored at level three: into a memory stripe at address 2. This goes on until
the (Z/A)2th L SADs are streamed. This is illustrated in Figure 4.5(a). It is now possible to read
all (Z/A)2 SADs from different A-OBs, but with the same motion vector. Those values are aligned
diagonal through the memory bank and have in the figure the same number. They can be loaded
by accessing the correct address in each memory stripe. The loaded result is barrel-shifted to the
left such that the SAD from OB0 is always the first SAD in the row. It is important that the SADs
are always organised from OB0 to OB64 because the SADs are later added by the 4-to1-adders.
These 4-to-1-adders add the A-SADs which form an intermediate SAD after the addition.
There are (Z/A)2 cycles needed to store the values into the memory bank. The values are
read out in L cycles. During the readout of the values, the next SADs are stored into the second
memory bank. In this way, after the first store, there is a continuous flow of SADs in the correct
47
4. Hardware Accelerator for High Efficiency Video Coding
4-to-1
adder
A
Minima
Comparator
4-to-1
adder
B
Minima
Comparator
Z
Minima
Comparator
A-SADs
B-SADs
Z-SADs
16 A-MVs
4 B-MVs
1 Z-MVs
MV-Chunk
SadComparator
Transposer
Figure 4.3: SadComparator with Transposer
Barrel Shifter Barrel Shifter
Barrel Shifter Barrel Shifter
MUX
SADs
SADs
Transposer
Figure 4.4: The Transposer consists of two mem-ory banks that allow for simultaneous read andwrites capability
0 2 3 4 5 6 7
0 2 3 4 5 67
0
Barrel Shifter - Right
L
(Z/A)^2
0
Barrel Shifter - Left
05 6 742 3
(Z/A)^2
210
1
0 1
0 1
0 1
0 1
1
1
1
3 4 5 6 7
5 6 742 31
1 1 1 1 1 1 11
1 1 1 1 1 1 1 1
(a) The transposer when L equals(Z/A)2
0 2 3 4 5 6 7
5
5
Barrel Shifter - Right
L
(Z/A)^2
13
13
13
13
0 5 6 742 3
(Z/A)^2
210
1
1
3 4 5 6 7
5 6 742 31
1098 11 12 13 14 15
109 811 12 13 14 15
0109 811 12 13 14 15
5
5
5
5
13
13
8 10 11 12 13 14 159
Barrel Shifter - Left
13 13 13 13 13 5 5 5 5 5 5 5 5 13 13 13
5 5 5 5 5 5 5 5 13 13 13 13 13 13 13 13
5 5 5 5 5 5 5 5
3 4 5 6 7
(b) The transposer when L is bigger than (Z/A)2
Figure 4.5: The transposer
format. In the case of a search area width L= 64, A=8 and Z=64, L equals (Z/A)2 and there are
as many write cycles as read cycles. When the search area width L= 128, there are more reads
48
4.3 SadComparator implementation
than writes in a memory bank. This situation is illustrated in Figure 4.5(b)
49
4. Hardware Accelerator for High Efficiency Video Coding
4.4 SadGenerator implementation
As stated in Section 3.5, the SadGenerator is the architectural structure responsible for calcu-
lating the A-SADs. The SadGenerator processses processes Z-OB per Z-OB. During a Z-iteration,
it calculates the SADs of every A-OB inside a Z-OB. Because the SadComparator needs the SADs
in a specific pattern, the SadGenerator calculates the SADs in a row by row basis. During a so
called Row-iteration, the SadGenerator calculates L×L SADs per reference frame for every A-OB
in the A-OB-row. When there is more than one reference frame, a Row-iteration is split into RF-
iterations for every reference frame. During each RF-iteration, L×L SADs of one reference frame
are calculated for each A-OB. With Z=64, A=8 and Z/A=8, there are 8 A-OBs in an A-OB-row.
This situation, with two reference frames, is illustrated in Figure 4.6.
Z 1
Row 1
Row 2
Row 3
Row 4
Row 5
Row 6
Row 7
Row 8
RF 1
RF 2
RF 1
RF 2
RF 1
RF 2
RF 1
RF 2
RF 1
RF 2
RF 1
RF 2
RF 1
RF 2
RF 1
RF 2
Z-iteration Row-iterations RF-iterations
Figure 4.6: An overview of the iterations when processing a Z-OB
4.4.1 The RF-stream
During each RF-iteration, the RF-data is streamed to the SadGenerator. The RF-data are
pixels from the reference frame that are needed to calculate the SADs of an A-OB row. The
reference frames in the Off-Chip Memory are extended frames to take the search area of the
original blocks on the border of the frame into account. With a search area of L×L, the upper
and left border are extended with L/2 pixels, the lower and right border with L/2 − 1 pixels. The
grey border in Figure 4.7(a) serves this purpose. The Off-Chip Memory on the MAX2 board is
accessed with a burstlength of 96 bytes. Every read is a multiple of 96 bytes and the starting point
of the read must be aligned with a multiple of 96 bytes in the memory. In order to be able to start
reading the second line of the frame, a burst alignment padding is added on the right side of the
frame. This padding makes sure that each line of the frame starts at an address that is a multiple
of 96 bytes, such that it is burst aligned.
The RF-data consists of L+A− 1 RF-lines with a length of L+ Z − 1 pixels and is accessed
with a strided pattern. The RF-data of the first A-OB-row from the first Z-OB starts at the burst
50
4.4 SadGenerator implementation
(a) Each frame is stored into the DRAM with asearch area border (grey) and burst align-ment padding (green)
RF-data
(b) The RF-data spans an area in the refer-ence frame
Figure 4.7: Reference frame in the DRAM
aligned address 0 in the DRAM, as show in Figure 4.7(b). The RF-data of the same A-OB-row,
but in a different reference frame is also burst aligned. This is because the second frame is stored
immediately behind the first frame and every line in the frame is a multiple of the burstsize due to
the burst alignment padding. The RF-data of the next A-OB-row in the same Z-OB starts A frame-
lines lower. Since every frame-line is a multiple of the burstsize, the RF-data’s start address is
burst aligned as well.
The RF-data of the first A-OB-row of the next Z-OB has a start address that is not aligned.
Suppose Z and L equal to 64. The RF-line has a width of 127 bytes. As already explained before,
the first RF-line of the first Z-OB is burst aligned: it starts at address 0 and ends at 126. The first
RF-line of the second Z-OB, on the right of the first Z-OB, is shifted by Z pixels. It starts at address
64 and ends at 190, as highlighted in yellow in Figure 4.8. It is not burst aligned. The first RF-line
of the 3th Z-OB is also not burst aligned, it starts at address 128 and ends at 254. In turn, the first
RF-line of the 4th Z-OB is again burst aligned, it starts at address 192 and ends at 318. The blue,
yellow, green pattern repeats itself.
0 96 192 288 38432 64 128 160 224 256 320 352
Figure 4.8: The first RF-lines of adjacent Z-OBs in the DRAM for Z and L equal to 64
It is possible to access the blue, yellow and green data with a single read of an RF-line with
length of 192 bytes. The output of the DRAM is shown in Figure 4.9 for each of the three different
cases. To access the blue data from Figure 4.8, the RF-line of 192 bytes with starting address 0 is
streamed and the first 127 bytes are selected.If we look at the memory output, the data is aligned
to the left of the 192 byte RF-line. To access the yellow data, the same RF-line is streamed
but in this case the data is aligned to the right side of the RF-line, and the 127 bytes are now
selected from the byte word 64 onwards. To access the green data, the RF-line is read starting
from address 96 and the selected 127 bytes of data appear from byte 32 on. When accessing the
51
4. Hardware Accelerator for High Efficiency Video Coding
second blue data, the pattern repeats itself but with an 192 byte address offset in the DRAM.
0 96 19232 64 128 160
Figure 4.9: The data of multiple Z-OBs inside an RF-line for Z and L equal to 64
In the case of Z and L equal to 64, an RF-line of 192 bytes is streamed, of which 127 bytes
are used. Thus due to the burst alignment restriction, for every RF byte needed, 1.5 RF bytes
need to be streamed as depicted in Figure 4.10. The efficiency percentage is 66,66%. The
architecture’s RF-data reuse factor of 232,44 × that is achieved by processing a row of A-OBs,
is hereby reduced to 154,96 ( = 232, 44 × 0, 6666). Appendix A illustrates that the blue, green,
yellow, blue pattern also holds for Z=32. With Z equal to 96, the pattern is blue blue, since they
are all burst aligned. The length L does not influence the pattern, but does influence the length
of the RF-line. The length of the RF-line corresponds to the bandwidth of the SadGenerator’s
RF-stream. Table A.1 shows the RF-data usage efficiency for every Z,L pair.
L+A-1
L+Z-1
H+A-1
OB Select
Horizontal Select
RF-line Length
Z Select
Figure 4.10: The access pattern of the RF-line inside the SadGenerator for different Z-OBs, withZ,L equal to 64
Figure 4.10 illustrates the SadGenerator’s sub-RF-line Selector. The Z Select, shown in blue,
selects the bytes depending on the horizontal position of the Z-OB. The OB Select, shown in
green, selects the bytes depending on the OB-iteration, i.e., which A-OB inside the A-OB-row is
currently being processed. The Horizontal Select, shown in red, selects the bytes depending on
the horizontal iteration of the ALU-grid.
4.4.2 The OB-stream
The A-OBs streamed to the SadGenerator are stored in On-Chip Block-RAM, or the FMem in
Maxeler jargon. During the streaming of the first line of the first A-OB, the ALU is not working, and
thus no RF-lines are streamed. During each processing cycle, all the 8 pixel lines, each 8 pixels
long, of a single A-OB, are sent to the ALU-grid. Each ALU-row in the grid uses a different pixel
52
4.4 SadGenerator implementation
line. In order to have 8-read ports in the BRAM, the data is automatically duplicated by the max
compiler. During the next OB-iteration, the pixels of the next A-OB are accessed by the ALU-grid.
A-OB 1
A-OB 2
A-OB 3
A-OB 4
A-OB 5
A-OB 6
A-OB 7
A-OB 0
B-RAM
ALU row 0
ALU row 1
ALU row 2
ALU row 3
ALU row 4
ALU row 5
ALU row 6
ALU row 7
Figure 4.11: The 8 Original Blocks are stored into the B-RAM and distributed to the ALU grid perrow
4.4.3 The ALU implementation
An individual ALU consists of an ABS-unit followed by a reduction tree (ADD-Tree). The ABS-
unit is responsible for calculating the absolute value of two vectors dimension-wise. The result is
a vector with the same length who’s values are added by the ADD-Tree, resulting in a single value
that is added to one of the accumulators. In the standard configuration, with the horizontal paral-
lelization equal to L, each ALU accumulates the SAD with a single x-coordinate for an A-OB-row.
With Z equal to 64 and A equal to 8, there are 8 accumulated values, each one corresponding to a
SAD value for an A-OB. In the case when the horizontal parallelization is L/c, the ALU calculates
the SADs for c x-coordinates. For example, when c is four, ALU(0,0) calculates the SADs (0,0),
(L/4,0), (L/2,0) and (3L/4,0). In the case of c equal to four, there are 4×8 accumulator values in
each ALU. In the standard configuration, the processing power P is equal to A. This means that
all the A pixels of the vector are processed at once. For more information about the processing
power P, the reader is referred to Section 3.6. An example of an ALU with P equal to A is given
in Figure4.12(a). When the processing power is less than A, for example A/2, only A/2 pixels are
processed per cycle. Thus two cycles are needed to process the two A vectors. During these two
cycles the ADD output value adds to the same accumulator value. An example of an ALU with P
equal to A/2 is given in Figure4.12(b).
A possible implementation is the use of (Z/A*L/H) accumulators. The accumulator is a prede-
fined Maxeler library accumulator. The logical view of each accumulator is shown in Figure 4.13.
The number of accumulation values, accumulation depth, in the figure is only four to simplify the
model. With Z equal to 64 and A equal to 8, the accumulation depth is always a multiple of
53
4. Hardware Accelerator for High Efficiency Video Coding
ABS
ADD
Tree
A OB pixels A Ref pixels
1 ADD value
1 ACC value
A ABS values
Accumulators
(Z/A)
(a) H(L)P(A) ALU configuration
ADD
Tree
ABS
A OB pixels A Ref pixels
1 ADD value
1 ACC value
A/2 ABS values
Mux Mux
A/2 Ref pixels A/2 OB pixels
Accumulators
(Z/A)x4
(b) H(L/4)P(A/2) ALU configuration
Figure 4.12: Detailed ALU implementation
8 (8× L/H).
ADD
R E
R
ACCACCACCACC
MUX
E E E E
R R R R
ADD Value
ACC Value
Figure 4.13: The SadGenerator ALU’s registers with library accumulators
The array of accumulation values can be implemented more efficiently since only one value
is accumulated or reset at a time. There is no need for Z/A × L/H accumulators. A single
accumulator and Z/A × L/H − 1 delay/hold pairs are enough to store the accumulated values.
Figure 4.14 shows the implementation with a custom shift register. Only one adder is used, in
combination with Z/A×L/H − 1 delay/hold pairs1. The accumulated values are conveyed by the
signals before and after the delay/hold pairs as described in the figure by ACC 0 to ACC 3. The
signal shift E can hold all the accumulated values, except for the first one. While the shift registers
are on hold, the delayed ACC 0 value is always accumulated with the value at the output of the
ADD unit. The circuit is in this state when multiple iterations are needed to accumulate the two
1a delay/hold pair is able to store data for a given number of cycles
54
4.5 Host Code
A-bit vector inputs of the ALU. This is the case when the processing power P of the ALU is less
than A, as explained in subsection 4.4.3.
ACC 0
ADD value
ADD
MUX
MUX
DHDHDH
0
R
shift E
shift Eshift Eshift E
ACC value
ACC 1ACC 2ACC 3
Figure 4.14: The SadGenerator ALU’s registers with custom accumulators
When two new A-bit vectors are processed, the add value has to be accumulated with a new
accumulation value. The accumulation values are shifted and the adder will add the add-value
with the ACC 3 value. This results in the new ACC 0 value. When ALU(0,0) has calculated the
SADs(0,0) it starts the calculation of the SADs(8,0). It does so by reseting the accumulation value
instead of using ACC 3.
It is clear that when the ALU’s processing power P is equal to A, each cycle, the accumulation
values are shifted and the adder never uses the delayed version. In this case, a simplified custom
implementation is possible without the hold blocks and without the mux that chooses between
ACC 0 and 3. This simplified custom implementation can not be used when P is not equal to A,
since the values of the shift registers must be hold.
4.5 Host Code
The data-flow-engine (DFE) is controlled by the host code. The host code controls the DFE
by sending commands to it. These commands are called actions. An action is a data structure
that contains all the information that a DFE needs to run. There are three things that need to be
configured by the action:
• Ticks: For each kernel, the number of cycles that it needs to run is set.
• Streams: Streams from/to the memory need to have a start address, length and streaming
pattern. Streams from/to the host need to have a length and a source/destination.
• Scalars: Kernels can have scalars who’s value can be set before each run.
55
4. Hardware Accelerator for High Efficiency Video Coding
A frame is processed Z-OB per Z-OB. Each Z-OB is processed by multiple RF-iterations as
shown in Figure 4.6. One action sets the commands that are necessary for one RF-iteration.
The SadComparator runs in parallel with the SadGenerator. An action thus contains configuration
data for both kernels.
An action sets the following parameters for the SadGenerator:
• Ticks: The number of cycles the SadGenerator has to run.
• RF-stream: Strided read pattern of RF-data from the DRAM to the SadGenerator
• SAD-stream: Strided writing pattern of SADs from the SADGenerator to the DRAM
• Scalars: Sets the state of the mux that chooses Z-Select in the sub-RF-Selector
An action sets the following parameters for the SadComparator:
• Ticks: The number of cycles the SadComparator has to run.
• SAD-stream: Strided read pattern of SADs from the DRAM to the SadComparator
• MV-stream: Write stream from the SadComparator to the PCIe
4.6 Summary
This chapter shows a possible embodiment for the architecture proposed in Chapter 3. Namely,
the implementation targets a Virtex 5 LX330T FPGA. For the HEVC motion estimation, the small-
est prediction block is chosen to be 8×8 (A=8) and the largest prediction block is 64×64 (Z=64).
The best prediction block is looked for in a search area of 64×64 pixels (L=64) and in two ref-
erence frames (RF=2). The implementation is performed with the Maxeler framework that uses
a high level hardware description language. The SadGenerator and SadComparator are imple-
mented as two separate kernels. To communicate with the outside world, the two kernels have
streams connected to the off-chip memory and PCIe host link.
The previous chapter proposed 3 possible architecture designs for the SadGenerator: pro-
cessing one A-OB, processing one row of A-OBs, and processing multiple rows of A-OBs in par-
allel. In this implementation we consider the architecture that processes one row of A-OBs: it has
a high reference frame data reuse and a simple SAD output pattern. For reading the reference
frame data, the SadGenerator implements a larger RF-line, thus circumventing the off-chip mem-
ory?s strided access pattern limitation. Moreover, a custom accumulator design for the ALU-grid
is also proposed.
In order to be able to process the SADs hierarchically, the SadComparator needs to stream the
SADs from the off-chip memory with a specific pattern. In order to support the efficiently streaming
of the SAD values we propose a new technique that combines the way how the SadGenerator
56
4.6 Summary
stores the data with a very efficient HW module on the SadComparator to transpose the SAD
values.
57
4. Hardware Accelerator for High Efficiency Video Coding
58
5Results
Contents5.1 Implementation Specific Optimizations . . . . . . . . . . . . . . . . . . . . . . . . 605.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625.3 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
59
5. Results
This chapter describes the evaluation of the proposed architecture. Section 5.1 starts by pro-
viding an evaluation of two implementation specific optimizations performed in each of the kernels.
Section 5.2 presents the results obtained for a real implementation on a MAX2 data-flow accel-
eration card with a Xilinx Virtex 5 FPGA. The results obtained show how different performance
metrics vary with different design parameters. Finally, Section 5.3 presents a comparison be-
tween the FPGA results obtained for two parameter configurations and a state-of-the-art GPU
implementation.
5.1 Implementation Specific Optimizations
5.1.1 SadGenerator Output Width
After streaming the first A-1 RF-lines, every L/H×A/P cycles, a total of L SADs will be ready.
These values will be streamed as the output of the kernel. As defined in Section 3.6, H is the ALU-
grid’s Horizontal Parallelization and P is the Processing Power of an individual ALU. It is possible
to have the stream width as large as L to output all SADs at once. This is necessary for designs
with H(L)P(A), where every cycle, L accumulation values will be ready to output. The output of the
kernel must be a multiple or natural divisor of 48 bytes because this is the internal stream width
of the system. With L = 64 and each SAD value represented by 2 bytes, the amount of SAD data
is 128 bytes. A padding of 16 bytes is added such that the stream width of the kernel’s output is
3× 48.
When H is smaller than L and/or P is smaller than A, there are some cycles between each
output cycle, where nothing is streamed to the output. When there are 3 cycles available for
output, the stream width can be 26 SAD values (48 bytes) and the 64 SADs will be streamed to
the output in 3 cycles. The last cycle will have 32 bytes SAD values and 16 bytes padding. An
example of a system where 3 output cycles are available is H(L/2)P(A/2) (2*2 output cycles are
available in total). When more cycles are available, it is possible to stream the SADs in 6, 12,
24,... cycles.
A design with L=64, A=8, Z=64, H(L), P(A) is made to test the resource usage of the SadGen-
erator with the possible number of output cycles. As seen in table 5.1, it is clear that the kernel
output width of 48 bytes, which in achieved with 3 output cycles, is the optimal case in terms of
resource usage. This can be explained because, since the data is already aligned to the internal
format, the hardware necessary to rescale and buffer the data are pretty straightforward. The
only marginal negative effect is that to output the last SADs, the system will need to run for 2
extra cycles in order to have 3 output cycles. Therefore, for configurations, where the throughput
is lower, i.e., the design has 3 or more output cycles available, the design uses 3 output cycles.
Otherwise 1 output cycle is used.
60
5.1 Implementation Specific Optimizations
Table 5.1: Resource usage relative to the design with 1 output cycleoutput width (bytes) LUTs (%) FFs (%) BRAMs (%)
1 output cycle 144 1.00 1.00 1.003 output cycles 48 0.999 1.005 0.8486 output cycles 24 1.002 1.004 0.87612 output cycles 12 1.000 1.004 0.85724 output cycles 6 1.002 1.003 0.848
5.1.2 Custom Accum
Each ALU in the SadGenerator has a number of accumulation values. The total number of
accumulation values is equal to the number of A-OBs that are processed during one Z-iteration
(Z/A) multiplied by the number of horizontal iterations (L/H).
Figure 5.1 shows the resource usage of the SadGenerator kernel for 3 different designs. The
first design, Library Accum, uses the standard accumulators provided by the Maxeler framework,
where each accumulation value is stored in a full accumulator, as in Figure 4.13. The second de-
sign, Custom Accum, stores the accumulation values in a custom shift register, as in Figure 4.14.
Finally, and as described in Section 4.4.3, the third design, Simplified Custom Accum is an op-
timization of the second one, with no hold-block and only one mux to reset that can be applied
when P = A. The design is for a motion estimation with as width of the search area L = 64,
horizontal parallelization H = 1 and ALU processing power P = A. The Z and A block widths are
respectively 64 and 8. Each ALU calculates the SAD value for Z/A = 8 OBs and for L/H = 64
x-SADs. This results in a total of 512 accumulation values that each ALU has to store.
Figure 5.1: The resource usage of the SadGenerator’s kernel for different accum implementationsrelative to the custom accum implementation, for H1A8 and L=64
The resources of the different designs are expressed relatively to the resource usage of the
Custom Accum design. The Library Accum design uses significant more LUTs and FFs, respec-
tively 6× and 3,5×. The reduction of resource usage by the Custom Accum design is accom-
plished by using shift registers instead of full accumulators, for which an adder is used per unit but
only one is used at a time. The Simplified Custom Accum design uses slightly less LUTs (2,5%)
but more BRAMs (2%).
It is clear that the custom accum implementations are a better choice, because it requires a
significant less amount of LUTs and FFs. When comparing the Custom Accum with the Simplified
61
5. Results
Custom Accum, there is a trade-off between BRAM and LUTs. Since BRAMs are a more scarce
resource, the Custom Accum implementation has the preference.
5.2 Experimental Results
5.2.1 Framework
The architecture is implemented on a MAX2 board with the Maxeler framework. The following
list describes some of the terminologies used throughout this section and how they should be
interpreted:
• Resources The resource usage as reported by the Xilinx tool after the mapping and place-
and-routing design-flow phases.
• Cycles The number of cycles are the total number of ”ticks” that the Data Flow Engine (DFE)
is set to run. This number of ticks is set via the software run-time API actions, which are
command streams. Execution-times that are derived from these number of cycles do not
take into account the initialisation time that is needed for each action.
• Memory The read and write rates are calculated for the case in which both the SadGener-
ator and SadComparator are running at the same time. In practice, this is always the case,
except during the processing of the first and last Z-OB. In these later cases, either the Sad-
Generator or the SadComparator is not running, which is why the read and write rates to the
DRAM will be significantly less.
• Power The power is calculated with the XPower Estimator tool from Xilinx [21]. This tool
automatically estimates the power consumption from the data provided on the MAP-report.
5.2.2 Reference Implementation
The architecture is successfully implemented on a MAX2 board. The accelerator is synthe-
sized for the Virtex 5 FPGA at 125MHz with the off-chip’s memory controller operating at 150MHz
(with a word width of 96 bytes, providing a maximum bandwidth of 14,4 MBytes/s. The parameters
of the reference implementations are summarized in table 5.2. There are three reference imple-
mentations; design H64P2, H64P4, and H64P8. Their results can be found in table 5.3. A detailed
overview of the power estimations based on the Xilinx XPower Estimator tool were obtained for
the worst case scenario, i.e., the largest design H64P8 and can be found in Appendix B.
The H64P8 design extracts four times more parallelism than the H64P2 design. The required
DRAM read and write bandwidth utilization is quadrupled in order to support the performance
increase. Both the H64P4 and H64P8 are able to process 720p frames in real-time. Namely, the
H64P8 design can process 720p, 1080p and 2160 frames in real time with a frame rate up to,
respectively 56.9, 26.8 and 6.7 frames per second.
62
5.2 Experimental Results
Table 5.2: ParametersH P Z L RF
H64 P2 64 2 64 64 2H64 P4 64 4 64 64 2H64 P8 64 8 64 64 2
From the three designs, only the H64P8 design was not able to run on the FPGA. Although we
were not able to fit the H64P8 design on the MAX2 board, the preliminary projections obtained
show an occupancy only slightly above the limits, namely of 101% LUTs, 96% FFs and 84%
BRAMs. Therefore, we believe that the analysis of the results obtained by these projections
are still valuable, and provide interesting conclusions. In particular, since the MAX2 card has the
smallest form factor FPGA of Maxeler, other more recent products could easily fit the design, such
as the MAX3 card, which has a Virtex 6 LX240T with about 3x the size. Moreover, regarding the
memory bandwidth requirements, the MAX2 card is also not able to provide enough bandwidth
for the H64P8, and would limit its overall performance. On the other hand, MAX3 would support
4x more bandwidth. An intermediate implementation point is shown for the H64P4 design point. It
can support a frame rate of up to 28.6 fps for 720p while also having lower hardware requirements
and reasonable memory bandwidth requirement for the target platform.
Table 5.3: Experimental resultsPower [W] Resources [#] Speed [fps] DRAM [bytes/cycle]
LUT FF BRAM 720p 1080p 2160p Read WriteH64 P2 13,6 84398 140747 297 14,3 6,7 1,6 38,4 32,4H64 P4 N.A. 124516∗ 173655∗ 273∗ 28,6 13,5 3,4 76,8 64,8H64 P8 N.A. 210158∗ 199682∗ 273∗ 56,9 26,8 6,7 153,3 129,4∗ Results from preliminary resource report.
Finally, regarding the H64P2 design, its main limitation is the much lower frames-per-second.
An overview of the resource usage of the H64P2 reference implementation can be found in Fig-
ure 5.2. Figure 5.2(a) shows that the kernels use the bulk of the LUTs and FFs. The kernels
implement the SadGenerator and SadComparator. The manager and kernels both use a lot of
BRAMs, respectively 41,67% and 49,68% of the total available BRAMs. When looking more into
detail, Figure 5.2(b) shows that the SadGenerator uses most of the LUTs and FFs. The SadCom-
parator uses much less LUTs and FFs, but a lot of BRAMs which are used for the implementation
of the SadComparator’s Transposer. The manager uses BRAMs for the implementation of the
FIFOs and controllers.
Next to the design choice, there are 3 more parameters that have a major impact on the
accelerator’s performance: Z, L and RF. The effect of these four parameters on the experimental
results are discussed in section 5.3.
63
5. Results
(a) Overview (b) Breakdown
Figure 5.2: An in-depth overview of the resource usage of the H64P2 reference implementation
5.2.3 Design Comparison
This section illustrates the effect of the architecture’s parameters on the accelerator’s perfor-
mance. H P is a two dimensional parameter. H specifies the ALU-grid’s horizontal parallelization.
P specifies an individual ALU’s processing power. Z is the width of the largest prediction block
for which SADs are calculated and compared. L defines the width of the search area in which a
prediction candidate is looked for in each reference frame. RF is the number of reference frames
in which a prediction candidate is looked for. H P is a design parameter that does not influence
the encoding efficiency. Z, L and RF are design parameters that influence the encoding efficiency.
The larger they are, the higher the encoding efficiency.
The next paragraphs give an in-depth description of each parameter and discuss their effect
on the resource usage, performance, memory usage and power consumption.
Z
L
H P
RF
Architecture
Resources
Cycles
DRAM
Power
Parameters Results
Figure 5.3: The effect of four parameters on the architecture’s performance
design
64
5.2 Experimental Results
A – parameter: H P The design parameter H P is a two dimensional parameter that changes
the configuration of the SadGenerator. The horizontal parallelization H specifies how many ALU
rows the SadGenerator’s ALU-grid has. H can go from 1 up to L, the width of the search area.
The parameter P specifies the processing power of each ALU in the SadGenerator. P can go from
1 up to A, the width of an A-OB.
The comparison between different designs is performed with the other parameters constant:
L=64, Z=64 and RF=2.
A –.1 H P: Resources The resource usage for each design can be found in Figure 5.4.
Changing the design, does not vary the usage of the BRAMs significantly. Increasing horizontal
parallelization H slightly increases the usage of LUTs and FFs. Increasing the processing power
P also slightly increases the usage of LUTs and FFs. An exception is when P is equal to A, 8 in
this case. This might be explained because each ALU does not need a mux anymore, which is
illustrated in Figure 4.12. The results for designs H64P8* and H64P4* are based on preliminary
reports, which explains the drop in BRAM usage. Changing the design from H64P8 to H64P4
decreases the LUT usage with 41% and the FF usage with 13%.
Figure 5.4: The resource usage for different designs
A –.2 H P: Cycles The design H1A1 has the slowest performance. Doubling H or P halves
the number of cycles needed. H64A8 is the fastest design. Figure 5.5 gives an overview of the
speed performance of each design.
A –.3 H P: DRAM The total amount of data read from or written to the DRAM is not affected
by the design parameter. But the read and write bandwidth required is affected by the choice of
design. Since doubling H or P halves the number of cycles, the read and write rates are thus
doubled. An overview of the read and write rates for the 12 fastest designs is given in Figure 5.6.
65
5. Results
Figure 5.5: The number of cycles needed per processing of a 720p frame. The red line indicatesthe maximum number of cycles allowed for real time (25fps) processing of a 720p frame.
Figure 5.6: The read and write rates to the DRAM for each design
A –.4 H P: Power Increasing H slightly increases the power usage. P equal to A uses less
energy compared to other choices of P. An overview of the power usage for each design is given
in Figure 5.7.
Figure 5.7: The total power usage per design
B – parameter: Z The parameter Z represents the width of the largest prediction block that is
able to be evaluated. Most of the times Z is imposed by the video coding standards. The HEVC
standard introduces the possibility to use prediction blocks with a size of up to 64×64. When
66
5.2 Experimental Results
the architecture is used to perform motion estimation with prediction blocks with a size of up to
64×64, Z has to be equal to 64. However, HEVC motion estimation can also be performed with
prediction blocks with a size of up to 32×32. In this case, Z can be chosen as 32 or 96. When Z
equals 96, each frame is processed with blocks of 96×96, but only prediction blocks up to 32×32
will be evaluated in the SadComparator. Changing Z has a major impact on the architecture and
its implementation. Depending on the requirements, an according Z size can be chosen.
B –.1 Z: Resources
• Increasing Z causes the SadComparator to calculate and compare more SADs in parallel.
With Z equal to 32, the SadComparator has to calculate only one SAD of a 32-OB. With Z
equal to 64, the SadComparator has to calculate 4 32-OB SADs in parallel. With Z equal
to 96, the SadComparator has to calculate 9 32-OB SADs in parallel. The higher the Z, the
more add trees and thus the more resources the SadComparator will need.
• Because of the pattern of the SADs in the DRAM, a transposer (as described in Figure 4.4)
is needed in order to internally stream the SADs in another pattern. The size of the BRAM
used by the transposer is 2× (Z/A)2 × L. Choosing a bigger Z-OB, exponentially expands
the usage of BRAM by the SadComparator.
• Increasing Z causes each ALU in the SadGenerator to store more accumulation values. The
number of accumulation values stored inside each ALU is equal to (Z/A)× (L/H).
• Increasing Z increases the number of A-OBs that are processed during the processing of a
Z-OB and thus also the number of SADs that are generated. This results in storing more
SADs during a Z-OB iteration, requiring a larger SAD-chunk in the DRAM.
An overview of the major resource changes is given in table 5.4.
Table 5.4: Major resource changes when increasing Z, relative to Z=32Z=32 Z=64 Z=96
SadGenerator: Accum Values 1x 2x 3xSadComperator: Transposer BRAM 1x 4x 9xSadComperator: Add Trees 1x 4x 9xSAD chunk: DRAM 1x 4x 9x
B –.2 Z: Cycles The number of cycles used to process one Z-OB by the SadGenerator is
proportional to Z2. The number of cycles used by the SadComparator is not proportional to the
size of Z and stays constant. Figure 5.8 illustrates to number of cycles needed to process one
Z-OB.
The number of Z-OBs that need to be processed in order to processes one frame is inversely
proportional to Z2. As long as the SadGenerator is the slower unit, changing the size of Z will not
67
5. Results
Figure 5.8: The number of cycles needed for each unit to process independently one Z-OB
significantly change the number of cycles needed to process a frame. This is because the pro-
cessing time of the SadGenerator is proportional to Z2 and the processing time of the SadCom-
parator stays constant when changing Z. When Z equals 32, the SadGenerator is not anymore
the slowest. The number of cycles needed by SadComparator determines now the total number
of cycles needed. The total number of cycles needed to process one 720p frame is illustrated
in Figure 5.9. As stated before, as long as the SadGenerator is the slower unit, changing the
size of Z will not significantly change the number of cycles needed to process a frame. With Z
equal to 64 and 96, the SadGenerator is the slower unit, resulting in an approximately constant
time to process a frame. With Z equal to 32, the SadComparator is the slower unit, resulting in
a bottleneck that significantly increases the time to process a frame. Choosing Z=96 over Z=32
improves the processing time by a factor of 3, 2.
B –.3 Z: DRAM Figure 5.10 shows the data usage by the SadGenerator when processing
a single Z-OB. Input:RF is the number of bytes that is read from the DRAM and streamed to
the SadGenerator. Output:SADs is the number of bytes that is outputted by the SadGenerator
and written to the DRAM. The amount of data outputted by the SadGenerator equals the amount
of data read by the SadComparator. The total amount of data read from the DRAM is thus the
amount of RF data read by the SadGenerator plus the SAD data read by the SadComparator. The
total amount of data read from the DRAM thus equals Input:RF + Output:SAD. The total amount
of data written to the DRAM is the amount of SADs outputted by the SadGenerator. The size of
Output:SAD when processing one Z-OB also equals the size of one SAD-Chunk. The size of the
DRAM that needs to be allocated for the SADs is twice the size of Output:SAD.
68
5.2 Experimental Results
Figure 5.9: The total number of cycles needed to process one 720p frame by both units workingin parallel
Figure 5.10: The amount of data used by the SadGenerator when processing one Z-OB for differ-ent sizes of Z
As shown in Figure 5.10, when increasing Z, the RF data and SAD data used to process one
Z-OB increases significantly. Doubling Z quadruples the RF and SAD data used when processing
one Z-OB. This is expected since larger Z-OBs have larger search areas and contain more A-OBs
which results in more SADs.
Figure 5.11 shows the data needed when processing a full 720p frame. When increasing Z,
less Z-OBs need to be processed when processing a full frame. The SAD data is constant, and
the RF data decreases with an increase of Z. Choosing a larger Z decreases the total RF-data that
has to be streamed. The reason for this is twofold. With a larger Z, the streamed RF-line is used
to calculate more A-OB SADs than with a smaller Z, since there (Z/A) A-OBs in an A-OB-row.
This results in less overlap of adjacent RF-lines and thus a decrease in the total RF-data that has
to be streamed. The other reason is due to the more efficient streaming of the RF-lines when Z is
69
5. Results
larger, which is explained in Appendix A. Choosing Z=96 over Z=32 improves the RF-data reuse
by a factor of 3.
Figure 5.11: The total amount of data used by the SadGenerator when processing one 720pframe for different sizes of Z
Figure 5.12 shows the data rates for different Z values. The data rate to and from the DRAM
is significantly lower with Z equal to 32 since more cycles are needed as illustrated in Figure 5.9.
When comparing Z equal to 64 and 96, the number of cycles is constant. The SAD data is
constant as well which explains why the write rate does not vary. The read rate for Z equal to 96
is sightly lower since less RF data is read from the DRAM.
Figure 5.12: The read and write rates of the DRAM for different sizes of Z
B –.4 Z: Power A larger Z value causes the implementation to use more resources. An
increase in power usage is expected.
70
5.2 Experimental Results
C – parameter: L L is the width of the search area that consists of L × L pixels. The larger
the search area, the more prediction candidates will be checked. This will increase the encoding
efficiency of the encoder, but will also increase the computation cost. It also means that for each
OB, more data from the reference frame is needed and more SADs have to be calculated.
The comparison between different values of L is performed with Z=64, RF=2 and as design
H(L)A8.
C –.1 L: Resources Increasing L greatly increases the size of the SadComparator’s trans-
poser which is equal to (Z/A)2 × L. Thus the number of BRAMs used by the SadCompartor is
proportional to L. Increasing L does not necessarily increases the horizontal parallelization H of
the SadGenerator. But when choosing for maximum speed, it has to scale with L. The increase in
horizontal parallelization will cause the SadGenerator to use significantly more LUTs and FFs.
An implementation of L=128 or L=256 is not possible on the MAX2 board since it has not
enough BRAMs to support it.
C –.2 L: Cycles The number of cycles needed to process one Z-OB by the SadGenerator
is proportional to L2 if the horizontal parallelization is equal to 1. With a horizontal parallelization
of L, the number of cycles is proportional to L. The number of cycles needed to process one Z-OB
by the SadComparator is proportional to L2. Figure 5.13 illustrates the number of cycles needed
to process one Z-OB.
Figure 5.13: The number of cycles needed for each unit to process independently one Z-OB
The total number of cycles needed to process a Z-OB with the SadGenerator and SadCom-
parator working in parallel will depend on the slowest unit. The total number of cycles to process
one 720p frame for different choices of L is illustrated in Figure 5.14. When going from L=32 to
L=64, the number of cycles is doubled since the SadGenerator is the slowest unit. From L=64 on
to L=128 or L=256, the total number of cycles needed is quadrupled for each doubling of L. This
71
5. Results
is because the SadComparator is the slowest unit and its number of cycles needed scales with
L2.
Figure 5.14: The total number of cycles needed to process one 720p frame by both units workingin parallel
C –.3 L: DRAM Increasing L greatly increases the RF data streamed from the DRAM to
the SadGenerator. The number of calculated SADs increases quadratically with L. Thus the total
amount of SADs streamed from the SadGenerator to the DRAM and the stream from the DRAM
to the SadComparator will increase quadratically as well when doubling L. Figure 5.15 presents
the total amount of data usage during the processing of one Z-OB by the SadGenerator.
Figure 5.15: The amount of data used by the SadGenerator when processing one Z-OB for differ-ent sizes of L
Doubling L, quadruples the number of pixels from the reference frame that has to be streamed
and the number of SADs that has to be calculated. This quadrupling does not happen for the
data transfer of the RF data and SAD data. Table 5.5 shows the expected increase in the second
row: Search Area L×L. The third and forth rows show the actual increase of the data streamed for
Input: RF and Output: SAD. The difference in data increase is mainly because in the streaming
of the data there is overhead due to padding, which is relatively smaller when L is larger. This
padding is necessary to meet the stream pattern restrictions regarding the size of the burst length.
Increasing L increases the data drastically, but also the number of cycles is increased. Fig-
ure 5.16 shows the data rate for each L. From L=32 to L=128, the data increases more than the
72
5.2 Experimental Results
Table 5.5: The difference in increase of RF data and SAD data compared to the increase of theSearch Area L× L, relative to L=32
L=32 L=64 L=128 L=256Search Area L×L 1 4 16 64
Input: RF 1 1,82 5,19 13,49Output: SAD 1 3 12 44
number of cycles, resulting in increased data rates. From L=128 to L=256, the number of cycles
increases more than the amount of data, resulting in decreased data rates.
Figure 5.16: The read and write rates of the DRAM for different size of L
C –.4 L:Power A larger L value causes the design to use more resources. Also the amount
of data streamed to and from the offchip DRAM increases significantly from L=32 to L=128 as is
shown in Figure 5.16. An increase in power usage is expected.
D – parameter: RF RF is the number of reference frames that are used to look for a candidate.
The comparison between different choices of L is performed with L=64, Z=64 and as design
H64P8.
D –.1 RF: Resources Changing RF does not change the resource usage since the design
stays the same. There is only an increase in the number of actions that are performed.
D –.2 RF: Cycles Doubling the number of reference frames, double the number of actions
that are performed, which doubles the number of cycles. This is illustrated in Figure 5.17
D –.3 RF: DRAM Doubling the number of reference frames, double the number of SADs
and the RF data that is needed. This is illustrated in Figure 5.18
Since both the cycles and the data is double when using the double of reference frames, the
data rate is not affected.
73
5. Results
Figure 5.17: The number of cycles needed to process one 720p frame for different numbers ofreference frames
Figure 5.18: The amount of data used by the SadGenerator when processing one Z-OB for differ-ent sizes of RF
D –.4 RF: Power The design is not changed. There is no change in power when changing
the number of reference frames.
5.3 Comparison
The results that are discussed in previous section are now compared with a NVIDIA Fermi-
based GPU implementation. First the GPU implementation is described and then the results are
discussed.
5.3.1 About the GPU implementation
The GPU implementation is developed by Svetislav Momcilovic [12]. It is executed on a NVIDIA
Fermi-based GPU architecture with 512 processing cores, clocked at 1.5GHz, and with a maxi-
mum TDP of 244W. It focuses on exploiting a fine grained data parallelism on the level of Coding
Tree Units, which are simultaneously processed across hundreds of GPU cores. In general,
74
5.3 Comparison
CTUs are mapped to different GPU Streaming Multi-processors (SMs), which consist of several
GPU cores. The cores, themselves, simultaneously process different matching candidates, while
keeping in SMs registers the minimum distortion values found among them, as well as the related
Motion Vectors.
In order to achieve higher processing performance, the reference samples are cached in the
SMs local storages. In general, the entire Search Area can not fit to the available cache memory,
and it is required to perform caching in several steps, while maximally reusing the samples already
available in the local storage. The size of the SA partition cached in a single step is determined
according to the available cache memory on particular GPU device. This approach does not only
provide fast access of the reference samples, but also the scalability of the Motion Estimation
algorithm over the SA size.
For different CTU partitioning modes, we applied hierarchical SAD computing, where the SADs
of the smaller partitions are reused to compute the SAD values of the larger ones. This approach
requires to save not only the reference samples, MVs and distortion values for the best processed
candidates, but also their SAD values, and leads to even larger memory requirements. In order to
deal with such high memory requirements and limited local storage available per SM, we divided
the ME algorithm into two kernels, where the first one performs the search algorithm for the
partitioning modes as large as 32x32, while the larger partitions (up to 64x64) are processed
in the second kernel. However, The SAD values obtained for the 32x32 partitions are saved in
global GPU memory and latter reused for the larger partitions. Even though leading to additional
caching overhead, this approach allows to more efficient use the available resources, and to
simultaneously employ larger number of GPU cores. The best candidates found on different GPU
cores are finally compared with each other in the reduction procedure in order to find the final set
of the best matching candidates for all partitioning modes of processed CTU.
5.3.2 Results
The proposed architecture’s implementation is now compared with a GPU implementation. It
performs motion estimation for blocks with a size of 64×64. Just like the proposed architecture
with Z=64, it calculates 85 motion vectors for each 64×64 block: 1 MV for the 64×64 OB, 4 MVs
for the 32×32 OBs, 16 MVs for the 16×16 OBs and 64 MVs for the 8×8 OBs.
Figure 5.19 compares the processing time for different frames sizes between the implementa-
tions. The motion estimation uses a single reference frame and a search area of 64×64. For both
the reference GPU’s implementation and the proposed architecture’s implementations, increasing
the size of the frame, increases the processing time by the same factor. The H64 P8 design is
with a near constant factor of 2.5 times faster than the GPU implementation. The H64P4 design
is with a near constant factor of 1.24 times faster than the GPU implementation.
Figure 5.20 compares the processing time for different number of reference frames between
75
5. Results
Figure 5.19: A comparison of the GPU and proposed implementation, for different frame sizes
Figure 5.20: A comparison of the GPU and proposed implementation, for different number ofreference frames
both implementations. The motion estimation is performed on a 1080p frame and uses a search
area of 64×64. For both the reference GPU’s implementation and the proposed architecture’s
implementation, increasing the number of reference frames, increases the processing time by the
same factor. The H64P8 design is with a near constant factor of 2.4 times faster than the GPU
implementation. The H64P4 design is with a near constant factor of 1.22 times faster than the
GPU implementation.
Figure 5.21 compares the processing time for different L×L search area sizes between both
implementations. The proposed architecture’s horizontal parallelization H scales here with L. The
motion estimation is performed on a 1080p frame and uses a single reference frame. For L equal
to 64 and 128, the H(L)P8 design is respectively 2, 45 and 2, 53 times faster. When L is equal
to 32, the H(L)P8 is still faster, but only with a factor of 1, 32. This is because, as shown in
Figure 5.13, with L=32 the SadGenerator is the slower unit and thus becomes a bottleneck. For L
equal to 128, the H(L)P4 design is only slightly slower than the H(L)P8 design. This is because,
as shown in Figure 5.13, with L equal to 128, the SadGenerator is almost a factor 2 slower than
76
5.4 Summary
Figure 5.21: A comparison of the GPU and proposed implementation, for search area sizes
the SadComparator. Slowing down the SadGenerator with a factor of two by choosing for P=4,
will thus only slightly slow down the overall performance of the proposed accelerator.
5.4 Summary
The reference implementation is successfully implemented on the FPGA with the following
parameters: A=8, Z=64, L=64 and RF=2. With the ALU-grid’s parameters H=64 and P=8, the
implementation achieves a staggering motion estimation of 6,7 fps/2160p, 26,8 fps/1080p and
56,9 fps/720p. The H64P8 design uses too many resources to fit on the targeted Virtex 5 FPGA.
Thanks to the ALU-grid’s parallelization parameters, the resource usage can be scaled down.
Changing the design from H64P8 to H64P4 decreases the LUT usage with 41% and the FF usage
with 13%. The read and write bandwidths to the DRAM, respectively 76,8 and 64,8 bytes/cycle
are halved compared to the H64P8. As a trade-off the performance is with a factor two slower.
The H64P2 design has even lower resource usage and bandwidths and fits on the targeted FPGA
but has a 2× performance decrease compared to the H64P4 design. The effects of several
implementation choices are studied. Changing the SadGenerator’s output width decreases it’s
BRAM usage by 15%. The use of custom accumulators in the ALU-grid achieves a 6× LUT and
3,5× FF usage reduction.
Furthermore the effects of the parameters Z, L, RF and H P on the architecture’s implemen-
tation are studied. Increasing H P drastically improves the performance at the cost of increased
data rates and resource usage. Increasing L increases the encoding efficiency at the cost of a
decrease in performance and increase in resource usage. Doubling RF increases the encoding
efficiency at the cost of performance decrease.
When the ME is performed with 64×64 prediction blocks, Z has to be chosen 64. In case the
maximum prediction block size is 32×32, Z can be chosen 32 or 96. Choosing Z=96 over Z=32
77
5. Results
increases the performance by a factor of 3,2 and the RF-data reuse by a factor of 3. The cost
is a significant increase in resource usage: the number of accumulation values in the ALU-gird
increases by 3×. Several other components increase by a factor of 9×: the number of BRAMs
used by the SadGenerator’s Transposer, the number of add-trees in the SadGenerator and the
SAD-chunk’s size in the off-chip memory.
Furthermore the results of the implemented architecture are compared with a NVIDIA FERMI
GPU implementation. The implemented architecture achieves a 2,5× performance increase over
the GPU for any frame resolution, number of reference frame and search area. Only for a small
search area of 32×32 pixels, the implementation achieves a 1,32× increased performance, which
is still significantly faster. Although the power studies shown herein are not enough to detail the
exact gains achieved in terms of power, we believe that it is reasonable to consider that the power
consumption of all the FPGA-based implementations are about one order of magnitude lower than
the GPU.
78
6Conclusions and Future work
Contents6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 806.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
79
6. Conclusions and Future work
6.1 Conclusion
A motion estimation accelerator is designed in this thesis for the new High Efficiency Video
Coding standard, based on the data-flow approach. Motion estimation is the most computational
intensive part of video encoding, taking up to 80% in the total computation time. HEVC’s ME
supports different sizes of prediction blocks, up to 64×64 in combination with multiple reference
frames and search area sizes ranging from 32×32 to 256×256 to look for the best matching
prediction block. These numerous configurations form a true challenge for a ME architecture.
The proposed architecture implements a Full Search ME algorithm, evaluating every prediction
candidate in the search area. This requires huge amounts of bandwidth from the memory that
stores the reference frames. In order to save computational time and bandwidth, hierarchical SAD
computing is used for the computation of the larger prediction block’s SADs. The architecture is
highly parallelised and focuses on performance and data reuse. This results in a bandwidth
reduction of several GBytes per frame to several MBytes, achieving a reduction factor of up to
1037. The architecture is composed of two units. The SadGenerator calculates the fine-grained
SADs. Using these SADs and hierarchical SAD computing, the SadComparator calculates the
movement vector for every prediction block. The architecture works with any size of the smallest
prediction block A×A and largest prediction block Z×Z.
Using the Maxeler framework, the architecture is implemented with a high level HDL in a
highly pipelined manner on a Virtex-5 FPGA. The implemented accelerator has the potential to
process 720p and 1080p frames at respectively 55,6 fps and 27 fps. These speeds outperform
the reference GPU implementation with a factor of 2,5. The core of the SadGenerator is a 2D
ALU-grid that is highly scalable. With its Horizontal parallelisation parameter H and Processing
power parameter P, the ALU-grid’s parallelisation can be easily scaled down to save resource
usage, at the cost of performance decrease.
Changing the ALU-grid’s processing power P from 8 to 4 decreases the LUT usage with 41%
and the FF usage with 13% at the cost of doubling the processing time. Choosing the H parameter
not the scale with the search area width also allows the architecture to process large search areas
with a high level of parallelism without using huge amounts of resources.
The reference implementation is designed to perform ME with prediction blocks of 8×8 up to
64×64, 2 reference frames and a search area of 64×64. It achieves a reference frame band-
width of 35,3 MBytes per processed 720p frame. With a H64A8 ALU-grid, the implementation
can outperform the GPU implementation with a factor of 2,5. With a H64A4 ALU-grid, the GPU
implementation is outperformed with a factor of 1,25.
80
6.2 Future work
6.2 Future work
• After having reduced significantly the bandwidth usage caused by the reference data, the
buffering of the SADs in the memory is now responsible for the most part of the bandwidth
required. Implementing a direct streaming of the SADs from the SadGenerator to the Sad-
Comparator can erase the need of buffering
• To refine the motion vectors, a sub pixel motion estimation is usually performed. The calcu-
lations and data patterns needed for sub pixel motion estimation are very similar to the one
used in this architecture and can be integrated in a future implementation.
81
6. Conclusions and Future work
82
Bibliography
[1] Cheng, Y.-S., Chen, Z.-Y., and Chang, P.-C. (2009). An h. 264 spatio-temporal hierarchical
fast motion estimation algorithm for high-definition video. In Circuits and Systems, 2009. ISCAS
2009. IEEE International Symposium on, pages 880–883. IEEE.
[2] Ching-Yeh Chen, Chao-Tsung Huang, Y.-H. C. and Chen, L.-G. (2006a). Level c+ data
reuse scheme for motion estimation with corresponding coding orders. TRANSACTIONS ON
CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, page 553.
[3] Ching-Yeh Chen, Shao-Yi Chien, Y.-W. H. T.-C. C. T.-C. W. and Chen, L.-G. (2006b). Anal-
ysis and architecture design of variable block-size motion estimation for h.264/avc. IEEE
TRANSACTIONS ON CIRCUITS AND SYSTEMS.
[4] Deepak Turaga, M. A. (1998). Search algorithms for block-matching in motion estimation.
[5] Frederico Pratas, Joao Andrade, G. F. V. S. and Sousa, L. (2013). Open the gates using
high-level synthesis towards programmable ldpc decoders on fpgas. Proc IEEE Global Conf.
on Signal and Information Processing - Global SIP.
[6] Jen-Chieh Tuan, T.-S. C. (2002). On the data reuse and memory bandwidth analysis for
full-search block-matching vlsi architecture. TRANSACTIONS ON CIRCUITS AND SYSTEMS
FOR VIDEO TECHNOLOGY.
[7] Lai, Y.-K. and Chen, L.-G. (1998). A data-interlacing architecture with two-dimensional
data-reuse for full-search block-matching algorithm. TRANSACTIONS ON CIRCUITS AND
SYSTEMS FOR VIDEO TECHNOLOGY.
[8] Lei Deng, Wen Gao, M. Z. H. Z. Z. J. (2005). An efficient hardware implementation for motion
estimation of avc standard. IEEE Transactions on Consumer Electronics.
[Leong] Leong, P. H. Recent trends in fpga architectures and applications.
[10] Mateus Grellert, Felipe Sampaio, J. C. B. M. L. A. (2011). A multilevel data reuse scheme for
motion estimation and its vlsi design. TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR
VIDEO TECHNOLOGY.
83
Bibliography
[11] MathWorks (2014). Simulink: Simulation and model based design.
[12] Momcilovic, S. (2014). Svetislav momcilovic.
[13] Poynton, C. (2002). Chroma subsampling notation. Retrieved June, 19:2004.
[Serrano] Serrano, J. Introduction to fpga design.
[15] Sullivan, G. J., Ohm, J.-R., Han, W.-J., and Wiegand, T. (2012). Overview of the high effi-
ciency video coding (hevc) standard. Trans. on Circuits and Systems for video technology.
[16] Technologies, M. (2012). Multiscale Dataflow Programming Version 2012.2.
[17] Technologies, M. (2014). Maxeler.
[18] Tourapis, A. M. (2002). Enhanced predictive zonal search for single and multiple frame
motion estimation. In Electronic Imaging 2002, pages 1069–1079. International Society for
Optics and Photonics.
[19] W. Carter, K. Duong, R. H. F. H. H. J. Y. J. J. E. M. L. T. N. and Sze, S. L. (1986). A user
programmable reconfiguration gate array. In in Proceedings of the IEEE Custom Integrated
Circuits Conference.
[20] Xilinx (2009). Virtex-5 family overview.
[21] Xilinx (2012). Xilinx power estimator 14.3.
[22] Xilinx (2013). 7 series fpgas overview.
[Yu-Wen Huang and Chen] Yu-Wen Huang, Tu-ChihWang, B.-Y. H. and Chen, L.-G. Hardware
architecture design for variable block size motion estimation in mpeg-4 avc/jvt/itu-t h.264.
[24] Zhu, S. and Ma, K.-K. (2000). A new diamond search algorithm for fast block-matching
motion estimation. IEEE TRANSACTIONS ON IMAGE PROCESSING.
84
AAppendix A
85
A. Appendix A
Section 4.4.1 describes how the pixels from the reference frame are streamed from the off-
chip memory to the SadGenerator for Z equal to 64 and L (search area width) equal to 64. Due
to the stride access pattern limitation of 96 bytes, the streamed RF-line contains data of several
Z-OBs. Depending on which Z-OB is being processed, a mux selects the corresponding bytes
that are needed to process that Z-OB. The remaining data is discarded. Figure A.1 shows the
first RF-lines of adjacent Z-OB in the DRAM for Z equal to 32 and for different values of L. The
corresponding RF-lines and their Z-OB data is shown in Figure A.2. Figure A.3 and Figure A.4
show the same for Z equal to 64. Figure A.5 and Figure A.6 show the same for Z equal to 96.
When comparing Figure A.2, with Figure A.4 and Figure A.6, it is clear that for the same value
of L, the length of the RF-line stays constant, regardless of Z. With L equal to 32 or 64, the RF-line
has always a length of 192 bytes. With L equal to 128, the RF-line has always a length of 288
bytes. With L equal to 256, the RF-line has always a length of 384 bytes. What does change with
the value of Z, are the bytes that are used. With Z and L equal to 32, only 64 bytes of the total
192 bytes are used, resulting in an efficiency of 33%. Table A.1 shows the efficiency for every Z,L
pair.
Table A.1: The RF-line data efficiency for different Z and L valuesZ=32 Z=64 Z=96
L=32 33% 50% 66%L=64 50% 66% 83%L=128 55% 66% 77%L=256 75% 83% 92%
In general, a higher L value or Z value means in a higher efficient use of the streamed RF
data, resulting in less data that has to be streamed to the SadGenerator.
86
0 96 192 288 38432 64 128 160 224 256 320 352 416 448 352
480
(a) L = 32
0 96 192 288 38432 64 128 160 224 256 320 352 416 448 352
480
(b) L = 64
0 96 192 288 38432 64 128 160 224 256 320 352 416 448 352
480
(c) L = 128
0 96 192 288 38432 64 128 160 224 256 320 352 416 448 352
480
(d) L = 256
Figure A.1: The first RF-lines of adjacent Z-OBs in the DRAM for Z equal to 32
87
A. Appendix A
0 96 19232 64 128 160
(a) L = 32
0 96 19232 64 128 160
(b) L = 64
0 96 192 28832 64 128 160 224 256
(c) L = 128
0 96 192 288 38432 64 128 160 224 256 320 352
(d) L = 256
Figure A.2: The data of multiple Z-OBs inside an RF-line for Z equal to 32
0 96 192 288 38432 64 128 160 224 256 320 352 416 448
480512
(a) L = 32
0 96 192 288 38432 64 128 160 224 256 320 352 416 448
480512
(b) L = 64
0 96 192 288 38432 64 128 160 224 256 320 352 416 448
480512
(c) L = 128
0 96 192 288 38432 64 128 160 224 256 320 352 416 448
480512
(d) L = 256
Figure A.3: The first RF-lines of adjacent Z-OBs in the DRAM for Z equal to 64
88
0 96 19232 64 128 160
(a) L = 32
0 96 19232 64 128 160
(b) L = 64
0 96 192 28832 64 128 160 224 256
(c) L = 128
0 96 192 288 38432 64 128 160 224 256 320 352
(d) L = 256
Figure A.4: The data of multiple Z-OBs inside an RF-line for Z equal to 64
0 96 192 288 38432 64 128 160 224 256 320 352 416 448
480512
(a) L = 32
0 96 192 288 38432 64 128 160 224 256 320 352 416 448
480512
(b) L = 64
0 96 192 288 38432 64 128 160 224 256 320 352 416 448
480512
(c) L = 128
0 96 192 288 38432 64 128 160 224 256 320 352 416 448
480512
(d) L = 256
Figure A.5: The first RF-lines of adjacent Z-OBs in the DRAM for Z equal to 96
89
A. Appendix A
0 96 19232 64 128 160
(a) L = 32
0 96 19232 64 128 160
(b) L = 64
0 96 192 28832 64 128 160 224 256
(c) L = 128
0 96 192 288 38432 64 128 160 224 256 320 352
(d) L = 256
Figure A.6: The data of a single Z-OBs inside an RF-line for Z equal to 96
90
BAppendix B
91
B. Appendix B
Figure B shows an overview of the Xilinx Power Estimator’s result for the proposed architec-
ture’s reference implementation with H64P2, which is described in Section 5.2.2. The estimation
is based on the MAP report generated by Xinilix during the compilation of the implementation with
the Maxeler platform. The total power consumption is 13,603 W. Most power is consumed by
the I/O and Device Static, with respectively 58% and 30% of the total power consumption. The
remaining power is consumed by Transceiver and Core Dynamic, with respectively 7% and 6% of
the total power consumption.
Figure B.1: The Xilinx Power Estimator’s result for the H64P2 reference implementation
92