Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Abstract — This work presents a programmable and configurable motion estimation processor capable of performing motion
estimation across several state-of-the-art video codecs that include multiple tools to improve the accuracy of the calculated motion
vectors. The core can be programmed using a C-style syntax optimized to implement arbitrary block matching algorithms and
configured with different execution units depending on the selected codec, the available inter-coding options and required
performance. This flexibility means that the core can support the latest video codecs such as H.264, VC-1 and AVS at high-
definition resolutions and frame rates. The configuration and programming phases are supported by an integrated development
environment that includes a compiler and profiling tools enabling a designer without specific hardware knowledge to optimize the
microarchitecture for the selected codec standard and motion search technique leading to a highly efficient implementation.
Index Terms — Video coding, motion estimation, reconfigurable processor, H.264, VC1, AVS, FPGA.
I.
TINTRODUCTION
he emergence of new advanced coding standards such as H.264 [1], VC-1 [2] and AVS [3] with their multiple coding tools
have introduced new complexity challenges during the motion estimation (ME) process used in inter-frame prediction. While
previous standards such as MPEG-2 could only vary a few options, H.264, VC-1 and AVS add the freedom of using quarter
pixel resolutions, multiple reference frames, multiple partition sizes and rate-distortion optimization as tools to optimize the
inter-prediction process. The potential complexity introduced by these tools operating on large reference area sizes makes
the full-search approach, which exhaustively tries each possible combination, less attractive from a performance and power
points of view. A flexible, reconfigurable and programmable motion estimation processor such as the one proposed in this
work is well poised to address these challenges by fitting the core microarchitecture to the inter-frame prediction tools and
algorithm for the selected encoding configuration. The concept was briefly introduced in [4] and it is further developed in
this work. The paper is organized as follows. Section II reviews significant work in the field of hardware architectures for
motion estimation. Section III establishes the need for architectures able to support fast motion estimation algorithms in
order to deliver high quality results in a power/area/time-constrained video processing platform. Section IV describes the
processor microarchitecture details while Section V presents the tools developed to program the core and explore the
software/hardware design space for advanced motion estimation. Section VI presents the multistandard hardware extensions
targeting VC-1 and AVS codecs. Finally section VII analyses and compares the complexity/performance of the proposed
solution and section VIII concludes the paper.
Multi-standard Reconfigurable Motion Estimation Processor for Hybrid Video Codecs
Jose L. Nunez-Yanez, Trevor Spiteri, George Vafiadis
II.MOTION ESTIMATION HARDWARE REVIEW
A. Full Search ME Hardware
There are numerous examples of full-search motion hardware in the literature, and in this section we will review a few
relevant examples. Full-search algorithms have been the preferred option for hardware implementations due to their regular
dataflow, which makes them well suited to architectures using one-dimensional (1-D) or two-dimensional (2-D) systolic
array principles with simple control and high hardware utilization. This approach generally avoids global routing, resulting
in high clock frequencies. Practical implementations, however, need to consider the interfacing with the memories in which
the frame data resides. This can result in large memory data widths and port counts or a large number of registers needed to
buffer the pixel data. Data broadcasting techniques can be used to reduce the need for large memory bit widths, although this
can reduce the achievable clock frequency. These ideas are developed in the work presented in [5] which uses a broadcasting
technique to propagate partial sum of absolute differences (SAD) values and merges the partial results to obtain different
block sizes in parallel.
An improvement on this concept is shown in [6] which develops a 2-D SAD Tree architecture which operates on one
reference location at a time using a four-row array of registers for reference data, thereby removing the need for broadcasting
and also allowing a snake-scan search order which further increases data-reuse. A different approach to using an adder tree is
proposed in [7], which adds variable block size support to a 1-D systolic array by using each processing element (PE) to
accumulate the SAD for each of the 41 motion vectors required, with a shuffling technique. The authors report a latency of
4496 clock cycles to complete the full search on a 16×16 search area with 16 PEs working in parallel. The throughput in this
case is 13 frames per second (fps) in QCIF video resolution. A similar approach based on a 2-D architecture is presented in
[8]. The architecture includes a total of 256 PEs grouped into 16 4×4 arrays that can complete the matching of a candidate
macroblock in every clock cycle. The implementation reports results based on a search area of 32×32 pixels. The whole
computation takes around 1100 clock cycles to complete with a complexity of 23K logic elements (LUTs) implemented in
an Altera Excalibur EPXA10. An FPGA working frequency of 12.5 MHz is reported although the device works at 285 MHz
when targeting a TSMC 0.13 μm standard cell library.
Importantly, with these SAD reuse strategies, it can be seen that full-search implementations have a clear advantage in that
extending them to support variable block sizes requires only a small increase in gate count over their conventional fixed-
block counterparts, and has little or no bearing on its throughput, critical path, or memory bandwidth. On the other hand, full
search invariably implies a large number of SAD operations, and even for reduced search areas, a number of optimizations
have been developed to make it more computationally tractable. For example, the work presented in [9] uses a most-
significant-bit first bit-serial architecture instead of a systolic implementation. This enables early termination when the SAD
of a particular motion vector candidate becomes larger than the current winner during the SAD computation. One of the
challenges facing the full-search approach in hardware is that throughput is determined by the search range which is
generally kept small at 16×16 pixels to avoid large increases in hardware complexity. Intuitively it is reasonable to expect
that the search ranges should be larger for high-definition video formats (the same object moves more pixels in high
resolution than in a lower resolution screen) and the increase in search range will have a large impact in complexity or
throughput since all the pixels must be processed, limiting the scalability of full search hardware. For example, the work
presented in [10] considers a relatively large search range of 63×48 pixels in their integer full-search architecture, which can
vary the number of pixel processing units. A configuration using 16 pixel processing units can support 62 fps of 1920×1080
video resolution clocking at 200 MHz, but it needs around 154K LUTs implemented in a Virtex5 XCV5LX330 device.
From this short review it can be concluded that full-search hardware architectures focus on integer-pel search while
fractional-pel search is not investigated since it is considered to take only a fraction of the time of the integer search. Also,
rate distortion optimization (RDO) using Lagrangian multipliers that add the cost of the motion vector to the SAD cost are
generally ignored although they can have a large impact on coding efficiency of around 10% as shown in later sections. One
of the difficulties of adding RDO to the previous architectures is that all the motion vector costs need to be calculated in
parallel to avoid becoming a bottleneck, and the additional hardware needed to support these parallel computations with
multipliers involved will increase the complexity considerably.
B. Fast Search ME Hardware
Architectures for fast ME algorithms have also been proposed [11]. The challenges the designer faces in this case include
unpredictable data flow, irregular memory access, low hardware utilization and sequential processing. Fast ME approaches
use a number of techniques to reduce the number of search positions, and this inevitably affects the regularity of the data
flow, eliminating one of the key advantages that systolic arrays have: their inherent ability to exploit data locality for reuse.
This is evident in the work done in [12] that compares a number of fast-motion algorithms mapped onto a systolic array and
discovers that the memory bandwidth required does not scale at anywhere near the same rate as the gate count.
A number of architectures have been proposed which follow the programmable approach by offering the flexibility of not
having to define the algorithm at design time. The application specific instruction-set processor (ASIP) presented in [13]
uses a specialized data path and a minimum instruction set similar to our own work. The instruction set consists of only eight
instructions operating on a RISC-like, register-register architecture designed for low-power devices. There is the flexibility
to execute any arbitrary block matching algorithms and the basic SAD16 instruction computes the difference between two
sets of 16 pixels and in the proposed microarchitecture takes 16 clock cycles to complete using a single eight-bit SAD unit.
The implementation using a standard cell 0.13 μm ASIC technology shows that this processor enables real-time motion
estimation for QCIF, operating at just 12.5 MHz to achieve low power consumption. An FPGA implementation using a
Virtex-II Pro device is also presented with a complexity of 2052 slices and a clock of 67 MHz. In this work, scaling can be
achieved by varying the width of the SADU (ALU equivalent for calculating SADs), but due to its design, the maximum
parallelism that can be achieved would be if the SAD for the entire row could be calculated in the minimum one clock cycle,
in a 128-bit SIMD manner.
The programmable concept is taken a step further in [14]. This ME core is also oriented to fast motion estimation
implementation and supports sub-pixel interpolation and variable block sizes. The interpolation is done on-demand using a
simplified 4-tap non-standard filter for the half-pel interpolation, which could cause a mismatch between the coder output
and a standard-compliant decoder. The core uses a technique to terminate the calculation of the macroblock SAD when this
value is larger than some previously calculated SAD, but it does not include a Lagrangian-based RDO technique to
effectively add the cost of coding the motion vector to the selection process. Scalability is limited since only one functional
unit is available, although a number of configuration options are available to match the architecture to the motion algorithm
such as algorithm-specific instructions. The SAD instruction, comparable to our own pattern instruction, operates on a 16-
pixel pair simultaneously and 16 instructions are needed to complete a macroblock search point, taking up to 20 clock
cycles. The processor uses 2127 slices in an implementation targeting a Virtex-II device with a maximum clock rate of 50
MHz. This implementation can sustain processing of 1024×750 frames at 30 fps.
There are also examples of ME processors in industry as reviewed in [11]. Xilinx has recently developed a processor
capable of supporting high definition 720p at 50 fps, operating at 225 Mhz [15] in a Virtex-4 device with a throughput of
200,000 macroblocks per second. This Virtex-4 implementation uses a total of around 3000 LUTs, 30 DSP48s embedded
blocks and 19 block-RAMs. The algorithm is fixed and based on a full search of a 4×3 region around 10 user-supplied initial
predictors for a total of 120 candidate positions, chosen from a search area of 112×128 pixels. The core contains a total of 32
SAD engines which continuously compute the cost of the 12 search positions that surround a given motion vector candidate.
III. THE CASE FOR FAST MOTION ESTIMATION HARDWARE ENGINES IN HIGH-DEFINITION VIDEO CODING
It has been shown in the literature [16] that the motion estimation process in advanced video coding standards can
represent up to 90% of the total complexity, especially when considering features such as multiple reference frames, multiple
partition sizes, large search ranges, multiple vector candidates and fractional-pel resolutions. A thorough evaluation of how
these options affect video quality is available at [16] although limited to QCIF (176×144) and CIF (352×288) formats. These
low resolutions formats are of limited application in current communications and multimedia products which aim at
delivering high quality video; even mobile phones already target resolutions of 800×480 pixels. Some of the conclusions
reached in [16] indicate that the impact of large search sizes on coding efficiency is limited, that most of the coding
efficiency is obtained from the first four block sizes (16×16, 16×8, 8×16, 8×8) and that the gains obtained thanks to RD-
Lagragian optimization are substantial. Our research has confirmed that RD-Lagrangian is of vital importance typically
reducing bit rates around 10% for the same video quality. Disabling this optimization reduces performance especially for the
exhaustive search. The reason is that the selection of the winner based only on the SAD can introduce motion vectors that do
not follow the real motion with excessive ‘noise’ that hurts the bit rate. The conclusion is that RD-Lagrangian should be
present in any high quality motion estimation hardware processor or algorithm. In subsections A and B we re-examine the
effects of search range/sub-partitions using three high-definition 1920×1080 video sequences: tractor (complex chaotic
motion), pedestrian area (fast simple motion) and sunflower (simple slow motion) obtained from [17] using an open-source
implementation of H.264 [18]. In all these experiments we have kept the option of using predicted motion vector candidates
to initialize the search range enabled and RD-Lagrangian optimization enabled since they consistently improve the quality of
the motion vectors.
A. Search range evaluation
For the search range evaluation we consider the well-known hexagonal search algorithm and the full search algorithm, both
as implemented in [18]. We consider the following search ranges: 8×8, 16×16, 32×32, 64×64, 128×128 and 256×256. The
results for the hexagonal case are shown in Figs. 1, 2 and 3. It is clear that the optimal search range varies depending on the
sequence increasing from 32×32 for the sunflower sequence to 128×128 for the pedestrian sequence. The analysis with the
full-search algorithm indicates similar results so it is not included. Overall, a 16×16 search range as used in many full-search
hardware implementations is too restrictive. In our hardware we have increased the range to 96×112 (search window of
112×128) as a tradeoff between hardware efficiency and coding quality, as explained in the following sections.
B. Sub-partition analysis
Fig. 4 shows the results of enabling sub-partitions for the pedestrian area sequence for the hexagonal search algorithm.
The results show that the sub-partitions optimization fails to improve performance for the hexagonal search algorithms and
can actually degrade it. Figs. 5 and 6 correspond to the other two sequences and show equivalent results. Similar experiments
conducted with other search algorithms including full-search also confirm this behavior. It is apparent that the cost of coding
the additional motion vectors required by the sub-partitions can make the gains obtained during residual encoding negligible.
Additionally, if Lagrangian optimization is disabled then the effect of sub-partition coding is much more negative since the
costs of these additional motion vectors are being ignored. Sub-partitions are more effective with smaller resolutions (the
same object will move fewer pixels in CIF compared with HD, for example) and it is possible that different Lagrangian
techniques from the one used in x264 could yield a better performance, but careful analysis is needed when these techniques
are being used in high definition video coding. Otherwise they could result in an important computational overhead with no
apparent benefit. This means that one of the main advantages of full-search hardware which is its ability to calculate in
parallel all the sub-partitions costs by combining SAD results of the smaller sub-blocks could be of limited benefit in high-
resolutions video formats. On the other hand the large search areas required by high-definition as seen in sub-section A could
be problematic to implement in real-time with full-search hardware resulting in a large hardware complexity and energy
consumption.
IV. PROCESSOR MICROARCHITECTURE
Section III has shown that for high-definition sequences large search ranges are required so complex exhaustive search
algorithms will need very high-throughputs to maintain real-time performance. On the other hand, fast motion estimation
algorithms can offer high quality of results by being able to track real motion better. A configurable/programmable
processor such as the one presented in this work can use different configurations depending on the video resolutions being
processed. Other configuration options that the hardware should support include the number of reference frames, fractional-
pel support and sub-partition sizes.
A programmable/configurable hardware solution can be designed to enable the trade-off among complexity and throughput
depending on the number and type of functional or execution units, so for example if a particular application does not benefit
from rate distortion optimization, this can be disabled, reducing the complexity and power consumption of the core.
According to this criterion, a configurable and programmable motion estimation processor has been designed which can
support H.264 motion estimation over a large range of search options and video resolutions, including high definition, using
different programs and hardware configurations. The microarchitecture of a sample hardware configuration using four
integer-pel execution units, one fractional-pel execution unit and one interpolation execution unit is illustrated in Fig. 7. The
main integer-pel pipeline must always be present to generate a valid processor configuration, but the other units are optional,
and are configured at compile time.
A. Integer-pel execution units (IPEU).
The main IPEU is shown in the middle of Fig.7 and has as principal components: (1) The physical address calculator that
transforms the motion vector offsets and vector candidates into addresses to the reference memory, (2) The vector alignment
unit that aligns the two 64-bit words obtained from the reference memory into a single 64-bit word of valid pixels ready to
operate with the data from the current macroblock memory. (3) The SAD logic which is formed by the SAD operator and
the adder tree to obtain and accumulate the SAD values,. (4) the Motion Vector Decision unit which decides which SAD and
motion vector candidate are kept as current winners for the next interaction. Each functional unit uses a 64-bit wide word
and a deep pipeline to achieve a high throughput. All the accesses to reference and macroblock memory are done through
64-bit wide data buses and the SAD engine also operates on 64-bit data in parallel. The memory is organized in 64-bit words
and typically all accesses are unaligned, since they refer to macroblocks that start in any position inside this word. By
performing 64-bit read accesses in parallel to two memory blocks, the desired 64 bits inside the two words can be selected
inside the vector alignment unit. The number of integer-pel execution units is configurable from a minimum of one to a
maximum limited only by the available resources in the technology selected for implementation. Each execution unit has its
own copy of the point memory and processes 64 bits of data in parallel with the rest of the execution units. The point
memories are 256×16 bits in size and contain the x and y offsets of the search patterns. For example a typical diamond search
pattern with a radius of 1 will use four positions in the point memory with values [−1,0], [0, −1], [1,0], [0, 1]. Any pattern
can be specified in this way, and multiple instructions specifying the same pattern can point to the same position in the point
memory saving memory resources. Each integer-pel execution unit receives an incremented address for the point memory so
each of them can compute the SAD for a different search point corresponding to the same pattern. This means that the
optimal number of integer-pel execution units for a diamond search pattern is four, and for the hexagon pattern six. Further
optimization to avoid searching duplicated patterns can halve the number of search points for many regular patterns. In
algorithms which combine different search patterns, such as UMH, a compromise can be found to optimize the hardware and
software components. This illustrates the idea that the hardware configuration and the software motion estimation algorithm
can be optimized together to generate different processors depending on the software algorithm to be deployed.
B. Fractional-pel Execution Unit (FPEU) and Interpolation Execution Unit (IEU).
The engine supports half- and quarter-pel motion estimation thanks to a fractional-pel execution unit (FPEU) and an
interpolation execution units (IEU) that calculate the quarter- and half-pel values using interpolation filters as defined in the
standard. The FPEU is conceptually similar to IPEU although in this case two vector alignment units are required since the
quarter-pel interpolation is done on-demand by averaging the corresponding half-pel reference data obtained from the
original, diagonal, horizontal or vertical memories. A memory block storing the original pixels without interpolation is
included since it is needed for quarter-pel operations and in this way no accesses to the main reference memory are required
so that the integer-pel execution unit can operate in parallel. The number of IEUs is limited to one, but the number of FPEUs
can be configured at compile time. The IEU interpolates the 20×20 pixel area that contains the 16×16 macroblock
corresponding to the winning integer motion vector. The interpolation hardware is cycled three times to calculate first the
horizontal pixels, then the vertical pixels, and finally the diagonal pixels. The IEU calculates the half pels through a six-tap
Wiener filter as defined in the standard. The IEU is formed by a total of eight systolic 1-D interpolation processors with six
processing elements each. The objective is to balance the internal memory bandwidth with the processing power so that in
each cycle, a total of eight valid pixels are presented to one interpolator. The interpolator starts processing these eight pixels
producing one new half-pel sample after each clock cycle. In parallel with the completion of 1-D interpolation of the first
eight-pixel vector, the memory has already been read another seven times, and its output assigned to the other seven
interpolators. The data read during the ninth memory cycle can then be assigned back to the first interpolator, obtaining high
hardware utilization. The horizontally-interpolated area contains enough pixels for the diagonal interpolation to also
complete successfully. A total of 24 rows with 24 bytes each are read. Each interpolator is enabled nine times so that a total
of 72 eight-byte vectors are processed. Due to the effects of filling and emptying the systolic pipeline before the half-pel
samples are available, a total of 141 clock cycles are needed to complete half-pel horizontal interpolation. During this time,
the integer pipeline is stalled, since the memory ports for the reference memory are in use. Once horizontal interpolation
completes, and in parallel with the calculation of the vertical and diagonal half-pel pixels and the fractional-pel motion
estimation, the processing of the next macroblock or partition can start in the integer-pel execution units. Completion of the
vertical and diagonal pixel interpolation takes a further 170 clock cycles, after which the motion estimation using the
fractional pels can start. As mention previously quarter-pel interpolation is done on-the-fly as required simply by reading the
data from two of the four memories containing the half- and full-pel positions, and averaging according to the H.264
standard. The quarter-pel processing unit uses two vector alignment units to keep up with the stream of pixels data being
read from the half-pel memories. Two aligned 64-bit vectors are generated in parallel and averaging takes place in this data
in a single clock cycle. This maintains the same level of performance as for the integer pipeline. The processor is designed to
execute fractional and integer-pel motion estimation in parallel over different macroblocks so that while fractional-pel
refinement is being performed in macrobock n, the integer-pel parts are already calculating macroblock n+1. In general for
most algorithms, the number of points searched for the fractional-pel refinement is generally lower than for the integer-pel
part, so it completes faster even if the additional overhead of calculating the half-pel pixels is taken into account. In any case,
the cycles needed by each part depend on the type of search strategies deployed at the integer-pel and fractional-pel levels
and will be dependent on the algorithm. An optimal solution will balance the two parts to avoid having one part idle while
the other one is busy.
V. DEVELOPMENT AND ANALYSIS TOOLS
To facilitate the access to the hardware without needing specific knowledge of the microarchitecture, a high level C-like
language called EstimoC and a compiler have been developed. The EstimoC language is aimed at designing a broad range of
block-matching algorithms. The code can be developed and compiled in an integrated development environment (IDE) for
motion estimation. The language contains a preprocessor for macro facilities that include conditional (if) and loop (for,
while, do) statements. The language also has facilities directly related to the motion estimation processor instruction set
(ISA) such as checking the SAD of a pattern consisting of a set of points, and conditional branching depending on which
point from the last pattern check command had the best SAD. The target ISA for the compiler is simple and it is formed by a
total of 8 instructions as illustrated in Fig.8. There are two arithmetic instructions (opcodes 0,1), three jump instructions
(opcodes 2,3,4) and three compare instructions (opcodes 5,6 ,7).
The compiler converts the program to assembly and then to a program memory file containing instructions and a point
memory file containing patterns that can then be written directly to the firmware memories of the hardware core. Fig. 8
shows an example block-matching algorithm written in EstimoC and excerpts from the target files.
The algorithm is a diamond search pattern executed for up to five times for a radius of 8, 4, 2 and 1 pixels, followed by a
small full search at fractional pixel level. The first set of check() and update() commands create the first search pattern,
which consists of five points. Each check() command adds a point to the search pattern being constructed and the update()
command completes the pattern. This pattern is compiled into the instruction at program address 00, which uses the five
points available in the point memory at addresses 00–04. The preprocessor goes through the do while loop three times, with s
taking the values 4, 2 and 1. Each time, a four-point pattern is checked for up to five times. The # if (WINID == 0) #break;
syntax ensures that if a pattern search does not improve the motion vector estimate, it is not repeated. The final lines create a
25-point fractional pattern search. Fig. 10 provides two further examples of how the how the EstimoC language can be used
to implement arbitrary block-matching motion estimation algorithms. The full-search example in Fig.10 corresponds to the
classical exhaustive search algorithm and it is included as a reference. It is not recommended due to its high computing
requirements. Notice that the algorithm has an initial check() command followed by the full-search loop itself. This is done
to be able to test all the motion vector candidates first and then conduct the full-search only around the winner. The hardware
will execute the first instruction once per motion vector candidate available. The compiled machine code shown below the
source code contains only two instructions: one for the first check() and one for the loop with a total of 225 search points
(15x15). Notice the parallelism is expressed without the need for any specific constructs such as PAR normally used in
hardware extensions of general purpose programming languages. The compiler recognizes all the search points that can be
computed in parallel and combines them in a single instruction with the help of the update command that tells the compiler
to stop looking for further checks to combine in a single instruction. The hardware will execute each instruction in a
different number of interactions depending on the available number of execution units since each execution unit can process
a check point in parallel.
The UMH example in Fig.10 contains sections from the source code corresponding to the sophisticated UMH (Uneven
Multihexagon Search) recommended in the H.264 standard. The source code starts with the definition of the search patterns
so that they can be called directly using the check command simplifying the program. The UMH algorithm performs
different pattern checks and uses a cost function (SAD or Lagrangian optimized SAD) to decide the type of patterns to
follow. For example, after the first check either a large or small cross can be used depending on the cost function. The COST
keyword is used by the compiler to generate the appropriate compare instructions followed by conditional jump instructions.
Finally, after the integer-pel search completes, a fractional refinement is conducted around the winner using two half-pel
diamond interactions and a final quarter-pel square pattern. The binary for UMH contains a total of 27 instructions and has
not been included for simplicity reasons. These examples illustrate the multiple programming options are available for
block-matching motion estimation algorithms and the difficulties of programming these algorithms without appropriate
compiler and tool support. On the other hand, the simple programming model and ISA of the proposed processor mean that
no overheads are introduced by using the compiler instead of writing assembly or machine code directly.
Once the program has been compiled into machine code, it can be complicated and time consuming to use the actual
motion estimation hardware for performing the analysis and configuration. The tasks required include synthesizing and
implementing a processor with some specific configuration in the FPGA, which will take around one hour per configuration
due to place&route running times. A cycle-accurate simulator can speed up the development cycle significantly by reducing
the number of tasks required to perform the analysis of a particular configuration. Additionally, the designer does not need
access to the actual hardware when using the simulator. A cycle-accurate model of the processor was developed as part of
the toolset. x264, a free software library for encoding H.264, was modified to use the cycle-accurate model when searching
for the motion vectors during motion estimation. The cycle-accurate simulator can be used directly from the developed IDE.
The designer can design a motion estimation algorithm and test it using different processor configuration parameters. The
simulator takes several parameters as inputs. The inputs which determine the processor configuration are: the program and
point memories generated by the Estimo compiler, the number of integer and fractional execution units, the minimum size
for block partitioning, whether to use motion vector cost optimization, and whether to use multiple motion vector candidates.
The simulator takes other options which do not affect the processor configuration itself, which are: the video file to process
and its resolution, the maximum number of frames to process, the quantization parameter (QP), and the number of reference
frames to use. The simulator will then process the video file using the selected search algorithm and processor configuration.
The output of the simulator includes: the bit rate of the compressed video, the peak signal-to-noise ration (PSNR), the
number of frames that can be processed per second, the number of clock cycles required per macroblock, and the energy
requirements per macroblock. The designer can also generate plots of the results until he or she is satisfied with a particular
configuration and algorithm. At this point, the tools can be used to generate a VHDL configuration file which can be used
together with the rest of the hardware library to implement the motion estimation processor using standard FPGA synthesis
and place&route tools. Fig.11 shows the overall design flow from algorithm conception to processor implementation.
VI. MULTI-STANDARD HARDWARE EXTENSIONS
The previous sections have concentrated on the hardware and software support for motion estimation in H.264, arguably the
most popular advanced video coding standard. The two additional codecs which have been considered in this work include
VC-1, the standard based on WMV-9 developed by Microsoft, and AVS the Chinese video coding Standard. Together with
H.264 these two codecs are state-of-the-art standards being deployed across a wide range of applications such as high
definition broadcast television, optical disc storage (Blu-ray) and broadband video streaming over the internet. They are all
transform and block-based codecs and share a lot of similarities. They process pixel groups in blocks of varying sizes in two
main modes: intra-mode in which a frame is compressed independently of other frames, and inter-mode in which temporal
redundancies among frames are exploited using motion estimation and compensation techniques. The transform used is
DCT-based and it is applied either to the pixels themselves during intra coding or to the residuals resulting from the motion
estimation phase during inter coding. The final stage is the entropy coding stage, which is generally based on variable-length
codes (VLC) although H.264 can use arithmetic coding in the form of CABAC. The motion estimation is critical from the
performance and complexity point of view for all of them. Since these standards do not specify the motion estimation
technique deployed, the video engineer is free to develop new algorithms trading speed and quality as long as the generated
output bitstream can be handled by the standard decoders. These features are well-suited to a configurable and
programmable processor such as the one presented in the previous sections, which can exploit the similarities between the
different standards to support the standards with few hardware changes.
A. Motion estimation differences in the three standards.
All the three codecs can operate at a maximum resolution of quarter-pel and support block sizes from 16×16 down to 4×4 for
AVS (4×4 available in AVS-M for mobile multimedia applications) and H.264 standards and down to 8×8 for the VC-1
standard. From the motion estimation point of view, the most significant differences exist in how the half-pel and quarter pel
pixels are calculated. H.264 uses a six-tap filter for the half-pel pixels and simple averaging for the quarter-pel pixels, as
seen in the previous sections. On the other hand VC-1 and AVS use four-tap filters for half- and quarter-pel interpolation
with different coefficients. This is summarized in Table 1 that also shows the dividing factors used to scale the result of the
filter operation. The table also shows that VC-1 uses a special case to handle ¾ interpolation, which is a special case for
quarter-pel interpolation and corresponds to pixels that are located as shown in Fig. 12. In this case, although the coefficient
values are equivalent, they are applied to opposite input pixels. Also from the table it can be seen that all the coefficient
multiplications can be obtained with simple shift and adds, except for the multiplication by 53 that will need a multiplier
structure. The complexity cost of this multiplier will be lower than a full multiplier since the multiplicand are fixed so it can
be implemented with logic. VC-1 also includes an optional lower complexity mode that uses a simple two-tap bilinear filter
for the ¼ positions which has not been included in the table.
To add support for these two new standards in our processor, the half-pel and quarter-pel processing functions need to be
modified to accommodate these new coefficients. Half-pel interpolation is done exhaustively over the 20×20 surrounding the
winning integer-pel motion vector. In H.264 this is done for performance reasons to be able to efficiently process the
calculation of the quarter-pel pixels that need diagonal half-pel pixels. Notice that to obtain diagonal pixels, we need to
obtain horizontal half-pel pixels first. Performance will be negatively affected since the processor will need first to obtain the
half-pel horizontal pixels, then the half-pel diagonal pixels and finally the quarter-pel pixels, interrupting the pipeline while
this extra processing is taking place. This will be even more challenging for VC-1 and AVS that need not just two, but four
half-pel pixels to obtain the correct value of the quarter-pel pixels. In the H.264 solution, all half-pel pixels have been
already calculated and can be read in parallel to obtain the quarter-pel pixels, maintaining the pipeline fully utilized. The
following subsections describe how the half-pel and quarter-pel processing functions have been extended to support VC-1
and AVS.
B. Half-pel pixel processing extensions.
Half-pels are calculated in eight systolic processors that contain six PEs each. To support VC-1 and AVS is straight-forward
since the number of coefficients involved is only four, so two PEs are set to zero to disable them when they are not needed.
Data still flows through the disabled PEs, so the control logic that reads integer pixels from memory and inputs them into the
processing elements does not need to be modified. Fig.13 shows the simplified diagram of the systolic processor. We have
opted for a simply systolic structure with a global adder since it obtains the required throughput and does not become a
bottleneck in the processor. An output register temporarily buffers eight output pixels before they are written to the half-pel
memories. The performance of the half-pel processing unit is equivalent in all three possible modes. The complexity of the
systolic processor in Fig.13 rises from 383 logic cells to 450 logic cells when it is configured in multi-standard mode, which
is a modest increase.
C. Quarter-pel pixel processing extensions.
Compared with the half-pel extensions, the quarter-pel processing unit needs more attention. The first challenge is that both
VC-1 and AVS use a similar four-tap filter instead of the two-tap averaging filter used in H.264. This means that four half or
integer pixels are needed to obtain each quarter-pel pixel. In H.264 mode the data is obtained from accessing two of the
half-pel memories, aligning as required by the motion vector, and averaging. To guarantee that eight valid pixels are
available after the alignment, two 64-bit memory locations are read from each memory using dual-port memories (two reads
and one write). The alignment units then select the right eight bytes out of the 16 bytes read. In VC-1 and AVS, the number
of ports needed to get the four eight-byte vectors is doubled. This means that the number of BRAMs used to store the half-
pel pixels must be doubled and the number of vector alignment units must also be increased from two to four. This will
double the logic needed to support fractional-pel searches in VC-1 and AVS. Instead of doubling the logic, we opt for
halving the performance of the fractional-pel pipelines when operating in VC-1 or AVS modes. The memories maintain the
number of ports as in H.264 but are read twice to obtain two eight-byte aligned vectors each time. To perform the filtering
itself we have designed a quarter-pel interpolator as shown in Fig.14. Similarities exist among the shift and add operations
needed to obtain the interpolated pixels for the three standards. These similarities can be exploited to reduce the complexity
of this unit as sown in Fig.14. The multiplexers are controlled by signals defined by the active standard. Pipelines a and b in
Fig.14 process two eight-pixel vectors. Operator sharing among the standards means that the additional logic needed to
support VC-1 and AVS is reduced. The limitation in fractional-pel performance in VC-1 and AVS modes is only a problem
if the fractional-pel searches take longer than the integer-pel searches, which is not generally the case. If this is the case, the
core can be configured with more fractional-pel execution units. The complexity of the quarter-pel pixel processing block in
H.264 mode is approximately 82 logic cells, since simple pixel averaging is involved. This value increases to 508 logic cells
in multi-standard mode since more complex operations and a more complex state machine are needed to cycle the logic
twice and interpolate four half- and full-pel pixels.
VI.HARDWARE PERFORMANCE EVALUATION AND IMPLEMENTATION
For the implementation, we have selected the Virtex-4 SX35 device included in the ML402 development platform. This
device offers a medium level of density inside the Virtex-4 family and can be considered main stream being fabricated using
90 nm CMOS technology. The results of implementing the processor with different numbers and types of execution units are
illustrated in Table 2. The basic configuration is relatively small using 7.4% of the available logic resources and 21 memory
blocks. The size of the block RAMS in Virtex-4 devices enables two reference search areas of size 112×128 to fit in this
memory as previously described. Each new execution unit adds around 1600 logic cells and 17 embedded memory blocks to
the complexity. The fractional and integer execution units have been carefully pipelined and all the configurations can
achieve a clock rate of 200 MHz. To obtain a performance value in terms of macroblocks per second performance is not as
straight-forward as in the full search case that always computes the same number of SADs for each macroblock. In this case,
the amount of motion in the video sequence, the type of algorithm and the hardware implementation affect the number of
macroblocks per second that the engine can process. Equation (1) can be used to estimate the cycle count needed to process
a single 16×16 partition and a single reference frame. The variables in the equation are ppp (search points per pattern), eu
(execution units available), ppm (average search points per macroblock).
(1)
The equation takes into account that the SAD calculations for each macroblock take 32 cycles (32 SADs of eight bytes
each) plus one cycle of memory read/write overheads and motion vector decision. There is also an 11-cycle overhead
representing the time needed to empty the integer pipeline before the best motion vector can be found in each pattern
iteration and the next pattern started from the current winning position. Also, the current microarchitecture always uses 33
cycles per search point and it does not try to terminate the SAD calculations earlier if the current cost becomes larger than
the cost obtained during a previous calculation. There are two reasons why this optimization based on monitoring the SAD
during the search point calculation has not been used: firstly, since the core uses multiple execution units it is very important
that all the execution units are maintained in synchrony so that a single control unit can issue the same control signals to all
the execution units. Execution units terminating at different clock cycles will invalidate this requirement. Secondly, the
detection of the cost has to be done at the bottom of the pipeline after the addition of the motion vector cost to the
accumulated SAD value. This means that all the SAD calculations already in flight in the pipeline must be invalidated,
introducing a bubble in the pipeline with a cost of 11 clock cycles. Unless the early detection saves more than 11 clock
cycles (which is not the case in our experiments, since the cost of nearby points tend to be close in value) there will be no
performance gains obtained from this technique but a considerable overhead in terms of logic. In order to use equation (1)
we need to estimate the average numbers of points searched per macroblock. These values have been measured using the
x264 software implementation configured with the diamond, hexagon and UMH algorithms, using the previous high
definition sequences with varying degrees of motion complexity. The values measured indicate that 15 points are tested for
the diamond search algorithm, 22 for the hexagon search algorithm, and 70 for the UMH search algorithm, for the 16×16
partition case, and these values increase by a factor of 1.5 if the 8×8 partitions are considered as well. The hardware can
overlap the fractional-pel search with the integer-pel search working on different macroblocks in parallel, so as long as the
fractional-pel completes faster than the integer-part, it will not affect the performance of the core. In order to validate this
statement we have further analyze the number of search points evaluated in x.264 algorithm. From the three integer-pel
algorithms available in x.264 it is obvious that the simple diamond search is the one that with a higher probability of
completing before the fractional search part. We must also note that x.264 only offers the diamond search for the fractional
refinement consisting of up to 2 half-pel followed by up to 2 quarter-pel interactions. Table 3 shows the effects in search
complexity and bit rate reduction of adding the fractional-pel refinement to the integer diamond search for the three high-
definition videos tested. Table 3 shows that the number of search points tested when both half-pel and quarter-pel is roughly
equivalent to the full-search part for the Sunflower and Tractor video sequences and smaller for the Pedestrian area
sequence. It also shows that the reduction in bit rate thanks to the fractional-pel search is very important. With this data we
can reasonably assume that the fractional-pel search will complete in approximately the same amount of time as the integer-
pel search and that it can be performed in parallel without increasing the total number of clock cycles. Table 4 shows the
performance figures obtained for the three motion estimation algorithms analyzed as a function of the number of integer-pel
execution units. The table shows the macroblocks per second for different configurations and also the largest video format
that each configuration can support. Table 4 only shows the optimal hardware configuration for each algorithm, taking into
account that, for example, a diamond search pattern does not benefit from more than four IPEUs and that a configuration
with three IPEUs will need the same number of cycles as for the two IPEUs case. The reason for this is that while the first
iteration will use all three IPEUs, a second iteration will still be required to complete the pattern instruction, when only one
IPEU will be used. Finally, Table 5 compares the performance and complexity figures of the base configuration of our
processor against the ASIP cores proposed in [13] and [14] in terms of performance complexity. The figures measured in the
general purpose P4 processor with all assembly optimizations enabled are also presented as a reference although the power
consumption and cost of this general purpose processor are not suitable for the embedded applications this works targets.
These types of comparisons are difficult since the features of each implementation vary. For example, our base configuration
does not support fractional pel searches and the addition of the interpolator and fractional pel execution unit in parallel with
the integer pel execution unit increases complexity by a factor of three. The core presented in [14] does support fractional pel
searches, although with a non-standard interpolator, and both searches must run sequentially. Overall, Table 5 shows that our
core offers a similar level of integer performance in terms of clock cycles for the diamond search algorithm to the ASIP
developed in [14] with one execution unit, and performance almost doubles if the configuration instantiates two execution
units as shown in the last row. For these experiments, our core was retargeted to a Virtex-II device to obtain a fair
comparison, since this is the technology used in [13] and [14]. The pipeline of the proposed solution can clock at double the
frequency as shown in the table, and this helps to justify why our solution with a single execution unit can support 1080p HD
formats while the solution presented in [13] is limited to 720p HD formats. The measurements of cycles per macroblock
were obtained processing the same CIF sequences as used in [14].
The diamond search used in this experiment corresponds to the implementation available in x264 that includes up to 8
diamond interactions followed by a square refinement using a single reference frame and a single macroblock size (16×16).
It is also noticeable that 287 cycles per macroblock will generate a throughput of almost 2000 CIF frames per second,
enabling the addition of extra reference frames and sub-partitions while still operating in real-time.
VII. CONCLUSIONS
The proposed processor exploits both reconfigurability and programmability in an innovative way to support motion
estimation in three state-of-the-art video coding standards, seamlessly trading complexity, throughput and quality of results,
and matching the architecture to the workload requirements. The processor has been named LiquidMotion to reflect its
adaptability and the toolset, which enables the designer to create new algorithms and hardware processors without specific
knowledge of the microarchitecture, is available for download at http://sharpeye.borelspace.com/. Support for older
standards such as MPEG2 can easily be added, since in this case, sub-pixel resolution is limited to half-pel using an
averaging filter. The advanced features available in these codecs are supported, including a parallel Lagrangian optimization
hardware unit which yields 10% lower bit rates for the same quality without affecting throughput. The research also shows
the importance of using relatively large search windows at 128×128 pixels for high definition video samples although further
increases of the search window do not bring noticeable benefits. The multi-standard hardware extensions targeting VC-1 and
AVS increase the flexibility of the processor with a relatively small hardware cost. Future work involves adding the core as
part of a dynamically reconfigurable multiFPGA array targeting multimedia applications.
[1] Ostermann, J., Bormans, J., List, P., Marpe, D., Narroschke, M., Pereira, F., Stockhammer, T. and Wedi, T., “Video coding with H.264/AVC: tools,
performance and complexity”. IEEE Circuits Syst. Mag. v4. pp. 7-28.
[2] Sridhar Srinivasan, et.all., “Windows Media Video 9: overview and applications”, EURASIP Signal Proc: Image Communication, 19 (2004) pp. 851-
875.
[3] L. Yu et al. “An Overview of AVS-Video: tools, performance and complexity”, Visual Communications and Image Processing 2005, Proc. of SPIE,
vol. 5960, pp.596021, July 31, 2006.
[4] Nunez-Yanez, J.L.; Hung, E.; Chouliaras, V., 'A configurable and programmable motion estimation processor for the H.264 video codec,' FPL 2008.
International Conference on , vol., no., pp.149-154, 8-10 Sept. 2008
[5] Huang, Y.-W., Wang, T.-C., Hsieh, B.-Y., Chen L.-G. “Hardware Architecture Design for Variable Block Size Motion Estimation in MPEG-4
AVC/JVT/ITU-T H.264”. ISCAS. May 2003.
[6] Ching-Yeh Chen; Shao-Yi Chien; Yu-Wen Huang; Tung-Chien Chen; Tu-Chih Wang; Liang-Gee Chen, "Analysis and architecture design of variable
block-size motion estimation for H.264/AVC", IEEE TCSVT, vol.53, no.3, pp.578-593, March 2006
[7] Yap, S.Y.; Mccanny, J.V., ‘A VLSI architecture for advanced video coding motion estimation’, ASAP, pp. 293-301, 24-26 June 2003
[8] Chao-Yung Kao and Youn-Long Lin, “An AMBA-Compliant Motion Estimator For H.264 Advanced Video Coding” IEEE International SOC
Conference (ISOCC), Seoul, Korea, October 2004
[9] Brian M. Li , Philip H. Leong, “Serial and Parallel FPGA-based Variable Block Size Motion Estimation Processors”, Journal of Signal Processing
Systems, Vol. 51 , No. 1, pp. 77-98 April 2008
[10] Theepan Moorthy, Andy Ye: A scalable computing and memory architecture for variable block size motion estimation on Field-Programmable Gate
Arrays. FPL 2008: 83-88
[11] Yu-Wen Huang, Ching-Yeh Chen, Chen-Han Tsai, Chun-Fu Shen, Liang-Gee Chen, “Survey on Block Matching Motion Estimation Algorithms and
Architectures with New Results”, The Journal of VLSI Signal Processing, Vol. 42, No. 3. (March 2006), pp. 297-320.
[12] Sheu-Chih Cheng; Hsueh-Min Hang, "A comparison of block-matching algorithms mapped to systolic-array implementation," IEEE TCSVT, IEEE
Transactions on , vol.7, no.5, pp.741-757, Oct 1997
[13] T. Dias , S. Momcilovic , N. Roma , L. Sousa, “Adaptive motion estimation processor for autonomous video devices”, EURASIP Journal on
Embedded Systems, v.2007 n.1, pp.41-41, January 2007
[14] Babionitakis, Konstantinos1, et al., “A real-time motion estimation FPGA architecture”, Journal of Real-Time Image Processing, Volume 3, Numbers
1-2, March 2008 , pp. 3-20(18)
[15] Information available at http://www.xilinx.com/ products/ipcenter/DO-DI-H264-ME.htm
[16] S. Saponara, K. Denolf, G. Lafruit, C. Blanch, and J. Bormans, “Performance and complexity co-evaluation of the advanced video coding standard for
cost-effective multimedia communications,” EURASIP J. Appl. Signal. Process., no. 2, Feb. 2004, pp. 220-235.
[17] 1080p HD sequences obtained from http://nsl.cs.sfu.ca/wiki/index.php/Video_Library_and_Tools#HD_Sequences_from_CBC
[18] Information available at http://www.videolan.org/developers/x264.html
Fig. 1. Search range analysis for sunflower sequence
Fig. 2. Search range analysis for pedestrian area sequence
Fig. 3. Search range analysis for tractor sequence
Fig. 4. Sub-partitions analysis for pedestrian area sequence
Fig. 5. Sub-partitions analysis for sunflower sequence
Fig. 6. Sub-partitions analysis for crowdrun sequence
Pointmemory
PhysicalAddress
Calculator
Programmemory
Instruction fetch,decode and issue Line offset
Next rmaddress
FP pipelineControl
+1 +1 +1
referencememory
Vectoralignment
referencememory
VectoralignmentCurrent
macroblockmemory
Pointmemory
Pointmemory
Pointmemory
PhysicalAddress
Calculator
PhysicalAddress
Calculator
PhysicalAddress
Calculator
SAD
Adder tree
SAD
Adder tree
FP pipelineControl
FP pipelineControl
referencememory
Vectoralignment
Adder tree
referencememory
Vectoralignment
Adder tree
Motion vector decision
+ + + +
GP Register File (motionvector candidates and
results)
Motion vector candidate
Best motion vectorBest SAD
HP in
terp
olato
r Sys
tolic
Pro
cess
or
HP in
terp
olato
r Sys
tolic
Pro
cess
or
HP in
terp
olato
r Sys
tolic
Pro
cess
or
HP in
terp
olato
r Sys
tolic
Pro
cess
or
HP in
terp
olato
r Sys
tolic
Pro
cess
or
HP in
terp
olato
r Sys
tolic
Pro
cess
or
HP in
terp
olato
r Sys
tolic
Pro
cess
or
HP in
terp
olato
r Sys
tolic
Pro
cess
or
HorizontalHP pixels
Vertical HPpixels
DiagonalHP pixels
Original FPpixels
QP pipelinecontrol
Currentmacroblock
memory
Adder tree
Vectoralignment
Quarter pel interpolation
Best motion vectorBest COST
16 16 16 16
16
126Next rmaddressLine offset 126 Next rm
address12Line offset
6 Next rmaddress12
Line offset6
16161616
rmaddress 9 rm
address 9
6464646464
64
646464
64
6464
16
1616
16
16 16
rmaddress
9
64
64
64
64 64 64
88888888
64 64 64 64 64 64
64 64
64
6464
16
16
32
8
8 24
1616
Unalignedpixel data
Unalignedpixel data
Unalignedpixel data
Alignedpixel data
Alignedpixel data
Alignedpixel data
Current SAD
Current SAD
Current cost
Current SAD
Unalignedpixel data
Alignedpixel data
Alignedpixel data
HP interpolationdata
HP data
QP data
Current SAD
Control signals
Control signals
Control signals
Controlsignals
Instructionaddress
Instructions
HP Pixel processing
Fractional pelpipeline
Main integer pelpipeline
Auxiliary integerpel pipelines
64
SADSAD
SAD
Adder
MVCOST
Quantization
MVPMV
Adder Adder Adder
Quantization
MVCOST
Adder
Quantization
MVPMV
6464
64 64
Vectoralignment
64
QP pixelprocessing
Fig. 7. Processor microarchitecture with a total of six execution units
24
Fig. 8. Instruction set architecture ISA
Integer pattern instruction
0010 Winner field immediate8
Unconditional jump to label
0011 immediate8
15 8 7
unused
Conditional jump to label (if winner field = winner id the jump to inmediate8)
winner id = 0 no winner in pattern else ids the winner execution unit
0100 unused immediate8
Compare (if less than set condition bit)
Compare (if greater than set condition bit)
Compare (if equal set condition bit)
0000 Pattern address Number of points
16
0101 reg immediate14
0110 reg
0111 reg
Op code Field A Field B
Fractional pattern instruction
0001 Pattern address Number of points
716 15 8
Conditional jump to label (if condition bit set jump to label)
immediate14
immediate14
1315
25
S = 8; // Initial step size
check(0, 0);check(0, S);check(0, -S);check(S, 0);check(-S, 0);update;
do{ S = S / 2; for(i = 0 to 4 step 1) { check(0, S); check(0, -S); check(S, 0); check(-S, 0); update; #if( WINID == 0 ) #break; }} while( S > 1);
for(i = -0.5 to 0.5 step 0.25) for(j = -0.5 to 0.5 step 0.25) check(i, j);update;
0 0 05 00 chk NumPoints: 5 startAddr: 01 0 04 05 chk NumPoints: 4 startAddr: 52 2 00 0B chkjmp WIN: 0 goto: 113 0 04 05 chk NumPoints: 4 startAddr: 5 ……………….11 0 04 0A chk NumPoints: 4 startAddr: 912 2 00 15 chkjmp WIN: 0 goto: 21 ………………..21 0 04 0D chk NumPoints: 4 startAddr: 1322 2 00 1F chkjmp WIN: 0 goto: 31 ……………….31 1 19 11 chkfr NumPoints: 25 startAddr: 17
Integer check pattern instruction
Conditional jump instruction
Fractional check pattern instruction
Fig. 9. Programming example
26
/* UMH algorithm
*/
Pattern(diamondhp){ check(0,0.5) check(0,-0.5) check(0.5,0) check(-0.5,0)}Pattern(diamondqp){ check(0,0.25) check(0,-0.25) check(0.25,0) check(-0.25,0)}
/*other pattern definitions*/
check(0,0);update;check(diamond);
#if (COST > 2000){ //large cross // horizontal cross for(i = -17 to 17 step 2) { check(i,0); } // vertical cross for(i = -7 to 7 step 2) { check(0,i); } update;}
//hp refinement
for(loop = 1 to 2 step 1){ check(diamondhp); #if( WINID == 0 ) #break;}
//qp refinement
check(squareqp);
//End
#else{ //small cross // horizontal cross for(i = -5 to 5 step 2) { check(i,0); } // vertical cross for(i = -3 to 3 step 2) { check(0,i); } update;}
/* large hexagon, small full search, hexagon and final square refinement as in UMH*/
Fig. 10. Programming example for the full-search and UMH algorithms.
//full search example
//first pointcheck(0,0);update;
//full searchfor(i = -7 to 7 step 1) for(j = -7 to 7 step 1) check(i, j);update;
0 0 01 00 chk NumPoints: 1 startAddr: 0 1 0 E1 01 chk NumPoints: 225 startAddr: 1
27
R TLC om ponent Lib rary
H igh level M Ealgorithm code SharpEye
C om pile r
Assem bly codeSharpEyeAssem bler/L inker
P rogram B inary Point B inary
Cyc le A ccurateS im ula tor/
C onfiguratorConstra in tsenergy/
throughput/quality/area
C onstra in tsm et?
Num ber and type offunctional un its
(In teger and fractionalpel, Lagrangian,
m otion vecto rcandida tes, e tc)
P rocessorb its tream
N oYes
G eneratecon figura tion R TL
file
StandardSynthes is/
P lace&routeFPG A too ls
N ew H ardwarecon figuration
N ew M E algorithm
Fig. 11. Processor programming and configuration workflow.
28
Table 1. Fractional-pel interpolation filters coefficients
29
Fig. 12. Fractional pixels locations
PE1 PE2 PE3 PE4 PE5 PE6
Pixel IN
Mode
Interpolated Pixel OUT
Add and shift
13 13 13 13 13
8
8 8 8 8 8
13
Fig. 13. Half-pel systolic processor unit
30
Virtex - 4 SX35
Configuration LUTs used/LUTs available
Memory blocks used/Memoryblocks available/Minimummemory bits
Criticalpath (ns)Logiclevels
1 IPEU/ 0FPEU
2259/30720 (7.4%) 21/192 (10%)/95 Kbits 4.976/8
2 IPEU/ 0FPEU
3805 /30270 (12.6%) 38/192 (19%)/179 Kbits 5.040/8
3 IPEU/ 0FPEU
5571/30270 (18.4 %) 55/192 (28%)/263 Kbits 5.032/7
1 IPEU/ 1FPEU
9143/30270 (30.2%) 31/192 (16%)/95+42 Kbits 4.986/6
2 IPEU/ 1FPEU
10985/30270 (36.2%) 48/192(39%)/179+84 Kbits 4.996/9
Table 2. Processor complexity in H.264 mode
31
HP pixels a HP pixels b
1(a) 1(b)
-4 (a) 53 (b)
18(a) -3 (b)
-3(b) 18(a)
53(b) -4(a)
1 (a) 3 (b)
3 (b) 1 (a)
<< 4X
+
+
QP pixels
6464
H.264
VC1
AVS
53<< 1
+x -1
<< 1
+
AVS or H.264 (x1)
VC-1VC-1
AVS
64
64
VC-1
VC1 (x18)
96
72
104
104
104
64
VC-1 (x53)
112
72
80 AVS (x3)
104x-1
VC1 (x-3)
H.264 (x1)
112
112 80
112
113
113
>>
113
VC-1 (x-4)
coefficient (pipeline)standard
Fig. 14. Quarter-pel processor unit
32
1080p video sequence
Type of fractional- pel search done ( average search points per macroblock
/ bit rate reduction %) None Half - pel only Half - pel and
Quarter-pel FP
Search Points
Bit Rate Reduction
HP Search Points
Bit Rate Reduction
HP&QP Search Points
Bit Rate Reduction
Pedestrian area
15.7 0% 5.4 6% 9.3 12%
Sunflower 11.3 0% 5.0 5% 9.1 11.5%
Tractor 11.1 0% 6.2 16% 10.2 25%
Table 3. Evaluation of fractional-pel search
33
Nu Number of Integer Execution units implemented
Throughput in macroblocks per second (16x16, diamond, 200 MHz, 4 ppp)
Throughput in Macroblocks per second (16x16, 8x8, diamond, 200 MHz,4ppp)
Throughput in macroblocks per second (16x16, hexagon, 200 MHz, 6 ppp)
Throughput in Macroblocks per second (16x16, 8x8, hexagon, 200 MHz, 6ppp)
Throughput in Macroblocks per second (16x16 UMH 200 MHz, 16 ppp)
Throughput in Macroblocks per second (16x16, 8x8 UMH 200 MHz, 16 ppp)
1 372,960 (1080p@30)
233,918 (720p@50)
260,983 (1080p@30)
173,988 (720p@30)
84,813
56,542
2 692,640 (1080p@50)
461,760 (1080p@50)
495,867 (1080p@50)
330,578 (1080p@30)
166,233 (720p@30)
110,822
3
708,382 (1080p@50)
472,255 (1080p@50)
4 1,212,121 (1080p@50)
808,080 (1080p@50)
319,680 (1080p@30)
213,120 (720p@50)
6
1,239,669 (1080p@50)
826,446 (1080p@50)
8
593,692 (1080p@50)
395,794 (720p@30)
16
1,038,961 (1080p@50)
692,640 (1080p@50)
Table 4. Performance analysis of the configurable processor
34
FPGA clock Memory
(MHz, Virtex-II) (BRAMS)
Intel P4 assembly ~3,000 N/A N/A N/A
Dias et al. [13] 4,532 2,052 67 4(external reference area)
Babionitakis et al. [14] 660 2,127 50 11 (1 reference area of 48x48 pixels)
Proposed with one integer-pel execution unit
510 1,231 125 21 (2 reference areas of 112x128 pixels)
Proposed with two integer-pel execution units
287 2,051 125 38(2 reference areas of 112x128 pixels)
Processor implementation
Cycles per MB (Diamond search)
FPGA Complexity (Slices)
Table 5. Performance/complexity comparison