27
WP283 (v1.0) January 17, 2008 www.xilinx.com 1 © 2008 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. All other trademarks are the property of their respective owners. Xilinx System Generator [Ref 1] is a MATLAB® Simulink® blockset that facilitates the design and targeting of Xilinx FPGAs. Within the MATLAB environment familiar to DSP designers, System Generator provides the ability to functionally simulate a design and use the MATLAB environment to verify the bit/cycle-true model against the golden reference results produced either externally or inside the MATLAB environment. Within MATLAB, designers can both target a Xilinx FPGA hardware platform and verify the hardware output, making it easier for an algorithm developer to make the leap into hardware and a firmware developer to better grasp the algorithm. However, despite the appreciable design cycle reduction advantages, some design philosophies built around pure HDL are slow to benefit—primarily due to legacy HDL design methodologies, designers' reluctance to stray from their comfort zone, and a lack of familiarity with Simulink. The benefit being overlooked is that System Generator complements the HDL design task by providing an easy- to-configure test bench platform for both functional simulation and hardware verification. White Paper: Xilinx System Generator WP283 (v1.0) January 17, 2008 Using System Generator for Systematic HDL Design, Verification, and Validation By: Justin Delva, Adrian Chirila-Rus, Ben Chan, Shay Seng R

Xilinx WP283 System Generator for Systematic HDL Design

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Xilinx WP283 System Generator for Systematic HDL Design

WP283 (v1.0) January 17, 2008 www.xilinx.com 1

© 2008 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. All other trademarks are the property of their respective owners.

Xilinx System Generator [Ref 1] is a MATLAB®Simulink® blockset that facilitates the design andtargeting of Xilinx FPGAs. Within the MATLABenvironment familiar to DSP designers, SystemGenerator provides the ability to functionally simulate adesign and use the MATLAB environment to verify thebit/cycle-true model against the golden reference resultsproduced either externally or inside the MATLABenvironment.

Within MATLAB, designers can both target a XilinxFPGA hardware platform and verify the hardwareoutput, making it easier for an algorithm developer tomake the leap into hardware and a firmware developerto better grasp the algorithm. However, despite theappreciable design cycle reduction advantages, somedesign philosophies built around pure HDL are slow tobenefit—primarily due to legacy HDL designmethodologies, designers' reluctance to stray from theircomfort zone, and a lack of familiarity with Simulink.The benefit being overlooked is that System Generatorcomplements the HDL design task by providing an easy-to-configure test bench platform for both functionalsimulation and hardware verification.

White Paper: Xilinx System Generator

WP283 (v1.0) January 17, 2008

Using System Generator for Systematic HDL Design,

Verification, and Validation

By: Justin Delva, Adrian Chirila-Rus, Ben Chan, Shay Seng

R

Page 2: Xilinx WP283 System Generator for Systematic HDL Design

2 www.xilinx.com WP283 (v1.0) January 17, 2008

OverviewR

Overview

About System GeneratorThe ability to model, simulate, and generate a netlist for a design from System Generator allows algorithm designers to make an easier transition to using hardware, and the model-based design methodology allows firmware developers to more easily grasp a new algorithm.

In addition, System Generator includes several features to satisfy users most familiar with a traditional HDL flow. Custom HDL code can be imported into a System Generator design and simulated within MATLAB using the System Generator interface to HDL simulators like Mentor® ModelSim® and the Xilinx ISE™ simulator.

A System Generator test bench is easily generated and can be used to test the custom HDL code running in real hardware without modification. The hardware co-simulation features use pre-supported FPGA platforms or user-provided platforms for performing either Simulink-controlled stepped-clock hardware-runs, or real-time data-burst runs.

Example Test BenchThe example presented in this discussion is a test bench methodology used in the design of the Xilinx H.264 Context Adaptive Binary Arithmetic Coder (CABAC) rev1.0 video encoder [Ref 2]. The purpose of the test bench is to provide an exhaustive and robust test environment for the CABAC rev1.0 video encoder core in three ways:

• Functional HDL simulation: MATLAB verification input and output test vectors are used to feed HDL simulators, for example, ModelSim.

• Functional hardware verification: An intermediate step to debug errors not discovered during the functional HDL simulations. This stage uses System Generator controlled single-step clocking as well as the same input and output test vectors of the functional HDL simulations.

• Real-time hardware verification: Using the input tests vectors of the functional HDL simulations, the design is tested in hardware at the targeted input rate and clocking frequency. The output of the hardware is captured into MATLAB and compared with the output test vectors.

Before further exploration of the test bench methodology, the design methodology and the framework the CABAC HDL are evaluated in the sections that follow.

Page 3: Xilinx WP283 System Generator for Systematic HDL Design

Systematic Design Flow

WP283 (v1.0) January 17, 2008 www.xilinx.com 3

R

Systematic Design Flow A systematic approach using System Generator in HDL design is presented in this article using the CABAC design portion of the H.264 standard to demonstrate the proposed design.

H.264/AVC Encoder StandardThe H.264/AVC [Ref 2] encoder is the product of years of collaborative efforts resulting in a standard with good video quality at substantially lower bit rates than previous standards. Although most of the techniques used by the H.264 standard (for example, integer transform, de-blocking filter, and motion estimation) are well detailed in literature, the concept of starting the design work from scratch seems daunting.

Fortunately, a reference C source code (what most developers seek to put into hardware for real-time applications) named H.264/AVC Joint Model (JM) is available to developers. However, in the general context of modern multimedia and communication applications, the direct translation from a high-level functional model, such as C or C++ to Register Transfer Level (RTL) is almost impossible, mainly due to vulnerability to simple translation errors during the subsequent transformation phases and the lack of a modular design flow.

Design Flow ProcessA systematic approach using System Generator in HDL design in presented in this article. Refining the design process through different abstraction levels defined by the design flow helps focus on the issues related to each design step and to gradually evolve into an efficient implementation.

Additionally, this design approach shortens design time by favoring design reuse and allowing structured verification and fast prototyping. The design steps include the following:

• Identify and perform a functional partitioning of the design.• For each functional partition of step 1, locate and capture (within the reference

model) the input data to the partition and use it as input stimulus for the HDL and hardware target. In addition, capture the output of each functional block for comparison with the design's output.

• Define how each functional block communicates with other blocks and select appropriate communication primitives [Ref 3].

• For a given functional partition, establish a single test bed from which the HDL and its subsequent hardware porting are verified against the functional reference model.

• Perform the HDL and hardware implementation of each module resulting from the partitioning in step 1 and verify against its functional model.

• Finally, all blocks are integrated and the final implementation is verified against the model.

Identifying Functional ModulesStarting from a reference specification provided by a standard body (like the H.264/AVC) or by a research group of the company, memory optimizations and algorithmic tuning are performed first to counter data bottlenecks and FPGA real estate constraints. The handling and identification of bottlenecks (or any other

Page 4: Xilinx WP283 System Generator for Systematic HDL Design

4 www.xilinx.com WP283 (v1.0) January 17, 2008

Systematic Design FlowR

implementation-limiting issues) require one of the following: the designer must efficiently use available FPGA resources, or have the capability to propose a reasonable sub-optimal solution.

In the latter case, a reference model must still be generated (either using C/C++, JAVA, MATLAB, etc.) to aid the designer in the verification process and to convince the algorithm developer of the validity of the new hardware-inspired solution.

The availability of a reference model does not preclude the designer from understanding the algorithm to implement. It is simply a tool the designer relies on to validate the engineering and algorithmic decisions made during the process. With the golden reference model, a suited architecture is defined and translated in an HDL description.

The Xilinx H.264 CABAC [Ref 2] encoder uses this approach to both map the design into hardware and to exhaustively verify the results. Initially, as in any complex design approach, the major blocks must be identified to separate the design into smaller manageable parts.

This separation must be performed so that each module has a consistent functional purpose in the overall system and can also be separately designed and verified. In addition, an individual design module may be separated into sub-components, as long as the sub-components total the overall function of the design module. On completion, the sub-components are assembled and then further tested. Figure 1 illustrates the three major design modules of the CABAC core.

• Main State Machine (SM). The input interface of the core; identifies and produces certain Syntax Elements (SE) to encode and performs the Run-level computation of integer transforms computations.

• SE PreProcessor. Creates binary output of the Main SM and performs several types of evaluation and preparation of SEs prior to encoding.

• Arithmetic Encoder (AC) Engine. Takes the binary SEs and encodes each bit sequentially. To achieve real-time encoding, FPGA resources are effectively used to achieve a single clock cycle encoding for each bit.

Communicating Between Functional ModulesOne of the challenges associated with segmenting designs into different functional modules is the ability to easily re-integrate them. Various teams and individuals have different ideas about how their modules communicate externally, but one assumption/observation can be made to generalize the approach for designing complex video processing and communications cores: the vast majority of today's complex video processing and communication algorithms work on both a sample level as well as on a group of samples (for example, block, macro-block, slice, frame, packets, and so forth).

Fundamentally, this assumption allows us to represent any system as a network of multiple processing elements inter-linked by different communication elements. The underlying theoretical cyclo-static dataflow modeling is described in Rapid System

X-Ref Target - Figure 1

Figure 1: Functional Modules of the CABAC Core

Syntax Elements (SE)PreProcessor

Arithmetic Encoder (AC)

Engine

Main State Machine (SM)

Page 5: Xilinx WP283 System Generator for Systematic HDL Design

Systematic Design Flow

WP283 (v1.0) January 17, 2008 www.xilinx.com 5

R

Prototyping (RSP), pp. 246-248 (June 2005) [Ref 3]. Because the data structures on which each processing element operates is clearly defined, a set of communication methods can be determined and fixed.

The communication methods have a practical (HDL) implementation named communication primitives, and although the types of communication primitives are limited, they are sufficient to represent any communication in the system. Detailed information about the communication primitives is provided in [Ref 3].

Communication Primitives

Based on general design practices, the set of communication primitives can be classified into two main categories:

• synchronizing elements • non-synchronizing elements

The synchronizing elements have the role of not only passing blocks or objects of data between the processing elements, but also information (for example, empty or full flags) about the status of data availability. In this category we have included FIFO, used for scalar communication (for example, block parameters) and Block FIFO for block data communication (for example, pixels of the macroblocks). The Block FIFO (or ObjectFIFO) is an implementation of a FIFO (First In First Out) queue of objects. Figure 2 illustrates the (a) Scalar FIFO and (b) Block FIFO.X-Ref Target - Figure 2

Figure 2: Synchronizing Communications Primitives

fullwe

data in

emptyredata out

… … … …

fullop_mode

addrdata in

data out

emptyop_modeaddrdata in data out

(a) Scalar FIFO

Block 1 Block n…

Word 1

Word 0

Word 2

Word n

op_mode NOP, read, write, commit

(b) Block FIFO

Block 0

Word 0

Word 1

Word K

Word 0

Word 1

Word K

Word 0

Word 1

Word K

Word 0

Word 1

Word K

Page 6: Xilinx WP283 System Generator for Systematic HDL Design

6 www.xilinx.com WP283 (v1.0) January 17, 2008

Systematic Design FlowR

The non-synchronizing communication primitives are communication elements accessible to two or more processing elements at any time, and used exclusively for transferring data. The non-synchronizing communication primitives include shared memories and global configuration registers.

Shared memory can be divided in local shared memories storing small amounts of memories required for processing, and global shared memories used to store large amounts of data. Because data should be accessible from multiple points in the processing algorithm, its implementation requires a memory controller implemented in the FPGA that includes a port arbitration mechanism. Lastly, the configuration registers are used to share global parameters that do not change often, such as frame parameters (frame size). The configuration parameters are implemented as shadow registers.

Capturing Test Vectors from the Reference Source CodeWith a limited set of communication primitives, the development and testing of functional blocks can be somewhat standardized. We start by identifying locations in the reference source code that will be used to intercept input stimuli for the block being designed, and then perform the same process for capturing the output of the same block.

The input stimuli, as well as the output used for comparison, may have to be gathered from different regions within the reference code. One method used to rewrite some or all parts of the reference code to better funnel input stimuli to one location and centralize the capture of the resulting output [Ref 8]. The resulting code reflects the eventual hardware architecture. This process amounts to dissecting and structuring the entire system into its functional modules to better capture the communication traffic between modules. Ideally, a reference code structured in this way is extremely convenient for the HDL designer.

For complicated reference models (like the JM), rewriting the reference code is a time-consuming effort added to the designer's main objective: the hardware implementation. The restructured code needs to be verified against the original code to make sure that the designer's version is correct. If changes are made in the original code, the designer's version must be revised or even reworked. The structured version of the reference code significantly delays the time in getting into real hardware work and is unnecessary because it does not reduce the number of design cycles.

X-Ref Target - Figure 3

Figure 3: Non-synchronizing Communications Primitives

(c) Global Shared Memory

External Memory Controller

Read Write

PHY

Memory Controller

External

Memory

(a) Configuration Register

(b) Local Shared Memory

Memory

r/w

add

data in

data

r/w

add

data in

data

Shadow

Active DataOut

Update

DataIn

Valid

Page 7: Xilinx WP283 System Generator for Systematic HDL Design

Systematic Design Flow

WP283 (v1.0) January 17, 2008 www.xilinx.com 7

R

A simpler way is for the designer to tap into the reference code using non-intrusive instructions to allow him or her to capture all the input/output traffic of the block being designed. For the C reference JM source code, instructions enclosed by preprocessor directives (such as #define and #ifdef), are used and strategically placed to capture input stimuli and corresponding outputs.

Figure 4 illustrates an example where DC chroma integer transform coefficients(1) are captured in globally defined arrays. This non-intrusive way of tapping into the reference code allows for the code additions of the hardware designer to co-exist (transparently) with the algorithmic work, whether the work is completed or ongoing. For reference code written in MATLAB, global and persistent variable [Ref 5] declarations can be used.

1. The (Run, Level) format for DC chroma integer transform values are captured in two pairs of arrays (as illustrated in Figure 4) inside the JM code for YUV420 input sequences. Prior to transmission to the CABAC core, the (Run, Level) format is converted back to the original single array of DC coefficients.

X-Ref Target - Figure 4

Figure 4: Non-intrusive Input Stimuli Captured inside the JM Reference Source Code

DC Coeff Capture

Page 8: Xilinx WP283 System Generator for Systematic HDL Design

8 www.xilinx.com WP283 (v1.0) January 17, 2008

Systematic Design FlowR

Defining Appropriate Communication Primitives for a Functional BlockAfter determining the best locations to capture input and output test vectors from the reference code, the designer must decide which communications primitive types will define the input and output ports of the functional block.

• Scalar FIFO primitives consist of word-aligned data, which can be easily assembled from within the reference C or MATLAB source code.

• Object FIFO primitives are vector-based data assembled as data structure types.

Shared memories (internal or external) are usually used as buffers for large data sets and are used between multiple functional blocks. Registers are much simpler primitives and better suited for communicating status or configuration parameters.

To select the primitive type, the generally accepted process is to assign I/O ports such that each port reflects the specific purpose of the data being funneled in or being sent out. Additionally, the method in which the I/O ports are defined must be consistent with specific design requirements such as input/output data rate and parallelism in the availability of data.

For the CABAC core, multiple types of data must be encoded by the single encoding engine displayed in Figure 1. The data types are grouped into Syntax Element (SE) and captured from several locations inside the reference code.

At first glance it is tempting to select multiple scalar FIFOs as the communication primitive of choice for each of the SE data to encode. Figure 5 illustrates the resulting input interface of this decision.

From the designer's perspective, this is probably the easiest interface to use. The designer's internal state machine simply monitors the availability of data in all the FIFOs and then services each FIFO based on some scheduling. However, the user of the functional block will notice a cluttered interface and naturally wonder how to properly write additional HDL code to satisfy all of these input channels without causing the block to stall from key ports being starved.

A better interface should take into account the availability of data and the schedule on which they are processed. If processing can be abstracted on a block basis, an Object FIFO interface should be used, as illustrated in Figure 6 where a simplified interface using an ObjectFIFO is used. A simple logic conversion is used to convert the user's FIFO interface to that of an ObjectFIFO inside the design. As illustrated, the user is expected to follow the designer's specified input format for an Object. Whenever block processing is permissible and whenever multiple inputs sources are to be considered, the designer's defined data structure in an Object FIFO helps eliminate ordering and scheduling uncertainties for the user.

Certain applications may require that the functional blocks share Local and Global memories as a means to communicate common data sets or simply because of a need for more storage space.

Local memories are built from current Xilinx FPGA select RAMs or block RAMs (see [Ref 3]) and are often smaller than global memories, which use external memory modules. Figure 7 shows an example of two functional blocks sharing a common external memory through an internal memory controller. The memory controller performs the arbitration (for example, round robin) between access requests from functional modules.

To prevent unwanted data overwrite, each functional module is pre-allocated its own memory space to operate on. The only provision is that the memory controller is capable of handling the total Read/Write bandwidth required by the combined

Page 9: Xilinx WP283 System Generator for Systematic HDL Design

Systematic Design Flow

WP283 (v1.0) January 17, 2008 www.xilinx.com 9

R

functional modules. Having functional blocks accessing common allocated memory space adds an extra layer of access control beyond the tasks of the memory controller. A typical scenario of having common memory areas is when a functional block saves data that later is to be consumed by another block. The data being shared is typically too large to be stored internally. If the design requires shared memory in addition to shared memory areas between functional blocks, these blocks must communicate between themselves their read/write status. Designing such functional blocks requires properly capturing test vectors such that each block can still be worked on and verified independently. For a functional block that consumes data from a common memory area, the memory contents are input test vectors extracted from the reference code and which must timely be updated for its consumption.

X-Ref Target - Figure 5

Figure 5: CABAC Macroblock (MB) Input Interface with Multiple FIFO

MBinfo[31:0]

MBinfo_we

MBinfo_full

Intramodes[31:0]

Intramodes_we

Intramodes_full

List_0_1[31:0]

List_0_1_we

List_0_1_full

MVD[31:0]

MVD_we

MVD_full

LumaCoeff[31:0]

LumaCoeff_we

LumaCoeff_full

ChromaCoeff[31:0]

ChromaCoeff_we

ChromaCoeff_full

CABAC MB I/F

Word 0

Word 1

Word 2

Word n

empty

data out

re

Word 0

Word 1

Word 2

Word n

empty

data out

re

Word 0

Word 1

Word 2

Word n

empty

data out

re

Word 0

Word 1

Word 2

Word n

empty

data out

re

Word 0

Word 1

Word 2

Word n

empty

data out

re

Word 0

Word 1

Word 2

Word n

empty

data out

re

FIFO 0

FIFO 1

FIFO 2

FIFO 3

FIFO 4

FIFO 5

Word 0

Word 1

Word 2

Word n

Word 0

Word 1

Word 2

Word n

Word 0

Word 1

Word 2

Word n

Word 0

Word 1

Word 2

Word n

Word 0

Word 1

Word 2

Word n

Word 0

Word 1

Word 2

Word n

Word 0

Word 1

Word 2

Word n

Word 0

Word 1

Word 2

Word n

Word 0

Word 1

Word 2

Word n

Word 0

Word 1

Word 2

Word n

Word 0

Word 1

Word 2

Word n

Word 0

Word 1

Word 2

Word n

Page 10: Xilinx WP283 System Generator for Systematic HDL Design

10 www.xilinx.com WP283 (v1.0) January 17, 2008

Systematic Design FlowR

X-Ref Target - Figure 6

Figure 6: Simplified CABAC Macroblock (MB) Interface using one ObjectFifo

MBinfo Group:

Intramodes(prediction)

List_0_1 Group: 9 words

MVD Group: 32 words

LUMA_AC_4x4 OR LUMA_AC_8x8

128 wordsChromaDC and ChromaAC:66 words

MB_IN_data[31:0]

MB_IN_re

MB_IN_empty

FIFO2ObjectFIFO

full

op_mode

addr

data in

data out

Converter

emptyop_mode

data out

addr

Wor

d 0

Wor

d 1

Wor

d k

Obj

ect 0

Wor

d 0

Wor

d 1

Wor

d k

Obj

ect 1

Wor

d 0

Wor

d 1

Wor

d k

Obj

ect n

……

CABAC MB I/F

Macroblock Object

ObjectF

IFO

5 words

Group: 8 words

OR (LUMA_DC_16x16 and/or LUMA_AC_16x16)

LUM

A B

LOC

KC

HR

OM

A B

LOC

K

Page 11: Xilinx WP283 System Generator for Systematic HDL Design

Systematic Design Flow

WP283 (v1.0) January 17, 2008 www.xilinx.com 11

R

Establishing a Common Test Bench for HDL Source Code and Hardware Verification

Test benches are typically HDL code designed to provide stimuli to the designer's own HDL code, and although some HDL design tools provide command line or scripted methods for driving signals of the designer's code, there is nothing comparable to a test bench. For robust design verification and to shorten design time, a high-quality test bench is essential.

Test bench methodologies are generally ad hoc and based on the perceived observations of the designer. As the complexity of the design increases, writing an adequate test bench becomes more challenging. Most test benches are only geared toward the verification of the HDL work, and then the designer determines another procedure to verify the design running in hardware. Setting up the apparatus for hardware verification can be difficult and is also dependent on the targeted hardware platform being fully debugged and working. Using System Generator and its pre-supported hardware platforms, a consistent and structured test bench methodology is proposed.

X-Ref Target - Figure 7

Figure 7: Functional Blocks with a Shared External Memory

External Memory

PHY

Memory Controller

BlockA

BlockB

FIFO

FIF

O

Page 12: Xilinx WP283 System Generator for Systematic HDL Design

12 www.xilinx.com WP283 (v1.0) January 17, 2008

Systematic Design FlowR

With the limited set of communication primitives and System Generator-supported FPGA platforms, it is feasible to verify both the HDL and the resulting hardware implementation in a single framework.

Figure 8 illustrates two test bench structures that can be used in System Generator for the verification of a functional block: one generates stimuli within MATLAB, and the other generates stimuli from external source code. Both methods are based on using the System Generator HDL co-simulation (black box) to interact with the user's HDL.

• Example A: An Embedded MATLAB script models parts of the system generating input stimuli to the functional block being tested. In System Generator, the Embedded MATLAB script can be written as a MATLAB function enclosed in a triggered sub-system. When triggered by the HDL primitives, the entire Embedded MATLAB script is executed in one clock cycle. There are two disadvantages in this example: the Embedded MATLAB model must be available to the designer, and writing a model in Embedded MATLAB becomes more difficult as complexity increases. Additionally, computations performed inside the Embedded MATLAB sub-system may slow System Generator significantly.

• Example B: In this structure, better suited for larger designs and for running days or even weeks of test vectors, the test bench utilizes stimuli captured from within the reference source code to exercise the functional block under test. In lieu of establishing a link between the external model and System Generator (doing so may slow down simulations), the captured test vectors are stored in files. These files are then read one vector at a time by a simple Embedded MATLAB script enclosed in a triggered sub-system. More information about this structure is provided in the sections that follow.

Using the Hardware co-simulation feature of System Generator, both structures in Figure 8 can be used to verify the HDL in hardware using a supported platform such as the Xilinx ML506 board [Ref 6].

Page 13: Xilinx WP283 System Generator for Systematic HDL Design

Systematic Design Flow

WP283 (v1.0) January 17, 2008 www.xilinx.com 13

R

X-Ref Target - Figure 8

Figure 8: Two Test Bench Structures used in System Generator for Verification to a Functional Block

System Generator/HDL Development and Testing

Example A: Stimuli generated inside MATLAB

Embedded MATLAB

Vector Generator

VHDLFunctional

Block

Delay Z-n

Comm Prim 1

Comm Prim 1

Comm Prim 3

Output

Output

= ? =

= ? =

External Source Code

System Generator/HDL Development and Testing

Example B: Stimuli generated from external source code

Comm Prim 3Comm Prim 1

Comm Prim 2

FunctionalBlock

VHDLFunctional

Block

EmbeddedMATLABFile Write

EmbeddedMATLABFile Read

EmbeddedMATLABFile Read

ExpectedO

utputC

apture

Stim

uliC

apture

Page 14: Xilinx WP283 System Generator for Systematic HDL Design

14 www.xilinx.com WP283 (v1.0) January 17, 2008

System Generator for Testing/Verification of the CABAC DesignR

System Generator for Testing/Verification of the CABAC DesignThe Xilinx H.264 encoder core [Ref 2] is used in this example to illustrate the use of System Generator for verification of an HDL design. Three examples of verification are provided:

• Functional HDL simulation/verification• Hardware single stepped clock verification• Real-time hardware verification in burst mode

Function HDL Simulation/Verification in System GeneratorAs described in [Ref 3], the CABAC core is an entropy encoder used in the H.264/AVC [Ref 4] video encoding standard. The design effectively uses the available resources found in Xilinx FPGAs to overcome throughput limitations due to the CABAC bit-wise processing nature, its complicated data dependency, and variant iteration times for each binary symbol.

For this reason, real-time encoding of HD sequences is possible as described in [Ref 3]. Such a complex design must be rigorously tested, and test cases must cover all internal states of the design. The predominate and straightforward approach is to use a large pool of input test sequences that cover the range of supported frame sizes (for example, QCIF, 4CIF, 720p or 1080i/p) and encoding options (for example, transform mode 4x4 or 8x8, quantization parameter, slice type).

Generating Input and Output Test Vectors for CABAC DesignFrom inside the JM reference source code [Ref 4], the input and output test vectors for the CABAC rev1.0 core are generated. The core requires two types of input test vectors:

• Slice encoding parameters• MacroBlock (MB) data to encode

The slice and MB data are sent to the core via FIFOs as shown in Figure 9, where the CABAC core is configured as a hardware accelerator. For each new Slice to encode, the encoding parameter settings must be passed to the core before subsequent MBs can be encoded. Figure 10 shows the code snippet used to capture the Slice encoding parameters inside the JM source code. The Slice parameters are lumped in the SliceInfo array and, for the entire test sequence, the collected parameters for all the slices of the sequence are dumped in one file. Details about the code are provided in [Ref 3]. The same input test vector collection procedure is used for the MB data to encode input vectors. Figure 11 shows the code snippet added inside the JM source code to capture MB data. All of the MB structure is then dumped into a second file. The output test vectors are also generated from within the JM code and stored into a third file. Having generated the input and output test vectors, the next step is to build the test bench.

Page 15: Xilinx WP283 System Generator for Systematic HDL Design

System Generator for Testing/Verification of the CABAC Design

WP283 (v1.0) January 17, 2008 www.xilinx.com 15

R

X-Ref Target - Figure 9

Figure 9: CABAC Core in Hardware Accelerator Mode

X-Ref Target - Figure 10

Figure 10: Code Snippet Capturing Slice Input Parameters

H.264 CABAC Encoder

MB_IN_data[31:0]

MB_IN_re

MB_IN_empty

Slice_IN_data[15:0]

Slice_IN_re

Slice_IN_empty

SCLR

CLK

Ecodestrm_full

Ecodestrm_we

Ecodestrm_data[7:0]

Ecodestrm_len_valid

Ecodestrm_len[29:0]

Ecodestrm_MB_done

Ecodestrm_LastByte

Output

Stream

Logic + Buffering

Bus Controller

FIFO

FIFO

bus

FPGA

‘0’

Statistics

Logic + Buffering

void Gen_SliceInfo()

/* This function generates the Slice input parameters of the CABAC core within the JM source code.

It should be called from the end of the start_slice() function of the JM’s slice.c source file.

*/

{

SliceInfo[0] = (((img structure != FRAME) & 0x1) << 12) | /* img_structure_notFrame */

(((input BRefPictures == 2) & 0x1) << 11) | /* img_NumBRefPicturesIs2 */

(((img yuv_format) & 0x3) << 9) | /* img_YUV_format */

(((img model_number) & 0x3) << 5) | /* img_model */

(((img type) & 0x7) <<2) | /* img_type */

(((input Transform8x8Mode != 0) & 0x1) << 1) | /* img_Transform8x8Mode */

0;

SliceInfo[1] = (img PicWidthInMbs&0xFF)<<8) | /* img_PicWidthinMbs */

((img qp) & 0x3F); /* img_QP */

SliceInfo[2] = 0x00; /* Not used; Reserved */

SliceInfo[3] = (input slice_mode == 1 ? Input slice_argument : img FrameSizeInMbs);/* img_FramsizeInMbs */

}

Page 16: Xilinx WP283 System Generator for Systematic HDL Design

16 www.xilinx.com WP283 (v1.0) January 17, 2008

System Generator for Testing/Verification of the CABAC DesignR

Test Bench Setup for CABAC DesignTo simulate HDL in System Generator, a black box of the top VHDL entity of the design must be created. Figure 12 illustrates a sample of a Simulink model created using System Generator.

The slice and MB input ports of the CABAC black box are each connected to input FIFOs that the core controls. Blocks labeled Convert X_ are used to handle unknown ‘X’ states that occur during simulation (Xs are replaced by zeros in HDL simulation mode).

X-Ref Target - Figure 11

Figure 11: Code Snippet Capturing MB Input Parameters

Void Gen_MBparams()/* This function generates the Macroblock input parameters of the CABAC core within the JM source code.

It should be called from the end of the writeMBLayer(int, int *) function of the JM’s macroblock.c source file.*/

{Gen_MBinfo(); /* write MBinfo Group */Gen_INTRAMODES(); /* write INTRAMODES Group */if (IS_INTERMV(currMB)){

Gen_LIST_0_1(); /* write LIST_0_1 Group */Gen_MVD(); /* write MVD Group */

}if ((currMB mb_type != 0) || (img type == B_SLICE && currMB cbp != 0)){

if (! IS_NEWINTRA(currMB) && currMB cbp != 0){

if (! currMB luma_transform_size_8x8_flag)Gen_LUMA_AC_4x4(); /* write LUMA_AC_4x4 Group */

elseGen_LUMA_AC_8x8(); /* write LUMA_AC_8x8 Group */

}else if IS_NEWINTRA(currMB){

Gen_LUMA_DC_16x16(); /* write LUMA_DC_16x16 Group */if (currMB cbp & 15)

Gen_LUMA_AC_16x16(); /* write LUMA_AC_16x16 Group */}if (currMB cbp > 15)

Gen_ChromaDC(); /* write ChromaDC Group */if (currMB cbp >> 4 == 2)

Gen_ChromaAC(); /* write ChromaAC Group */}

Page 17: Xilinx WP283 System Generator for Systematic HDL Design

System Generator for Testing/Verification of the CABAC Design

WP283 (v1.0) January 17, 2008 www.xilinx.com 17

R

Because System Generator requires a known state (0 or 1) for signals, these blocks are needed to accommodate less stringent HDL codes. From the input/output interface illustrated in Figure 12, Figure 13 illustrates a newly created subsystem of the CABAC core. The new CABAC subsystem now connects to other triggered subsystems that provide inputs to the core and verifies the core output.

The subsystem, Slice_input, which provides the Slice input parameters, uses MATLAB functions to read from a text file. As shown in Figure 13, this subsystem, when triggered by the CABAC subsystem, runs all the Embedded MATLAB instructions in one simulation clock.

X-Ref Target - Figure 12

Figure 12: Black Box CABAC HDL

Page 18: Xilinx WP283 System Generator for Systematic HDL Design

18 www.xilinx.com WP283 (v1.0) January 17, 2008

System Generator for Testing/Verification of the CABAC DesignR

The embedded MATLAB function, illustrated in Figure 14, is called at each simulation clock, during triggering. In a single enabled clock period, this embedded MATLAB function reads and outputs one line of the slice test vector file alongside a write enable (we) signal for storage into the slice input FIFO of Figure 12.

Unsupported MATLAB functions executed inside Embedded MATLAB for Simulink must be declared as extrinsic functions [Ref 6]. The file pointer, fp0, of Figure 14 is initialized in the Simulink Model Properties dialog box and saved in the work space. To verify the output, the encoded bit stream obtained from the JM source code is compared with the simulation output inside the Output_compare subsystem of Figure 14. This subsystem, shown in Figure 15, also uses an Embedded MATLAB function that reads one line of the JM output test vector and compares it with the HDL-encoded bit stream.

While the CABAC test bench uses simple embedded MATLAB scripts to read test vector files, more complicated testing structures can be built using embedded MATLAB function modeling. Being able to call MATLAB functions as well as user- defined functions inside Simulink provides an almost limitless testing/verification environment.

X-Ref Target - Figure 13

Figure 13: Test Bench Set Up for HDL Simulation of CABAC Design

Page 19: Xilinx WP283 System Generator for Systematic HDL Design

System Generator for Testing/Verification of the CABAC Design

WP283 (v1.0) January 17, 2008 www.xilinx.com 19

R

X-Ref Target - Figure 14

Figure 14: Slice_Input Subsystem and Embedded MATLAB for Reading a Test Vector Input File

Page 20: Xilinx WP283 System Generator for Systematic HDL Design

20 www.xilinx.com WP283 (v1.0) January 17, 2008

System Generator for Testing/Verification of the CABAC DesignR

Functional Hardware Simulation in System GeneratorHDL functional simulations on large designs with large sets of test vectors are very slow; for example, running the CABAC core on a 1080p sequence of 300 frames takes several hours. To accelerate the process, running the design in hardware through functional hardware simulation is suggested.

In functional hardware simulation, the HDL design being tested is simulated directly form hardware using a clocking scheme controlled by System Generator. For the CABAC design, the same test bench of Figure 14 is used for the functional hardware simulation. The System Generator hardware co-simulation feature is used to target a Xilinx ML506 board and communicate through its 1 Gbps Ethernet port.

In functional hardware simulation, the controlled clocking provides the ability to determine the precise location of an error. The designer can trace the subset of input and corresponding output test vectors resulting in failure. Using the JTAG port of the ML506, a ChipScope™ [Ref 7] interface may be added to provide in depth debugging capabilities (Figure 16) while the HDL is running in hardware. In this design, the ChipScope interface was manually added into the imported HDL design by the user; an alternative is to make use of System Generator's ChipScope block which allows the debug interface to be seamlessly integrated into the same System Generator model.

X-Ref Target - Figure 15

Figure 15: Output Bit Stream Test Vector Comparison Subsystem

Page 21: Xilinx WP283 System Generator for Systematic HDL Design

System Generator for Testing/Verification of the CABAC Design

WP283 (v1.0) January 17, 2008 www.xilinx.com 21

R

Real-time Hardware Verification in System GeneratorTo counter interface limitations between the user's PC and hardware, System Generator uses vectored or frame based communication. The idea is to assemble as many input data samples as possible into memory. In a single transaction, the contents of the memory are sent to the hardware. For the duration of the burst period of the transaction, the hardware runs at the targeted clock speed.

To take advantage of this System Generator feature for the CABAC design, the FIFOs in Figure 12 must be replaced with shared FIFOs (From FIFOs) [Ref 9]. Mirroring shared FIFOs (To FIFOs) must also be placed in the test bench of Figure 13 to complete the communication link. Figure 17 shows the resulting System Generator model after replacing the input FIFOs of Figure 12 with shared memory FIFOs (an additional encoded output stream FIFO and a ChipScope block has been added). The test bench illustrated in Figure 13 is now replaced by the test bench illustrated in Figure 18.

X-Ref Target - Figure 16

Figure 16: Point-to-Point Ethernet HW Cosim with ChipScope Debug Inside HDL

Ethernet Port HW-cosim

JTAG Port ChipScope

Page 22: Xilinx WP283 System Generator for Systematic HDL Design

22 www.xilinx.com WP283 (v1.0) January 17, 2008

System Generator for Testing/Verification of the CABAC DesignR

However, for the CABAC design, this test bench presents one issue–because the CABAC core is asynchronous with respect to the Simulink environment, the set up fails. To resolve this issue, we must go outside Simulink and use the System Generator MATLAB-based hardware co-simulation interface (M-HWcosim). M-HWcosim is an API that allows the ability to transfer data to hardware from a MATLAB script M-file.

It resolves the asynchronous problem of the CABAC design inside System Generator by using its instruction sets to asynchronously send and retrieve data from the CABAC HWcosim version of Figure 17. The M-file, with the M-HWcosim instructions, used for stimulating the HWcosim block of Figure 17 is shown in Figure 19. Details about M-HWcosim will become available in a future version of the product.

Figure 18 illustrates the new test bench with Shared FIFO interfacing to the model illustrated in Figure 17. Because the CABAC design is asynchronous in Simulink, this model does not work.

X-Ref Target - Figure 17

Figure 17: Black Boxing of CABAC HDL: Shared FIFOs for Real Time Burst Transfers

Page 23: Xilinx WP283 System Generator for Systematic HDL Design

System Generator for Testing/Verification of the CABAC Design

WP283 (v1.0) January 17, 2008 www.xilinx.com 23

R

Figures 19 through 22 illustrate portions of code that accomplish the following:

• Figure 19 shows the M-HWcosim M-file for real-time verification of the CABAC design, which initializes M-HWcosim as well as variables.

• Figure 20 shows data to be sent to the core first assembled in SliceArray and MBdata arrays.

• Figure 21 shows how the contents of SliceArray and MBdata are dumped to the core.

• Figure 22 shows the read of output stream FIFO into the array dataStream; the array dataStream is compared with the golden value acquired from reference code.

X-Ref Target - Figure 18

Figure 18: Test Bench with Shared FIFO

Page 24: Xilinx WP283 System Generator for Systematic HDL Design

24 www.xilinx.com WP283 (v1.0) January 17, 2008

System Generator for Testing/Verification of the CABAC DesignR

X-Ref Target - Figure 19

Figure 19: M-HWcosim M-file for Real Time Verification of CABAC Design

Page 25: Xilinx WP283 System Generator for Systematic HDL Design

System Generator for Testing/Verification of the CABAC Design

WP283 (v1.0) January 17, 2008 www.xilinx.com 25

R

X-Ref Target - Figure 20

Figure 20: Data to be Sent to Core First Assembled in SliceArray and MB data Arrays

Page 26: Xilinx WP283 System Generator for Systematic HDL Design

26 www.xilinx.com WP283 (v1.0) January 17, 2008

System Generator for Testing/Verification of the CABAC DesignR

X-Ref Target - Figure 21

Figure 21: Code Section Representing Content of SliceArray and MBdata are Dumped to Core

X-Ref Target - Figure 22

Figure 22: Code Section showing Read of Output Stream FIFO into Array Data Stream

Page 27: Xilinx WP283 System Generator for Systematic HDL Design

Conclusion

WP283 (v1.0) January 17, 2008 www.xilinx.com 27

R

ConclusionIn complete system design, verification is often as much work as the actual design. The design of the CABAC block in H.264 leverages the JM source-code model for generating test vectors from a high-level language. In this case, the HDL design verification is integrated with System Generator and MATLAB. Additionally, the integration with complete boards that can run the CABAC block at speed is a major improvement over ad hoc environments, substantially reducing the time required for building a verification environment, subsequently allowing the designer to focus on the actual block.

References1. www.xilinx.com/ise/optional_prod/system_generator.htm

2. H.264 CABAC Encoder (DS603) Xilinx IP Core

3. Adrian Chirila-Rus, Kristof Denolf, Bart Vanhoof, Paul Schumacher, and Kees Vissers. Communication Primitives Driven Hardware Design and Test Methodology Applied on Complex Video Applications, pp246-248. In Rapid System Prototyping (RSP) June 2005.

4. iphome.hhi.de/suehring/tml/

5. www.mathworks.com

6. www.xilinx.com/products/boards/ml506/docs.htm

7. www.xilinx.com/literature/literature-chipscope.htm

8. Kristof Denolf, Adrian Chirila-Rus, Paul Schumacher, Robert Turney, Kees Vissers, Diederik Verkest, and Henk Corporaal. A Systematic Approach to Design Low-power Video Codec Cores. Accepted for EURASIP Journal on Embedded Systems, Special Issue on Embedded Systems for Portable and Mobile Video Platforms, 2007.

9. www.xilinx.com/support/sw_manuals/sysgen_bklist.pdf

Revision HistoryThe following table shows the revision history for this document:

Notice of DisclaimerThe information disclosed to you hereunder (the “Information”) is provided “AS-IS” with no warranty ofany kind, express or implied. Xilinx does not assume any liability arising from your use of theInformation. You are responsible for obtaining any rights you may require for your use of thisInformation. Xilinx reserves the right to make changes, at any time, to the Information without notice andat its sole discretion. Xilinx assumes no obligation to correct any errors contained in the Information or toadvise you of any corrections or updates. Xilinx expressly disclaims any liability in connection withtechnical support or assistance that may be provided to you in connection with the Information. XILINXMAKES NO OTHER WARRANTIES, WHETHER EXPRESS, IMPLIED, OR STATUTORY, REGARDINGTHE INFORMATION, INCLUDING ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR APARTICULAR PURPOSE, OR NONINFRINGEMENT OF THIRD-PARTY RIGHTS.

Date Version Description of Revisions

1/17/2008 1.0 Initial Xilinx release.