Application Examples Avnet Silica & Enclustra Seminar ... · Avnet Silica & Enclustra Seminar ... Application Examples Avnet Silica & Enclustra Seminar ... Application Examples Avnet

Enclustra GmbH

Räffelstrasse 28 – CH-8045 Zürich – Switzerland

Phone +41 43 343 39 43 – www.enclustra.com

1

Application Examples

Avnet Silica & Enclustra Seminar “Getting started with Xilinx Zynq SoC”

Fribourg, April 26, 2017

Enclustra GmbH



2




Enclustra GmbH



3




Introduction

The next few slides give a brief overview of what will be discussed in this presentation

and they contain some general points that apply to both examples covered.

The presentation does not only highlight the implementation of the projects

discussed but it points out as well what know-how is required to successfully realize

an offloading engine using FPGA fabric within an SoC.

Enclustra GmbH



4




Projects Discussed

Two very different projects are discussed in this presentation.

The first project is a statistical image processing engine whose only purpose is to

accelerate an algorithm already implemented in software. Since the customer is

doing statistical image processing regularly, the engine has to be designed in a

reusable way. The algorithms to speed up were known when the project started, so

the operations to implement were specified exactly and Enclustra was only

responsible for converting the operations into a form that can be implemented in

SoC fabric easily (fixed-point-quantization, approximations, etc.).

The second project is a Bluetooth transceiver. It allows receiving and transmitting

packets. All link-layer operations (data encoding, header generation, etc.) and

physical-layer operations (modulation, demodulation, filtering, etc.) are executed

within the offloading engine. The algorithms were developed by Enclustra since the

customer has a lot of knowledge in protocol handling but less knowledge in digital

signal processing. Because the requirements are very project specific, the transceiver

is not optimized for reusage.

There is one point both example projects have in common: They are both fixed-point

math intensive.

Enclustra GmbH



5




Enclustra Fixed-Point Math Development Flow

To understand the examples discussed in this presentation, a basic knowledge about the

development flow used for both projects is required. Therefore this flow is described briefly.

Almost every engineer made the experience that written language (or even worse: spoken

language) is prone to misunderstandings. Moreover language does not enforce the

specification of all corner cases, which leads to unclear requirements. Nevertheless, written

language is used for most specifications. This introduces significant risk for both sides, the

customer (delay) and Enclustra as service company (rework effort).

In some cases, customers are already aware of these problems and provide a MATLAB,

Simulink or C implementation of the algorithm to be implemented in an FPGA or SoC. Even

though this solves the problem of misunderstandings regarding the specification, the

algorithms provided are often using double precision floating point numbers which are not

implementable efficiently in FPGA fabric. As a result the algorithm must be changed to work

with fixed-point operations and other implementation-optimized concepts such as taylor

approximations for functions. This conversion leads to slight changes in the behavior and

therefore again to the risk that the algorithm does not perform as expected and rework is

required.

To reduce the risk described above, Enclustra always implements a fixed-point model of the

algorithm in such a way that it can be implemented bit-true in the FPGA fabric. The customer

can then analyze the performance of the algorithm and identify changes required to meet all

requirements before the implementation is started. After approval by the customer, the bit-

true model of the algorithm is used as specification for the implementation.

This approach significantly reduces risk for both sides and improves work efficiency since a

100% clear an implementable specification in a programming language (usually MATLAB)

exists when the implementation of the offloading engine is started.

Enclustra GmbH



6




bit-true Co-Simulations

The central point of the development flow described on the last slide is to ensure that the FPGA fabric implementation is bit-true to the MATLAB model approved by the customer. Since this is crucial to all algorithm related projects Enclustra works on, some libraries were developed to increase the efficiency in doing so.

One library contains bit-true implementation of common functions for MATLAB, VHDL and C. It includes basic operations such as addition or multiplication as well as more complex building blocks such as taylor approximations or CORDIC. If this library is used for all operations in the MATLAB model of the algorithm, it is implicitly bit-true implementable in VHDL.

The second important library contains functions to read and write files containing stimuli and responses from MATLAB and VHDL. This library is used to communicate between the MATLAB part and the VHDL part of a co-simulation.

Co-simulations are implemented for each VHDL entity and always consist of a MATLAB and a VHDL part. The MATLAB part of the co-simulation generates stimuli and feeds them into the bit-true model of the entity under test. Stimuli as well as response of the model are written to files using the library described above. The VHDL simulation reads the stimuli from this file, applies them to the VHDL implementation of the entity under test and automatically checks if the response matches the expectation. If the responses do not exactly (bit by bit) match, errors are written into a report file.

Additionally the actual response of the VHDL implementation is written into a file. This file can be read from MATLAB and analyzed, which is very helpful for finding the root-cause of any mismatches.

Enclustra GmbH



7




Statistical Image Processing Engine

The next few slides describe the implementation of a statistical image processing

acceleration engine. The goal is not to exactly describe the engine discussed but to

pick out some interesting points.

Enclustra GmbH



8




Target Algorithm

Even though the image processing engine discussed was implemented in a reusable

way, one target algorithm and therefore an exact set of operations was already

known when the project started.

Only a few of the operations are straight forward implementable in FPGA fabric:

• Addition

• Subtraction

• Multiplication

• Per image operations

The other operations made high demands on design to achieve an efficient

implementation in FPGA fabric:

• Division and Squareroot are implemented using taylor approximations and shift

operations

• Complex operations are implemented using CORDIC

• Multi image operations require reading and buffering up to 32 images in parallel

even though all other operations require only 2 input images

The region of interest (ROI) for the images to be processed must be configurable at

runtime.

Enclustra GmbH



9




Goals and Achievements

The algorithm to be accelerated was already implemented on the Cortex-A9

processor available in Zynq SoCs running at 600 MHz. The main goal of the customer

was to reduce the execution time for the algorithm from around 60 seconds to

around 5-10 seconds. This corresponds to a 6-12x speedup. CPU load should be

reduced to allow other tasks such as communication running smoothly in the

background.

Thanks to offloading of all actual operations on images, the CPU load dropped

dramatically. The image processing engine runs at 100 MHz and can process one

pixel every clock cycle. This results in a full image operation execution time of around

1 ms and reduces the complete algorithm execution time to around 3 sec (including

multi-image operations, control overhead and more complex operations still

implemented in software), which corresponds to a speedup of 20x and therefore

already exceeds the initial goals of the customer.

Because of the nature of the target application, faster execution is beneficial even

beyond the initial goals. Therefore the customer decided to use four accelerators in

parallel (one for each of four parallel image streams) to further increase the speedup.

This results in a breathtaking speedup of 80x compared to the existing

implementation in software.

As nice side effect, the power consumption of the system dropped. However, power

consumption was not a main concern in this project.

Enclustra GmbH



10




General Architecture

The image processing engine discussed consists of five main parts:

• Control logic

• Contains a register bank containing all settings (e.g. selection of operation, addresses of input and output images)

• Regularly issues read and write operations to make sure no overflows respectively underruns occur in the input/output buffers

• Notifies the CPU via IRQ when the operation is completed

• AXI4-master interface

• Reads input data and write output data directly from/to DDR memory

• Input buffer

• This is basically a multi-channel FIFO which buffers the input images

• Required because of the bursting nature of DDR memory accesses

• Output buffer

• The output image also needs to be buffered

• Required because of the bursting nature of DDR memory accesses

• Processing unit

• Executes the operation selected

Enclustra GmbH



11




Multi Channel Input Buffer

The requirement for the input buffer is to contain enough data to keep the

processing unit active while a DDR memory read access is issued but not yet

completed. Unfortunately the response time of the DDR memory is strongly affected

by jitter because of the nature of DDR memories in general (e.g. refresh cycles) and

because other components such as the CPU and other processing units are accessing

the same DDR memory in parallel.

It was found that the buffer is required to contain enough data to keep the

processing unit running for about 40 µs which corresponds to 4096 pixels (8 kB) per

input for operations with one or two images. This results in a total of 8 BlockRAMs (2

kB each) required for the input buffer.

This sounds reasonable at the first glance but there are these multi-image operations

with up to 32 input images. If a 4096 pixel buffer was implemented for all 32 input

images, a total of 256 kB buffer space would be required, which translates into 128

BlockRAMs. This is not reasonable to implement since the device targeted only

contains 240 BlockRAMs and more than one engine needs to be implemented.

Fortunately all up to 32 images are processed time interleaved (one pixel of every

image one after the other, then the next pixel, etc.). As a result, the processing engine

can be kept active for 40 µs not with 4096 pixels per image but with 4096 pixels in

total. This results in the initially estimated 8 BlockRAMs being sufficient but now

every image just gets less buffer space (128 pixels for 32 images, 256 pixels for 16

images, etc.).

Enclustra GmbH



12




Division Implementation

The first problem with divisions is that they cannot be implemented resource efficient in FPGA fabric in contrast to addition and subtraction as well as (thanks to embedded multipliers) multiplication. This problem was solved by using a taylor approximation of the 1/x function together with a multiplication instead of implementing a binary divider.

The second problem is that the 1/x function doubles the number of bits required to represent the whole result range in appropriate precision. In our case this would lead to a taylor approximation with 32-bit output which is not reasonably implementable. To solve this problem, the operation was mathematically transformed into a form which only requires the taylor approximation to be valid in the range between 0.5 and 1. This comes at the cost of two shift operators which are easily implementable in FPGA fabric.

In other words: Non technology friendly operations (division, full-range 1/x) were replaced by more technology friendly operations (multiplication, taylor approximation, shifts). This clearly shows that the efficient implementation of algorithms in FPGA fabric requires a lot of know-how about technology friendly implementation approaches, which will never be replaceable by tools.

The implementation of the division also shows why the verification of a bit-true model of each operation by the customer is required: Even though the division of two 16-bit numbers with 32-bit result is mathematically 100% defined, the results of the implementation chosen may differ by a few LSBs. It is up to the customer to decide if this performance is sufficient or if a more precise implementation is required at the cost of more resources.

Enclustra GmbH



13




Bluetooth Transceiver Engine

The next few slides describe the implementation of a Bluetooth transceiver engine.

Again the goal is not to explain each and every detail but to select some interesting

points and discuss them.

Enclustra GmbH



14




Requirements

Let’s first answer the most obvious question: Why is anybody using SoC technology

for doing Bluetooth even though very cheap Bluetooth chips are available off-the-

shelf? The answer is that the target application is a Bluetooth qualification setup. This

means that not only communication via Bluetooth is required but also very

controlled signal generation including exact frequency offsets and other TX

(transmit) defects. On the RX (receive) side some additional measurements are

required too.

Software defined radio (SDR) allows controlling all parameters of the signal

processing exactly and changing them easily on the fly. SDR systems are not affected

by temperature effects and aging since the signal processing is defined by

algorithms and digitally stored parameters. As a result SDR is the ideal technology to

fulfill the high requirements of this project – and SoCs are the ideal platform to

implement SDR thanks to the power of parallel processing.

In parallel to the transceiver engine discussed, the qualification process involves RF

measurements. This is another reason why using a standard Bluetooth chip is not an

option. Note that the RF measurements are not included in the Bluetooth transceiver

engine. The received signal is recorded in parallel to the transceiver engine

responsible for communicating using the Bluetooth protocol.

Enclustra GmbH



15




Bluetooth Protocol Basics

To understand the following slides, it is important to know the basics of the

Bluetooth low-level protocol. Therefore the protocol is presented in a simplified

form.

The most basic packet type is a basic rate (BR) packet. It is modulated using GFSK,

which is a type for frequency modulation. One bit is transferred with every symbol

and the symbol rate is 1 MSPS.

To improve the data rate, enhanced data rate (EDR) packets come into play. The

meta information is encoded equally to the BR packets but the data is encoded using

DPSK, which is a type of phase modulation. Two (EDR2, 4-DPSK) or three (EDR3, 8-

DPSK) bits are transferred with every symbol but the symbol rate is unchanged at 1

MSPS.

To reduce power consumption, an additional packet type is defined: The low energy

(BLE) packet. It uses the same modulation type as the BR packet but has a different

packet structure and slightly different modulation parameters.

For improved data rate at low power consumption, the BLE 2 Mbps packet type is

used. It is equal to the BLE1 packet but the symbol rate is doubled to 2 MSPS.

The packet types mentioned lead to the requirement for implementing three

different modulation/demodulation schemes and two different symbol rates for

GFSK. This is important to understand the general structure of the engine.

Enclustra GmbH



16




General Architecture

To transmit a packet, the information about the packet including the payload data is

writen into the packet generator via the AXI4-slave interface of the offloading

engine. The packet generator then assembles the packet according to the protocol,

calculates CRC checksums, does forward error coding, etc.

The binary data is then modulated using the appropriate modulation scheme. A

controlled symbol rate error can be introduced within the modulator, which requires

a high precision resampling.

For EDR packets, the first part of the packet is GFSK modulated while the payload is

DPSK modulated. The fader is responsible for softly switching between the

modulation schemes to avoid transmitting wideband noise du to hard transitions.

The signal conditioning unit is used to add a well defined frequency offset and set a

signal gain.

The resampling and filtering unit changes the sampling rate from the internally used

8/16 MHz (16 MHz for BLE 2 Mbps, 8 MHz for all ohter packet types) to 2 MHz used

by the RF frontend.

If a packet is received, the operations are inverted. First the signal is converted to the

internally used sample rate of 8/16 MHz. The demodulators then extract the binary

data and the packet receiver detects packets and decodes them.

The CPU is notified whenever a packet is received and can read the packet data via

the AXI4-slave interface.

Enclustra GmbH



17




Loopback Testing in Simulations and on Hardware

To achieve a good test coverage and find all issues before delivering the offloading

engine, a thorough testing concept was implemented additionally to the normal

regression testing concept in use at Enclustra (self-checking regression tests).

In simulations the whole processing chain was simulated in a loopback configuration

(blue) for only a hand full of packets due to long simulation runtimes.

Additionally several loopback paths are implemented and can be activated on

hardware. This allows testing many thousands of packets within a short time. Thanks

to multiple loopback paths (red), any problems can be roughly located very quickly

which saves a lot of debugging time. Before delivering the Bluetooth transceiver

engine to the cusomer, loopback tests for over a million packets were run including

the RF frontend and the antenna.

Thanks to the consideration of loopback tests early in the concept phase, care was

taken to design the system symmetrically (e.g. same sample rates and number

formats on RX and TX side) to ease the implementation of loopback tests. This

greatly reduced the testing effort.

Enclustra GmbH



18




Efficient FIR Filter Implementation using FPGA Fabric

FIR filters are one of the most cited examples for the parallel processing power of

SoCs and FPGAs. They fit the strengths of FPGA fabric very well because of its

multiply-accumulate power provided by DSP-slices.

For one given filter within the receive path of the Bluetooth transceiver engine, a 64-

tap FIR filter with a sample rate of 16 MSPS is required. In this case a fully parallel

implementation of the FIR filter is not efficient, since it would require much resources

(64 DSP slices) but run at a clock speed way below what FPGA fabric is capable of

(16 MHz). A fully serial implementation using only one DSP slice is not possible either

since this would require a clock speed of over 1 GHz which is far beyond the

possibilities of FPGA fabric.

The implementation chosen contains 8 DSP slices doing multiply accumulate

operations. It therefore requires 8 clock cycles to execute all 64 multiplications. An

additional DSP slice is used to sum up the results of all 8 clock cycles. This partially

parallel architecture allows using a small number of DSP slices (8 pieces) at a

reasonable clock speed (128 MHz) to achieve the performance required.

Considering that even the smallest Xilinx SoC device contains 80 DSP slices, this

example shows that with 10% of the DSP resources of the smallest SoC device, a filter

performance of over 1 GMAC/s is easily achievable and that this is not just a

theoretical number but proven in a real-world project.

Enclustra GmbH



19




Modeling Signal Processing Paths – The tv-Concept

In the project discussed, Enclustra created bit-true models of all signal processing

components and the customer could verify the performance of the algorithm

developed before implementation started.

Unfortunately many signal processing elements such as filters introduce delay and/or

change the sample rate. It is therefore difficult to track a given signal accross the

whole processing chain and compare the same parts of the signal (in terms of «the

signal related to the same symbols») at various stages of the processing.

To improve this situation, the tv-concept was used. This means a signal always

consists of a value vector (v) and a time vector (t). The time vector is changed along

with the processing. A processing delay leads to a shift on the time axis, a up- or

down-sampling leads to a interpolation or decimation of the time vector. Using this

concept, data can always be ploted related to the time vector and is aligned correctly

for analysis.

The tv-concept may seem like a purely MATLAB specific issue but it is important in

general for the development of signal processing offloading engines since it makes

models created by the «offloading engine designer» easily understandable to the

«application engineer» approving them. The communication between these two

parties is crucial for efficiency and success, exactly as communication is in gerneral

for engineering projects multiple parties work on.

Enclustra GmbH



20




Example for tv-Concept Benefits: Constellation Plots

Thanks to the tv-concept, the state of the output signal for every sampling instance

of a symbol can easily be displayed even if the signal contains symbol rate errors or

other deffects required by the project discussed.

This is a good example for the benefits of the tv-concept since the customer can

easily check if the results are within the specification based on these plots. More easy

means less error prone. At this point we again see that good and communication

friendly design practices can significantly reduce risk.

Enclustra GmbH



21




Project Achievements

Thanks to the design methodology used and a thorough testing concept, there was

no need to touch the algorithm after the implementation. There were also no

significant bugs recovered after delivery. As a result of these points, the project was

delivered on schedule and without any additional effort required.

The computing power of the offloading engine is quite high with 4.5 GMAC/sec and

the CPU is available to 100% for the protocol handling and qualification flow control.

Enclustra GmbH



22




Enclustra GmbH



23




SoC Benefits

The two examples discussed illustrate the power of SoCs. Propperly designed

offloading engines can boost the system performance and at the same time reduce

the CPU load. In the statistical image processing example this led to significantly

reduced runtimes. The realization of the Bluetooht transceiver example would not

even have been possible with a CPU approach since performance requirements could

not be fullfilled.

The statistical image processing engine is a very good example for a reusable

offloading engine. The customer paid for development once but can benefit from

the results in multiple product generations.

In the case of the Bluetooth transceiver, the abstraction level of the software written

by the customer could be raised significantly. The customer does neither need to

take care of the signal processing nor of the low-level protocol encoding and

decoding. All these things are fully implemented by Enclustra in the offloading

engine.

One common point of both projects discussed is that they show that successful

outsourcing of the offloading engine development is possible.

Enclustra GmbH



24




No Gain wihtout Brain

An FPGA is not a CPU and it is not programmable the same way as a CPU.

Programming an FPGA means designing a chip and requires a significantly different

skillset and experience.

A good example for this is the fact that algorithms often need to be converted into a

form that is efficiently implementable in FPGA fabric. To do so, deep techology

specific know-how is required. As a result the decision to either strongly invest in

building up this know-how or to outsource it must be taken. Anything between these

two ways likely results in suboptimal results.

There are many EDA tools such as SDx from Xilinx, which help speeding up the

implementation process and significantly increase productivity. Important to

understand is the point, that these tools really only help with the implementation

and do not replace the technology specific know-how required to find an efficient

way to implement a given task in FPGA fabric.

High-level design entry tools for FPGA fabric are comparable to compilers in the

software world. They certainly can help you raising the abstraction level and getting

forward faster but most of the performance and memory footprint of any application

are given by architectural decisions. No compiler or programming language will ever

choose a good software architecture. It just implements the architecture chosen by

the engineer in an efficient way.

Enclustra GmbH



25




Communication Matters

During the development of the offloading engines discussed in this presentation, no

significant communication problems occured. This did not just happen but required a

lot of attention and the availability of the tools required (e.g. bit-true libraries) and

experience from earlier projects.

The problem of communication between different engieering disciplines is often

underestimated. Efficient communication requires each engineer to have a basic

knoweldge in the area the others are working in.

With SoCs, the requirement for communication has significantly increased since tasks

can be moved from fabric to the CPU and vice versa virtually seamlessly. To benefit

from this, each side needs to be able to estimate the feasibilty of solving a problem

in the other part of the SoC. As a result of this, the ideal SoC engineer should have

experience in software and FPGA development.

SoCs also increase the requirement for thorough testing. Even though each

engineering discipline (software, FPGA) is good at debugging the things it created,

much time is lost if the part created by the other discipline does not work. The only

way to avoid delays caused by this problem is to implement thorough testing

concepts on both sides before integrating the whole system.

The importance of communication is not affected by the general project setup. It

applies if software and offloading engine are developed by different companies (as

in the examples) as well as if they are developed by different teams of the same

company or even by different engineers in one team.

Enclustra GmbH



26




Enclustra GmbH



27




Documents

Application Examples Avnet Silica & Enclustra Seminar ... · Avnet Silica & Enclustra Seminar ... Application Examples Avnet Silica & Enclustra Seminar ... Application Examples Avnet