89
1. INTRODUCTION 1.1. INTRODUCTION: Filter is the component which passes certain band of frequencies and opposes other frequency components. Filter is the basic component in any Digital Signal Processor (DSP) applications. For this we have two filters they are Finite Impulse Response (FIR) filter and Infinite Impulse Response (IIR) filter. FIR filter is digital type of filter where we consider finite number of samples. In FIR filter the impulse response settle down to zero after final sample of interval, where as in IIR filter we consider infinite number of samples for analysis. Here in our project we designed FIR filter with less resources and less delay using Distributed Arithmetic (DA) algorithm. If we use direct method i.e. Multiplication and Accumulate (MAC) for implementing FIR filter it consumes much area (resource) and is expensive to implement on FPGA. To overcome this drawback DA came into existence, which is a multiplier-less architecture. As DA is a very efficient solution especially suited for LUT-based FPGA architectures. CVR College of Engineering (VLSI) Page 1

Peoject Documentation

Embed Size (px)

Citation preview

Page 1: Peoject Documentation

1. INTRODUCTION

1.1. INTRODUCTION:

Filter is the component which passes certain band of frequencies and opposes other

frequency components. Filter is the basic component in any Digital Signal Processor (DSP)

applications. For this we have two filters they are Finite Impulse Response (FIR) filter and

Infinite Impulse Response (IIR) filter.

FIR filter is digital type of filter where we consider finite number of samples. In FIR

filter the impulse response settle down to zero after final sample of interval, where as in IIR filter

we consider infinite number of samples for analysis.

Here in our project we designed FIR filter with less resources and less delay using

Distributed Arithmetic (DA) algorithm. If we use direct method i.e. Multiplication and

Accumulate (MAC) for implementing FIR filter it consumes much area (resource) and is

expensive to implement on FPGA. To overcome this drawback DA came into existence, which is

a multiplier-less architecture. As DA is a very efficient solution especially suited for LUT-based

FPGA architectures.

The main problem of DA is that the LUT size will increase exponentially with the order

of the filter. To overcome this problem a hardware-efficient DA architecture is used which

reduces the LUT size by modifying the architecture of the filter to achieve high performance.

CVR College of Engineering (VLSI) Page 1

Page 2: Peoject Documentation

1.2. FIR Filter:

FIR filter is a one polynomial coefficient. FIR filter needs much high order polynomial to

get an equivalent filter as IIR filter, which results in longer delay.

H (Z) =B (Z)/ZN

Y[n] = b0x[n] +b1x [n-1] +b2x [n-2]………………..+ bn x[n-N]

N is the filter order an Nth-order filter has (N + 1) terms on the right-hand side; these are

commonly referred to as taps.

This equation can also be expressed as a convolution of the coefficient sequence bi with the

input signal

That is, the filter output is a weighted sum of the current and a finite number of previous

values of the input.

CVR College of Engineering (VLSI) Page 2

Page 3: Peoject Documentation

1.3. Block Diagram of FIR filter:

Fig1.1:Block diagram of FIR filter

1.4. Spartan III features:

The Spartan -3E family reduces system cost to by offering the lowest cost-per-logic of any FPGA family, supporting the lowest-cost configuration solutions including commodity serial (SPI) and parallel flash memories, and efficiently integrating the functions of many chips into a single FPGA.

Advanced, Low-Cost Features

Five devices with 100K to 1.6M system gates

From 66 to 376 I/Os with package and destiny migration

Up to 648K bits of block RAM and up to 231K bits of distributed RAM

Up to 36 embedded 18x18 multipliers for high-performance DSP applications

Up to eight Digital Clock Mangers

CVR College of Engineering (VLSI) Page 3

Page 4: Peoject Documentation

Cost-Saving System Interfaces and Solutions

Support for Xilinx Platform Flash as well as commodity serial(SPI) and byte-wide flash memory for configuration

Easy-to-implement interfaces to DDR memory

Support for 18 common I/O standards, including PCI-X, mini-LVDs, and RSDS

Industry-Leading Design Tools and IP

ISE design tools to shorten design and verification time

Hundreds of pre-verified, pre-optimized Intellectual Property(IP) cores and reference

designs

Chip Scope ProTM system-debugging environment

Easy-to-Use, Low-Cost FPGA Development Systems

Complete Spartan-3E Standard Kit available for only $149 USD

Includes XC3S500E FPGA, SPI Flash, 32Mb DDR memory support for USB2.0

1.4. CONCLUSION:

In this chapter we discussed about FIR filter and its block diagram. The Spartan-3 pro-

FPGA features are described.

CVR College of Engineering (VLSI) Page 4

Page 5: Peoject Documentation

2. LITERATURE SURVEY

2.1 INTRODUCTION:

The signal is the one which carries information from one source to the destination. There

are different types of signals. Filter plays essential role in Digital Signal Processing (DSP). Filter

is a system that passes certain frequency components and rejects other frequency components.

Filters are designed for the specifications of the desired properties of the system. FPGA is a

prototype device which is used to implement simpler algorithms.

2.2. Signal:

In the field of communications, signal processing and in electrical engineering more

generally, a signal is any time-varying or spatial-varying quantity.

In the physical world, any quantity measurable through time or over space can be taken as a

signal. Within a complex society, any set of human information or machine data also be taken as

a signal. Such information or machine data must all be part systems existing in the physical

world- either living or non-living.

Despite the complexity of such systems, their outputs and inputs can often be represented as

simple quantities measurable through time or across space. In the latter half of the 20th century

Electrical engineering itself separated into several disciplines, specializing in the design and

analysis of physical signals and systems, on one hand and in the functional behavior and

conceptual structure of the complex human and machine systems, on the other. These

engineering disciplines have led the way in the design, study, and implementation of systems that

CVR College of Engineering (VLSI) Page 5

Page 6: Peoject Documentation

take advantage of signals as simple measurable quantities in order to facilitate

the transmission, storage and manipulation of information.

2.2.1. Definition of the signal:

In information theory, a signal is a codified message, that is, the sequence of states in a

communication channel that encodes a message.

In the context of signal processing, arbitrary binary data streams are not considered as signals,

but only analog and digital signals that are representations of analog physical quantities.

In a communication system, a transmitter encodes a message into a signal, which is

carried to a receiver by the communications channel. For example, the words "Mary had a little

lamb" might be the message spoken into a telephone. The telephone transmitter converts the

sounds into an electrical voltage signal. The signal is transmitted to the receiving telephone by

wires; and at the receiver it is reconverted into sounds.

In telephone networks, signaling, for example common channel signaling, refers to phone

number and other digital control information rather than the actual voice signal.

Signals can be categorized in various ways. The most common distinction is between

discrete and continuous spaces that the functions are defined over, for example discrete and

continuous time domains. Discrete-time signals are often referred to as time series in other

fields. Continuous-time signals are often referred to as continuous signals even when the signal

functions are not continuous; an example is a square-wave signal.

A second important distinction is between discrete-valued and continuous-valued. Digital

signals are sometimes defined as discrete-valued sequences of quantified values that may or may

not be derived from an underlying continuous-valued physical process. In other contexts, digital

signals are defined as the continuous-time waveform signals in a digital system, representing a

bit-stream. In the first case, a signal that is generated by means of a digital modulation method is

considered as converted to an analog signal, while it is considered as a digital signal in the

second case.

CVR College of Engineering (VLSI) Page 6

Page 7: Peoject Documentation

2.2.2. Types of Signals:

2.2.2.1. Discrete-time and continuous time signal:

If for a signal, the quantities are defined only on a discrete set of times, we call it a

discrete-time signal. In other words, a discrete-time real (or complex) signal can be seen as a

function from the set of integers to the set of real (or complex) numbers. Discrete signals have

frequency domain analysis. A discrete signal usually uses Z- Transform to analyze its frequency

response, where discrete signals are denoted by u (k) and k= -1, 0, 1, 2, 3…..

A continuous-time real (or complex) signal is any real-valued (or complex-

valued) function which is defined for all time t in an interval, most commonly an infinite

interval. Continuous signals have continuous frequency spectrum. It uses Fourier Transform (FT)

to obtain its frequency response, where continuous signals are denoted by u (t), t is continuous.

2.2.2.2. Analog and Digital signal:

There are mainly two types of signals encountered in practice, analog and digital. In

short, the difference between them is that digital signals are discrete and quantized, as defined

below, while analog signals possess neither property.

DISCRETIZATION:

One of the fundamental distinctions between different types of signals is

between continuous and discrete time. In the mathematical abstraction, the domain of a

continuous-time (CT) signal is the set of real numbers (or some interval thereof), whereas the

CVR College of Engineering (VLSI) Page 7

Page 8: Peoject Documentation

domain of a discrete-time (DT) signal is the set of integers (or some interval). What these

integers represent depends on the nature of the signal.

DT signals often arise via sampling of CT signals. An audio signal, for example consists

of a continually fluctuating voltage on a line that can be digitized by an ADC circuit, wherein the

circuit will read the voltage level on the line, say, every 50 µs. The resulting stream of numbers

is stored as digital data on a discrete-time signal. Computers and other digital devices are

restricted to discrete time.

QUANTIZATION:

If a signal is to be represented as a sequence of numbers, it is impossible to maintain

arbitrarily high precision - each number in the sequence must have a finite number of digits. As a

result, the values of such a signal are restricted to belong to a finite set; in other words, it

is quantized.

2.3 Filters in signal processing:

In signal processing, a filter is a device or process that removes from a signal some

unwanted or component or feature. In general, it takes an input that is a function of time and

produces an output that is a function of time (usually delayed from the input).

Filtering is a class of signal processing, the defining feature of filters being the complete

or partial suppression of some aspect of the signal. Most often, this means removing

some frequencies and not others in order to suppress interfering signals and reduce

background noise. However, filters do not exclusively act in the frequency domain; especially in

the field of image processing many other targets for filtering exist.

There are many different bases of classifying filters and these overlap in many different

ways, there is no simple hierarchical classification. Filters may be:

analog  or digital

CVR College of Engineering (VLSI) Page 8

Page 9: Peoject Documentation

discrete-time  (sampled) or continuous-time

linear  or non-linear

passive  or active type of continuous-time filter

Infinite impulse response  (IIR) or finite impulse response (FIR)

type of discrete-time or digital filter.

2.3.1. Analog Filter:

Analog filters are a basic building block of signal processing much used in electronics.

Amongst their many applications are the separation of an audio signal before application

to bass, mid-range and tweeter loudspeakers; the combining and later separation of multiple

telephone conversations onto a single channel; the selection of a chosen radio station in a radio

receiver and rejection of others.

Passive linear electronic analogue filters are those filters which can be described

with linear differential equations (linear); they are composed of capacitors, inductors and,

sometimes, resistors (passive) and are designed to operate on continuously varying (analogue)

signals. There are many linear filters which are not analogue in implementation (digital filter),

and there are many electronic filters which may not have a passive topology – both of which may

have the same transfer function of the filters described in this article. Analogue filters are most

often used in wave filtering applications, that is, where it is required to pass particular frequency

components and to reject others from analog (continuous-time) signals.

2.3.2. Digital Filters:

In electronics, computer science and mathematics, a digital filter is a system that

performs mathematical operations on a sampled, discrete-time signal to reduce or enhance

certain aspects of that signal. This is in contrast to the other major type of electronic filter,

the analog filter, which is an electronic circuit operating on continuous-time analog signals. An

CVR College of Engineering (VLSI) Page 9

Page 10: Peoject Documentation

analog signal may be processed by a digital filter by first being digitized and represented as a

sequence of numbers, then manipulated mathematically, and then reconstructed as a new analog

signal. In an analog filter, the input signal is "directly" manipulated by the circuit.

A digital filter system usually consists of an analog-to-digital converter (to sample the

input signal), a microprocessor (often a specialized digital signal processor), and a digital-to-

analog converter. Software running on the microprocessor can implement the digital filter by

performing the necessary mathematical operations on the numbers received from the ADC. In

some high performance applications, an FPGA or ASIC is used instead of a general purpose

microprocessor.

Digital filters may be more expensive than an equivalent analog filter due to their

increased complexity, but they make practical many designs that are impractical or impossible as

analog filters. Since digital filters use a sampling process and discrete-time processing, they

experience latency (the difference in time between the input and the response), which is almost

irrelevant in analog filters.

Digital filters are commonplace and an essential element of everyday electronics such

as radios, cell phones, and stereo receivers.

2.3.3. Passive filter:

Passive implementations of linear filters are based on combinations

of resistors (R), inductors (L) and capacitors (C). These types are collectively known as passive

filters, because they do not depend upon an external power supply and/or they do not contain

active components such as transistors.

Inductors block high-frequency signals and conduct low-frequency signals,

while capacitors do the reverse. A filter in which the signal passes through an inductor, or in

which a capacitor provides a path to ground, presents less attenuation to low-frequency signals

than high-frequency signals and is a low-pass filter. If the signal passes through a capacitor, or

has a path to ground through an inductor, then the filter presents less attenuation to high-

frequency signals than low-frequency signals and is a high-pass filter. Resistors on their own

CVR College of Engineering (VLSI) Page 10

Page 11: Peoject Documentation

have no frequency-selective properties, but are added to inductors and capacitors to determine

the time-constants of the circuit, and therefore the frequencies to which it responds.

The inductors and capacitors are the reactive elements of the filter. The number of

elements determines the order of the filter. In this context, an LC tuned circuit being used in a

band-pass or band-stop filter is considered a single element even though it consists of two

components.

At high frequencies (above about 100 megahertz), sometimes the inductors consist of

single loops or strips of sheet metal, and the capacitors consist of adjacent strips of metal. These

inductive or capacitive pieces of metal are called stubs.

2.3.4. Active Filter:

Active filters are implemented using a combination of passive and active (amplifying)

components, and require an outside power source. Operational amplifiers are frequently used in

active filter designs. These can have high Q, and can achieve resonance without the use of

inductors. However, their upper frequency limit is limited by the bandwidth of the amplifiers

used.

2.3.5. Linear- Continuous time filter:

Linear continuous-time circuit is perhaps the most common meaning for filter in the

signal processing world, and simply "filter" is often taken to be synonymous. These are filters

that are designed to remove certain frequencies and allow others to pass. Such a filter is, of

necessity, a linear filter. Any non-linearity will result in the output signal containing components

of frequency which were not present in the input signal.

The modern design methodology for linear continuous-time filters is called network

synthesis. Some important filter families designed in this way are;

CVR College of Engineering (VLSI) Page 11

Page 12: Peoject Documentation

Chebyshev filter , has the best approximation to the ideal response of any filter for a

specified order and ripple.

Butterworth filter , has a maximally flat frequency response.

Bessel filter , has a maximally flat phase delay.

Elliptic filter , has the steepest cutoff of any filter for a specified order and ripple.

The difference between these filter families is that they all use a different polynomial

function to approximate to the ideal filter response. This results in each having a

different transfer function.

Another methodology which is dead but can still is seen walking around now and again is

the image parameter method. Filters designed by this methodology are archaically called "wave

filters". Some important filters designed by this method are;

Constant k filter , the original and simplest form of wave filter.

M-derived filter , a modification of the constant k with improved cutoff steepness

and impedance matching.

2.3.6. Terminology to classify linear filter:

Some terms used to describe and classify linear filters:

The frequency response can be classified into a number of different band

forms describing which frequencies the filter passes (the pass band) and which it rejects

(the stop band);

Low-pass filter  – low frequencies are passed, high frequencies are attenuated.

High-pass filter  – high frequencies are passed, Low frequencies are attenuated.

Band-pass filters  – only frequencies in a frequency band are passed.

Band-stop filter  or band-reject filters – only frequencies in a frequency band are

attenuated.

CVR College of Engineering (VLSI) Page 12

Page 13: Peoject Documentation

Notch filter  – rejects just one specific frequency - an extreme band-stop filter.

Comb filter  – has multiple regularly spaced narrow pass bands giving the band form the

appearance of a comb.

All-pass filter  – all frequencies are passed, but the phase of the output is modified.

Cutoff frequency  is the frequency beyond which the filter will not pass signals. It is

usually measured at a specific attenuation such as 3dB.

Roll-off  is the rate at which attenuation increases beyond the cut-off frequency.

Transition band , the (usually narrow) band of frequencies between a pass band and stop

band.

Ripple  is the variation of the filters insertion loss in the pass band.

The order of a filter is the degree of the approximating polynomial and in passive filters

corresponds to the number of elements required to build it. Increasing order increases

roll-off and brings the filter closer to the ideal response.

2.3.7. FIR Filter:

A Finite Impulse Response (FIR) filter is a type of a digital filter. The impulse response,

the filter's response to a Kronecker delta input, is finite because it settles to zero in a finite

number of sample intervals. This is in contrast to Infinite Impulse Response (IIR) filters, which

have internal feedback and may continue to respond indefinitely. The impulse response of an

Nth-order FIR filter lasts for N+ 1 sample, and then dies to zero.

The difference equation that defines the output of an FIR filter in terms of its input is:

Y[n] = b0x[n] +b1x [n-1] +b2x [n-2]………………..+ bn x [n-N]

where:

x[n] is the input signal,

CVR College of Engineering (VLSI) Page 13

Page 14: Peoject Documentation

y[n] is the output signal,

bi are the filter coefficients, and

N is the filter order – an Nth-order filter has (N + 1) terms on the right-hand side;

these are commonly referred to as taps.

This equation can also be expressed as a convolution of the coefficient sequence bi with

the input signal:

That is, the filter output is a weighted sum of the current and a finite number of previous

values of the input.

2.3.8. IIR filters:

Infinite Impulse Response (IIR) is a property of signal processing systems. Systems with

this property are known as IIR systems or, when dealing with filter systems, as IIR filters. IIR

systems have an impulse response function that is non-zero over an infinite length of time. This

is in contrast to FIR, which have fixed-duration impulse responses. The simplest analog IIR filter

is an RC filter made up of a single resistor (R) feeding into a node shared with a

single capacitor (C). This filter has an exponential impulse response characterized by an RC time

constant.

IIR filters may be implemented as either analog or digital filters. In digital IIR filters, the

output feedback is immediately apparent in the equations defining the output. Note that unlike

with FIR filters, in designing IIR filters it is necessary to carefully consider "time zero" case in

which the outputs of the filter have not yet been clearly defined.

Design of digital IIR filters is heavily dependent on that of their analog counterparts

because there are plenty of resources, works and straightforward design methods concerning

analog feedback filter design while there are hardly any for digital IIR filters. As a result,

CVR College of Engineering (VLSI) Page 14

Page 15: Peoject Documentation

usually, when a digital IIR filter is going to be implemented, an analog filter (e.g. Chebyshev

filter, Butterworth filter, Elliptic filter) is first designed and then is converted to a digital filter by

applying discretization techniques such as Bilinear transform or Impulse invariance.

Digitals filters are often described and implemented in terms of the difference

equation that defines how the output signal is related to the input signal:

where:

 is the feed forward filter order

 are the feed forward filter coefficients

 is the feedback filter order

 are the feedback filter coefficients

 is the input signal

 Is the output signal.

A more condensed form of the difference equation is:

CVR College of Engineering (VLSI) Page 15

Page 16: Peoject Documentation

2.4. FPGA:

FPGAs offer an opportunity to accelerate your digital signal processing application up to

1000 times over a traditional DSP microprocessor.

Microprocessors are slow:

Digital signal processing has traditionally been done using enhanced microprocessors. While the

high volume of generic product provides a low cost solution, the performance falls seriously

short for many applications. Until recently, the only alternatives were to develop custom

hardware (typically board level or ASIC designs), buy expensive fixed function processors (e.g.

an FFT chip), or use an array of microprocessors.

FPGAs accelerate DSP:

Recent increases in Field Programmable Gate Array performance and size offer a new

hardware acceleration opportunity. FPGAs are an array of programmable logic cells

interconnected by a matrix of wires and programmable switches.. Each cell performs a simple

logic function defined by a user's program. An FPGA has a large number (64 to over 20,000) of

these cells available to use as building blocks in complex digital circuits. Custom hardware has

never been so easy to develop.

Performance up to 1000x:

The ability to manipulate the logic at the gate level means you can construct a custom

processor to efficiently implement the desired function. By simultaneously performing all of the

algorithm’s sub functions, the FPGA can outperform a DSP by as much as 1000:1.

CVR College of Engineering (VLSI) Page 16

Page 17: Peoject Documentation

Fig 2.1 comparision of DSP and FPGA.

DSP performance is limited by the serial instruction stream. FPGAs are a better solution

in the region above the curve.

FPGA DSPs are flexible:

Like microprocessors, many FPGAs can be infinitely reprogrammed in-circuit in only a

fraction of a second. Design revisions, even for a fielded product, can be implemented quickly

and painlessly. Hardware can also be reduced by taking advantage of reconfiguration.

Highly integrated:

The programmable logic in an FPGA can absorb much of the interface and ‘glue’ logic

associated with microprocessors. The tighter integration can make a product smaller, lighter,

cheaper and lower power.

CVR College of Engineering (VLSI) Page 17

Page 18: Peoject Documentation

Competitively priced:

FPGAs are a generic product customized at the point of use. They enjoy the cost

advantages of high production volumes. There is also none of the NRE charges or fabrication

delays associated with ASIC development and get you to market on time.

The FPGA’s flexibility eliminates the long design cycle associated with ASICs. With

FPGAs there are no delays for prototypes or early production volume. Design revisions are

easily implemented, often taking less than a day. The devices are fully tested by the

manufacturer, eliminating production test development.

2.5. CONCLUSION:

In this chapter we discussed about signals, different types of signals, filters, different types of

filters and FPGA in Digital Signal Processing.

CVR College of Engineering (VLSI) Page 18

Page 19: Peoject Documentation

3. DESIGN METHODOLOGY

3.1. INTRODUCTION:

A Finite Impulse Response (FIR) filter is a type of a digital filter. The direct

implementation of the FIR filter requires more number of resources, to reduce the number of

resources Distributed Arithmetic came into existence which replaces multiplications by additions

and siftings. To reduce ROM size the proposed DA algorithm came into existence which uses

multiplexers. The LUT-less algorithm uses multiplexers to remove the usage of ROM memory.

3.2. DIRECT IMPLEMENTATION OF FIR FILTER:

Generally FIR filter is designed using Multiply and Accumulate (MAC) principle where

the filter coefficients undergo multiplication and additions. The MAC principle is common in

Digital Signal Processing algorithms.

The following expression explains the MAC operation.

Note a few points:

h=[h0,h1, h2,…, hK-1] is a matrix of “constant” values

CVR College of Engineering (VLSI) Page 19

Page 20: Peoject Documentation

h=[h0,h1, h2,…, hK-1] is a matrix of “constant” values

Each hk is of M-bits

Each hk is of N-bits

y should be able large enough to accommodate the result

A numerical example:

Fig 3.1. Block diagram of 1-tap filter using direct implementation.

CVR College of Engineering (VLSI) Page 20

Page 21: Peoject Documentation

Fig 3.2. Block diagram of 4-tap FIR filter using direct implementation.

In direct implementation we follow Multiply and Accumulate (MAC) operation. In this

type of operation we directly multiply the coefficient of the filter with the variable and add them

to get final result. If we consider 1-tap filter, filter coefficient h0 is directly multiplied with

variable x0 and result is assigned to the output. In 4-tap filter filter-coefficient are multiplied

with corresponding variables, the result of four multipliers are added and assigned to the result.

If we follow this method we require four multipliers, which require many resources. To reduce

resource utilization and improve speed we follow Distributed Arithmetic (DA) Algorithm, which

is multiplier less architecture.

CVR College of Engineering (VLSI) Page 21

Page 22: Peoject Documentation

3.3. IMPLEMENTING FIR FILTER USING DISTRIBUTED ARITHMETIC:

Distributed arithmetic is a bit level rearrangement of a multiply accumulate to avoid the

multiplications.  It is a powerful technique for reducing the size of a parallel hardware multiply-

accumulate that is well suited to FPGA designs.  It can also be extended to other sum functions

such as complex multiples, Fourier transforms and so on.

In most of the multiply accumulate applications in signal processing, one of the

multiplicands for each product is a constant. Usually each multiplication uses a different

constant.

Using our most compact multiplier, the scaling accumulator, we can construct a multiple

product term parallel multiply-accumulate function in a relatively small space if we are willing to

accept a serial input. In this case, we feed four parallel scaling accumulators with unique

serialized data. Each multiplies that data by a possibly unique constant, and the resulting

products are summed in an adder tree as shown below

Fig 3.3. 4-tap FIR filter using DA algorithm.

CVR College of Engineering (VLSI) Page 22

Page 23: Peoject Documentation

If we stop to consider that the scaling accumulator multiplier is really just a sum of

vectors, then it becomes obvious that we can rearrange the circuit.

Here, the adder tree combines the 1 bit partial products before they are accumulated by

the scaling accumulator. All we have done is rearranged the order in which the 1xN partial

products are summed. Now instead of individually accumulating each partial product and then

summing the results, we postpone the accumulate function until after we’ve summed all the 1xN

partials at a particular bit time. This simple rearrangement of the order of the adds has effectively

replaced N multiplies followed by an N input add with a series of N input adds followed by a

multiply. This arithmetic manipulation directly eliminates N-1 Adders in an N product term

multiply-accumulate function. For larger numbers of product terms, the savings becomes

significant.

Fig 3.4. block diagram of 4- tap filter using LUT less algorithm.

CVR College of Engineering (VLSI) Page 23

Page 24: Peoject Documentation

Further hardware savings are available when the coefficients Cn are constants. If that is

true, then the adder tree shown above becomes a Boolean logic function of the 4 serial inputs. 

The combined 1xN products and adder tree is reduced to a four input look up table. The sixteen

entries in the table are sums of the constant coefficients for all the possible serial input

combinations. The table is made wide enough to accommodate the largest sum without overflow.

Negative table values are sign extended to the width of the table, and the input to the scaling

accumulator should be sign extended to maintain negative sums.

Fig 3.5. block diagram which explains MUX operations.

Obviously the serial inputs limit the performance of such a circuit.  As with most hardware

applications, we can obtain more performance by using more hardware.  In this case, more than

one bit sum can be computed at a time by duplicating the LUT and adder tree as shown here. The

second bit computed will have a different weight than the first, so some shifting is required

before the bit sums are combined. In this 2 bit at a time implementation, the odd bits are fed to

one LUT and adder tree, while the even bits are simultaneously fed to an identical tree. The odd

bit partials are left shifted to properly weight the result and added to the even partials before

accumulating the aggregate. Since two bits are taken at a time, the scaling accumulator has to

shift the feedback by 2 places.

CVR College of Engineering (VLSI) Page 24

Page 25: Peoject Documentation

Fig 3.6. block diagram which explains MUX operations for more number of inputs

This paralleling scheme can be extended to compute more than two bits at a time.  In the

extreme case, all input bits can be computed in parallel and then combined in a shifting adder

tree.  No scaling accumulator is needed in this case, since the output from the adder tree is the

entire sum of products.   This fully parallel implementation has a data rate that matches the serial

clock, which can be greater than 100 MS/S in today's FPGAs.

CVR College of Engineering (VLSI) Page 25

Page 26: Peoject Documentation

Fig 3.7. Block digram which explains shifting and addition operations.

Most often, we have more than 4 product terms to accumulate. Increasing the size of the

LUT might look attractive until you consider that the LUT size grows exponentially. Considering

the construction of the logic we stuffed into the LUT, it becomes obvious that we can combine

the results from the LUTs in an adder tree. The area of the circuit grows by roughly 2n-1 using

adder trees to expand it rather than the 2n growth experienced by increasing LUT size. For

FPGAs, the most efficient use of the logic occurs when we use the natural LUT size (usually a 4-

LUT, although and 8-LUT would make sense if we were using an 8 input block RAM) for the

LUTs and then add the outputs of the LUTs together in an adder tree, as shown below:

Fig 3.8. Block diagram of 8-tap FIR filter.

CVR College of Engineering (VLSI) Page 26

Page 27: Peoject Documentation

.

3.4. MATHEMATICAL ANALYSIS OF DISTRIBUTED ARITHMETIC:

General equation of FIR filter is

-----1

Let xk be a N-bits scaled two’s complement number i.e.

| xk | < 1

xk: {bk0, bk1, bk2……, bk(N-1) }

where bk0 is the sign bit

We can express xk as

-----2

Now by substituting (2) in (1), we get

----3

And now

CVR College of Engineering (VLSI) Page 27

Page 28: Peoject Documentation

By expanding the term we get

Now by expanding the sigma term we get the following equation

By taking common multiples into consideration we can re arrange the equation in the

following fashion.

CVR College of Engineering (VLSI) Page 28

Expanding this part

Page 29: Peoject Documentation

Finally the equation is reduced in the following way.

The equation 4 is the final formula of the distributed arithmetic.

For ROM construction the equation 4 is reduced in the following fashion.

has only 2K possible values i.e.

(5) Can be pre-calculated for all possible values of b1n b2n …bKn

We can store these in a look-up table of 2K words addressed by K-bits i.e. b1n b2n …bKn s

3.5. Block Diagram of FIR filter using DA algorithm:

Here in our project we are designing 4-tap FIR filter. The original LUT based DA

implementation of FIR filter is shown in the following figure.

CVR College of Engineering (VLSI) Page 29

-----4

---- 5

-----4

Page 30: Peoject Documentation

Fig:3.9. Block diagram of 4-tap FIR filter using DA based Algorithm

The block diagram of LUT based DA implemented FIR filter consists of three units such as the

shift register unit, the DA-LUT unit and the adder/shifter unit.

The four input signals each of four bits are given to parallel in serial out shift registers.

The output of parallel in serial out register is single bit value. The coefficients of filter are stored

in the Look up Table and depending on the output of the four parallel in serial out registers a

value is selected from Look-Up. The output of the look up table is given to the Adder and Shifter

unit. The Adder and Shifter unit adds this value to the left shifted previous output and gives it to

CVR College of Engineering (VLSI) Page 30

Page 31: Peoject Documentation

the output. This process is repeated for four clock cycles, after four clock cycles we will get the

required output.

3.5.1. Shift Register unit:

A serial-in/parallel-out shift register is similar to the serial-in/ serial out

shift register in that it shifts data into internal storage elements and shifts data out at the serial-

out, data-out and pin. It is different in that it makes all the internal stages available as outputs.

Therefore, a serial-in/parallel-out shift register converts data from serial format

to parallel format. If four data bits are shifted in by four clock pulses via a single wire at data-in,

below, the data becomes available simultaneously on the four outputs QA to QD after the fourth

clock pulse.

Fig 3.10. Serial in parallel out shift register with 4- stages.

The practical application of the serial-in/parallel-out shift register is to convert data

from serial format on a single wire to parallel format on multiple wires. Perhaps, we will

illuminate four LEDs (Light Emitting Diodes) with the four outputs (QA QB QC QD ).

CVR College of Engineering (VLSI) Page 31

Page 32: Peoject Documentation

Fig 3.11. serial in parallel out shift register in detail.

The above details of the serial-in/parallel-out shift register are fairly simple. It looks like

a serial-in/ serial-out shift register with taps added to each stage output. Serial data

shifts in at SI (Serial Input). After a number of clocks equal to the number of stages, the first data

bit in appears at SO (QD) in the above figure. In general, there is no SO pin. The last stage

(QD above) serves as SO and is cascaded to the next package if it exists.

Note that serial-in/ serial-out shift registers come in grater than 8-bit lengths of 18 to 64-bits.

It is not practical to offer a 64-bit serial-in/parallel-out shift register requiring that many output

pins. See waveforms below for above shift register.

CVR College of Engineering (VLSI) Page 32

Page 33: Peoject Documentation

Fig 3.12. Serial in parallel out register waveforms.

The shift register has been cleared prior to any data by CLR', an active low signal, which clears

all type D Flip-Flops within the shift register. Note the serial data 1011pattern presented at

the SI input. This data is synchronized with the clock CLK. This would be the case if it is being

shifted in from something like another shift register, for example, a parallel-in/ serial-

out shift register (not shown here). On the first clock at t1, the data 1 at SI is shifted

from D to Q of the first shift register stage. After t2 this first data bit is at QB. After t3 it is at QC.

After t4 it is at QD. Four clock pulses have shifted the first data bit all the way to the last

stage QD. The second data bit a 0 is at QC after the 4th clock. The third data bit a 1 is at QB. The

fourth data bit another 1 is at QA. Thus, the serial data input pattern 1011is

contained in (QD QC QB QA). It is now available on the four outputs.

CVR College of Engineering (VLSI) Page 33

Page 34: Peoject Documentation

It will available on the four outputs from just after clock t4 to just before t5. This parallel data

must be used or stored between these two times, or it will be lost due to shifting out the QD stage

on following clocks t5 to t8 as shown above.

3.5.2. Look Up Table unit:

The binary data is stored in the solid-state devices. Those storage "cells" within solid-state

memory devices are easily addressed by driving the "address" lines of the device with the proper

binary value(s). Suppose we had a ROM memory circuit written, or programmed, with certain

data, such that the address lines of the ROM served as inputs and the data lines of the ROM

served as outputs, generating the characteristic response of a particular logic function.

Theoretically, we could program this ROM chip to emulate whatever logic function we wanted

without having to alter any wire connections or gates.

Consider the following example of a 4 x 2 bit ROM memory (a very small memory!)

programmed with the functionality of a half adder:

Fig 3.13. Functionality of Half Adder.

CVR College of Engineering (VLSI) Page 34

Page 35: Peoject Documentation

If this ROM has been written with the above data (representing a half-adder's truth table),

driving the A and B address inputs will cause the respective memory cells in the ROM chip to be

enabled, thus outputting the corresponding data as the Σ (Sum) and Cout bits. Unlike the half-

adder circuit built of gates or relays, this device can be set up to perform any logic function at all

with two inputs and two outputs, not just the half-adder function. To change the logic function,

all we would need to do is write a different table of data to another ROM chip. We could even

use an EPROM chip which could be re-written at will, giving the ultimate flexibility in function.

It is vitally important to recognize the significance of this principle as applied to digital

circuitry. Whereas the half-adder built from gates or relays processes the input bits to arrive at a

specific output, the ROM simply remembers what the outputs should be for any given

combination of inputs. This is not much different from the "times tables" memorized in grade

school: rather than having to calculate the product of 5 times 6 (5 + 5 + 5 + 5 + 5 + 5 = 30),

school-children are taught to remember that 5 x 6 = 30, and then expected to recall this product

from memory as needed. Likewise, rather than the logic function depending on the functional

arrangement of hard-wired gates or relays (hardware), it depends solely on the data written into

the memory (software).

Such a simple application, with definite outputs for every input, is called a look-up table, because

the memory device simply "looks up" what the output(s) should to be for any given combination

of inputs states.

3.5.3. Adder and Shifter unit:

The adder and shifter unit consists of manly two blocks they are shifter and accumulator.

The input to the adder and shifter unit is the output of LUT. The input is added to the left shifted

previous output and it is assigned to the output. Here we use 16-bit adder. This process is

repeated k times to obtain the final output, where k is the number of input bits. Here we designed

4-Tap filter where input is 4 bit size so it requires four clock cycles to get the required output.

The adder and shifter unit one which eliminates the multiplication process by using shifting and

accumulate process.

CVR College of Engineering (VLSI) Page 35

Page 36: Peoject Documentation

3.6. Proposed Distributed Arithmetic:

The lower half of the LUT of the original LUT based DA implementation of FIR filter is the

sum of the sum of upper half of the LUT. The lower half is nothing but the locations where b3=1

and the upper half is the locations where b3=0. To avoid this wastage of memory we are using

proposed DA where the LUT size is reduced by an half with the additional 2*1 multiplexer and

full adder as shown in the following figure.

By using this proposed Distributed Arithmetic the LUT size is reduced to the half of its size.

The output of the fourth input i.e. b3 is given to the multiplexer. If the output is one then h[3] will

be the output of the multiplexer and if the b3 is zero then zero will be the output of the

multiplexer. The output of the multiplexer is added with the output of the LUT and then given to

the adder/shifter unit.

CVR College of Engineering (VLSI) Page 36

Page 37: Peoject Documentation

Fig 3.14. Block diagram of 4-tap FIR filter using proposed DA algorithm.

3.7. LUT-less Distributed Arithmetic:

The LUT reduction procedure discussed above will be further developed to obtain LUT-less

DA architecture. The LUT-less DA architecture is as shown below:

Fig 3.15. Block diagram of 4-tap FIR filter using LUT less algorithm.

Here in this procedure all LUT’s are replaced by multiplexers and full adders so that

memory usage is reduced completely. The output of the parallel in serial out is given to the

multiplexer where the value of the output is one then respective constant value is obtained

CVR College of Engineering (VLSI) Page 37

Page 38: Peoject Documentation

otherwise zero will be obtained. The output of multiplexer is given to the adder/shifter unit of the

filter.

3.8. VLSI implementation methods:

At the engineering level digital VLSI chips are classified by the approach used to

implement and the circuit. Several design styles can be considered for chip implementation of

specified algorithms or logic styles can be considered for chip implementation for specified

algorithms or logic functions. Each design has its own merits and demerits and thus a proper

choice has to be made by designers in order to provide the functionality at low cost.

3.8.1. PLD (PROGRAMMABLE LOGIC DEVICE):

PLD’s are standard ICs that are available in standard configurations from a catalog of parts

and are sold in very high volume to many different customers. PLD’s may be configured or

programmed to create a part customized to a specified application, and so they also belong to the

family of ASIC’s. PLD’s use different technologies to allow programming of the device.

There are four types of PLD’s

1. Programmable Logic Array (PLA)

2. Programmable Array Logic (PAL)

3. Complex Programmable Logic Device (CPLD)

4. Field Programmable Gate Array (FPGA)

1. Programmable Logic Array (PLA):

A Programmable Logic Array is a small PLD that contains two levels of logic, an

AND-plane and an OR-plane, where both levels are programmable.

CVR College of Engineering (VLSI) Page 38

Page 39: Peoject Documentation

2. Programmable Array Logic (PAL):

A Programmable Array Logic is a small PLD that has programmable AND plane

followed by a fixed OR plane.

3. Complex Programmable Logic Device (CPLD):

A Complex Programmable Logic Device is a PLD that consists of an arrangement

of multiple PLA/PAL like blocks on a single chip.

4. Field Programmable Gate Array (FPGA):

A Field Programmable Gate Array is a PLD that allows a very high logic capacity

than CPLD.

3.8.2. Features of PLD:

1. No customized mask layers or logic cells.

2. Fast design turnaround.

3. Single large blocks of programmable interconnect.

3.8.3. FPGA (Field Programmable Gate Array):

3.8.3.1. Introduction:

Field Programmable Gate Arrays are specific integrated circuits that can be user-

programmed easily. The FPGA contains versatile functions, configurable interconnects and

input/output interface to adapt to the user specification. FPGA allow rapid prototyping using

custom logic structures, and are very popular for limited production products. Modern FPGA are

extremely dense, with complexity of several millions of gates which enable the emulation of

very complex hardware such as parallel microprocessors, mixture of processor and signal

processing. One key advantage of FPGA is their ability to be reprogrammed, in order to create a

completely different hardware by modifying the logic gate array. FPGA not only exist as simple

CVR College of Engineering (VLSI) Page 39

Page 40: Peoject Documentation

components, but also as CPU ram-blocks in system-on-chip designs. FPGA consists of Slices

where each Slice consists of 2 look up tables and 2 D-Flip Flops.

3.8.3.2. Look Up Table (LUT):

Look Up Table (LUT) is a one-bit wide memory array, where the address lines for the

memory are inputs of the logic block and the one-bit output from the memory is the input for the

next block. A LUT with n inputs would correspond to (2^n)*1 bit memory, can realize any logic

function of its n inputs by programming the logic functions truth table directly into the memory.

Fig 3.16 LUT’s in FPGA.

CVR College of Engineering (VLSI) Page 40

Page 41: Peoject Documentation

3.8.3.3. Classification of FPGA:

FPGAs are classified based on Switching Technology.

1. SRAM based FPGAs

2. ANTIFUSE based FPGAs

XILINX and ALTERA are the leading manufacturers in SRAM based FPGA.

ACTEL, QUICKLOGIC, CYPRESS are the leading manufactures in

ANTIFUSE based FPGA.

3.8.3.4. FPGA Design flow:

The involved in implementation of a design on FPGA involves System Specifications.

Specifications refer to kind of inputs and kind of outputs and the range of values that the kit can

take it. Based on these System specifications we move on to the next step i.e. Architecture

describes the interconnections between all the blocks involved in our design.

Each and every block in the Architecture along with their interconnections is modeled in

either VHDL or Verilog depending on our ease. All these blocks are then simulated and the

outputs are verified for correct functioning.

From this simulation step we head towards the next step i.e. Synthesis. This is a very

important step in knowing whether our design can be implemented on a FPGA kit or not.

Synthesis converts our VHDL code into its functional components which are vendor specific.

After performing synthesis we can have a look of RTL schematic and Technology Schematic.

We can also see the timing delays that will be present in the FPGA if the design is implemented

on it.

Place & Route is the next step in which the tool places all the components on a FPGA die

for optimum performance both in terms of area and speed. We also see the interconnections,

which will be made, in this part of the implementation flow. In post place and route simulation

step the actual delays, which will be involved on the FPGA kit, are considered by the tool and

simulation is performed taking into consideration these delays, which will be present in the

implementation on the kit. Delays here mean electrical loading effect, wiring delays, stray

capacitances.

CVR College of Engineering (VLSI) Page 41

Page 42: Peoject Documentation

Fig:3.17. FPGA implementation design flow

CVR College of Engineering (VLSI) Page 42

System specifications

Architecture

VHDL Module

Simulation

Synthesis

Place and route

Post place and route place

Generating BIT map file

Download on to FPGA

Configuring FPGA

Timing verification

Placing of FPGA die and interconnections

Generating a net list

Functional verification

Coding

Block diagram

Initials

Page 43: Peoject Documentation

After post place and route, comes generating the bit-map file, which means converting the

VHDL code bit streams which is useful to configure the FPGA kit. A bit file is generated after

we perform this step.

After this comes final step of downloading the bit map file on to the FPGA board which is

done by connecting the computer to FPGA board with the help of JTAG cable (Joint Test Action

Group) which is a IEEE standard. The bit map file contains the whole design, which is placed on

the FPGA die; the outputs can now be observed from the FPGA LED’s or multiplexed seven

segment displays. This step completes the whole process of implementing our design on an

FPGA.

3.8.3.5. Characteristics of FPGA:

None of the mask layers are customized.

A method for programming the basic logic cells and the interconnect.

The core is a regular array of programmable basic logic cells that can implement

combinational as well as sequential logic (flip-flops).

A matrix of programmable interconnect surround the basic logic cells.

Programmable I/O cells surround the core.

Design turnaround is a few hours.

3.9. Applications of FPGA:

1. Device controllers

2. Random logic

3. Emulation of hardware

4. Integrating multiple SPLD’s

3.10. Conclusion:This chapter discussed about Design methodologies, implementation of FIR filter in different methods and FPGA design flow.

4. DESIGN ANALYSIS

CVR College of Engineering (VLSI) Page 43

Page 44: Peoject Documentation

4.1. INTRODUCTION:

After implementation of the design, next is to analyze the design. Now this chapter will

discuss about the analysis of the design after the implementation FIR filter in all four different

algorithms. In this chapter we are flow chapter of the algorithm is presented. Here synthesis and

simulation reports are discussed in the view of the performance of the design.

4.2. Flow chart:

Fig.4.1 flow chart

4.3. Simulation:

CVR College of Engineering (VLSI) Page 44

Page 45: Peoject Documentation

4.3.1. Direct implementation of FIR filter:

Fig 4.2. Simulation report of Direct Implementation of FIR filter.

In direct implementation we use Multiply and Accumulate (MAC) operation. Here we take filter co-efficient as constants and variables as input. Here we multiply the variables with filter co-efficient based on the FIR equation. Here we get output in one clock cycle but the delay will be more. Here time period of clock is 20 ns. For 1st 20 ns reset is one, so the output is zero.

4.3.2. One tap filter:

CVR College of Engineering (VLSI) Page 45

Page 46: Peoject Documentation

Fig 4.3 Simulation report of one tap filter

We apply input to the Parallel in Serial Out register (PISO), after clock event we’ll check reset, if reset is one then we will make all intermediate variable, signals, counter and output port zero, if reset is zero for first clock cycle we’ll get the MSB of inputs from Parallel in Serial Out register. Based on the output of PISO we’ll select the value from Look Up Table (LUT). This value is added to the left shifted previous output. This result is given to output.

Here for one tap filter we have one filter co-efficient and one 4-bit input. We store the pre-calculated values (in this case h0 and “0000”) in the ROM and we give input to the PISO register. At first reset is one so the output is zero. After reset becoming zero and clk event we load the input port with input. During first clock cycle we’ll get the MSB of the input from parallel in serial out register. Based on output of PISO we’ll decide whether we have to select h0 or “0000”, if bit from PISO is zero we’ll select “0000”, if it is one we’ll select one of the values from LUT. The output of the LUT is added to left shift previous output. During 1 st clock cycle output is zero, so if we shift zero we’ll get zero. This is added to output of LUT. During second clock cycle we get second MSB from PISO, based on this we’ll select value from the LUT, this added to left shifted previous output. After four clock cycles we’ll get the require output.

CVR College of Engineering (VLSI) Page 46

Page 47: Peoject Documentation

4.3.3. Implementation of FIR filter using DA algorithm:

Fig 4.4. Simulation report of DA algorithm based FIR filter.

We apply inputs to the Parallel in Serial Out register (PISO), after clock event we’ll check reset, if reset is one then we will make all intermediate variable, signals, counter and output port zero, if reset is zero for first clock cycle we’ll get the MSB of inputs from Parallel in Serial Out register. Based on the output of PISO we’ll select the value from Look Up Table (LUT). This value is added to the left shifted previous output. This result is given to output.

Here for 4-tap filter we have four filters co-efficient 16 bit each and four 4-bit input. We store the pre-calculated values in the ROM and we give input to the PISO register. At first reset is one, so the output is zero. After reset becoming zero and clk event we load the input port with input. During first clock cycle we’ll get the MSB’s of all four inputs from parallel in serial out registers. Based on output of PISO registers we’ll select a value from the LUT. The output of the LUT is added to left shift previous output. During 1st clock cycle output is zero, so if we shift zero we’ll get zero. This is added to output of LUT. During second clock cycle we get second MSB from PISO, based on this we’ll select value from the LUT, this added to left shifted previous output. After four clock cycles we’ll get the require output.

4.3.4. Implementation of FIR filter using Proposed DA algorithm:

CVR College of Engineering (VLSI) Page 47

Page 48: Peoject Documentation

Fig:4.5. Simulation report of proposed DA based FIR filter.

We applied the input to the Parallel in Serial Out register (PISO), after clock event we’ll check reset, if reset is one then we will make all intermediate variable, signals, counter and output port zero, if reset is zero for first clock cycle we’ll get the MSB of inputs from Parallel in Serial Out register. Output of one PISO register is given to the MUX and remaining three outputs from other PISO register is given to the 3-tap LUT. Based on the output of three PISO registers we’ll select the value from Look Up Table (LUT). This value is added to the left shifted previous output. This result is given to output.

Here for 4- tap filter we have four filters co-efficient 16 bits each and four 4-bit inputs. We give inputs to the PISO register. At first reset is one so the output is zero. After reset becoming zero and clk event we load the input port with input. During first clock cycle we’ll get the MSB of the input from parallel in serial out register. Three are given to the LUT and remaining is given to given to MUX. The output of the MUX and LUT are added. This result is added to the left shifted previous output. During 1st clock cycle output is zero, so if we shift zero we’ll get zero. This is added to the sum of LUT and MUX. During second clock cycle we get second MSB from PISO, based on this we’ll select value from the LUT and MUX, this added to left shifted previous output. After four clock cycles we’ll get the require output.

CVR College of Engineering (VLSI) Page 48

Page 49: Peoject Documentation

4.3.5. Implementation of FIR filter using LUT-Less algorithm:

Fig 4.6. Simulation report of LUT-Less based FIR filter.

We applied the input to the Parallel in Serial Out register (PISO), after clock event we’ll check reset, if reset is one then we will make all intermediate variable, signals, counter and output port zero, if reset is zero for first clock cycle we’ll get the MSB of inputs from Parallel in Serial Out register. The output of each PISO is fed to the four different MUX. The output of all MUX is added and the result is given to output.

Here for 4-tap filter we have four filters co-efficient, 16 bit each and four 4-bit inputs. We’ll give one input to four PISO registers. At first reset is one so the output is zero. After reset becoming zero and clk event we load the input port with input. During first clock cycle we’ll get the MSB of the input from parallel in serial out register. Each output of PISO is given to four MUX. In MUX based on the input we’ll decide whether the we have to select filter co-efficient or “0000”, if bit from PISO is zero we’ll select “0000”, if it is one we’ll select corresponding filter co-efficient. The output of MUX is added. The result is added to the left shifted previous output. During 1st clock cycle output is zero, so if we shift zero we’ll get zero. This is added to output of LUT. During second clock cycle we get second MSB from PISO, based on this we’ll get values from MUX, this are added and the result is added to left shifted previous output. After four clock cycles we’ll get the require output.

CVR College of Engineering (VLSI) Page 49

Page 50: Peoject Documentation

4.4. Synthesis report:

4.4.1. Direct implementation:

=========================================================================

* Advanced HDL Synthesis *

=========================================================================

Loading device for application Rf_Device from file '3s400.nph' in environment C:\Xilinx92i.

=========================================================================

Advanced HDL Synthesis Report

Macro Statistics

# Multipliers : 7

4x16-bit multiplier : 7

# Adders/Subtractors : 6

20-bit adder : 6

# Registers : 80

Flip-Flops : 80

=========================================================================

Timing Summary

=========================================================================

Speed Grade: -5

Minimum period: No path found

Minimum input arrival time before clock: 11.542ns

Maximum output required time after clock: 6.216ns

Maximum combinational path delay: No path found

Timing Detail:

--------------

All values displayed in nanoseconds (ns)

CVR College of Engineering (VLSI) Page 50

Page 51: Peoject Documentation

=========================================================================

Timing constraint: Default OFFSET IN BEFORE for Clock 'clk'

Total number of paths / destination ports: 77620 / 160

4.4.2. DA algorithm based FIR filter:

=========================================================================

* Advanced HDL Synthesis *

=========================================================================

Loading device for application Rf_Device from file '3s400.nph' in environment C:\Xilinx92i.

=========================================================================

Advanced HDL Synthesis Report

Macro Statistics

# ROMs : 1

16x16-bit ROM : 1

# Adders/Subtractors : 2

16-bit adder : 1

3-bit adder : 1

# Registers : 53

Flip-Flops : 53

# Comparators : 4

4-bit comparator not equal : 4

=========================================================================

Timing Summary

Speed Grade: -5

Minimum period: 9.602ns (Maximum Frequency: 104.140MHz)

CVR College of Engineering (VLSI) Page 51

Page 52: Peoject Documentation

Minimum input arrival time before clock: 10.130ns

Maximum output required time after clock: 6.280ns

Maximum combinational path delay: No path found

Timing Detail:

--------------

All values displayed in nanoseconds (ns)

=========================================================================

Timing constraint: Default period analysis for Clock 'clk'

Clock period: 9.602ns (frequency: 104.140MHz)

Total number of paths / destination ports: 7762 / 68

4.4.3. Proposed DA algorithm:

=========================================================================

* Advanced HDL Synthesis *

=========================================================================

Loading device for application Rf_Device from file '3s400.nph' in environment C:\Xilinx92i.

=========================================================================

Advanced HDL Synthesis Report

Macro Statistics

# ROMs : 1

8x16-bit ROM : 1

# Adders/Subtractors : 3

16-bit adder : 2

3-bit adder : 1

# Registers : 53

Flip-Flops : 53

CVR College of Engineering (VLSI) Page 52

Page 53: Peoject Documentation

# Comparators : 4

4-bit comparator not equal : 4

Timing Summary

=========================================================================

Speed Grade: -5

Minimum period: 10.188ns (Maximum Frequency: 98.154MHz)

Minimum input arrival time before clock: 10.672ns

Maximum output required time after clock: 6.280ns

Maximum combinational path delay: No path found

Timing Detail:

--------------

All values displayed in nanoseconds (ns)

=========================================================================

Timing constraint: Default period analysis for Clock 'clk'

Clock period: 10.188ns (frequency: 98.154MHz)

Total number of paths / destination ports: 8655 / 68

4.4.4. LUT-Less algorithm:

=========================================================================

* Advanced HDL Synthesis *

=========================================================================

Loading device for application Rf_Device from file '3s400.nph' in environment C:\Xilinx92i.

=========================================================================

Advanced HDL Synthesis Report

Macro Statistics

CVR College of Engineering (VLSI) Page 53

Page 54: Peoject Documentation

# Adders/Subtractors : 5

16-bit adder : 4

3-bit adder : 1

# Registers : 53

Flip-Flops : 53

# Comparators : 4

4-bit comparator not equal : 4

=========================================================================

Timing Summary

Speed Grade: -5

Minimum period: 12.615ns (Maximum Frequency: 79.268MHz)

Minimum input arrival time before clock: 13.055ns

Maximum output required time after clock: 6.280ns

Maximum combinational path delay: No path found

Timing Detail:

--------------

All values displayed in nanoseconds (ns)

=========================================================================

Timing constraint: Default period analysis for Clock 'clk'

Clock period: 12.615ns (frequency: 79.268MHz)

Total number of paths / destination ports: 64845 / 68

4.5. Conclusion:

In this chapter we simulation reports and synthesis reports are observed.

CVR College of Engineering (VLSI) Page 54

Page 55: Peoject Documentation

5. RESULT

This project is comprised of several chapters which devoted to the designing of

FIR filter using four different algorithms. Here the comparison tables are presented to

have overview on the resource usage and the timing comparisons. A simple overview

of the project, including the scope, motivation and objectives are discussed. Finally a

FIR filter is designed and the output is observed on the Spartan III FPGA kit.

Resource Specification Direct implementation

Distributed arithmetic algorithm

Proposed DA algorithm

LUT less algorithm

Rom

8*16bit - 1 -

16*16bit 1 - -

Adder/ subtractor

16 bit adder 1 2 4

3 bit adder 1 1 1

Registers Flip-flop 53 53 53

Comparator 4-bit comparator

4 4 4

Resource comparison table:

CVR College of Engineering (VLSI) Page 55

Page 56: Peoject Documentation

Delay comparison:

Techniques Time delay (ns) Frequency (MHz)

Direct implementation

--

Distributed Arithmetic

9.602 104.14

Proposed DA algorithm

10.188 98.154

LUT less algorithm

12.615 79.268

CVR College of Engineering (VLSI) Page 56

Page 57: Peoject Documentation

6. CONCLUSION

This project presents the proposed DA architectures for FIR filter. The architectures

reduces the memory usage by half at every iteration of LUT reduce the memory usage by half at

every iteration of LUT reduction at the cost of the limited decrease of the system frequency. We

also divide high order filters into several groups of small filters, hence we can reduce the LUT

size also. As to get the speed implementation of the FIR filter a proposed DA algorithm is

adopted.

We have successfully implemented high efficient 4-tap FIR filter, using both original DA

architecture and the proposed DA architecture on Spartan III FPGA kit device. It shows that the

proposed DA architecture is the hardware efficient for the FPGA implementation.

CVR College of Engineering (VLSI) Page 57

Page 58: Peoject Documentation

7. FUTURE SCOPE

The speed of the filter can be further increased using pipelining principle where parallel

processing can implemented. By using pipelining principle the filter can be extended to the

higher order. The 70-tap can be implemented using symmetrical structure, so that we can reduce

it to 35-tap. Then by dividing 35-tap filter to the 7 smaller filters each having 5-tap DA-LUT unit

could be implemented by a 4-input LUT with an additional 2*1 multiplexer and a full adder.

Thus our 4-tap can be extended to the 70-tap and more higher filter.

APPENDIX-A

CVR College of Engineering (VLSI) Page 58

Page 59: Peoject Documentation

Xilinx FPGA

Field Programmable Gate Arrays (FPGAs) are Specific Integrated Circuits that can be easily user-

programmed. The FPGA contains versatile functions, configurable interconnects and input/output

interface to adapt to the user specifications. The FPGA allows rapid Prototyping using Custom Logic

Structures, and are very popular for limited production products. Modern FPGAs are extremely dense,

with the complexity of several millions of gates, which enable the emulation of very complex hardware

such as Parallel Microprocessors, mixture of Processors and Signal Processing. One key advantage of

FPGA is their ability to be reprogrammed, in order to create a completely different hardware by

modifying the Logic Gate Array.

Advantages of FPGA:

1. Short turnaround time

2. Design independent

3. Flexibility

Classification of FPGAs:

The FPGAs are classified based on switching technology.

1. SRAM based FPGAs

2. ANTIFUSE based FPGAs

XILINX and ALTERA are the leading manufactures in SRAM based FPGA.

ACTEL, QUICKLOGIC, CYPRESS are the leading manufactures in ANTIFUSE based FPGA.

XILINX FPGA

Xilinx is a developer of FPGA and CPLD devices that are used in numerous applications within

telecommunications, consumer, defense, and others fields. The Xilinx offers device families for glue

logic (Cool Runner, Cool Runner II), low-cost (Spartan), and high-end (Virtex) applications. The Xilinx

also provides different application oriented optimized series FPGA’s as LX (For Logic), SX (For Signal

Processing) and (FX for Fully Featured).

Xilinx develops IP (intellectual property) cores designed in HDL which allow designers to minimize

time to market. These IP cores range from simple functions (such as BCD encoders, counters, etc.) to

complex systems (such as multi-gigabit networking cores and custom embedded microcontrollers like the

fully-featured Micro blaze soft microprocessor, and the compact Pico blaze microcontroller.) In addition,

Xilinx Design Services (XDS) can create custom cores.

CVR College of Engineering (VLSI) Page 59

Page 60: Peoject Documentation

Xilinx offers Electronic Design Automation (EDA) tools for use with its devices. Chief among these

is ISE, which offers a complete EDA flow. Domain specific tools include Xilinx's Embedded Developer's

Kit (EDK), which is aimed primarily at designers wishing to use the embedded PowerPC 405 core in the

Virtex-II Pro and Virtex-4, or Xilinx's own soft microprocessor/microcontroller in their designs. Other

domain-specific tools include Xilinx's System Generator for DSP, which provides seamless simulation

and implementation of high-performance DSP designs on Xilinx's FPGAs. The design in Xilinx FPGA

can be implemented by using the basic block of the FPGA. The main purpose of the logic block is to

design the desired functionality by using available components of the Logic Block.

Basic Logic Block

In Xilinx FPGA the basic blocks that can be used for design are Configurable Logic Blocks (CLB’s).

Each CLB consists of the SLICES. The SLICES consists of two-LUT’s and 2-D flip-flops.

Look-up Table (LUT):

The Look-up Table (LUT) is a one-bit wide memory array, where the address line for the memory are

inputs of the logic block and the one-bit output from the memory is the LUT output. A LUT with n inputs

would correspond to 2^n x 1 bit memory, and can realize any logic function of its n inputs by

programming the logic functions truth table directly into the memory. The LUT can be used to design any

combinational circuit which is having single output.

XILINX FPGA DESIGN FLOW:

The Xilinx FPGA Design flow is shown in Figure A-1. The first step involved in implementation of a

design on FPGA involves Specifications. The Specifications consists of number of inputs and number of

outputs and the range of values that the kit can take in. Based on these specifications the architecture will

be designed. The Architecture describes the interconnections between all the blocks involved in the

design.

Each and every block in the Architecture along with their interconnections is modeled in either VHDL

or Verilog depending on requirement. All these blocks are then simulated and the outputs are verified for

correct functionality. The Simulation can be done at various levels of abstractions. The other simulations

include the post synthesis simulation, post place and route simulation etc. The simulation needs a set of

test vectors for checking the functionality.

Once the functional simulation is correct then the next step is Synthesis. The Synthesis

converts the HDL description in to Net list. The Net list gives the information about the

functional hardware elements. The synthesis step gives two views of the design. One is

CVR College of Engineering (VLSI) Page 60

Page 61: Peoject Documentation

Technology Dependent and other is Technology independent. The Technology independent view

of the synthesis gives the design information in terms of gates and other components which are

not dependent to any technology. The Technology dependent view of the synthesis gives the

hardware information in terms of the LUT’s and other components, which are dependent to

Xilinx Technology.

Place & Route is the next step in which the tool places all the components on a FPGA die

for optimum performance both in terms of area and speed. After placing the components the

interconnections between the components can also be done.

In post place and route simulation step the actual delays which will be involved in the

design are considered by the tool and simulation is performed by considering these delays. These

Delays are because of electrical loading effect, wiring delays, stray capacitances.

After post place and route, comes generating the bit-map file, which means converting the

VHDL/Verilog code into bit streams which is useful to configure the FPGA kit. A .bit file is

generated after this step.

After this comes final step of downloading the bit map file on to the FPGA board which is

done by connecting the computer to FPGA board with the help of JTAG cable (Joint Test Action

Group) which is an IEEE standard. The bit map file contains the whole design which is to be

used in FPGA board.

APPENDIX-B

CVR College of Engineering (VLSI) Page 61

Page 62: Peoject Documentation

Spartan-3 Starter Kit – Introduction:

The Xilinx Spartan-3 Starter Kit provides a low-cost, easy-to-use development and

evaluation platform for Spartan-3 FPGA designs.

Figure-B-1 shows the Spartan-3 Starter Kit board, which includes the following

components and features:

200,000-gate Xilinx Spartan-3 XC3S400 FPGA in a 256-ball thin Ball Grid Array package

(XC3S400FT256) [1]

The Table B-1 shows the device information of the Xilinx Spartan-3 FPGA.

Family Name Xilinx Spartan-3

Device Name XC3S400

Capacity 20,000 gates

Package Ball Grid Array

Speed Grade -4/-5

Table B-1: Device Information of Spartan-3

The components and capacity of the Xilinx Spaatan-3 FPGA are shown below.

CVR College of Engineering (VLSI) Page 62

Page 63: Peoject Documentation

4,320 logic cell equivalents

Twelve 18K-bit block RAMs (216K bits)

Twelve 18x18 hardware multipliers

Four Digital Clock Managers (DCMs)

Up to 173 user-defined I/O signals

2Mbit Xilinx XCF02S Platform Flash, in-system programmable configuration PROM [2]

1Mbit non-volatile data or application code storage available after FPGA configuration

Jumper options allow FPGA application to read PROM data or FPGA configuration from

other sources[3]

1M-byte of Fast Asynchronous SRAM (bottom side of board) [4]

Two 256Kx16 ISSI IS61LV25616AL-10T 10 ns SRAMs

Configurable memory architecture

- Single 256Kx32 SRAM array, ideal for Micro Blaze code images

- Two independent 256Kx16 SRAM arrays

Individual chip select per device

Individual byte enables

3-bit, 8-color VGA display port[5]

9-pin RS-232 Serial Port [6]

DB9 9-pin female connector (DCE connector)

Maxim MAX3232 RS-232 transceiver/translator[7]

Uses straight-through serial cable to connect to computer or workstation serial port

Second RS-232 transmit and receive channel available on board test points[8]

PS/2-style mouse/keyboard port[9]

Four-character, seven-segment LED display[10]

Eight slide switches[11]

Eight individual LED outputs[12]

Four momentary-contact push button switches[13]

50MHz crystal oscillator clock source( as in FigureB-2)[14]

Socket for an auxiliary crystal oscillator clock source[15]

FPGA configuration mode selected via jumper settings[16]

Push button switch to force FPGA reconfiguration (FPGA configuration happens

CVR College of Engineering (VLSI) Page 63

Page 64: Peoject Documentation

automatically at power-on)[17]

LED indicates when FPGA is successfully configured[18]

Three 40-pin expansion connection ports to extend and enhance the Spartan-3 Starter Kit

Board[19][20][21]

See www.xilinx.com/s3board for compatible expansion cards

Compatible with Diligent, Inc. peripheral boards

https://digilent.us/Sales/boards.cfm#Peripheral

FPGA serial configuration interface signals available on the A2 and B1 connectors -

PROG_B, DONE, INIT_B, CCLK, DONE

JTAG port [22] for low-cost download cable[23]

Diligent JTAG download/debugging cable connects to PC parallel port [23].

JTAG download/debug port compatible with the Xilinx Parallel Cable IV and MultiPRO

Desktop Tool [24].

AC power adapter input for included international unregulated +5V power supply[25].

Power-on indicator LED [26].

On-board 3.3V [27], 2.5V [28] , and 1.2V[29] regulators Component Locations.

Figure B-2 indicates the component locations on the top side and bottom side of the

board, respectively.

CVR College of Engineering (VLSI) Page 64

Page 65: Peoject Documentation

Figure B-2: XILINX Spartan-3 Starter Kit

CVR College of Engineering (VLSI) Page 65