155
High Speed VLSI Architecture for General Linear Feedback Shift Register (LFSR) Structures ABSTRACT A linear feedback shift register (LFSR) is a shift register whose input bit is a linear function of its previous state. The only linear function of single bits is xor, thus it is a shift register whose input bit is driven by the exclusive-or (xor) of some bits of the overall shift register value. The initial value of the LFSR is called the seed, and because the operation of the register is deterministic, the stream of values produced by the register is completely determined by its current (or previous) state. Likewise, because the register has a finite number of possible states, it must eventually enter a repeating cycle. Linear Feedback Shift Register (LFSR) structures are widely used in digital signal processing and communication systems, such as BCH, CRC. Many current functions such as Scrambling, Convolution Coding, CRC and even Cordic or Fast Fourier Transform can be derived as Linear Feedback Shift Registers (LFSR) In high-rate digital systems such as optical communication system, throughput of 1Gbps is usually desired. The serial input/output operation property of LFSR structure is a bottleneck in such systems and parallel LFSR architecture is thus required. 0

Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

Embed Size (px)

Citation preview

Page 1: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

High Speed VLSI Architecture for General Linear Feedback Shift Register (LFSR) Structures

ABSTRACT

A linear feedback shift register (LFSR) is a shift register whose input bit is a linear

function of its previous state. The only linear function of single bits is xor, thus it is a shift

register whose input bit is driven by the exclusive-or (xor) of some bits of the overall shift

register value. The initial value of the LFSR is called the seed, and because the operation

of the register is deterministic, the stream of values produced by the register is completely

determined by its current (or previous) state. Likewise, because the register has a finite

number of possible states, it must eventually enter a repeating cycle.

Linear Feedback Shift Register (LFSR) structures are widely used in digital signal

processing and communication systems, such as BCH, CRC. Many current functions such

as Scrambling, Convolution Coding, CRC and even Cordic or Fast Fourier Transform can

be derived as Linear Feedback Shift Registers (LFSR) In high-rate digital systems such as

optical communication system, throughput of 1Gbps is usually desired. The serial

input/output operation property of LFSR structure is a bottleneck in such systems and

parallel LFSR architecture is thus required.

This work presents a three-step high-speed VLSI architecture for LFSR structures;

this paper proposes improved three-step LFSR architecture with both higher hardware

efficiency and speed. This architecture can be applied to any LFSR structure for high-

speed parallel implementation.

0

Page 2: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

1. Introduction

1.1 Error Coding

In recent years there has been an increasing demand for digital transmission and

storage systems. This demand has been accelerated by the rapid development and

availability of VLSI technology and digital processing. It is frequently the case that a

digital system must be fully reliable, as a single error may shutdown the whole system, or

cause unacceptable corruption of data, e.g. in a bank account. In situations such as this

error control must be employed so that an error may be detected and afterwards corrected.

The simplest way of detecting a single error is a parity checksum, which can be

implemented using only exclusive-or gates. But in some applications this method is

insufficient and a more sophisticated error control strategy must be implemented.

If a transmission system can transfer data in both directions, an error control

strategy may be determined by detecting an error and then, if an error has occurred,

retransmitting the corrupted data. These systems are called automatic repeat request

(ARQ). If transmission takes place in only one direction, e.g. information recorded on a

compact disk, the only way to accomplish error control is with forward error correction

(FEC).In FEC systems some redundant data is concatenated with the information data in

order to allow for the detection and correction of the corrupted data without having to

retransmit it. One of the most important classes of FEC codes is linear block codes. In

block codes, data is transmitted and corrected within one block (codeword). That is, the

data preceding or following a transmitted codeword does not influence the current

codeword. Linear block codes are described by the integer n, the total number of symbols

in the associated codeword. Block codes are also described by the number k of information

symbols within a codeword, and the number of redundant (check) symbols n-k.

In error control, it is crucial to understand the sources of errors. Each transmitted

bit has probability p > 0 of being received incorrectly. On memoryless channels every

1

Page 3: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

transmitted symbol may be considered independently, so only random errors occur.

Unfortunately, most channels have memory and usually several successive symbols are

corrupted. These kinds of errors are called burst errors [29]. Burst errors can be most

efficiently corrected through use of burst error correcting codes, e.g. Reed Solomon (RS)

codes [44]. Because the structure of burst error correcting codes is usually complicated,

multiple random error correcting codes are often employed. In order to improve burst error

correction, the transmitted codewords are also rearranged by interleaving. The resulting

code is called an interleaved code. In this way the burst errors scatter into several

codewords and look like random errors. Other operations on block codes are also available

to improve the error correcting ability or to adapt a code to a specified requirement. For

example codes may be shortened, extended, concatenated or interleaved [2,5].

The simplest block codes are Hamming codes. They are capable of correcting only

one random error and therefore are not practically useful, unless a simple error control

circuit is required. More sophisticated error correcting codes are the Bose, Chaudhuri and

Hocquenghem (BCH) codes that are a generalisation of the Hamming codes for multiple-

error correction. In this thesis the subclass of binary, random error correcting BCH codes

is considered, hereafter called BCH codes. BCH codes operate over finite fields or Galois

fields. The mathematical background concerning finite fields is well specified and in

recent years the hardware implementation of finite fields has been extensively studied

Furthermore, any BCH code can be defined by only two fundamental parameters and these

parameters can be selected by the designer. These parameters are crucial to the design and

the question arises if it is possible to develop a tool that will automatically generate any

BCH codec description, just by providing the code size n and the number of errors to be

corrected t. This design automation would considerably reduce BCH codec design cost and

time and increase the ease with which BCH codecs with different design parameters are

generated. This is an important motivation since the architectures of BCH codecs with

different parameters can vary remarkably.

2

Page 4: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

1.2 Hardware solutions

BCH codes employ sophisticated algorithms and their implementation is rather

burdensome. The safe solution both in terms of costs and time is a software solution. But

as BCH codes operate over finite fields, a standard microprocessors’ arithmetic is not

suitable, and a software solution is therefore rather slow . Another kind of solution is to

employ a specialist digital signal processing (DSP) unit, but this option requires rather

expensive and sophisticated hardware and can be adopted only when a small number of

devices is to be produced. Overall, software solutions are therefore slower, consume more

power and are less reliable than hardware implementations.

In recent years the Programmable Logic Device (PLD) has been developed and the

PLD subclass of Field Programmable Gate Arrays (FPGAs) has been introduced. This has

revolutionised hardware design and its implementation. The advantages of an FPGA

solution are as follows:

The FPGA is fully reprogrammable.

A design can be automatically converted from the gate level into the layout structure by

the place and route software. Therefore design changes can be made almost as easily as

software ones.

Simulation at the layout level, where the design is tied to the internal FPGA structure, is

also possible (back annotation). This enables not only the logical functionality but the

timing characteristics of the design to be simulated as well.

Xilinx Inc. offers a wide range of components [55] For example the XC3000 family

offers 1,300 to 9,000 gate complexity and 256-1320 flip-flops, so even a relatively

complex design can be implemented. (A range of other manufacturers also market

FPGA devices including Actel and Altera.)

In conclusion, a hardware solution can be easily implemented, and the differences

between hardware and a software solutions have become blurred. Unfortunately although

FPGA solutions are easy to introduce and verify, they are rather expensive and therefore

not economical for mass-production. In this case, a full or semi-custom Application

Specific Integrated Circuit (ASIC) might be more appropriate. An ASIC solution is more

complex and its implementation takes much longer than an FPGA. On the other hand,

3

Page 5: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

although an ASIC is characterised as having high starting costs it will allow for a lower

cost per chip in mass-production. However an ASIC solution cannot be modified easily or

cheaply, due to the high cost of layout masks and the long time required for their

development.

1.3 Verilog HDL and synthesis

The development of VLSI and PLDs has stimulated a demand for a hardware

description language (HDL), with a well-defined syntax and semantics. This requirement

led to the development of the Very (High Speed Integrated Circuit) Hardware Description

Language [1,25,26]. The VERILOG language describes a digital circuit using the design

entity. The entity contains an input/output interface and an architecture description. The

language supports different data types, namely constants, variables and signals and there

are also different data formats available, for instance bits, integers and real numbers.

VERILOG also supports numerous operators such as addition, multiplication,

exponentiation and modulo reduction of these data types [26].

VERILOG offers the opportunity for different levels of design. This is a crucial

feature of the language as it enables design partitioning and simulation at different levels,

thus the design can be hierarchical. In addition, VERILOG allows a design to be described

in different domains [25,34]. There are three different domains for describing digital

systems. The behavioural domain describes the system without stating how the specified

functionality is to be implemented. The structural domain describes a network of

components. The physical domain describes how a system is actually to be built.

VERILOG models of digital systems can be written at each of these three levels. These

models can then be simulated using Electronic Computer Aided Design (ECAD) tools.

VERILOG has subsequently become a standard [26] and has been widely adopted

throughout the electronics industry.

ECAD tools have long since been available which convert gate level descriptions

of circuits into descriptions which can be accepted by ASIC manufacturers. One of the key

recent developments has been the design of automatic synthesis tools which convert higher

level textual descriptions of digital circuits into lower level or gate level

4

Page 6: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

descriptions .These synthesis tools therefore allow high level descriptions of circuits to be

transported into hardware much quicker and cheaper than was previously the case. By

virtue of being a standard, there are numerous proprietary VERILOG synthesis tools

available.

Synthesis may be considered as either high level, logic level or layout level

synthesis depending on the level of abstraction involved. The highest level of design

abstraction is the system level, where the design specification and performance are defined

and a system is described, for example, as a set of processors, memories, controllers and

buses. Below this is the algorithmic level where the focus is on data structures and the

computations performance by individual processors. Next comes the register transfer level

(RTL) where the system is viewed as a set of interconnected storage elements and

functional blocks. Below this is the logic level where the system is described as a network

of gates and flip-flops. The lowest level of abstraction is a circuit level which views the

system in terms of individual transistors or other elements of which it is composed.

High level synthesis [36] takes place on the algorithmic level and on the RTL.

Usually there are different structures that can be used to realise a given behaviour, and one

of the tasks of high level synthesis is to find the structure that best meets the given

behaviour. There are a number of reasons why high-level synthesis should be considered.

For example high level synthesis reduces design times and allows for the possibility of

searching the design space for different trade-offs between cost, speed and power.

Unfortunately in practice, high-level synthesis tools are rather difficult to develop.

Furthermore, a man-crafted design is often more hardware efficient. As a result, the design

is usually synthesised at the lower level of abstraction.

5

Page 7: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

Logic level synthesis is much simpler because the digital blocks have already been

determined, therefore one of the most important aspects of this process is optimisation.

Logic synthesis is often associated with a target technology because the final logic form

for different technologies is different. The intention at this level may also be to minimise

the delay through the circuit [9] and/or to minimise the hardware requirements [11]. This

task may be even more complicated as only a few signals may be optimised with respect to

time delay whilst with others it may be required to reduce the hardware levels. Layout

level synthesis has been carried out for many years now [18] and is well understood. For

example the place and route software associated with XILINX FPGA devices can be

considered to carry out layout level synthesis.

One of the most significant problems for a synthesis tool is that the number of

possible solutions increases rapidly with an increase in logic complexity. Usually synthesis

problems are NP-complete, that is the synthesis execution time grows exponentially with

the size of the problem. Therefore the time required to find the best solution is usually

considerable. Consequently algorithms producing inexact but close to optimum solutions

are employed - so called heuristic [13].

Design synthesis is a very powerful tool and in theory, saving a considerable

amount of design time as the design need not be developed at the gate level but instead at a

higher level. In addition, the synthesis tool optimises the final design according to the

specified technology and predefined criteria such as minimum area and speed.

Unfortunately, synthesis tools are very complex and difficult to develop.

Various commercial synthesis tools are available, usually operating at the RTL

level, but seldom higher. The problem for a BCH codec designer therefore is that he has to

have a high level of understanding of BCH codes before he can write these RTL

descriptions in the first place. It is therefore the aim of this project to develop a high level

synthesis tool for the design of BCH codecs. This tool will accept the parameters n and t of

a BCH code and then generate the VERILOG description of the resulting BCH encoder

and decoder. These VERILOG descriptions will be written at the RTL/logic level to

facilitate their synthesis to gate level using a standard synthesis tool.

6

Page 8: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

1.4 Overview of thesis

The structure of this thesis is as follows. Chapter 2 presents finite fields and their

arithmetic. It considers how to construct finite fields bit-serial and bit parallel multipliers

for the dual, normal and polynomial bases. In addition, finite field inversion and

exponentiation are considered and a new approach for raising field elements to the third

power is presented. This chapter further presents a new hardware-efficient architecture

generating the sum of products and a new dual-polynomial basis multiplier.

Chapter 3 introduces BCH codes and algorithms for encoding and decoding BCH

codes are presented. Chapter 4 describes the BCH codec synthesis system.

7

Page 9: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

2. Finite Fields and Field Operators

2.1 Introduction

In this chapter finite fields and finite field arithmetic operators are introduced. The

definitions and main results underlying finite field theory are presented and it is shown

how to derive extension fields. The various finite field arithmetic operators are reviewed.

In addition, new circuits are presented carrying out frequently used arithmetic operations

in decoders. These operators are shown to have faster operating speeds and lower

hardware requirements than their equivalents and consequently have been used extensively

throughout this project.

Finite fields

Error control codes rely to a large extent on powerful and elegant algebraic

structures called finite fields. A field is essentially a set of elements in which it is possible

to add, subtract, multiply and divide field elements and always obtain another element

within the set. A finite field is a field containing a finite number of elements. A well

known example of a field is the infinite field of real numbers.

2.2 Field definitions and basic features

The concept of a field is now more formally introduced. A field F is a non-empty

set of elements with two operators usually called addition and multiplication, denoted ‘+’

and ‘*’ respectively. For F to be a field a number of conditions must hold:

8

Page 10: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

1. Closure: For every a, b in F

c = a + b; d = a * b; (2.1)

where c, d F.

2. Associative: For every a, b, c in F

a + (b + c) = (a + b) + c and a * (b * c) = (a * b) * c. (2.2)

3. Identity: There exists an identity element ‘0’ for addition and ‘1’ for multiplication that

satisfy

0 + a = a + 0 = a and a * 1 = 1 * a = a (2.3)

for every a in F.

4. Inverse: If a is in F, there exist elements b and c in F such that

a + b = 0 a * c = 1. (2.4)

Element b is called the additive inverse, b = (-a), element c is called the multiplicative

inverse, c = a-1 (a0).

5. Commutative: For every a, b in F

a + b = b + a a * b = b * a. (2.5)

6. Distributive: For every a, b, c in F

(a + b) * c = a * c + b * c. (2.6)

The existence of a multiplicative inverse a-1 enables the use of division. This is

because for a,b,c F, c = b/a is defined as c = b * a-1 . Similarly the existence of an

additive inverse (-a) enables the use of subtraction. In this case for a,b,c F, c = b - a is

defined as c = b + (-a).

It can be shown that the set of integers {0, 1, 2, ... , p-1} where p is a prime,

together with modulo p addition and multiplication forms a field [30]. Such a field is

called the finite field of order p, or GF(p), in honour of Evariste Galois [48]. In this thesis

only binary arithmetic is considered, where p is constrained to equal 2. This is because, as

shall be seen, by starting with GF(2), the representation of finite field elements maps

conveniently into the digital domain. Arithmetic in GF(2) is therefore defined modulo 2. It

is from the base field GF(2) that the extension field GF(2m) is generated.

9

Page 11: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

2.2.1 The extension field GF(2m)

Before introducing GF(2m), some definitions are required. A polynomial p(x) of degree m

over GF(2) is a polynomial of the form

p(x) = p0 + p1x + p2x2 + ... + pmxm (2.7)

where the coefficients pi are elements of GF(2) = {0,1}. Polynomials over GF(2) can be

added, subtracted, multiplied and divided in the usual way [29]. A useful property of

polynomials over GF(2) is that ([29], pp.29)

p2(x) = ( p0 + p1x + ... +pnxn)2 = p0 + p1x2 + ... + pnx2n = p(x2). (2.8)

The notion of an irreducible polynomial is now introduced.

Definition 2.1. A polynomial p(x) over GF(2) of degree m is irreducible if p(x) is not

divisible by any polynomial over GF(2) of degree less than m and greater than zero.

To generate the extension field GF(2m), an irreducible, monic polynomial of degree

m over GF(2) is chosen, p(x) say. Then the set of 2m polynomials of degree less than m

over GF(2) is formed and denoted F. It can then be proven that when addition and

multiplication of these polynomials is taken modulo p(x), the set F forms a field of 2m

elements, denoted GF(2m) [30]. Note that GF(2m) is extended from GF(2) in an analogous

way that the complex numbers C are formed from the real numbers R where in this case,

p(x) = x2 + 1.

To represent these 2m field elements, the important concept of a basis is now

introduced.

2.2.2 The polynomial basis and primitive elements

Definition 2.2. A set of m linearly independent elements ={0 ,1,..., m-1} of GF(2m) is

called a basis for GF(2m).

A basis for GF(2m) is important because any element a GF(2m) can be represented

uniquely as the weighted sum of these basis elements over GF(2). That is

a = ao0 + a11 + .... + am-1m-1 ai GF(2). (2.9)

10

Page 12: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

Hence the field element a can be denoted by the vector (a0, a1, ..., am-1). This is why the

restriction p = 2 has been made, since the above representation maps immediately into the

binary field.

There are a large number of possible bases for any GF(2m) [30]. One of the more

important bases is now introduced.

Definition 2.3. Let p(x) be the defining irreducible polynomial for GF(2m). Take as a

root of p(x), then A = {1,,...m-1} is the polynomial basis for GF(2m).

For example consider GF(28) with p(x) = x4 + x + 1. Take as a root of p(x) then

A = {1,,2,3} forms the polynomial basis for this field and all 16 elements can be

represented as

a = a0 + a1 + a22 + a33 (2.10)

where the ai GF(2). These basis coefficients can be stored in a basis table of the kind

shown in Appendix B.

Definition 2.4. An irreducible polynomial of degree m is a primitive polynomial if the

smallest positive integer n for which p(x) divides xn + 1 is n = 2m - 1.

If is a root of p(x) where this polynomial is not only irreducible but also

primitive, then GF(2m) can be represented alternatively by the set of elements GF(2m) =

{0,1,,2, ... n-1}, (n = 2m -1 ). In this case is called a primitive element and n = 1. The

relationship between powers of primitive elements and the polynomial basis representation

of GF (28) is also shown in Appendix B.

The choice as to whether to represent field elements over a basis or as powers of a

primitive element usually depends on whether hardware or a software implementation is

being adopted. This is because i * j = i +j , where this indices addition is modulo 2m-1

and so can easily be carried out on a general purpose computer. Multiplication of field

elements using the primitive element representation is therefore simple to implement in

software, but addition is much more difficult. For implementation in hardware however a

11

Page 13: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

basis representation of field elements makes addition relatively straight forward to

implement. This is because

a = b + c = (b0 + b1 + ... + bm-1 m-1 ) + (c0 + c1 + ... + cm-1 m-1 ) =

= (b0 + c0) + (b1 + c1) + ... + (bm-1 + cm-1) m-1 (2.11)

and so addition is performed component-wise modulo 2. Hence a GF(2m) adder circuit

comprises 1 or m XOR gates depending on whether the basis coefficients are represented

in series or parallel. This is an important feature of GF(2m) and one of the main reasons

why finite fields of this form are so extensively used.

2.2.3 The Dual Basis

The dual basis is an important concept in finite field theory and was originally

exploited to allow for the design of hardware efficient RS encoders [3]. However

subsequent research has allowed the use of dual basis multipliers to be adopted throughout

the encoding and decoding processes.

Definition 2.5. [15] Let {i} and {i} be bases for GF(2m), let f be a linear function from

GF(2m) GF(2), and GF(2m), 0. Then {i} and {i} are dual to each other with

respect to f and if

fif i jif i ji j( )

.

10 (2.12)

In this case, {i} is the standard basis and {i} is the dual basis.

Theorem 2.1. [15]. Every basis has a dual basis with respect to any non-zero linear

function f: GF (2m) GF (2), and any non-zero GF(2m).

For example consider GF (28) with p(x) = x4 + x + 1 and take as a root of p(x).

Then {1,,2,3} is the polynomial basis for the field. Now taking = 1 and f to be the

least significant polynomial basis coefficient, {1,3,2,} forms the dual basis to the

polynomial basis. In fact by varying there are 2m-1 dual bases to any given basis and the

dual basis with the most attractive characteristics can be taken. This is usually taken to

mean the dual basis that can be obtained from the polynomial basis with the simplest linear

transformation [38].

12

Page 14: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

2.2.4 Normal basis

A normal basis for GF(2m) is a basis of the form Bm

{ , , , } 2 2 1

where

GF(2m). For every finite field there always exists at least one normal basis [30]. Normal

basis representations of field elements are especially attractive in situations where squaring

is required, since if (a0, a1, ... ,am-1) is the normal basis representation of a GF(2m) then

(am-1, a0, a1, ... , am-2) is the normal basis representation of a2 [31]. This property is

important in its own right but also because it allows for hardware efficient Massey-Omura

multipliers to be designed. The normal basis representation of GF(28) is given in

Appendix B.

2.3 Multiplication by a constant j

It is frequently required to carry out multiplication by a constant value in encoders

and decoders. This can be accomplished using two-variable input multipliers of the type

described later. Alternatively it is often beneficial to employ a multiplier designed

specifically for this task ([29], p. 162) ([30], p.89).

Let a = a0 + a1 + ... + am-1m-1 be an element in GF(2m) where is a root of the

primitive polynomial p(x) = xm + p xjj

j

m

0

1

. Thus

a * = a0 + a12 + ... + am-1m (2.13)

but since p() = 0

a* = a0 + a12 + ... + am-2m-1 + am-1(p0 + p1 +p22 + ... + pm-1m-1) (2.14)

which is equivalent to a* mod p().

For example consider, multiplication by in GF(28), where p(x) = x4 + x + 1. Then

a* = a3 + (a3 + a0) + a12 + a23 (2.15)

and this multiplication can be carried out with the following circuit.

13

A0 A1 A2 A3

Page 15: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

Figure 2.1. Circuit for computing a a * in GF (28).

If the above register is initialised by Ai = ai (i=0,1,2,3) then by clocking the register once,

the value of a* is generated. This algorithm may be readily extended for multiplication

by j, where j is any integer and for any GF(2m).

2.4 Bit-serial multiplication

The most commonly implemented finite field operations are multiplication and

addition. Multiplication is considered to be a degree of magnitude more complicated than

addition and a large body of research has been carried out attempting to reduce the

hardware and time complexities of multiplication. Finite field adders and multipliers can

be classified according to whether they are bit-serial or bit-parallel, that is whether the m

bits representing field elements are processed in series or in parallel. Whereas bit-serial

multipliers generally require less hardware than bit-parallel multipliers, they also usually

require m clock cycles to generate a product rather than one. Hence in time critical

applications bit-parallel multipliers must be implemented, in spite of the increased

hardware overheads.

2.4.1 Berlekamp multipliers

The Berlekamp multiplier [3] uses two basis representations, the polynomial basis

for the multiplier and the dual basis for the multiplicand and the product. Because it is

normal practice to input all data in the same basis, this means some basis transformation

circuits will be required. Fortunately for m = (3, 4, 5, 6, 7, 9, 10) the basis conversion

from the dual to the polynomial basis - and vice versa - is merely a reordering of the basis

coefficients [38]. For the important case m = 8 - for example the error-correcting systems

used in CDs, DAT and many other applications operate over GF(28) - this basis conversion

requires a reordering and two additions of the basis coefficients (Appendix C). In practice

therefore, two additional XOR gates are required. Even including the extra hardware for

basis conversions, the Berlekamp multiplier is known to have the lowest hardware

requirements of all available bit-serial multipliers [28].

14

Page 16: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

Now let a, b, c GF(2m) such that c = a * b and represent b over the polynomial

basis as b =  bkk

k

m

*

0

1

. Further, and following Definition 2.5, let {0, , ..., m-1,} be the

dual basis to the polynomial basis for some f and . Hence a aii

m

i

0

1

and c cii

m

i

0

1

where these values of ai and ci are given by the following.

Lemma 2.1 [15]. Let {0, 1, ..., m-1} be the dual basis to the polynomial basis for GF(2m)

for some f and and let a = ai ii

m

0

1 be the dual basis representation of a GF(2m).

Then ai = f(ai) for (i=0,1, ..., m-1).

The multiplication c = a*b can therefore be represented in the matrix form [15]

a a aa a a

a a a

bb

b

cc

c

m

m

m m m m m

0 1 1

1 2

1 2 2

0

1

1

0

1

1

...

...... ... ... ... ... ...

(2.16)

where ai = f(ai) and ci = f(ci) (i = 0,1, ..., m-1) are the dual basis coefficients of a

and c respectively and ai = f(ai) (i=m, m+1,..., 2m-2). It can be shown [15] that

am+k = f(am+k ) =  p aj j kj

m

*

0

1

(k= 0,1, .....) (2.17)

where the pj are the coefficients of p(x). These values of am+k can therefore be obtained

from an m-stage linear feedback shift register (LFSR) where the feedback terms

correspond to the pj coefficients and the LFSR is initialised with the dual basis coefficients

of a. On clocking the LFSR am is generated, then on the next clock cycle am+1 is produced,

and so on. The m vector multiplications listed in equ(2.16) are then carried out by a

structure comprising m 2-input AND gates and (m-1) 2-input XOR gates. As an example, a

Berlekamp multiplier for GF(28) is shown in Fig. 2.2 where p(x) = x4 + x + 1.

15

Page 17: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

A3 A2 A1 A0

B3 B2 B1 B0

c3 c2 c1 c0

Figure 2.2 Bit-serial Berlekamp multiplier for GF(28).

The registers in Fig. 2.2 are initialised by Ai = ai and Bi = bi for (i= 0,1,2,3). At this

point the first product bit c0 is available on the output line. The remaining values of c1, c2

and c3 are obtained by clocking the register a further three times.

With the above scheme at least one basis conversion is required if both inputs and

the output are to be represented over the same basis. This basis transformation is a linear

transformation of the basis coefficients and can be implemented within the multiplier

structure itself. However with GF(28) the dual basis is a permutation of the polynomial

basis coefficients and so this conversion can be implemented by a simple reordering of the

inputs.

2.4.2 Massey-Omura Multiplier

The Massey-Omura multiplier [31,54] operates entirely over the normal basis and

so no basis converters are required. The idea behind the Massey-Omura multiplier is that if

the Boolean function generating the first product bit has the inputs cyclically shifted, then

this same function will also generate the second product bit. Furthermore with each

subsequent cyclic shift a further product bit is generated. Hence instead of m Boolean

functions, one Boolean function is required to generate all m product bits but with the

inputs to this function shifted each clock cycle.

16

Page 18: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

As an example, consider a Massey-Omura bit-serial multiplier for GF(28). Let

be a root of p(x) = x4 + x + 1 and let a normal basis for the field is {3, 6, 12, 9}.

Further, let such that c = a*b and represent these elements over the normal basis. Then

c = c03 + c16 + c212 + c39 =

= (a03 + a16 + a212 + a39) * (b03 + b16 + b212 + b39)

where c0 = a0b2 + a1b2 + a1b3 + a2b0 + a2b1 + a3b1 + a3b3

c1 = a1b3 + a2b3 + a2b0 + a3b1 + a3b2 + a0b2 + a0b0

c2 = a2b0 + a3b0 + a3b1 + a0b2 + a0b3 + a1b3 + a1b1

c3 = a3b1 + a0b1 + a0b2 + a1b3 + a1b0 + a2b0 + a2b2. (2.18)

From equ(2.18) only one Boolean function is required to generate c0, the remaining values

of c1, c2 and c3 are obtained by adding one to all of the indices. This amounts to a cyclic

shift of the inputs to this Boolean function. A circuit diagram for this multiplier is given in

Fig 2.3. The registers in Fig. 2.3 are initialised by Ai = ai and Bi = bi for (i=0,1,2,3). At

this point the first product bit c0 will be available on the output line. The remaining values

of c1, c2 and c3 are obtained by cyclically shifting the registers a further three times.

A3 A2 A1 A0

B3 B2 B1 B0

c3 c2 c1 c0

Figure 2.3. Bit-serial Massey-Omura multiplier for GF(28).

In the case of a Massey-Omura multiplier for GF(28), from equ(2.18) seven 2-

input AND gates and six 2-input XOR gates are required to implement the required

Boolean equation. In general there is a result that states the defining Massey-Omura

17

Page 19: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

function for a GF(2m) multiplier requires at least (2m-1) 2-input AND gates and at least

(2m-2) 2-input XOR gates [39]. In the case of the above example, it can be seen that the

GF(28) Massey-Omura multiplier has achieved this lower bound.

2.4.3 Polynomial basis multipliers

Polynomial basis multipliers operate entirely over the polynomial basis and require

no basis converters. These multipliers are easily implemented, reasonably hardware

efficient and the time taken to produce the result is the same as for Berlekamp or Massey-

Omura multipliers. In truth however bit-serial polynomial basis multipliers are serial-in

parallel-out multipliers. In some applications this results in an additional register being

required and adds an extra m clock cycles to the computation time. This is the main reason

why polynomial basis multipliers are frequently overlooked for use in codec design.

However as will be shown in Sections 2.4.5 and 2.4.6, this feature can be actually

beneficial.

There are two different methods of operation for polynomial basis multipliers, least

significant bit (LSB) first or most significant bit (MSB) first. Either of these approaches

may be chosen and both modes are described below.

2.4.3.1 Option L - LSB first

In this option the LSB appears first on the multiplier input. Therefore denote this

multiplier a Bit-Serial Polynomial Basis Multiplier option L (SPBML). This multiplier is

described in detail in the literature [4], ([29], pp.163 -164), ([30], pp. 90-91) and

summarised here.

Let a, b, c GF(2m) and represent these elements over the polynomial basis as

a = a0 + a1 + ... + am-1m-1

b = b0 + b1 + ... + bm-1m-1

c = c0 + c1 + ... + cm-1m-1 (2.19)

The multiplication c = a * b can be expressed as

c = a * b = (a0 + a1 + ... + am-1m-1) * b

c = (...(((a0b) + a1b) + a2b2) + ...) + am-1bm-1 (2.20)

A circuit carrying out multiplication by implementing equ(2.20) therefore requires an

LFSR to carry out multiplication by . This LFSR is initialised with b and on clocking

18

Page 20: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

the register the value of b is generated. The values a0 ,a1, ... , am-1 are fed in series into the

multiplier to generate each of the values aibi (i=0,1,...,m-1) which are accumulated in a

register to form the product bits c0 ,c1, ... , cm-1. As an example, a circuit diagram for such a

multiplier for GF(23) using the primitive polynomial p(x) = x3 + x + 1 is given in Fig. 2.4.

The operation of this circuit is as follows. The registers are initialised by Bi = bi

and Ci = 0 (i=0,1,2). The values a0 ,a1, a2 are fed in series into the multiplier and after 3

clock cycles the result c is available in the Ci register.

B0 B1 B2

a2 a1 a0

C0 C1 C2

Figure 2.4. Circuit for multiplying two elements in GF(23).

2.4.3.2 Option M - MSB first

In this option the MSB appears first on the multiplier input. The Bit-Serial

Polynomial Basis Multiplier option M (SPBMM) has been known for many years [28]

([29], p.163) and more recently modified by Scott et al [45].

The multiplication c = a * b (where a, b, c are as given in equation 2.18) can be

expressed

c = a * b = (a0 + a1 + ... + am-1m-1) * b

c = (...(((am-1b) + am-2b) + am-3b) + ...) + a0b. (2.21)

A circuit implementing equation 2.21 for GF(23) is shown in Fig. 2.5. Initially the Ci

register is set to zero and the Bi register is initialised by Bi = bi (i=0,1,2). a2 is then fed into

the circuit and a2b loaded into the top register. Then a1 enters the circuit and the top

register are clocked so that they then contain (a2b + a1b). Finally, the top register are

clocked to generate (a2b + a1b) and this value is added to a0b to form the required

product. In general therefore the result is obtained in the Ci register after m clock cycles.

19

Page 21: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

C0 C1 C2

a0 a1 a2

B0 B1 B2

Figure 2.5. Circuit for multiplying two elements in GF(23).

2.4.4 Comparison of bit-serial multipliers

The Massey-Omura multiplier operates entirely over a normal basis and so no

additional basis conversions are required. The normal basis representation is especially

effective in performing operations such as squaring. Unfortunately, the multiplier circuit is

relatively hardware inefficient (compared to the Berlekamp multiplier for example, [28,

33]) and cannot be hardwired to carry out reduced complexity constant multiplication.

Furthermore, the Massey-Omura multiplier cannot be easily extended for different values

of m given a particular choice of m.

The Berlekamp multiplier is known to have very low hardware requirements [28].

The Berlekamp multiplier can also be hardwired to allow for particularly efficient constant

multiplication [3]. The disadvantage of this multiplier is that it operates over both the dual

and the polynomial basis, and so basis converters may be required. In most cases the basis

conversion is only a permutation of the basis coefficients, and hence no additional

hardware is required (see Appendix C). Because of these reasons, the Berlekamp

multiplier is widely used in codec design.

The bit-serial polynomial basis multipliers do not require basis converters, and are

almost as hardware efficient as the Berlekamp multiplier. They do however have a

different interface to the Berlekamp multiplier being serial-in-parallel-out. Hence the

choice between a Berlekamp and a polynomial basis multiplier often depends on the circuit

in which the multiplier is to be implemented. For example if the result is required to be

20

Page 22: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

represented in parallel then an SPBMM may be used, otherwise a Berlekamp multiplier

could be rather adopted.

In comparing all four multipliers directly, it is noted that they each take m clock

cycles to generate a solution. Similarly they each require 2m flip-flops. In order to

compare the hardware requirements of these four multipliers some notation is introduced.

Let Na equal the number of 2-input AND gates required by a multiplier and let Nx equal

the number of 2-input XOR gates required by a multiplier. Furthermore, let Da and Dx be

the delays through a 2-input AND gate and XOR gate respectively. Let H(pp) be the

Hamming weight of the primitive polynomial chosen for GF(2m). (These choices of p(x)

are listed in Appendix A.) The hardware requirements and delays of three of these

multipliers are given in below.

Berlekamp multiplier

Na = m; Nx = m + H(pp) - 3

Delay = Da + log2(m -1) * Dx. (2.22)

Standard basis multiplier option L

Na = m Nx = m + H(pp) - 2

Delay = Da + Dx. (2.23)

Standard basis multiplier option M

Na = m Nx = m + H(pp) - 2

Delay = Da + 2Dx. (2.28)

For Massey-Omura multipliers the number of gates cannot be explicitly specified.

As a comparison, values of Na and Nx for all three types of multiplier are given in Table

2.1

21

Page 23: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

m Massey Omura [33] Berlekamp SPBML/ SPBMM

Na Nx Na Nx Na Nx

3 5 4 3 3 3 4

4 7 6 4 4 4 5

5 9 8 5 5 5 6

6 11 10 6 6 6 7

7 19 18 7 7 7 8

8 21 20 8 10 8 11

9 17 16 9 9 9 10

10 19 18 10 10 10 11

Table 2.1 The usage of gates for bit-serial Massey Omura, Berlekamp and standard basis

multipliers.

It should be noticed that in some applications the most important feature of a

multiplier is the input/output interface. Beth et. al. [4] presented a different interface for

polynomials, dual and normal basis multipliers. In conclusion polynomial basis multipliers

can be only serial-in parallel-out, whereas dual and normal basis multipliers can be either

parallel-in serial-out or serial-in parallel-out.

2.4.5 Generating the sum of products

Often in BCH and RS decoders one product is not required to be generated in

isolation, but instead a sum of products must be calculated. For example an equation of the

form

c a bj jj

t

1(2.25)

is required to be evaluated in Berlekamp-Massey algorithm circuits described in the next

chapter.

If bit-serial Berlekamp or Massey-Omura multipliers are being used, the sum of t

products is obtained by the modulo 2 addition of the output of the t independent

multipliers and so (t-1) additional XOR gates are required. With polynomial basis

multipliers where the outputs are represented in parallel, m*(t-1) XOR gates are required.

22

Page 24: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

However if SPBMMs are used to generate these products, large hardware savings can be

made, as follows. A SPBMM implements equ(2.21) (rewritten below)

c = (...(((am-1b) + am-2b) + am-3b) + ...) + a0b

by generating the values Pn = Pn-1 + am-nb (n=1,2,...,m) where P0 = 0 and c = Pm. If now

aj = aj,0 + aj,1 + ... + aj,m-1m-1 and bj = bj,0 + bj,1 + ... + bj,m-1m-1 (j=1,2,...t) then

instead generate

P P a bn n i m n ii

t

1

1

, (2.26)

where P0 = 0 and so P a bm j jj

t

1.

Equ(2.26) can be implemented by a circuit comprising two parts. Part A generates

Pn-1 in the same manner that Pn-1 is generated in the top register in Figure 2.5. Part B

comprises t registers and m 2-input AND gates generating the values ai,m-nbi (n=1,2,...,m)

for (i=1,2,...,t). The additions required in equ(2.26) can be carried out by m*(t-1) XOR

gates included in the design of the Part A circuit. A circuit for GF(23) with t=2 is shown in

Figure 2.6.

Using this approach to evaluating equ(2.25) (t+1)*m flip-flops are required. If t

distinct SPBMMs are used however, 2t*m flop flops are needed and so the above method

allows for a saving of (t-1)*m flip-flops to be made. Given that Berlekamp multipliers are

the most hardware efficient bit-serial multipliers it would seem appropriate to use these

multipliers when implementing equ(2.25). In this case however the presented approach

would again save (t-1)*m flip-flops since t distinct multipliers would be required. In

addition (H(pp)-2)* (t -1) - 1 XOR gates are saved where H(pp) is the hamming weight of

irreducible polynomial for the field. Hence the presented approach is the most hardware

efficient method of implementing equ(2.25) currently available.

23

Page 25: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

C0 C1 C2

a1,0 a1,1 a 1,2

b1,0 b1,1 b1,2

a2,0 a2,1 a 2,2

b2,0 b2,1 b2,2

Part A

Part B

Part B

Figure 2.6. Circuit generating c = a1b1 + a2b2 using in GF(23) .

Initially the presented approach appears to have an unattractive input/output format

since the aj values enter in series, the bj values enter in parallel and the output is also

generated in parallel. However when utilised in a Berlekamp-Massey algorithm circuit,

this input/output format can be very convenient. This is because the incoming syndromes

are frequently represented in series (and so can take the role of the aj values) and the error

location values generated by the circuit are represented in parallel (and so can take the role

of the bj values). The other multipliers required in the circuit must then also be bit-parallel

multipliers, thereby increasing the throughput of the overall circuit. Furthermore in the

next section a new approach of Dual Polynomial Basis Multiplier is presented, and a

combination of these two architectures offers a new hardware and time efficient

architecture for the BMA (presented in Section 3.4.2).

It should be noticed here that sum of products architecture may be also extended

for dual and normal basis multipliers if their architecture is serial-in parallel-out. Therefore

it is possible to construct a sum of products multiplier for MSB-first dual basis multipliers

(Fig. 7 [4]) and for MSB-first and LSB-first normal basis multipliers (Fig. 10, 11 [4]).

24

Page 26: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

2.4.6 Dual-Polynomial Basis Multipliers

In real time applications, the time taken by a multiplier to generate a solution is one

of its most important characteristics. Therefore a designer has to choose between hardware

efficient but slow bit-serial multipliers, and quick but rather complex bit-parallel

multipliers. In some applications it is required to calculate

y = a * b * c. (2.27)

In the standard approach to generate equ(2.27), two multiplications are carried out

independently, i.e. first the multiplication z = a * b is implemented and the result stored in

the auxiliary register Z, and then the multiplication y = z * c is carried out. The total

calculation time is the sum of two independent multiplication times. In some applications

this time is unacceptably long and a parallel multiplier must be employed, and so a more

complex architecture is required.

To overcome this problem, a new approach has been developed. Using the two

proposed Dual-Polynomial Basis Multipliers (DPBMs), the time required to implement

equ(2.27) is almost the same as the time required to carry out a single multiplication.

Furthermore a DPBM is almost as hardware efficient as the standard bit-serial approach.

The DPBM can also be modified to carry out more complex operations such as y = (a * b

+ c) * d. (This operation is required to be carried out in the Berlekamp Massey

algorithm).

2.4.7 Option A dual polynomial basis multipliers

The Berlekamp multiplier can be described as a parallel-in serial-out multiplier. On

the other hand, bit-serial polynomial basis multipliers are serial-in parallel-out. Therefore,

there is the option of connecting these two types of multiplier together to form one

multiplier generating y = a * b * c. In this arrangement, the Berlekamp multiplier’s output

is connected directly to the polynomial basis multiplier’s serial input. Thus the

multiplication y = a * b * c is carried out in the same time span that a single bit-serial

Berlekamp multiplier takes to yield one product. A problem occurs however because the

polynomial basis multiplier operates on the polynomial basis whilst the Berlekamp

multiplier produces a result in the dual basis, and so an additional basis conversion is

required.

25

Page 27: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

The complexity of this basis representation depends on the irreducible polynomial

selected, and so two cases have been considered. Those cases in which the irreducible

polynomial for the field is

a trinomial of the form p(x) = xm + xp + 1, or

a pentanomial of the form p(x) = xm + xp+2 + xp+1 + xp + 1.

2.4.7.1 Irreducible trinomials

When the irreducible polynomial defining GF(2m) is a trinomial, the dual basis is a

permutation of the polynomial basis (see Appendix C). Therefore it is possible to

rearrange the order of the output from the Berlekamp multiplier so that it is compatible

with the polynomial basis multiplier.

As an example, consider GF(28) with p(x) = x4 + x + 1. An element z GF(28) is

represented in the polynomial basis as

z = z0+ z1 + z22 + z33 zi GF(2) (2.28)

so a Berlekamp multiplier would generate this value in the dual basis as

z0, z3, z2, z1.

The SPBMM requires the serial input in the order

z3, z2, z1, z0.

A circuit that rearranges the dual basis coefficients into this order can be easily developed,

thus allowing the DPBM to be designed. The general scheme for a multiplier generating

y = a * b * c is shown in Figure 2.7.

Berlekamp multiplierz = a * b

a b

Zregister

mux SSBMMy = z * c

c

z

y

Figure 2.7. Dual-Polynomial Basis Multiplier option A.

Assume for instance that the multiplier shown in Figure 2.7 is a Dual-Polynomial

Basis Multiplier option A (DPBMA) for GF(28). The hardware required in addition to the

26

Page 28: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

SPBMM and the Berlekamp multiplier is a 2:1 multiplexer and a flip-flop. On the first

clock cycle the values of a and b are parallel loaded into the Berlekamp multiplier. Once

these values have been stored, the first product bit z0 is available on the output. This result

is then clocked into the Z flip-flop. On clocking the Berlekamp multiplier a further three

times the values of z3, z2, z1 are produced. These coefficients pass through the multiplexer

and feed the serial input of the SPBMM. On the 5th clock cycle the multiplexer feeds the

SPBMM input with z0, so that the SPBMM has been fed the input sequence z3, z2, z1, z0, as

required. In total therefore this circuit has a total computation time of (m+1) clock cycles.

Note also that no extra m-bit register Z is required to store the value of z as required in the

standard approach to generating equ(2.27).

This approach may be easily extended to GF(2m) where the irreducible polynomial

for GF(2m) is of the form p(x) =  xm + xp + 1. In this case if (z0, z1, ..., zm-1) is the

polynomial basis representation of z GF(2m), the output in the dual basis from a

Berlekamp multiplier is (see Appendix C)

zp-1, zp-2, ..., z0, zm-1, zm-2, ..., zp . (2.29)

In this case, a multiplier structure similar to that shown in Figure 2.7 is derived. In

addition, p extra flip-flops and one (p + 1):1 multiplexer are required, and the total

calculation time is now m + p clock cycles.

2.4.7.2 Irreducible pentanomial

When the irreducible polynomial for GF(2m) is a pentanomial of the form

p(x) = xm + xp+2 + xp+1 + xp + 1

the dual to polynomial basis conversion involves a reordering and two GF(2) additions,

and so two extra XOR gates are required to implement this conversion. In this case the

DPBMA is more difficult to implement, but is still worthy of consideration.

As an example, and because GF(28) is the most useful example of a field for which

an appropriate pentanomial exists, consider GF(28) with p(x) = x8 + x4 + x3 + x2 + 1. Let z

GF(28) be presented in the dual basis as

z = z00 + z11 + z22 + z33 + z44 + z55 + z66 + z77 zi GF(2)

and in the polynomial basis as

z = s0 + s1 + s22 + s33 + s44 + s55 + s66 + s77. si GF(2)

27

Page 29: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

Then the dual to polynomial basis conversion is given by

s7 z3, s6 z4 , s5 z5, s4 z6

s3 z3 + z7, s2 z0 + z2 , s1 z1, s0 z2.

The DPBMA for GF(28) is shown in Figure 2.8.

Berlekamp Multiplierz = a * b

a b

Z3

0 mux 1

3

4

SPBMM y = z * c

c

z

y

Z2

Z0

Z1

2

Figure 2.8. DPBMA generating y = a * b * c in GF(28)

The operation of the DPBMA shown in Figure 2.8 is as follows. On the first clock

cycle a and b are parallel loaded into the Berlekamp multiplier and at this point the first

product bit is available on the output. The remaining 7 product bits are obtained by

clocking the Berlekamp multiplier a further 7 times. The first 4 values generated by the

Berlekamp multiplier are clocked into the Zi register so that after 4 clock cycles Zi = zi

(i=0,1,2,3). This fourth value of z3 is also the first input to the SPBMM (i.e. s7). The next

three outputs from the Berlekamp multiplier are fed into the SPBMM and then the

multiplexer selects inputs 1,4,3,2 on the next four clock cycles to generate the required

input for the SPBMM. The overall DPBMA will generate a solution on the 11th clock

cycle.

In general, a DPBMA takes m+p+1 clock cycles to generate a product when the

irreducible polynomial is of the form p(x) = xm + xp+2 + xp+1 + xp + 1. In addition to the

required Berlekamp multiplier and the SPBMM, an additional p+2 flip-flops, two 2-input

XOR gates and one (3+p):1 multiplexer are required.

28

Page 30: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

2.4.8 Dual polynomial basis multipliers option B

The DPBM may also be developed in a different form. With this option, a

multiplier implementing equ(2.27) has the same calculation time as a single, bit-serial

multiplier. Instead of rearranging the order of the output from a Berlekamp multiplier, it is

possible to add an additional circuit to the input of a ‘Berlekamp-like’ multiplier denoted

bit-serial dual basis multiplier (SDBM). With this scheme, the SDBM produces a product

in the polynomial basis, and so no extra circuit between the SDBM and the SPBMM is

required. In order to develop the DPBM option B (DPBMB), the function Rd(x) is

introduced.

Definition 2.6. Let the irreducible polynomial for GF(2m) be p(x) = p0 + p1x + p2x2 + ... +

xm and let a, b GF(2m) be represented in the dual basis as

b = b00 + b11 + ... + bm-1m-1

a = a00 + a11 + ... + am-1m-1.

Then define the function Rd : GF(2m) GF(2m) such that b = Rd(a), where b satisfiesb

a j m

p a j m

j

j

i ii

m

1

0

1

1

1.

(2.30)

The value b = Rd(a) = a where is a root of p(x) and so the function Rd(x) has

the same effect of the coefficients of x as an LFSR which is initialised with the dual basis

representation of x. Let Rd2(a) be defined as Rd2(a) = Rd(Rd(a)) - the state of the LFSR

after 2 clock cycles - and so on.

2.4.8.1 Irreducible trinomials

To introduce the DPBMB assume first that the defining irreducible polynomial p(x)

is a trinomial of the form p(x) =  xm  + xp + 1. Now consider a Berlekamp multiplier

without the LFSR but instead a set of m input lines Ai, denoted as SDBM.

Let a, b, z GF(2m) such that z = a * b. Further, let b and z be represented in the

polynomial basis and a be represented in the dual basis as a = a0 + a1 + ... + am-1m-1. If

the SDBM is fed with the inputs Ai = ai (i=0,1, ..., m-1) the first coefficient of the dual

basis representation of z is obtained, or equivalently from equ(2.29), the p-th polynomial

basis coefficient of z, namely zp-1. So if instead, the multiplier is fed with the dual basis

representation of Rdp(a), the p+1-th coefficient of the dual basis representation of z is

29

Page 31: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

obtained, or equivalently, the last polynomial basis coefficient zm-1. Similarly, if on the

next clock cycle the multiplier is fed with the dual basis representation of Rdp+1(a), the

p+2-th coefficient of the dual basis representation of z is obtained, or equivalently, the

next to the last polynomial basis coefficient zm-2. This analysis may continue and so

overall, if the proposed multiplier is fed with the input sequence

Rdp(a), Rdp+1(a), Rdp+3, ..., Rdm-1(a), a, Rd(a), ..., Rdp-1(a) (2.31)

the multiplier will generate the values zm-1, zm-2, ... , z0 which is the correct format for the

SPBMM.

As previously mentioned the proposed technique is flexible in that it can be

modified to carry out operations of the form y = (a * b + c) * d. For example, consider

Figure 2.9 where a circuit for GF(28) is presented implementing the operation y = (a * b

+ c) * d. Consider first the lower half of the circuit which implements z = a*b using a

SDBM. Taking p(x) = x4 + xp + 1, (p=1) in order that the SDBM produces the result in the

required sequence, from equ(2.31) the ‘a’ inputs to the multiplier must be in the order

Rd(a), Rd2(a), Rd3(a), a.

To achieve this four flip-flops, four 3:1 multiplexers, an Rdp(a) circuit and an

Rd(a) circuit are additionally required. (An Rdp(a) circuit is a combinational circuit that

given the dual basis representation of a GF(2m), generates Rdp(a). This circuit therefore

implements a linear transformation over GF(2) and comprises p XOR gates. In this case, it

can be seen that only one additional XOR gate is required).

On the first clock cycle, the multiplexers select input 0, thereby loading Rdp(a) into

the ari register. On the 2nd and 3rd clock cycles the multiplexers select input 2 thereby

loading Rd2(a) and Rd3(a) respectively into the ari register. Finally on the 4th clock cycle,

the multiplexers select input 1 to load the dual basis representation of a into the ari

register. In doing this the output sequence z3, z2, z1, z0 is generated, as required by the

SPBMM.

If the Ci register was previously initialised with the polynomial basis coefficients of

c and are now clocked, the polynomial basis representation of (a*b + c) is generated. This

value is then fed into an SPBMM as normal to generate the required result y = (a*b +

c)*d, this equation is required in the BMA.

30

Page 32: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

The DPBMB can be easily extended to different GF(2m) if the irreducible polynomial is a

trinomial of the form p(x) =  xm + xp + 1. In general, the multiplexers should select the

following signals.

Clock cycle Origin of Signal Actual Values on

these Signals

1 Rdp(a) circuit Rdp(a)

2 to m-p Rd(a) circuit Rdp+1(a) to Rdm-1(a)

m-p+1 Ai register a

m-p+2 to m Rd(a) circuit Rd(a) to Rdp-1(a)

31

Page 33: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

Y0 Y1 Y2

d0 d1 d2

Y3

d3

C0 C1 C2 C3

b0 b1 b2 b3

ar0 ar1 ar2 ar3

a0 a1 a2 a3

mux0 1 2

mux0 1 2

mux0 1 2

mux0 1 2

Rdp(a)

Rd(ar)

Summation

SPBMM

a*b + c

a*b

SDBM

Basisrearrangingcircuit

Figure 2.9 Circuit generating y = (a * b + c) * d in GF(28).

32

Page 34: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

In comparison with a standard approach, a DPBMB circuit requires an additional m

3:1 multiplexers, m flip-flop, one XOR gates to form the Rd(x) circuit and p XOR gates to

generate the Rdp(x) circuit. In order to reduce the complexity of this Rdp(x) circuit, a value

of p as low as possible should be chosen. Hence the optimal irreducible polynomial to

choose in this instance is p(x) =  xm + x + 1. Such polynomials exist for m=2,3,4,6,7, etc.

2.4.8.2 Primitive pentanomials

When the irreducible polynomial p(x) is of the form p(x) = xm + xp+2 + xp+1 + xp +

1 an DPBMB can be designed similarly as in the trinomial case. Because the basis

conversion is not just a permutation of basis coefficients and also involves two GF(2)

additions, the circuit rearranging the input to a SDBM is rather more complicated

however.

Using the same analysis as in the trinomial case, it can be shown that when p(x) =

xm + xp+2 + xp+1 + xp + 1, the required input sequence for an SDBM multiplier is

Rdp+1(a), Rdp+2(a), ... , Rdm-2(a), Rdp+1 + Rdm-1(a), a + Rdp(a), Rd(a), ... , Rdp(a).

So for example consider GF(28) and p(x) = x8 + x4 + x3 + x2 + 1. The required

input sequence is therefore

Rd3(a), Rd4(a), Rd5(a), Rd6(a), Rd3(a)+Rd7(a), a+Rd2(a), Rd(a), Rd2(a).

This sequence is generated by a circuit of the form shown in Figure 2.10. The key section

of this circuit is the multiplexer determining the ordering of the above input sequence. The

input selection lines are as follows:

clocks 1 line 4 Rd3(a)

clock 2-4 line 3 Rd(ar) i.e. Rd4(a), Rd5(a), Rd6(a)

clock 5 line 2 Rd(ar) + Rd3(a) i.e. Rd7(a) + Rd3(a)

clock 6 line 1 Rd2(a) + a

clock 7 line 0 Rd(a)

clock 8 line 3 Rd(ar) i.e. Rd2(a),

In general, the DPBMB requires an additional (p+2) Rd(x) circuits (3 XOR gates in

each), 2m XOR gates for summation circuits, m 5:1 multiplexers and m flip-flops.

33

Page 35: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

Rd(x) Rd(x) Rd(x)

Rd(x) 0 1 2 3 4

Mux

8 registersar

SDBMz = a * b

SPBMMy = z * c

a Rd(a) Rd2(a) Rd3(a)

b z c

y = a * b *c

Figure 2.10. DPBMB generating y = a * b * c in GF(28)

2.4.8.3 Summary of DPBM

The DPBM is particularly useful if the time taken to generate a product is critical.

The DPBM offers a half-way solution between a bit-serial and a bit-parallel multiplier.

Furthermore, both DPBMs are hardware efficient and in some situations the DPBM offers

a reduction in hardware since the intermediate value z does not have to be stored. The

structure of the DPBM depends on the irreducible polynomial for GF(2m). The optimal

irreducible polynomial is a trinomial of the form p(x) = xm + xp + 1 where p = 1, for p>1,

more hardware is required in the multiplier. For some values of m (e.g. m = 8) there do not

exist irreducible trinomials, and so an irreducible pentanomial must be used resulting in

the addition of extra hardware. Although, the structure of the DPBM depends on the

selected irreducible polynomial for GF(2m), it has been shown that the architecture can be

easily specified for two important classes of irreducible polynomials.

The DPBMs require only one input to be represented in the dual basis, the other

input and the output are represented in the polynomial basis. Two different options have

been presented. With the DPBMA, the dual basis output is converted into the polynomial

34

Page 36: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

basis. This multiplier is particularly suited to generating products of the form y = a * b * c

or y = (a * b + c) * d if it is acceptable to take (m+p) clock cycles to generate this product.

With the DPBMB the basis rearranging takes place on the input. This approach takes more

hardware than the DPBMA circuit, but generating the product only takes m clock cycles.

The DPBMB is of particular use when evaluating expressions of the form

y ab c di i ii

t

( )

1 (2.32)

where a, bi, ci, di GF(2m). This is because only one relatively expensive basis

rearranging circuit is required. Expressions of the type equ(2.32) are generated in the

implementation of the Berlekamp-Massey algorithm.

Note that SPBMLs have not been used in conjunction with DPBMs because the

basis reordering circuits are more complicated than the ones needed when using SPBMMs.

Beth et. al. [4] presented normal basis multipliers with a LSB-first serial-in

parallel-out interface. Therefore it is also possible to construct a multiplier that can carry

out the multiplication d= a * b * c over the normal basis during only m clock cycles. This

multiplier consists of a parallel-in serial-out Massey-Omura multiplier of the form

presented in Section 1.4.2. and the above multiplier ([4] Fig. 11). This double - multiplier

does not require basis rearranging as the DPBM does, but normal basis multiplication is

relatively hardware inefficient (see Section 2.4.4), and constructing a normal basis

multiplier for different choices m is quite complex. In addition, a normal basis

multiplication requires the arguments of the multiplication to be rotated therefore an

additional control system is required. Summing up, in this thesis the DPBM is adopted,

however in some instances it is not obviously the most appropriate architecture.

A similar architecture using only dual basis multipliers cannot be constructed,

because parallel-in serial-out multipliers produce the result in the dual basis and the serial-

in parallel-out dual basis multipliers can be constructed only for serial input in the

polynomial basis [4].

35

Page 37: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

2.5 Bit-Parallel Multiplication

In some applications, it is required to adopt bit-parallel architectures rather than

bit-serial ones to achieve the required performance. So far, only bit-serial multipliers have

been considered because of their hardware advantages over bit-parallel multipliers.

Unfortunately, in the time critical places in BCH codecs, bit-serial architectures are too

slow and more complex bit-parallel architecture must be adopted.

2.5.1 Dual Basis Multipliers

The bit-parallel dual basis multiplier (PDBM) was presented in [15]. Let a, c

GF(2m) be represented in the dual basis by a = a00 + a11 + ... + am-1m-1 and c = c00 +

c11 + ... + cm-1m-1. Let b GF(2m) be represented in the polynomial basis as b = b0 +

b1 + ... + bm-1m-1. The multiplication c = a * b is therefore represented by equations

(2.16) and (2.17). Using these equations and the bit-serial Berlekamp multiplier properties,

the PDBM can be easily derived [15] as a circuit implementing the equations

cj = ajb0 + aj+1b1 + aj+2b2 + ... + aj+m-1bm-1 (j=0,1,....,m-1)

am+k =  p aj j kj

m

*

0

1

(k=0,1,...,m-1) (2.33)

where the pj are the coefficients of the primitive polynomial for the field p(x) = p-

0 + p1x + ... + xm.

In general therefore a PDBM for GF(2m) comprises one type A module that

generates am+i (i=0,1,...,m-1) from equ(2.33) and m type B modules each generating the

inner product of two m-length vectors over GF(2). As an example, the PDBM for GF(23)

using p(x) = x3 + x + 1 is given below.

a0

a1

a2

a3

a4

Figure 2.11. Type A module for a bit-parallel dual basis multiplier for GF(23).

36

Page 38: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

aj

aj+1

aj+2

b0

b1

b2

cj

Figure 2.12. Type B module for a bit-parallel dual basis multiplier for GF(23).

Module A

Module B Module B Module B

a0 a1 a2

a3 a4

c0 c1 c2

b

Figure 2.13. Bit-parallel dual basis multiplier for GF(23).

2.5.2 Normal basis multipliers

A bit-parallel normal basis multiplier was also presented by Massey and Omura

[31]. This multiplier comprises m identical Boolean functions, where the inputs to these

functions are effectively cyclically shifted one each time. A bit-parallel Massey-Omura

multiplier (PMOM) requires at least m(2m-1) 2-input AND gates and at least m(2m-2) 2-

input XOR gates [31,54]. The complexity of this multiplier is therefore dependent upon

the complexity of the defining multiplication function. Accordingly, this multiplier is more

hardware intensive than the PDBM and is not used in this thesis.

2.5.3 Polynomial Basis Multipliers

The bit-parallel polynomial basis multiplier (PPBM) was presented by Laws et al.

[28]. The multiplier performs the same sequence of computations as the bit-serial

37

Page 39: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

polynomial multiplier option M (SPBMM), and so denote this multiplier the parallel

polynomial basis multiplier option M (PPBMM).

Let a, b, c GF(2m) and

a = a0 + a1 + ... + am-1m-1

b = b0 + b1 + ... + bm-1m-1

c = c0 + c1 + ... + cm-1m-1. (2.34)

To generate c = a * b, the representation

c = (...(((am-1b) + am-2b) + am-3b) + ...) + a1b (2.35)

is again used. The PPBMM therefore consists of (m-1) blocks that carry out the operations

ym-1= am-1b

and yj = ajb + yj+1 mod p(). for m-1> j 0

where the result c = y0, and p(x) is the irreducible polynomial for GF(2m).

Mastrovito has presented a different type of polynomial basis multiplier [33]. This

multiplier generates

c a b p x c jj

j

m

mod ( ) 0

1

by employing the product matrix M:cc

c

f f ff f f

f f f

aa

a

M A

m

m

mm

mm m

mm

mm m

m m

m

m T

1

2

0

11

21

01

12

22

02

10

20

00

1

2

0

(2.36)

where f bji

kfor some k

.

The most burdensome part of the Mastrovito algorithm is finding the product matrix M.

The algorithm for finding the matrix M has been omitted as it is rather complicated and

can be found in [33]. In conclusion, the Mastrovito bit-parallel polynomial basis multiplier

(MPPBM) is rather difficult to represent algorithmically. However the advantage of the

MPPBM is that it has a smaller time delay than the PPBMM.

Laws et al. [28] presented a parallel multiplier using the same calculation sequence

as the SPBMM. The question arises, is it possible to construct a modular and regular

parallel multiplier employing the same calculation sequence as in the case of the SPBML.

Research carried out concludes that it is, as below.

38

Page 40: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

Express the multiplication c = a * b as in equ(2.20)

c = (...(((a0b) + a1b) + a2b2) + ...) + am-1bm-1 .

Now represent b * j as

b * j = bj,0 + bj,1 + bj,22 + ... + bj,m-1m-1. (2.37)

Therefore using (2.20) and (2.37)

cj = a0b0,j + a1b1,j + a2b2,j + ... + am-1bm-1,j (2.38)

Equation (2.38) may also be derived if the SPBML is considered. In the SPBML, the value

cj is obtained by sequentially summing up the binary multiplication bj,i (the state of register

bi after j clock cycles) and ai.

Using equations (2.38) and (2.37) it is possible to construct a modular and regular

bit-parallel polynomial basis multiplier, option L (PPBML). A PPBML for GF(28) is

presented below.

* * *

Module B Module B Module B Module B

b0 b* b*2 b*3

c0 c1 c2 c3

a

b3

b0,0 b3,0 b3,1

b3,2

b3,3

b0,3

Figure 2.14. PPBML for GF(2^8).

a0

a1

a2

b0,j

b1,j

b2,j

cj

b3,j

a3

Figure 2.15. Module B of the PPBML - Identical inner product generator identical to that

required in the Berlekamp multiplier.

39

Page 41: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

b j , 0

b j , 1

b j , 2

b j , 3

b j + 1 , 0

b j + 1 , 1

b j + 1 , 2

b j + 1 , 3

Figure 2.16. Circuit for multiplying by in GF(2)

In general a PPBML for GF(2m) comprises m type B inner product modules and

(m-1) type C modules that generate j * b where is a root of p(x) and b GF(2m). A type

C module essentially carries out a linear transformation of basis coefficients over GF(2)

and will therefore consist of a number of XOR gates.

2.5.4 Comparison of parallel multipliers

In this section the PDBM [15], the PMOM [31,54] and three polynomial basis

multipliers, the MPPBM [33], the PPBMM [28], and the PPBML have been considered. A

comparison of the number of XOR and AND gates required by these multipliers and the

maximum delay times for a range of values of m are presented below. (In fact, the delay

through each of these multipliers is Da plus the values cited below, since a single row of

AND gates is also required by each of the multipliers).

40

A0 A1 A2 A3

Page 42: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

m PMOM MPPBM PDBM

PPBMM

PPBML

PDBM PPBMM PPBML

NA NX DX NA NX DX NA NX DX DX DX

3 15 12 2 9 8 3 9 8 3 3 3

4 28 2 3 16 15 3 16 15 3 4 3

5 45 40 3 25 25 5 25 2 5 6 5

6 66 60 4 36 33 4 36 35 4 6 4

7 133 126 5 49 48 4 49 48 4 7 4

8 168 160 5 64 90 5 64 77 7 11 7

9 153 144 4 81 80 5 81 80 6 11 6

10 190 180 5 100 101 6 100 99 6 12 6

Table 2.2. Comparison of bit-parallel finite field multipliers.

The number of gates for the PMOM is taken from [33], for the MPPBM and the PDBM

this number is taken from [15]. The primitive polynomials used to design these multipliers

(excluding the PMOM) are listed in Appendix A.

As a general rule, the number of gates and multiplier delay can be obtained from the

following:

PDBM:

NA= m*m NX= (m-1)*(m + H(pp)-2) Del= log2(m) + t * log2(H(pp)-1)

PPBMM:

NA= m*m NX= (m-1)*(m + H(pp)-2) ( DX= m -1 + t if H(pp)= 3 )

PPBML:

NA = m*m NX= (m-1)*(m + H(pp)-2) ( DX= log2(m) + t if H(pp)= 3 )

where t = (m-1)/(m-p) and p(x) = xm + xp + p xii

i

p

0

1

.

In conclusion from Table 2.2, the PPBML has the same parameters as the PDBM

for the considered choices of m. The PPBML needs no basis conversions and so the design

41

A0 A1 A2 A3

A0 A1 A2 A3

Page 43: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

of a PPBML is simpler and more hardware efficient than the PDBM, especially if a

primitive trinomial for GF(2m) does not exist. On the other hand, the PDBM is slightly

easier to design (without the basis conversions), and some additional design optimisation

can be done, e.g. for m= 8 the number of XOR gates can be reduced to 72 [15]. In

conclusion, the choice between PDBM and PPBML is related to the individual design

specification, as the differences in design complexity and hardware requirements are small.

The PPBMM has the same hardware requirements as the PPBML but a longer

delay time. Accordingly the PPBMM is not used in the thesis. Similarly, the PMOM is not

used given the high hardware requirements and long delay path of the multiplier. The

PPBML in comparison with the MPPBM is much easier to design; and in most cases the

final circuits are similar, e.g. for m= 4. Therefore in this thesis only the PDBM and

PPBML have been considered.

It should be mentioned that a number of bit parallel multipliers have been

proposed for circumstances in which p(x) is of the form p(x) = xm + xm-1 + ..... + x + 1,

that is, when p(x) is an all one polynomial [22]. However all one polynomials are

relatively rare and so do not help in finding general solutions of the kind required here.

2.6 Finite field exponentiation

2.6.1 Squaring

In some applications squaring in a finite field is required. Squaring can be

performed using a standard multiplier but this approach is rather hardware inefficient.

Instead, a different algorithm is employed, as was described for example in [16].

Let a GF(2m) be represented in the polynomial basis as

a = a0 + a1 + a22 + ... + am-1m-1.

Now let b GF(2m) such that b = a2 . From equ(2.8), f2(x) = f(x2) and so

b = a2 = a0 + a12 + a24 + a36 + ... + am-12m-2 mod p(). (2.39)

In other words, the coefficients of b can be obtained from a linear transformation of the

coefficients of a over GF(2). This linear transformation will require a number of XOR

gates to implement, and these numbers for a range of m are listed in Table 2.3.

42

Page 44: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

2.6.2 Raising field elements to the third power

The standard approach for carrying out exponentiation to the power three is to use

a standard multiplier and squaring and then calculate a3 = a2 * a [46]. If a PPBML is used

together with the approach for carrying out squaring described above, the hardware

requirements for this circuit are as given in Table 2.3.

An alternative method of raising elements to the power three is now described. Let

a, b GF(2m) such that b = a3 and represent both these elements over the polynomial basis

in the usual way. From the equation

(x + y)3 = x3 + 3x2y + 3xy2 + y3

the expressions

b = a3 = (a0 + a1 + a22 + ... + am-1m-1)3 mod p()

b a a aii

i ji j i j

j i

m

i

m

i

m

3 2 2

1

1

0

2

0

1

( ) mod p() (2.40)

are derived. A circuit implementing equ(2.40) can be designed directly and consists of (m-

1) + (m-2) + ... +1 = m*(m-1)/2 AND gates and at most m*(m2-1)/2 XOR gates.

However in practice, these requirements are much lower. The number of gates for this

cubic circuit is given in Table 2.3. In comparison with the standard approach this method

offers hardware savings especially if design optimisation is employed. For example for m

= 8, with optimisation the number of XOR gates is almost the halved.

m squaring cubic a3 = a2 * a

NXOR NXOR NAND NXOR NAND

4 2 16 (13) 6 17 16

5 3 29 (21) 10 27 25

6 3 47 (33) 15 38 36

7 3 66 (46) 21 51 49

8 12 (10) 135 (70) 28 87 64

9 6 133 (83) 36 86 81

10 6 159 (105) 45 105 100

Table 2.3. Hardware requirements for exponentiation in GF(2m). ( ) = with design

optimisation.

43

Page 45: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

2.7 Finite field inversion

BCH decoders are required to implement the finite field division c = a/b. This

division can be implemented using a division algorithm e.g. [15, 17, 21]. Unfortunately,

BCH decoders require that the result of a division be available faster than these algorithms

allow. Often however b is available earlier than a and so it can be beneficial to first

employ inversion to generate b-1 and then to use a fast bit-parallel multiplier.

Throughout this thesis, the Fermat inverter is used. Fermat inverters operating over

the normal and dual bases have been presented [16, 54]. The dual basis inverter is

hardware efficient and what is more, it is convenient that the result of this division is

represented in the dual basis and so can be utilised in dual basis multipliers for example.

Hence, the dual basis inverter has been employed in this project.

A Fermat inverter implements the equation

a a a a a am m

1 2 2 2 4 8 2 1

(2.41)

and so in turn is based on repeated multiplications and squaring. The dual basis inverter

uses a PDBM as presented in Section 2.5.1 and carries out squaring in the polynomial

basis as described in Section 2.6.1. The overall inversion circuit requires (m-1) clock

cycles to generate a result. To then calculate c = a/b = a*b-1 one extra clock cycle is

required to carry out the multiplication.

2.8 Conclusions

In this chapter the main definitions and results underpinning finite field

theory have been introduced. It has been shown how to generate GF(2m) from the base

field GF(2) and the most important basis representations have been described.

The most useful bit-serial and bit-parallel finite field multipliers have been reviewed for

adoption in BCH codecs. Circuits for carrying out inversion, division and exponentiation

in GF(2m) have also been described. Finally some important new circuits have been

presented. A hardware efficient method of generating the sum of products using a

previously overlooked multiplier has been described. This circuit operates entirely over the

polynomial basis and has an attractive input/output format for use in circuits implementing

44

Page 46: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

the Berlekamp-Massey algorithm. Two multiplier circuits generating products of the form

y = a*b*c have also been presented. These circuits are based around Berlekamp

multipliers and SPBMMs. Hardware/time trade-offs are made in determining which of

these two options to adopt. Both multipliers propose novel methods of implementing the

required basis conversions so allowing Berlekamp multipliers and SPBMMs to be used in

tandem. Finally, a new bit-parallel multiplier - the PPBML - has been presented. This

multiplier is a hardware efficient equivalent of a previously presented bit-serial multiplier.

In addition, a new algorithm for exponentiation to the power three has been presented. The

algorithm is especially hardware efficient if the design optimisation is employed.

3. BCH codes

In this chapter Bose-Chaudhuri-Hocquenghem (BCH) codes are introduced and

various BCH encoding and decoding algorithms are presented. BCH code encoding and

45

Page 47: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

decoding is considered. Three different decoding strategies are presented according to the

error correcting capability of the code. Generally decoding is broken down into three

processes, syndromes calculation, Berlekamp-Massey algorithm (BMA) and Chien search.

In addition the BMA can be developed with or without inversion and both methods are

described here. At the end of this chapter, comparisons between BCH codes and RS codes

are presented.

3.1 Background

The first class of linear codes derived for error correction were Hamming codes

[20]. These codes are capable of correcting only a single error but because of their

simplicity, Hamming codes and their variations have been widely used in error control

systems; e.g. the 16/32 bit parallel error detection and correction circuits

SN54ALS616/SN54ALS632 [50]. Later the generalised binary class of Hamming codes

for multiple-errors was discovered by Hocquenghem in 1959 [23], and independently by

Bose and Chaudhuri in 1960 [6]. Subsequently, non-binary error-correcting codes were

derived by Gorenstein and Zieler [19]. Almost at the same time independently of the work

of Bose, Chaudhuri and Hocquenghem, the important subclass of non-binary BCH codes -

RS codes - were introduced by Reed and Solomon [44].

This project is concerned with BCH codes however and these codes are described

in more detail below.

46

Page 48: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

3.2 BCH codes

The class of BCH codes is a large class of error correction codes that occupies a

prominent place in theory and practice of error correction. This prominence is a result of

the relatively simple encoding and decoding techniques. Furthermore, provided the block

length is not excessive, there are good codes in this class ([30] Chapter 9). In this thesis

only the subclass of binary BCH codes is considered as these codes can be simply and

efficiently implemented in digital hardware.

Before considering BCH codes, some additional theory needs to be introduced.

Theorem 3.1. ([30], p.10) The minimum distance of a linear code is the minimum

Hamming weight of any non-zero codeword.

Theorem 3.2. ([30], p.10) A code with minimum distance d can correct

(d-1)/2 errors.

Definition 3.2. A linear code C is cyclic if whenever (c0, c1, ..., cn-1) C then so is (cn-1, c0,

c1, ..., cn-2).

A codeword (c0, c1, ..., cn-1) of a cyclic code can be represented as the polynomial

c(x) = c0 + c1x + .... + cn-1xn-1 . This correspondence is very helpful as the mathematical

background of polynomials is well developed, and so this representation is used here.

It is frequently convenient to define error-correcting codes in terms of the

generator polynomial of that code g(x) [29]. The generator polynomial of a t-error-

correcting BCH code is defined to be the least common multiple (LCM) of f1, f3, ... f2*t-1,

that is,

g(x) = LCM{ f1, f3, f5, ... f2*t-1} (3.1)

where fj is the minimal polynomial of j (0 < j < 2t + 1) considered below.

47

Page 49: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

Let fj (0 < j < 2t + 1) be a minimal polynomial of j then fj is obtained by

(Theorem 2.14, [29]):

f x x

where and

ji

e

j

i

e

( ) ( )

.

2

0

1

2

(3.2)

According to Theorem e m.

To generate a codeword for an (n, k) t error-correcting BCH code, the k

information symbols are formed into the information polynomial i(x) = i0 + i1x +...+ ik-1xk-

1 where ij GF(2). Then the codeword polynomial c(x) = c0 + c1x + ... + cn-1xn-1 is

formed as

c(x) = i(x)*g(x). (3.3)

Since the degree of fj(x) is less or equal to m (e m equ(3.2); [29] p. 38), from equ(3.1)

the degree of the g(x) (and consequently the number of parity bits n-k) is at most equal to

m*t. For small values of t, the number of parity check bits is usually equal to m*t ([29] p.

142).

For any positive integer m 3 there exists binary BCH codes (n, k) with the

following parameters:

n = 2m - 1 length of codeword in bits

t the maximum number of error bits that can be corrected

k n - m * t number of information bits in a codeword

dmin 2*t + 1 the minimum distance.

A list of BCH code parameters for m 10 is given in Appendix D. Note that for t = 1, this

construction of BCH codes generates Hamming codes. The number of parity bits equals m,

and so (2m - 1, 2m - m -1) codes are obtained. In this case the generator polynomial g(x)

satisfies

g(x) = f1(x) = p(x)

where p(x) is the irreducible polynomial for GF(2m) as given, for example, in Appendix A.

In this thesis only primitive BCH codes are considered. Binary non-primitive BCH codes

can be constructed in a similar manner to primitive codes ([29] p.151). Non-primitive

BCH codes have a generator polynomial g(x) with

l, l+1, l+2, ... , l+d-2

48

Page 50: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

as roots, where is an element in GF(2m) and l is a non-negative integer. Non-primitive

BCH codes obtained in this way have a minimum distance of at least d. When l = 1, d =

2* t + 1 and = where is a primitive element of GF(2m), primitive BCH codes are

obtained.

3.3 Encoding BCH codes

If BCH codewords are encoded as in equ(3.3) the data bits do not appear explicitly

in the codeword. To overcome this let

c(x) = xn-k * i(x) + b(x) (3.4)

where c(x)= c0 + c1x +...+ cn-1xn-1, i(x)= i0 + i1x +...+ ik-1xk-1, b(x)= b0 + b1x +...+ bm-1xm-1

and cj, ij, bj GF(2). Then if b(x) is taken to be the polynomial such that

xn-k * i(x) = q(x) * g(x) - b(x) (3.5)

the k data bits will be present in the codeword. (By implementing equ(3.4) instead of

equ(3.3) systematic ([29] p. 54) codewords are generated).

BCH codes are implemented as cyclic codes [42], that is, the digital logic

implementing the encoding and decoding algorithms is organised into shift-register circuits

that mimic the cyclic shifts and polynomial arithmetic required in the description of cyclic

codes. Using the properties of cyclic codes [29, 30], the remainder b(x) can be obtained in

a linear (n-k)-stage shift register with feedback connections corresponding to the

coefficients of the generator polynomial g(x) = 1 + g1x + g2x2 + ... + gn-k-1xn-k-1 + xn-k. Such

a circuit is shown on Figure 3.1.

b0 b1 b2 bn-k-1

g1 g2 gn-k-1

xn-k i(x)

c(x)

S1

S2

1

2

49

Page 51: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

Figure 3.1. Encoding circuit for a (n, k) BCH code.

The encoder shown in Figure 3.1 operates as follows

for clock cycles 1 to k, the information bits are transmitted in unchanged form (switch

S2 in position 2) and the parity bits are calculated in the Linear Feedback Shift Register

(LFSR) (switch S1 is on).

for clock cycles k+1 to n, the parity bits in the LFSR are transmitted (switch S2 in

position 1) and the feedback in the LFSR is switch off (S1 - off).

As an example, the (15, 5) 3-error correcting BCH code is considered. The

generator polynomial with , 2, 3, ... , 6 as the roots is obtained by multiplying the

following minimal polynomials:

roots minimal polynomial

, 2, 4 f1(x) = (x+) * (x+2) * (x+4) * (x+8) = 1 + x + x4

3, 6 f3(x) = (x+3) * (x+6) * (x+12) + (x+9) = 1 + x + x2 + x3 + x4

5 f5(x) = (x+5) * (x+10) = 1 + x + x2

Thus the generator polynomial g(x) is given by

g(x) = f1(x) * f3(x) * f5(x) = 1 + x + x2 + x4 + x5 + x8 + x10.

3.4 Decoding BCH codes

The decoding process is far more complicated than the encoding process. As a

general rule, decoding can be broken down into three separate steps:

1. Calculating the syndromes

2. Solving the key equation

3. Finding the error locations.

Fortunately, for some BCH codes step number 2 can be omitted. To decode BCH

codes in this thesis, three different strategies have been employed, for single error

correcting (SEC), double error correcting (DEC) and triple and more error correcting

(TMEC) BCH codes.

50

Page 52: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

Regarding step 1, the calculation of the syndromes is identical for all BCH codes.

For SEC codes step number 2 - solving the key equation - can be omitted, as a syndrome

gives rise to the error location polynomial coefficient. For DEC codes step number two

can also be omitted but the error location algorithm is rather more complicated. Finally,

when implementing the TMEC decoding algorithm all three steps must be carried out,

where step 2 - the solution of the key equation - is the most complicated.

3.4.1 Calculation of the syndromes

Let

c(x) = c0 + c1x + c2x2 + ... + cn-1xn-1

r(x) = r0 + r1x + r2x2 + ... + rn-1xn-1

e(x) = e0 + e1x + e2x2 + ... + en-1xn-1 (3.6)

be the transmitted polynomial, the received polynomial and the error polynomial

respectively so that

r(x) = c(x) + e(x). (3.7)

The first step of the decoding process is to store the received polynomial r(x) in a buffer

register and to calculate the syndromes Sj (for 1 j 2t -1). The most important feature

of the syndromes is that they do not depend on transmitted information but only on error

locations, as shown below.

Define the syndromes Sj as

S rj ii j

i

n

0

1

for (1 j 2t). (3.8)

Since rj = cj + ej (j = 0, 1, ...., n-1)

S c e c ej i ii j

i

n

ii j

ii j

i

n

i

n

( ) 0

1

0

1

0

1

.

(3.9)

By the definition of BCH codes

cii j

i

n

00

1

for (1 j 2t)

(3.10)

thus

51

Page 53: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

S ej ii j

i

n

0

1

. (3.11)

It is therefore observed that the syndromes Sj depends only on the error polynomial e(x),

and so if no errors occur, the syndromes will all be zero.

To generate the syndromes, express equ(3.8) as

Sj = (...((rn-1 * j + rn-2) * j + rn-3) * j + ....) * j + r0. (3.12)

Thus a circuit calculating the syndrome Sj carries out (n-1) multiplications by the constant

value j and (n-1) single bit summations. Note that because rjGF(2) the equation S2i = Si2

is met [29].

For example a circuit calculating S3 for m = 4 and p(x) = x4 + x + 1 is presented in

Figure 3.2. Initially the register si (0 i 3) is set to zero. Then the register s0 - s3 is

shifted 15 times and the received bits ri (0 i 14) are clocked into the syndrome

calculation circuit. Then the S3 is obtained in the s0 - s3 register.

s0 s1 s2 s3r(x)

Figure 3.2. Circuit computing S3 for m = 4.

Syndromes can also be calculated in a second way ([29] p. 152, 165), ([30] p.

271). Employing this approach, Sj is obtained as the remainder in the division of the

received polynomial by the minimal polynomial fj(x), that is,

r(x) = aj * fj(x) + bj(x) (3.13)

where

Sj = bj(j). (3.14)

It should be mentioned that the minimal polynomials for , 2, 4, .... are the same and so

only one register is required to calculate the syndromes S1, S2, S4, ... . The rule can be

extended for S3, S6, ..., and so on.

52

Page 54: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

For example the circuit calculating S3 for m = 4 is shown in Figure 3.3. The

minimal polynomial of 3 is

f3(x) = 1 + x + x2 + x3 + x4

and let b(x) = b0 + b1x + b2x2 + b3x3 be the remainder on dividing r(x) by f3(x). Then

S3 = b(3) = b0 + b13 + b26 + b39 = b0 + b3 + b22 + (b1 + b2 + b3) 3.

The circuit in Figure 3.3 therefore operates by first dividing r(x) by f3(x) to generate b(x)

and then calculating b(3). The result is obtained after the register b0 - b3 have been

clocked 15 times.

b0 b1 b2 b3r(x)

S3

Figure 3.3. Second method of computing S3 for m = 4.

3.4.2 Solving the key equation

The second stage of the decoding process is finding the coefficients of the error

location polynomial (x) = 0 + 1x + ... + txt using the syndromes Sj (1 j < 2t). The

relationship between the syndromes and these values of j is given by ([5], p.168)

St i jj

t

j

0

0 (i= 1, ..., t) (3.15)

and the roots of (x) give the error positions. The coefficients of (x) can be calculated by

methods such as the Peterson-Gorenstein-Zieler algorithm [5, 43] or Euclid’s algorithm

[49]. In this thesis the Berlekamp-Massey Algorithm (BMA) [2, 32] has been used as it

has the reputation of being the most efficient method in practice [5].

In the BMA, the error location polynomial (x) is found by t-1 recursive iterations.

During each iteration r, the degree of (x) is usually incremented by one. Through this

method the degree of (x) is exactly the number of corrupted bits, as the roots of (x) are

associated with the transmission errors. The BMA is based on the property that for a

53

Page 55: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

number of iterations r greater or equal the number of errors ta that have actually occurred

(r ta), the discrepancy dr = 0 in equ (3.16) below where

d Sr r j jj

t

2 1

0

. (3.16)

On the other hand if r < ta, the discrepancy dr calculated in equ(3.16) is usually non zero

and is used to modify the degree and coefficients of (x). What the BMA essentially does

therefore is compute the shortest degree (x) such that equ(3.15) holds.

The BMA with inversion is given below.

Initials values:d

if SS if S

x S x

xx if S

x if S

lif Sif S

r

p

1 00

1

0

0

0 01 0

1

1

1 1

01

13

1

21

11

1

( )

( )

( )

( )

.

(3.17)

The error location polynomial (x) is then generated by the following set of recursive

equations:d S

x x d d x

bselif d or r lif d and r l

xx x if bselx x if bsel

ll if bsel

l l if bsel

dd if bseld if bsel

r r

r ir

r ii

t

r rp r

r

r r

r r

rr

r

rr

r r

pr

r

( )

( ) ( ) ( )

( )( )

( )

( ) ( ) ( )

( )( )

( )

2 10

1 1

12

2 1

1

0 01 0

00

02 1 0

00

1

(3.18)

these calculation are carried out for r= 1, ..., t-1.

Note that the above algorithm is slightly modified in comparison with the

previously presented BMA [2, 32]. Due to more complicated initial states, the number of

iterations is decreased by one. In practice, this causes only a slight increase in the hardware

requirements but the BMA calculation time is significantly reduced.

A circuit implementing the BMA is given in Figure 3.4. The error location

polynomial (x) is obtained in the C registers after t-1 iterations.

54

Page 56: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

B2 B3 B4 Bt

C1 C2 C3 C4 Ct

Sj-1 Sj-2 Sj-3 Sj-4 Sj-t+1

Sj+2 Sj+1

Sj

inv

drreg

0

01

Figure 3.4. Berlekamp Massey Algorithm with inversion.

In some applications it may be beneficial to implement the BMA without

inversion. A version of the BMA achieving this was presented in [8, 56]. For inversionless

BMA the initial conditions are the same as for the BMA with inversion given in equ(3.17).

The error location polynomial is then generated by following recursive equations:d S

x d x d x

bselif d or r lif d and r l

xx x if bsel

x x if bsel

ll if bsel

r l if bsel

dd if bsel

d if bsel

r r

r ir

r ii

t

rp

rr

r

r r

r r

rr

r

rr

r

pp

r

( )

( ) ( ) ( )

( )( )

( )

( ) ( ) ( )

( )( )

( )

.

2 10

1

12

2 1

1

0 01 0

0

0

02 1 0

0

0

1

(3.19)

In conclusion, inversionless BMA is more complicated and requires a greater

number of multiplications than the BMA with inversion. On the other hand, inversion can

take (m-1) clock cycles (see Section 2.7) and therefore even if parallel multiplication is

used this constraint will slow down the algorithm. Therefore the inversionless algorithm

has to be implemented for some BCH codes.

For SEC and DEC BCH codes the coefficients of (x) can be obtained directly

without using the BMA. This is because for SEC BCH codes

(x) = 1 + S1x

and for DEC BCH codes

55

Page 57: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

(x) = 1 + 1x + 2x2 = 1 + S1x + (S12 + S3 * S1

-1) x2

[2], ([30] p. 321). This approach for generating (x) directly in terms of the syndromes

can be theoretically extended to TMEC but quickly becomes too complex to implement in

hardware and so the BMA must be used.

3.4.3 Finding the error locations

3.4.3.1 General case

The last step in decoding BCH codes is to find the error location numbers. These

values are the reciprocals of the roots of (x) and may be found simply by substituting 1,

, 2, ... , n-1 into (x). A method of achieving this using sequential substitution has been

presented by Chien [10]. In the Chien search the sum

0 + 1j + 22j + ... + ttj (j= 0, 1, ... , k-1) (3.20)

is evaluated every clock. It can be noticed that if (j)= 0, the received bit rn-1-j is

corrupted. Therefore if for clock cycle j the sum equals zero the received bit rn-j-1 should be

corrected.

A circuit implementing the Chien search is shown in Figure 3.5. The operation of

this circuit is as follows. The registers c0, c1, ..., ct are initialised by the coefficients of the

error location polynomial 0, 1, ... , t. Then the sum c jj

t

0 is calculated and if this equals

zero, the error has been found and after being delayed in a buffer, the faulty received bit is

corrected using an XOR gate. On the next clock cycle each value in the ci register is

multiplied by i (using a constant multiplier), and the sum c jj

t

0 is calculated again. The

above operations are carried out for every transmitted information bit (that is k times).

56

Page 58: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

buffer

c0 c1 c2 ct

* *2 *t

c jj

t

0

input

output

Figure 3.5 Chien’s search circuit.

3.4.3.2 Finding an error location for t = 2

In the case of DEC BCH codes, two different algorithms may be adopted. Firstly,

one may adopt the general procedure, namely finding the syndromes, implementing the

relatively burdensome BMA and then adopting Chien search. In this thesis however

another approach has been adopted [53]. This algorithm does not require the error location

polynomial (x) to be generated, instead a more sophisticated error location procedure is

adopted. This algorithm is summarised below.

Suppose the received vector has at most two errors, then the error location

polynomial (x) is given by [2] ([30] p. 321):

(x) = 1 + 1x + 2x2 = 1 + S1x + (S12 + S3 * S1

-1) x2. (3.21)

Therefore if there is no error 1 = 0 and 2 = 0, thus

S1 = 0 S3 = 0. (3.22)

If only one error has occurred 1 0 and 2 = 0, thus

S1 0 S3 = S13. (3.23)

If there are two errors 1 0 and 2 0, thus

S1 0 S3 S13. (3.2)

If S1 = 0 and S3 0 more than two errors have occurred and so the error pattern cannot be

corrected.

This step-by-step decoding algorithm is based around the assumption that an error

has occurred at the present location. Accordingly the values in the s1 and s3 registers are

57

A0 A1 A2 A3

Page 59: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

changed. These changes are easily implemented because if the received bit rn-1 is

corrupted, only the first bits si,0 of the registers s1 and s3 need be negated where si = si,0 +

si,1x + ... + si,m-1xm-1 for (i=1,3) and the s1 and s3 registers hold the values of S1 and S3

respectively. Similarly, assuming that the received bit rn-1-j is corrupted, the syndrome

registers are clocked j-times implementing the function si si * i (with a circuit similar

to the syndrome calculation circuit) and so only the first bits s1,0 and s3,0 are negated [53].

A circuit employing this algorithm is given in Figure 3.6. At first registers s1 and s3

are initiated with S1 and S3 respectively and using equ(3.22-2), the number of errors

present are stored by clocking values h1, h3 into flip-flops p1, p3. It is then assumed that an

error has occurred in the first position. The registers s1 and s3 are updated and again using

equ(3.22-2) the new number of errors present is specified. If the new number of errors has

decreased, the assumption has proven to be correct and an error has been found. That is,

the received bit rn-1 is corrected and the error assumption changes are introduced

permanently into the s1 and s3 registers. In addition, the p flip-flops are clocked with the

new h signals. Alternatively, if the number of errors has not decreased, the assumption is

wrong and the correct bit has been received and the changes are cancelled. The above

operations are repeated for every received information bit rn-1-j (0 j < k), after the s

registers have been shifted (si si * i, i= 1,3) j-times.

s1 s3

* *3

(s1)3

s1 0 (s1)3 s3

S1 S3

h1 h3p1 p3

error decision circuit

buffer

r(x)

output

Figure 3.6. Error location circuit for t = 2.

58

A0 A1 A2 A3A0 A1 A2 A3

Page 60: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

3.5 Reed-Solomon codes

In this thesis only binary BCH codes have been considered and only their hardware

representation developed. Therefore here a comparison between binary BCH codes and

non-binary BCH codes, the sub-class of RS codes [44] is considered. RS codes are the

most efficient error correcting codes theoretically possible and a wide body of knowledge

concerning them exists [5,30]. In addition, RS codes are especially attractive as they can

correct not only random but burst errors as well. In many situations the information

channel has memory, and so random binary BCH codes are not appropriate. Fortunately,

binary BCH codes can correct burst errors when an interleaved code with large t is

adopted. But as will be shown below, this architecture is not recommended and instead,

RS codes should be used.

RS codes operate on symbols consisting of m-bits and which are elements of

GF(2m). Each codeword consists of (n= 2m-1, k= 2m-1-2t) such symbols, where t is the

maximum number of symbols that may be corrected.

Now the decoding of RS codes will briefly be presented in comparison with BCH

codes. The encoding process is omitted here as it is relatively simple, and therefore only

slightly influences a codec’s complexity. There are two different ways of decoding RS

codes [5,16] in the time or in the frequency domain. Here frequency domain decoding

process will be considered.

Decoding may be separated into four main areas

1. Calculation of syndromes using equation:

S ri ji j

j

n

00 i< 2t (3.25)

where the rj GF(2m) are the received symbols (see also equ(3.8)).

Note that the calculation of the syndromes for BCH codes is simpler (riGF(2)) than for

RS codes because ri GF(2m) and so the equation Si2 =S2i

does not hold for RS codes.

2. Berlekamp-Massey Algorithm. The BMA is similar as in the case of BCH codes but

requires twice as many iterations (taking the same value of t).

3. Recursive extension, computing the equation

E Ei i l ll

t

12t i n-1 (3.26)

59

Page 61: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

where Ei = Si 0 i 2t-1.

Youzhi [56] has shown that this step can be implemented with a BMA circuit by adding

only additional control signals.

4. Obtaining the error magnitudes ei by computing the inverse transforms

en

Ei ji j

j

n

10

1

2t i n-1. (3.27)

This step is not required for BCH codes and in comparison to the Chien search is

rather more complicated.

If binary BCH codes and non-binary RS codes are compared, at first it may seem

that BCH codes are much simpler to implement. This is because RS codes operate on

symbols and require additional steps to be computed since not only do the error locations

have to be calculated (as with BCH codes) but also the error magnitudes. But after closer

consideration, it may be seen that for example, a (15, 11) RS code can correct up to two

corrupted 4-bit symbols, e.g. at least one 5-bit burst error. This code consists of 4*15= 60

codeword bits and 4*11= 44 information bits. Conversely, consider a similar (63, 36) 5

bits random error correcting BCH code. It should be noted here that this comparison of

BCH and RS codes is not based on any practical experiments and in practice maybe

different codes should be considered. This BCH code has not only a lower information rate

(k/n) but more hardware is needed to the decoder. Consequently, calculation of the

syndromes is simpler in comparison with the RS code, but a greater number of syndromes

must be computed. Furthermore, the BMA is much more complex for the BCH code as the

number of errors is greater. RS codes also require the inverse transform calculation to be

implemented and this is more complicated than the equivalent Chien search circuit. But

taken overall, the hardware requirements of the RS codec are much simpler. In addition

the BCH code requires all the operations to be carried out over GF(26), whereas the RS

code operates over GF(2), and so more complex arithmetic circuits are required for BCH

codes.

In conclusions, RS codecs generally have more attractive properties and they

should rather be implemented if burst errors have to be corrected.

60

A0 A1 A2 A3

Page 62: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

3.6 Conclusions

In this chapter BCH codes have been introduced. Encoding and decoding

algorithms for BCH codes with different error-correcting abilities have been considered.

Decoders have more complex structure than encoders and so the decoding process has

been broken down into three separate steps. The first step is the syndrome calculation

process, an identical process whatever the error-correcting ability of the code. The next

step is to find the error location polynomial (x). This stage is the most complicated of the

three and for DEC BCH codes, an alternative decoding algorithm is used which by-passes

the need to generate this polynomial entirely. For TMEC BCH codes (x) is calculated

using the relatively complex BMA, whereas for SEC BCH codes, (x) can be expressed

immediately in terms of the syndromes. The last stage of decoding is to find and correct

any errors present. Two different approaches have been employed to achieve this, one for

the general case and one for DEC BCH codes.

CHAPTER 4

4.1 High speed architectures for General Linear Feedback Shift Registers

The implementation of CRC check generation circuit can be implemented with the

use of linear feedback circuit. Following figure shows the LFSR representation of CRC

with generator polynomial 1+y+y3+y5

4.2 Architectures for polynomial G(y)=1+y+y3+y5

y

Fig.4.1Serial structure

61

D D

1

D DD+++

Page 63: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

CRC codes have been used for years to detect data errors on interfaces, and their operation

and capabilities are well understood.

4.3 Motivation for the parallel Implementation:

Cyclic redundancy check (CRC) is widely used to detect errors in data

communication and storage devices. When high-speed data transmission is required, the

general serial implementation cannot meet the speed requirement. Since parallel

processing is a very efficient way to increase the throughput rate, parallel CRC

implementations have been discussed extensively in the past decade.

Although parallel processing increases the number of message bits that can be processed in

one clock cycle, it can also lead to a long critical path (CP); thus, the increase of

throughput rate that is achieved by parallel processing will be reduced by the decrease of

circuit speed. Another issue is the increase of hardware cost caused by parallel processing,

which needs to be controlled. This brief addresses these two issues of parallel CRC

implementations.

4.4 Literature Survey and Existing Systems:

In the past recursive formulas have been developed for parallel CRC hardware

computation based on mathematical deduction.

They have identical CPs. The parallel CRC algorithm in [2] processes an m-bit message in

(m+k)/L clock cycles, where is the order of the generator polynomial and L is the level of

parallelism. However, in [1], m message bits can be processed in m/L clock cycles.

High-speed architectures for parallel long Bose–Chaudhuri–Hocquenghen (BCH)

encoders in [3] and [4], which are based on the multiplication and division computations

on generator polynomial, are efficient in terms of speeding up the parallel linear feedback

shift register (LFSR) structures. They can also be generally used for the LFSR of any

generator polynomial. However, their hardware cost is high.

4.5 LFSR (Linear Feedback Shift Register)

62

Page 64: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

A Linear Feedback Shift Register (LFSR) is a shift register whose input bit is a

linear function of its previous state. The only linear functions of single bits are xor and

inverse-xor; thus it is a shift register whose input bit is driven by the exclusive-or (xor) of

some bits of the overall shift register value.

The initial value of the LFSR is called the seed, and because the operation of the

register is deterministic, the sequence of values produced by the register is completely

determined by its current (or previous) state. Likewise, because the register has a finite

number of possible states, it must eventually enter a repeating cycle. However, an LFSR

with a well-chosen feedback function can produce a sequence of bits which appears

random and which has a very long cycle.

4.6 Serial input hardware realization

Fig 4.2. Basic LFSR architecture

63

Page 65: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

Fig 4.3 .Linear Feedback Shift Register Implementation of CRC-32

4.7 DESIGN OF ARCHITECTURES USING DSP TECHNIQUES

4.7.1 Unfolding

It’s a transformation technique that can be applied to DSP program to create a new

program describing more than one iteration of the original program.

Unfolding a DSP program by an unfolding factor J creates a new program that

describes J consecutive iterations of the original program. It increases the sampling rate by

replicating hardware so that several inputs can be processed in parallel and several outputs

can be produced at the same time.

4.7.2 Pipelining

Reduces the effective critical path by introducing pipelining latches along the

critical data path either to increase the clock frequency or sample speed or to reduce power

consumption at the same speed.

It is done using a look-ahead pipelining algorithm to reduce the iteration bound of

the CRC architecture.

4.7.3 Retiming

A technique used to change the locations of delay elements in a circuit without

affecting the input/output characteristics of the circuit.

Moving around existing delays

• Does not alter the latency of the system

• Reduces the critical path of the system

Retiming has many applications in synchronous circuit design. These applications

include reducing the clock period of the circuit, reducing the number of registers in the

circuit, reducing the power consumption of the circuit and logic synthesis. It can be used to

increase the clock rate of a circuit by reducing the computation time of the critical path.

64

Page 66: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

4.7.4 Critical path:

It is the path with the longest computation time among all paths that contain zero

delays, and the computation time of the critical path is the lower bound on the clock period

of the circuit.

4.7.5 Iteration bound:

It is defined as the maximum of al the loop bounds. Loop bound is defined as t/w,

where t is the computation time of the loop and w is the no.of delay elements in the loop.

fLinear Feedback Shift Register (LFSR) structures are widely used in digital signal

processing and communication systems, such as BCH, CRC. In high-rate digital

systems such as optical communication system, throughput of 1Gbps is usually

desired. The serial input/output operation property of LFSR structure is a bottleneck in

such systems and parallel LFSR architecture is thus required.. In , high-speed parallel

CRC implementation based on unfolding, pipelining and retiming is proposed. These

parallel LFSR structures are not always efficient for general LFSR structures.

Furthermore, the large fan out problem explained in for long LFSR structures are not

addressed in these papers.

Pipelining technique is needed to reduce the achievable minimum clock period before

parallel implementation is applied .A three-step LFSR structure is presented in [4]. The

message input m(x) is first multiplied by a factor polynomial p(x); then the generator

polynomial g(x) is thus modified as g’(x)=p(x)g(x); remainder of m(x)p(x)/g’(x) is

finally divided by p(x) and quotient is the expected output. The second step of this

algorithm inserts as many delay elements as possible to the right-most feedback loop,

which otherwise causes large fan out and long latency when g(x) is long [4]. This

three-step scheme can eliminate the effect of large fan out. However, the feedback

structure of p(x) in the third step can still limit the achievable clock frequency of the

final parallel LFSR structure. Three approaches are proposed in [5] to eliminate the

feedback loops in the third step. Since the speed bottleneck of the three-step algorithm

in [4] is usually located in the third step, the new approaches in [5] can efficiently

speedup the final parallel LFSR structures. However, the hardware cost of the

approaches in [5] is high. Furthermore, since the goal of the second step in [4] is to

insert delay elements into the rightmost feedback loop and the achievable clock

frequency is not necessarily determined by this feedback loop, the feedback structure

65

Page 67: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

obtained from the second step of [4] is not optimum for the achievable clock

frequency. This paper uses different structures to solve the bottleneck in the third step

of [4] with lower hardware cost. When we construct p(x), we guarantee that p(x) can

be decomposed into several short-length polynomials. Since the quotient of dividing

the output of the second step by p(x) can be alternatively obtained by dividing the

output of the second step by a chain of p(x) factor polynomials and the feedback loop

of small length polynomial can be easily handled by the look-ahead pipelining

techniques [3], the iteration bound bottleneck can be solved with smaller number of

XOR gates than needed in [5]. A search algorithm for reducing achievable clock period

of the second step and thus the overall parallel LFSR is then presented after large

fanout problem effect is solved.

4.7.6 IMPROVED ALGORITHM FOR ELIMINATING THE FANOUTBOTTLENECK FOR LFSR STRUCTURE

For a generator polynomial g(x) of degree (n-k) and a message sequence m(x) of

degree (k-1), the systematic encoding can provide us with the code word of degree (n-

1) as:

c(x)=m(x)xn-k +Rem(m(x)xn-k )g( x) , where Rem(m(x)xn-k )g( x) is the remainder of

polynomial of dividing by

m(x)xn-k byg(x). For example, if g(y)=1+y+y^3+y^5

its corresponding LFSR structure for computing Rem(m(x)xn-k )g ( x) is shown in Fig..

In Fig., D denotes a delay element. Message sequence m(x) is injected into Fig from

the right side with the most significant bit first. After k clock cycles, will be available

in the delay elements of Fig. Rem(m(x)xn-k )g ( x)From Fig. 2.1 (a), we can see that it

has 3 feedback loops, with loops bound of TXOR, (3/4) TXOR and (3/5) TXOR for

loops

1, 2 and 3, respectively. The iteration bound is thus TXOR,corresponding to the right-

most feedback loop, where TXOR is the computation time of an XOR gate. Note that,

66

Page 68: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

Iteration bound is defined as the maximum of all the loop bounds. Loop bound is

defined as t/w, where t is the computation time of the loop and w is the number of

delay elements in the loop [6]. Iteration bound is the minimum achievable clock period

of a digital system.

Fig 4.4 LFSR structure for g(y)=1+y+y3+y5 (a) serial structure

Fig 4.5 LFSR structure for g(y)=1+y+y3+y5 (b)3-parellel structure

3-parallel implementation to the LFSR structure in Fig. 2.1 (a) is shown in Fig. 2.1 (b).

After observing Fig. 2.1 (b), we can find two issues. One is that the iteration bound

increased to 3TXOR, which means that although the throughput rate has been

increased by a factor of 3, the achievable clock frequency is decreased by a factor of 3;

the achievable processing speed is thus the same as previous serial LFSR structure in

Fig. 2.1 (a). Therefore, reducing the iteration bound of the original serial LFSR

structure is important before we apply parallel implementation [3].

Another issue indicated in Fig. 2.1 (b) is that each of the three right-most XOR gates

are driving a lot of other XOR gates, which is a large number when the generator

polynomial is long and thus causes large fan out delay [4]. Inserting delay elements

between the right-most XOR gates and their subsequent XOR gates can solve this

issue. The two rightmost XOR gates in Fig. 2.1 (b) can be separated from their

subsequent XOR gates by first pipelining input y(3k) and y(3k+1) and then applying

retiming. However, this scheme cannot be applied to the lowest right-most XOR gate.

This is caused by the fact that the number of delay elements in the right-most feedback

67

Page 69: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

loop in Fig. 2.1 (a) is only 2 and less than the desired parallelism level 3. Therefore,

inserting enough delay elements into the right-most feedback loop of the original serial

LFSR structure is the key for solving large. Fan out issue, which is the contribution of

three-step LFSR architecture in [4].

From

=[q(x)g’(x)+r’(x)]/p(x), we can see that if we multiply both m(x)xn-k and g(x) with

p(x), the remainder of dividing

m(x)xn-k by g’(x) is r’(x)=r(x)p(x). r(x) is the quotient of dividing r’(x) by p(x). This

is the basic idea of the three-step LFSR architecture. Since g’(x) is reconstructed by

g(x)p(x), it’s very important to choose proper p(x) for addressing the two issues

discussed above. In [4], a clustered looked-ahead computation is applied

for finding p(x). This scheme is efficient and can insert delay elements to the right-

most feedback loop with the minimum increase of polynomial degree to g(x) and thus

can control the increase of XOR gates. However, the iteration bound bottleneck is

transferred to the LFSR in the third stage. Although three approaches have been

proposed in [5] to address this the iteration bound issue in the third step, the hardware

cost is high and the iteration bound bottleneck will again be transferred to the second

step.

4.7.7 Improved Algorithm for eliminating the Fanout bottleneck

We start with the same example used in [4].

Example 1: Given a BCH(255,233) code using generatorpolynomial

g(x)=1+x2+x3+x4+x5+x6+x7+x9+x14+x16+x17+x19

+x20+x22+x25+x26+x27+x29+x30+x31+x32

=[1 0 1 1 1 1 1 1 1 0 1 0 0 0 0 1 0 1 1 0 1 1 0 1 0 0 1 1 10 1 1 1 1].

Assume our targeted parallelism level is also 8. Then we obtain:

g’(x)=g(x)p(x)=1+x+x2+x4+x7+x8+x11+x12+x16+x17+x18+x25+x27+x28+x31+x33

+x42

p(x)=(1+x)(1+x4)(1+x5)

This is illustrated in Fig.

68

Page 70: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

Fig 4.6 Three-step implementation of BCH (255,233) encoding, (a) first step,(b)

second step, (c) third step and (d)-(e) improved third step

The improved third step Fig. 4.6 (d) is based on the look ahead pipelining scheme

discussed as follows. Operation of

1/(1+xk )is shown in Fig4.6 (a), where we implement (4.1).

69

Page 71: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

b(n+k) = a(n) + b(n) (4.1)

If we apply k steps look-ahead to (2.1), we obtain (4.2).

b(n+2k) = a(n+k) + b(n+k) = a(n+k) + a(n) + b(n). (4.2)

The corresponding hardware implementation is shown in Fig.

4.6 (b).

Another additional 2k steps look-ahead of (4.2) can be represented as (4.3).

b(n+4k) = a(n+3k) + a(n+2k) + b(n+2k)

= a(n+3k) + a(n+2k) + a(n+k) + a(n) + b(n) (4.3)

(2.3) can be implemented with the structure shown in Fig. 4.6(c).

Fig4.7 look-ahead pipelining for 1/(1+xk), (a) original hardware implementation for

1/(1+xk), (b) 2k-level look-ahead pipelining and (c) 4klevel look-ahead pipelining

Replace the 1/(1+x) in Fig. 4.7 (c) with Fig. 4.7 (c) and k=1,Fig. 4.7 (c) can be

pipelined as shown in Fig. 4.7 (d). Note that in Fig. 4.7 (d), we have two 1/(1+x4) in a

row and they can be simplified as 1/(1+x8). This process is shown in Fig. 4.7 (e).

From Example 1, we can see that the difference of the proposed scheme with the

previous three-step LFSR structures [4][5] is located at constructing p(x). In this paper,

p(x) is constructed with small length polynomials which is easy to handle with the

proposed look-ahead pipelining algorithm. However, p(x) in [4][5] is obtained and

handled with a single long polynomial, which usually leads to large iteration bound

and is difficult to handle.

When p(x) is decomposed into small length polynomials, its operational characteristics

for 1/p(x) is slightly different from when it is implemented as a single long

polynomial. We will show this difference with the following example as shown in Fig.

2.4. A strict proof is given in Appendix A. From Fig. 2.4, we can see that operation of

70

Page 72: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

m(x)/p(x) can get both correct remainder and quotient when we use structure

p(x)=1+x+x2+x3; while structure for p(x)=(1+x)(1+x2) can provide us with only

correct quotient. However, using p(x)=(1+x)(1+x2) is sufficient because we only need

quotient in the third step. Furthermore, in the structure for p(x)=(1+x)(1+x2) shown in

Fig. 2.4 (b), the quotient can be obtained without any latency. This is an important

advantage, because the structure for p(x)=1+x+x2+x3 in Fig. 2.4 has latency of 3 clock

cycles, which has the same number as the degree of p(x).

Iteration bound of the third step can be easily reduced to any small value and that of

the overall LFSR structure is

determined in the second step, which is not handled in [4][5].We discuss this issue in

section III.

Fig 4.8 Operation of m(x)/p(x) for m(x)=1+x+x2+x3+x4+x5+x6 (a) p(x)=1+x+x2+x3 and (b) p(x)=(1+x)(1+x2)

4.8 PROPOSED ALGORITHM FOR REDUCING THE ITERATION BOUND OF THE LFSR STRUCTURE

71

Page 73: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

As we can see from TABLE I, we can keep multiplying g(x) by short length

polynomials such as 1+xk to insert as many delay elements to the right-most feedback

loop of g(x) as we need in each iteration for eliminating the fanout bottleneck of a

LFSR structure. After this fanout bottleneck is eliminated, our goal should be to reduce

the iteration bound of the entire LFSR structure, which is now located in the second

step. We have seen the advantage of multiplying g(x) by short length polynomial such

as 1+xk, so that the third step does not lead to the iteration bound bottleneck and the

latency of the third step becomes minimum. We will show its advantage for reducing

the iteration bound of

the second step and thus the whole LFSR structure in this section. We start with the

third iteration shown in TABLE I.

After the third iteration, g’(x) does not have fanout bottleneck anymore for unfolding

factor of J=8. Its iteration bound is

TXOR*max{2/9, 3/11, 4/14, 5/15, 6/17, 8/25, 9/26, 10/28,

11/30, 12/31, 13/34, 14/35, 15/38, 16/40, 17/41,

17/42}=0.4146TXOR, which is located in the 16th feedback loop. If we keep

multiplying g’(x) by 1+xk to reduce its iteration bound, there are 17 possibilities for k,

corresponding to the 17 feedback loops in g’(x). After trying all these 17possibilities,

we conclude that 1+x17 will reduce the iteration bound in the second step most from

0.4146TXOR to 0.3621TXOR. g’’(x)= (1+x17)g’(x) is given in TABLE II. Based on

the discussion for the third step in Section II, 1/(1+x17) is far away from causing

iteration bound bottleneck of the entire LFSR structure, which has been reduced to

0.3621 TXOR. We can keep multiplying g’’(x) with 1+xk to lower iteration bound

even further. For example, after the 6th iteration shown in TABLE II, the iteration

bound of the LFSR structure has been reduced to 0.3369TXOR. Note that, these

optimized iteration bounds are not achieved without extra cost. We can see that the

number of required XOR gates has increased from 2 to 76. Although multiplying g’(x)

will change the LFSR structure for lower iteration bound, the elimination of fanout

bottleneck is maintained. This is because the elimination of fanout bottleneck is given

in the right-most feedback loop of the second step of the LFSR structure. Multiplying

72

Page 74: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

g’(x) by 1+xk maintains the structure on the right side of the feed back loop

corresponding to 1+xk. This property is illustrated in TABLE II.

From TABLE II, we have p(x)=(1+x)(1+x4 )(1+x5 ) andp'(x)= p(x)(1+x17 )(1+x43 )(1+x86 ) (1 +x8 )(1+ x5 )(1+ x17 )(1+ x43 )(1+ x86 )/[(1+ x)(1+ x2 )]

Then Improved three-step implementation of BCH (255,233)encoding is shown in Fig.

From the discussion so far, we can summarize the proposed high speed VSLI

algorithm for general LFSR structures as follows:

1) Iteratively multiply g(x) by short length polynomials to insert as many delay

elements to the right-most feedback loop of g(x) as we need in each iteration for

eliminating the fan out bottleneck of a LFSR structure. The iteration exits when the

number of delay elements in g’(x) is not less than the targeted unfolding factor J.

The simplest short length polynomials we can use are 1+xk, where k is the degree

difference of the two highest-degree terms in g’(x).

Another way to find short length polynomials is to partially borrow Algorithm A in [4,

section III]. Instead of getting one long polynomial p(x) by using Algorithm A once,

we can use it by multiple times and limit the length of each obtained p(x) until the

number of delay elements in g’(x) is not less than the targeted unfolding factor J.

Although first method to find short length polynomial can guarantee to find p(x), the

second method might be more

Hardware efficient. This is because the second method can lead to polynomial g’(x) of

lower degree polynomial. Note that eliminating fan out bottleneck is not needed when

73

Page 75: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

g(x) is 2) Multiply g’(x) iteratively by 1+ xk , which will lead to the smallest iteration

bound for the current g’’(x). For each iteration, the number of possible k is the same as

that of feedback loops for current g’’(x). The iteration exits when desired iteration

bound or the best iteration bound for certain hardware cost requirement is reached.

Fig 4.9 Improved three-step implementation of BCH (255,233) encoding, (a) first step,

(b) second step, (c) third step

4.9 BCH ENCODER ARCHITECTURE

An (n, k) binary BCH code encodes a k-bit message into an n-bit code word. A k-bit

message (mk−1, mk−2, . . .,m0) can be considered as the coefficients of a degree k – 1

polynomial m(x) = mk−1xk−1 + mk−2xk−2 + . . . + m0,where mk−1, mk−2, . . ., m0 ∈

GF(2). Meanwhile, the corresponding n-bit code word (cn−1, cn−2, . . ., c0) can be

considered as the coefficients of a degree n−1 polynomial c(x) = cn−1xn−1 +

cn−2xn−2 + . . . + c0, where cn−1,cn−2, . . ., c0 ∈ GF(2). The encoding of BCH

codes can besimply expressed by:

c(x) = m(x)g(x),where the degree n − k polynomial g(x) = gn−kxn−k+ gn−k−1

xn−k−1 + . . . + g0 (gn−k, gn−k−1, ・ ・ ・ , g0 ∈GF(2)) is the generator

polynomial of the BCH code. Usually,gn−k = g0 =_ 1_. However, systematic encoding

74

Page 76: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

is generally desired, since message bits are just part of the code word. The systematic

encoding can be implemented by:

c(x) = m(x) ・ xn−k + Rem(m(x) ・ xn−k)g(x), (1)where Rem(f(x))g(x) denotes the

remainder polynomial

of dividing f(x) by g(x). The architecture of a systematic BCH encoder is shown in Fig.

1. During the first k clock cycles, the two switches are connected to the ’a’ port, and

the k-bit message is input to the LFSR serially with most significant bit (MSB) first.

Meanwhile, the message bits are also sent to the output to form the systematic part of

the code word. After k clock cycles, the switches are moved to the ’b’ port. At this

point, the n − k registers contain the coefficients of Rem(m(x) ・ xn−k)g(x). The

remainder bits

are then shifted out of the registers to the code word output bit by bit to form the

remaining systematic code word bits. For binary BCH, the multipliers in Fig. 1 can be

replaced by connection or no connection when gi (0 ≤ i < n−k) is ’1’ or ’0’,

respectively. The critical path of this architecture consists of two XOR gates, and the

output of the right-most XOR gate is input to all the other XOR gates. In the case of

long BCH codes, this architecture may suffer from the long delay of the right-most

XOR gate caused by the large fanout. Although the serial architecture of BCH encoder

is quite straight forward, in the case when it can not run as fast as the application

requirements, parallel architectures must be employed. Fanout bottleneck will also

exist in parallel architectures.

4.10 PARALLEL BCH ENCODER WITH ELIMINATED FANOUT BOTTLENECK

In the serial BCH encoder in Fig. 4.10, the effect of large fanout can always be

eliminated by retiming [7]. To make notations simple, we refer to the input to the

right-most XOR gate, which is the delayed output of the second XOR gate from right

as the horizontal input (Hinput). In Fig. 1,there is at least one register at the Hinput of

the right-most XOR gate. Meanwhile, registers can be added to the message input.

Therefore, as shown in Fig. 2, retiming can always be performed along the dotted

cutset by removing one register from each input to the right-most XOR gate

75

Page 77: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

FIG 4.10

and adding one to the output. For the purpose of clarity,switches are removed from the

LFSR in Fig. 2 and the other figures in the remainder of the paper. However, if

unfolding is applied directly to Fig. 1, retiming can not be applied in an obvious way

to eliminate the large fanout. The original architecture can be expressed by data flow

graph (DFG) as nodes connected by path with delays. Each XOR gate in the LFSR is a

node in the corresponding DFG. In the J-unfolded architecture, there are J copies of

each node with the same function as in the original architecture (see Chapter 5 of [4]).

However, the total number of delayelements does not change.

4.11 Retimed LFSR

Assuming there is a path from node U to node V in the original architecture with W

delay elements, in the J-unfolded architecture, node Ui is connected to node V(i+W)%J

with _(i + W)/J_ delay elements, where Ui, Vj (0 ≤ i, j < J) are copies of nodes U and

V , respectively. Therefore, if the unfolding factor J is greater than W, there will be W

paths with one delay element and J −W paths without any delay element in the

unfolded architecture. For example, Fig. 3 (a) shows an LFSR with generator

polynomial g(x) = x3 + x + 1. In this example, there are two registers in the path

connecting the output of the left XOR gate and the input of the right XOR gate. In the

3-unfolded architecture illustrated in Fig. 3 (b), there are 3-1=2 paths from the output

of the copies of the left XOR gate to the input of the copies of the right XOR gate with

one delay, and another one path without any delay. The unfolded LFSR in Fig. 3 (b)

cannot be retimed to eliminate the fanout problem for each copy of the right XOR gate.

If the generator polynomial can be expressed as:

g(x) = xt0 + xt1 + ・ ・ ・ + xtl−2 + 1, (2)

where t0, t1, ・ ・ ・ tl−2 are positive integers with t0 > t1 > t2 ・ ・ ・ > tl−2 and

l is the total number of non-zero termsof g(x), there are t0 − t1 consecutive registers at

76

Page 78: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

the Hinput of the right-most XOR gate in Fig. 1. If a J-unfolded BCH encoder is

desired, t0 − t1 ≥ J needs to be satisfied to ensure that there is at least one register at

the Hinput of each of the J copies of the right-most XOR gate, so that retiming can be

applied to move one register to the output. Meanwhile, J registers need to be added to

the message input to enable retiming.

Fig4.11 (a) An LFSR example, (b) 3-unfolded version ofLFSR in

In the case of t0 − t1 < J, the generator polynomial needs to be modified to enable

retiming of the right-most XOR gate in the J-unfolded architecture. Assuming the

original (n, k) BCH code uses generator polynomial g(x) with degree n−k, the message

inputm(x) multiplied by xn−k can be written as:

m(x)xn−k = q(x)g(x) + r(x), (3) where q(x) and r(x) represent the quotient and

remainder

polynomials of dividing m(x)xn−k by g(x), respectively. Multiplying p(x) to both side

of (3), we get:

m(x)p(x)xn−k = q(x)(g(x)p(x)) + r(x)p(x). (4) Let g_(x) = p(x)g(x) be expressed as:

77

Page 79: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

CHAPTER -5 BCH CODES

Given an BCH(255, 233) code using generator polynomialg(x)

=x32+x31+x30+x29+x27+x26+x25+x22+x20+x19+x17+x16+x14+x9+x7+x6+x5+x4+x3

+x2+1,

we want to find p(x) such that t0’-t1’≥ 8 in g_(x).

In this example, E should be set to 8, and a = 32, b = 31 at the beginning of Algorithm

A. The intermediate values after each iteration in Algorithm A are given below:

After iteration I:

˜p(x) = 1+x−1.

˜g(x) = x32 +x28 +x27 +x2 +x22 +x21 +x20 +x18 +

x17 + x15 + x14 + x13 + x9 + x8 + x7 + x+1+x−1.

num = 1; b = 28; a − b = 4 < 8; continue.

After iteration II:

˜p(x) = 1+x−1 + x−4.

˜g(x) = x32 +x26 +x25 +x2 +x23 +x20 +x17 +x16 +

x14+x12+x10+x9+x8+x7++x5+x3+x2+x−2+x−4.

num = 4; b = 26; a − b = 6 < 8; continue.

After iteration III:

˜p(x) = 1+x−1 + x−4 + x−6.

˜g(x) = x32 + x21 + x19 + x17 + x13 + x12 + x11 + x9 +

x7 + x5 + x2 + x+1+x−1 + x−3 + x−6.

num = 6; b = 21; a − b = 11 > 8; stop.

Final step:

p(x) = ˜p(x)x6 = x6 + x5 + x2 + 1

g_(x) = ˜g(x)x6 = x38 + x27 + x25 + x23 + x19 + x18 +

x17 + x15 + x13 + x11 + x8 + x7 + x6 + x5 + x3 + 1.

78

Page 80: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

According to (4), the modified method of finding Rem

FIG 5.1 Block diagram of the modified BCH encoding

FIG 5.2 Step 1 of the modified BCH encoding (m(x)xn−k)g(x) in the BCH

encoding can be implemented by the steps illustrated in Fig. 4. Each step is explained

using the g(x), p(x) and g_(x) derived in Example I in the remainder of this section.

The first step in Fig. 4 is to multiply the message input polynomial by p(x). This can be

implemented by adding delayed message inputs according to the coefficient of p(x).

For example, using the p(x) derived in Example I, this step can be implemented by the

diagram in Fig. 5. The four taps correspond to 1, x2, x5and x6, respectively, as shown

in Fig. 5. After m(x)p(x) is computed, it is fed into the second block to compute

Rem(m(x)p(x)xn−k)g_(x) by using similar LFSR architectures as that in Fig. 1.

However, since deg(g_(x)) > n−k, the product of p(x) and m(x) should be added to the

output of the (n − k)th register from left, instead of being added to the output of the

right-most register. The addition of m(x)p(x) can break the consecutive a − b registers

at the Hinput of the right-most XOR gate in the LFSR. The implementation of the

second step using the BCH code in Example 1 is illustrated in Fig. 6. As could be

observed in Fig. 6, there are 38-27=11 consecutive registers at the Hinput of the right-

most XOR gate according to g_(x). However, after adding m(x)p(x) to the output of the

32nd register, only 6 consecutive registers are left. Therefore, at most 6-unfolding can

be applied to Fig. 6 without suffering from large fanout problem. At the end of

algorithm A, deg(g_(x)) = num + deg(g(x)) = n − k + num. Hence, only num

consecutive registers left after adding m(x)p(x) where num ≤ E −1 at the end of

Algorithm A.

79

Page 81: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

FIG 5.3 Step 2 of the modified BCH encoding

FIG 5.4 Step 3 of the modified BCH encoding

Therefore ,E is usually set to larger than the desired unfolding factor J to ensure num

≥ J at the end of Algorithm A. Alternatively, at the expense of slight increase in the

critical path and latency, the delays at the input of the last XOR gate can be retimed

and moved to the output of this XOR gate. For example, the 5 delays at the Hinput of

the last XOR gate in Fig. 6 can be retimed and moved to the output of this XOR gate.

This requires first adding 5 delays to m(x)p(x) input. The penalty is the increase in the

critical path to 2 XOR gates in the serial encoder.

In the third step, Rem(m(x)p(x)xn−k)g_(x) needs to be divided by p(x) to get the final

result. Similar architectures as that in Fig. 1 can also be used, except that the input data

is added to the input of the left-most register, since the input polynomial does not need

to be multiplied by any power of x. For example, the third step of the modified BCH

encoding using the p(x) derived in Example I is illustrated in Fig. 7.

Unfolding the modified BCH encoder in Fig. 4 by factor J, a parallel architecture

capable of processing J message bits at a time is derived. In the J-unfolded block of

computing p(x)m(x), feedback loop does not exist. Thus it can be pipelined to achieve

desired clock frequency. In the second block, since the LFSR of the modified generator

polynomial g_(x) has at least J registers at the Hinput of the right-most

XOR gate, retiming can be applied to the J-unfolded architecture to eliminate the

effect of large fan out after adding J registers to the output of m(x)p(x). Although the

fan out problem does not exist in the third block in Fig. 4, it can exist in the unfolded

architecture. Since the polynomial p(x) enables g_(x) = p(x)g(x) to have consecutive

80

Page 82: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

zero coefficients after the highest power term, the difference of the highest two powers

of p(x) is equal to t0 − t1 < J. After J-unfolding is applied, there are some copies of

the right-most XOR gates connected to lp−2 XOR gates, where lp is the number of

non-zer terms in p(x). In the worst case, lp is at most E+1−(t0−t1), and E is set to as

small as possible, which is a little bit larger than the unfolding factor J. Usually, the

desired unfolding factor is far less than the length of g(x) in long BCH codes. Hence,

the delay caused by the fan out of dividing p(x) is far less than that of dividing g(x) in

the original BCH encoders.

5.1 BCH DECODER ARCHITECTURE

In this section, a parallel BCH decoder is presented.The syndrome-based BCH

decoding consists of three major steps [3], as depicted in Fig. 2, where R is the hard

decision of received information from noisy channel and D is the decoded codeword. S

and Λ represent syndromes of the received polynomial and error locator polynomial,

respectively

5.2 Syndrome Generator

For t-error-correcting BCH codes, 2t syndromes of the received polynomial could be

evaluated as follows:

Sj = R(αj) =nX−1 i=0Ri(αj )i (1)

for 1 ≤ j ≤ 2t. If 2t conventional syndrome generator units shown in Fig. 3(a) are used

at the same time independently, n clock cycles are necessary to complete computing all

the 2t syndromes. However, if each syndrome generator unit in Fig. 3(a) is replaced by

a parallel syndrome generator unit with parallel factor of p depicted in Fig. 3(b), which

can process p bits per clock cycle, only _n/p_ clock cycles are sufficient. It is worth

noting that for binary BCH codes, even-indexed syndromes are the squares of earlier-

81

Page 83: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

indexed syndromes, i.e., S2j = S2 j . Based on this constraint, actually only t parallel

syndrome generator units are required to compute the odd-indexed syndromes,

followed by a much simpler field square circuit to generate those even-indexed

syndromes.

5.3. Key Equation Solver

Either Peterson’s or Berlekamp-Massey (BM) algorithm [3] could be employed to

solve the key equations for Λ(x). Inversion-free BM algorithm and its efficient

implementations could be easily found in the literature [2] [4] and are not considered

in this paper.

5.4 . Chien Search

Once Λ(x) is found, the decoder searches for error locations by checking whether

Λ(αi) = 0 for 0 ≤ i ≤ (n − 1), which is

Syndrome generator unit (a) Conventional architecture (b) Parallel architecture with

parallel factor of p

CHAPTER-6

APPENDIX-A

82

Page 84: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

VERILOG HDLImplementation of High Speed LFSR is done using Verilog HDL.

In the semiconductor and electronic design industry, Verilog is a hardware description

language (HDL) used to model electronic systems. Verilog HDL, not to be confused with

VHDL, is most commonly used in the design, verification, and implementation of digital

logic chips at the Register transfer level (RTL) level of abstraction. It is also used in the

verification of analog and mixed-signal circuits.

6.1 Overview

Hardware description languages, such as Verilog, differ from software programming

languages because they include ways of describing the propagation of time and signal

dependencies (sensitivity). There are two assignment operators, a blocking assignment (=),

and a non-blocking (<=) assignment. The non-blocking assignment allows designers to

describe a state-machine update without needing to declare and use temporary storage

variables. Since these concepts are part of the Verilog's language semantics, designers

could quickly write descriptions of large circuits, in a relatively compact and concise form.

At the time of Verilog's introduction (1984), Verilog represented a tremendous

productivity improvement for circuit designers who were already using graphical

schematic-capture, and specially-written software programs to document and simulate

electronic circuits.

The designers of Verilog wanted a language with syntax similar to the C programming

language, which was already widely used in engineering software development. Verilog is

case-sensitive, has a basic preprocessor (though less sophisticated than ANSI C/C++), and

equivalent control flow keywords (if/else, for, while, case, etc.), and compatible language

operators precedence. Syntactic differences include variable declaration (Verilog requires

bit-widths on net/reg types), demarcation of procedural-blocks (begin/end instead of curly

braces {}), and many other minor differences.

A Verilog design consists of a hierarchy of modules. Modules encapsulate design

hierarchy, and communicate with other modules through a set of declared input, output,

and bidirectional ports. Internally, a module can contain any combination of the following:

net/variable declarations (wire, reg, integer, etc.), concurrent and sequential statement

83

Page 85: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

blocks and instances of other modules (sub-hierarchies). Sequential statements are placed

inside a begin/end block and executed in sequential order within the block. But the blocks

themselves are executed concurrently, qualifying Verilog as a Dataflow language.

Verilog's concept of 'wire' consists of both signal values (4-state: "1, 0, floating,

undefined"), and strengths (strong, weak, etc.) This system allows abstract modeling of

shared signal-lines, where multiple sources drive a common net. When a wire has multiple

drivers, the wire's (readable) value is resolved by a function of the source drivers and their

strengths.

A subset of statements in the Verilog language is synthesizable. Verilog modules that

conform to a synthsizeable coding-style, known as RTL (register transfer level), can be

physically realized by synthesis software. Synthesis-software algorithmically transforms

the (abstract) Verilog source into a netlist, a logically-equivalent description consisting

only of elementary logic primitives (AND, OR, NOT, flipflops, etc.) that are available in a

specific VLSI technology. Further manipulations to the netlist ultimately lead to a circuit

fabrication blueprint (such as a photo mask-set for an ASIC, or a bitstream-file for an

FPGA).

6.2 History, Beginning

Verilog was invented by Phil Moorby and Prabhu Goel during the winter of 1983/1984 at

Automated Integrated Design Systems (later renamed to Gateway Design Automation in

1985) as a hardware modeling language. Gateway Design Automation was later purchased

by Cadence Design Systems in 1990. Cadence now has full proprietary rights to Gateway's

Verilog and the Verilog-XL simulator logic simulators.

6.3 Verilog-95

With the increasing success of VHDL at the time, Cadence decided to make the language

available for open standardization. Cadence transferred Verilog into the public domain

under the Open Verilog International (OVI) (now known as Accellera) organization.

Verilog was later submitted to IEEE and became IEEE Standard 1364-1995, commonly

referred to as Verilog-95.

84

Page 86: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

In the same time frame Cadence initiated the creation of Verilog-A to put standards

support behind its analog simulator Spectre. Verilog-A was never intended to be a

standalone language and is a subset of Verilog-AMS which encompassed Verilog-95.

6.4 Verilog 2001

Extensions to Verilog-95 were submitted back to IEEE to cover the deficiencies that users

had found in the original Verilog standard. These extensions became IEEE Standard 1364-

2001 known as Verilog-2001.

Verilog-2001 is a significant upgrade from Verilog-95. First, it adds explicit support for

(2's complement) signed nets and variables. Previously, code authors had to perform

signed-operations using awkward bit-level manipulations (for example, the carry-out bit of

a simple 8-bit addition required an explicit description of the boolean-algebra to determine

its correct value.) The same function under Verilog-2001 can be more succinctly described

by one of the built-in operators: +, -, /, *, >>>. A generate/endgenerate construct (similar

to VHDL's generate/endgenerate) allows Verilog-2001 to control instance and statement

instantiation through normal decision-operators (case/if/else). Using generate/endgenerate,

Verilog-2001 can instantiate an array of instances, with control over the connectivity of

the individual instances. File I/O has been improved by several new system-tasks. And

finally, a few syntax additions were introduced to improve code-readability (eg. always

@*, named-parameter override, C-style function/task/module header declaration.)

Verilog-2001 is the dominant flavor of Verilog supported by the majority of commercial

EDA software packages.

6.5 Verilog 2005

Not to be confused with SystemVerilog, Verilog 2005 (IEEE Standard 1364-2005)

consists of minor corrections, spec clarifications, and a few new language features (such as

the uwire keyword.)

A separate part of the Verilog standard, Verilog-AMS, attempts to integrate analog and

mixed signal modelling with traditional Verilog.

6.6 Design Styles

85

Page 87: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

Verilog, like any other hardware description language, permits a design in either Bottom-

up or Top-down methodology. 

  Bottom-Up Design

The traditional method of electronic design is bottom-up. Each design is performed at the

gate-level using the standard gates (refer to the Digital Section for more details). With the

increasing complexity of new designs this approach is nearly impossible to maintain. New

systems consist of ASIC or microprocessors with a complexity of thousands of transistors.

These traditional bottom-up designs have to give way to new structural, hierarchical

design methods. Without these new practices it would be impossible to handle the new

complexity.

   

  Top-Down Design

The desired design-style of all designers is the top-down one. A real top-down design

allows early testing, easy change of different technologies, a structured system design and

offers many other advantages. But it is very difficult to follow a pure top-down design.

Due to this fact most designs are a mix of both methods, implementing some key elements

of both design styles.    

 

86

Page 88: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

   

Figure shows a Top-Down design approach.    

6.7  Verilog Abstraction Levels

Verilog supports designing at many different levels of abstraction. Three of them are very

important:

Behavioral level

Register-Transfer Level

Gate Level

 Behavioral level

This level describes a system by concurrent algorithms (Behavioral). Each algorithm itself

is sequential, that means it consists of a set of instructions that are executed one after the

other. Functions, Tasks and Always blocks are the main elements. There is no regard to

the structural realization of the design.   

87

Page 89: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

 Register-Transfer Level

Designs using the Register-Transfer Level specify the characteristics of a circuit by

operations and the transfer of data between the registers. An explicit clock is used. RTL

design contains exact timing bounds: operations are scheduled to occur at certain times.

Modern RTL code definition is "Any code that is synthesizable is called RTL code".

   

Gate Level

Within the logic level the characteristics of a system are described by logical links and

their timing properties. All signals are discrete signals. They can only have definite logical

values (`0', `1', `X', `Z`). The usable operations are predefined logic primitives (AND, OR,

NOT etc gates). Using gate level modeling might not be a good idea for any level of logic

design. Gate level code is generated by tools like synthesis tools and this netlist is used for

gate level simulation and for backend.

6.8 Introduction

ModelSim is a verification and simulation tool for VHDL, Verilog, SystemVerilog, and

mixed language designs.

6.8.1 Basic Simulation Flow

The following diagram shows the basic steps for simulating a design in ModelSim.

Figure 6.1. Basic Simulation Flow - Overview Lab

88

Page 90: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

In ModelSim, all designs are compiled into a library. You typically start a new simulation

in ModelSim by creating a working library called "work". "Work" is the library name used

by the compiler as the default destination for compiled design units.

Compiling Your Design

After creating the working library, you compile your design units into it. The ModelSim

library format is compatible across all supported platforms. You can simulate your design

on any platform without having to recompile your design.

Loading the Simulator with Your Design and Running the Simulation With the design

compiled, you load the simulator with your design by invoking the simulator on a top-

level module (Verilog) or a configuration or entity/architecture pair (VHDL). Assuming

the design loads successfully, the simulation time is set to zero, and you enter

a run command to begin simulation.

Debugging Your Results

If you don’t get the results you expect, you can use ModelSim’s robust debugging

environment to track down the cause of the problem.

6.8.2 Project Flow

A project is a collection mechanism for an HDL design under specification or test. Even

though

you don’t have to use projects in ModelSim, they may ease interaction with the tool and

are

useful for organizing files and specifying simulation settings.

The following diagram shows the basic steps for simulating a design within a ModelSim

project.

89

Page 91: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

FIG 6.2

As you can see, the flow is similar to the basic simulation flow. However, there are two

important differences:

You do not have to create a working library in the project flow; it is done for you

automatically.

Projects are persistent. In other words, they will open every time you invoke

ModelSim

unless you specifically close them.

6.8.3Multiple Library Flow

ModelSim uses libraries in two ways: 1) as a local working library that contains the

compiled

version of your design; 2) as a resource library. The contents of your working library will

change as you update your design and recompile. A resource library is typically static and

serves as a parts source for your design. You can create your own resource libraries, or

they

may be supplied by another design team or a third party (e.g., a silicon vendor).

You specify which resource libraries will be used when the design is compiled, and there

are

rules to specify in which order they are searched. A common example of using both a

working

90

Page 92: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

library and a resource library is one where your gate-level design and testbench are

compiled

into the working library, and the design references gate-level models in a separate resource

library.

The diagram below shows the basic steps for simulating with multiple libraries.

Figure 6.3. Multiple Library Flow

6.9 Debugging Tools

ModelSim offers numerous tools for debugging and analyzing your design. Several of

these

tools are covered in subsequent lessons, including:

Using projects

Working with multiple libraries

Setting breakpoints and stepping through the source code

Viewing waveforms and measuring time

Viewing and initializing memories

Creating stimulus with the Waveform Editor

Automating simulation

91

Page 93: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

6.10 Basic Simulation

Figure 3-1. Basic Simulation Flow - Simulation Lab

Design Files for this Lesson

The sample design for this lesson is a simple 8-bit, binary up-counter with an associated

testbench. The pathnames are as follows:

Verilog – <install_dir>/examples/tutorials/verilog/basicSimulation/counter.v and

tcounter.v

VHDL – <install_dir>/examples/tutorials/vhdl/basicSimulation/counter.vhd and

tcounter.vhd

This lesson uses the Verilog files counter.v and tcounter.v. If you have a VHDL license,

use

counter.vhd and tcounter.vhd instead. Or, if you have a mixed license, feel free to use the

Verilog testbench with the VHDL counter or vice versa.

6.10.1 Create the Working Design Library

Before you can simulate a design, you must first create a library and compile the source

code into that library.

1. Create a new directory and copy the design files for this lesson into it.

92

Page 94: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

Start by creating a new directory for this exercise (in case other users will be working with

these lessons).

Verilog: Copy counter.v and tcounter.v files from

/<install_dir>/examples/tutorials/verilog/basicSimulation to the new directory.

VHDL: Copy counter.vhd and tcounter.vhd files from

/<install_dir>/examples/tutorials/vhdl/basicSimulation to the new directory.

2. Start ModelSim if necessary.

a. Type vsim at a UNIX shell prompt or use the ModelSim icon in Windows. Upon

opening ModelSim for the first time, you will see the Welcome to ModelSim dialog. Click

Close.

b. Select File > Change Directory and change to the directory you created in step 1.

3. Create the working library.

a. Select File > New > Library.

This opens a dialog where you specify physical and logical names for the library (Figure

3-2). You can create a new library or map to an existing library. We’ll be doing the

former.

Figure 6.4. The Create a New Library Dialog

93

Page 95: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

b. Type work in the Library Name field (if it isn’t already entered automatically).

c. Click OK.

ModelSim creates a directory called work and writes a specially-formatted file named

_info into that directory. The _info file must remain in the directory to distinguish it as a

ModelSim library. Do not edit the folder contents from your operating system; all changes

should be made from within ModelSim. ModelSim also adds the library to the list in the

Workspace (Figure 3-3) and records the library mapping for future reference in the

ModelSim initialization file (modelsim.ini).

Figure 6.5 . work Library in the Workspace

When you pressed OK in step 3c above, the following was printed to the Transcript:

vlib work

vmap work work

These two lines are the command-line equivalents of the menu selections you made. Many

command-line equivalents will echo their menu-driven functions in this fashion.

6.11 Compile the Design

With the working library created, you are ready to compile your source files.

You can compile by using the menus and dialogs of the graphic interface, as in the Verilog

example below, or by entering a command at the ModelSim> prompt.

1. Compile counter.v and tcounter.v.

a. Select Compile > Compile. This opens the Compile Source Files dialog (Figure 3-4).

94

Page 96: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

If the Compile menu option is not available, you probably have a project open. If so, close

the project by making the Workspace pane active and selecting File > Close from the

menus.

b. Select both counter.v and tcounter.v modules from the Compile Source Files dialog and

click Compile. The files are compiled into the work library. c. When compile is finished,

click Done.

Figure 6.5 . Compile Source Files Dialog

2. View the compiled design units.

a. On the Library tab, click the ’+’ icon next to the work library and you will see two

design units (Figure 3-5). You can also see their types (Modules, Entities, etc.) and the

path to the underlying source files (scroll to the right if necessary).

b. Double-click test_counter to load the design.

You can also load the design by selecting Simulate > Start Simulation in the menu bar.

This opens the Start Simulation dialog. With the Design tab selected, click the ’+’ sign

next to the work library to see the counter and test_counter modules. Select

the test_counter module and click OK (Figure 3-6).

Figure 6.6 . Loading Design with Start Simulation Dialog

95

Page 97: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

When the design is loaded, you will see a new tab in the Workspace named sim that

displays the hierarchical structure of the design (Figure 3-7). You can navigate within the

hierarchy by clicking on any line with a ’+’ (expand) or ’-’ (contract) icon. You will also

see a tab named Files that displays all files included in the design.

Figure 6.7. VHDL Modules Compiled into work Library

6.12 Load the Design

1. Load the test_counter module into the simulator.

a. In the Workspace, click the ‘+’ sign next to the work library to show the files contained

there.

Figure 6.8 . Workspace sim Tab Displays Design Hierarchy

96

Page 98: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

2. View design objects in the Objects pane.

a. Open the View menu and select Objects. The command line equivalent is: view objects

The Objects pane (Figure 3-8) shows the names and current values of data objects in the

current region (selected in the Workspace). Data objects include signals, nets, registers,

constants and variables not declared in a process, generics, parameters.

Figure 6.9 Object Pane Displays Design Objects

You may open other windows and panes with the View menu or with the view command.

See Navigating the Interface.

6.12 Run the Simulation

Now you will open the Wave window, add signals to it, then run the simulation.

1. Open the Wave debugging window.

a. Enter view wave at the command line

You can also use the View > Wave menu selection to open a Wave window.

The Wave window is one of several windows available for debugging. To see a list

of the other debugging windows, select the View menu. You may need to move or

resize the windows to your liking. Window panes within the Main window can be

zoomed to occupy the entire Main window or undocked to stand alone. For details,

97

Page 99: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

see Navigating the Interface.

2. Add signals to the Wave window.

a. In the Workspace pane, select the sim tab.

b. Right-click test_counter to open a popup context menu.

c. Select Add > To Wave > All items in region (Figure 3-9).

All signals in the design are added to the Wave window.

Figure 6.10 . Using the Popup Menu to Add Signals to Wave Window

3. Run the simulation.

a. Click the Run icon in the Main or Wave window toolbar.

The simulation runs for 100 ns (the default simulation length) and waves are

drawn in the Wave window.

b. Enter run 500 at the VSIM> prompt in the Main window.

98

Page 100: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

The simulation advances another 500 ns for a total of 600 ns (Figure 3-10).

Figure 6.11. Waves Drawn in Wave Window

c. Click the Run -All icon on the Main or Wave window toolbar.

The simulation continues running until you execute a break command or it

hits a statement in your code (e.g., a Verilog $stop statement) that halts the

simulation.

d. Click the Break icon. The simulation stops running.

6.12 Xilinx design flowThe first step involved in implementation of a design on FPGA involves System

Specifications. Specifications refer to kind of inputs and kind of outputs and the range of

values that the kit can take in based on these Specifications. After the first step system

specifications the next step is the Architecture. Architecture describes the interconnections

between all the blocks involved in our design. Each and every block in the Architecture

along with their interconnections is modeled in either VHDL or Verilog depending on the

ease. All these blocks are then simulated and the outputs are verified for correct

functioning.

99

Page 101: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

Figure 6.12 Xilinx Implementation Design Flow-Chart.

After the simulation step the next steps i.e., Synthesis. This is a very important step

in knowing whether our design can be implemented on a FPGA kit or not. Synthesis

converts our VHDL code into its functional components which are vendor specific. After

performing synthesis RTL schematic, Technology Schematic and generated and the timing

delays are generated. The timing delays will be present in the FPGA if the design is

implemented on it. Place & Route is the next step in which the tool places all the

components on a FPGA die for optimum performance both in terms of areas and speed.

We also see the interconnections which will be made in this part of the implementation

flow.

In post place and route simulation step the delays which will be involved on the

FPGA kit are considered by the tool and simulation is performed taking into consideration

these delays which will be present in the implementations on the kit. Delays here mean

electrical loading effect, wiring delays, stray capacitances.

After post place and route, comes generating the bit-map file, which means

converting the VHDL code into bit streams which is useful to configure the FPGA kit. A

bit file is generated this step is performed. After this comes final step of downloading the

bit map file on to the FPGA board which is done by connecting the computer to FPGA

board with the help of JTAG cable (Joint Test Action Group) which is an IEEE standard.

The bit map file consist the whole design which is placed on the FPGA die, the outputs

can now be observed from the FPGA LEDs. This step completes the whole process of

implementing our design on an FPGA.

100

Page 102: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

6.13 Xilinx ISE 10.1 software 6.13.1 Introduction

Xilinx ISE (Integrated Software Environment) 9.2i software is from XILINX

company, which is used to design any digital circuit and implement onto a Spartan-3E

FPGA device. XILINX ISE 9.2i software is used to design the application, verify the

functionality and finally download the design on to a Spartan-3E FPGA device.

6.13.2 Xilinx ISE 10.1 software tools

SIMULATION : ISE (Integrated Software Environment) Simulator SYNTHESIS, PLACE & POUTE : XST (Xilinx Synthesis Technology) Synthesizer

6.13.3 Design steps using Xilinx ISE 10.1

1 Create an ISE PROJECT for particular embedded system application.

2 Write the assembly code in notepad or write pad and generate the verilog or vhdl

module by making use of assembler.

3 Check syntax for the design.

4 Create verilog test fixture of the design.

5 Simulate the test bench waveform (BEHAVIORAL SIMULATION) for functional

verification of the design using ISE simulator.

6 Synthesize and implement the top level module using XST synthesizer.

101

Page 103: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

CHAPTER-7

SIMULATION RESULTS

Serial Implementation 1+y+y3+y5

102

Page 104: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

Synthesis Results:

Fig 1:HDL Synthesis Report

Macro Statistics

# Registers : 5

1-bit register : 5

# Xors : 3

1-bit xor2 : 3

Design Statistics

# IOs : 8

Cell Usage :

# BELS : 3

# LUT2 : 1

# LUT3 : 2

# FlipFlops/Latches : 5

# FDC : 5

# Clock Buffers : 1

103

Page 105: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

# BUFGP : 1

# IO Buffers : 7

# IBUF : 2

# OBUF : 5

Device utilization summary:---------------------------

Selected Device : XC3S500E

Number of Slices: 3 out of 960 0%

Number of Slice Flip Flops: 5 out of 1920 0%

Number of 4 input LUTs: 3 out of 1920 0%

Number of IOs: 8

Number of bonded IOBs: 8 out of 108 7%

Number of GCLKs: 1 out of 2 4%

Timing Summary:

Minimum period: 2.269ns (Maximum Frequency: 440.723MHz)

Minimum input arrival time before clock: 2.936ns

Maximum output required time after clock: 4.450ns

Parellel Implementation 1+y+y3+y5

104

Page 106: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

HDL Synthesis Report

Macro Statistics

# Registers : 5

1-bit register : 5

# Xors : 9

1-bit xor2 : 9

Design Statistics

# IOs : 11

Cell Usage :

# BELS : 6

# LUT3 : 3

# LUT4 : 3

# FlipFlops/Latches : 5

# FDC : 5

# Clock Buffers : 1

# BUFGP : 1

# IO Buffers : 9

# IBUF : 4

# OBUF : 5

105

Page 107: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

Device utilization summary:

---------------------------

Selected Device: XC3S500E

Number of Slices: 3 out of 960 0%

Number of Slice Flip Flops: 5 out of 1920 0%

Number of 4 input LUTs: 6 out of 1920 0%

Number of IOs: 11

Number of bonded IOBs: 10 out of 108 9%

Number of GCLKs: 1 out of 2 4%

Timing Summary:

---------------

Minimum period: 2.269ns (Maximum Frequency: 440.723MHz)

Minimum input arrival time before clock: 4.235ns

Maximum output required time after clock: 4.450ns

ADVANTAGES:

Reduced Power dissipation

Higher throughput rate.

Higher processing speed

106

Page 108: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

Fast Computation.

LFSR can rapidly transmit a sequence that indicates high-precision relative time offsets

APPLICATIONS:

Pattern Generators

Built-In Self-Test(BIST)

Encryption.

LFSR can be used for generating pseudo-random numbers, pseudo-noise

sequences, fast digital counters, and whitening sequences.

Pseudo-Random Bit Sequences

Conclusion Efficient high-speed parallel LFSR structures must address two important issues: large

fan out bottleneck and Iteration bound bottleneck. Three-step LFSR architectures

provide us with the flexibility to handle these two issues. The key point is the

construction of p(x) and p’(x), which are used for addressing the fan out bottleneck

and iteration bound bottleneck, respectively, as shown in this paper. These two issues

can be better solved by choosing p(x) and p’(x) that can be further decomposed into

107

Page 109: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

small length polynomials because small length polynomials can be easily handled by

the proposed look-ahead pipelining algorithms. Higher processing speed and hardware

efficiency can be achieved by using this approach.

REFERENCE

[1]Tong-Bi Pei, Charles Zukowski, "High-Speed Parallel CRC Circuits in VLSI", IEEE Transactions on Communications, vol. 40, no. 4, April1992 pp. 653-657.[2] G. Campobello, G. Patané, M. Russo, “Parallel CRC realization,” IEEE Transactions on Computers, VOL. 52, NO. 10, OCTOBER 2003.[3] C. Cheng and K. K. Parhi, “High-Speed Parallel CRC Implementation Based on Unfolding, Pipelining, and Retiming,” IEEE Trans. CircuitsSyst. II, Express Briefs, vol. 53, no. 10, pp.1017–1021, Oct. 2006.

108

Page 110: Abstract - codelooker · Web viewThis can be accomplished using two-variable input multipliers of the type described later. Alternatively it is often beneficial to employ a multiplier

[4] K. K. Parhi, “Eliminating the fan out bottleneck in parallel long BCH encoders,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 51, no. 3, pp.512–516, Mar. 2004.[5] X. Zhang and K. K. Parhi, “High-speed architectures for parallel long BCH encoders,” in Proc. ACM Great Lakes Symp. VLSI, Boston, MA,pp. 1–6, Apr. 2004.[6] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation. Hoboken, NJ: Wiley, 1999.[7] T.V. Ramabadran and S.S, Gaitonde, “A Tutorial on CRC Computations”, IEEE Micro, Vol.8, No. 4, , pp. 62-75, August 1988.

109