79
VHDL Implementation and Performance Analysis of two Division Algorithms by Salman Khan B.S., Sir Syed University of Engineering and Technology, 2010 A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Applied Science in the Department of Electrical and Computer Engineering c Salman Khan, 2015 University of Victoria All rights reserved. This thesis may not be reproduced in whole or in part, by photocopying or other means, without the permission of the author.

VHDL Implementation and Performance Analysis of two

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: VHDL Implementation and Performance Analysis of two

VHDL Implementation and Performance Analysis of two Division Algorithms

by

Salman Khan

B.S., Sir Syed University of Engineering and Technology, 2010

A Thesis Submitted in Partial Fulfillment of the

Requirements for the Degree of

Master of Applied Science

in the Department of Electrical and Computer Engineering

c© Salman Khan, 2015

University of Victoria

All rights reserved. This thesis may not be reproduced in whole or in part, by

photocopying or other means, without the permission of the author.

Page 2: VHDL Implementation and Performance Analysis of two

ii

VHDL Implementation and Performance Analysis of two Division Algorithms

by

Salman Khan

B.S., Sir Syed University of Engineering and Technology, 2010

Supervisory Committee

Dr. Fayez Gebali, Supervisor

(Department of Electrical and Computer Engineering)

Dr. Atef Ibrahim, Member

(Department of Electrical and Computer Engineering)

Page 3: VHDL Implementation and Performance Analysis of two

iii

Supervisory Committee

Dr. Fayez Gebali, Supervisor

(Department of Electrical and Computer Engineering)

Dr. Atef Ibrahim, Member

(Department of Electrical and Computer Engineering)

ABSTRACT

Division is one of the most fundamental arithmetic operations and is used exten-

sively in engineering, scientific, mathematical and cryptographic applications. The

implementation of arithmetic operation such as division, is complex and expensive in

hardware. Unlike addition and subtraction, division requires several iterative compu-

tational steps on given operands to produce the result. Division, in the past has often

been perceived as an infrequently used operation and received not as much attention

but it is one of the most difficult operations in computer arithmetic. The techniques

of implementation in hardware of such an iterative computation impacts the speed,

the area and power of the digital circuit. For this reason, we consider two division

algorithms based on their step size in shift. Algorithm 1 operates on fixed shift step

size and has a fixed number of iteration while the Algorithms 2 operates on variable

shift step size and requires considerably fewer number of iterations. In this thesis,

technique is provided to save power and speed up the overall computation. It also

looks at different design goal strategies and presents a comparative study to asses

how each of the two design perform in terms of area, delay and power consumption.

Page 4: VHDL Implementation and Performance Analysis of two

iv

Contents

Supervisory Committee ii

Abstract iii

Table of Contents iv

List of Tables vii

List of Figures viii

Acknowledgements x

Dedication xi

1 Introduction 1

1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Motivation for this work . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Division Background 5

2.1 Division Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Division Algorithms Classes . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 Digit Recurrence Algorithms . . . . . . . . . . . . . . . . . . . 8

2.2.2 Functional Iteration Algorithms . . . . . . . . . . . . . . . . . 8

2.2.3 Very High Radix Algorithms . . . . . . . . . . . . . . . . . . . 8

2.2.4 Look-up Tables . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.5 Variable Latency Algorithms . . . . . . . . . . . . . . . . . . . 9

2.3 Related work in the area . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Page 5: VHDL Implementation and Performance Analysis of two

v

3 Considered Division Algorithms 11

3.1 Division Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1.1 Reasons For Considerations . . . . . . . . . . . . . . . . . . . 11

3.1.2 Overview of Operation . . . . . . . . . . . . . . . . . . . . . . 11

3.2 Division Algorithm 1 : Fixed Shift Algorithm . . . . . . . . . . . . . 12

3.2.1 Mode 1 : Range reduction of Y . . . . . . . . . . . . . . . . . 13

3.2.2 Mode 2 : Post processing of Y and Z . . . . . . . . . . . . . . 14

3.3 Division Algorithm 2 : Adaptive Shift Algorithm . . . . . . . . . . . 15

3.3.1 Mode 1 : Range reduction of Y . . . . . . . . . . . . . . . . . 15

3.3.2 Mode 2 : Post processing of Y and Z . . . . . . . . . . . . . . 16

3.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Design and Implementation 18

4.1 Hardware entities for Algorithm 1 . . . . . . . . . . . . . . . . . . . . 18

4.1.1 X, Y and Z Registers . . . . . . . . . . . . . . . . . . . . . . 19

4.1.2 Data Multiplexer . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.1.3 Comparator for Y . . . . . . . . . . . . . . . . . . . . . . . . 21

4.1.4 The Look-up table . . . . . . . . . . . . . . . . . . . . . . . . 21

4.1.5 The ALU unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.1.6 Counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1.7 Finite State Machine . . . . . . . . . . . . . . . . . . . . . . . 23

4.1.8 FSM : State transition diagram . . . . . . . . . . . . . . . . . 24

4.2 Hardware entities for Algorithm 2 . . . . . . . . . . . . . . . . . . . . 25

4.2.1 Delta Address Generator . . . . . . . . . . . . . . . . . . . . . 27

4.2.2 DAG Implementation . . . . . . . . . . . . . . . . . . . . . . . 32

4.2.3 Finite State Machine . . . . . . . . . . . . . . . . . . . . . . . 36

4.2.4 FSM : State transition diagram . . . . . . . . . . . . . . . . . 36

4.3 Circuit Implementations . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.3.1 Algorithm 1 : Fixed Shift division algorithm . . . . . . . . . . 38

4.3.2 DAG overall layout . . . . . . . . . . . . . . . . . . . . . . . . 40

4.3.3 Algorithm 2: Adaptive Shift division algorithm . . . . . . . . 40

4.4 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5 Results and Evaluation 43

5.1 Numerical Simulation using MATLAB . . . . . . . . . . . . . . . . . 43

Page 6: VHDL Implementation and Performance Analysis of two

vi

5.1.1 Numerical Simulation of Algorithm 1 . . . . . . . . . . . . . . 44

5.1.2 Numerical Simulation of Algorithm 2 . . . . . . . . . . . . . . 45

5.2 Hardware Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.2.1 VHDL Simulation of Algorithm 1 . . . . . . . . . . . . . . . . 46

5.2.2 VHDL Simulation of Algorithm 2 . . . . . . . . . . . . . . . . 48

5.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.3.1 Device Utilization . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.3.2 Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.3.3 Power Consumption . . . . . . . . . . . . . . . . . . . . . . . 53

5.3.4 Power-Delay Product . . . . . . . . . . . . . . . . . . . . . . . 54

5.3.5 Area-Delay Product . . . . . . . . . . . . . . . . . . . . . . . 55

5.4 Comparison of Work in Related Area . . . . . . . . . . . . . . . . . . 55

5.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6 Conclusion, Contributions and Future Work 59

6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

Bibliography 62

7 Additional Information 64

7.1 Interpretation of signals . . . . . . . . . . . . . . . . . . . . . . . . . 65

8 Used Terms and Acronyms 67

Page 7: VHDL Implementation and Performance Analysis of two

vii

List of Tables

Table 4.1 Truth Table when Y is positive . . . . . . . . . . . . . . . . . . 33

Table 4.2 Truth Table when Y is negative . . . . . . . . . . . . . . . . . . 33

Table 5.1 Iterations for Algorithm 1 . . . . . . . . . . . . . . . . . . . . . 44

Table 5.2 Iterations for Algorithm 2 . . . . . . . . . . . . . . . . . . . . . 45

Table 5.3 On-chip device utilization of Algorithm 1 . . . . . . . . . . . . . 51

Table 5.4 On-chip device utilization of Algorithm 2 . . . . . . . . . . . . . 51

Table 5.5 Timing Summary of Algorithm 1 . . . . . . . . . . . . . . . . . 52

Table 5.6 Timing Summary of Algorithm 2 . . . . . . . . . . . . . . . . . 52

Table 5.7 On-chip power consumptions. . . . . . . . . . . . . . . . . . . . 54

Table 5.8 Power-delay product for Algorithm 1 and 2. . . . . . . . . . . . 55

Table 5.9 Area-delay product for Algorithm 1 and 2. . . . . . . . . . . . . 55

Table 5.10 Summary of related work in the area . . . . . . . . . . . . . . . 57

Page 8: VHDL Implementation and Performance Analysis of two

viii

List of Figures

Figure 2.1 Nonzero bits of X and Y at the start of division . . . . . . . . . 7

Figure 2.2 Nonzero bits of X and Y at the end of division . . . . . . . . . 7

Figure 4.1 Algorithm 1 system level . . . . . . . . . . . . . . . . . . . . . . 19

Figure 4.2 Registers X, Y and Z in the bank . . . . . . . . . . . . . . . . 20

Figure 4.3 Data multiplexer for register bank . . . . . . . . . . . . . . . . 21

Figure 4.4 Comparator for Y . . . . . . . . . . . . . . . . . . . . . . . . . 21

Figure 4.5 LUT block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Figure 4.6 ALU block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Figure 4.7 Logical operation of ALU during ith iteration . . . . . . . . . . 23

Figure 4.8 Counter block diagram . . . . . . . . . . . . . . . . . . . . . . . 23

Figure 4.9 Finite State Machine block . . . . . . . . . . . . . . . . . . . . 24

Figure 4.10 State transition diagram for Algorithm 1 . . . . . . . . . . . . 25

Figure 4.11 Algorithm 2 system level . . . . . . . . . . . . . . . . . . . . . 27

Figure 4.12 Delta (δ) Address Generator . . . . . . . . . . . . . . . . . . . 28

Figure 4.13 DAG system level . . . . . . . . . . . . . . . . . . . . . . . . . 28

Figure 4.14 Position finder unit block . . . . . . . . . . . . . . . . . . . . . 29

Figure 4.15 Multiplexer for flag input . . . . . . . . . . . . . . . . . . . . . 30

Figure 4.16 Multiplexer for data input . . . . . . . . . . . . . . . . . . . . 30

Figure 4.17 The Px Register . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Figure 4.18 The number subtractor block in DAG . . . . . . . . . . . . . . 31

Figure 4.19 2-bits scan unit 0 in level 1 . . . . . . . . . . . . . . . . . . . . 32

Figure 4.20 Hierarchical approach between level 1 and 2 . . . . . . . . . . 33

Figure 4.21 Hierarchical arrangement of position finder unit . . . . . . . . 35

Figure 4.22 Finite State Machine block . . . . . . . . . . . . . . . . . . . . 36

Figure 4.23 State transition diagram for Algorithm 2 . . . . . . . . . . . . 37

Figure 4.24 Top level block of fixed shift division algorithm . . . . . . . . . 38

Figure 4.25 Fixed shift division algorithm RTL schematic . . . . . . . . . . 39

Page 9: VHDL Implementation and Performance Analysis of two

ix

Figure 4.26 Delta address generator RTL schematic . . . . . . . . . . . . . 40

Figure 4.27 Top level block of adaptive shift division algorithm . . . . . . . 40

Figure 4.28 Adaptive shift division algorithm RTL schematic . . . . . . . . 41

Figure 5.1 All iterations for Algorithm 1 . . . . . . . . . . . . . . . . . . . 46

Figure 5.2 Iterations 0 to 2 for Algorithm 1 . . . . . . . . . . . . . . . . . 47

Figure 5.3 Iterations 3 to 7 for Algorithm 1 . . . . . . . . . . . . . . . . . 47

Figure 5.4 Iterations 8 to 11 for Algorithm 1 . . . . . . . . . . . . . . . . . 48

Figure 5.5 Iterations 12 to 14 for Algorithm 1 . . . . . . . . . . . . . . . . 48

Figure 5.6 All iterations for Algorithm 2 . . . . . . . . . . . . . . . . . . . 49

Page 10: VHDL Implementation and Performance Analysis of two

x

ACKNOWLEDGMENTS

In the name of Allah, the Most Gracious and the Most Merciful.

All praises belong to Allah the merciful for his guidance and blessings to enable me

complete this thesis. I would like to thank:

My parents, for their prayers, love, patience, emotional support, motivation and

assurance in difficult and frustrating moments and for their constant motiva-

tion. Despite the financial constraints, they were always ready to support me

financially.

My Supervisor, Dr. Fayez Gebali, for all the mentoring and support which en-

abled me to achieve my academic and research objectives, also for helping me

cope up with off-school problems and settling in as an international student.

For sharing his ideas, concepts and experiences and It would not have been

possible to complete my research without his invaluable guidance.

My Committee, Dr. Atef Ibrahim, for devoting precious time and providing valu-

able suggestions to improve the quality of the thesis.

My Manager at BC Hydro, Djordje Atanackovic, for his encouragement and

support to help me focus on my thesis completion.

UVIC ECE Dept admin and lab staff, Kevin Jones, Janice Closson, Paul Fedrigo

and Brent Sirna for assisting me during the course of my degree.

Page 11: VHDL Implementation and Performance Analysis of two

xi

DEDICATION

To my father, Muhammad Khalid Zahid and my mother, Imtiaz Khalid for

having a lifelong long dream to see me achieve my graduate qualification at a world

class foreign institution. In difficult times, it proved as key motivating factor and

enabled me to maintain focus.

To my grandmother, Rasool Fatima for her countless prayers and believing in me.

To my Supervisor, Dr. Fayez Gebali, he is one of the most knowledgeable, kindest

and helpful person I have known. I wish him the best of health.

Page 12: VHDL Implementation and Performance Analysis of two

Chapter 1

Introduction

1.1 Overview

Implementation of mathematical algorithms, as those required by a Random number

generator (RNG), require complex and expensive arithmetic operations like division

and multiplication, while also requiring iterative computations on given inputs to

obtain the required output. The techniques of implementing these operations and it-

erations in hardware significantly impacts the speed, area and power of the hardware.

The division of two integers, the divisor and the dividend, results in an integer remain-

der and an integer quotient. The integer division in the one of the most fundamental

arithmetic operation and is heavily required in engineering, scientific, mathematical

and statistical computations. Implementing and performing division operation in

hardware is complex, expensive and requires more computational power in consump-

tion when compared to the addition and subtraction operations. According to [1],

division is the most difficult operation in computer arithmetic and it is a common per-

ception of think of division as an infrequently used operation whose implementation

does not receive much attention. The division in modern micro-processors takes many

clock cycles, furthermore the number of requires clock cycles for integer division also

depends on the operand’s values [2], a larger integer operands will require more clock

cycles to perform division. The more clock cycles or numbers of iterations are needed

by the divider, the more is the power consumed and the speed of operation decreases

too. Ignoring the implementation has been proven in to result in significant system

performance degradation [3]. In applications that employ division operation, having

an efficient implementation of division hardware can significantly improve the overall

Page 13: VHDL Implementation and Performance Analysis of two

2

performance of the system, thus it is imperative to find out the best implementation

method of the division algorithm in hardware. Having a divider that has a lower heat

dissipation is also a desirable attribute in terms of performance and security.

1.2 Motivation for this work

This work is part of an on-going research on the design, development and implementa-

tion of low power Pseudo Random Number Generator (PRNG) and the work focuses

on the implementation and performance analysis of division algorithms to be incor-

porated in the PRNG. These division algorithms are implemented as co-processor

designs which will be later required by the PRNG to implement the mathematical

algorithm that generates the random numbers. Although the implementation of the

overall PRNG exceeds the scope of this work, the targeted PRNG is based on the

Park-Miller algorithm, which is a fairly popular choice of algorithm for the genera-

tion of random numbers, the algorithm requires an initial seed value, a special prime

number, a quotient and a remainder to generate a random number [4]. Two hardware

divider designs are considered and implemented to generate the quotient and remain-

der through a division algorithm for the Park-Miller Algorithm so that the random

number is generated by the PRNG.

The hardware for 32 bit integer division is based on the digit-recurrence, non

restoring division algorithm. The divider designs are analyzed later for the perfor-

mance and impact on important parameters for the choice of application. There has

been quite a bit of work in the hardware dividers with reference to the application of

algorithms particularly dealing with higher radix implementation and floating point

implementation. Most researchers compare the performance results of the overall

divider in terms of speed and area while the methodology of implementation and

how the changes in implementation can affect the performance, specially in power

consumption, in fixed point integer division has not been explained much clearly.

This motivated us to determine the best implementation in terms of performance

parameters of hardware divider and study the two dividers to see which one is best

suited to low power or high speed or low cost implementation. Another motivation

of this work was to come up with a simplified design approach that would allow the

new designers and researchers to understand and re-implement the integer division in

hardware. From the academic and learning point of view, this work enabled the un-

derstanding of iterative algorithms, their design and implementation, state machine

Page 14: VHDL Implementation and Performance Analysis of two

3

syncronization which are skills useful for any one learning practical hardware design

implementation.

1.3 Contributions

Two division algorithms which are based on digit recurrence, non-restoring division

are considered and implemented. The first algorithm is called the “fixed shift division

algorithm” while the second is the “adaptive shift algorithm”. The second algorithm is

an improvement of the first algorithm in terms of performance. Our work contributes

to the following:

1. Designed and implemented two signed integer division algorithms for performing

the division operation in hardware.

2. Verify the hardware design by developing a Matlab code to confirm the correct-

ness and accuracy of the hardware implemented in VHDL.

3. Compared performance of two division algorithms from the viewpoint of device

utilization (area), power consumption and timing analysis (delay).

4. The high-radix technique proposed in [5] for floating point arithmetic is adapted

to integer arithmetic.

Our work will help the designer in decision making towards choosing the division im-

plementation for application specific purpose. If the application demands high speed

or low power computation such as RNGs, cryptographic and encryption processors

then the adaptive shift algorithm is the preferred choice where as in applications such

as those in smart cards which have area and cost constraints, the fixed shift algorithm

is better suited.

1.4 Thesis Organization

This section outlines the organization of the thesis and is intended to present the

reader with the brief summary of main focus of each chapter.

Chapter 1 introduces the reader to the subject and the scope of the research. The

motivation for the research and the contributions of the research is discussed

which were the fundamental objectives in thesis.

Page 15: VHDL Implementation and Performance Analysis of two

4

Chapter 2 describes the background and fundamentals of division in hardware. A

brief classification of division algorithms is provide in order to aid the reader

with the understanding of the related previous work done in the area.

Chapter 3 describes our approach towards the division operation. The two con-

sidered division algorithms, known as the fixed shift division algorithm and

the adaptive shift division algorithm, are presented and their methodology ex-

plained which is used to achieve the correct result of division operation.

Chapter 4 describes the hardware design and implementation. The system hard-

ware entities are explained which are common to both the algorithms and also

the ones that are specific to each of the two algorithms. The circuit implemen-

tations of both the algorithms are presented.

Chapter 5 contains the results and evaluations of the two algorithms. The numerical

simulation results are obtained to verify that the algorithms work and then the

results of hardware simulations (in VHDL) are presented to confirm that the

implementation of the two algorithm has been done correctly. Performance

analysis of the two algorithms is also conducted in this chapter.

Chapter 6 has the concluding statements and the short description of the work and

what was achieved through this work.

Page 16: VHDL Implementation and Performance Analysis of two

5

Chapter 2

Division Background

2.1 Division Fundamentals

There are various references, such as [6][7][8], by authors who have worked on number

division. The fundamental principle of division is that the division of dividend by a

divisor can be realized in cycles of shifting and adding (in actual subtraction) with

hardware or software control of the loop which requires iteratively converging at the

correct result of the division through the hardware divider.

In this literature, we refer to Y as the dividend and X as the divisor. We wish to

divide the integer Y by a positive integer X, the result of this division operation should

be two integers: the quotient and the remainder, denoted by q and r respectively so

that the following equation is satisfied:

Y = qX + r (2.1)

q and r can be expressed as:

q =

⌊Y

X

⌋(2.2)

0 ≤ r < X (2.3)

the floor value of eqn (2.2) would give us a whole number rounded to the lower

integer and a fractional part which is the difference from the actual value to the

Page 17: VHDL Implementation and Performance Analysis of two

6

rounded value. This whole number is the quotient while the fractional part of this

floor function will give us the remainder. Using this concept we can rewrite:

r = Y − qX (2.4)

The above equation states that the remainder r, can be obtained if X is subtracted

from Y for a q number of times, untill the condition in (2.3) is statisfied and at this

point the value of Y is the desired remainder, r. Most hardware dividers operate in

the same manner, this is very similar to the the long division by hand in which the

hardware divider updates the value of Y as per the equation:

Y ← Y − δX (2.5)

The δ is the partial quotient and the updated value of Y is the partial remainder.

The hardware divider, in the similar manner as long division method by hand, keeps

track of the quotients by adding their values in a register Z, which is given by:

Z ← Z + δ (2.6)

From (2.5) and (2.6) we see that δ is subtracted in Y and added to Z.

The choice of value of δ can be arbitrary towards achieving the correct result of

division, provided that the following two conditions are met:

1. The updated value for Y in (2.5) should converge to the range 0 ≤ Y < X, so

that this will produce the desired remainder. If Y is positive, the factor δX is

subtracted, if Y is negative, the factor δX is added to Y ;

2. The updated value for Z in (2.6) should add or subtract to produce the desired

quotient.

We represent the dividend Y of n bits in 2’s complement so that the range of Y can

be given as:

−2n−1 ≤ Y < 2n−1 (2.7)

Our divisor X is assumed to require only m bits for it’s representation such that

m ≤ n. Figure 2.1 shows the nonzero bits in Y and X at the start of division

operation. Our goal is to iteratively reduce the nonzero bits of Y to m bits so that

Page 18: VHDL Implementation and Performance Analysis of two

7

Y comes in the range:

0 ≤ Y < X (2.8)

Figure (2.2) shows the nonzero bits of Y at the end of the division operation where

Y stores the value of the remainder which can fall in the range 0 ≤ r < X.

The choice in value of δ at each iteration to implement (2.5) and (2.6) will differentiate

the division algorithms that we will implement in our work, this will be demonstrated

in chapters to follow.

Figure 2.1: Nonzero bits of X and Y at the start of division

Figure 2.2: Nonzero bits of X and Y at the end of division

Page 19: VHDL Implementation and Performance Analysis of two

8

2.2 Division Algorithms Classes

Oberman and Flynn presented the taxonomy of division algorithm in [3], which clas-

sified the algorithms based on their hardware implementations and they classify the

algorithms in five classes: digit recurrence, functional iteration, very high radix, ta-

ble look-up and variable latency. Many practical division algorithms are hybrids of

several of these classes and can reach combinations of classes to the overall algorithm.

2.2.1 Digit Recurrence Algorithms

Digit recurrence is the most simplest and widest implemented of all division algo-

rithms. The methodology behind it is that it uses subtractive methods to deduce

digits of quotient on every iteration and it retires a fixed number of bits of the quo-

tient in every iteration to achieve this, meaning that the step-size of bits retired

in each iterations are the same. The implementation of digit recurrence algorithms

require less complexity and area.

2.2.2 Functional Iteration Algorithms

The functional iteration uses the multiplication operation as the basis of division

operation. Functional iteration take advantage of high speed multiplier to converge

to result quadratically, unlike the subtractive division through which the result is

converged upon linearly, this reduces the latency and length of each iteration cycles.

Therefore instead of retiring fixed bit at iterations, this class of algorithms retire

increasing bits at each iteration.

2.2.3 Very High Radix Algorithms

Digit recurrence algorithms are suited to low radix division operation and as we

increase the radix, the hardware and divisor multiple process gets complicated and

consumes more area and computation time too. A variant of this is the very high

radix algorithm which avoids the constraints posed by the higher radix, and the term

“very high radix” applies to dividers that retire more than 10 bits in each iteration.

Page 20: VHDL Implementation and Performance Analysis of two

9

2.2.4 Look-up Tables

When a low-precision quotient is required, it may be feasible to apply division using a

look-up table implementation without the use of an algorithm. This implementation

uses direct and liner approximation methods to compute quotient bits. The table can

be implemented as a ROM and the advantage of using this fast processing since no

arithmetic calculation is needed but on the down side, the size of the look-up table

grows exponentially to account for each added bit for accuracy.

2.2.5 Variable Latency Algorithms

The digit recurrence and very high radix algorithms retire fixed number of bits in

every iteration while the function iteration based algorithms retire increasing number

of bits in every iteration, but all three of these algorithms complete the operation in

fixed number of cycles. Variable latency algorithms based dividers perform division

in variable amount of time.

2.3 Related work in the area

The main algorithms for division in hardware implementation were highlighted in pre-

vious section and each methodology has it’s own application and benefits, however

the digit recurrence algorithms is the most commonly used approach for hardware

division implementation and they have procedures like restoring, non-restoring, SRT

division (Sweeney, Robertson and Tocher), approximation algorithms, CORDIC al-

gorithm, multiplicative algorithm and continued product algorithm [9]. According to

Sutter and Deschamps in [10], binary non-restoring digit recurrence algorithms are

the mostly preferred procedure for FPGA based dividers. Authors of [9] implemented

high speed non-restoring based division using the high speed adder/subtractor ap-

proach to speed up the division operation. Sutter and Deschamps implemented high

speed fixed point dividers in [10] based on utilization of FPGA characteristics such as:

adder/subtractor or conditional adders having same delay as simple adders; existence

of dedicated and fast carry generation and propagation logic; and additional mul-

tiplexers to the general purpose LUTs in a sequential, combinational and pipelined

circuits. Achieving higher speed is desirable in hardware implementation but some

applications may also require power efficiency, Nannarelli and Lang proposed low

power divider [11], which discussed power saving techniques such as : re-timing the

Page 21: VHDL Implementation and Performance Analysis of two

10

recurrence, changing redundant representations to reduce the number of flip flops,

using gates with lower drive capability, equalizing the paths of the input signals of

the blocks to reduce glitches, switching-off not active blocks.

We focused the implementation of division algorithms on non-restoring division

methodology and designed a fixed iteration division algorithm and then utilized Dr.

Gebali’s HCORDIC technique [5], an adaptive algorithm methodology, to reduce

number of iterations based on hierarchical design for the adaptive shift iteration

algorithm. Dr. Gebali implemented this technique for floating point arithmetic and

we adapted this technique to make it applicable to integer arithmetic.

2.4 Chapter Summary

This chapter highlights the basics of division in hardware which will enable the reader

to understand the algorithms we present in Chapter 3. Overview of some of the known

division algorithm classes are presented to enable the reader to understand the high

level differences between different implementations. The related work in the area of

division is also discussed to present the reader with additional information to help

better understanding of intended work.

Page 22: VHDL Implementation and Performance Analysis of two

11

Chapter 3

Considered Division Algorithms

The non-restoring division algorithm is based on retiring fixed number of quotient

bits in each iterations, the basis of our algorithms depends on the shifts or δ, which

was introduced in the previous chapter. The difference in the size of δ defines our

algorithms with the fixed δ and the adaptive δ, which we refer to as fixed shift

algorithm and the adaptive shift algorithm respectively.

3.1 Division Approach

3.1.1 Reasons For Considerations

We choose these two division algorithms because of the following reasons:

1. They are popular for implementation of division in integer arithmetic.

2. No multiplier is needed (reduced power and area).

3. No adder, no multiplier, look up table is utilized thus can be implemented in

non-Xilinx programmable logic devices, hence these algorithms are not device

specific.

4. Simplicity of the algorithms.

3.1.2 Overview of Operation

The two algorithms essentially operate in two modes:

Page 23: VHDL Implementation and Performance Analysis of two

12

1. Range reduction mode of Y - in this mode, the algorithm takes multiple steps/iterations

to reduce the dividend to converge on to the result.

2. Post processing mode of Y and Z - this is a single step to process the remainder

and quotient when the result in mode 1, does not fall in the desired range.

To begin the operation in mode 1, the sign of the current value of dividend Y is

checked, if the value is negative, the product of δ and divisor, X is added to Y and

the next value of Y is obtained. If the value of Y is positive, the product δ*X is

subtracted from the current value of Y to obtain the next value of Y, these steps

yields the value of the remainder.

The quotient is produced in simultaneous steps, the δ is added or subtracted to the

current value of quotient Z depending on the operation performed on Y since the two

will have opposite operations performed on them. At each of these steps, the range

of Y is also kept in checked; if at the end of the iteration, the value of Y is in the

desired range, that value of Y would be the remainder and the corresponding value

of Z will be the quotient.

If the value is not in the range at the end of the range reduction mode, the algorithm

will jump to mode 2, which will be a single step to adjust the range so that we have the

correct quotient and remainder at the next step. This methodology is mathematically

explained in the next section.

3.2 Division Algorithm 1 : Fixed Shift Algorithm

This algorithm performs a fixed minimal number of iterative steps to give the quotient

and the remainder when we perform the division of Y by X. In our work, Y is a 32

bit signed integer such that the value n, which is the number of bits in the dividend, is

32. The X is m bits long, which is 17 bits long since this is the minimum value needed

by Dr. Gebali for the initial quotient to implement the random number generator.

The sign of X is arbitrary, therefore assumed to be positive.

The fixed shift division algorithm has the following properties:

1. The required number of iteration is equal to n−m+ 1.

Page 24: VHDL Implementation and Performance Analysis of two

13

2. The sign of the current value of Y determines if the operation needed on the

next iteration is addition or subtraction.

3. The value of Z will converge on to the quotient with the opposite operation to

the operation of Y in property number 2.

4. The δ at every iteration is determined by the equation (3.6) below.

3.2.1 Mode 1 : Range reduction of Y

The step size of δ is given by the iteration index and not by the intermediate values

of Y. The property number 1 is applied on Y and Z per the following equations:

Y (i+1) = Y (i) − µiδiX, 0 ≤ i ≤ n−m (3.1)

Z(i+1) = Z(i) + µiδi (3.2)

where the initial value of Y and Z are:

Y (0) = Y (3.3)

Z(0) = 0 (3.4)

the µi in equation (3.1) and (3.2) denotes the addition or subtraction operation in a

given iterative index value i, the δi is the step size given by the following equations:

µi =

1 when Y (i) ≥ 0

−1 when Y (i) < 0(3.5)

δi = 2(n−m−i), 0 ≤ i ≤ n−m (3.6)

Once again, it is important to remember that the iteration step size depends on δ

and not on the intermediate data of the partial quotient and remainder, this step size

will be governed by the binary shift and will be used by the ALU of the divider to

compute the result.

Page 25: VHDL Implementation and Performance Analysis of two

14

3.2.2 Mode 2 : Post processing of Y and Z

On the completion of Mode 1, the value of remainder, Y n−m+1 needs to fall in the

range:

−2m−1 ≤ Y n−m+1 ≤ 2m−1 − 1 (3.7)

This range may not be satisfied due to the following:

1. The value of Y n−m+1 is negative.

2. The value of Y n−m+1 is positive but greater than X.

In either outcome, the post processing mode becomes applicable such that the in-

equality below is satisfied in order to achieve the correct remainder:

0 ≤ Y n−m+1 < X (3.8)

the value of quotient, Zn−m+1, also needs to be updated whenever Y is changed.

In order to bring the result Y n−m+1 in the desired range, the following process needs

to be applied:

Y (n−m+1) = Y (n−m+1) − µX (3.9)

Z(n−m+1) = Z(n−m+1) + µ (3.10)

where µ works in the same way as in range reduction mode to determine the addi-

tion and subtraction operation on equation (3.9) and (3.10) based on the following

condition:

µ =

1 when Y (n−m+1) ≥ X

−1 when Y (n−m+1) < 0(3.11)

To satisfy (3.8), this process is only needed once. The total number of iterations

needed in algorithm 1 is n−m + 1 if the result of division is achieved in mode 1. If

the result is not achieved in mode 1, a total n−m+ 2 iterations will be required.

Page 26: VHDL Implementation and Performance Analysis of two

15

3.3 Division Algorithm 2 : Adaptive Shift Algo-

rithm

This algorithm does not perform a fixed number of iterative steps to compute the

quotient and the remainder but instead it functions by determining at each iteration,

the step size δ from the magnitude of the input data. Since the step size of the shift

is not fixed, we call this as adaptive shift. This algorithm requires lesser iterations

in comparison to the fixed shift algorithm. Similar to our assumptions for fixed shift

algorithm, we consider the divisor X to have m bits and the dividend Y to have n

bits, inclusive of sign bit.

The adaptive shift division algorithm has the following properties:

1. The required number of iteration is determined by the input data.

2. The sign of the current value of Y determines if the operation needed on the

next iteration is addition or subtraction.

3. The value of Z will converge on to the quotient with the opposite operation to

the operation of Y in property number 2.

4. The location of the most significant bit value of Y and X determines the value

of δ at every iteration by the equation (3.17) below.

3.3.1 Mode 1 : Range reduction of Y

The step size of δ in the adaptive shift algorithm is obtained by the magnitude of

the input data and not by the iteration index, as it was obtained in the fixed shift

algorithm. The iterations on Y and Z occur as per the following equations:

Y (i+1) = Y (i) − µiδiX, 0 ≤ i ≤ n−m (3.12)

Z(i+1) = Z(i) + µiδi (3.13)

where the initial value of Y and Z are:

Y (0) = Y (3.14)

Z(0) = 0 (3.15)

Page 27: VHDL Implementation and Performance Analysis of two

16

the µi in equations (3.12) and (3.13) denotes the addition or subtraction operation

in a given iterative index value i, the δi is the step size given respectively by the

following equations :

µi =

1 when Y (i) ≥ 0

−1 when Y (i) < 0(3.16)

δi = 2(Py−Px), |y| ≥ x (3.17)

where Px is the position of the most significant set bit of X, since X is arbitrary and

our notation assumes it as a positive value.

while Py is defined as:

Py =

position of most significant 1 when Y > 0

0 when Y = 0

position of most significant 0 when Y < 0

(3.18)

when Py ≤ Px, the iterations for the range reduction mode are stopped.

3.3.2 Mode 2 : Post processing of Y and Z

On the completion of Mode 1, the range of Y n−m+1 needs to fall in the range:

−2m−1 ≤ Y n−m+1 ≤ 2m−1 − 1 (3.19)

Just like in fixed shift algorithm post processing; the range may not be satisfied

because either the value of Y n−m+1 is negative or positive but greater than X and

thus, this value needs to be processed so that it satisfies the range:

0 ≤ Y n−m+1 < X (3.20)

the value of quotient, Zn−m+1, also needs to be updated whenever Y is changed.

In order to bring the result Y n−m+1 in the desired range, the following process needs

Page 28: VHDL Implementation and Performance Analysis of two

17

to be applied:

Y (n−m+1) = Y (n−m+1) − µX (3.21)

Z(n−m+1) = Z(n−m+1) + µ (3.22)

where µ works in the same way as in range reduction mode to determine the addition

and subtraction operation on equations (3.12) and (3.13) based on the following

condition:

µ =

1 when Y (n−m+1) ≥ X

−1 when Y (n−m+1) < 0(3.23)

This processes is needed so that the range of equation (3.20) is satisfied. The total

number of iterations needed in algorithm 2 is n−m if the result of division is achieved

in mode 1. If the result is not achieved in mode 1, one more iteration is needed in

mode 2.

3.4 Chapter Summary

In this chapter, we considered the two division algorithms; the fixed shift algorithm

and the adaptive shift algorithm. The equations and conditions required by the

algorithms were explained and represented mathematically. The difference between

the two algorithms is primarily based on the step size δ, in the fixed shift algorithm,

the δ is determined by the iterative index while in the adaptive shift algorithm, the δ is

governed by the input data, that is difference between the position of most significant

“1” or “0” based on sign of Y and the position of most significant “1” in X, since X

is assumed to be positive. In both algorithms, the idea is to reduce Y as determined

by δ such that it is positive and lesser than X in magnitude. When Y fails to falls

in the correct range, a post processing step is required to obtain the correct values of

Y and Z.

Page 29: VHDL Implementation and Performance Analysis of two

18

Chapter 4

Design and Implementation

The hardware realization of the division algorithms requires identification and de-

sign implementation of individual system blocks and their interconnectivity divider

designs. This chapter provides sufficient design methodology.

4.1 Hardware entities for Algorithm 1

The division methodology, equations, conditions and operations explained in chapter

3, will be used to determine the hardware entities required for each of the division

algorithms. In this section we look at the hardware entities that are required for

implementation of Algorithm 1. In every iteration the hardware needs to implement:

• One shift.

• One addition and one subtraction (two operations performed by the ALU)

to implement this, Algorithm 1 needs the following entities:

• X, Y and Z registers

• Data multiplexer

• Comparator for Y

• Look-up table

• ALU

• Counter

Page 30: VHDL Implementation and Performance Analysis of two

19

• Finite state machine

The system block-level diagram of Algorithm 1 is shown in fig. 4.1.

Figure 4.1: Algorithm 1 system level

4.1.1 X, Y and Z Registers

Division requires four operands in total; the divisor, the dividend, the quotient and

the remainder but in our implementation, only three operands are needed since we

Page 31: VHDL Implementation and Performance Analysis of two

20

reduce the dividend such that it yields the quotient. Therefore we need to store only

three values in the registers; the remainder Y, the quotient Z and the divisor X. The

word width of Y is 32 bits, therefore we set the registers of X and Z to 32 bits

word width too. Having a uniform word width of the three registers will simplify the

applicability of arithmetic operations on these operands.

Moreover, the registers are required to hold values from the following:

• The initial values from the external data lines

• The intermediate values of Y and Z from the data feedback from the ALU

during each iteration.

• The final values of Y and Z once the iterations are complete and division result

is obtained.

To perform the above requirements, we need to have control signals for the register

bank to enable the read/write capability on the register contents and we also need

the ability to switch selectivity between the external data or the internal feedback

data. The block level view of our register bank is shown in fig. 4.2 below.

Figure 4.2: Registers X, Y and Z in the bank

4.1.2 Data Multiplexer

The multiplexer has the control signal input from the controller to select from the

external data line or from the feedback data lines from the ALU, the output data

lines from the multiplexer feeds the data into the registers. The block level of the

multiplexer is shown in fig. 4.3.

Page 32: VHDL Implementation and Performance Analysis of two

21

Figure 4.3: Data multiplexer for register bank

4.1.3 Comparator for Y

The comparator that scans Y is an important part of the hardware since it determines

if the addition or subtraction operation is needed on the next values of Y and Z. The

operands X and Y are fed up in to the comparator to raise the flag when the following

conditions occur:

• Raise the flag when the value of Y goes negative (f ypos = 0)

• Raise the flag when the value of Y is positive but less than X (f ygtex = 1)

The block level view of the comparator is shown in fig. 4.4 below.

Figure 4.4: Comparator for Y

4.1.4 The Look-up table

The look-up table (LUT) is implemented as a ROM in the system with contents

stored in weights of binary shifts. The value of δ calculated from in two algorithms

corresponds to the address in the LUT, which is picked up by the ALU during the

computation in the iteration. The LUT block is shown in fig. 4.5.

Page 33: VHDL Implementation and Performance Analysis of two

22

Figure 4.5: LUT block

4.1.5 The ALU unit

The ALU unit computes the equations (3.1)(3.2)(3.9)(3.10)(3.12)(3.13)(3.21) and

(3.22) and is comprised of three ALUs to perform the following:

• Perform multiplication between δi and X.

• Perform addition/subtraction (based on sign bit of current Y ) of the product

δiX from Yi to obtain Yi+1

• Perform addition/subtraction (based on the sign bit of current Y ) of δi from Zi

to obtain Zi+1

The ALU requires the control signal based on the status of comparator flags to per-

form addition or subtraction operation. The ALU block is shown in fig. 4.6 and the

logical operation during an iteration is shown in fig. 4.7.

Figure 4.6: ALU block

Page 34: VHDL Implementation and Performance Analysis of two

23

Figure 4.7: Logical operation of ALU during ith iteration

4.1.6 Counter

To perform the shift we need a counter. Recall from section 3.2.1 that the step size

of δ is given by the iteration index and not by the intermediate values of Y. The

counter is employed in algorithm 1 to produce the iterations indexes at each iteration

which pulls out the corresponding values from the LUT table for the ALU. When the

iterations are complete, a flag is raised and it’s status is provided to the controlling

unit. The counter block is shown below in fig. 4.8.

Figure 4.8: Counter block diagram

4.1.7 Finite State Machine

The finite state machine (FSM) is the controlling unit of the system, it sends and

receives the control signals to and from other hardware entities in the system. The

FSM block is shown in fig. 4.9. The FSM of the algorithm 1 is fairly simple and only

has four states: initial, iterate, adjust and final.

Page 35: VHDL Implementation and Performance Analysis of two

24

Figure 4.9: Finite State Machine block

4.1.8 FSM : State transition diagram

In the initial state the FSM is in the idle mode and scans for an external “start”

input control signal. The initial state is used as a system initialization mode which

occurs upon reset and the counter is cleared, the “sel” (select) is set to high so that

the external data inputs are selected and those values are ready to be loaded into

the registers X,Y and Z. The enable x and enable yz are set to high which enables

the writing in the registers while the “done” signal is set to “zero” and add sub y is

essentially in the don’t care state.

Once the “start” is received, the FSM goes into the iterate mode which implements

the “range reduction of Y ” mode, for this the counter is enabled and the “sel” control

is set to “0” so that the internal feedback data lines from the ALU are selected for

the next iteration. The flags f ypos = 0 and f ygtex = 1, which means that the Y is

negative or is positive but greater than equal to X respectively and the add sub y is

controlled accordingly. If the value of Y is negative the addition is performed, if it’s

positive and greater than Y, subtraction is performed. When the counter has reached

the pre-determined “counts”, the f i flag is raised to a “1”, which sends the signal to

the FSM that iterations are complete in mode 1.

The FSM checks the status of the flag, if flags f ypos = 0 and f ygtex = 1 then

the FSM goes into the adjust mode to “post process Y and Z ”. Otherwise if the flags

have different status (f ypos = 1 and f ygtex = 0), this means that the Y is in the

correct range and the FSM goes directly into the final state. In the final state, the

write capability in registers Y and Z is disabled through enable yz and the “done”

signal is set to “1” which indicates that the division operation is complete. The state

Page 36: VHDL Implementation and Performance Analysis of two

25

transition diagram is shown in fig. 4.10.

Figure 4.10: State transition diagram for Algorithm 1

4.2 Hardware entities for Algorithm 2

We know from section 3.3.1 that the step size of δ in the adaptive shift algorithm is

obtained by the magnitude of the input data and not by the iteration index, therefore

we do not use the counter in the implementation of this algorithm. We will instead

need a special hardware unit that will check the most significant 1’s or 0’s in the

operand, if the number is positive or negative respectively. In our design, we call

this unit the delta address generator (DAG). In every iteration the hardware needs

to implement the following operations:

• Determine the location of most significant 1 or 0 for Y i.

• One shift.

Page 37: VHDL Implementation and Performance Analysis of two

26

• One addition and one subtraction (two operations performed by the ALU)

to implement this, Algorithm 2 needs the following entities:

• X, Y and Z registers

• Data multiplexer

• Comparator for Y

• Look-up table

• ALU

• Delta (δ) address generator

• Finite state machine

The system block-level diagram of Algorithm 2 is shown in fig. 4.11. We only

discuss the DAG and finite state machine for Algorithm2 because it specific to the

adaptive shift algorithm while the rest of the entities are implemented in the exact

same way as in Algorithm 1. One key difference between both designs is that the

counter used in Algorithm 1 is not used in Algorithm 2, instead, the DAG generates

the shifts in δ.

Page 38: VHDL Implementation and Performance Analysis of two

27

Figure 4.11: Algorithm 2 system level

4.2.1 Delta Address Generator

This unit will determine the location of most significant 1 or 0 by scanning the position

Py and Px and generating an address from the difference of the two position to obtain

the corresponding value of shift in δ from the LUT ROM. which will be used by the

ALU for the computation in the iteration step. The DAG block level diagram is

shown in fig. 4.12. and the overall system block-level diagram is given in fig. 4.13.

Page 39: VHDL Implementation and Performance Analysis of two

28

Figure 4.12: Delta (δ) Address Generator

Figure 4.13: DAG system level

Page 40: VHDL Implementation and Performance Analysis of two

29

The DAG is composed of several hardware entities such as :

• position finder unit.

• multiplexer for flag.

• multiplexer for data lines.

• Px register.

• number subtractor.

Position finder unit

The purpose of this unit is to find Py and Px from Y and X respectively based on the

input of flag, f ypos. If f ypos = 1, the position finder unit detects the most significant

1 bit in Y and if f ypos = 0 then the unit detects for most significant 0 bit. Since

X is assumed to be positive, the unit will always look for most significant 1’s in X.

See fig. 4.14 below, the number at the input is either X or Y depending upon the

data multiplexer input. Similarly, the f mux which is the flag forwarded by the flag

multiplexer, indicates the sign on the number operand at the input of the position

finder unit. For the case of Y, the “f mux” will have input from the f ypos of the

comparator, for the case of X, the “f mux” will send a “1” to the position finder unit,

which indicates the unit to look for most significant “1” in X. The output “position”

will have the value of Py and Px from Y and X respectively. The “flag out” signal

is the resultant of the hierarchical implementation of the position finder and is not

used in computation of Py and Px or in the division operation.

Figure 4.14: Position finder unit block

Page 41: VHDL Implementation and Performance Analysis of two

30

Multiplexer for flag

This is just a simple multiplexer that enables re-using the same position finder unit

for Px and Py. It reads the status of the flag, f ypos to decide if the unit needs to look

for 1’s or 0’s in Y. For the case of Px, we feed a “1” from the multiplexer input so

that the unit always looks for most significant “1’s” in X, since X is always positive.

Figure 4.15 below highlights this, the “sel x” input comes from the FSM and when

it’s high, the multiplexer sends “1” at the output, otherwise when its a low or a “0”,

it sends “f ypos” at the output as “f mux”.

Figure 4.15: Multiplexer for flag input

Multiplexer for data lines

This works in the exact same way as the multiplexer for flag and share the same

control input “sel x”, since we re-use the position finder unit for both Px and Py, this

multiplexer helps to control the data lines selected as input for the position finder

unit as shown in fig. 4.16 below.

Figure 4.16: Multiplexer for data input

Page 42: VHDL Implementation and Performance Analysis of two

31

Px register

To employ the re-usability of the position finder unit, we need a register that stores

subtracter. Since this register is only used for Px, it will function only when “sel x

= 1”, and therefore this is controlled by the signal “enable reg Px”. Figure 4.17

illustrates this block.

Figure 4.17: The Px Register

Number subtractor

This hardware entity essentially performs the subtraction of Py − Px that is used

as an address for LUT and this entity also raises the flag “f i” when the result of

subtraction is less than or equal to “0”, which indicates to the FSM that the “range

reduction of Y ” mode is complete. The delta address will have the value of the delta

from the result of Py − Px while the “position x and position” signals represent Px

and Py respectively. Figure 4.18 illustrates this block.

Figure 4.18: The number subtractor block in DAG

Page 43: VHDL Implementation and Performance Analysis of two

32

4.2.2 DAG Implementation

The DAG is the most important hardware unit for the Algorithm 2 since this unit

computes the adaptive shift, δ for this algorithm. Remember in algorithm 1, we

employed the counter to compute the fixed shifts which was based on the iterative

index i, but in the adaptive shift based division technique, we scan the words Y and

X for the bit position of most significant 1’s or 0’s and then use the difference between

the bit locations to obtain value of δ.

The DAG is implemented in a hierarchical arrangement of five levels which is given

by the relation since we have the 32 bit operand:

2x = 32 (4.1)

therefore, x = 5

The “level 1” is comprised of 16 2-bits scan units that each scans the two bits at a

time for the entire word width of Y starting from bit location Y0Y1 up till Y30Y31, the

unit checks the presence of 1 or 0 in the MSB, depending on the sign of Y otherwise

checks the LSB for a 1 or 0 and sends the flag and position to the next hierarchical

level. This unit also accepts a starting base value n at each block on which the value

is obtained to pass on to the next level. Figure 4.19 below shows 2 of these units that

will help illustrate the concept.

Tables 4.1 and 4.2 show how the 2-bits scan unit works in Y is positive or negative.

Figure 4.19: 2-bits scan unit 0 in level 1

The “level 2” is comprised of 8 scan block that each scans, essentially 4 bits, the

two numbers and the two flags from the 2-bits scan unit in level 1 starting from

scan block 0 up to scan block 8, if the flag “f1” of the 2-bits scan unit 1 is a “1” then

the number on the output of scan block 0 is “n1” and if the flag “f0” is a “1” and “f1”

Page 44: VHDL Implementation and Performance Analysis of two

33

Y1 Y0 n0 f00 0 0 00 1 n 11 0 n+1 11 1 n+1 1

Table 4.1: Truth Table when Y is positive

Y1 Y0 n0 f00 0 n+1 10 1 n+1 11 0 n 11 1 0 0

Table 4.2: Truth Table when Y is negative

is a “0” then the number on the output of scan block 0 is “n0”. We demonstrate this

relation between scan block 0 in level 2 and the two 2-bits scan unit 0 and scan unit 1

from level 1 in fig. 4.20.

The approach for level 2 transcends in the same manner all the way down to level

5 through level 3 and level 4. The scan block shown in fig. 4.20 is exactly the same

for the rest of level and works on the same principle by accepting two numbers and

two flags from the previous level and updating the number output depending on the

status of the flag(s). As we increase a level, the number of scan blocks needed will be

reduced by a factor of 2 and hence we only have four scan blocks in level 3 and then

two in level 4 and one in level 5.

Figure 4.20: Hierarchical approach between level 1 and 2

The hierarchical arrangement of all 5 levels is shown in fig. 4.21. The number

Page 45: VHDL Implementation and Performance Analysis of two

34

“n0 L5”obtained in the output of level 5 is the position of the most significant 1 or 0

depending on the sign of operand.

The n at the top of each “2-bits scan unit”, referred in figure as “u”, is the base

value for each unit. Notice that the whole word width of 32 bits is covered with 16 2-

bits scan unit, each scan unit forwards their respective bit position outputs (n0...n15)

and flag outputs (f0...f15) to scan blocks in level 2.

Although the methodology of operation is same for scan blocks as 2-bits scan

unit, different notation for number output and flag outputs is used to highlight the

difference. The numbers are the base value plus the most significant 1 or 0 in that

unit, then the flag determine which scan block has the most significant 1 and 0, in

other words, if the flag from a high order scan block is high, the number output of

that scan block is sent at the output.

Page 46: VHDL Implementation and Performance Analysis of two

35

Figure 4.21: Hierarchical arrangement of position finder unit

Page 47: VHDL Implementation and Performance Analysis of two

36

4.2.3 Finite State Machine

The finite state machine (FSM) is the controlling unit of the system sends and receives

the control signals to and from other hardware entities in the system. The FSM block

is shown below in the fig. 4.22. The FSM of the algorithm 2 has an additional state

than algorithm 1 and only has a total of five states: initial, load X (initialize X ),

iterate, adjust and final.

Figure 4.22: Finite State Machine block

4.2.4 FSM : State transition diagram

In the initial state the FSM is in the idle mode and scans for an external “start” input

control signal. Once the “start” is received, the FSM goes into the “load X” mode.

The load X state is an additional initialization state, along with the initial state,

that loads the value of Px into the Px register so that the iterations are synchronized

with Py when the iterate mode is reached. The initial state is used as a system

initialization mode which occurs upon reset and the “sel” (select) is set to high so

that the external data inputs are selected and those values are ready to be loaded into

the registers X,Y and Z. The enable x and enable yz are set to high which enables

the writing in the registers while the “done” signal is set to “zero” and add sub y

is essentially in the don’t care state. We have two additional control signals; the

“sel x” (select x) and “enable reg x” (enable register x) which are associated with

obtaining the value of Px, the position of “X ”. In the load X’ state, the “sel x” and

“enable reg x” are disabled so that DAG will fetch values of Y in order to obtain the

value of Py.

The value of δ can be obtained when the DAG performs the operation Py − Px,

Page 48: VHDL Implementation and Performance Analysis of two

37

on next clock cycle the FSM goes into the iterate mode which implements the “range

reduction of Y ” mode. The flags f ypos = 0 and f ygtex = 1, which means that the Y

is negative or is positive but greater than equal to X respectively and the add sub y

is controlled accordingly. If the value of Y is negative the addition is performed,

if it’s positive and greater than Y, subtraction is performed. When the result of

Py − Px ≤ 0, the the f i flag is raised to a “1”, which sends the signal to the FSM

that iterations are complete in mode 1. The state transition diagram is shown in fig.

4.23.

Figure 4.23: State transition diagram for Algorithm 2

Page 49: VHDL Implementation and Performance Analysis of two

38

The FSM checks the status of the flag, if flags f ypos = 0 and f ygtex = 1 then

the FSM goes into the adjust mode to “post process Y and Z ”. Otherwise if the flags

have different status (f ypos = 1 and f ygtex = 0), this means that the Y is in the

correct range and the FSM goes directly into the final state. In the final state, the

write capability in registers Y and Z is disabled through enable yz and the “done”

signal is set to “1” which indicates that the division operation is complete.

4.3 Circuit Implementations

The general description of the system and it’s blocks has been covered in previous

sections, In this section we look at the Register Transfer Level (RTL) view of the

top level block and the overall RTL schematic of the division and allied hardware

implementation. The signals paths are shown in red and the data paths are shown

in black colored lines in the schematics.

4.3.1 Algorithm 1 : Fixed Shift division algorithm

The top level block and the RTL schematic for fixed shift division algorithm are shown

in the fig. 4.24 and 4.25.

Figure 4.24: Top level block of fixed shift division algorithm

Page 50: VHDL Implementation and Performance Analysis of two

39

Figure 4.25: Fixed shift division algorithm RTL schematic

Page 51: VHDL Implementation and Performance Analysis of two

40

4.3.2 DAG overall layout

The overall RTL schematic for the DAG used in algorithm 2 is shown in fig. 4.26.

Figure 4.26: Delta address generator RTL schematic

4.3.3 Algorithm 2: Adaptive Shift division algorithm

The schematics for adaptive shift division algorithm are shown in the fig. 4.27 and

4.28.

Figure 4.27: Top level block of adaptive shift division algorithm

Page 52: VHDL Implementation and Performance Analysis of two

41

Figure 4.28: Adaptive shift division algorithm RTL schematic

Page 53: VHDL Implementation and Performance Analysis of two

42

4.4 Chapter summary

In this chapter, the design overview and methodology was explained with regards to

each of the two division algorithms: the algorithm 1, fixed shift division algorithm

and the algorithm 2, the adaptive shift division algorithm. The difference in operation

and implementation between the two algorithms was explained with the reference to

step size, δ. In algorithm 1, the iterations are pre-determined and this was achieved

through a counter while in algorithm 2, the shifts in δ was achieved through a special

hardware called the DAG. The DAG, is a hierarchal implementation of scan units

and scan blocks with the purpose of calculating the difference between Py −Px. This

difference corresponds to an address in the LUT that holds the shifted binary value

of δ.

Page 54: VHDL Implementation and Performance Analysis of two

43

Chapter 5

Results and Evaluation

The aim of this chapter is to demonstrate that the two division algorithms designs in

previous chapter will work based on the algorithms discussed in chapter 3. The im-

plementation phase proved to be very challenging and required a respectable amount

of testing, debugging and design revision to ensure that proper functionality of the in-

tended hardware. This chapter documents the tests and simulation results to analyze

the functionality and the performance of the two algorithms. Initially the division

algorithm 1, based on fixed shift, was constructed to achieve a working division al-

gorithm and then algorithm 2, based on adaptive shift technique, was constructed to

produce the same division result. A comparative analysis was conducted between the

two algorithms for their power consumption, device utilization, timing analysis, area-

delay product and power-delay product based on design goals for balanced, timing

performance and power optimization. Some of the related work in the area is also

compared in this chapter.

5.1 Numerical Simulation using MATLAB

The two algorithms were first implemented in software using MATLAB in order to

verify that the division algorithms yielded correct value of quotient and remainder

when the dividend was divided by the divisor. The purpose of this numerical simula-

tion was also to have a reference benchmark of numerical values in each iteration so

that the comparison can be drawn accordingly during the hardware implementation

phase. These simulation numbers were not only important from the verification point

of view, but were also very beneficial during hardware description debugging.

Page 55: VHDL Implementation and Performance Analysis of two

44

5.1.1 Numerical Simulation of Algorithm 1

Table 5.1 shows the numerical simulation in each iteration when Y = 1,176,349 is

divided by X = 127,773.

Range Reduction Mode of Y, Algorithm 1

i Y i+1 Zi+1 δi = 2n−m−i µi × δiInitialize 1,176,349 0 - -

0 -2092256483 16384 214 16384

1 -1045540067 8192 213 -8192

2 -522181859 4096 212 -4096

3 -260502755 2048 211 -2048

4 -129663203 1024 210 -1024

5 -64243427 512 29 -512

6 -31533539 256 28 -256

7 -15178595 128 27 -128

8 -7001123 64 26 -64

9 -2912387 32 25 -32

10 -868019 16 24 -16

11 154165 8 23 -8

12 -356927 12 22 4

13 -101381 10 21 -2

14 26392 9 20 -1

Post Processing Mode of Y and Z, Algorithm 1

Not required, results are obtained

Table 5.1: Iterations for Algorithm 1

In the above table, we have all the values of Y i+1 and Zi+1 for each of the iterations

i, notice that shifts in δi are decremental or decreasing by 1 bit, for this reason we call

this algorithm as the fixed shift division algorithm. In chapter 3, we discussed that

in our work Y is 32 bits and X needs to be at least in 17 bits, denoted by n and m

respectively. The difference n−m = 15 bits, which gives us the number of iterations

required for the division operation, therefore we perform a total of 15 iterations. Since

the result of division on the chose value of operands Y and X satisfies the equation

(3.8), the “post processing mode of Y and Z is not needed” in the fixed shift division

algorithm.

Page 56: VHDL Implementation and Performance Analysis of two

45

5.1.2 Numerical Simulation of Algorithm 2

Table 5.2 shows the numerical simulation in each iteration when Y = 1,176,349 is

divided by X = 127,773.

Range Reduction Mode of Y, Algorithm 2

i Y i+1 Zi+1 δi = 2Py−Px µi × δiInitialize 1,176,349 0 - -

0 -868019 16 24 16

1 154165 8 23 -8

2 -101381 10 21 2

Post Processing Mode of Y and Z, Algorithm 2

26392 9 20 -1

Table 5.2: Iterations for Algorithm 2

The table lists out the values of Y i+1 and Zi+1 for each of the iterations i, notice

that shifts in δi are not decremental as in case of algorithm 1, for this reason we call

this algorithm as the adaptive shift division algorithm. As discussed in chapter 3,

the iterations for adaptive shift division algorithm are given by Py − Px. The most

significant 1 in Y is at 20th bit position starting from bit position number 0, the least

significant bit in Y while the most significant 1 in X is at 16th bit position starting

from bit position number 0, the least significant bit in X. The difference between the

two respective bit positions in Y and X is 20− 16 = 4, therefore the algorithm takes

a total of 4 iterations to produce the result. The iterations in the “range reduction

mode of Y ” ends when Py ≤ Px, at this point the value of Y did not satisfy the

equation (3.20) therefore the algorithm goes into the “post processing mode of Y and

Z to obtain the correct result. The Table 5.2, shows that the iterations required to

achieve the division result is much lesser than the iterations given in Table 5.1.

5.2 Hardware Simulation

The two algorithms were design, synthesized and implemented in VHDL using Xilinx

ISE Project Navigator 13.4. The implemented top level and overall RTL schematics of

both the division algorithms and allied hardware modules were presented in chapter

4. The VHDL test benches were created and simulated to verify that the hardware

performs division correctly.

Page 57: VHDL Implementation and Performance Analysis of two

46

5.2.1 VHDL Simulation of Algorithm 1

Figure 5.1 shows the screen shots of test bench output.

Figure 5.1: All iterations for Algorithm 1

Page 58: VHDL Implementation and Performance Analysis of two

47

By observing Y and Z in fig. 5.1, once the “start” becomes a “1”, the iterations

begin on every rising edge as it can be seen until the result of division quotient and

remainder is achieved. For clarity we break down the figure and examine the zoomed

view of iterations in fig. 5.2 to fig. 5.6 such that, to verify iterations data with the

numerical simulation data.

Figure 5.2: Iterations 0 to 2 for Algorithm 1

Figure 5.3: Iterations 3 to 7 for Algorithm 1

Page 59: VHDL Implementation and Performance Analysis of two

48

Figure 5.4: Iterations 8 to 11 for Algorithm 1

Figure 5.5: Iterations 12 to 14 for Algorithm 1

Our observation of the figures of the test bench screen shots above, it can be seen

that the iteration data from the VHDL simulation is consistent with the numerical

simulation data obtain in section 5.1.

5.2.2 VHDL Simulation of Algorithm 2

We now assess the functionality of our division algorithm 2, the adaptive shift division

algorithm. Just like in section 5.1, it was observed that the adaptive shift technique

reduces considerable number of iterations as compared to the fixed shift technique,

this is verified by observing Y and Z in fig. 5.7.

Page 60: VHDL Implementation and Performance Analysis of two

49

Figure 5.6: All iterations for Algorithm 2

Page 61: VHDL Implementation and Performance Analysis of two

50

5.3 Performance Evaluation

The hardware device chosen for the implementation is Xilinx Spartan-3E xc3s1200e-

4fg320. This Spartan-3E FPGA device contains 1200,000 system gates, 19,512 equiv-

alent logic cells and 8,672 total number of slices [12] out of which the available logic

utilization consists of 17344 flip flops, 17344 4 input LUTs and 250 bonded IOBs.

Apart from the usage of slices in logic, they are also used for routing signals within

the device. The test study analyzes and compares the two division algorithms for

their power consumption, device utilization and timing analysis using the Xilinx ISE

tool with respect to three design goals:

• Balanced.

• Timing performance.

• Power optimization.

These design goal profiles are pre-defined in ISE Navigator tool and can be set to a

desired goal in the synthesis properties. In the balanced profile, the optimization goal

is the “speed” and the optimization effort is “normal” while in the timing performance,

the optimization goal is the “speed” and uses a “high” optimization effort. In the

power optimization profile, the optimization goal is “area” while the optimization

effort is “high”.

5.3.1 Device Utilization

Once the two division algorithms were successfully complied, they were synthesized

to assess the device utilization and performance. The device utilization results for

fixed shift division algorithm and adaptive shift division algorithm is shown in Table

5.3 and 5.4 respectively.

The device utilization summary can be obtained through the following:

Go to ISE Navigator Design pane > select Implementation (view) > select the design

as “top module”.

In the Process pane > select synthesize - XST > View the Design summary (synthe-

sized) window .

Page 62: VHDL Implementation and Performance Analysis of two

51

Device Utilization Summary : Algorithm 1

Design goal Balanced Timing Performance Power Optimization

Number of Blocks 457 462 426

Flip Flops 70 86 70

4-Input LUTs 253 242 222

Occupied Slices 129 159 134

Table 5.3: On-chip device utilization of Algorithm 1

The on-chip logic utilization summary shows that Algorithm 1 uses a total of 457

blocks for the balanced, 462 blocks for timing performance and 426 blocks for power

optimization which also includes 132 IOBs, 1 BUFGMUXs and 1 MULTI18xSIOs for

each of the three profile. All three profiles have a device utilization of 1% for our

target device.

Device Utilization Summary : Algorithm 2

Design goal Balanced Timing Performance Power Optimization

Number of Blocks 607 582 585

Flip Flops 105 104 104

4-Input LUTs 368 344 347

Occupied Slices 194 182 192

Table 5.4: On-chip device utilization of Algorithm 2

Algorithm 2 uses a more area as compared to Algorithm 1, and in addition to the

number of flip flops and 4-input LUTs mentioned in Table 5.4, the total number of

blocks also includes 132 IOBs, 1 BUFGMUXs and 1 MULTI18xSIOs for each of the

three profile. All three profiles have a device utilization of 2% for our target device.

5.3.2 Timing Analysis

In the timing analysis we look at the two division algorithms for their clock frequency,

the critical path delay and the overall completion time required by the division op-

eration. The Table 5.8 and Table 5.9 shows the timing summary of the division

algorithm 1 and 2 respectively.

Page 63: VHDL Implementation and Performance Analysis of two

52

Timing Analysis : Algorithm 1

Design goal Balanced Timing Performance Power Optimization

Clock Frequency [MHz] 81.374 83.008 77.48

Critical Path Delay [ns] 12.289 12.047 12.907

Division operation completion time = 155 ns

Table 5.5: Timing Summary of Algorithm 1

Timing Analysis : Algorithm 2

Design goal Balanced Timing Performance Power Optimization

Clock Frequency [MHz] 39.022 41.722 37.198

Critical Path Delay [ns] 25.626 23.968 26.883

Division operation completion time = 70 ns

Table 5.6: Timing Summary of Algorithm 2

The critical path delay determines the clock frequency and can be obtained by

running the synthesize - XST option from the process panel in the ISE tool. Once

the synthesis is complete, the timing report can be viewed from right clicking the

synthesize - XST. This report also reveal the the source and destination of the critical

path in each of the profiles.

The data path that cause the critical path delay for Algorithm 1 were the following:

• Balanced.

source : counter instance/temp count 3 (FF)

destination : regbank instance/y reg alu 31 (FF).

• Timing performance.

source : regbank instance/y reg alu 1 1 (FF)

destination : regbank instance/z reg alu 31 (FF).

• Power optimization.

source : regbank instance/y reg alu 0 (FF)

destination : regbank instance/z reg alu 31 (FF).

The data path that cause the critical path delay for Algorithm 2 were the following:

Page 64: VHDL Implementation and Performance Analysis of two

53

• Balanced.

source : fsm instance/pstate internal FSM FFd3 (FF)

destination : regbank instance/y reg alu 31 (FF).

• Timing performance.

source : fsm instance/pstate internal FSM FFd2 (FF)

destination : regbank instance/y reg alu 31 (FF).

• Power optimization.

source : fsm instance/pstate internal FSM FFd2 (FF)

destination : regbank instance/y reg alu 31 (FF).

The overall division operation completion time was obtained through running the

ISIM simulation through the “simulation” view of the ISE tool and double clicking the

“simulate behavioral model”, which will show the test bench output. Using vertical

markers to calculate the time difference between vertical marker place on the rising

edge when “start” signal becomes high till the rising edge time instant when “done”

signal is set to high.

In respective to the timing analysis empirical data, it was observed that the al-

gorithm 2 had over 50% lesser clock frequency which corresponds to double the time

period or the delay. The addition of DAG hardware increased delay per clock cycle

thereby, reducing the clock frequency which results in lesser circuit power consump-

tion and increased reliability [13], this is because the dynamic power consumption

is related to clock frequency, the higher switching activity there is in the circuit, or

higher clock frequency, it results in higher dynamic power consumption. The DAG

hardware also resulted in lesser job (division operation) completion time which verifies

that the Algorithm 2 is more than 50% faster than Algorithm 1.

5.3.3 Power Consumption

The total on-chip power is given by the static power and the dynamic power. The

static power results mainly from the leakage current within the device from the tran-

sistors and exists even when the transistor is logically “OFF”. The dynamic power

depends on the switching activity defined in [14]. Based on this theory, the total

power consumption will change for the same design if different target devices are

used, therefore in our results, we refer to dynamic power in the presented data in

Table 5.7.

Page 65: VHDL Implementation and Performance Analysis of two

54

The ISE XPower tool can be used through the following:

Go to ISE Navigator Process pane > select Implement design > place & route

Analyze Power Distribution (XPower Analyzer) > Go to the XPA tool window and

set the clock frequency in tree drop down list > go to Tools in the menu bar > update

power analysis.

Design Power Consumption [mW]

Design goal Algorithm 1 Algorithm 2 % difference

Balanced 87 31 64.37 %

Timing Performance 93 38 59.14%

Power Optimization 68 31 54.41%

Table 5.7: On-chip power consumptions.

The last column in the table, % difference between the power consumption of the

two designs, is a clear indication that the adaptive shift division algorithm is power

efficient. All the three design goals for the adaptive shift division algorithm show

more than 50% of lesser power consumption than compared to the fixed shift division

algorithm. It should be worth mentioning that generally an increase in the area of

the design increases power consumption which can be said for Algorithm 2 but this

was not found to be the case for Algorithm 2. The reason for this, to the best of our

knowledge, is that the Xilinx’s implementation of the division through Algorithm 2

results in lesser number of gates switching which causes lesser switching activity and

hence lesser dynamic power consumed.

5.3.4 Power-Delay Product

The power-delay product and area-delay product are two figures of merit considered

by many designers in digit electrons and hence we include these metrics to enable

prospective designers to determine the trade off between the designs. The power-

delay product can also be referred to as the “switching energy” of the digital circuit

and is given by the power consumption over a switching event. Table 5.8 shows the

power-delay product for the two algorithms, this figure of merit is measured in joules

[J].

Since the target device we have chosen for our design is very huge as compared

Page 66: VHDL Implementation and Performance Analysis of two

55

to the overall area utilization (1% for Algorithm 1 and 2% for Algorithm 2) of the

designs, this results in an enormous leakage power therefore we chose the dynamic

power of the circuit to measure this.

Power-Delay Product [J]

Design goal Algorithm 1 Algorithm 2

Balanced 1069.143 794.406

Timing Performance 1120.371 910.784

Power Optimization 877.676 833.373

Table 5.8: Power-delay product for Algorithm 1 and 2.

5.3.5 Area-Delay Product

In simple terms, the area-delay product can be referred as the number of LUTs ×clock period of the design [15]. For our calculation we also consider the flip flops

utilized to determine this number. Table 5.9 shows the area-delay product for the

two algorithms, this figure of merit is calculated by: (Flip Flops + 4-LUTS) × clock

period of the design.

Area-Delay Product [(FF+4LUT).s]

Design goal Algorithm 1 Algorithm 2

Balanced 3969.347 12121.098

Timing Performance 3951.416 10737.664

Power Optimization 3768.844 12124.233

Table 5.9: Area-delay product for Algorithm 1 and 2.

5.4 Comparison of Work in Related Area

Firstly, as previously discussed in section 2.3 and to the best of our knowledge and

review of related work in the area, most of the work in division is implemented for

high speed dividers, particularly in floating point arithmetic using techniques such as

high speed adders/subtracters and multiplication, utilization of precomputed values

and look-up tables to speed up division. Several authors have implemented division

in higher radix or different bit sized operands while our work focuses on radix 2

implementation of 32 bit operands. Our implementation of division is carried out

Page 67: VHDL Implementation and Performance Analysis of two

56

in 32 bit signed fixed point arithmetic without utilizing any of the above mentioned

techniques to speed up the division operation, therefore finding exact comparable

work was difficult since we vary δ, the step size of bit shift, to perform division and

speed up the overall operation.

The authors of [16] stress that the work related to power consumption in dividers

is fairly limited in FPGAs, which is what we encountered as well during our explo-

ration of related comparable work in power consumption analysis of integer division

in FPGAs.

Lastly, some of the other work in division is implemented on ASIC and custom

CMOS technologies and in fair respect to difference in design implementation, area

and power consumption amongst these technologies, we do not make compare the

work with ours. Other instances were found where authors of [17] implemented a 16

bit fixed point complex divider using pipelined CORDIC to divide complex numbers

in polar coordinates and because this and similar sort of work focus on a different

coordinate system, we do not compare such work to ours. Moreover, there has not

been a lot of work on FPGA based fixed point integer division in recent years and to

the best of our knowledge, the work of authors in [9][10][18][19] is among the closet

recent work that can somewhat be comparable with our results.

Authors of [9] implemented a division using high speed adder/subtractor with

a 10 bit divisor and a 9 bit dividend that results in a 9 bit quotient and a 10 bit

remainder. Although it is not clearly mentioned but judging from the operands used

in the example, this work focuses on fixed point integer division but details about

power, delay, frequency and latency is not provided. The authors of [10] implemented

high speed fixed point divider based on implementation of an adder/subtracter or

conditional adder and ripple carry adders, the details about power is not reported.

The authors of [18] provided results for a fixed point divider in FPGA that uses

precomputed values, the input is scaled to a denominator that has a value between

0.5 and 1 and then the inverse of this denominator is multiplied to the nominator,

the details about power consumption, area utilization is not provided. The authors

of [19] present a power consumption analysis of 6 ÷ 3 bit integer division, based on

ancient Vedic mathematic technique, in FPGAs. Table 5.10 summarizes the results

from these authors along with our considered designs.

Page 68: VHDL Implementation and Performance Analysis of two

57

Fixed Point Division : Results Summary

Scheme Bits # of LUT Power Delay Frequency Latency

slices [mW] [ns] [MHz] [ns]

Algorithm 1* 32 129 253 87 12.29 81.37 155

Algorithm 1** 32 159 242 93 12.05 83.00 “

Algorithm 1*** 32 134 222 68 12.91 77.48 “

Algorithm 2* 32 194 368 31 25.63 39.02 70

Algorithm 2** 32 182 344 38 23.97 41.72 “

Algorithm 2*** 32 192 347 31 26.88 37.198 “

[9] 10×9 103 176 - - - -

[10] Combinational 32 687 1152 - 85.0 - -

[10] Pipelined 32 1273 - - 3.8 263 129.2

[10] Sequential 32 201 - - 3.8 263 125.4

[18] 32/16 I/O - 647† - - - 0.03 [µs]

[19] 6÷ 3 - - 93 41§ 250 -

* Balanced design goal

** Timing performance design goal

*** Power optimization design goal† TLE - Total logic elements§ - Propagation delay

Table 5.10: Summary of related work in the area

5.5 Chapter Summary

In this chapter, the two division algorithms were numerically simulated to verify that

they perform division and then the algorithms were designed in hardware using VHDL

which showed that the algorithms were implementable in hardware just like they were

done in numerical methodology. We then analyzed and evaluated the performance of

the digital designs of the two division algorithm.

The fixed shift division algorithm (algorithm 1) demonstrated lesser device logic

utilization as compared to the adaptive shift division algorithm (algorithm 2). In the

power consumption analysis, the adaptive shift division algorithm demonstrated sig-

nificantly efficient power consumption, compared to the fixed shift algorithm. In the

timing analysis, we studied the critical path delay and the overall division operation

Page 69: VHDL Implementation and Performance Analysis of two

58

completion time for both designs. We then provide the power-delay product and the

area-delay product of the two designs. Lastly, we compare our fixed point divider

with some of the work published in related area in the recent years. This information

and results are very useful to our study since it gives us the capability to asses, from

the design point view, which algorithm is needed in our application. In applications,

where power efficiency and higher speed is a requirement, the adaptive shift divider

algorithm is recommended, in cost effective and cheaper application, the fixed shift

divider algorithm can be implemented.

Page 70: VHDL Implementation and Performance Analysis of two

59

Chapter 6

Conclusion, Contributions and

Future Work

6.1 Conclusion

This thesis considered the implementation of division operation in hardware, which in

the past have not been as much worked upon as the multipliers and adders. Moreover,

the implementation of a power efficient division algorithm was one of the major ob-

jective of the thesis and research work. Division operation is critical in cryptographic

and security processors and often employed as a co-processor in cryptographic and

encryption processors. Power efficient divider not only saves power, it also helps re-

duce the heat signature of the digital device which is an important aspect since most

of the side-channel analysis and attacks, which monitor the heat dissipation profile.

The two division algorithms were considered based on digit recurrence non-restoration

division technique using the simple long division concept, performed by hand. The

algorithms were numerically simulated in Matlab to confirm that they work and

yield correct values of quotient and remainder for a given dividend and divisor. The

hardware implementation was verified through VHDL and yielded results which were

consistent with the numerical simulation.

Three different analysis were made on the two algorithms which were: device uti-

lization, power consumption and timing analysis. Three analysis were obtain on three

different design goals based on the optimization techniques preset in VHDL synthesis.

The fixed shift division worked on predetermined fixed shifts in δ, in this algorithm

the shift in δ is incremental and this increases with iteration corresponding to the

Page 71: VHDL Implementation and Performance Analysis of two

60

shifted value of (2i). The adaptive shift algorithm does not have fixed incremental

shifts in the value of δ, hence the name, adaptive. The adaptive shifts were generated

by a special hardware that checks the most significant “1” or “0” in the given dividend

with the most significant “1” in the divisor. The difference between the two position

of the most significant digits, computes the shift for the algorithm. The adaptive

shift algorithm performed the same division operation and generated the results in

much lesser iterative steps. In doing so, it utilized more area on the device and has

increased routing and logic delays. The adaptive shift division algorithm also demon-

strated lesser power consumption as compared to the fixed shift algorithm. The two

design were designed and implemented in a simple yet efficient methodology.

It must also be necessary to mention that this work is part of Dr. Gebali’s work

on Pseudo Random Number Generator (PRNG), which is based on Park Miller Algo-

rithm, the Park Miller Algorithm requires division operation and generates random

numbers with the utilization of quotient and remainder in the algorithm. The imple-

mentation of integer division is comparable to this thesis for any future work.

6.2 Contributions

This following contributions are made based on this thesis:

1. Designed and implemented two signed integer division algorithms for performing

the division operation in hardware.

2. Verify the hardware design by developing a Matlab code to confirm the correct-

ness and accuracy of the hardware implemented in VHDL.

3. Compared performance of two division algorithms from the viewpoint of device

utilization (area), power consumption and timing analysis (delay).

4. The high-radix technique proposed in [5] for floating point arithmetic is adapted

to integer arithmetic.

6.3 Future Work

This work provided hardware implementations for 32-bit machines. The work can be

extended in two directions:

Page 72: VHDL Implementation and Performance Analysis of two

61

1. Random number generation for 64- or even 128-bit machines.

2. Random number generation for elliptic curve cryptography where the integers

are represented by 200 to 600 bits.

The random number generation in mobile devices is typically 32 bits, which is

what we use in this work. Most of the current systems use the library generated

RAND function to generate the random numbers for usage in application. When

there is a need for higher number of bits say 64 bits or more the RAND function

could have limitations in generating high quality random numbers for the desired

application and to cater this, our work needs to be extended and scaled accordingly

so that we are able to meet the requirements.

The bits required in elliptical curve cryptography typically varies between 200 bits

up to 600 bits, for such scenarios our work will need to be varied accordingly. Two

ways to do this is by:

1. Increasing the size of the components up to the required bit size instead of 32

bits.

2. Using the 32-bit design concatenating the resulting numbers to represent an

arbitrary size integer.

To improve system speed, high-radix implementations could be considered for the

calculation of the terms δi. Pipelining the operations might not lead to improved

latency since increased clock speed is accompanied by increased number of pipeline

stages.

Page 73: VHDL Implementation and Performance Analysis of two

62

Bibliography

[1] N.M. Nayeem, M.A. Hossain, M. Haque, L. Jamal, and H. Babu. Novel reversible

division hardware. In Circuits and Systems, 2009. MWSCAS ’09. 52nd IEEE

International Midwest Symposium on, pages 1134–1138, Aug 2009.

[2] N. Takagi, S. Kadowaki, and K. Takagi. A hardware algorithm for integer di-

vision. In Computer Arithmetic, 2005. ARITH-17 2005. 17th IEEE Symposium

on, pages 140–146, June 2005.

[3] S.F. Oberman and M. Flynn. Design issues in division and other floating-point

operations. Computers, IEEE Transactions on, 46(2):154–161, Feb 1997.

[4] S. K. Park and K. W. Miller. Random number generators: Good ones are hard

to find. Commun. ACM, 31(10):1192–1201, Oct 1988.

[5] Fayez Elguibaly and A. Rayhan. Hcordic: a high-radix adaptive cordic algorithm.

Canadian Journal of Electrical and Computer Engineering, 25(4):149, 2000.

[6] Shlomo Waser and Michael J. Flynn. Introduction to arithmetic for digital sys-

tems designers. Holt, Rinehart and Winston, New York, 1982.

[7] R.P. Brent and P. Zimmermann. Modern Computer Arithmetic. Cambridge

Monographs on Applied and Computational Mathematics. Cambridge University

Press, 2010.

[8] B. Parhami. Computer Arithmetic: Algorithms and Hardware Designs. Oxford

series in electrical and computer engineering. Oxford University Press, 2010.

[9] Sukhmeet Kaur and Rajeev Agarwal. Vhdl implementation of non restoring

division algorithm using high speed adder/subtractor. International Journal of

Advanced Research in Electrical, Electronics and Instrumentation Engineering,

2.7, 2013.

Page 74: VHDL Implementation and Performance Analysis of two

63

[10] G. Sutter and J. Deschamps. High speed fixed point dividers for fpgas. In Field

Programmable Logic and Applications, 2009. FPL 2009. International Confer-

ence on, pages 448–452, Aug 2009.

[11] A. Nannarelli and T. Lang. Low-power divider. Computers, IEEE Transactions

on, 48(1):2–14, Jan 1999.

[12] Xilinx. Xilinx Power Tools Spartan-6 and Virtex-6 FPGAs. http://www.

xilinx.com/support/documentation/sw_manuals/xilinx11/ug733.pdf,

March 2010. Accessed : April 2015.

[13] Anju S Pillai and TB Isha. Factors causing power consumption in an embed-

ded processor-a study. International Journal of Application or Innovation in

Engineering & Management (IJAIEM), 2(7), 2013.

[14] Xilinx. Spartan-3E FPGA Family Data Sheet. http://www.xilinx.com/

support/documentation/data_sheets/ds312.pdf, July 2013. Accessed : April

2015.

[15] Xilinx User Community. Area delay product of FPGA designs.

http://forums.xilinx.com/xlnx/board/crawl_message?board.id=IMPBD&

message.id=5111, March 2012. Accessed : June 2015.

[16] Ruzica Jevtic, Bojan Jovanovic, and Carlos Carreras. Power estimation of di-

viders implemented in fpgas. In Proceedings of the 21st Edition of the Great Lakes

Symposium on Great Lakes Symposium on VLSI, GLSVLSI ’11, pages 313–318,

New York, NY, USA, 2011. ACM.

[17] Dong Wang, Pengju Ren, and Leibo Liu. A high-throughput fixed-point complex

divider for fpgas. IEICE Electronics Express, 10(4):20120879–20120879, 2013.

[18] Muhammad Firmansyah Kasim, Trio Adiono, Muhammad Fahreza, and Muham-

mad Fadhli Zakiy. Fpga implementation of fixed-point divider using pre-

computed values. Procedia Technology, 11(0):206 – 211, 2013. 4th International

Conference on Electrical Engineering and Informatics, ICEEI 2013.

[19] D. Kumar, A. Sharma, and P. saha. Integer division technique for signal pro-

cessing applications. In Proceedings of the 9th International Conference on Ubiq-

uitous Information Management and Communication, IMCOM ’15, pages 52:1–

52:4, New York, NY, USA, 2015. ACM.

Page 75: VHDL Implementation and Performance Analysis of two

64

Page 76: VHDL Implementation and Performance Analysis of two

65

Chapter 7

Additional Information

7.1 Interpretation of signals

Signal Bit(s) Description

x in 32 initial value of X, the divisor

y in 32 initial value of Y, the dividend

mux to y 32 data path from data multiplexer to register Y

mux to z 32 data path from data multiplexer to register z

enable x 1 control signal from FSM from register X

enable yz 1 control signal from FSM from register Y and Z

clk 1 system clock signal

x reg alu 32 data path register X to ALU

y reg alu 32 data path register Y to ALU

z reg alu 32 data path register Z to ALU

y alu reg 32 feedback data path from ALU containing value of Y (i+ 1)

z alu reg 32 feedback data path from ALU containing value of Z(i+ 1)

f ypos 1 flag indication for positive or negative value of Y

f ygtex 1 flag indication for Y greater than or equal to X

address 5 input value of LUT address from counter in Algorithm 1

delta alu 32 data path carrying value of δ from the LUT to the ALU

y add sub 1 FSM control signal to ALU to perform addition or signal

reset 1 reset system and initialize the divider

start 1 start division operation

counter enable 1 input control signal at counter from FSM

f i 1 flag indication for completion of iteration

Page 77: VHDL Implementation and Performance Analysis of two

66

Signal Bit(s) Description

count 5 output value of counter to LUT in Algorithm 1

count enable 1 FSM control signal output to the counter

sel 1 select external/feedback data values

done 1 indication for completion of division operation

sel x 1 select data from X in DAG in Algorithm 2

enable reg px 1 enable writing data in register Px in DAG

flag out 1 output of flag from DAG (we don’t care about this)

delta address 5 input value of LUT address from DAG in Algorithm 2

flag out 1 output of flag from DAG and position finder unit

number 32 input data value or operand in position finder unit

f mux 1 position finder input to select X or Y flag

position (of Y ) 5 input value of Py for number subtractor

position x 5 input value of Px for number subtractor

enable reg x 1 write enable in register Px inside DAG

Page 78: VHDL Implementation and Performance Analysis of two

67

Chapter 8

Used Terms and Acronyms

• RNG, Random Number Generator

• PRNG, Pseudo Random Number Generator

• VHDL, VHSIC Hardware Description Language

• ROM, Read Only Memory

• CORDIC, COordinate Rotation DIgital Computer

• FPGA, Field Programmable Gate Array

• LUT, Look Up Table

• H-CORDIC, High Performance Adaptive CORDIC

• ALU, Arithmetic and Logical Unit

• FSM, Finite State Machine

• DAG, Delta Address Generator

• MSB, Most Significant Bit

• LSB, Least Significant Bit

• RTL, Register Transfer Logic

• MATLAB, Matrix Laboratory (Mathworks, Inc. computation tool)

• IOB, Input/Output Block

Page 79: VHDL Implementation and Performance Analysis of two

68

• ISE, Integrated Synthesis Environment (Xilinx HDL tool)

• BUFGMUXs, multiplexed global clock buffer that can select between two input

clocks

• MULTI18xSIOs, dedicated multipliers in the (Xilinx) target device

• XST, Xilinx Synthesis Technology

• ISIM, ISE Simulator

• ASIC, Application-Specific Integrated Circuit

• CMOS, Complementary metaloxidesemiconductor (CMOS)