floating point multiplier

An Efficient Implementation of Floating Point Multiplier

ABSTRACT

This paper describes an efficient implementation of an IEEE 754 single precision

floating point multiplier targeted for Xilinx Virtex-5 FPGA. VHDL is used to implement a

technology-independent pipelined design. The multiplier implementation handles the overflow

and underflow cases. Rounding is not implemented to give more precision when using the

multiplier in a Multiply and Accumulate (MAC) unit. With latency of three clock cycles the

design achieves 301 MFLOPs. The multiplier was verified against Xilinx floating point

multiplier core.

1

1. INTRODUCTION

Floating point numbers are one possible way of representing real numbers in binary

format; the IEEE 754 standard presents two different floating point formats, Binary interchange

format and Decimal interchange format. Multiplying floating point numbers is a critical

requirement for DSP applications involving large dynamic range. This paper focuses only on

single precision normalized binary interchange format. Fig. 1 shows the IEEE 754 single

precision binary format representation; it consists of a one bit sign (S), an eight bit exponent (E),

and a twenty three bit fraction (M or Mantissa). An extra bit is added to the fraction to form what

is called the significand1. If the exponent is greater than 0 and smaller than 255, and there is 1 in

the MSB of the significand then the number is said to be a normalized number.

Where M = m22 2-1 + m21 2-2 + ................+ m0 2-23.

Bias = 127

SIGNIFICAND:

Significand is the mantissa with an extra MSB bit.

OVER VIEW OF INDUSTRY:

Very-large-scale integration (VLSI) is the process of creating integrated circuits by

combining thousands of transistors into a single chip.

1.1 Developments:

The first semiconductor chips held two transistors each. Subsequent advances added more

and more transistors, and, as a consequence, more individual functions or systems were integrated

over time. The first integrated circuits held only a few devices, perhaps as many as ten diodes,

transistors, resistors and capacitors, making it possible to fabricate one or more logic gates on a

2

single device. Now known respectively as small-scale integration (SSI), improvements in technique

led to devices with hundreds of logic gates, known as medium-scale integration (MSI). Further

improvements led to large-scale integration (LSI), i.e. systems with at least a thousand logic gates.

Current technology has moved far past this mark and today's microprocessors have many millions of

gates and billions of individual transistors.

At one time, there was an effort to name and calibrate various levels of large-scale integration

above VLSI. Terms like ultra-large-scale integration (ULSI) were used. But the huge number of

gates and transistors available on common devices has rendered such fine distinctions moot. Terms

suggesting greater than VLSI levels of integration are no longer in widespread use.

As of early 2008, billion-transistor processors are commercially available. This became more

commonplace as semiconductor fabrication advanced from the then-current generation of 65 nm

processes. A notable example is Nvidia's 280 series GPU. This GPU is unique in the fact that almost

all of its 1.4 billion transistors are used for logic, in contrast to the Itanium, whose large transistor

count is largely due to its 24 MB L3 cache. Current designs, unlike the earliest devices, use extensive

design automation and automated logic synthesis to lay out the transistors, enabling higher levels of

complexity in the resulting logic functionality. Certain high performance logic blocks like SRAM

cell, however, are still designed by hand to ensure the highest efficiency. VLSI technology may be

moving toward further radical miniaturization with introduction of NEMS technology.

Structured VLSI design is a modular methodology originated by Carver Mead and Lynn

Conway for saving microchip area by minimizing the interconnect fabrics area. This is obtained by

repetitive arrangement of rectangular macro blocks which Can be inter connected using wiring by

abutment: an example is portioning the layout of an adder in to a row of equal bit slices cells, in

complex design this structuring may be achieved by hierarchal nesting.

Structured VLSI design had been popular in the early 1980s, but lost its popularity later

because of the advent of placement and routing tools wasting a lot of area by routing, which is

tolerated because of the progress of Moore’s Law. When introducing the hardware description

language KARL in the mid' 1970s, Reiner Hartenstein coined the term "structured VLSI design"

(originally as "structured LSI design"), echoing Edsger Dijkstra's structured programming

approach by procedure nesting to avoid chaotic spaghetti-structured programs.

3

1.2 Challenges:

As microprocessors become more complex due to technology scaling, microprocessor

designers have encountered several challenges which force them to think beyond the design

plane, and look ahead to post-silicon:

Power usage/ Heat dissipation - As threshold voltages have ceased to scale with

advancing process technology, dynamic power dissipation has not scaled

proportionally. Maintaining logic complexity when scaling the design down only means

that the power dissipation per area will go up. This has given rise to techniques such as

dynamic voltage and frequency scaling (DVFS) to minimize overall power.

Process variation - As photolithography techniques tend closer to the fundamental laws

of optics, achieving high accuracy in doping concentrations and etched wires is

becoming more difficult and prone to errors due to variation. Designers now must

simulate across multiple fabrication process corners before a chip is certified ready for

production.

Stricter design rules - Due to lithography and etc issues with scaling, design rules for

layout have become increasingly stringent. Designers must keep ever more of these

rules in mind while laying out custom circuits. The overhead for custom design is now

reaching a tipping point, with many design houses opting to switch to electronic design

automation (EDA) tools to automate their design process.

Timing/design closure - As clock frequencies tend to scale up, designers are finding it

more difficult to distribute and maintain low clock skew between these high frequency

clocks across the entire chip. This has led to a rising interest in multi core

and multiprocessor architectures, since an overall speedup can be obtained by lowering

the clock frequency and distributing processing.

First-pass success - As die sizes shrink (due to scaling), and wafer sizes go up (to lower

manufacturing costs), the number of dies per wafer increases, and the complexity of

making suitable photo masks goes up rapidly. A mask set for a modern technology can

cost several million dollars. This non-recurring expense deters the old iterative

4

philosophy involving several "spin-cycles" to find errors in silicon, and encourages

first-pass silicon success. Several design philosophies have been developed to aid this

new design flow, including design for manufacturing (DFM), design for test (DFT), and

Design for X.

1.3 Applications of VLSI:

Electronic systems now perform a wide variety of tasks in daily life. Electronic

systems in some cases have replaced mechanisms that operated mechanically, hydraulically, or

by other means; electronics are usually smaller, more flexible, and easier to service. In other

cases electronic systems have created totally new applications. Electronic systems perform a

variety of tasks, some of them visible, some more hidden:

Personal entertainment systems such as portable MP3 players and DVD

players perform sophisticated algorithms with remarkably little energy.

Electronic systems in cars operate stereo systems and displays; they also

control fuel injection systems, adjust suspensions to varying terrain, and

perform the control functions required for anti-lock braking (ABS) systems.

Digital electronics compress and decompress video, even at high-definition

data rates, on-the-fly in consumer electronics.

Low-cost terminals for Web browsing still require sophisticated electronics,

despite their dedicated function.

Personal computers and workstations provide word-processing, financial

analysis, and games. Computers include both central processing units (CPUs)

and special-purpose hardware for disk access, faster screen display, etc.

Medical electronic systems measure bodily functions and perform complex

processing algorithms to warn about unusual conditions. The availability of

these complex systems, far from overwhelming consumers, only creates

demand for even more complex systems.

The growing sophistication of applications continually pushes the design and manufacturing of

integrated circuits and electronic systems to new levels of complexity. And perhaps the most

5

amazing characteristic of this collection of systems is its variety-as systems become more

complex, we build not a few general-purpose computers but an ever wider range of special-

purpose systems. Our ability to do so is a testament to our growing mastery of both integrated

circuit manufacturing and design, but the increasing demands of customers continue to test the

limits of design and manufacturing.

6

2. VERILOG HDL

Verilog HDL is a hardware description language that can be used to model a digital

system at many levels of abstraction ranging from the algorithmic-level to the gate-level to the

switch-level. The complexity of the digital system being modeled could vary from that of a

simple gate to a complete electronic digital system, or anything in between. The digital system

can be described hierarchically and timing can be explicitly modeled within the same

description.

The Verilog HDL language includes capabilities to describe the behavior-al nature of a

design, the dataflow nature of a design, a design's structural composition, delays and a waveform

generation mechanism including aspects of response monitoring and verification, all modeled

using one single language. In addition, the language provides a programming language interface

through which the internals of a design can be accessed during simulation including the control

of a simulation run.

The language not only defines the syntax but also defines very clear simulation

semantics for each language construct. Therefore, models written in this language can be

verified using a Verilog simulator. The language inherits many of its operator symbols and

constructs from the C programming language. Verilog HDL provides an extensive range of

modeling capabilities, some of which are quite difficult to comprehend initially. However, a core

subset of the language is quite easy to leam and use. This is sufficient to model most

applications.

2.1 History:

The verilog HDL language was first developed by Gateway Design Automation in 1983

as hardware are modeling language for their simulator product, At that time ,it was a propnetary

language. Because of the popularity of the, simulator product, Verilog HDL gained acceptance as

a usable and practical language by a number of designers. In an effort to increase the popularity

of the language, the language was placed in the public domain in 1990. Open verilog

International (OVI) was formed to promote verilog. In 1992 OVI decided to pursue

7

standardization of verilog HDL as an IEEE standard. This effort was successful and the language

became an IEEE standard in 1995. The complete standard is described in the verilog hardware

description language reference manual. The standard is called std 1364-1995.

2.2 Major Capabilities:

Listed below are the major capabilities of the verilog hardware description:

Primitive logic gates, such as and, or and nand, are built-in into the language.

Flexibility of creating a user-defined primitive (UDP). Such a primitive could either be a

combinational logic primitive or a sequential logic primitive.

Switch-level modeling primitive gates, such as pmos and nmos, are also built-in into the

language.

Explicit language constructs are provided for specifying pin-to-pin delays, path delays

and timing checks of a design.

A design can be modeled in three different styles or in a mixed style. These styles are:

behavioral style - modeled using procedural constructs; dataflow style - modeled using

continuous assignments; and structural style - modeled using gate and module

instantiations.

There are two data types in Verilog HDL; the net data type and the register data type. The

net type represents a physical connection between structural elements while a register

type represents an abstract data storage element.

Verilog HDL also has built-in logic functions such as & (bitwise-and) and I (bitwise-or).

High-level programming language constructs such as conditions, case statements, and

loops are available in the language.

Notion of concurrency and time can be explicitly modeled.

Powerful file read and write capabilities fare provided.

The language is non-deterministic under certain situations, that is, a model may produce

different results on different simulators; for example, the ordering of events on an event

queue is not defined by the standard.

8

2.3 Synthesis:

Synthesis is the process of constructing a gate level netlist from a register-transfer level

model of a circuit described in Verilog HDL. Figure.2-2 shows such a process. A synthesis

system may as an intermediate step, generate a netlist that is comprised of register-transfer level

blocks such as flip-flops, arithmetic-logic-units, and multiplexers, interconnected by wires. In

such a case, a second program called the RTL module builder is necessary. The purpose of this

builder is to build, or acquire from a library of predefined components, each of the required RTL

blocks in the user-specified target technology.

Having produced a gate level netlist, a logic optimizer reads in the netlist and optimizes

the circuit for the user-specified area and timing constraints. These area and timing constraints

may also be used by the module builder for appropriate selection or generation of RTL blocks. In

this book, we assume that the target netlist is at the gate level. The logic gates used in the

synthesized netlists are described in Appendix B. The module building and logic optimization

phases are not described in this book.

The above figure shows the basic elements of Verilog HDL and the elements used in

hardware. A mapping mechanism or a construction mechanism has to be provided that translates

the Verilog HDL elements into their corresponding hardware elements as shown in figure.2-3

9

Figure.2-2 synthesis process

2.4 Advantages of Verilog HDL :

1. It is not possible to describe the functionality of digital circuits using higher level languages

such as FORTRAN, C and other higher level languages, because these are sequential in nature.

Hardware Description Languages (HDLs) came into existence.

2.Design can be described and implemented at a very abstract (high) level).

3. Technology independent implementation. Functional verification of the design can be done

early in the design cycle.

10

4. One can optimize and modify the design description until it meets the desired functionality as

well as required specifications. Most design bugs are eliminated before going to implementation

(Chip).

5.Designing with HDLs is a analogous to computer programming, a textual description is an

easier way to develop and debug circuits.

6.Design reusability, short developing time and easy modification of design.

7. Verilog HDL is non proprietary and is an IEEE standard.

8.Switch level modeling primitive gates, such as pmos and nmos are also built-in into the

languages.

9. It is human and machine readable. Thus it can be used as an exchange language between tools

and designers.

10.Verilog HDL can be used to perform response monitoring of the design under test, that is, the

values of a design under test can be monitored and displayed. These values can be compared

with expected values, and in case of a mismatch, a report message can be printed.

11

3. FPGA DESIGN FLOW

This is part of chapter deals with the implementation flow specifying the significance

of various properties, reports obtained and simulation waveforms of architectures developed to

implement.

3.1 FPGA Design flow:

The various steps involved in the design flow are as follows:

1) Design entry.

2) Functional simulation.

3) Synthesizing and optimizing (translation) the design.

4) Placing and routing the design

5) Timing simulation of the design after post PAR.

6) Static timing analysis.

7) Configuring the device by bit generation.

3.1.1 Design entry:

The first step in implementing the design is to create the HDL code based on design

criteria. To support these instantiations we need to include UNISIM library and compile all

design libraries before performing the functional simulation. The constraints (timing and area

constraints) can also be included during the design entry. Xilinx accepts the constraints in the

form of user constraint (UCF) file.

3.1.2 Functional Simulation:

12

This step deals with the verification of the functionality of the written source code. ISE

provides its own ISE simulator and also allows for the integration with other tools such as

Modelsim. This project uses Modelsim for the functional verification by selecting the option

during project creation. Functional simulation determines if the logic in the design is correct

before implementing it in a device. Functional simulation can take place at the earliest stages of

the design flow. Because timing information for the implemented design is not available at this

stage, the simulator tests the logic in the design using unit delays.

3.1.3 Synthesizing and Optimizing:

In this stage behavioral information in the HDL file is translated into a structural net list,

and the design is optimized for a Xilinx device. To perform synthesis this project uses Xilinx

XST tool [17]. From the original design, a net list is created, then synthesized and translated into

a native generic object (NGO) file. This file is fed into the Xilinx software program called

NGDBuild, which produces a logical native generic database (NGD) file.

3.1.4 Design implementation:

In this stage, The MAP program maps a logical design to a Xilinx FPGA. The input to

MAP is an NGD file, which is generated using the NGDBuild program. The NGD file contains a

logical description of the design that includes both the hierarchical components used to develop

the design and the lower level Xilinx primitives. The NGD file also contains any number of

NMC (macro library) files, each of which contains the definition of a physical macro. MAP first

performs a logical DRC (Design Rule Check) on the design in the NGD file. MAP then maps the

design logic to the components (logic cells, I/O cells, and other components) in the target Xilinx

FPGA.

The output from MAP is an NCD (Native Circuit Description) file, and PCF (Physical constraint

file).

• NCD (Native Circuit Description) file—a physical description of the design in terms of the

components in the target Xilinx device.

13

• PCF (Physical Constraints File)—an ASCII text file that contains constraints specified during

design entry expressed in terms of physical elements. The physical constraints in the PCF are

expressed in Xilinx’s constraint language.

After the creation of Native Circuit Description (NCD) file with the MAP program, place

and route that design file using PAR. PAR accepts a mapped NCD file as input, places and

routes the design, and outputs an NCD file to be used by the bit stream generator (Bit

Generation).

The PAR placer executes multiple phases of the placer. PAR writes the NCD after all the

placer phases are complete. During placement, PAR places components into sites based on

factors such as constraints specified in the PCF file, the length of connections, and the available

routing resources.

After placing the design, PAR executes multiple phases of the router. The router

performs a converging procedure for a solution that routes the design to completion and meets

timing constraints. Once the design is fully routed, PAR writes an NCD file, which can be

analyzed against timing. PAR writes a new NCD as the routing improves throughout the router

phases.

3.1.5 Timing simulation after post PAR:

Timing simulation at this stage verifies that the design runs at the desired speed for the

device under worst-case conditions. This process is performed after the design is mapped,

placed, and routed for FPGAs. At this time, all design delays are known. Timing simulation is

valuable because it can verify timing relationships and determine the critical paths for the design

under worst-case conditions. It can also determine whether or not the design contains set-up or

hold violations. In most of the designs the same test bench can be used to simulate at this stage.

3.1.6 Static timing analysis:

14

Static timing analysis is best for quick timing checks of a design after it is placed and

routed. It also allows you to determine path delays in your design. Following are the two major

goals of static timing analysis:

• Timing verification

This is verifying that the design meets your timing constraints.

• Reporting

This is enumerating input constraint violations and placing them into an accessible file.

ISE provides Timing Reporter and Circuit Evaluator (TRACE) tool to perform STA. The

input files to the TRACE are .ncd file and .pcf from PAR .and the output file is a .twr file.

3.2 Processes and properties:

Processes and properties enable the interaction of our design with the functionality available in

the ISE™ suite of tools.

3.2.1 Processes:

Processes are the functions listed hierarchically in the Processes window. They perform

functions from the start to the end of the design flow.

3.2.2 Properties:

Process properties are accessible from the right-click menu for select processes. They

enable us to customize the parameters used by the process.

Process properties are set at synthesis and implementation phase.

15

3.3 Synthesize options:

The following properties apply to the Synthesize properties using the Xilinx Synthesis

Technology (XST) synthesis tool.

Optimization Goal.

Specifies the global optimization goal for area or speed.

Select an option from the drop-down list.

Speed.

Optimizes the design for speed by reducing the levels of logic.

Area

Optimizes the design for area by reducing the total amount of logic used for

design implementation.

By default, this property is set to Speed.

3.3.1 Optimization Effort:

Specifies the synthesis optimization effort level.

Select an option from the drop-down list.

Normal

Optimizes the design using minimization and algebraic factoring algorithms.

High Performs additional optimizations that are tuned to the selected device architecture. "High"

takes more CPU time than "Normal" because multiple optimization algorithms are tried to get

the best result for the target architecture.

By default, this property is set to Normal.

This project aims at Timing performance and was selected HIGH effort level.

16

3.3.2 Power Reduction:

When set to Yes (checkbox is checked), XST optimizes the design to consume as little

power as possible.

By default, this property is set to No (checkbox is blank).

3.3.3 Use Synthesis Constraints File:

Specifies whether or not to use the constraints file entered in the previous property. By

default, this constraints file is used (property checkbox is checked).

3.3.4 Keep Hierarchy:

Specifies whether the corresponding design unit should be preserved or not merged with

the rest of the design. You can specify Yes, No and Soft. Soft is used when you wish to maintain

the hierarchy through synthesis, but you do not wish to pass the keep_ hierarchy attributes to

place and route.

By default, this property is set to No.

The change in option of this property from no to yes gave me almost double the speed.

17

4. FLOATING POINT MULTIPLIPLICATION ALGORITHM.

4.1 FLOATING POINT MULTIPLICATION:

Multiplying two numbers in floating point format is done by

1. Adding the exponent of the two numbers then subtracting the bias from their result.

2. Multiplying the significant of the two numbers.

3. Calculating the sign by XOR ing the sign of the two numbers. In order to represent the

multiplication result as a normalized number their should be1 in the MSB of the result.

4.2 FLOATING POINT MULTIPLICATION ALGORITHM:

As stated in the introduction, normalized floating point numbers have the form of

Z= (-1S) * 2 (E - Bias) * (1.M).

To multiply two floating point numbers the following is done:

1. Multiplying the significand; i.e. (1.M1*1.M2).

2. Placing the decimal point in the result.

3. Adding the exponents; i.e. (E1 + E2 - Bias).

4. Obtaining the sign; i.e. s1 xor s2.

5. Normalizing the result; i.e. obtaining 1 at the MSB of the results’ significand.

6. Rounding the result to fit in the available bits.

7. Checking the underflow and overflow occurrence.

Consider a floating point representation similar to the IEEE 754 single precision floating

point format, but with a reduced number of mantissa bits (only 4) while still retaining the hidden

18

‘1’ bit for normalized numbers:

A=0100001000100=40, B=1100000011110=-7.5.

To multiply A and B

1. Multiply significand: 1.0100

× 1.1110

00000

10100

10100

10100

10100

1001011000

2. Place the decimal point: 10.01011000

3. Add exponents: 10000100

+ 10000001

100000101

The exponent representing the two numbers is already shifted/biased by the bias value (127) and

is not the true exponent; i.e. EA = EA-true + bias and EB = EB-true + bias And

EA + EB = EA-true + EB-true + 2 bias

So we should subtract the bias from the resultant exponent otherwise the bias will be added

twice.

100000101

19

- 01111111

10000110

4. Obtain the sign bit and put the result together:

1 1000011010.01011000

5. Normalize the result so that there is a 1 just before the radix point (decimal point). Moving the

radix point one place to the left increments the exponent by 1; moving one place to the right

decrements the exponent by 1.

1 1000011010.01011000 (before normalizing)

1 100001111.001011000 (normalized)

The result is (without the hidden bit):

1 1000011100101100

6. The mantissa bits are more than 4 bits (mantissa available bits); rounding is needed. If we

applied the truncation rounding mode then the stored value.

1 100001110010

In this paper we present a floating point multiplier in which rounding support isn’t implemented.

Rounding support can be added as a separate unit that can be accessed by the multiplier or by a

floating point adder, thus accommodating for more precision if the multiplier is connected

directly to an adder in a MAC unit. Fig. 2 shows the multiplier structure; Exponents addition,

Significand multiplication, and Result’s sign calculation are independent and are done in parallel.

The significand multiplication is done on two 24 bit numbers and results in a 48 bit product,

which we will call the intermediate product (IP). The IP is represented as (47 down to 0) and the

20

decimal point is located between bits 46 and 45 in the IP. The following sections detail each

block of the floating point multiplier.

In the below figure it is clearly showing the each block of floating point multiplier.

FLOATING POINT MULTIPLIER BLOCK DIAGRAM:

21

5. HARDWARE OF FLOATING POINT MULTIPLIER

The practical implementation of this multiplier is divided into four hardware modules.

MODULE 1: it includes sign bit calculation and exponent addition.

MODULE 2: it includes mantissa multiplication using a carry save multiplier.

MODULE 3: it includes normaliser.

MODULE 4: it contains overflow and underflow detection.

MODULE 1:

Includes sign bit calculation and exponent addition.

Concepts used:

1. Operation of XOR gate.

2. Unsigned ripple carry adder.

3. Zero subtractor and one subtacctor.

Sign bit calculation:

Multiplying two numbers results in a negative sign no. if one of the multiplied no’s is of a

negative value by the aid of a truth table we can find that this can be obtained by xoring the sign

of two inputs.

Table 1: XOR-TRUTH TABLE.

Exponent addition:

22

This unsigned adder is responsible for adding the exponent of the first input to the

exponent of the second input and subtracting the Bias (127) from the addition result (i.e.

Exponent + Bexponent - Bias). The result of this stage is called the intermediate exponent. The

add operation is done on 8 bits, and there is no need for a quick result because most of the

calculation time is spent in the significand multiplication process (multiplying 24 bits by 24 bits);

thus we need a moderate exponent adder and a fast significand multiplier.

An 8-bit ripple carry adder is used to add the two input exponents. As shown in Fig. 3 a

ripple carry adder is a chain of cascaded full adders and one half adder; each full adder has three

inputs (A, B, Ci) and two outputs (S, Co). The carry out (Co) of each adder is fed to the next full

adder (i.e. each carry bit "ripples" to the next full adder)

The addition process produces an 8-bit sum (s7 – s0) and a carry bit (c0,7). These bits are

concatenated to form a 9- bit addition result(s8-s0) from which the bias is subtracted.

Bias subtraction:

The Bias is subtracted using an array of ripple borrow subtractors. A normal subtractor

has three inputs (minuend (S), subtrahend (T), Borrow in (Bi)) and two outputs (Difference (R),

Borrow out (Bo)). The subtractor logic can be optimized if one of its inputs is a constant value

which is our case, where the Bias is constant (127|10 = 001111111|2). Table I shows the truth

table for a 1-bit subtractor with the input T equal to 1 which we will call “one subtractor (OS)”.

23

One subtarctor:

Here one input is always 1.

The Boolean equations that represent the substactor are:

Truth table:

24

Zero substractor:

Here one input is always zero.

The Boolean equations that represent this subtarctor are:

Truth table:

The below figure shows the Bias subtractor which is a chain of 7 one subtractors (OS)

followed by 2 zero subtractors (ZS); the borrow output of each subtractor is fed to the next

subtractor. If an underflow occurs then Eresult < 0 and the number is out of the IEEE 754 single

precision normalized numbers range; in this case the output is signaled to 0 and an underflow

flag is asserted.

25

Ripple borrow substractor:

MODULE 2:

Includes mantissa multiplication using a carry save multiplier.

Concepts used:

1. Carry save multiplication.

2. Half adders and full adders.

Unsigned multiplier (for significand multiplication):

This unit is responsible for multiplying the unsigned significand and placing the decimal

point in the multiplication product. The result of significand multiplication will be called the

intermediate product (IP). The unsigned significand multiplication is done on 24 bit. Multiplier

performance should be taken into consideration so as not to affect the whole multiplier’s

performance. A 24x24 bit carry save multiplier architecture is used as it has a moderate speed

with a simple architecture. In the carry save multiplier, the carry bits are passed diagonally

downwards (i.e. the carry bit is propagated to the next stage). Partial products are made by AND

the inputs together and passing them to the appropriate adder.

26

Carry save multiplier has three main stages:

1. The first stage is an array of half adders.

2. The middle stages are arrays of full adders.

3. The number of middle stages is equal to significand size minus two. .

4. The last stage is an array of ripple carry adders. This stage is called the vector merging stage.

The number of adders (Half adders and Full adders) in each stage is equal to the

significand size minus one. For example, a 4x4 carry save multiplier is shown in Fig. Below and

it has the following stages:

1. The first stage consists of three half adders.

2. Two middle stages; each consists of three full adders.

3. The vector merging stage consists of one half adder and two full adders.

The decimal point is between bits 45 and 46 in the significand multiplier result. The

multiplication time taken by the carry save multiplier is determined by its critical path. The

critical path starts at the AND gate of the first partial products (i.e. a1b0 and a0b1), passes

through the carry logic of the first half adder and the carry logic of the first full adder of the

middle stages, then passes through all the vector merging adders.

27

Partial product

AIBI = AI AND BI;

HA: HALF ADDER.

FA: FULL ADDER.

MODULE 3:

Includes norlamalizer.

Concepts used:

1. Normalisation

Normaliser:

The result of the significand multiplication (intermediate product) must be normalized to

have a leading ‘1’ just to the left of the decimal point (i.e. in the bit 46 in the intermediate

product). Since the inputs are normalized numbers then the intermediate product has the leading

one at bit 46 or 47.

If the leading one is at bit 46 (i.e. to the left of the decimal point) then the intermediate

product is already a normalized number and no shifts is needed.

If the leading one is at bit 47 then the intermediate product is shifted to the right and the

exponent incremented by1.

The shift operation is done using combinational shift logic made by multiplexers. Fig .8 shows a

simplified logic of a normaliser that has an 8 bit intermediate product input and the 6 bit

intermediate exponent input.

28

MODULE 4:

Includes overflow and underflow detection

Overflow/underflow means that the results exponent is too large or small to be

represented in the exponent field. The exponent of the result must be 8 bit in size and must be

between 1 and 254 otherwise the value is not a normalized value.

An overflow may occur while adding the two exponents or during normalization.

Overflow due to exponent addition may be compensated during subtraction of the bias resulting

in a normal output value (normal operation). An underflow may occur while subtracting the bias

to form the intermediate exponent. If the intermediate exponent less than zero then it’s an

underflow that can never be compensated, if the intermediate exponent equals to zero then it’s an

underflow that may be compensated during normalization by adding 1 to it.

When an overflow occurs an overflow flag signal goes high and the result turns to ±

infinity (sign determined according to the sign of the floating point multiplier inputs). When an

underflow occurs an underflow flag signal goes high and the result turns to ± zero (sign

determined according to the sign of the floating point multipliers inputs). Denormalized numbers

29

are signal to zero with the appropriate sign calculated from the inputs an underflow flag is raised.

Assume that E1 and E2 are the exponents of the two numbers A and B respectively, the results

exponent is calculated by (6)

Eresult = E1 + E2 – 127 (6)

E1 and E2 can have the values from 1 to 254; resulting in Eresult having values from -

125 (2-127) to 381 (508-127); but for normalized numbers, Eresult can only have the values

from 1 to 254. Table III summarizes the Eresult different values and the effect of normalization

on it.

Table 4: overflow and underflow.

30

6. PIPELINING THE MULTIPLIER

Pipelining increases the CPU instruction throughput - the number of instructions

completed per unit of time. But it does not reduce the execution time of an individual instruction.

In fact, it usually slightly increases the execution time of each instruction due to overhead in the

pipeline control. The increase in instruction throughput means that a program runs faster and has

lower total execution time.

In order to enhance the performance of the multiplier, three pipelining stages are used to divide

the critical path thus increasing the maximum operating frequency of the multiplier.

The pipelining stages are imbedded at the following locations:

In the middle of the significant multiplier and in the middle of the exponent adder

(before the bias subtraction).

After the significant multiplier and after the exponent adder

At the floating point multiplier outputs (sign, exponent and mantissa bits)

Fig. 9 shows the pipelining stages as dotted lines.

Three pipelining stages mean that there is latency in the output by three clocks. The synthesis

tool “retiming” option was used so that the synthesizer uses its optimization logic to better place

the pipelining registers across the critical path.

31

7. SIMULATION RESULTS FOR INDIVIDUAL MODULES

32

8. IMPLIMENTATION AND TESTING

33

The whole multiplier (top unit) was tested against the Xilinx floating point multiplier

core generated by Xilinx coregent. Xilinx core was customized to have two flags to indicate

overflow and underflow, to have a maximum latency of three cycles. Xilinx core implements the

“round to nearest” rounding mode.

A testbench is used to generate the stimulus and applies it to the implemented floating

point multiplier and to the Xilinx core then compares the results. The floating point multiplier

code was also checked using designchecker. Designchecker is a linting tool which helps in

filtering design issues like clocks, unused/undriven logic, and combinational loops. The design

was synthesized using precision synthesis tool targeting Xilinx vertex-5 with a timing constraint

of 300MHz. post synthesis ansd place and route simulations were made to ensure the design

functionality after synthesis and place and route.

The area of Xilinx core is less that the implemented floating point multiplier because the

latter doesn’t truncate/round the 48 bits result of the mantissa multiplier which is reflected in the

amount of function generators and registers used to perform operations on the extra bits; also the

speed of Xilinx core is affected by the fact that it implements the round to nearest rounding

mode.

9. CONCLUSION AND FUTURE WORK

34

This paper represents an implementation of floating point multiplier that supports the IEEE 754-

2008 binary interchange format; the multiplier doesn’t implement rounding and just presents the

significand multiplication result as is (48bits); this gives better precision if the whole 48 bits are

utilized in another unit; i.e. a floating point adder to form MAC unit. The design has three

pipelining stages and after implementation on a Xilinx Virtex5 FPGA it achieves 301 MFLOPS

35

Documents

floating point multiplier