Upload
binduanku
View
79
Download
4
Tags:
Embed Size (px)
DESCRIPTION
32-bit floating point multiplier
Citation preview
An Efficient Implementation of Floating Point Multiplier
ABSTRACT
This paper describes an efficient implementation of an IEEE 754 single precision
floating point multiplier targeted for Xilinx Virtex-5 FPGA. VHDL is used to implement a
technology-independent pipelined design. The multiplier implementation handles the overflow
and underflow cases. Rounding is not implemented to give more precision when using the
multiplier in a Multiply and Accumulate (MAC) unit. With latency of three clock cycles the
design achieves 301 MFLOPs. The multiplier was verified against Xilinx floating point
multiplier core.
1
1. INTRODUCTION
Floating point numbers are one possible way of representing real numbers in binary
format; the IEEE 754 standard presents two different floating point formats, Binary interchange
format and Decimal interchange format. Multiplying floating point numbers is a critical
requirement for DSP applications involving large dynamic range. This paper focuses only on
single precision normalized binary interchange format. Fig. 1 shows the IEEE 754 single
precision binary format representation; it consists of a one bit sign (S), an eight bit exponent (E),
and a twenty three bit fraction (M or Mantissa). An extra bit is added to the fraction to form what
is called the significand1. If the exponent is greater than 0 and smaller than 255, and there is 1 in
the MSB of the significand then the number is said to be a normalized number.
Where M = m22 2-1 + m21 2-2 + ................+ m0 2-23.
Bias = 127
SIGNIFICAND:
Significand is the mantissa with an extra MSB bit.
OVER VIEW OF INDUSTRY:
Very-large-scale integration (VLSI) is the process of creating integrated circuits by
combining thousands of transistors into a single chip.
1.1 Developments:
The first semiconductor chips held two transistors each. Subsequent advances added more
and more transistors, and, as a consequence, more individual functions or systems were integrated
over time. The first integrated circuits held only a few devices, perhaps as many as ten diodes,
transistors, resistors and capacitors, making it possible to fabricate one or more logic gates on a
2
single device. Now known respectively as small-scale integration (SSI), improvements in technique
led to devices with hundreds of logic gates, known as medium-scale integration (MSI). Further
improvements led to large-scale integration (LSI), i.e. systems with at least a thousand logic gates.
Current technology has moved far past this mark and today's microprocessors have many millions of
gates and billions of individual transistors.
At one time, there was an effort to name and calibrate various levels of large-scale integration
above VLSI. Terms like ultra-large-scale integration (ULSI) were used. But the huge number of
gates and transistors available on common devices has rendered such fine distinctions moot. Terms
suggesting greater than VLSI levels of integration are no longer in widespread use.
As of early 2008, billion-transistor processors are commercially available. This became more
commonplace as semiconductor fabrication advanced from the then-current generation of 65 nm
processes. A notable example is Nvidia's 280 series GPU. This GPU is unique in the fact that almost
all of its 1.4 billion transistors are used for logic, in contrast to the Itanium, whose large transistor
count is largely due to its 24 MB L3 cache. Current designs, unlike the earliest devices, use extensive
design automation and automated logic synthesis to lay out the transistors, enabling higher levels of
complexity in the resulting logic functionality. Certain high performance logic blocks like SRAM
cell, however, are still designed by hand to ensure the highest efficiency. VLSI technology may be
moving toward further radical miniaturization with introduction of NEMS technology.
Structured VLSI design is a modular methodology originated by Carver Mead and Lynn
Conway for saving microchip area by minimizing the interconnect fabrics area. This is obtained by
repetitive arrangement of rectangular macro blocks which Can be inter connected using wiring by
abutment: an example is portioning the layout of an adder in to a row of equal bit slices cells, in
complex design this structuring may be achieved by hierarchal nesting.
Structured VLSI design had been popular in the early 1980s, but lost its popularity later
because of the advent of placement and routing tools wasting a lot of area by routing, which is
tolerated because of the progress of Moore’s Law. When introducing the hardware description
language KARL in the mid' 1970s, Reiner Hartenstein coined the term "structured VLSI design"
(originally as "structured LSI design"), echoing Edsger Dijkstra's structured programming
approach by procedure nesting to avoid chaotic spaghetti-structured programs.
3
1.2 Challenges:
As microprocessors become more complex due to technology scaling, microprocessor
designers have encountered several challenges which force them to think beyond the design
plane, and look ahead to post-silicon:
Power usage/ Heat dissipation - As threshold voltages have ceased to scale with
advancing process technology, dynamic power dissipation has not scaled
proportionally. Maintaining logic complexity when scaling the design down only means
that the power dissipation per area will go up. This has given rise to techniques such as
dynamic voltage and frequency scaling (DVFS) to minimize overall power.
Process variation - As photolithography techniques tend closer to the fundamental laws
of optics, achieving high accuracy in doping concentrations and etched wires is
becoming more difficult and prone to errors due to variation. Designers now must
simulate across multiple fabrication process corners before a chip is certified ready for
production.
Stricter design rules - Due to lithography and etc issues with scaling, design rules for
layout have become increasingly stringent. Designers must keep ever more of these
rules in mind while laying out custom circuits. The overhead for custom design is now
reaching a tipping point, with many design houses opting to switch to electronic design
automation (EDA) tools to automate their design process.
Timing/design closure - As clock frequencies tend to scale up, designers are finding it
more difficult to distribute and maintain low clock skew between these high frequency
clocks across the entire chip. This has led to a rising interest in multi core
and multiprocessor architectures, since an overall speedup can be obtained by lowering
the clock frequency and distributing processing.
First-pass success - As die sizes shrink (due to scaling), and wafer sizes go up (to lower
manufacturing costs), the number of dies per wafer increases, and the complexity of
making suitable photo masks goes up rapidly. A mask set for a modern technology can
cost several million dollars. This non-recurring expense deters the old iterative
4
philosophy involving several "spin-cycles" to find errors in silicon, and encourages
first-pass silicon success. Several design philosophies have been developed to aid this
new design flow, including design for manufacturing (DFM), design for test (DFT), and
Design for X.
1.3 Applications of VLSI:
Electronic systems now perform a wide variety of tasks in daily life. Electronic
systems in some cases have replaced mechanisms that operated mechanically, hydraulically, or
by other means; electronics are usually smaller, more flexible, and easier to service. In other
cases electronic systems have created totally new applications. Electronic systems perform a
variety of tasks, some of them visible, some more hidden:
Personal entertainment systems such as portable MP3 players and DVD
players perform sophisticated algorithms with remarkably little energy.
Electronic systems in cars operate stereo systems and displays; they also
control fuel injection systems, adjust suspensions to varying terrain, and
perform the control functions required for anti-lock braking (ABS) systems.
Digital electronics compress and decompress video, even at high-definition
data rates, on-the-fly in consumer electronics.
Low-cost terminals for Web browsing still require sophisticated electronics,
despite their dedicated function.
Personal computers and workstations provide word-processing, financial
analysis, and games. Computers include both central processing units (CPUs)
and special-purpose hardware for disk access, faster screen display, etc.
Medical electronic systems measure bodily functions and perform complex
processing algorithms to warn about unusual conditions. The availability of
these complex systems, far from overwhelming consumers, only creates
demand for even more complex systems.
The growing sophistication of applications continually pushes the design and manufacturing of
integrated circuits and electronic systems to new levels of complexity. And perhaps the most
5
amazing characteristic of this collection of systems is its variety-as systems become more
complex, we build not a few general-purpose computers but an ever wider range of special-
purpose systems. Our ability to do so is a testament to our growing mastery of both integrated
circuit manufacturing and design, but the increasing demands of customers continue to test the
limits of design and manufacturing.
6
2. VERILOG HDL
Verilog HDL is a hardware description language that can be used to model a digital
system at many levels of abstraction ranging from the algorithmic-level to the gate-level to the
switch-level. The complexity of the digital system being modeled could vary from that of a
simple gate to a complete electronic digital system, or anything in between. The digital system
can be described hierarchically and timing can be explicitly modeled within the same
description.
The Verilog HDL language includes capabilities to describe the behavior-al nature of a
design, the dataflow nature of a design, a design's structural composition, delays and a waveform
generation mechanism including aspects of response monitoring and verification, all modeled
using one single language. In addition, the language provides a programming language interface
through which the internals of a design can be accessed during simulation including the control
of a simulation run.
The language not only defines the syntax but also defines very clear simulation
semantics for each language construct. Therefore, models written in this language can be
verified using a Verilog simulator. The language inherits many of its operator symbols and
constructs from the C programming language. Verilog HDL provides an extensive range of
modeling capabilities, some of which are quite difficult to comprehend initially. However, a core
subset of the language is quite easy to leam and use. This is sufficient to model most
applications.
2.1 History:
The verilog HDL language was first developed by Gateway Design Automation in 1983
as hardware are modeling language for their simulator product, At that time ,it was a propnetary
language. Because of the popularity of the, simulator product, Verilog HDL gained acceptance as
a usable and practical language by a number of designers. In an effort to increase the popularity
of the language, the language was placed in the public domain in 1990. Open verilog
International (OVI) was formed to promote verilog. In 1992 OVI decided to pursue
7
standardization of verilog HDL as an IEEE standard. This effort was successful and the language
became an IEEE standard in 1995. The complete standard is described in the verilog hardware
description language reference manual. The standard is called std 1364-1995.
2.2 Major Capabilities:
Listed below are the major capabilities of the verilog hardware description:
Primitive logic gates, such as and, or and nand, are built-in into the language.
Flexibility of creating a user-defined primitive (UDP). Such a primitive could either be a
combinational logic primitive or a sequential logic primitive.
Switch-level modeling primitive gates, such as pmos and nmos, are also built-in into the
language.
Explicit language constructs are provided for specifying pin-to-pin delays, path delays
and timing checks of a design.
A design can be modeled in three different styles or in a mixed style. These styles are:
behavioral style - modeled using procedural constructs; dataflow style - modeled using
continuous assignments; and structural style - modeled using gate and module
instantiations.
There are two data types in Verilog HDL; the net data type and the register data type. The
net type represents a physical connection between structural elements while a register
type represents an abstract data storage element.
Verilog HDL also has built-in logic functions such as & (bitwise-and) and I (bitwise-or).
High-level programming language constructs such as conditions, case statements, and
loops are available in the language.
Notion of concurrency and time can be explicitly modeled.
Powerful file read and write capabilities fare provided.
The language is non-deterministic under certain situations, that is, a model may produce
different results on different simulators; for example, the ordering of events on an event
queue is not defined by the standard.
8
2.3 Synthesis:
Synthesis is the process of constructing a gate level netlist from a register-transfer level
model of a circuit described in Verilog HDL. Figure.2-2 shows such a process. A synthesis
system may as an intermediate step, generate a netlist that is comprised of register-transfer level
blocks such as flip-flops, arithmetic-logic-units, and multiplexers, interconnected by wires. In
such a case, a second program called the RTL module builder is necessary. The purpose of this
builder is to build, or acquire from a library of predefined components, each of the required RTL
blocks in the user-specified target technology.
Having produced a gate level netlist, a logic optimizer reads in the netlist and optimizes
the circuit for the user-specified area and timing constraints. These area and timing constraints
may also be used by the module builder for appropriate selection or generation of RTL blocks. In
this book, we assume that the target netlist is at the gate level. The logic gates used in the
synthesized netlists are described in Appendix B. The module building and logic optimization
phases are not described in this book.
The above figure shows the basic elements of Verilog HDL and the elements used in
hardware. A mapping mechanism or a construction mechanism has to be provided that translates
the Verilog HDL elements into their corresponding hardware elements as shown in figure.2-3
9
Figure.2-2 synthesis process
2.4 Advantages of Verilog HDL :
1. It is not possible to describe the functionality of digital circuits using higher level languages
such as FORTRAN, C and other higher level languages, because these are sequential in nature.
Hardware Description Languages (HDLs) came into existence.
2.Design can be described and implemented at a very abstract (high) level).
3. Technology independent implementation. Functional verification of the design can be done
early in the design cycle.
10
4. One can optimize and modify the design description until it meets the desired functionality as
well as required specifications. Most design bugs are eliminated before going to implementation
(Chip).
5.Designing with HDLs is a analogous to computer programming, a textual description is an
easier way to develop and debug circuits.
6.Design reusability, short developing time and easy modification of design.
7. Verilog HDL is non proprietary and is an IEEE standard.
8.Switch level modeling primitive gates, such as pmos and nmos are also built-in into the
languages.
9. It is human and machine readable. Thus it can be used as an exchange language between tools
and designers.
10.Verilog HDL can be used to perform response monitoring of the design under test, that is, the
values of a design under test can be monitored and displayed. These values can be compared
with expected values, and in case of a mismatch, a report message can be printed.
11
3. FPGA DESIGN FLOW
This is part of chapter deals with the implementation flow specifying the significance
of various properties, reports obtained and simulation waveforms of architectures developed to
implement.
3.1 FPGA Design flow:
The various steps involved in the design flow are as follows:
1) Design entry.
2) Functional simulation.
3) Synthesizing and optimizing (translation) the design.
4) Placing and routing the design
5) Timing simulation of the design after post PAR.
6) Static timing analysis.
7) Configuring the device by bit generation.
3.1.1 Design entry:
The first step in implementing the design is to create the HDL code based on design
criteria. To support these instantiations we need to include UNISIM library and compile all
design libraries before performing the functional simulation. The constraints (timing and area
constraints) can also be included during the design entry. Xilinx accepts the constraints in the
form of user constraint (UCF) file.
3.1.2 Functional Simulation:
12
This step deals with the verification of the functionality of the written source code. ISE
provides its own ISE simulator and also allows for the integration with other tools such as
Modelsim. This project uses Modelsim for the functional verification by selecting the option
during project creation. Functional simulation determines if the logic in the design is correct
before implementing it in a device. Functional simulation can take place at the earliest stages of
the design flow. Because timing information for the implemented design is not available at this
stage, the simulator tests the logic in the design using unit delays.
3.1.3 Synthesizing and Optimizing:
In this stage behavioral information in the HDL file is translated into a structural net list,
and the design is optimized for a Xilinx device. To perform synthesis this project uses Xilinx
XST tool [17]. From the original design, a net list is created, then synthesized and translated into
a native generic object (NGO) file. This file is fed into the Xilinx software program called
NGDBuild, which produces a logical native generic database (NGD) file.
3.1.4 Design implementation:
In this stage, The MAP program maps a logical design to a Xilinx FPGA. The input to
MAP is an NGD file, which is generated using the NGDBuild program. The NGD file contains a
logical description of the design that includes both the hierarchical components used to develop
the design and the lower level Xilinx primitives. The NGD file also contains any number of
NMC (macro library) files, each of which contains the definition of a physical macro. MAP first
performs a logical DRC (Design Rule Check) on the design in the NGD file. MAP then maps the
design logic to the components (logic cells, I/O cells, and other components) in the target Xilinx
FPGA.
The output from MAP is an NCD (Native Circuit Description) file, and PCF (Physical constraint
file).
• NCD (Native Circuit Description) file—a physical description of the design in terms of the
components in the target Xilinx device.
13
• PCF (Physical Constraints File)—an ASCII text file that contains constraints specified during
design entry expressed in terms of physical elements. The physical constraints in the PCF are
expressed in Xilinx’s constraint language.
After the creation of Native Circuit Description (NCD) file with the MAP program, place
and route that design file using PAR. PAR accepts a mapped NCD file as input, places and
routes the design, and outputs an NCD file to be used by the bit stream generator (Bit
Generation).
The PAR placer executes multiple phases of the placer. PAR writes the NCD after all the
placer phases are complete. During placement, PAR places components into sites based on
factors such as constraints specified in the PCF file, the length of connections, and the available
routing resources.
After placing the design, PAR executes multiple phases of the router. The router
performs a converging procedure for a solution that routes the design to completion and meets
timing constraints. Once the design is fully routed, PAR writes an NCD file, which can be
analyzed against timing. PAR writes a new NCD as the routing improves throughout the router
phases.
3.1.5 Timing simulation after post PAR:
Timing simulation at this stage verifies that the design runs at the desired speed for the
device under worst-case conditions. This process is performed after the design is mapped,
placed, and routed for FPGAs. At this time, all design delays are known. Timing simulation is
valuable because it can verify timing relationships and determine the critical paths for the design
under worst-case conditions. It can also determine whether or not the design contains set-up or
hold violations. In most of the designs the same test bench can be used to simulate at this stage.
3.1.6 Static timing analysis:
14
Static timing analysis is best for quick timing checks of a design after it is placed and
routed. It also allows you to determine path delays in your design. Following are the two major
goals of static timing analysis:
• Timing verification
This is verifying that the design meets your timing constraints.
• Reporting
This is enumerating input constraint violations and placing them into an accessible file.
ISE provides Timing Reporter and Circuit Evaluator (TRACE) tool to perform STA. The
input files to the TRACE are .ncd file and .pcf from PAR .and the output file is a .twr file.
3.2 Processes and properties:
Processes and properties enable the interaction of our design with the functionality available in
the ISE™ suite of tools.
3.2.1 Processes:
Processes are the functions listed hierarchically in the Processes window. They perform
functions from the start to the end of the design flow.
3.2.2 Properties:
Process properties are accessible from the right-click menu for select processes. They
enable us to customize the parameters used by the process.
Process properties are set at synthesis and implementation phase.
15
3.3 Synthesize options:
The following properties apply to the Synthesize properties using the Xilinx Synthesis
Technology (XST) synthesis tool.
Optimization Goal.
Specifies the global optimization goal for area or speed.
Select an option from the drop-down list.
Speed.
Optimizes the design for speed by reducing the levels of logic.
Area
Optimizes the design for area by reducing the total amount of logic used for
design implementation.
By default, this property is set to Speed.
3.3.1 Optimization Effort:
Specifies the synthesis optimization effort level.
Select an option from the drop-down list.
Normal
Optimizes the design using minimization and algebraic factoring algorithms.
High Performs additional optimizations that are tuned to the selected device architecture. "High"
takes more CPU time than "Normal" because multiple optimization algorithms are tried to get
the best result for the target architecture.
By default, this property is set to Normal.
This project aims at Timing performance and was selected HIGH effort level.
16
3.3.2 Power Reduction:
When set to Yes (checkbox is checked), XST optimizes the design to consume as little
power as possible.
By default, this property is set to No (checkbox is blank).
3.3.3 Use Synthesis Constraints File:
Specifies whether or not to use the constraints file entered in the previous property. By
default, this constraints file is used (property checkbox is checked).
3.3.4 Keep Hierarchy:
Specifies whether the corresponding design unit should be preserved or not merged with
the rest of the design. You can specify Yes, No and Soft. Soft is used when you wish to maintain
the hierarchy through synthesis, but you do not wish to pass the keep_ hierarchy attributes to
place and route.
By default, this property is set to No.
The change in option of this property from no to yes gave me almost double the speed.
17
4. FLOATING POINT MULTIPLIPLICATION ALGORITHM.
4.1 FLOATING POINT MULTIPLICATION:
Multiplying two numbers in floating point format is done by
1. Adding the exponent of the two numbers then subtracting the bias from their result.
2. Multiplying the significant of the two numbers.
3. Calculating the sign by XOR ing the sign of the two numbers. In order to represent the
multiplication result as a normalized number their should be1 in the MSB of the result.
4.2 FLOATING POINT MULTIPLICATION ALGORITHM:
As stated in the introduction, normalized floating point numbers have the form of
Z= (-1S) * 2 (E - Bias) * (1.M).
To multiply two floating point numbers the following is done:
1. Multiplying the significand; i.e. (1.M1*1.M2).
2. Placing the decimal point in the result.
3. Adding the exponents; i.e. (E1 + E2 - Bias).
4. Obtaining the sign; i.e. s1 xor s2.
5. Normalizing the result; i.e. obtaining 1 at the MSB of the results’ significand.
6. Rounding the result to fit in the available bits.
7. Checking the underflow and overflow occurrence.
Consider a floating point representation similar to the IEEE 754 single precision floating
point format, but with a reduced number of mantissa bits (only 4) while still retaining the hidden
18
‘1’ bit for normalized numbers:
A=0100001000100=40, B=1100000011110=-7.5.
To multiply A and B
1. Multiply significand: 1.0100
× 1.1110
00000
10100
10100
10100
10100
1001011000
2. Place the decimal point: 10.01011000
3. Add exponents: 10000100
+ 10000001
100000101
The exponent representing the two numbers is already shifted/biased by the bias value (127) and
is not the true exponent; i.e. EA = EA-true + bias and EB = EB-true + bias And
EA + EB = EA-true + EB-true + 2 bias
So we should subtract the bias from the resultant exponent otherwise the bias will be added
twice.
100000101
19
- 01111111
10000110
4. Obtain the sign bit and put the result together:
1 1000011010.01011000
5. Normalize the result so that there is a 1 just before the radix point (decimal point). Moving the
radix point one place to the left increments the exponent by 1; moving one place to the right
decrements the exponent by 1.
1 1000011010.01011000 (before normalizing)
1 100001111.001011000 (normalized)
The result is (without the hidden bit):
1 1000011100101100
6. The mantissa bits are more than 4 bits (mantissa available bits); rounding is needed. If we
applied the truncation rounding mode then the stored value.
1 100001110010
In this paper we present a floating point multiplier in which rounding support isn’t implemented.
Rounding support can be added as a separate unit that can be accessed by the multiplier or by a
floating point adder, thus accommodating for more precision if the multiplier is connected
directly to an adder in a MAC unit. Fig. 2 shows the multiplier structure; Exponents addition,
Significand multiplication, and Result’s sign calculation are independent and are done in parallel.
The significand multiplication is done on two 24 bit numbers and results in a 48 bit product,
which we will call the intermediate product (IP). The IP is represented as (47 down to 0) and the
20
decimal point is located between bits 46 and 45 in the IP. The following sections detail each
block of the floating point multiplier.
In the below figure it is clearly showing the each block of floating point multiplier.
FLOATING POINT MULTIPLIER BLOCK DIAGRAM:
21
5. HARDWARE OF FLOATING POINT MULTIPLIER
The practical implementation of this multiplier is divided into four hardware modules.
MODULE 1: it includes sign bit calculation and exponent addition.
MODULE 2: it includes mantissa multiplication using a carry save multiplier.
MODULE 3: it includes normaliser.
MODULE 4: it contains overflow and underflow detection.
MODULE 1:
Includes sign bit calculation and exponent addition.
Concepts used:
1. Operation of XOR gate.
2. Unsigned ripple carry adder.
3. Zero subtractor and one subtacctor.
Sign bit calculation:
Multiplying two numbers results in a negative sign no. if one of the multiplied no’s is of a
negative value by the aid of a truth table we can find that this can be obtained by xoring the sign
of two inputs.
Table 1: XOR-TRUTH TABLE.
Exponent addition:
22
This unsigned adder is responsible for adding the exponent of the first input to the
exponent of the second input and subtracting the Bias (127) from the addition result (i.e.
Exponent + Bexponent - Bias). The result of this stage is called the intermediate exponent. The
add operation is done on 8 bits, and there is no need for a quick result because most of the
calculation time is spent in the significand multiplication process (multiplying 24 bits by 24 bits);
thus we need a moderate exponent adder and a fast significand multiplier.
An 8-bit ripple carry adder is used to add the two input exponents. As shown in Fig. 3 a
ripple carry adder is a chain of cascaded full adders and one half adder; each full adder has three
inputs (A, B, Ci) and two outputs (S, Co). The carry out (Co) of each adder is fed to the next full
adder (i.e. each carry bit "ripples" to the next full adder)
The addition process produces an 8-bit sum (s7 – s0) and a carry bit (c0,7). These bits are
concatenated to form a 9- bit addition result(s8-s0) from which the bias is subtracted.
Bias subtraction:
The Bias is subtracted using an array of ripple borrow subtractors. A normal subtractor
has three inputs (minuend (S), subtrahend (T), Borrow in (Bi)) and two outputs (Difference (R),
Borrow out (Bo)). The subtractor logic can be optimized if one of its inputs is a constant value
which is our case, where the Bias is constant (127|10 = 001111111|2). Table I shows the truth
table for a 1-bit subtractor with the input T equal to 1 which we will call “one subtractor (OS)”.
23
One subtarctor:
Here one input is always 1.
The Boolean equations that represent the substactor are:
Truth table:
24
Zero substractor:
Here one input is always zero.
The Boolean equations that represent this subtarctor are:
Truth table:
The below figure shows the Bias subtractor which is a chain of 7 one subtractors (OS)
followed by 2 zero subtractors (ZS); the borrow output of each subtractor is fed to the next
subtractor. If an underflow occurs then Eresult < 0 and the number is out of the IEEE 754 single
precision normalized numbers range; in this case the output is signaled to 0 and an underflow
flag is asserted.
25
Ripple borrow substractor:
MODULE 2:
Includes mantissa multiplication using a carry save multiplier.
Concepts used:
1. Carry save multiplication.
2. Half adders and full adders.
Unsigned multiplier (for significand multiplication):
This unit is responsible for multiplying the unsigned significand and placing the decimal
point in the multiplication product. The result of significand multiplication will be called the
intermediate product (IP). The unsigned significand multiplication is done on 24 bit. Multiplier
performance should be taken into consideration so as not to affect the whole multiplier’s
performance. A 24x24 bit carry save multiplier architecture is used as it has a moderate speed
with a simple architecture. In the carry save multiplier, the carry bits are passed diagonally
downwards (i.e. the carry bit is propagated to the next stage). Partial products are made by AND
the inputs together and passing them to the appropriate adder.
26
Carry save multiplier has three main stages:
1. The first stage is an array of half adders.
2. The middle stages are arrays of full adders.
3. The number of middle stages is equal to significand size minus two. .
4. The last stage is an array of ripple carry adders. This stage is called the vector merging stage.
The number of adders (Half adders and Full adders) in each stage is equal to the
significand size minus one. For example, a 4x4 carry save multiplier is shown in Fig. Below and
it has the following stages:
1. The first stage consists of three half adders.
2. Two middle stages; each consists of three full adders.
3. The vector merging stage consists of one half adder and two full adders.
The decimal point is between bits 45 and 46 in the significand multiplier result. The
multiplication time taken by the carry save multiplier is determined by its critical path. The
critical path starts at the AND gate of the first partial products (i.e. a1b0 and a0b1), passes
through the carry logic of the first half adder and the carry logic of the first full adder of the
middle stages, then passes through all the vector merging adders.
27
Partial product
AIBI = AI AND BI;
HA: HALF ADDER.
FA: FULL ADDER.
MODULE 3:
Includes norlamalizer.
Concepts used:
1. Normalisation
Normaliser:
The result of the significand multiplication (intermediate product) must be normalized to
have a leading ‘1’ just to the left of the decimal point (i.e. in the bit 46 in the intermediate
product). Since the inputs are normalized numbers then the intermediate product has the leading
one at bit 46 or 47.
If the leading one is at bit 46 (i.e. to the left of the decimal point) then the intermediate
product is already a normalized number and no shifts is needed.
If the leading one is at bit 47 then the intermediate product is shifted to the right and the
exponent incremented by1.
The shift operation is done using combinational shift logic made by multiplexers. Fig .8 shows a
simplified logic of a normaliser that has an 8 bit intermediate product input and the 6 bit
intermediate exponent input.
28
MODULE 4:
Includes overflow and underflow detection
Overflow/underflow means that the results exponent is too large or small to be
represented in the exponent field. The exponent of the result must be 8 bit in size and must be
between 1 and 254 otherwise the value is not a normalized value.
An overflow may occur while adding the two exponents or during normalization.
Overflow due to exponent addition may be compensated during subtraction of the bias resulting
in a normal output value (normal operation). An underflow may occur while subtracting the bias
to form the intermediate exponent. If the intermediate exponent less than zero then it’s an
underflow that can never be compensated, if the intermediate exponent equals to zero then it’s an
underflow that may be compensated during normalization by adding 1 to it.
When an overflow occurs an overflow flag signal goes high and the result turns to ±
infinity (sign determined according to the sign of the floating point multiplier inputs). When an
underflow occurs an underflow flag signal goes high and the result turns to ± zero (sign
determined according to the sign of the floating point multipliers inputs). Denormalized numbers
29
are signal to zero with the appropriate sign calculated from the inputs an underflow flag is raised.
Assume that E1 and E2 are the exponents of the two numbers A and B respectively, the results
exponent is calculated by (6)
Eresult = E1 + E2 – 127 (6)
E1 and E2 can have the values from 1 to 254; resulting in Eresult having values from -
125 (2-127) to 381 (508-127); but for normalized numbers, Eresult can only have the values
from 1 to 254. Table III summarizes the Eresult different values and the effect of normalization
on it.
Table 4: overflow and underflow.
30
6. PIPELINING THE MULTIPLIER
Pipelining increases the CPU instruction throughput - the number of instructions
completed per unit of time. But it does not reduce the execution time of an individual instruction.
In fact, it usually slightly increases the execution time of each instruction due to overhead in the
pipeline control. The increase in instruction throughput means that a program runs faster and has
lower total execution time.
In order to enhance the performance of the multiplier, three pipelining stages are used to divide
the critical path thus increasing the maximum operating frequency of the multiplier.
The pipelining stages are imbedded at the following locations:
In the middle of the significant multiplier and in the middle of the exponent adder
(before the bias subtraction).
After the significant multiplier and after the exponent adder
At the floating point multiplier outputs (sign, exponent and mantissa bits)
Fig. 9 shows the pipelining stages as dotted lines.
Three pipelining stages mean that there is latency in the output by three clocks. The synthesis
tool “retiming” option was used so that the synthesizer uses its optimization logic to better place
the pipelining registers across the critical path.
31
7. SIMULATION RESULTS FOR INDIVIDUAL MODULES
32
8. IMPLIMENTATION AND TESTING
33
The whole multiplier (top unit) was tested against the Xilinx floating point multiplier
core generated by Xilinx coregent. Xilinx core was customized to have two flags to indicate
overflow and underflow, to have a maximum latency of three cycles. Xilinx core implements the
“round to nearest” rounding mode.
A testbench is used to generate the stimulus and applies it to the implemented floating
point multiplier and to the Xilinx core then compares the results. The floating point multiplier
code was also checked using designchecker. Designchecker is a linting tool which helps in
filtering design issues like clocks, unused/undriven logic, and combinational loops. The design
was synthesized using precision synthesis tool targeting Xilinx vertex-5 with a timing constraint
of 300MHz. post synthesis ansd place and route simulations were made to ensure the design
functionality after synthesis and place and route.
The area of Xilinx core is less that the implemented floating point multiplier because the
latter doesn’t truncate/round the 48 bits result of the mantissa multiplier which is reflected in the
amount of function generators and registers used to perform operations on the extra bits; also the
speed of Xilinx core is affected by the fact that it implements the round to nearest rounding
mode.
9. CONCLUSION AND FUTURE WORK
34
This paper represents an implementation of floating point multiplier that supports the IEEE 754-
2008 binary interchange format; the multiplier doesn’t implement rounding and just presents the
significand multiplication result as is (48bits); this gives better precision if the whole 48 bits are
utilized in another unit; i.e. a floating point adder to form MAC unit. The design has three
pipelining stages and after implementation on a Xilinx Virtex5 FPGA it achieves 301 MFLOPS
35