106
Northeastern University Electrical and Computer Engineering Master's eses Department of Electrical and Computer Engineering January 01, 2011 Analysis and design of high performance 128-bit parallel prefix end-around-carry adder Ogun Turkyilmaz Northeastern University is work is available open access, hosted by Northeastern University. Recommended Citation Turkyilmaz, Ogun, "Analysis and design of high performance 128-bit parallel prefix end-around-carry adder" (2011). Electrical and Computer Engineering Master's eses. Paper 59. hp://hdl.handle.net/2047/d20001096

Analysis and Design of High Performance 128-Bit Parallel Prefix e

Embed Size (px)

Citation preview

Page 1: Analysis and Design of High Performance 128-Bit Parallel Prefix e

Northeastern University

Electrical and Computer Engineering Master'sTheses

Department of Electrical and ComputerEngineering

January 01, 2011

Analysis and design of high performance 128-bitparallel prefix end-around-carry adderOgun TurkyilmazNortheastern University

This work is available open access, hosted by Northeastern University.

Recommended CitationTurkyilmaz, Ogun, "Analysis and design of high performance 128-bit parallel prefix end-around-carry adder" (2011). Electrical andComputer Engineering Master's Theses. Paper 59. http://hdl.handle.net/2047/d20001096

Page 2: Analysis and Design of High Performance 128-Bit Parallel Prefix e

Analysis and Design of High Performance128-bit Parallel Prefix End-Around-Carry Adder

A Thesis Presentedby

Ogun Turkyilmaz

to

The Department of Electrical and Computer Engineering

in partial fulfillment of the requirementsfor the degree of

Master of Science

in

Electrical and Computer Engineering

Northeastern UniversityBoston, Massachusetts

August 2011

Page 3: Analysis and Design of High Performance 128-Bit Parallel Prefix e

Abstract

Addition is a timing critical operation in today’s floating point units. In order to develop faster

processing, an end-around carry (EAC) was proposed as a part of fused-multiply-add unit which

performs multiplication followed by addition [5]. The proposed EAC adder was also investigated

through other prefix adders in FPGA technology as a complete adder [6]. In this thesis, we

propose a 128-bit standalone adder with parallel prefix end around carry logic and conditional

sum blocks to improve the critical path delay and provide flexibility to design with different

adder architectures. In previous works, CLA logic was used for EAC logic. Using a modified

structure of a parallel prefix 2n − 1 adder provides flexibility to the design and decreases the

length of the carry path. After the architecture is tested and verified, critical path is analyzed

using FreePDK45nm library. Full custom design techniques are applied carefully during critical

path optimization. Critical path analysis provides fast comparison of the total delay among

different architectures without designing the whole circuit and a simpler approach to size the

transistors for lowest delay possible. As a final step, datapath is designed as a recurring bitslice

for fast layout entry. The results show that the proposed adder shows 142ps delay, 2.42mW

average power dissipation, and 3,132 sq. micron area assuming there is not much routing area

overhead in the estimated area.

Page 4: Analysis and Design of High Performance 128-Bit Parallel Prefix e

Acknowledgements

I would like to express my foremost appreciation to my advisor, Prof. Yong-Bin Kim for

giving me the opportunity to conduct research at Northeastern University. I am grateful for

his technical guidance and constant support in my graduate career. Without his valuable

suggestions and assistance, this thesis would not be accomplished.

I would like to thank the committee members, Prof. Fabrizio Lombardi and Prof. Gu-

nar Schirner for reading my thesis and offering valuable suggestions and contributions. I am,

especially, grateful to Prof. Schirner for the long discussions about graduate study. He has

been a mentor and a teacher to me, who generously shares his knowledge and experiences with

tremendous enthusiasm and never ending encouragement.

It has been an honor and a great pleasure to study as a Fulbright Scholar. I would like

to express my appreciation to the Fulbright Commission for giving me the chance to pursue

further academical study and connect with many accomplished scholars.

I would also like to thank Faith Crisley, Graduate Coordinator at the ECE Department, for

her support and valuable suggestions. She has always been helpful with her comforting manner

even in the most stressful moments.

Last but not least, I would like to express my sincere appreciation to my beloved parents,

Nuket and Nevzat Turkyilmaz, and my sister, Pinar Turkyilmaz, who encouraged me continu-

ously in every step I take, supported me constantly through every hardship I faced and loved

me without boundaries. Without them, I could not even imagine being where I am today.

Ogun TurkyilmazAugust 2011

Page 5: Analysis and Design of High Performance 128-Bit Parallel Prefix e

Contents

1 Introduction 11.1 Fused Multiply-Add Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 Ripple Carry Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.2 Carry lookahead adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.3 Parallel Prefix Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Tree Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 Recurrence Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.4.1 Weinberger Recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.4.2 Ling Recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Modulo Adders 162.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2 Addition in Modulo 2n − 1 Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3 Analysis of Previous End-Around-Carry Adders . . . . . . . . . . . . . . . . . . . 182.4 Carry-lookahead EAC Logic Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Modified Parallel Prefix EAC Adder 243.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2 Proposed Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2.1 The 16bit blocks in EAC adder . . . . . . . . . . . . . . . . . . . . . . . . 253.2.2 Parallel Prefix 2n − 1 EAC Block . . . . . . . . . . . . . . . . . . . . . . . 28

3.3 Implementation and Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 Critical Path Analysis 304.1 Path Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.2 Path Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.3 Transistor Level Design and Sizing . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.3.1 Logic Level Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.3.2 Late arriving signal exploitation . . . . . . . . . . . . . . . . . . . . . . . 344.3.3 Logical Effort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.3.4 Design with Helpers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.4 Transistor Sizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

i

Page 6: Analysis and Design of High Performance 128-Bit Parallel Prefix e

4.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5 Datapath Library 455.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.2 Concepts in Full Custom Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.3 Datapath Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.4 Layout Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6 Conclusion and Future Works 56

A Verilog Code of the Proposed Adder 63

B HSPICE Simulation Files 68B.1 Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68B.2 Simulation Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69B.3 Condition of Transistors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Page 7: Analysis and Design of High Performance 128-Bit Parallel Prefix e

List of Figures

1.1 Ripple Carry adder [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Carry lookahead adder [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Group PG cells [3]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 Taxonomy of prefix networks [3]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.5 Kogge-Stone adder [3]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.6 Sklansky adder [3]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.7 Brent-Kung adder [3]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.8 Han-Carlson adder [3]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.9 Knowles [2,1,1,1] adder [3]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.10 Ladner Fischer adder [3]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1 Prefix graph with fast end-around carry [4]. . . . . . . . . . . . . . . . . . . . . . 172.2 General block diagram Modulo 2n − 1 Adder [4]. . . . . . . . . . . . . . . . . . . 182.3 Block diagram of the 128-bit binary adder [5]. . . . . . . . . . . . . . . . . . . . . 192.4 Architecture of the EAC adder [6]. . . . . . . . . . . . . . . . . . . . . . . . . . . 202.5 Architecture of the EAC adder [6]. . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.1 Architecture of the modified EAC Adder. . . . . . . . . . . . . . . . . . . . . . . 253.2 16bit conditional sum blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.3 cin merge with fast carry link [7]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.4 cin merge with extra bit [7]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.5 Modified 8bit Kogge-Stone EAC block. . . . . . . . . . . . . . . . . . . . . . . . . 28

4.1 Critical path of the modified EAC adder. . . . . . . . . . . . . . . . . . . . . . . 314.2 Gate level design of critical path. . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.3 Reduced first stage in Weinberger recursion adder [8]. . . . . . . . . . . . . . . . 334.4 AOI without late arriving exploitation. . . . . . . . . . . . . . . . . . . . . . . . . 344.5 AOI with late arriving exploitation. . . . . . . . . . . . . . . . . . . . . . . . . . 354.6 EAC logic with helper. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.7 Spreadsheet for Logical Effort Calculation. . . . . . . . . . . . . . . . . . . . . . . 394.8 Transistor level schematic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.9 Transistor level schematic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.10 Delay vs. Vdd at 25 ◦C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.11 Power vs. Vdd at 25 ◦C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.12 Delay vs. Vdd at 100 ◦C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.13 Power vs. Vdd at 100 ◦C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.1 Global floorplan of a datapath [9]. . . . . . . . . . . . . . . . . . . . . . . . . . . 46

iii

Page 8: Analysis and Design of High Performance 128-Bit Parallel Prefix e

5.2 Regularity placement and routing datapath circuit [10]. . . . . . . . . . . . . . . 465.3 Schematic layout of datapath and detailed view of bitcell [11]. . . . . . . . . . . . 475.4 Representation of a datapath cell [9]. . . . . . . . . . . . . . . . . . . . . . . . . . 485.5 Designed basic cells. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.6 Bit slices of the blocks in the adder. . . . . . . . . . . . . . . . . . . . . . . . . . 505.7 Bit slice of 16bit Kogge-Stone adder. . . . . . . . . . . . . . . . . . . . . . . . . . 515.9 Wide layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.10 Stacked compact layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.8 Layout of the blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

Page 9: Analysis and Design of High Performance 128-Bit Parallel Prefix e

List of Tables

1.1 Trade-off between different adder topologies. . . . . . . . . . . . . . . . . . . . . . 7

4.1 Delay and Power dissipation values in correspondence to VDD and Temperature. 43

5.1 Results comparison of proposed adder with the previous work. . . . . . . . . . . 53

v

Page 10: Analysis and Design of High Performance 128-Bit Parallel Prefix e

Chapter 1

Introduction

Fused multiply add unit plays an important role in modern microprocessor. It performs floating

point multiplication followed by an addition of the product with a third floating point operand.

In 2007, a seven cycle fused multiply add pipeline unit was proposed as a part of the floating

point unit in IBM’s POWER6 microprocessor [5]. In this fused multiply add data flow, the

product should be aligned before it is added with the addend. Because the magnitude of the

product is unknown in the early stages prior to the combination with the addend it is difficult to

determine a priori which operand is bigger. Even if it was determined early that the product is

bigger, there would be a problem on conditionally complementing two intermediate operands the

carry and sum outputs of the counter tree. Thus an adder needs to be designed to always output

a positive magnitude result and preferably only needs to complement one operand. In [6], the

adder in POWER6, was taken as a reference for design space exploration in FPGA technology.

They designed a complete adder independent of FMA block and shown that Kogge-Stone does

not provide the best performance in FPGA technology. Zhang et al. [12] proposed recently a

108-bit adder for an FMA unit. All these adders took the adder in POWER6 as a reference.

We believe using a CLA block in this adder limits the possibility to totally exploit the benefits

of parallel prefix adders. We designed an adder with parallel prefix 2n − 1 block. Although

the carry increment topology is still employed, the number of carry merge terms is decreased

as well as the length end around carry path. Another area of improvement lies in the design of

first level addition blocks. Using a Carry-select scheme provides the benefit of choosing the sum

1

Page 11: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 1. INTRODUCTION 2

at the end of computation. However, carry path should not include the calculation of group

propagate and generate(PG) terms according to input carry. In our conditional sum blocks, the

carry path only includes PG terms with cin = 0 and the sum is calculated in non-critical path

for both conditions: cin = 0 and cin = 1. At the final stage, the real sum is selected according

to the output of the EAC block.

The thesis is organized as following; in Chapter 1 general information about adders are

provided to show the importance of design space. In Chapter 2 detailed analysis of modulo

2n − 1 adders and EAC adders are given. The modified adder is described in Chapter 3 and

compared with the previous architectures. Critical path analysis methodology is analyzed in

detail in Chapter 4 and simulation results are provided. Finally in Chapter 5 datapath library

design methodology is described.

1.1 Fused Multiply-Add Operation

A fused multiply-add(FMA) unit performs the multiplication A × B followed immediately by

an addition of the product and a third operand C so that the result T is calculated as Eqn. 1.1

in a single indivisable step [2]. Such a unit is capable of performing multiply only by setting

C = 0 and add(or subtract) only by setting, for example, B=1.

T = A×B + C = M + C (1.1)

An advantage of a fused multiply add unit, compared to separate multiply and adder, arises

when executing floating-point operations since rounding is performed only once for the result of

T = A×B+C rather than twice(for the multiply and then for the add) [13]. Since rounding may

introduce computation errors, reducing the number of rounding operations affects positively the

overall error.

The input of the operands are calculated at the CSA(Carry-Save Adder) multiplier tree and

the magnitude of the operands is not known prior to addition to determine which operand has

greater value. Since floating point is a sign magnitude operation, the result of the adder should

be in two’s complement form [14]. Therefore, an adder is needed to produce two separate results

Page 12: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 1. INTRODUCTION 3

for the following cases:

Case 1 : If operand M > C, |M − C| = M − C = M + C + 1

Case 2 : If operand C < M ,

|M − C| = C −M = −(M − C) = −(M + C + 1) = −(M + C)− 1 = (M + C)

During subtraction of M − C, the final carry out is Cout is 1 when M > C, and 0 when

C > M . Consequently, an End-Around-Carry adder produces two different results and Cout

determines whether case 1 or case 2 happens [12].

1.2 Adders

1.2.1 Ripple Carry Adders

The addition of two operands is the most frequent operation in almost any arithmetic unit.

A two-operand adder is used not only when performing additions and subtractions, but also

often employed when executing more complex operations like multiplication and division. Con-

sequently, a fast two-operand adder is essential [3].

si = ai ⊕ bi ⊕ ci (1.2)

ci+1 = ai · bi + (ai + bi) · ci (1.3)

Figure 1.1: Ripple Carry adder [1].

Page 13: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 1. INTRODUCTION 4

The most straightforward implementation of a parallel adder for two operands x and y is

through the use of n basic units called full adders. A full adder(FA) is a logical circuit that

accepts two operand bits, say x and y, and an incoming carry bit, denoted by s and an outgoing

carry bit denoted by c. As the notation suggests, the outgoing carry c is also the incoming

carry for the subsequent FA, which has c and y as the input bits. The FA is a combinational

digital circuit implementing the binary addition of three bits through the Boolean equations in

1.2 and 1.3. The ripple carry adder is shown on Figure 1.1.

1.2.2 Carry lookahead adders

The most commonly used scheme for accelerating carry propagation is the carry lookahead

scheme [2] on Figure 1.2. The equations in Eqn. 1.4 show the realization of the Carry-Look-

Ahead Generator. The main idea behind carry lookahead addition is an attempt to generate

all incoming carries in parallel(for all the n-1 high order FAs) and avoid the need to wait until

the correct carry propagates from the stage(FA) of the adder where it has been generated.

This can be accomplished in principle, since the carries generated and the way they propagate

depend only on the digits of the original numbers x-1, x-2,and y. These digits are available

simultaneously to all stages of the adder and consequently value of the incoming carry and

compute the sum bit accordingly. This however, would require and inordinately large number

of inputs to each stage of the adder, rendering this approach impractical [1].

Figure 1.2: Carry lookahead adder [2].

Page 14: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 1. INTRODUCTION 5

c4 = G∗0 + c0P

∗0 ,

c8 = G∗1 +G∗

0P∗1 + c0P

∗0P

∗1 ,

c12 = G∗2 +G∗

1P∗2 +G∗

0P∗1P

∗2 + c0P

∗0P

∗1P

∗2

(1.4)

1.2.3 Parallel Prefix Adders

A parallel prefix circuit is a combinational circuit with n inputs x1, x2, , xn producing the outputs

x1, x2◦x1, ..., xn◦xn−1◦...◦x1 where ◦ is the associativity binary operation. The first stage of the

adder generates individual P and G signals. The remaining stages constitute the parallel prefix

circuit with the fundamental carry operation serving as the ◦ associative binary operation. This

part of the adder can be designed in many different ways.

gi = ai · bipi = ai ⊕ bi

(1.5)

Gi:k = Gi:j + Pi:j ·Gj−1:k

Pi:k = Pi:j · Pj−1:k

(1.6)

Although computing carry-propagate addition can use generate and propagate signals, its

implementation in VLSI can be quite inefficient due to the number of wires that have to be

connected together. Parallel-prefix adders solve this problem by making the wires shorter with

simple gate structures to aid in the passing of groups of carries to the next weight [3] [15]. The

proof of parallel prefix adder can be found in [16].

Parallel-prefix adders can be broken down into three stages:

• Pre-computation: single bit carry generate/propagate are obtained with Equation 1.5 and

temporary sum is generated. This stage can be simplified applying the rules defined in Section

1.4.1.

• Parallel-prefix tree: carry at each bit is computed with group carry generate/propagate

where Equation 1.6 is applied. It is possible to simplify the first stage of the tree using the rules

defined in Section 1.4.2.

Page 15: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 1. INTRODUCTION 6

• Post-computation: Sum and carry-out are derived with Equation 1.2.

Basic cell definitions of prefix adders is shown Figure 1.3.

Figure 1.3: Group PG cells [3].

1.3 Tree Adders

Tree structures have been used for graphically representing the various parallel prefix algorithms.

Many state-of-the-art adder circuits use parallel prefix schemes to achieve high performance

[17–19].

For wide adders, the delay of carry-lookahead (or carry-skip or carry-select) adders becomes

dominated by the delay of passing the carry through the lookahead stages. This delay can

be reduced by looking ahead across the look- ahead blocks [20]. In general, it possible to

construct multilevel tree of look-ahead structures to achieve delay that grows with log N. Such

adders are variously referred to as tree adders, logarithmic adders, multilevel-lookahead adders,

parallel-prefix adders, or simply lookahead adders.

There are many ways to build the lookahead tree that offer tradeoffs among the number of

Page 16: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 1. INTRODUCTION 7

Table 1.1: Trade-off between different adder topologies.Topology Logic Level Fanout Wiring Track

Kogge-Stone [22] Low Low HighSklansky [23] Low High LowBrent-Kung [24] High Low Low

stages of logic, the number of logic gates, the maximum fanout on each gate, and the amount

of wiring between stages. Figure 1.4 shows a three dimensional taxonomy of prefix adders [21].

Far edge adders are Kogge-Stone [22], Sklansky [23] and Brent-Kung [24] for their properties of

low logic-level, higher wiring track; low logic-level, high fanout and high logic-level, low fanout

respectively as shown on Table 1.1.

Figure 1.4: Taxonomy of prefix networks [3].

The following parallel prefix adders can be found in the literature:

Kogge-Stone The Kogge-Stone tree [22] Figure 1.5 achieves both log2N stages and fanout

of 2 at each stage. This comes at the cost of long wires that must be routed between

stages. The tree also contains more PG cells; while this may not impact the area if the

adder layout is on a regular grid, it will increase power consumption. Despite these cost,

Page 17: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 1. INTRODUCTION 8

Kogge-Stone adder is generally used for wide adders because it shows the lowest delay

among other structures.

Figure 1.5: Kogge-Stone adder [3].

Sklansky The Sklansky or divide-and-conquer tree [23] on Figure 1.6 reduces the delay to

log2N stages by computing intermediate prefixes along with the large group prefixes.

This comes at the expense of fan-outs that double at each level: The gates fanout to [8, 4,

2, 1] other columns. These high fan-outs cause poor performance on wide adders unless

the gates are appropriately sized or the critical signals are buffered before being used

for the intermediate prefixes. Transistor sizing can cut into the regularity of the layout

because multiple sizes of each cell are required, although the larger gates can spread into

adjacent columns. With appropriate buffering, the fan-outs can be reduced to [8,1,1,1].

Page 18: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 1. INTRODUCTION 9

Figure 1.6: Sklansky adder [3].

Brent-Kung The Brent-Kung tree [24] as shown on Figure 1.7 computes prefixes for 2-bit

groups. These are used to find prefixes for 4-bit groups, which in turn are used to find

prefixes for 8-bit groups, and so forth. The prefixes then fan back down to compute the

carries-in to each bit. The tree requires 2(log2N) − 1 stages. The fanout is limited to 2

at each stage. The diagram shows buffers used to minimize the fanout and loading on the

gates, but in practice, the buffers are generally omitted.

Figure 1.7: Brent-Kung adder [3].

Page 19: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 1. INTRODUCTION 10

Other than those major adders trade-off can be achieved with the following adders.

The Han-Carlson trees [25] are a family of networks between Kogge-Stone and Brent-Kung.

Figure 1.8 shows such a tree that performs Kogge-Stone on the odd- numbered bits, and then

uses one more stage to ripple into the even positions.

The Knowles trees [26] are a family of networks between Kogge-Stone and Sklansky. All

of these trees have log2N stages, but differ in the fanout and number of wires. If we say that

16-bit Kogge-Stone and Sklansky adders drive fanouts of [1, 1,1,1] and [8, 4, 2, 1] other columns,

respectively, the Knowles networks lie between these extremes. For example, Figure 1.9 shows

a [2, 1, 1, 1] Knowles tree that halves the number of wires in the final track at the expense of

doubling the load on those wires.

The Ladner-Fischer trees [27] are a family of networks between Sklansky and Brent-Kung.

Figure 1.10 is similar to Sklansky, but computes prefixes for the odd- numbered bits and again

uses one more stage to ripple into the even positions. Cells at high-fanout nodes must still be

sized or merged appropriately to achieve good speed.

Figure 1.8: Han-Carlson adder [3].

Page 20: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 1. INTRODUCTION 11

Figure 1.9: Knowles [2,1,1,1] adder [3].

Figure 1.10: Ladner Fischer adder [3].

1.4 Recurrence Algorithms

Recurrence algorithms have been a research area for a long time [8] [28]. Weinberger presented

the most widely known carry recurrence for VLSI addition in 1958 [20]. Over the years, several

addition algorithms have been developed. These algorithms manipulate the carry and sum

equations in an attempt to improve the speed of addition. The equations for sum and carry are

defined as equation 1.2 and 1.3.

Page 21: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 1. INTRODUCTION 12

Ling modified the algorithm to reduce the complexity of the carry computation at the

cost of increased complexity in the sum computation. An analysis was later performed by

Doran [29] to determine the set of recurrences which have recurrence properties that are similar

to Weinberger’s and Ling’s.

1.4.1 Weinberger Recurrence

Weinberger [20] demonstrated that addition speed could be improved by parallelizing the compu-

tation of carry. Although widely credited with only the Carry Look-Ahead Adder, Weinberger’s

recurrence was not limited in group size or number of levels for carry computation [20]. The

fundamental advancement of his work was the introduction of generate and propagate as shown

on Eqn. 1.7. Weinberger defined the terms: bitwise generate (g), bitwise propagate (p), group

generate (G), and group propagate (P ). These terms allow for carry computation to be per-

formed in parallel, yielding a significant improvement in performance compared to ripple-carry

addition. For a group of 4 bits, Weinberger recurrence has ten terms for the generation of G5:2

from the inputs and four terms for the generation of P5:2. The maximum transistor stack height

is 5.

Weinberger demonstrated that and could be used to create blocks of arbitrary size and

parallelized to form multiple levels of recurrence [20]. Thus, the majority of parallel prefix adders

proposed for high-performance addition employ the realizations of Weinberger’s recurrence, e.g.,

Kogge-Stone [22], Brent-Kung [24], Han-Carlson [25], Ladner-Fischer [27], and those described

by Knowles [26].

gi = ai · bipi = ai + bi

(1.7)

1.4.2 Ling Recurrence

IBM ECL technology limitations on fan-in (limited to 4) and wired-OR (limited to 8) motivated

Ling to develop a transformation that reduced the fan-in of Weinberger’s recurrence [30] [8].

For clarity, a simple derivation of Ling’s transformation will be shown. This derivation provides

Page 22: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 1. INTRODUCTION 13

the physical meaning of the signals used in Ling’s transformation and identifies the favorable

characteristics of Ling for implementation in modern CMOS technology. In the derivation, the

bitwise generate signal is defined as: gi = ai · bi and the bitwise propagate signal is defined as:

ti = ai+bi. Note that the propagate signal ti is the same as Weinberger’s pi (when implemented

using an OR). To maintain consistency with Ling’s original paper, ti will be used for propagate.

Ling’s transformation reduces the complexity of Weinberger’s recurrence by factoring ti from

ci+1 to create a pseudo-carry (hi) on which the recurrence is performed. The transformation is

shown below on c1 to form h0. The carry-out signal, c1 , of the first bit position is

c1 = g0 + t0 · c0 (1.8)

Ling’s transformation uses the property ti · gi = gi to form

c1 = t0 · g0 + t0 · c0 = t0 · (g0 + c0) (1.9)

where g0 + c0 = h0, which leads to

c1 = t0 · (g0 + c0) = t0 · h0 (1.10)

The general transformation of is ci defined as

ci =

{ti−1hi−1 if i > 0c0 i=0

(1.11)

where the pseudo-carry, hi , is defined as

hi = gi + ci (1.12)

The physical meaning of the pseudo-carry signal h can be described as follows. By factoring

ti out of the carry expression and propagating hi instead of ci+1, all cases where carry is

generated and/or propagated from the stage preceding stage i are included in hi. This includes

the case where a carry-in to the ith stage can be assimilated (which should not result in a

carry-out). The assimilate condition is handled when forming ci+1 by ANDing hi with ti to

produce ci+1. If the carry-assimilate (carry-kill) condition exists then ti = 0 , which results in

ci+1 = 0. A recurrence for can be defined as has been done previously for Weinberger’s ci. The

Page 23: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 1. INTRODUCTION 14

group pseudo-carry and transmit which allow for parallel prefix computation can be defined

over the group of bits (capital letters are used to refer to the group):

Ti:j = ti · ti−1 · · · tj (1.13)

Hi:j = gi + gi−1 + ti−1 · gi−2 + ti−1 · ti−2gi−3 + · · ·+ ti−1 · ti−2 · · · tj+1 · gj (1.14)

The recurrence can be expressed using the ”•” operator as

(Hi:j

Ti−1:j−1

)•(

Hj−1:k

Tj−2:k−1

)=

(Hi:j + Ti−1:j−1 ·Hj−1:k

Ti−1:j−1 · Tj−2:k−1

)(1.15)

The transformation from Weinberger’s recurrence to Ling’s recurrence for a group of 4 bits

is shown in the example in Fig. 2. This figure should dispel any difficulties associated with

understanding the original Ling’s derivation.

The advantage of using pseudo-carry instead of carry is offset by the increased complexity

of sum computation, which requires the real carry to form individual sum signals. In CMOS

technology can be efficiently calculated conditionally, thus avoiding the AND operation on the

critical carry path:

si =

{ai ⊕ bi hi−1=0ai ⊕ bi ⊕ ti−1 hi−1=1

(1.16)

1.5 Conclusion

In this chapter general information about binary adder realization in VLSI perspective is de-

scribed. Parallel prefix adders provide the fastest carry propagation on the critical path for wide

adders and it allows fast layout design because of its regularity. The most important advantage

is the design space with many trade-offs in delay, power dissipation and area. Also, recurrence

algorithms are provided to show it is possible to shorten the carry path with different propagate

and generate terms.

In the next chapter, a detailed analysis about the modulo adders is given. End-around-carry

adders are analyzed in detail as a specialized realization of modulo adders. Analysis of previous

Page 24: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 1. INTRODUCTION 15

work is also included.

Page 25: Analysis and Design of High Performance 128-Bit Parallel Prefix e

Chapter 2

Modulo Adders

2.1 Introduction

Modular arithmetic has been an interest to researchers in a wide range of areas, since its

operations are the basis for systems that use the Residue Number Systems (RNS) [31]. Modulo

addition/subtraction and multiplication can also be applied to digital filters [32], cryptography

[33], error detection and correction [34], as well as checksum computation in high-speed networks

[35]. More importantly, modulo 2n−1 addition is a common operation that can be implemented

in hardware, because of its circuit efficiency and simple implementation [36].

In end-around carry adders the carry-in depends on the carry out [4] ie. the carry out cout is

fed through some logic back to the carry in cin. In particular, this is used for modulo 2n +1 [37]

and 2n − 1 [38] addition rely on decrement and increment respectively of the addition result

depending on cout. Since prefix algorithms actually rely on incrementer structures, considering

parallel-prefix schemes for this kind of adders is very promising.

In order to obtain fast end around carry adders both conditions of fast carry out generation

and fast carry in processing have to be met. This implies that there should be no combinational

path existing between cin and cout. A fast end around carry adder can be built using the prefix

structure. Here the last prefix is used as an incrementer which is controlled by the cout of the

previous prefix stages.

16

Page 26: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 2. MODULO ADDERS 17

2.2 Addition in Modulo 2n − 1 Adder

Addition modulo 2n− 1 or ones complement addition can be formulated by the following equa-

tion:

A+B(mod2n−1) =

{A+B − (2n − 1) = A+B + 1 (mod2n) if A+B ≥ 2n − 1A+B otherwise

(2.1)

However, the condition A + B ≥ 2n − 1 is not trivial to compute. It can be rewritten as

A+B ≥ 2n with a carry input of 1.

A+B(mod2n − 1) =

{A+B − (2n − 1) = A+B + 1 (mod2n) if A+B ≥ 2n

A+B otherwise(2.2)

Now the carry out cout from the addition A+B can be used to determine whether increment

has to be performed or even, simpler cout can be added to the sum of A + B. This equation

however results in a double representation of zero ( ie. 0 = 00 = 11)

Figure 2.1: Prefix graph with fast end-around carry [4].

The standard approach for the implementation of a modulo 2n − 1 adder is by using a

conventional carry propagate adder(CPA) abd have the carry out fed back into the carry in of

the adder. This creates the necessary end-around-carry needed to have modulo 2n − 1 operate

correctly as shown on Figure 2.2.

Page 27: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 2. MODULO ADDERS 18

Figure 2.2: General block diagram Modulo 2n − 1 Adder [4].

2.3 Analysis of Previous End-Around-Carry Adders

Previous defined end-around action can be obtained using different rules. Although the EAC

adder has been used [39] and implemented on several microprocessors, very few details exist

on their formulations and arithmetic algorithms in today’s literature. Schwarz [40] provided

explanations about some aspects of the EAC adder’s algorithm as a part of fused-multiply-

add(FMA) unit. Shedletsky [41] defined the indeterminate behavior of EAC adders and Liu et

al. [42] presented a formal analysis of EAC adders.

Page 28: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 2. MODULO ADDERS 19

Figure 2.3: Block diagram of the 128-bit binary adder [5].

Liu et al. [6] [42] extended the algorithm to make the adder independent without being

part of a FMA unit. The design mainly follows the algorithms of the EAC adder which is

implemented in IBM POWER6 microprocessor [5]. The additional logic units of the proposed

adder are useful to ensure the whole adder to work independently. Another advantage is that

it is easier to implement and test the adder in FPGA technology, which enables design space

exploration. Figure 2.4 shows the architecture of the adder.

Page 29: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 2. MODULO ADDERS 20

Figure 2.4: Architecture of the EAC adder [6].

EAC means that when subtracting two signed numbers that are in signed magnitude format,

the subtraction is implemented by the addition of the first operand with the Boolean complement

of the second operand. For this addition, instead of setting a carry into the least significant

digit, the carry out of the most significant digit is taken as the carry in. This ensures that

the result of the addition is always positive magnitude result and preferably only one operand

needs to be conditionally complemented. Thus, an EAC adder performs addition similar to

other regular adders and subtraction using the end around carry to ensure the result is positive.

The adder shown in Figure 2.4 should satisfy the following conditions: 1) when x.s = y.s

the adder should do addition and we have s.s = x.s and s.s = x.x+ y.y. 2) when x.s 6= y.s, the

Page 30: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 2. MODULO ADDERS 21

adder should do subtraction. If x.x ≥ y.y, then s.s = x.s and s.s = x.x− y.y; if x.x < y.y then

s.s = y.s and s.s = y.y − x.x.

The subtraction operation can be described as follows:

1) Determining which operand is bigger. After a subtraction operation if result is positive,

operand x is bigger otherwise y is bigger. When x.x − y.y = x.x + y.y + 1 = x.x + 2n − y.y,

the carry out of x.x + y.y + 1 will be 1. Therefore, cout results as 1 if x is bigger and 0 if y is

bigger. Hence the sum equation can be written as x.x+ y.y + cout

2)When y is bigger, cout = 0 the subtraction can be written as s.s = y.y − x.x = −(y.y −

x.x) = −(x.x+ y.y + 1) = (x.x+ y.y + 0) + 1− 1 = (x.x+ y.y).

3) The cout is used to select correct result:

s.s =

{x.x+ y.y + cout cout = 1

x.x+ y.y + cout cout = 0

In order to implement addition and subtraction in one adder,y should be complemented

conditionally. The effective operation can be defined as;

Os = x.s⊕ y.s (2.3)

yt =

{y.y Os = 0y.y Os = 1

The sign of the result is determined after sign logic:

s.s =

{x.s cout = 1y.s cout = 0

Generally implementation of an adder/subtracter is achieved using two different adders, one

for addition and one for subtraction. After the results for both of the operation is calculated,

the final result is selected with a multiplexer as given on Figure 2.5.

Page 31: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 2. MODULO ADDERS 22

Figure 2.5: Architecture of the EAC adder [6].

2.4 Carry-lookahead EAC Logic Unit

The use of EAC unit helps implementing adder/subtracter using only one adder. In [40], the

algorithm for EAC unit with four carry bits, can be found. The most significant bit is labeled

as 0. The group carries for a CLA adder is defined as :

C0 = G0 + P0G1 + P0P1G2 + P0P1P2G3 + P0P1P2P3Cin

C1 = G1 + P1G2 + P1P2G3 + P1P2P3Cin

C2 = G2 + P2G3 + P2P3Cin

C3 = G3 + P3Cin

(2.4)

If the carry out C0 is fed to the carry in, EAC operation is achieved as Equation 2.5.

C0 = G0 + P0G1 + P0P1G2 + P0P1P2G3 + P0P1P2P3

C1 = G1 + P1G2 + P1P2G3 + P1P2P3G0 + P0P1P2P3

C2 = G2 + P2G3 + P2P3G0 + P2P3P0G1 + P0P1P2P3

C3 = G3 + P3G0 + P3P0G1 + P3P0P1G2 + P0P1P2P3

(2.5)

The combination of the carries in this way result in a carry chain for every group that is the

length of the width of the adder. This wrapping of the carries is correct for subtraction but is

not correct for addition. To make adder selectable for addition and subtraction. the P3 term

needs to be modified. An extra bit is combined to the least significant bit of the adder to assert

the carry propagation when the effective operation is subtraction. This bit can be integrated

Page 32: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 2. MODULO ADDERS 23

into P3 to make P3 = 0 for an effective operation of addition as defined in Eqn 2.6.

P t3 =

{P3 Os = 10 Os = 0

(2.6)

2.5 Conclusion

In this chapter detailed information about binary modulo adders and previous work on EAC

adders is provided. It is shown that when the cout is fed back into a carry-increment stage

provides EAC operation as defined as 2n − 1 adders. The designs in [5] and [12] employ the

adder as a part of FMA flow. However, by addition of extra logic Liu [6] proposes a stand alone

adder.

In the next chapter detailed analysis about the proposed parallel prefix EAC adder will be

given. The design premises an alternative to the previous work while providing wider design

space.

Page 33: Analysis and Design of High Performance 128-Bit Parallel Prefix e

Chapter 3

Modified Parallel Prefix EAC Adder

3.1 Introduction

Previosly proposed adders were analyzed in section 2.3. Since the designed adder in [5] did

not include the implementation details about how the blocks were internally built, [6] made a

detailed analysis about how the first stage adder and EAC blocks are defined. They also ex-

tended the work to a complete adder without a FMA unit. Their implementation was directed

to FPGA. Although the analysis that they provided solved many of the questions about previ-

ously designed work, a number of the parts of the adder still needs to analyzed. In this chapter,

we propose a modified adder which uses a modified parallel prefix 2n − 1 adder block as the

EAC logic with the conditional sum blocks for flexibility among different adder architectures

and lower total propagation delay.

3.2 Proposed Adder

Figure 3.1 shows the architecture of the proposed adder. The first level includes eight 16bit

blocks of Kogge-Stone prefix-2 adders for 128bit inputs and the second level includes modiifed

2n − 1 parallel prefix Kogge-Stone adder. In order to design a standalone adder, the input

complement, add/sub and sign blocks are included as suggested in [6].

24

Page 34: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 3. MODIFIED PARALLEL PREFIX EAC ADDER 25

Figure 3.1: Architecture of the modified EAC Adder.

3.2.1 The 16bit blocks in EAC adder

Figure 3.2 shows the 16bit conditional sum blocks. The black and gray blocks are the same as

Fig. 1.3. The dashed lines correspond to the halfsum blocks, hi = ai ⊕ bi. GG and GP refer to

the group generate and propagate signals.

As explained in [6], when carry in to the adder block is assumed 0, it is possible to reduce the

complexity of the adder. When carry in is intended to propagate, the generate term in Equation

1.6 can be extended as Gi:k = Gi:j +Pi:j ·cin. Thus, for cin = 0 it results in Gi:k = Gi:j . It can be

seen that the generated group carry for the corresponding bit position need not be merged with

the previous carry. Thus, it is possible to reduce the number of black terms and use gray terms

instead. Namely, if the there is no carry merge operation in the next level, the corresponding

level can be terminated with a gray cell. Figure 1.5 shows 16-bit Kogge-stone tree with gray

Page 35: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 3. MODIFIED PARALLEL PREFIX EAC ADDER 26

and black cells. However, in our design since we included a second stage, we need both the

generate and propagate terms. This necessity increases the use of black cells instead of gray

cells.

Figure 3.2: 16bit conditional sum blocks.

The adder in [6] is not clear about how the input carry is merged in the first level of adders.

A discussion of the most efficient approaches for the traditional carries can be found in [43].

The carry-in bit can be included either by adding a fast carry increment stage or by treating

cin as an extra bit of the preprocessing stage of the adder. The first case in shown in Figure

3.3. The second case can be derived by setting g−1 = cin according to Equation 1.6. As a result

of these schemes, the complexity increases to solve carry incorporation problem.

Additionally [6] and [5] does not explain how the carry is propagated after 8bit blocks.

Although a similar structure is used in [12], the adder architecture is different and only one set

of generate-propagate is calculated and propagated in the first level. Since [6] uses two different

adders for each condition when cin = 0 and cin = 1, two set carries need to be selected before

Page 36: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 3. MODIFIED PARALLEL PREFIX EAC ADDER 27

leaving the first level. This operation makes the calculation more complicated. Thus, we use

the architecture in Figure 3.2 to calculate one set of generate-propagate as group terms for the

conditional of cin = 0. In the next stage, conditional carry for cin = 1 is calculated with a

simple equation Gi:k = Gi:j + Pi:j which is simply an OR gate.

Figure 3.3: cin merge with fast carry link [7].

Figure 3.4: cin merge with extra bit [7].

As a final step the sum for each carry condition, for cin = 0 and cin = 1, is calculated to be

Page 37: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 3. MODIFIED PARALLEL PREFIX EAC ADDER 28

selected according to the result of the second level of carry calculation.

3.2.2 Parallel Prefix 2n − 1 EAC Block

As explained in chapter 2, 2n − 1 adders can be used for EAC calculation. In this section,

we extend the use of 2n − 1 adders for adder/subtractor. Figure 3.5 shows the modified 8bit

Kogge-Stone EAC block.

Figure 3.5: Modified 8bit Kogge-Stone EAC block.

The block first takes the group generate and propagate terms that are calculated in the first

level. Since there are eight 16 bit adders in the first level, the carry for the whole adder must

be calculated first. The 8 carries are merged in the Kogge-Stone adder. In order to achieve

subtraction operation, the final carry at the most significant bit position must be merged with

lower significant position. This step is also called as carry increment stage. The effective

operation signal Os is defined as in Eqn. 2.3. When operation is subtraction, Os = 1, the

AND gate propagates the most signiificant carry and when operation is addition, Os = 0, most

significant carry is blocked and the unit works as a regular adder. ct8 = Os ·G127:0 is defined as

the end-around-carry in Eqn. 3.1.

ct8 =

{G127:0 Os = 10 Os = 0

(3.1)

In the literature [37] [4] [44] [45] [15] carry-increment stage has found many usage for 2n− 1

operation. Especially for the wide adders, N ≥ 64 the final carry needs to travel a very long path

to arrive on the least significant position, actually twice the length from the least significant

Page 38: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 3. MODIFIED PARALLEL PREFIX EAC ADDER 29

position to least significant position again. It is defined in [40] that carry needs to travel only

the total length in an EAC operation. That result motivated the use of CLA logic as a EAC

logic. In our adder, we use a second stage for EAC calculation instead of one long parallel prefix

propagation and carry increment stages. Namely for 128 bit EAC adder the shorthest stage

count is 2n = 128, n = 7 and one stage of 127 carry merge terms. In our adder, we have 4 stages

for the first level, 3 stages for the second level and 1 stage of 8 carry merge terms, which makes

the total stage count the same, whereas a much lower count of carry merge terms. Decreasing

the number of carry merge terms, actually, decreases the length of the carry path as well as the

delay.

3.3 Implementation and Validation

After the adder architecture is finalized, both the modified and previously proposed [6] adders

are designed in Verilog. A test fixture is created to validate the adder with the corner cases,

such as 0-to-1 crossover, carry propagation from 16th bit to the next bit. A check procedure

is defined to signal out if the output of the adder and the calculated result are different. The

tests show that both adders have similar operation and thus the modified EAC is verified.

3.4 Conclusion

In this chapter the proposed Parallel Prefix EAC adder is analyzed. It is shown that using

conditional sum blocks solves the carry incorporation problem existing in the previous works.

Moreover, the EAC logic in [6] is redesigned with a modified 2n − 1 adder to provide wider

design space and shorter carry path. As a final step the adder is implemented in Verilog and

validated through simulation.

In the next chapter, a simple method for transistor level realization of the critical path is

discussed.

Page 39: Analysis and Design of High Performance 128-Bit Parallel Prefix e

Chapter 4

Critical Path Analysis

4.1 Path Identification

The critical path is identified as the path from the sign logic to the sum of at the output as shown

in Figure 4.1 for the proposed adder. In order to conditionally complement the second input,

the effective operation Os is calculated from sign inputs sA and sB. After Bt is calculated, the

propagate and generate terms are calculated using Eqn. 1.7. The half-sum block, hi = ai⊕ bi is

not on the critical path because the half-sum is needed for sum calculation which is not needed

for carry propagation. After 4 stages of 16bit adder, carry is merged with other group terms

in the 8bit EAC block. The EAC block consists of 3 carry merge stages and 1 carry increment

stage. We only take into account the carry merge operation Gi:k = Gi:j + Pi:j · Gj−1:k, which

is simply an AND-OR-INVERT(AOI) gate, because the group propagation term is not on the

critical path. The most significant carry should either be blocked or propagated. Thus, the

Os controlled AND gate is on the critical path. Since the end-around carry term, the most

significant bit in EAC block, is calculated before the carry increment stage, c8 is not on the

critical path. If we choose one of the carry in between the last and first bits, it gives a good

estimation about the delay of the operation. At the next stage one of the sums needs to be

selected according to the carry from EAC block as an output for each16bit adder. Therefore, one

mux is on the critical path. As a final stage, the output needs to be conditionally complemented

according to the end-around-carry and operation. The calculation of the condition is not on

30

Page 40: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 4. CRITICAL PATH ANALYSIS 31

critical path because during the first MUX stage, it can be calculated. Thus, we end up one

XOR stage for complementing.

Figure 4.1: Critical path of the modified EAC adder.

4.2 Path Design

After the critical path is identified, the gate level model is created as in Figure 4.2. All the

inverters on the critical path are removed to decrease the number of logic levels by applying

the inverting property of CMOS to the consecutive levels. It can be observed that each stage

of carry merge operation shows an alternating design of AOI and OAI. The output load is one

minimum size inverter.

Page 41: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 4. CRITICAL PATH ANALYSIS 32

Figure 4.2: Gate level design of critical path.

Gate level design is just a logical expression of the circuit. Therefore, in order to make

accurate simulations the gates should be mapped to transistor level models. In this phase,

we created a library of custom design gates instead of using standard cell library for higher

performance.

4.3 Transistor Level Design and Sizing

Full custom design always outperforms standard cell design in terms of power consumption, area

and propagation delay. However, it is hard to automate designing in full custom perspective and

the designing process takes much longer time than the design with standard cells. A number of

techniques exist when designing in transistor level.

4.3.1 Logic Level Minimization

Before designing any logic circuit, it is a beneficial step to exploit logic level minimization. In

our design we tried to combine the pg generation and first level of AOI as suggested in [8]. This

approach allows for a single stage to be removed from Weinberger adder realization. However,

this method increases the stack count and since our blocks were designed close to minimum

width transistors, we did not observe any improvement on the delay. Figure 4.3 shows the

reduced first stage.

Page 42: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 4. CRITICAL PATH ANALYSIS 33

Figure 4.3: Reduced first stage in Weinberger recursion adder [8].

As another step Ling recursion is applied to the first step. However as suggested in [15]

Ling carries must be converted to real carries by ANDing the most significant carry with the

corresponding propagate term, ci+1 = di · pi before leaving the adder block. This calculation is

on the critical path. Therefore, Ling recursion is not applied.

As a final step, we combined the Os controlled AND gate with the previous OAI gate.

However, we observed that increasing the complexity of the gate did not provide much delay

improvement because the gates are already sized close to minimum.

Page 43: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 4. CRITICAL PATH ANALYSIS 34

4.3.2 Late arriving signal exploitation

One of the most important perspective in transistor level design is to connect the late arriving

signals close to output. It can be shown on Figure 4.4 AOI gate , assuming A is the late arriving

signal, the gate on Figure 4.5 provides better delay.

Figure 4.4: AOI without late arriving exploitation.

Page 44: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 4. CRITICAL PATH ANALYSIS 35

Figure 4.5: AOI with late arriving exploitation.

4.3.3 Logical Effort

Sutherland et al. described a very useful concept of Logical Effort [46].The method of logical

effort is founded on a simple model of the delay through a single MOS logic gate. The model

describes delays caused by the capacitive load that the logic gate drives and by the topology

of the logic gate. Clearly as the load increases, the delay increases, but delay also depends on

the logic function of the gate. Inverters, the simplest logic gates, drive loads best and are often

used as amplifiers to drive large capacitances. Logic gates that compute other functions require

more transistors, some of which are connected in series, making them poorer that inverters at

driving current. A NAND gate has more delay than an inverter with similar transistor sizes

that drives the same load. The method of logical effort quantifies these effects to simplify delay

analysis for individual logic gates and multistage logic networks.

The logical effort of a logic gate tells how much worse it is at producing output current

than is an inverter, given that each of its inputs may present the same input capacitance as the

Page 45: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 4. CRITICAL PATH ANALYSIS 36

inverter. Reduced output current means slower operation, and thus the logical effort number for

a logic gate tells how much more slowly it will drive a load that would an inverter. Equivalently,

logical effort is how much more input capacitance a gate must present in order to deliver the

same output current as an inverter.

Logical Effort for Multistage Networks

The method of logical effort reveals the best number of stages in a multistage network and how

to obtain the least overall delay by balancing the delay among the stages. The notions of logical

and electrical effort generalize easily from individual gates to multistage paths.

The logical effort along a path compounds by multiplying the logical efforts of all the logic

gates along the path. The symbol G denotes the path logical effort, so that it is distinguished

from g, the logical effort of a single gate in the path. The subscript i indexes the logic states

along the path.

G = Πgi (4.1)

The electrical effort along a path through a network is simply the ratio of the capacitance

that loads the last logic gate in the path to the input capacitance of the first gate in the path.

The uppercase symbol H indicates the electrical effort along a path. In this case, Cin and Cout

refer to the input and output capacitances.

H = Cout/Cin (4.2)

Branching effort b is used to account for fanout within a network. When fanout occurs

within a logic network, some of the available drive current is directed along the analyzed path

and some is directed along the off-path. The branching effort b is defined at the output of a

logic to be

b = (Conpath + Coffpath)/Cin (4.3)

Note that if the path does not branch, the branching effort is one. The branching effort

along an entire path B is the product of the branching effort at each of the stages along the

Page 46: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 4. CRITICAL PATH ANALYSIS 37

path.

B = Πbi (4.4)

Utilizing the definitions of logical, electrical and branching effort along a path, path effort

is defined as F. Note that the path branching and electrical efforts are related to the electrical

effort of each stage.

F = G ·B ·H (4.5)

Although it is not a direct measure of delay along the path, the path effort holds the key

to minimizing the delay. Observe that the path effort depends only on the circuit topology and

loading and not upon the sizes of the transistors used in the logic gates embedded within the

network. The path effort is related to the minimum achievable delay along the path. Only a

little work yields to find the best number of stages and the proper transistor sizes to realize the

minimum delay.

Optimizing the design of an N-stage logic network proceeds from a very simple principle that

the path delay is least when each stage in the path bears the same stage effort. This minimum

delay is achieved when the stage effort is:

f = gi · hi = F 1/N (4.6)

To equalize the effort borne by each stage on a path, and therefore achieve the minimum

delay along the path, appropriate transistor sizes for each stage of logic along the path must be

chosen. Each logic stage should be designed with electrical effort:

hi = F 1/N/gi (4.7)

From this relationship, it is straightforward to determine the transistor sizes of gates along

a path. Starting at the end of the path and working backward to apply the capacitance trans-

formation:

Cin,i = (gi · Cout,i)/f (4.8)

Page 47: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 4. CRITICAL PATH ANALYSIS 38

The equation determines the input capacitance of each gate, which can then be distributed

appropriately among the transistors connected to the input.

4.3.4 Design with Helpers

As described in [47] in most of the aders the stage effort is generally constant if wire capaci-

tance is neglected. This means uniform gate sizes may be used throughout with little loss in

performance. It is possible to have a very regular layout in this case. However, such adders

like Sklansky show exponential fanout increase. In this case, the stage effort becomes high for

those cells and as a result a bigger driver is needed. Therefore, Harris et al. [47] proposes the

concept of helpers. When the stage effort increases, it is a good practice to duplicate driving

cells in parallel to maintain lower delay.

Figure 4.6: EAC logic with helper.

Using the concept described by [47], we duplicate the operation controlled NAND gate.

That helps to decrease the fanout of the carry increment stage and, consequently, the size of

the NAND gate.

4.4 Transistor Sizing

Logical effort found wide interest in transistor sizing [48–51]. In our proposed adder, we used

the concept to size the transistors in the critical path. After the sizes are determined, it is

Page 48: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 4. CRITICAL PATH ANALYSIS 39

possible to use these widths in the remaining blocks because other blocks which are not on

critical path will have the same loads and fan-outs within the same logic level.

In order to equalize the rising and falling edge of the output signal wp/wn = 2.5/1 ratio is

used between NMOS and PMOS transistors. In the transistor level library, late arriving signal

exploitation, helpers and logical effort are used to achive highest performance. In addition, the

Os controlled AND gate needs to drive 7 OAI cells. If we use two helpers, as described in 4.3.4,

instead of one, so that the gates drive 4 OAI at most and it is possible decrease the large size

as well as delay. In order to apply logical effort, a spreadsheet is created as in Figure 4.7. After

the logical effort parameters are calculated, the width is distributed to each stage according to

stage coefficients. Since the gates are designed according to 2.5/1 ratio, stage coefficients show

how large is the gate when compared to inverter.

Figure 4.7: Spreadsheet for Logical Effort Calculation.

After the appropriate transistor sizes are found, we continued on performing simulations

using HSPICE [52] and FreePDK45 [53] [54] library. Figure 4.8 and 4.9 show the transistor

level schematic.

Page 49: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 4. CRITICAL PATH ANALYSIS 40

Figure 4.8: Transistor level schematic.

Figure 4.9: Transistor level schematic.

Page 50: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 4. CRITICAL PATH ANALYSIS 41

4.5 Simulation Results

During the simulation a square wave signal is applied with 150ps rising and falling edges. The

delay between input and output is then measured. The power dissipation of the critical path

is measured for power estimation of one bit of the adder. Temperature and supply voltage is

swept between 25-100 ◦C and 0.8-1.2Vdd to observe optimal performance.

It can be observed from Figures 4.10 - 4.13 that critical path shows 141.8ps delay and

18.9µW power dissipation for one bit, when 10 % switching activity is considered [3], which

makes 2.42mW for 128bits.

Figure 4.10: Delay vs. Vdd at 25 ◦C.

Page 51: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 4. CRITICAL PATH ANALYSIS 42

Figure 4.11: Power vs. Vdd at 25 ◦C.

Figure 4.12: Delay vs. Vdd at 100 ◦C.

Page 52: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 4. CRITICAL PATH ANALYSIS 43

Table 4.1: Delay and Power dissipation values in correspondence to VDD and Temperature.

Vdd Delay(ps) Power(µW )

Temp25 ◦C 100 ◦C 25 ◦C 100 ◦C

0.8 178.18 269.08 10.33 11.551.0 141.83 214.94 18.90 20.591.2 125.45 189.31 33.79 36.19

Figure 4.13: Power vs. Vdd at 100 ◦C.

Table 4.1 shows the trade-off between temperature and supply voltage for the proposed

adder. When Vdd is increased from 1.0V to 1.2V delay decreases by 12%, power increases by

79% and when decreased from 1.0V to 0.8V, delay increases by 26% and power decreases by 45%.

When temperature changes from 25◦C to 100◦C delay increases by 51% and power increases by

9.2%. It can be observed that much more power can be saved than the increase in delay for low

power applications by lowering the supply voltage. However, for high speed applications more

power should be provided. Since the temperature of the circuit has a reverse effect on both

power and delay, the circuit should remain at lower temperature for higher performance.

Page 53: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 4. CRITICAL PATH ANALYSIS 44

4.6 Conclusion

In this chapter the methodology for transistor level critical path optimization is provided. First,

the critical path of the adder is identified taking into account the logic levels. Transistor are sized

with the Logical Effort concept [46]. Architecture level and transistor level optimizations are

applied according to full custom design rules in FreePDK45 library [54]. As a final step, transient

simulations are carried out to calculate the delay and power dissipation. It is observed that the

method enables fast design for comparison of the path among different adder architectures in

the critical delay and power dissipation perspective.

In the next chapter, a methodology for fast area estimation and layout entry is given.

Datapath library is generated exploiting the regularity for the proposed adder.

Page 54: Analysis and Design of High Performance 128-Bit Parallel Prefix e

Chapter 5

Datapath Library

5.1 Introduction

Datapaths have been a research topic in the last years, as an approach to make the layout entry

faster [9] [10] [55] [56]. Informally datapaths are circuits where the same or similar logic is

applied to several bits [57]. A datapath stack [58] is made up of many custom word lines such

as registers, ALU, adders, shifters, multiplexers, buffers to form the data flow of the functional

units.

Datapaths are characterized by a highly regular layout structure. A typical datapath floor-

plan consists of an array of horizontally oriented words of identical bit cells, called datapath

cells, and vertically oriented bit slices as shown on Figure 5.1. Since each bit slice is replicated a

number of times(determined by the datapath width) with very little or no modification, layout

generation of such regular structures reduces to a careful design, often by means of handcraft-

ing, of individual datapath cells. Figure 5.2 shows the regularity placement and routing inside

a datapath slice. Figure 5.3 shows the schematic layout of datapath and corresponding bit

slice [55].

45

Page 55: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 5. DATAPATH LIBRARY 46

Figure 5.1: Global floorplan of a datapath [9].

Figure 5.2: Regularity placement and routing datapath circuit [10].

Page 56: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 5. DATAPATH LIBRARY 47

Figure 5.3: Schematic layout of datapath and detailed view of bitcell [11].

Datapath circuits are typically organized in horizontal rows of words representing the same

functional block and vertical bit slices, delimited by vertically running power and ground rails.

The layout of the datapath cell of bit slice i is identical to that of bit slice (i+1), but mirrored

along the vertical axis so that the adjacent bit slices can share common power or ground rail.

Page 57: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 5. DATAPATH LIBRARY 48

Figure 5.4: Representation of a datapath cell [9].

The width of the bit slice, also known as a pitch is fixed; it determines the width for all the

datapath cells as outlined on Figure 5.4. Power and ground (VDD/VSS) supply rails generally

delimit the pitch. Signal nets are connected to the datapath cell components by means of

bristles. Vertical bristles, or data lines, provide wiring between different within the same bit

slice. They run in parallel to the power rails. Horizontal bristles or control lines provide wiring

between datapath cells of different bit slices. Control lines span the width of the datapath

run perpendicular to the power rails. Since adders exploit very regular structures, they can be

designed with datapaths.

5.2 Concepts in Full Custom Design

Transistor chaining and device merging Transistor chaining is a widely used technique to

improve both area and performance of datapath cells. Several transistors can be chained

together by combining their diffusion areas in order to reduce the diffusion capacitance.

Diffusion sharing applied to simple logic gates in the same datapath cell is known as device

merging.

Transistor folding Transistor folding is another popular technique aimed at minimizing area

Page 58: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 5. DATAPATH LIBRARY 49

and improving performance of custom designs. The folding changes the aspect ratio of

the component while maintaining the required device size (W/L ratio). By performing

folding with different number of fingers(poly gates), different component instances can be

created for the placement phase.

Intracell sharing Two component areas(diffusion regions or poly gates) belonging to compo-

nents from adjacent bit slices can be merged if they share the same global net, such as

power line, control line or clock signal. In a typical organization of a datapath, adjacent

bit slices are identical copies of each other, reflected with respect to the vertical boundary

line. In this case the components can be pushed under the boundary line (ground or power

rail) to create a more compact layout.

5.3 Datapath Design

Using the concepts defined in sections 5.2 and 4.3 the logic gates are designed. Figure 5.5(a)

and 5.5(b) shows the design of basic AOI and MUX2 cells according to the rules. AOI cell

occupies 0.77µm× 1.5µm area and MUX2 0.97µm× 1.5µm.

(a) AOI Layout (b) MUX2 Layout

Figure 5.5: Designed basic cells.

The longest cell according to the values from Figure 4.7 has a 1.5µm height. Thus, the other

Page 59: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 5. DATAPATH LIBRARY 50

cells are designed according to the longest cell to maintain regularity and to exploit datapath.

Figure 5.7 shows the bitslice of 16bit Kogge-Stone adder. Using small cells allowed us to reduce

the height of the bit slice. The bitslice includes the sign logic and conditional sum calculation.

It can be seen from the datapath that the carry merge and propagate cells are interchanging

one stage after another due to the intrinsic negation of CMOS design.

(a) EAC bitslice (b) EAC last bit

Figure 5.6: Bit slices of the blocks in the adder.

Page 60: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 5. DATAPATH LIBRARY 51

Figure 5.7: Bit slice of 16bit Kogge-Stone adder.

Page 61: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 5. DATAPATH LIBRARY 52

Using the same approach in Figure 5.7, bitslices for the last and regular bits of parallel prefix

EAC block can be designed. Since the last bit has a irregular layout it is designed separately.

Figure 5.6(a) and 5.6(b) show the designed bitslices for EAC block.

5.4 Layout Design

Using the datapath designs, it is straightforward to design the whole adder. Replicating the

bitslice as designed in Figure 5.7 gives the layout of 16 bit Kogge-Stone adder. Figure 5.8(a)

shows the layout of 16 bit adder. 16 bit adder and EAC block are sized 13.42µm × 21.98µm

and 5.18µm× 11.06µm respectively.

As a final step, using the designed blocks, it is possible to estimate the total area of 128 bit

EAC adder. Two different schemes are considered. Figure 5.9 has a thinner but wider layout

to be used as a part of larger datapath. Figure 5.10 has a thicker but tighter layout to have a

more compact layout because half of the adder is stacked on top of the other. First one and the

second are sized 20.8µm× 175µm and 35.8µm× 87.5µm respectively.

Figure 5.9: Wide layout.

Figure 5.10: Stacked compact layout.

Page 62: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 5. DATAPATH LIBRARY 53

Table 5.1: Results comparison of proposed adder with the previous work.Delay(ps) Power(mW) Area(µm2) Technology

[5] 200(+29%) - - 65nm[12] 270(+47%) 20(+88%) 17,237(+82%) 65nmProposed 142 2.42 3,132 45nm

5.5 Results

The results of previous works and the proposed adder can be observed on Table 5.1. In previous

works a pipelined adder with 128bit 5GHz+ Binary Floating Point adder [5] is proposed and

in [12] a 108bit EAC adder is proposed. It is not possible to compare with both of the adders

because they implement the adders as a part of FMA unit and the adders are designed in

65nm IBM SOI technology, which is not publicly available. The adder designed in [5] is a

pipelined adder which does not fall into the scope of this thesis. The adder in [6] is designed

in FPGA technology, however, our adder is full custom design. Therefore, to the best of our

knowledge this is the first adder to use small Parallel Prefix 2n − 1 EAC block with the full

custom design methodology. It can be seen on Table 5.1 that the adder operates at the delay

of 142ps and 2.42mW power dissipation in 3, 132µm2 area under 25◦C with 1V supply. The

proposed adder shows up to 47% improvement in delay, 81% improvement in area and 88%

improvement in power dissipation assuming routing does not majorly contribute to delay and

power in comparison with the previous works.

5.6 Conclusion

In this final chapter a detailed analysis about layout entry through datapath design is provided.

The comparison between the proposed and the previous work is included. Datapath library

provides building blocks for early area estimation before designing the whole adder. Further,

the routing is simplified with datapath since wires can be placed regularly in the bitslice. The

cells are designed using full custom design rules to achieve compact, area efficient layout. Two

different layouts are generated by repeating the bitslices. We aimed to have the longest carry

path under 100µm assuming the routing does not majorly contribute to delay. Final result

Page 63: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 5. DATAPATH LIBRARY 54

shows that, in the stacked layout, longest path remains under 45nm and the proposed adder

shows up to 47% improvement in delay, 81% improvement in area and 88% improvement in

power dissipation in 45nm technology under 25◦C and 1.0Vdd.

Page 64: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 5. DATAPATH LIBRARY 55

(a) Layout of 16bit Kogge-Stone adder (b) Layout of parallel prefix EAC block

Figure 5.8: Layout of the blocks.

Page 65: Analysis and Design of High Performance 128-Bit Parallel Prefix e

Chapter 6

Conclusion and Future Works

Adders are the functional blocks which are generally designed for faster operation. However,

power dissipation cannot have less priority anymore. Using parallel prefix adders is a good design

practice for trading-off between speed, power dissipation and area. It is observed in the literature

that it is not possible to have higher gains anymore from the designed circuits in the performance

perspective. However, the design space is very vast and there always exists possibilities for

improvements. In this thesis, we designed a parallel prefix 2n − 1 based adder to show it is

possible to shorten the critical path and power dissipation. After the adder is implemented in

Verilog, we performed a critical path analysis. In this top down design perspective, we preferred

full custom design to achieve the best performance rather that standard cell design. It is well

known that analysis and design in full custom design methodology requires large amount of time.

Thus, specifying the critical path provides fast analysis without designing the whole circuit. In

order to make such analysis we assumed that wire delay does not contribute to the majority

of the total propagation delay. This analysis shows that if the critical path is optimized, the

total performance is optimized. As a next step, we created a datapath library using the results

from critical path analysis. It is shown that datapath design reduces the complexity of adder

design process because it exploits the regularity as bit slice. As a general outcome, critical

path analysis and datapath design provides fast analysis for comparison among different adder

architectures in speed, power dissipation and area perspectives. Moreover, once the bit slice is

designed optimally, it can be used as a building block for fast layout entry. After the datapath

56

Page 66: Analysis and Design of High Performance 128-Bit Parallel Prefix e

CHAPTER 6. CONCLUSION AND FUTURE WORKS 57

design, it is observed in the final layout that the length of the end-around carry path is decreased

to less than 45µm, using a stacked layout, which confirms our assumption about routing delay.

With our assumption, the adder shows 142ps delay, 2.42mW power dissipation and under 3200

sq. micron area.

Our analysis is based on Static CMOS design. Recent works show that dynamic adders

provides good performance results. Although the power dissipation increases in dynamic design,

a trade-off can be achieved between speed and power. Moreover, pipelined circuits found interest

in the adder design. The operation of this adder can be further analyzed in dynamic design

space. In addition, the use of EAC adders in floating point units especially decimal floating

point adders can be further investigated.

Page 67: Analysis and Design of High Performance 128-Bit Parallel Prefix e

Bibliography

[1] J. Rabaey, A. Chandrakasan, and B. Nikolic, Digital Integrated Circuits, 2/e. PearsonEducation, 2003.

[2] I. Koren, Computer Arithmetic Algorithms, ser. Ak Peters Series. A K Peters, 2002.

[3] N. Weste and D. Harris, CMOS VLSI design: a circuits and systems perspective.Pearson/Addison-Wesley, 2005.

[4] R. Zimmermann, “Efficient vlsi implementation of modulo (2n± 1) addition and multipli-cation,” in Computer Arithmetic, 1999. Proceedings. 14th IEEE Symposium on, 1999, pp.158 –167.

[5] X. Y. Yu, Y. hing Chan, M. Kelly, and S. B. Curran, “A 5ghz+ 128-bit binary floating-pointadder for the power6,” in Power6 Processor, Proc. of ESSCIRC, 2006, pp. 166–169.

[6] F. Liu, Q. Tan, G. Chen, X. Song, O. Ait Mohamed, and M. Gu, “Field programmablegate array prototyping of end-around carry parallel prefix tree architectures,” ComputersDigital Techniques, IET, vol. 4, no. 4, pp. 306 –316, July 2010.

[7] G. Dimitrakopoulos and D. Nikolos, “High-speed parallel-prefix vlsi ling adders,” Comput-ers, IEEE Transactions on, vol. 54, no. 2, pp. 225 – 231, Feb. 2005.

[8] B. Zeydel, D. Baran, and V. Oklobdzija, “Energy-efficient design methodologies: High-performance vlsi adders,” Solid-State Circuits, IEEE Journal of, vol. 45, no. 6, pp. 1220–1233, June 2010.

[9] M. Ciesielski, S. Askar, and S. Levitin, “Analytical approach to layout generation of datap-ath cells,” Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactionson, vol. 21, no. 12, pp. 1480 – 1488, Dec 2002.

[10] T. Tao Ye and G. De Micheli, “Data path placement with regularity,” in Computer AidedDesign, 2000. ICCAD-2000. IEEE/ACM International Conference on, 2000, pp. 264 –270.

[11] J.-S. Yim and C.-M. Kyung, “Datapath layout optimisation using genetic algorithm andsimulated annealing,” Computers and Digital Techniques, IEE Proceedings -, vol. 145, no. 2,pp. 135 –141, Mar 1998.

[12] X. Y. Zhang, Y.-H. Chan, R. K. Montoye, L. J. Sigal, E. M. Schwarz, and M. Kelly, “A270ps 20mw 108-bit end-around carry adder for multiply-add fused floating point unit,”Journal of Signal Processing Systems, vol. 58, pp. 139–144, 2010.

58

Page 68: Analysis and Design of High Performance 128-Bit Parallel Prefix e

BIBLIOGRAPHY 59

[13] J. Bruguera and T. Lang, “Floating-point fused multiply-add: reduced latency for floating-point addition,” in Computer Arithmetic, 2005. ARITH-17 2005. 17th IEEE Symposiumon, June 2005, pp. 42 – 51.

[14] D. Harris and S. Harris, Digital design and computer architecture, ser. Morgan Kaufmann.Morgan Kaufmann Publishers, 2007.

[15] J. Chen and J. Stine, “Parallel prefix ling structures for modulo 2n − 1 addition,” inApplication-specific Systems, Architectures and Processors, 2009. ASAP 2009. 20th IEEEInternational Conference on, July 2009, pp. 16 –23.

[16] G. Chen and F. Liu, “Proofs of correctness and properties of integer adder circuits,” Com-puters, IEEE Transactions on, vol. 59, no. 1, pp. 134 –136, Jan. 2010.

[17] J. Park, H. Ngo, J. Silberman, and S. Dhong, “470 ps 64-bit parallel binary adder [for cpuchip],” in VLSI Circuits, 2000. Digest of Technical Papers. 2000 Symposium on, 2000, pp.192 –193.

[18] D. Patil, O. Azizi, M. Horowitz, R. Ho, and R. Ananthraman, “Robust energy-efficientadder topologies,” in Computer Arithmetic, 2007. ARITH ’07. 18th IEEE Symposium on,June 2007, pp. 16 –28.

[19] G. Dimitrakopoulos, P. Kolovos, P. Kalogerakis, and D. Nikolos, “Design of high-speedlow-power parallel-prefix vlsi adders.” in PATMOS’04, 2004, pp. 248–257.

[20] A. Weinberger and J. Smith, “A logic for high-speed addition,” National Bureau of Stan-dards, vol. no. Circulation 591, pp. 3 – 12, 1958.

[21] D. Harris, “A taxonomy of parallel prefix networks,” in Signals, Systems and Computers,2003. Conference Record of the Thirty-Seventh Asilomar Conference on, vol. 2, Nov. 2003,pp. 2213 – 2217 Vol.2.

[22] P. M. Kogge and H. S. Stone, “A parallel algorithm for the efficient solution of a generalclass of recurrence equations,” Computers, IEEE Transactions on, vol. C-22, no. 8, pp. 786–793, Aug. 1973.

[23] J. Sklansky, “Conditional-sum addition logic,” Electronic Computers, IRE Transactionson, vol. EC-9, no. 2, pp. 226 –231, June 1960.

[24] R. Brent and H. Kung, “A regular layout for parallel adders,” Computers, IEEE Transac-tions on, vol. C-31, no. 3, pp. 260 –264, March 1982.

[25] T. Han, D. A. Carlson, and T. don Han, “Fast area-efficient vlsi adders,” in IEEE Sympo-sium on Computer Arithmetic, 1987.

[26] S. Knowles, “A family of adders,” in Computer Arithmetic, 2001. Proceedings. 15th IEEESymposium on, 2001, pp. 277 –281.

[27] R. E. Ladner and M. J. Fischer, “Parallel prefix computation,” Journal of The ACM,vol. 27, pp. 831–838, 1980.

Page 69: Analysis and Design of High Performance 128-Bit Parallel Prefix e

BIBLIOGRAPHY 60

[28] B. Zeydel, T. Kluter, and V. Oklobdzija, “Efficient mapping of addition recurrence algo-rithms in cmos,” in Computer Arithmetic, 2005. ARITH-17 2005. 17th IEEE Symposiumon, June 2005, pp. 107 – 113.

[29] R. Doran, “Variants of an improved carry look-ahead adder,” Computers, IEEE Transac-tions on, vol. 37, no. 9, pp. 1110 –1113, Sep 1988.

[30] H. Ling, “High-speed binary adder,” IBM Journal of Research and Development, vol. 25,no. 3, pp. 156 –166, March 1981.

[31] R. I. Tanaka, “Residue arithmetic and its applications to computer technology,” 1967.

[32] W. Jenkins and B. Leon, “The use of residue number systems in the design of finite impulseresponse digital filters,” Circuits and Systems, IEEE Transactions on, vol. 24, no. 4, pp.191 – 201, Apr 1977.

[33] X. Lai and J. L. Massey, “A proposal for a new block encryption standard.” Springer-Verlag, 1991, pp. 389–404.

[34] S.-S. Yau and Y.-C. Liu, “Error correction in redundant residue number systems,” Com-puters, IEEE Transactions on, vol. C-22, no. 1, pp. 5 – 11, Jan. 1973.

[35] F. Halsall, Data communications, computer networks and open systems (4th ed.). RedwoodCity, CA, USA: Addison Wesley Longman Publishing Co., Inc., 1995.

[36] V. Paliouras and T. Stouraitis, “Novel high-radix residue number system multipliers andadders,” in Circuits and Systems, 1999. ISCAS ’99. Proceedings of the 1999 IEEE Inter-national Symposium on, vol. 1, Jul 1999, pp. 451 –454 vol.1.

[37] C. Efstathiou, H. Vergos, and D. Nikolos, “Modulo 2n ± 1 adder design using select-prefixblocks,” Computers, IEEE Transactions on, vol. 52, no. 11, pp. 1399 – 1406, Nov. 2003.

[38] L. Kalampoukas, D. Nikolos, C. Efstathiou, H. Vergos, and J. Kalamatianos, “High-speedparallel-prefix module 2n-1 adders,” Computers, IEEE Transactions on, vol. 49, no. 7, pp.673 –680, Jul 2000.

[39] A. Beaumont-Smith and C.-C. Lim, “Parallel prefix adder design,” in Computer Arithmetic,2001. Proceedings. 15th IEEE Symposium on, 2001, pp. 218 –225.

[40] E. M. Schwarz, High-performance energy-efficient microprocessor design, ser. Series onintegrated circuits and systems. Springer, 2006, ch. Binary floating-point unit design.

[41] J. Shedletsky, “Comment on the sequential and indeterminate behavior of an end-around-carry adder,” Computers, IEEE Transactions on, vol. C-26, no. 3, pp. 271 –272, March1977.

[42] F. Liu, X. Song, Q. Tan, and G. Chen, “Formal analysis of end-around-carry adder infloating-point unit,” Computer-Aided Design of Integrated Circuits and Systems, IEEETransactions on, vol. 29, no. 10, pp. 1655 –1659, Oct. 2010.

[43] A. Goldovsky, R. Kolagotla, C. Nicol, and M. Besz, “A 1.0-nsec 32-bit prefix tree adderin 0.25- mu;m static cmos,” in Circuits and Systems, 1999. 42nd Midwest Symposium on,vol. 2, 1999, pp. 608 –612 vol. 2.

Page 70: Analysis and Design of High Performance 128-Bit Parallel Prefix e

BIBLIOGRAPHY 61

[44] R. Muralidharan and C.-H. Chang, “Hard multiple generator for higher radix modulo 2n-1multiplication,” in Integrated Circuits, ISIC ’09. Proceedings of the 2009 12th InternationalSymposium on, Dec. 2009, pp. 546 –549.

[45] L. Kalampoukas, D. Nikolos, C. Efstathiou, H. Vergos, and J. Kalamatianos, “High-speedparallel-prefix module 2n-1 adders,” Computers, IEEE Transactions on, vol. 49, no. 7, pp.673 –680, Jul 2000.

[46] I. Sutherland, R. Sproull, and D. Harris, Logical effort: designing fast CMOS circuits, ser.The Morgan Kaufmann Series in Computer Architecture and Design. Morgan KaufmannPublishers, 1999.

[47] D. Harris and I. Sutherland, “Logical effort of carry propagate adders,” in Signals, Systemsand Computers, 2003. Conference Record of the Thirty-Seventh Asilomar Conference on,vol. 1, Nov. 2003, pp. 873 – 878 Vol.1.

[48] A. Kabbani, D. Al-Khalili, and A. Al-Khalili, “Logical path delay distribution and tran-sistor sizing,” in IEEE-NEWCAS Conference, 2005. The 3rd International, June 2005, pp.391 – 394.

[49] V. Oklobdzija, B. Zeydel, H. Dao, S. Mathew, and R. Krishnamurthy, “Energy-delay es-timation technique for high-performance microprocessor vlsi adders,” in Computer Arith-metic, 2003. Proceedings. 16th IEEE Symposium on, June 2003, pp. 272 – 279.

[50] F. Frustaci, M. Lanuzza, P. Zicari, S. Perri, and P. Corsonello, “Designing high-speedadders in power-constrained environments,” Circuits and Systems II: Express Briefs, IEEETransactions on, vol. 56, no. 2, pp. 172 –176, Feb. 2009.

[51] R. Zlatanovici, S. Kao, and B. Nikolic, “Energy-delay optimization of 64-bit carry-lookahead adders with a 240 ps 90 nm cmos design example,” Solid-State Circuits, IEEEJournal of, vol. 44, no. 2, pp. 569 –583, Feb. 2009.

[52] HSPICE, “The gold standard for accurate circuit simulation,” http://www.synopsys.com/Tools/Verification/AMSVerification/CircuitSimulation/HSPICE/Pages/default.aspx.

[53] J. Stine, I. Castellanos, M. Wood, J. Henson, F. Love, W. Davis, P. Franzon, M. Bucher,S. Basavarajaiah, J. Oh, and R. Jenkal, “Freepdk: An open-source variation-aware designkit,” in Microelectronic Systems Education, 2007. MSE ’07. IEEE International Conferenceon, June 2007, pp. 173 –174.

[54] NCSU, “45nm variant of the FreePDK process design kit,” http://www.eda.ncsu.edu/wiki/FreePDK45:Contents.

[55] T. Jing, X.-L. Hong, Y.-C. Cai, J.-Y. Xu, C.-Q. Yang, Y.-Q. Zhang, Q. Zhou, and W. Wu,“Data-path layout design inside soc,” in Communications, Circuits and Systems and WestSino Expositions, IEEE 2002 International Conference on, vol. 2, June-1 July 2002, pp.1406 – 1410 vol.2.

[56] W. Daily and A. Chang, “The role of custom design in asic chips,” in Design AutomationConference, 2000. Proceedings 2000. 37th, 2000, pp. 643 –647.

Page 71: Analysis and Design of High Performance 128-Bit Parallel Prefix e

BIBLIOGRAPHY 62

[57] N. H. E. Weste and K. Eshraghian, “Principles of cmos vlsi design: a systems perspective,”1993.

[58] W. Luk and A. Dean, “Multistack optimization for data-path chip layout,” Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, vol. 10, no. 1, pp.116 –129, Jan 1991.

Page 72: Analysis and Design of High Performance 128-Bit Parallel Prefix e

Appendix A

Verilog Code of the Proposed Adder

module Mod_2n_1_128b_temp(

input [127:0] A,

input sA,

input [127:0] B,

input sB,

output [127:0] S,

output sS

);

wire [127:0] iB, iS;

wire Os;

wire [7:0] c,p,g;

xor X1 (Os, sA, sB);

assign iB = Os ? ~B : B;

// 16b front and 8b EAC

Mod2n_1_16b_wCI_CS_KS i0 (.A(A[15:0]), .B(iB[15:0]), .cin(c[7]), .pg({p[0],g[0]}), .S(iS[15:0]));

Mod2n_1_16b_wCI_CS_KS i1 (.A(A[31:16]), .B(iB[31:16]), .cin(c[0]), .pg({p[1],g[1]}), .S(iS[31:16]));

Mod2n_1_16b_wCI_CS_KS i2 (.A(A[47:32]), .B(iB[47:32]), .cin(c[1]), .pg({p[2],g[2]}), .S(iS[47:32]));

Mod2n_1_16b_wCI_CS_KS i3 (.A(A[63:48]), .B(iB[63:48]), .cin(c[2]), .pg({p[3],g[3]}), .S(iS[63:48]));

Mod2n_1_16b_wCI_CS_KS i4 (.A(A[79:64]), .B(iB[79:64]), .cin(c[3]), .pg({p[4],g[4]}), .S(iS[79:64]));

Mod2n_1_16b_wCI_CS_KS i5 (.A(A[95:80]), .B(iB[95:80]), .cin(c[4]), .pg({p[5],g[5]}), .S(iS[95:80]));

Mod2n_1_16b_wCI_CS_KS i6 (.A(A[111:96]), .B(iB[111:96]), .cin(c[5]), .pg({p[6],g[6]}), .S(iS[111:96]));

Mod2n_1_16b_wCI_CS_KS i7 (.A(A[127:112]), .B(iB[127:112]), .cin(c[6]), .pg({p[7],g[7]}), .S(iS[127:112]));

// 8b EAC logic

Mod2n_1_8b_wo_pg_KS EAC (.p(p[7:0]), .p1(p1[7:0]), .g(g[7:0]), .c(c[7:0]), .Os(Os));

assign S = Os&(c[7]) ? iS : ~iS;

assign sS = c[7] ? sA : sB;

endmodule

////////////////////////////////////////////////////////////////////////////////////////////////////////////

module Mod2n_1_16b_wCI_CS_KS(

input [15:0] A,

input [15:0] B,

input cin,

output p1,

output [1:0] pg,

output [15:0] S

);

wire [1:0] r1c7, r1c6, r1c5, r1c4, r1c3, r1c2, r1c1, r1c0;

wire [1:0] r1c15, r1c14, r1c13, r1c12, r1c11, r1c10, r1c9, r1c8;

63

Page 73: Analysis and Design of High Performance 128-Bit Parallel Prefix e

APPENDIX A. VERILOG CODE OF THE PROPOSED ADDER 64

pg16 ipg16(.A(A), .B(B),

.pg15(r1c15),.pg14(r1c14),.pg13(r1c13),.pg12(r1c12),

.pg11(r1c11),.pg10(r1c10),.pg9(r1c9),.pg8(r1c8),

.pg7(r1c7),.pg6(r1c6),.pg5(r1c5),.pg4(r1c4),

.pg3(r1c3),.pg2(r1c2),.pg1(r1c1),.pg0(r1c0));

wire [1:0] r2c15, r2c13, r2c11, r2c9, r2c7, r2c5, r2c3, r2c1;

wire [1:0] r2c14, r2c12, r2c10, r2c8, r2c6, r2c4, r2c2;

black ir1c15(.pg(r1c15), .pg0(r1c14), .pgo(r2c15));

black ir1c14(.pg(r1c14), .pg0(r1c13), .pgo(r2c14));

black ir1c13(.pg(r1c13), .pg0(r1c12), .pgo(r2c13));

black ir1c12(.pg(r1c12), .pg0(r1c11), .pgo(r2c12));

black ir1c11(.pg(r1c11), .pg0(r1c10), .pgo(r2c11));

black ir1c10(.pg(r1c10), .pg0(r1c9), .pgo(r2c10));

black ir1c9(.pg(r1c9), .pg0(r1c8), .pgo(r2c9));

black ir1c8(.pg(r1c8), .pg0(r1c7), .pgo(r2c8));

black ir1c7(.pg(r1c7), .pg0(r1c6), .pgo(r2c7));

black ir1c6(.pg(r1c6), .pg0(r1c5), .pgo(r2c6));

black ir1c5(.pg(r1c5), .pg0(r1c4), .pgo(r2c5));

black ir1c4(.pg(r1c4), .pg0(r1c3), .pgo(r2c4));

black ir1c3(.pg(r1c3), .pg0(r1c2), .pgo(r2c3));

black ir1c2(.pg(r1c2), .pg0(r1c1), .pgo(r2c2));

black ir1c1(.pg(r1c1), .pg0(r1c0), .pgo(r2c1));

wire [1:0] r3c15, r3c14, r3c11, r3c10, r3c7, r3c6, r3c3, r3c2;

wire [1:0] r3c13, r3c12, r3c9, r3c8, r3c5, r3c4;

black ir2c15(.pg(r2c15), .pg0(r2c13), .pgo(r3c15));

black ir2c14(.pg(r2c14), .pg0(r2c12), .pgo(r3c14));

black ir2c13(.pg(r2c13), .pg0(r2c11), .pgo(r3c13));

black ir2c12(.pg(r2c12), .pg0(r2c10), .pgo(r3c12));

black ir2c11(.pg(r2c11), .pg0(r2c9), .pgo(r3c11));

black ir2c10(.pg(r2c10), .pg0(r2c8), .pgo(r3c10));

black ir2c9(.pg(r2c9), .pg0(r2c7), .pgo(r3c9));

black ir2c8(.pg(r2c8), .pg0(r2c6), .pgo(r3c8));

black ir2c7(.pg(r2c7), .pg0(r2c5), .pgo(r3c7));

black ir2c6(.pg(r2c6), .pg0(r2c4), .pgo(r3c6));

black ir2c5(.pg(r2c5), .pg0(r2c3), .pgo(r3c5));

black ir2c4(.pg(r2c4), .pg0(r2c2), .pgo(r3c4));

black ir2c3(.pg(r2c3), .pg0(r2c1), .pgo(r3c3));

black ir2c2(.pg(r2c2), .pg0(r1c0), .pgo(r3c2));

wire [1:0] r4c15, r4c14, r4c13, r4c12, r4c11, r4c10, r4c9, r4c8;

wire [1:0] r4c7, r4c6, r4c5, r4c4;

black ir3c15(.pg(r3c15), .pg0(r3c11), .pgo(r4c15));

black ir3c14(.pg(r3c14), .pg0(r3c10), .pgo(r4c14));

black ir3c13(.pg(r3c13), .pg0(r3c9), .pgo(r4c13));

black ir3c12(.pg(r3c12), .pg0(r3c8), .pgo(r4c12));

black ir3c11(.pg(r3c11), .pg0(r3c7), .pgo(r4c11));

black ir3c10(.pg(r3c10), .pg0(r3c6), .pgo(r4c10));

black ir3c9(.pg(r3c9), .pg0(r3c5), .pgo(r4c9));

black ir3c8(.pg(r3c8), .pg0(r3c4), .pgo(r4c8));

black ir3c7(.pg(r3c7), .pg0(r3c3), .pgo(r4c7));

black ir3c6(.pg(r3c6), .pg0(r3c2), .pgo(r4c6));

Page 74: Analysis and Design of High Performance 128-Bit Parallel Prefix e

APPENDIX A. VERILOG CODE OF THE PROPOSED ADDER 65

black ir3c5(.pg(r3c5), .pg0(r2c1), .pgo(r4c5));

black ir3c4(.pg(r3c4), .pg0(r1c0), .pgo(r4c4));

wire [1:0] r5c15, r5c14, r5c13, r5c12, r5c11, r5c10, r5c9, r5c8;

black ir4c15(.pg(r4c15), .pg0(r4c7), .pgo(r5c15));

black ir4c14(.pg(r4c14), .pg0(r4c6), .pgo(r5c14));

black ir4c13(.pg(r4c13), .pg0(r4c5), .pgo(r5c13));

black ir4c12(.pg(r4c12), .pg0(r4c4), .pgo(r5c12));

black ir4c7(.pg(r4c11), .pg0(r3c3), .pgo(r5c11));

black ir4c6(.pg(r4c10), .pg0(r3c2), .pgo(r5c10));

black ir4c5(.pg(r4c9), .pg0(r2c1), .pgo(r5c9));

black ir4c4(.pg(r4c8), .pg0(r1c0), .pgo(r5c8));

assign pg = r5c15;

wire [15:0] r6c0, r6c1;

Carry_Inc CIA0(.cin(1’b0),.c7(r4c7),.c6(r4c6),.c5(r4c5),.c4(r4c4),

.c3(r3c3),.c2(r3c2),.c1(r2c1),.c0(r1c0),.r1c(r6c0[7:0]));

Carry_Inc CIA1(.cin(1’b1),.c7(r4c7),.c6(r4c6),.c5(r4c5),.c4(r4c4),

.c3(r3c3),.c2(r3c2),.c1(r2c1),.c0(r1c0),.r1c(r6c1[7:0]));

Carry_Inc CIB0(.cin(1’b0),.c7(r5c15),.c6(r5c14),.c5(r5c13),.c4(r5c12),

.c3(r5c11),.c2(r5c10),.c1(r5c9),.c0(r5c8),.r1c(r6c0[15:8]));

Carry_Inc CIB1(.cin(1’b1),.c7(r5c15),.c6(r5c14),.c5(r5c13),.c4(r5c12),

.c3(r5c11),.c2(r5c10),.c1(r5c9),.c0(r5c8),.r1c(r6c1[15:8]));

wire [15:0] S1,S0;

assign S0= {r6c0[14:0],1’b0} ^

{r1c15[1],r1c14[1],r1c13[1],r1c12[1],r1c11[1],r1c10[1],r1c9[1],r1c8[1],

r1c7[1],r1c6[1],r1c5[1],r1c4[1],r1c3[1],r1c2[1],r1c1[1],r1c0[1]};

assign S1= {r6c1[14:0],1’b1} ^

{r1c15[1],r1c14[1],r1c13[1],r1c12[1],r1c11[1],r1c10[1],r1c9[1],r1c8[1],

r1c7[1],r1c6[1],r1c5[1],r1c4[1],r1c3[1],r1c2[1],r1c1[1],r1c0[1]};

assign S = cin ? S1 : S0 ;

endmodule

////////////////////////////////////////////////////////////////////////////////////////////////////////////

module pg16 (A, B, pg15, pg14, pg13, pg12, pg11, pg10, pg9, pg8, pg7, pg6, pg5, pg4, pg3, pg2, pg1, pg0);

input [15:0] A, B;

output [1:0] pg15, pg14, pg13, pg12, pg11, pg10, pg9, pg8, pg7, pg6, pg5, pg4, pg3, pg2, pg1, pg0;

assign pg15 = {(A[15] ^ B[15]), (A[15] & B[15])};

assign pg14 = {(A[14] ^ B[14]), (A[14] & B[14])};

assign pg13 = {(A[13] ^ B[13]), (A[13] & B[13])};

assign pg12 = {(A[12] ^ B[12]), (A[12] & B[12])};

assign pg11 = {(A[11] ^ B[11]), (A[11] & B[11])};

assign pg10 = {(A[10] ^ B[10]), (A[10] & B[10])};

assign pg9 = {(A[9] ^ B[9]), (A[9] & B[9])};

assign pg8 = {(A[8] ^ B[8]), (A[8] & B[8])};

assign pg7 = {(A[7] ^ B[7]), (A[7] & B[7])};

assign pg6 = {(A[6] ^ B[6]), (A[6] & B[6])};

assign pg5 = {(A[5] ^ B[5]), (A[5] & B[5])};

assign pg4 = {(A[4] ^ B[4]), (A[4] & B[4])};

assign pg3 = {(A[3] ^ B[3]), (A[3] & B[3])};

assign pg2 = {(A[2] ^ B[2]), (A[2] & B[2])};

assign pg1 = {(A[1] ^ B[1]), (A[1] & B[1])};

assign pg0 = {(A[0] ^ B[0]), (A[0] & B[0])};

Page 75: Analysis and Design of High Performance 128-Bit Parallel Prefix e

APPENDIX A. VERILOG CODE OF THE PROPOSED ADDER 66

endmodule

////////////////////////////////////////////////////////////////////////////////////////////////////////////

module black (pg, pg0, pgo);

input [1:0] pg, pg0;

output [1:0] pgo;

assign pgo[1] = pg[1] & pg0[1];

assign pgo[0] = pg[0] | (pg0[0] & pg[1]) ;

endmodule

////////////////////////////////////////////////////////////////////////////////////////////////////////////

module Carry_Inc(

input cin,

input [1:0] c7, c6, c5, c4, c3, c2, c1, c0,

output [7:0] r1c

);

// Carry Increment Stage

gray ic7(.pg(c7), .pg0(cin), .pgo(r1c[7]));

gray ic6(.pg(c6), .pg0(cin), .pgo(r1c[6]));

gray ic5(.pg(c5), .pg0(cin), .pgo(r1c[5]));

gray ic4(.pg(c4), .pg0(cin), .pgo(r1c[4]));

gray ic3(.pg(c3), .pg0(cin), .pgo(r1c[3]));

gray ic2(.pg(c2), .pg0(cin), .pgo(r1c[2]));

gray ic1(.pg(c1), .pg0(cin), .pgo(r1c[1]));

gray ic0(.pg(c0), .pg0(cin), .pgo(r1c[0]));

endmodule

////////////////////////////////////////////////////////////////////////////////////////////////////////////

module gray (pg, pg0, pgo);

input [1:0] pg;

input pg0;

output pgo;

assign pgo = (pg0 & pg[1]) | pg[0];

endmodule

////////////////////////////////////////////////////////////////////////////////////////////////////////////

module Mod2n_1_8b_wo_pg_KS(

input [7:0] p,

input [7:0] g,

input Os,

input [7:0] p1,

output [7:0] c

);

wire [1:0] r1c7, r1c6, r1c5, r1c4, r1c3, r1c2, r1c1, r1c0;

assign r1c7 = {p[7],g[7]};

assign r1c6 = {p[6],g[6]};

assign r1c5 = {p[5],g[5]};

assign r1c4 = {p[4],g[4]};

assign r1c3 = {p[3],g[3]};

assign r1c2 = {p[2],g[2]};

assign r1c1 = {p[1],g[1]};

assign r1c0 = {p[0],g[0]};

wire [1:0] r2c7, r2c6, r2c5, r2c4, r2c3, r2c2, r2c1;

Page 76: Analysis and Design of High Performance 128-Bit Parallel Prefix e

APPENDIX A. VERILOG CODE OF THE PROPOSED ADDER 67

black ir1c7(.pg(r1c7), .pg0(r1c6), .pgo(r2c7));

black ir1c6(.pg(r1c6), .pg0(r1c5), .pgo(r2c6));

black ir1c5(.pg(r1c5), .pg0(r1c4), .pgo(r2c5));

black ir1c4(.pg(r1c4), .pg0(r1c3), .pgo(r2c4));

black ir1c3(.pg(r1c3), .pg0(r1c2), .pgo(r2c3));

black ir1c2(.pg(r1c2), .pg0(r1c1), .pgo(r2c2));

black ir1c1(.pg(r1c1), .pg0(r1c0), .pgo(r2c1));

wire [1:0] r3c7, r3c6, r3c5, r3c4, r3c3, r3c2;

black ir2c7(.pg(r2c7), .pg0(r2c5), .pgo(r3c7));

black ir2c6(.pg(r2c6), .pg0(r2c4), .pgo(r3c6));

black ir2c5(.pg(r2c5), .pg0(r2c3), .pgo(r3c5));

black ir2c4(.pg(r2c4), .pg0(r2c2), .pgo(r3c4));

black ir2c3(.pg(r2c3), .pg0(r2c1), .pgo(r3c3));

black ir2c2(.pg(r2c2), .pg0(r1c0), .pgo(r3c2));

wire [1:0] r4c7, r4c6, r4c5, r4c4;

black ir3c7(.pg(r3c7), .pg0(r3c3), .pgo(r4c7));

black ir3c6(.pg(r3c6), .pg0(r3c2), .pgo(r4c6));

black ir3c5(.pg(r3c5), .pg0(r2c1), .pgo(r4c5));

black ir3c4(.pg(r3c4), .pg0(r1c0), .pgo(r4c4));

wire r5c7, r5c6, r5c5, r5c4, r5c3, r5c2, r5c1, r5c0;

// End Around Carry Stage

assign c0 = r4c7[0] &Os ;

gray ir4c6(.pg(r4c6), .pg0(c0), .pgo(r5c6));

gray ir4c5(.pg(r4c5), .pg0(c0), .pgo(r5c5));

gray ir4c4(.pg(r4c4), .pg0(c0), .pgo(r5c4));

gray ir4c3(.pg(r3c3), .pg0(c0), .pgo(r5c3));

gray ir4c2(.pg(r3c2), .pg0(c0), .pgo(r5c2));

gray ir4c1(.pg(r2c1), .pg0(c0), .pgo(r5c1));

gray ir4c0(.pg(r1c0), .pg0(c0), .pgo(r5c0));

assign c = {r4c7[0],r5c6,r5c5,r5c4,r5c3,r5c2,r5c1,r5c0};

endmodule

Page 77: Analysis and Design of High Performance 128-Bit Parallel Prefix e

Appendix B

HSPICE Simulation Files

B.1 Cells

Cell library for simulation.

* Cells

.subckt inv in out length=0.05u width=0.09u

m1 out in vdd vdd PMOS_VTL l=length w=’2.5*width’

m2 out in gnd gnd NMOS_VTL l=length w=width

.ends inv

.subckt nand2 in1 in2 out length=0.05u width=0.09u

m1 out in2 vdd vdd PMOS_VTL L=length W=’2.5*width’

m2 out in1 vdd vdd PMOS_VTL L=length W=’2.5*width’

m3 out in1 1 1 NMOS_VTL L=length W=’2*width’

m4 1 in2 gnd gnd NMOS_VTL L=length W=’2*width’

.ends nand2

.subckt nor2 in1 in2 out length=0.05u width=0.09u

m1 out in2 1 1 PMOS_VTL L=length W=’5*width’

m2 1 in1 vdd vdd PMOS_VTL L=length W=’5*width’

m3 out in1 gnd gnd NMOS_VTL L=length W=’1*width’

m4 out in2 gnd gnd NMOS_VTL L=length W=’1*width’

.ends nor2

.subckt tg in pctrl nctrl out length=0.05u width=0.09u

m0 in pctrl out vdd PMOS_VTL L=length W=’1*width’

m1 in nctrl out gnd NMOS_VTL L=length W=’1*width’

.ends tg

.subckt xor2 in1 in2 out length=0.05u width=0.09u

x01 in1 out1 inv L=length W=’1*width’

x02 in2 out2 inv L=length W=’1*width’

x03 in1 in2 out2 out tg L=length W=’1*width’

x04 out1 out2 in2 out tg L=length W=’1*width’

.ends xor2

.subckt xnor2 in1 in2 out length=0.05u width=0.09u

x01 in2 out2 inv L=length W=’2*width’

x02 in1 out2 in2 out tg L=length W=’1*width’

m1 out in1 out2 vdd PMOS_VTL L=length W=’2*2.5*1*width’

m2 out in1 in2 gnd NMOS_VTL L=length W=’2*width’

68

Page 78: Analysis and Design of High Performance 128-Bit Parallel Prefix e

APPENDIX B. HSPICE SIMULATION FILES 69

.ends xnor2

.subckt AOI A B C F length=0.05u width=0.09u

m01 F A 2 2 PMOS_VTL L=length W=’5*width’

m02 F B 2 2 PMOS_VTL L=length W=’5*width’

m03 2 C vdd vdd PMOS_VTL L=length W=’5*width’

m04 F A 1 1 NMOS_VTL L=length W=’2*width’

m05 1 B gnd gnd NMOS_VTL L=length W=’2*width’

m06 F C gnd gnd NMOS_VTL L=length W=’1*width’

.ends AOI

.subckt mux2 in1 in2 sel out length=0.05u width=0.09u

x01 sel nsel inv L=length W=’1*width’

x02 in1 sel nsel out tg L=length W=’1*width’

x03 in2 nsel sel out tg L=length W=’1*width’

.ends mux2

.subckt OAI A B C F length=0.05u width=0.09u

m01 F A 2 2 PMOS_VTL L=length W=’5*width’

m02 2 B vdd vdd PMOS_VTL L=length W=’5*width’

m03 F C vdd vdd PMOS_VTL L=length W=’2.5*width’

m04 F A 1 1 NMOS_VTL L=length W=’2*width’

m05 F1 B 1 1 NMOS_VTL L=length W=’2*width’

m06 1 C gnd gnd NMOS_VTL L=length W=’2*width’

.ends OAI

B.2 Simulation Code

* source CRITICAL PATH

* 16x8 EAC KS adder

.include ’NMOS_VTL_.inc’

.include ’PMOS_VTL_.inc’

.include ’cells.cir’

.global vdd gnd

.connect gnd 0

.PARAM W=’0.05u’ L=0.05u N=8 M=4

.PARAM vdd = 1.0V

vdd vdd 0 vdd

V_V1 in0 0

+PULSE 0 1 0 150p 150p 0.5n 3n

X_UI1 in0 in1 inv length=L width=’2*W’

X_UI2 in1 in inv length=L width=’2*W’

.connect in sA

X_U1_1 sA n1 inv length=L width=’f1*W’

X_U1_2 gnd n2 inv length=L width=’f1*W’

X_U1_3 sA gnd n2 Os tg length=L width=’4*W’

X_U1_4 n1 n2 gnd Os tg length=L width=’4*W’

X_U2_1 Os n3 inv length=L width=’f2*W’

Page 79: Analysis and Design of High Performance 128-Bit Parallel Prefix e

APPENDIX B. HSPICE SIMULATION FILES 70

X_U2_2 vdd Os n3 1 tg length=L width=’4*W’

X_U2_3 gnd n3 Os 1 tg length=L width=’4*W’

X_U3 1 vdd 2 nand2 length=L width=’f3*W’

X_U4 2 gnd vdd 3 OAI length=L width=’f4*W’

X_U4_1 2 gnd vdd 51 OAI length=L width=’f4*W’

X_U5 3 vdd gnd 4 AOI length=L width=’f5*W’

X_U5_1 3 vdd gnd 52 AOI length=L width=’f5*W’

X_U6 4 gnd vdd 5 OAI length=L width=’f6*W’

X_U6_1 4 gnd vdd 53 OAI length=L width=’f6*W’

X_U7 5 vdd gnd 6 AOI length=L width=’f7*W’

X_U8 6 gnd vdd 7 OAI length=L width=’f8*W’

X_U8_1 6 gnd vdd 54 OAI length=L width=’f8*W’

X_U9 7 vdd gnd 8 AOI length=L width=’f9*W’

X_U9_1 7 vdd gnd 55 AOI length=L width=’f9*W’

X_U10 8 gnd vdd 9 OAI length=L width=’f10*W’

X_U11 9 vdd 10 nand2 length=L width=’f11*W’

X_U11_1 9 vdd 50 nand2 length=L width=’f11*W’

X_U12 10 gnd vdd 11 OAI length=L width=’f12*W’

X_U12_1 10 gnd vdd 35 OAI length=L width=’f12*W’

X_U12_2 10 gnd vdd 36 OAI length=L width=’f12*W’

X_U12_3 10 gnd vdd 37 OAI length=L width=’f12*W’

X_U13_1 11 n4 inv length=L width=’f13*W’

X_U13_2 gnd 11 n4 12 tg length=L width=’f13*W’

X_U13_3 vdd n4 11 12 tg length=L width=’f13*W’

X_U14_1 12 n6 inv length=L width=’f14*W’

X_U14_4 vdd n5 inv length=L width=’f14*W’

X_U14_2 12 vdd n5 s75 tg length=L width=’f14*W’

X_U14_3 n6 n5 vdd s75 tg length=L width=’f14*W’

X_UO s75 out inv length=L width=’2*W’

.PARAM f1=2 f2=2 f3=2.5 f4=2 f5=1.6 f6=1.3 f7=2 f8=1.6 f9=1.3 f10=2 f11=2.5 f12=0.8 f13=2 f14=2

.tran 0.1p 100n

.option post=2 nomod LIST

.meas tran tplh_inr trig v(in) td=70n val=’vdd/2’ cross=1 targ v(s75) td=70n val=’vdd/2’ cross=1

.meas tran tplh_inf trig v(in) td=71n val=’vdd/2’ cross=1 targ v(s75) td=71n val=’vdd/2’ cross=1

************************************************************************************************

.meas tran tplh_inr_Os_in trig v(in) td=70n val=’vdd/2’ cross=1 targ v(Os) td=70n val=’vdd/2’ cross=1

.meas tran tplh_inf_Os_in trig v(in) td=71n val=’vdd/2’ cross=1 targ v(Os) td=71n val=’vdd/2’ cross=1

.meas tran tplh_inr_1_Os trig v(Os) td=70n val=’vdd/2’ cross=1 targ v(1) td=70n val=’vdd/2’ cross=1

.meas tran tplh_inf_1_Os trig v(Os) td=71n val=’vdd/2’ cross=1 targ v(1) td=71n val=’vdd/2’ cross=1

.meas tran tplh_inr_2_1 trig v(1) td=70n val=’vdd/2’ cross=1 targ v(2) td=70n val=’vdd/2’ cross=1

Page 80: Analysis and Design of High Performance 128-Bit Parallel Prefix e

APPENDIX B. HSPICE SIMULATION FILES 71

.meas tran tplh_inf_2_1 trig v(1) td=71n val=’vdd/2’ cross=1 targ v(2) td=71n val=’vdd/2’ cross=1

.meas tran tplh_inr_3_2 trig v(2) td=70n val=’vdd/2’ cross=1 targ v(3) td=70n val=’vdd/2’ cross=1

.meas tran tplh_inf_3_2 trig v(2) td=71n val=’vdd/2’ cross=1 targ v(3) td=71n val=’vdd/2’ cross=1

.meas tran tplh_inr_4_3 trig v(3) td=70n val=’vdd/2’ cross=1 targ v(4) td=70n val=’vdd/2’ cross=1

.meas tran tplh_inf_4_3 trig v(3) td=71n val=’vdd/2’ cross=1 targ v(4) td=71n val=’vdd/2’ cross=1

.meas tran tplh_inr_5_4 trig v(4) td=70n val=’vdd/2’ cross=1 targ v(5) td=70n val=’vdd/2’ cross=1

.meas tran tplh_inf_5_4 trig v(4) td=71n val=’vdd/2’ cross=1 targ v(5) td=71n val=’vdd/2’ cross=1

.meas tran tplh_inr_6_5 trig v(5) td=70n val=’vdd/2’ cross=1 targ v(6) td=70n val=’vdd/2’ cross=1

.meas tran tplh_inf_6_5 trig v(5) td=71n val=’vdd/2’ cross=1 targ v(6) td=71n val=’vdd/2’ cross=1

.meas tran tplh_inr_7_6 trig v(6) td=70n val=’vdd/2’ cross=1 targ v(7) td=70n val=’vdd/2’ cross=1

.meas tran tplh_inf_7_6 trig v(6) td=71n val=’vdd/2’ cross=1 targ v(7) td=71n val=’vdd/2’ cross=1

.meas tran tplh_inr_8_7 trig v(7) td=70n val=’vdd/2’ cross=1 targ v(8) td=70n val=’vdd/2’ cross=1

.meas tran tplh_inf_8_7 trig v(7) td=71n val=’vdd/2’ cross=1 targ v(8) td=71n val=’vdd/2’ cross=1

.meas tran tplh_inr_9_8 trig v(8) td=70n val=’vdd/2’ cross=1 targ v(9) td=70n val=’vdd/2’ cross=1

.meas tran tplh_inf_9_8 trig v(8) td=71n val=’vdd/2’ cross=1 targ v(9) td=71n val=’vdd/2’ cross=1

.meas tran tplh_inr_10_9 trig v(9) td=70n val=’vdd/2’ cross=1 targ v(10) td=70n val=’vdd/2’ cross=1

.meas tran tplh_inf_10_9 trig v(9) td=71n val=’vdd/2’ cross=1 targ v(10) td=71n val=’vdd/2’ cross=1

.meas tran tplh_inr_11_10 trig v(10) td=70n val=’vdd/2’ cross=1 targ v(11) td=70n val=’vdd/2’ cross=1

.meas tran tplh_inf_11_10 trig v(10) td=71n val=’vdd/2’ cross=1 targ v(11) td=71n val=’vdd/2’ cross=1

.meas tran tplh_inr_12_11 trig v(11) td=70n val=’vdd/2’ cross=1 targ v(12) td=70n val=’vdd/2’ cross=1

.meas tran tplh_inf_12_11 trig v(11) td=71n val=’vdd/2’ cross=1 targ v(12) td=71n val=’vdd/2’ cross=1

.meas tran tplh_inr_s75_12 trig v(12) td=70n val=’vdd/2’ cross=1 targ v(s75) td=70n val=’vdd/2’ cross=1

.meas tran tplh_inf_s75_12 trig v(12) td=71n val=’vdd/2’ cross=1 targ v(s75) td=71n val=’vdd/2’ cross=1

*************************************************************************************************

.PRINT POWER

.MEASURE TRAN avg_power AVG POWER from 0ns to 100ns

******** alterations

.alter case 2:

.TEMP 100

.alter case 3:

.TEMP 25

.param vdd=1.2V

.alter case 4:

.TEMP 100

.param vdd=1.2V

.alter case 5:

.TEMP 25

.param vdd=0.8V

.alter case 6:

.TEMP 100

.param vdd=0.8V

.end

B.3 Condition of Transistors

Transistor conditions under 25◦C and 1.0V supply:

element name 1:m1 1:m2 2:m1 2:m2

drain 0:in1 0:in1 0:in 0:in

gate 0:in0 0:in0 0:in1 0:in1

source 0:vdd 0:0 0:vdd 0:0

Page 81: Analysis and Design of High Performance 128-Bit Parallel Prefix e

APPENDIX B. HSPICE SIMULATION FILES 72

bulk 0:vdd 0:0 0:vdd 0:0

model 0:pmos_vtl 0:nmos_vtl 0:pmos_vtl 0:nmos_vtl

w eff 240.0000n 90.0000n 240.0000n 90.0000n

l eff 22.5000n 22.5000n 22.5000n 22.5000n

rd eff 0. 0. 0. 0.

rs eff 0. 0. 0. 0.

cdsat 10.0000f 10.0000f 10.0000f 10.0000f

cssat 19.7699a 7.9079a 19.7699a 7.9079a

capbd 124.7500a 49.9000a 124.7500a 49.9000a

capbs 199.6000a 79.8400a 199.6000a 79.8400a

temp 25.0000 25.0000 25.0000 25.0000

aic

nf 1.0000 1.0000 1.0000 1.0000

min 0. 0. 0. 0.

rbdb 15.0000 15.0000 15.0000 15.0000

rbsb 15.0000 15.0000 15.0000 15.0000

rbpb 5.0000 5.0000 5.0000 5.0000

rbps 15.0000 15.0000 15.0000 15.0000

rbpd 15.0000 15.0000 15.0000 15.0000

trnqsmod 0. 0. 0. 0.

acnqsmod 0. 0. 0. 0.

rbodymod 1.0000 1.0000 1.0000 1.0000

rgatemod 1.0000 1.0000 1.0000 1.0000

geomod 1.0000 1.0000 1.0000 1.0000

rgeomod 0. 0. 0. 0.

delvto 0. 0. 0. 0.

mulu0 1.0000 1.0000 1.0000 1.0000

delk1 0. 0. 0. 0.

delnfct 0. 0. 0. 0.

deltox 0. 0. 0. 0.

sa 0. 0. 0. 0.

sb 0. 0. 0. 0.

sd 0. 0. 0. 0.

saeff 0. 0. 0. 0.

sbeff 0. 0. 0. 0.

element name 3:m1 3:m2 4:m1 4:m2

drain 0:n1 0:n1 0:n2 0:n2

gate 0:in 0:in 0:0 0:0

source 0:vdd 0:0 0:vdd 0:0

bulk 0:vdd 0:0 0:vdd 0:0

model 0:pmos_vtl 0:nmos_vtl 0:pmos_vtl 0:nmos_vtl

w eff 240.0000n 90.0000n 240.0000n 90.0000n

l eff 22.5000n 22.5000n 22.5000n 22.5000n

rd eff 0. 0. 0. 0.

rs eff 0. 0. 0. 0.

cdsat 10.0000f 10.0000f 10.0000f 10.0000f

cssat 19.7699a 7.9079a 19.7699a 7.9079a

capbd 124.7500a 49.9000a 124.7500a 49.9000a

capbs 199.6000a 79.8400a 199.6000a 79.8400a

temp 25.0000 25.0000 25.0000 25.0000

aic

nf 1.0000 1.0000 1.0000 1.0000

min 0. 0. 0. 0.

rbdb 15.0000 15.0000 15.0000 15.0000

rbsb 15.0000 15.0000 15.0000 15.0000

rbpb 5.0000 5.0000 5.0000 5.0000

Page 82: Analysis and Design of High Performance 128-Bit Parallel Prefix e

APPENDIX B. HSPICE SIMULATION FILES 73

rbps 15.0000 15.0000 15.0000 15.0000

rbpd 15.0000 15.0000 15.0000 15.0000

trnqsmod 0. 0. 0. 0.

acnqsmod 0. 0. 0. 0.

rbodymod 1.0000 1.0000 1.0000 1.0000

rgatemod 1.0000 1.0000 1.0000 1.0000

geomod 1.0000 1.0000 1.0000 1.0000

rgeomod 0. 0. 0. 0.

delvto 0. 0. 0. 0.

mulu0 1.0000 1.0000 1.0000 1.0000

delk1 0. 0. 0. 0.

delnfct 0. 0. 0. 0.

deltox 0. 0. 0. 0.

sa 0. 0. 0. 0.

sb 0. 0. 0. 0.

sd 0. 0. 0. 0.

saeff 0. 0. 0. 0.

sbeff 0. 0. 0. 0.

element name 5:m0 5:m1 6:m0 6:m1

drain 0:in 0:in 0:n1 0:n1

gate 0:0 0:n2 0:n2 0:0

source 0:os 0:os 0:os 0:os

bulk 0:vdd 0:0 0:vdd 0:0

model 0:pmos_vtl 0:nmos_vtl 0:pmos_vtl 0:nmos_vtl

w eff 190.0000n 190.0000n 190.0000n 190.0000n

l eff 22.5000n 22.5000n 22.5000n 22.5000n

rd eff 0. 0. 0. 0.

rs eff 0. 0. 0. 0.

cdsat 10.0000f 10.0000f 10.0000f 10.0000f

cssat 15.8159a 15.8159a 15.8159a 15.8159a

capbd 99.8000a 99.8000a 99.8000a 99.8000a

capbs 159.6800a 159.6800a 159.6800a 159.6800a

temp 25.0000 25.0000 25.0000 25.0000

aic

nf 1.0000 1.0000 1.0000 1.0000

min 0. 0. 0. 0.

rbdb 15.0000 15.0000 15.0000 15.0000

rbsb 15.0000 15.0000 15.0000 15.0000

rbpb 5.0000 5.0000 5.0000 5.0000

rbps 15.0000 15.0000 15.0000 15.0000

rbpd 15.0000 15.0000 15.0000 15.0000

trnqsmod 0. 0. 0. 0.

acnqsmod 0. 0. 0. 0.

rbodymod 1.0000 1.0000 1.0000 1.0000

rgatemod 1.0000 1.0000 1.0000 1.0000

geomod 1.0000 1.0000 1.0000 1.0000

rgeomod 0. 0. 0. 0.

delvto 0. 0. 0. 0.

mulu0 1.0000 1.0000 1.0000 1.0000

delk1 0. 0. 0. 0.

delnfct 0. 0. 0. 0.

deltox 0. 0. 0. 0.

sa 0. 0. 0. 0.

sb 0. 0. 0. 0.

sd 0. 0. 0. 0.

saeff 0. 0. 0. 0.

Page 83: Analysis and Design of High Performance 128-Bit Parallel Prefix e

APPENDIX B. HSPICE SIMULATION FILES 74

sbeff 0. 0. 0. 0.

element name 7:m1 7:m2 8:m0 8:m1

drain 0:n3 0:n3 0:vdd 0:vdd

gate 0:os 0:os 0:os 0:n3

source 0:vdd 0:0 0:1 0:1

bulk 0:vdd 0:0 0:vdd 0:0

model 0:pmos_vtl 0:nmos_vtl 0:pmos_vtl 0:nmos_vtl

w eff 240.0000n 90.0000n 190.0000n 190.0000n

l eff 22.5000n 22.5000n 22.5000n 22.5000n

rd eff 0. 0. 0. 0.

rs eff 0. 0. 0. 0.

cdsat 10.0000f 10.0000f 10.0000f 10.0000f

cssat 19.7699a 7.9079a 15.8159a 15.8159a

capbd 124.7500a 49.9000a 99.8000a 99.8000a

capbs 199.6000a 79.8400a 159.6800a 159.6800a

temp 25.0000 25.0000 25.0000 25.0000

aic

nf 1.0000 1.0000 1.0000 1.0000

min 0. 0. 0. 0.

rbdb 15.0000 15.0000 15.0000 15.0000

rbsb 15.0000 15.0000 15.0000 15.0000

rbpb 5.0000 5.0000 5.0000 5.0000

rbps 15.0000 15.0000 15.0000 15.0000

rbpd 15.0000 15.0000 15.0000 15.0000

trnqsmod 0. 0. 0. 0.

acnqsmod 0. 0. 0. 0.

rbodymod 1.0000 1.0000 1.0000 1.0000

rgatemod 1.0000 1.0000 1.0000 1.0000

geomod 1.0000 1.0000 1.0000 1.0000

rgeomod 0. 0. 0. 0.

delvto 0. 0. 0. 0.

mulu0 1.0000 1.0000 1.0000 1.0000

delk1 0. 0. 0. 0.

delnfct 0. 0. 0. 0.

deltox 0. 0. 0. 0.

sa 0. 0. 0. 0.

sb 0. 0. 0. 0.

sd 0. 0. 0. 0.

saeff 0. 0. 0. 0.

sbeff 0. 0. 0. 0.

element name 9:m0 9:m1 10:m1 10:m2

drain 0:0 0:0 0:2 0:2

gate 0:n3 0:os 0:vdd 0:1

source 0:1 0:1 0:vdd 0:vdd

bulk 0:vdd 0:0 0:vdd 0:vdd

model 0:pmos_vtl 0:nmos_vtl 0:pmos_vtl 0:pmos_vtl

w eff 190.0000n 190.0000n 302.5000n 302.5000n

l eff 22.5000n 22.5000n 22.5000n 22.5000n

rd eff 0. 0. 0. 0.

rs eff 0. 0. 0. 0.

cdsat 10.0000f 10.0000f 10.0000f 10.0000f

cssat 15.8159a 15.8159a 24.7123a 24.7123a

capbd 99.8000a 99.8000a 155.9375a 155.9375a

capbs 159.6800a 159.6800a 249.5000a 249.5000a

temp 25.0000 25.0000 25.0000 25.0000

Page 84: Analysis and Design of High Performance 128-Bit Parallel Prefix e

APPENDIX B. HSPICE SIMULATION FILES 75

aic

nf 1.0000 1.0000 1.0000 1.0000

min 0. 0. 0. 0.

rbdb 15.0000 15.0000 15.0000 15.0000

rbsb 15.0000 15.0000 15.0000 15.0000

rbpb 5.0000 5.0000 5.0000 5.0000

rbps 15.0000 15.0000 15.0000 15.0000

rbpd 15.0000 15.0000 15.0000 15.0000

trnqsmod 0. 0. 0. 0.

acnqsmod 0. 0. 0. 0.

rbodymod 1.0000 1.0000 1.0000 1.0000

rgatemod 1.0000 1.0000 1.0000 1.0000

geomod 1.0000 1.0000 1.0000 1.0000

rgeomod 0. 0. 0. 0.

delvto 0. 0. 0. 0.

mulu0 1.0000 1.0000 1.0000 1.0000

delk1 0. 0. 0. 0.

delnfct 0. 0. 0. 0.

deltox 0. 0. 0. 0.

sa 0. 0. 0. 0.

sb 0. 0. 0. 0.

sd 0. 0. 0. 0.

saeff 0. 0. 0. 0.

sbeff 0. 0. 0. 0.

element name 10:m3 10:m4 11:m01 11:m02

drain 0:2 10:1 0:3 11:2

gate 0:1 0:vdd 0:2 0:0

source 10:1 0:0 11:2 0:vdd

bulk 10:1 0:0 11:2 0:vdd

model 0:nmos_vtl 0:nmos_vtl 0:pmos_vtl 0:pmos_vtl

w eff 240.0000n 240.0000n 490.0000n 490.0000n

l eff 22.5000n 22.5000n 22.5000n 22.5000n

rd eff 0. 0. 0. 0.

rs eff 0. 0. 0. 0.

cdsat 10.0000f 10.0000f 10.0000f 10.0000f

cssat 19.7699a 19.7699a 39.5397a 39.5397a

capbd 124.7500a 124.7500a 249.5000a 249.5000a

capbs 199.6000a 199.6000a 399.2000a 399.2000a

temp 25.0000 25.0000 25.0000 25.0000

aic

nf 1.0000 1.0000 1.0000 1.0000

min 0. 0. 0. 0.

rbdb 15.0000 15.0000 15.0000 15.0000

rbsb 15.0000 15.0000 15.0000 15.0000

rbpb 5.0000 5.0000 5.0000 5.0000

rbps 15.0000 15.0000 15.0000 15.0000

rbpd 15.0000 15.0000 15.0000 15.0000

trnqsmod 0. 0. 0. 0.

acnqsmod 0. 0. 0. 0.

rbodymod 1.0000 1.0000 1.0000 1.0000

rgatemod 1.0000 1.0000 1.0000 1.0000

geomod 1.0000 1.0000 1.0000 1.0000

rgeomod 0. 0. 0. 0.

delvto 0. 0. 0. 0.

mulu0 1.0000 1.0000 1.0000 1.0000

delk1 0. 0. 0. 0.

Page 85: Analysis and Design of High Performance 128-Bit Parallel Prefix e

APPENDIX B. HSPICE SIMULATION FILES 76

delnfct 0. 0. 0. 0.

deltox 0. 0. 0. 0.

sa 0. 0. 0. 0.

sb 0. 0. 0. 0.

sd 0. 0. 0. 0.

saeff 0. 0. 0. 0.

sbeff 0. 0. 0. 0.

element name 11:m03 11:m04 11:m05 11:m06

drain 0:3 0:3 11:f1 11:1

gate 0:vdd 0:2 0:0 0:vdd

source 0:vdd 11:1 11:1 0:0

bulk 0:vdd 11:1 11:1 0:0

model 0:pmos_vtl 0:nmos_vtl 0:nmos_vtl 0:nmos_vtl

w eff 240.0000n 190.0000n 190.0000n 190.0000n

l eff 22.5000n 22.5000n 22.5000n 22.5000n

rd eff 0. 0. 0. 0.

rs eff 0. 0. 0. 0.

cdsat 10.0000f 10.0000f 10.0000f 10.0000f

cssat 19.7699a 15.8159a 15.8159a 15.8159a

capbd 124.7500a 99.8000a 99.8000a 99.8000a

capbs 199.6000a 159.6800a 159.6800a 159.6800a

temp 25.0000 25.0000 25.0000 25.0000

aic

nf 1.0000 1.0000 1.0000 1.0000

min 0. 0. 0. 0.

rbdb 15.0000 15.0000 15.0000 15.0000

rbsb 15.0000 15.0000 15.0000 15.0000

rbpb 5.0000 5.0000 5.0000 5.0000

rbps 15.0000 15.0000 15.0000 15.0000

rbpd 15.0000 15.0000 15.0000 15.0000

trnqsmod 0. 0. 0. 0.

acnqsmod 0. 0. 0. 0.

rbodymod 1.0000 1.0000 1.0000 1.0000

rgatemod 1.0000 1.0000 1.0000 1.0000

geomod 1.0000 1.0000 1.0000 1.0000

rgeomod 0. 0. 0. 0.

delvto 0. 0. 0. 0.

mulu0 1.0000 1.0000 1.0000 1.0000

delk1 0. 0. 0. 0.

delnfct 0. 0. 0. 0.

deltox 0. 0. 0. 0.

sa 0. 0. 0. 0.

sb 0. 0. 0. 0.

sd 0. 0. 0. 0.

saeff 0. 0. 0. 0.

sbeff 0. 0. 0. 0.

element name 12:m01 12:m02 12:m03 12:m04

drain 0:51 12:2 0:51 0:51

gate 0:2 0:0 0:vdd 0:2

source 12:2 0:vdd 0:vdd 12:1

bulk 12:2 0:vdd 0:vdd 12:1

model 0:pmos_vtl 0:pmos_vtl 0:pmos_vtl 0:nmos_vtl

w eff 490.0000n 490.0000n 240.0000n 190.0000n

l eff 22.5000n 22.5000n 22.5000n 22.5000n

rd eff 0. 0. 0. 0.

Page 86: Analysis and Design of High Performance 128-Bit Parallel Prefix e

APPENDIX B. HSPICE SIMULATION FILES 77

rs eff 0. 0. 0. 0.

cdsat 10.0000f 10.0000f 10.0000f 10.0000f

cssat 39.5397a 39.5397a 19.7699a 15.8159a

capbd 249.5000a 249.5000a 124.7500a 99.8000a

capbs 399.2000a 399.2000a 199.6000a 159.6800a

temp 25.0000 25.0000 25.0000 25.0000

aic

nf 1.0000 1.0000 1.0000 1.0000

min 0. 0. 0. 0.

rbdb 15.0000 15.0000 15.0000 15.0000

rbsb 15.0000 15.0000 15.0000 15.0000

rbpb 5.0000 5.0000 5.0000 5.0000

rbps 15.0000 15.0000 15.0000 15.0000

rbpd 15.0000 15.0000 15.0000 15.0000

trnqsmod 0. 0. 0. 0.

acnqsmod 0. 0. 0. 0.

rbodymod 1.0000 1.0000 1.0000 1.0000

rgatemod 1.0000 1.0000 1.0000 1.0000

geomod 1.0000 1.0000 1.0000 1.0000

rgeomod 0. 0. 0. 0.

delvto 0. 0. 0. 0.

mulu0 1.0000 1.0000 1.0000 1.0000

delk1 0. 0. 0. 0.

delnfct 0. 0. 0. 0.

deltox 0. 0. 0. 0.

sa 0. 0. 0. 0.

sb 0. 0. 0. 0.

sd 0. 0. 0. 0.

saeff 0. 0. 0. 0.

sbeff 0. 0. 0. 0.

element name 12:m05 12:m06 13:m01 13:m02

drain 12:f1 12:1 0:4 0:4

gate 0:0 0:vdd 0:3 0:vdd

source 12:1 0:0 13:2 13:2

bulk 12:1 0:0 13:2 13:2

model 0:nmos_vtl 0:nmos_vtl 0:pmos_vtl 0:pmos_vtl

w eff 190.0000n 190.0000n 390.0000n 390.0000n

l eff 22.5000n 22.5000n 22.5000n 22.5000n

rd eff 0. 0. 0. 0.

rs eff 0. 0. 0. 0.

cdsat 10.0000f 10.0000f 10.0000f 10.0000f

cssat 15.8159a 15.8159a 31.6318a 31.6318a

capbd 99.8000a 99.8000a 199.6000a 199.6000a

capbs 159.6800a 159.6800a 319.3600a 319.3600a

temp 25.0000 25.0000 25.0000 25.0000

aic

nf 1.0000 1.0000 1.0000 1.0000

min 0. 0. 0. 0.

rbdb 15.0000 15.0000 15.0000 15.0000

rbsb 15.0000 15.0000 15.0000 15.0000

rbpb 5.0000 5.0000 5.0000 5.0000

rbps 15.0000 15.0000 15.0000 15.0000

rbpd 15.0000 15.0000 15.0000 15.0000

trnqsmod 0. 0. 0. 0.

acnqsmod 0. 0. 0. 0.

rbodymod 1.0000 1.0000 1.0000 1.0000

Page 87: Analysis and Design of High Performance 128-Bit Parallel Prefix e

APPENDIX B. HSPICE SIMULATION FILES 78

rgatemod 1.0000 1.0000 1.0000 1.0000

geomod 1.0000 1.0000 1.0000 1.0000

rgeomod 0. 0. 0. 0.

delvto 0. 0. 0. 0.

mulu0 1.0000 1.0000 1.0000 1.0000

delk1 0. 0. 0. 0.

delnfct 0. 0. 0. 0.

deltox 0. 0. 0. 0.

sa 0. 0. 0. 0.

sb 0. 0. 0. 0.

sd 0. 0. 0. 0.

saeff 0. 0. 0. 0.

sbeff 0. 0. 0. 0.

element name 13:m03 13:m04 13:m05 13:m06

drain 13:2 0:4 13:1 0:4

gate 0:0 0:3 0:vdd 0:0

source 0:vdd 13:1 0:0 0:0

bulk 0:vdd 13:1 0:0 0:0

model 0:pmos_vtl 0:nmos_vtl 0:nmos_vtl 0:nmos_vtl

w eff 390.0000n 150.0000n 150.0000n 70.0000n

l eff 22.5000n 22.5000n 22.5000n 22.5000n

rd eff 0. 0. 0. 0.

rs eff 0. 0. 0. 0.

cdsat 10.0000f 10.0000f 10.0000f 10.0000f

cssat 31.6318a 12.6527a 12.6527a 6.3264a

capbd 199.6000a 79.8400a 79.8400a 39.9200a

capbs 319.3600a 127.7440a 127.7440a 63.8720a

temp 25.0000 25.0000 25.0000 25.0000

aic

nf 1.0000 1.0000 1.0000 1.0000

min 0. 0. 0. 0.

rbdb 15.0000 15.0000 15.0000 15.0000

rbsb 15.0000 15.0000 15.0000 15.0000

rbpb 5.0000 5.0000 5.0000 5.0000

rbps 15.0000 15.0000 15.0000 15.0000

rbpd 15.0000 15.0000 15.0000 15.0000

trnqsmod 0. 0. 0. 0.

acnqsmod 0. 0. 0. 0.

rbodymod 1.0000 1.0000 1.0000 1.0000

rgatemod 1.0000 1.0000 1.0000 1.0000

geomod 1.0000 1.0000 1.0000 1.0000

rgeomod 0. 0. 0. 0.

delvto 0. 0. 0. 0.

mulu0 1.0000 1.0000 1.0000 1.0000

delk1 0. 0. 0. 0.

delnfct 0. 0. 0. 0.

deltox 0. 0. 0. 0.

sa 0. 0. 0. 0.

sb 0. 0. 0. 0.

sd 0. 0. 0. 0.

saeff 0. 0. 0. 0.

sbeff 0. 0. 0. 0.

element name 14:m01 14:m02 14:m03 14:m04

Page 88: Analysis and Design of High Performance 128-Bit Parallel Prefix e

APPENDIX B. HSPICE SIMULATION FILES 79

drain 0:52 0:52 14:2 0:52

gate 0:3 0:vdd 0:0 0:3

source 14:2 14:2 0:vdd 14:1

bulk 14:2 14:2 0:vdd 14:1

model 0:pmos_vtl 0:pmos_vtl 0:pmos_vtl 0:nmos_vtl

w eff 390.0000n 390.0000n 390.0000n 150.0000n

l eff 22.5000n 22.5000n 22.5000n 22.5000n

rd eff 0. 0. 0. 0.

rs eff 0. 0. 0. 0.

cdsat 10.0000f 10.0000f 10.0000f 10.0000f

cssat 31.6318a 31.6318a 31.6318a 12.6527a

capbd 199.6000a 199.6000a 199.6000a 79.8400a

capbs 319.3600a 319.3600a 319.3600a 127.7440a

temp 25.0000 25.0000 25.0000 25.0000

aic

nf 1.0000 1.0000 1.0000 1.0000

min 0. 0. 0. 0.

rbdb 15.0000 15.0000 15.0000 15.0000

rbsb 15.0000 15.0000 15.0000 15.0000

rbpb 5.0000 5.0000 5.0000 5.0000

rbps 15.0000 15.0000 15.0000 15.0000

rbpd 15.0000 15.0000 15.0000 15.0000

trnqsmod 0. 0. 0. 0.

acnqsmod 0. 0. 0. 0.

rbodymod 1.0000 1.0000 1.0000 1.0000

rgatemod 1.0000 1.0000 1.0000 1.0000

geomod 1.0000 1.0000 1.0000 1.0000

rgeomod 0. 0. 0. 0.

delvto 0. 0. 0. 0.

mulu0 1.0000 1.0000 1.0000 1.0000

delk1 0. 0. 0. 0.

delnfct 0. 0. 0. 0.

deltox 0. 0. 0. 0.

sa 0. 0. 0. 0.

sb 0. 0. 0. 0.

sd 0. 0. 0. 0.

saeff 0. 0. 0. 0.

sbeff 0. 0. 0. 0.

element name 14:m05 14:m06 15:m01 15:m02

drain 14:1 0:52 0:5 15:2

gate 0:vdd 0:0 0:4 0:0

source 0:0 0:0 15:2 0:vdd

bulk 0:0 0:0 15:2 0:vdd

model 0:nmos_vtl 0:nmos_vtl 0:pmos_vtl 0:pmos_vtl

w eff 150.0000n 70.0000n 315.0000n 315.0000n

l eff 22.5000n 22.5000n 22.5000n 22.5000n

rd eff 0. 0. 0. 0.

rs eff 0. 0. 0. 0.

cdsat 10.0000f 10.0000f 10.0000f 10.0000f

cssat 12.6527a 6.3264a 25.7008a 25.7008a

capbd 79.8400a 39.9200a 162.1750a 162.1750a

capbs 127.7440a 63.8720a 259.4800a 259.4800a

temp 25.0000 25.0000 25.0000 25.0000

aic

nf 1.0000 1.0000 1.0000 1.0000

min 0. 0. 0. 0.

Page 89: Analysis and Design of High Performance 128-Bit Parallel Prefix e

APPENDIX B. HSPICE SIMULATION FILES 80

rbdb 15.0000 15.0000 15.0000 15.0000

rbsb 15.0000 15.0000 15.0000 15.0000

rbpb 5.0000 5.0000 5.0000 5.0000

rbps 15.0000 15.0000 15.0000 15.0000

rbpd 15.0000 15.0000 15.0000 15.0000

trnqsmod 0. 0. 0. 0.

acnqsmod 0. 0. 0. 0.

rbodymod 1.0000 1.0000 1.0000 1.0000

rgatemod 1.0000 1.0000 1.0000 1.0000

geomod 1.0000 1.0000 1.0000 1.0000

rgeomod 0. 0. 0. 0.

delvto 0. 0. 0. 0.

mulu0 1.0000 1.0000 1.0000 1.0000

delk1 0. 0. 0. 0.

delnfct 0. 0. 0. 0.

deltox 0. 0. 0. 0.

sa 0. 0. 0. 0.

sb 0. 0. 0. 0.

sd 0. 0. 0. 0.

saeff 0. 0. 0. 0.

sbeff 0. 0. 0. 0.

element name 15:m03 15:m04 15:m05 15:m06

drain 0:5 0:5 15:f1 15:1

gate 0:vdd 0:4 0:0 0:vdd

source 0:vdd 15:1 15:1 0:0

bulk 0:vdd 15:1 15:1 0:0

model 0:pmos_vtl 0:nmos_vtl 0:nmos_vtl 0:nmos_vtl

w eff 152.5000n 120.0000n 120.0000n 120.0000n

l eff 22.5000n 22.5000n 22.5000n 22.5000n

rd eff 0. 0. 0. 0.

rs eff 0. 0. 0. 0.

cdsat 10.0000f 10.0000f 10.0000f 10.0000f

cssat 12.8504a 10.2803a 10.2803a 10.2803a

capbd 81.0875a 64.8700a 64.8700a 64.8700a

capbs 129.7400a 103.7920a 103.7920a 103.7920a

temp 25.0000 25.0000 25.0000 25.0000

aic

nf 1.0000 1.0000 1.0000 1.0000

min 0. 0. 0. 0.

rbdb 15.0000 15.0000 15.0000 15.0000

rbsb 15.0000 15.0000 15.0000 15.0000

rbpb 5.0000 5.0000 5.0000 5.0000

rbps 15.0000 15.0000 15.0000 15.0000

rbpd 15.0000 15.0000 15.0000 15.0000

trnqsmod 0. 0. 0. 0.

acnqsmod 0. 0. 0. 0.

rbodymod 1.0000 1.0000 1.0000 1.0000

rgatemod 1.0000 1.0000 1.0000 1.0000

geomod 1.0000 1.0000 1.0000 1.0000

rgeomod 0. 0. 0. 0.

delvto 0. 0. 0. 0.

mulu0 1.0000 1.0000 1.0000 1.0000

delk1 0. 0. 0. 0.

delnfct 0. 0. 0. 0.

deltox 0. 0. 0. 0.

sa 0. 0. 0. 0.

Page 90: Analysis and Design of High Performance 128-Bit Parallel Prefix e

APPENDIX B. HSPICE SIMULATION FILES 81

sb 0. 0. 0. 0.

sd 0. 0. 0. 0.

saeff 0. 0. 0. 0.

sbeff 0. 0. 0. 0.

element name 16:m01 16:m02 16:m03 16:m04

drain 0:53 16:2 0:53 0:53

gate 0:4 0:0 0:vdd 0:4

source 16:2 0:vdd 0:vdd 16:1

bulk 16:2 0:vdd 0:vdd 16:1

model 0:pmos_vtl 0:pmos_vtl 0:pmos_vtl 0:nmos_vtl

w eff 315.0000n 315.0000n 152.5000n 120.0000n

l eff 22.5000n 22.5000n 22.5000n 22.5000n

rd eff 0. 0. 0. 0.

rs eff 0. 0. 0. 0.

cdsat 10.0000f 10.0000f 10.0000f 10.0000f

cssat 25.7008a 25.7008a 12.8504a 10.2803a

capbd 162.1750a 162.1750a 81.0875a 64.8700a

capbs 259.4800a 259.4800a 129.7400a 103.7920a

temp 25.0000 25.0000 25.0000 25.0000

aic

nf 1.0000 1.0000 1.0000 1.0000

min 0. 0. 0. 0.

rbdb 15.0000 15.0000 15.0000 15.0000

rbsb 15.0000 15.0000 15.0000 15.0000

rbpb 5.0000 5.0000 5.0000 5.0000

rbps 15.0000 15.0000 15.0000 15.0000

rbpd 15.0000 15.0000 15.0000 15.0000

trnqsmod 0. 0. 0. 0.

acnqsmod 0. 0. 0. 0.

rbodymod 1.0000 1.0000 1.0000 1.0000

rgatemod 1.0000 1.0000 1.0000 1.0000

geomod 1.0000 1.0000 1.0000 1.0000

rgeomod 0. 0. 0. 0.

delvto 0. 0. 0. 0.

mulu0 1.0000 1.0000 1.0000 1.0000

delk1 0. 0. 0. 0.

delnfct 0. 0. 0. 0.

deltox 0. 0. 0. 0.

sa 0. 0. 0. 0.

sb 0. 0. 0. 0.

sd 0. 0. 0. 0.

saeff 0. 0. 0. 0.

sbeff 0. 0. 0. 0.

element name 16:m05 16:m06 17:m01 17:m02

drain 16:f1 16:1 0:6 0:6

gate 0:0 0:vdd 0:5 0:vdd

source 16:1 0:0 17:2 17:2

bulk 16:1 0:0 17:2 17:2

model 0:nmos_vtl 0:nmos_vtl 0:pmos_vtl 0:pmos_vtl

w eff 120.0000n 120.0000n 490.0000n 490.0000n

l eff 22.5000n 22.5000n 22.5000n 22.5000n

rd eff 0. 0. 0. 0.

rs eff 0. 0. 0. 0.

cdsat 10.0000f 10.0000f 10.0000f 10.0000f

cssat 10.2803a 10.2803a 39.5397a 39.5397a

Page 91: Analysis and Design of High Performance 128-Bit Parallel Prefix e

APPENDIX B. HSPICE SIMULATION FILES 82

capbd 64.8700a 64.8700a 249.5000a 249.5000a

capbs 103.7920a 103.7920a 399.2000a 399.2000a

temp 25.0000 25.0000 25.0000 25.0000

aic

nf 1.0000 1.0000 1.0000 1.0000

min 0. 0. 0. 0.

rbdb 15.0000 15.0000 15.0000 15.0000

rbsb 15.0000 15.0000 15.0000 15.0000

rbpb 5.0000 5.0000 5.0000 5.0000

rbps 15.0000 15.0000 15.0000 15.0000

rbpd 15.0000 15.0000 15.0000 15.0000

trnqsmod 0. 0. 0. 0.

acnqsmod 0. 0. 0. 0.

rbodymod 1.0000 1.0000 1.0000 1.0000

rgatemod 1.0000 1.0000 1.0000 1.0000

geomod 1.0000 1.0000 1.0000 1.0000

rgeomod 0. 0. 0. 0.

delvto 0. 0. 0. 0.

mulu0 1.0000 1.0000 1.0000 1.0000

delk1 0. 0. 0. 0.

delnfct 0. 0. 0. 0.

deltox 0. 0. 0. 0.

sa 0. 0. 0. 0.

sb 0. 0. 0. 0.

sd 0. 0. 0. 0.

saeff 0. 0. 0. 0.

sbeff 0. 0. 0. 0.

element name 17:m03 17:m04 17:m05 17:m06

drain 17:2 0:6 17:1 0:6

gate 0:0 0:5 0:vdd 0:0

source 0:vdd 17:1 0:0 0:0

bulk 0:vdd 17:1 0:0 0:0

model 0:pmos_vtl 0:nmos_vtl 0:nmos_vtl 0:nmos_vtl

w eff 490.0000n 190.0000n 190.0000n 90.0000n

l eff 22.5000n 22.5000n 22.5000n 22.5000n

rd eff 0. 0. 0. 0.

rs eff 0. 0. 0. 0.

cdsat 10.0000f 10.0000f 10.0000f 10.0000f

cssat 39.5397a 15.8159a 15.8159a 7.9079a

capbd 249.5000a 99.8000a 99.8000a 49.9000a

capbs 399.2000a 159.6800a 159.6800a 79.8400a

temp 25.0000 25.0000 25.0000 25.0000

aic

nf 1.0000 1.0000 1.0000 1.0000

min 0. 0. 0. 0.

rbdb 15.0000 15.0000 15.0000 15.0000

rbsb 15.0000 15.0000 15.0000 15.0000

rbpb 5.0000 5.0000 5.0000 5.0000

rbps 15.0000 15.0000 15.0000 15.0000

rbpd 15.0000 15.0000 15.0000 15.0000

trnqsmod 0. 0. 0. 0.

acnqsmod 0. 0. 0. 0.

rbodymod 1.0000 1.0000 1.0000 1.0000

rgatemod 1.0000 1.0000 1.0000 1.0000

geomod 1.0000 1.0000 1.0000 1.0000

rgeomod 0. 0. 0. 0.

Page 92: Analysis and Design of High Performance 128-Bit Parallel Prefix e

APPENDIX B. HSPICE SIMULATION FILES 83

delvto 0. 0. 0. 0.

mulu0 1.0000 1.0000 1.0000 1.0000

delk1 0. 0. 0. 0.

delnfct 0. 0. 0. 0.

deltox 0. 0. 0. 0.

sa 0. 0. 0. 0.

sb 0. 0. 0. 0.

sd 0. 0. 0. 0.

saeff 0. 0. 0. 0.

sbeff 0. 0. 0. 0.

element name 18:m01 18:m02 18:m03 18:m04

drain 0:7 18:2 0:7 0:7

gate 0:6 0:0 0:vdd 0:6

source 18:2 0:vdd 0:vdd 18:1

bulk 18:2 0:vdd 0:vdd 18:1

model 0:pmos_vtl 0:pmos_vtl 0:pmos_vtl 0:nmos_vtl

w eff 390.0000n 390.0000n 190.0000n 150.0000n

l eff 22.5000n 22.5000n 22.5000n 22.5000n

rd eff 0. 0. 0. 0.

rs eff 0. 0. 0. 0.

cdsat 10.0000f 10.0000f 10.0000f 10.0000f

cssat 31.6318a 31.6318a 15.8159a 12.6527a

capbd 199.6000a 199.6000a 99.8000a 79.8400a

capbs 319.3600a 319.3600a 159.6800a 127.7440a

temp 25.0000 25.0000 25.0000 25.0000

aic

nf 1.0000 1.0000 1.0000 1.0000

min 0. 0. 0. 0.

rbdb 15.0000 15.0000 15.0000 15.0000

rbsb 15.0000 15.0000 15.0000 15.0000

rbpb 5.0000 5.0000 5.0000 5.0000

rbps 15.0000 15.0000 15.0000 15.0000

rbpd 15.0000 15.0000 15.0000 15.0000

trnqsmod 0. 0. 0. 0.

acnqsmod 0. 0. 0. 0.

rbodymod 1.0000 1.0000 1.0000 1.0000

rgatemod 1.0000 1.0000 1.0000 1.0000

geomod 1.0000 1.0000 1.0000 1.0000

rgeomod 0. 0. 0. 0.

delvto 0. 0. 0. 0.

mulu0 1.0000 1.0000 1.0000 1.0000

delk1 0. 0. 0. 0.

delnfct 0. 0. 0. 0.

deltox 0. 0. 0. 0.

sa 0. 0. 0. 0.

sb 0. 0. 0. 0.

sd 0. 0. 0. 0.

saeff 0. 0. 0. 0.

sbeff 0. 0. 0. 0.

element name 18:m05 18:m06 19:m01 19:m02

drain 18:f1 18:1 0:54 19:2

gate 0:0 0:vdd 0:6 0:0

source 18:1 0:0 19:2 0:vdd

bulk 18:1 0:0 19:2 0:vdd

model 0:nmos_vtl 0:nmos_vtl 0:pmos_vtl 0:pmos_vtl

Page 93: Analysis and Design of High Performance 128-Bit Parallel Prefix e

APPENDIX B. HSPICE SIMULATION FILES 84

w eff 150.0000n 150.0000n 390.0000n 390.0000n

l eff 22.5000n 22.5000n 22.5000n 22.5000n

rd eff 0. 0. 0. 0.

rs eff 0. 0. 0. 0.

cdsat 10.0000f 10.0000f 10.0000f 10.0000f

cssat 12.6527a 12.6527a 31.6318a 31.6318a

capbd 79.8400a 79.8400a 199.6000a 199.6000a

capbs 127.7440a 127.7440a 319.3600a 319.3600a

temp 25.0000 25.0000 25.0000 25.0000

aic

nf 1.0000 1.0000 1.0000 1.0000

min 0. 0. 0. 0.

rbdb 15.0000 15.0000 15.0000 15.0000

rbsb 15.0000 15.0000 15.0000 15.0000

rbpb 5.0000 5.0000 5.0000 5.0000

rbps 15.0000 15.0000 15.0000 15.0000

rbpd 15.0000 15.0000 15.0000 15.0000

trnqsmod 0. 0. 0. 0.

acnqsmod 0. 0. 0. 0.

rbodymod 1.0000 1.0000 1.0000 1.0000

rgatemod 1.0000 1.0000 1.0000 1.0000

geomod 1.0000 1.0000 1.0000 1.0000

rgeomod 0. 0. 0. 0.

delvto 0. 0. 0. 0.

mulu0 1.0000 1.0000 1.0000 1.0000

delk1 0. 0. 0. 0.

delnfct 0. 0. 0. 0.

deltox 0. 0. 0. 0.

sa 0. 0. 0. 0.

sb 0. 0. 0. 0.

sd 0. 0. 0. 0.

saeff 0. 0. 0. 0.

sbeff 0. 0. 0. 0.

element name 19:m03 19:m04 19:m05 19:m06

drain 0:54 0:54 19:f1 19:1

gate 0:vdd 0:6 0:0 0:vdd

source 0:vdd 19:1 19:1 0:0

bulk 0:vdd 19:1 19:1 0:0

model 0:pmos_vtl 0:nmos_vtl 0:nmos_vtl 0:nmos_vtl

w eff 190.0000n 150.0000n 150.0000n 150.0000n

l eff 22.5000n 22.5000n 22.5000n 22.5000n

rd eff 0. 0. 0. 0.

rs eff 0. 0. 0. 0.

cdsat 10.0000f 10.0000f 10.0000f 10.0000f

cssat 15.8159a 12.6527a 12.6527a 12.6527a

capbd 99.8000a 79.8400a 79.8400a 79.8400a

capbs 159.6800a 127.7440a 127.7440a 127.7440a

temp 25.0000 25.0000 25.0000 25.0000

aic

nf 1.0000 1.0000 1.0000 1.0000

min 0. 0. 0. 0.

rbdb 15.0000 15.0000 15.0000 15.0000

rbsb 15.0000 15.0000 15.0000 15.0000

rbpb 5.0000 5.0000 5.0000 5.0000

rbps 15.0000 15.0000 15.0000 15.0000

rbpd 15.0000 15.0000 15.0000 15.0000

Page 94: Analysis and Design of High Performance 128-Bit Parallel Prefix e

APPENDIX B. HSPICE SIMULATION FILES 85

trnqsmod 0. 0. 0. 0.

acnqsmod 0. 0. 0. 0.

rbodymod 1.0000 1.0000 1.0000 1.0000

rgatemod 1.0000 1.0000 1.0000 1.0000

geomod 1.0000 1.0000 1.0000 1.0000

rgeomod 0. 0. 0. 0.

delvto 0. 0. 0. 0.

mulu0 1.0000 1.0000 1.0000 1.0000

delk1 0. 0. 0. 0.

delnfct 0. 0. 0. 0.

deltox 0. 0. 0. 0.

sa 0. 0. 0. 0.

sb 0. 0. 0. 0.

sd 0. 0. 0. 0.

saeff 0. 0. 0. 0.

sbeff 0. 0. 0. 0.

element name 20:m01 20:m02 20:m03 20:m04

drain 0:8 0:8 20:2 0:8

gate 0:7 0:vdd 0:0 0:7

source 20:2 20:2 0:vdd 20:1

bulk 20:2 20:2 0:vdd 20:1

model 0:pmos_vtl 0:pmos_vtl 0:pmos_vtl 0:nmos_vtl

w eff 315.0000n 315.0000n 315.0000n 120.0000n

l eff 22.5000n 22.5000n 22.5000n 22.5000n

rd eff 0. 0. 0. 0.

rs eff 0. 0. 0. 0.

cdsat 10.0000f 10.0000f 10.0000f 10.0000f

cssat 25.7008a 25.7008a 25.7008a 10.2803a

capbd 162.1750a 162.1750a 162.1750a 64.8700a

capbs 259.4800a 259.4800a 259.4800a 103.7920a

temp 25.0000 25.0000 25.0000 25.0000

aic

nf 1.0000 1.0000 1.0000 1.0000

min 0. 0. 0. 0.

rbdb 15.0000 15.0000 15.0000 15.0000

rbsb 15.0000 15.0000 15.0000 15.0000

rbpb 5.0000 5.0000 5.0000 5.0000

rbps 15.0000 15.0000 15.0000 15.0000

rbpd 15.0000 15.0000 15.0000 15.0000

trnqsmod 0. 0. 0. 0.

acnqsmod 0. 0. 0. 0.

rbodymod 1.0000 1.0000 1.0000 1.0000

rgatemod 1.0000 1.0000 1.0000 1.0000

geomod 1.0000 1.0000 1.0000 1.0000

rgeomod 0. 0. 0. 0.

delvto 0. 0. 0. 0.

mulu0 1.0000 1.0000 1.0000 1.0000

delk1 0. 0. 0. 0.

delnfct 0. 0. 0. 0.

deltox 0. 0. 0. 0.

sa 0. 0. 0. 0.

sb 0. 0. 0. 0.

sd 0. 0. 0. 0.

saeff 0. 0. 0. 0.

sbeff 0. 0. 0. 0.

Page 95: Analysis and Design of High Performance 128-Bit Parallel Prefix e

APPENDIX B. HSPICE SIMULATION FILES 86

element name 20:m05 20:m06 21:m01 21:m02

drain 20:1 0:8 0:55 0:55

gate 0:vdd 0:0 0:7 0:vdd

source 0:0 0:0 21:2 21:2

bulk 0:0 0:0 21:2 21:2

model 0:nmos_vtl 0:nmos_vtl 0:pmos_vtl 0:pmos_vtl

w eff 120.0000n 55.0000n 315.0000n 315.0000n

l eff 22.5000n 22.5000n 22.5000n 22.5000n

rd eff 0. 0. 0. 0.

rs eff 0. 0. 0. 0.

cdsat 10.0000f 10.0000f 10.0000f 10.0000f

cssat 10.2803a 5.1402a 25.7008a 25.7008a

capbd 64.8700a 32.4350a 162.1750a 162.1750a

capbs 103.7920a 51.8960a 259.4800a 259.4800a

temp 25.0000 25.0000 25.0000 25.0000

aic

nf 1.0000 1.0000 1.0000 1.0000

min 0. 0. 0. 0.

rbdb 15.0000 15.0000 15.0000 15.0000

rbsb 15.0000 15.0000 15.0000 15.0000

rbpb 5.0000 5.0000 5.0000 5.0000

rbps 15.0000 15.0000 15.0000 15.0000

rbpd 15.0000 15.0000 15.0000 15.0000

trnqsmod 0. 0. 0. 0.

acnqsmod 0. 0. 0. 0.

rbodymod 1.0000 1.0000 1.0000 1.0000

rgatemod 1.0000 1.0000 1.0000 1.0000

geomod 1.0000 1.0000 1.0000 1.0000

rgeomod 0. 0. 0. 0.

delvto 0. 0. 0. 0.

mulu0 1.0000 1.0000 1.0000 1.0000

delk1 0. 0. 0. 0.

delnfct 0. 0. 0. 0.

deltox 0. 0. 0. 0.

sa 0. 0. 0. 0.

sb 0. 0. 0. 0.

sd 0. 0. 0. 0.

saeff 0. 0. 0. 0.

sbeff 0. 0. 0. 0.

element name 21:m03 21:m04 21:m05 21:m06

drain 21:2 0:55 21:1 0:55

gate 0:0 0:7 0:vdd 0:0

source 0:vdd 21:1 0:0 0:0

bulk 0:vdd 21:1 0:0 0:0

model 0:pmos_vtl 0:nmos_vtl 0:nmos_vtl 0:nmos_vtl

w eff 315.0000n 120.0000n 120.0000n 55.0000n

l eff 22.5000n 22.5000n 22.5000n 22.5000n

rd eff 0. 0. 0. 0.

rs eff 0. 0. 0. 0.

cdsat 10.0000f 10.0000f 10.0000f 10.0000f

cssat 25.7008a 10.2803a 10.2803a 5.1402a

capbd 162.1750a 64.8700a 64.8700a 32.4350a

capbs 259.4800a 103.7920a 103.7920a 51.8960a

temp 25.0000 25.0000 25.0000 25.0000

aic

nf 1.0000 1.0000 1.0000 1.0000

Page 96: Analysis and Design of High Performance 128-Bit Parallel Prefix e

APPENDIX B. HSPICE SIMULATION FILES 87

min 0. 0. 0. 0.

rbdb 15.0000 15.0000 15.0000 15.0000

rbsb 15.0000 15.0000 15.0000 15.0000

rbpb 5.0000 5.0000 5.0000 5.0000

rbps 15.0000 15.0000 15.0000 15.0000

rbpd 15.0000 15.0000 15.0000 15.0000

trnqsmod 0. 0. 0. 0.

acnqsmod 0. 0. 0. 0.

rbodymod 1.0000 1.0000 1.0000 1.0000

rgatemod 1.0000 1.0000 1.0000 1.0000

geomod 1.0000 1.0000 1.0000 1.0000

rgeomod 0. 0. 0. 0.

delvto 0. 0. 0. 0.

mulu0 1.0000 1.0000 1.0000 1.0000

delk1 0. 0. 0. 0.

delnfct 0. 0. 0. 0.

deltox 0. 0. 0. 0.

sa 0. 0. 0. 0.

sb 0. 0. 0. 0.

sd 0. 0. 0. 0.

saeff 0. 0. 0. 0.

sbeff 0. 0. 0. 0.

element name 22:m01 22:m02 22:m03 22:m04

drain 0:9 22:2 0:9 0:9

gate 0:8 0:0 0:vdd 0:8

source 22:2 0:vdd 0:vdd 22:1

bulk 22:2 0:vdd 0:vdd 22:1

model 0:pmos_vtl 0:pmos_vtl 0:pmos_vtl 0:nmos_vtl

w eff 490.0000n 490.0000n 240.0000n 190.0000n

l eff 22.5000n 22.5000n 22.5000n 22.5000n

rd eff 0. 0. 0. 0.

rs eff 0. 0. 0. 0.

cdsat 10.0000f 10.0000f 10.0000f 10.0000f

cssat 39.5397a 39.5397a 19.7699a 15.8159a

capbd 249.5000a 249.5000a 124.7500a 99.8000a

capbs 399.2000a 399.2000a 199.6000a 159.6800a

temp 25.0000 25.0000 25.0000 25.0000

aic

nf 1.0000 1.0000 1.0000 1.0000

min 0. 0. 0. 0.

rbdb 15.0000 15.0000 15.0000 15.0000

rbsb 15.0000 15.0000 15.0000 15.0000

rbpb 5.0000 5.0000 5.0000 5.0000

rbps 15.0000 15.0000 15.0000 15.0000

rbpd 15.0000 15.0000 15.0000 15.0000

trnqsmod 0. 0. 0. 0.

acnqsmod 0. 0. 0. 0.

rbodymod 1.0000 1.0000 1.0000 1.0000

rgatemod 1.0000 1.0000 1.0000 1.0000

geomod 1.0000 1.0000 1.0000 1.0000

rgeomod 0. 0. 0. 0.

delvto 0. 0. 0. 0.

mulu0 1.0000 1.0000 1.0000 1.0000

delk1 0. 0. 0. 0.

delnfct 0. 0. 0. 0.

deltox 0. 0. 0. 0.

Page 97: Analysis and Design of High Performance 128-Bit Parallel Prefix e

APPENDIX B. HSPICE SIMULATION FILES 88

sa 0. 0. 0. 0.

sb 0. 0. 0. 0.

sd 0. 0. 0. 0.

saeff 0. 0. 0. 0.

sbeff 0. 0. 0. 0.

element name 22:m05 22:m06 23:m1 23:m2

drain 22:f1 22:1 0:10 0:10

gate 0:0 0:vdd 0:vdd 0:9

source 22:1 0:0 0:vdd 0:vdd

bulk 22:1 0:0 0:vdd 0:vdd

model 0:nmos_vtl 0:nmos_vtl 0:pmos_vtl 0:pmos_vtl

w eff 190.0000n 190.0000n 302.5000n 302.5000n

l eff 22.5000n 22.5000n 22.5000n 22.5000n

rd eff 0. 0. 0. 0.

rs eff 0. 0. 0. 0.

cdsat 10.0000f 10.0000f 10.0000f 10.0000f

cssat 15.8159a 15.8159a 24.7123a 24.7123a

capbd 99.8000a 99.8000a 155.9375a 155.9375a

capbs 159.6800a 159.6800a 249.5000a 249.5000a

temp 25.0000 25.0000 25.0000 25.0000

aic

nf 1.0000 1.0000 1.0000 1.0000

min 0. 0. 0. 0.

rbdb 15.0000 15.0000 15.0000 15.0000

rbsb 15.0000 15.0000 15.0000 15.0000

rbpb 5.0000 5.0000 5.0000 5.0000

rbps 15.0000 15.0000 15.0000 15.0000

rbpd 15.0000 15.0000 15.0000 15.0000

trnqsmod 0. 0. 0. 0.

acnqsmod 0. 0. 0. 0.

rbodymod 1.0000 1.0000 1.0000 1.0000

rgatemod 1.0000 1.0000 1.0000 1.0000

geomod 1.0000 1.0000 1.0000 1.0000

rgeomod 0. 0. 0. 0.

delvto 0. 0. 0. 0.

mulu0 1.0000 1.0000 1.0000 1.0000

delk1 0. 0. 0. 0.

delnfct 0. 0. 0. 0.

deltox 0. 0. 0. 0.

sa 0. 0. 0. 0.

sb 0. 0. 0. 0.

sd 0. 0. 0. 0.

saeff 0. 0. 0. 0.

sbeff 0. 0. 0. 0.

element name 23:m3 23:m4 24:m1 24:m2

drain 0:10 23:1 0:50 0:50

gate 0:9 0:vdd 0:vdd 0:9

source 23:1 0:0 0:vdd 0:vdd

bulk 23:1 0:0 0:vdd 0:vdd

model 0:nmos_vtl 0:nmos_vtl 0:pmos_vtl 0:pmos_vtl

w eff 240.0000n 240.0000n 302.5000n 302.5000n

l eff 22.5000n 22.5000n 22.5000n 22.5000n

rd eff 0. 0. 0. 0.

rs eff 0. 0. 0. 0.

cdsat 10.0000f 10.0000f 10.0000f 10.0000f

Page 98: Analysis and Design of High Performance 128-Bit Parallel Prefix e

APPENDIX B. HSPICE SIMULATION FILES 89

cssat 19.7699a 19.7699a 24.7123a 24.7123a

capbd 124.7500a 124.7500a 155.9375a 155.9375a

capbs 199.6000a 199.6000a 249.5000a 249.5000a

temp 25.0000 25.0000 25.0000 25.0000

aic

nf 1.0000 1.0000 1.0000 1.0000

min 0. 0. 0. 0.

rbdb 15.0000 15.0000 15.0000 15.0000

rbsb 15.0000 15.0000 15.0000 15.0000

rbpb 5.0000 5.0000 5.0000 5.0000

rbps 15.0000 15.0000 15.0000 15.0000

rbpd 15.0000 15.0000 15.0000 15.0000

trnqsmod 0. 0. 0. 0.

acnqsmod 0. 0. 0. 0.

rbodymod 1.0000 1.0000 1.0000 1.0000

rgatemod 1.0000 1.0000 1.0000 1.0000

geomod 1.0000 1.0000 1.0000 1.0000

rgeomod 0. 0. 0. 0.

delvto 0. 0. 0. 0.

mulu0 1.0000 1.0000 1.0000 1.0000

delk1 0. 0. 0. 0.

delnfct 0. 0. 0. 0.

deltox 0. 0. 0. 0.

sa 0. 0. 0. 0.

sb 0. 0. 0. 0.

sd 0. 0. 0. 0.

saeff 0. 0. 0. 0.

sbeff 0. 0. 0. 0.

element name 24:m3 24:m4 25:m01 25:m02

drain 0:50 24:1 0:11 25:2

gate 0:9 0:vdd 0:10 0:0

source 24:1 0:0 25:2 0:vdd

bulk 24:1 0:0 25:2 0:vdd

model 0:nmos_vtl 0:nmos_vtl 0:pmos_vtl 0:pmos_vtl

w eff 240.0000n 240.0000n 190.0000n 190.0000n

l eff 22.5000n 22.5000n 22.5000n 22.5000n

rd eff 0. 0. 0. 0.

rs eff 0. 0. 0. 0.

cdsat 10.0000f 10.0000f 10.0000f 10.0000f

cssat 19.7699a 19.7699a 15.8159a 15.8159a

capbd 124.7500a 124.7500a 99.8000a 99.8000a

capbs 199.6000a 199.6000a 159.6800a 159.6800a

temp 25.0000 25.0000 25.0000 25.0000

aic

nf 1.0000 1.0000 1.0000 1.0000

min 0. 0. 0. 0.

rbdb 15.0000 15.0000 15.0000 15.0000

rbsb 15.0000 15.0000 15.0000 15.0000

rbpb 5.0000 5.0000 5.0000 5.0000

rbps 15.0000 15.0000 15.0000 15.0000

rbpd 15.0000 15.0000 15.0000 15.0000

trnqsmod 0. 0. 0. 0.

acnqsmod 0. 0. 0. 0.

rbodymod 1.0000 1.0000 1.0000 1.0000

rgatemod 1.0000 1.0000 1.0000 1.0000

geomod 1.0000 1.0000 1.0000 1.0000

Page 99: Analysis and Design of High Performance 128-Bit Parallel Prefix e

APPENDIX B. HSPICE SIMULATION FILES 90

rgeomod 0. 0. 0. 0.

delvto 0. 0. 0. 0.

mulu0 1.0000 1.0000 1.0000 1.0000

delk1 0. 0. 0. 0.

delnfct 0. 0. 0. 0.

deltox 0. 0. 0. 0.

sa 0. 0. 0. 0.

sb 0. 0. 0. 0.

sd 0. 0. 0. 0.

saeff 0. 0. 0. 0.

sbeff 0. 0. 0. 0.

element name 25:m03 25:m04 25:m05 25:m06

drain 0:11 0:11 25:f1 25:1

gate 0:vdd 0:10 0:0 0:vdd

source 0:vdd 25:1 25:1 0:0

bulk 0:vdd 25:1 25:1 0:0

model 0:pmos_vtl 0:nmos_vtl 0:nmos_vtl 0:nmos_vtl

w eff 90.0000n 70.0000n 70.0000n 70.0000n

l eff 22.5000n 22.5000n 22.5000n 22.5000n

rd eff 0. 0. 0. 0.

rs eff 0. 0. 0. 0.

cdsat 10.0000f 10.0000f 10.0000f 10.0000f

cssat 7.9079a 6.3264a 6.3264a 6.3264a

capbd 49.9000a 39.9200a 39.9200a 39.9200a

capbs 79.8400a 63.8720a 63.8720a 63.8720a

temp 25.0000 25.0000 25.0000 25.0000

aic

nf 1.0000 1.0000 1.0000 1.0000

min 0. 0. 0. 0.

rbdb 15.0000 15.0000 15.0000 15.0000

rbsb 15.0000 15.0000 15.0000 15.0000

rbpb 5.0000 5.0000 5.0000 5.0000

rbps 15.0000 15.0000 15.0000 15.0000

rbpd 15.0000 15.0000 15.0000 15.0000

trnqsmod 0. 0. 0. 0.

acnqsmod 0. 0. 0. 0.

rbodymod 1.0000 1.0000 1.0000 1.0000

rgatemod 1.0000 1.0000 1.0000 1.0000

geomod 1.0000 1.0000 1.0000 1.0000

rgeomod 0. 0. 0. 0.

delvto 0. 0. 0. 0.

mulu0 1.0000 1.0000 1.0000 1.0000

delk1 0. 0. 0. 0.

delnfct 0. 0. 0. 0.

deltox 0. 0. 0. 0.

sa 0. 0. 0. 0.

sb 0. 0. 0. 0.

sd 0. 0. 0. 0.

saeff 0. 0. 0. 0.

sbeff 0. 0. 0. 0.

element name 26:m01 26:m02 26:m03 26:m04

drain 0:35 26:2 0:35 0:35

gate 0:10 0:0 0:vdd 0:10

source 26:2 0:vdd 0:vdd 26:1

bulk 26:2 0:vdd 0:vdd 26:1

Page 100: Analysis and Design of High Performance 128-Bit Parallel Prefix e

APPENDIX B. HSPICE SIMULATION FILES 91

model 0:pmos_vtl 0:pmos_vtl 0:pmos_vtl 0:nmos_vtl

w eff 190.0000n 190.0000n 90.0000n 70.0000n

l eff 22.5000n 22.5000n 22.5000n 22.5000n

rd eff 0. 0. 0. 0.

rs eff 0. 0. 0. 0.

cdsat 10.0000f 10.0000f 10.0000f 10.0000f

cssat 15.8159a 15.8159a 7.9079a 6.3264a

capbd 99.8000a 99.8000a 49.9000a 39.9200a

capbs 159.6800a 159.6800a 79.8400a 63.8720a

temp 25.0000 25.0000 25.0000 25.0000

aic

nf 1.0000 1.0000 1.0000 1.0000

min 0. 0. 0. 0.

rbdb 15.0000 15.0000 15.0000 15.0000

rbsb 15.0000 15.0000 15.0000 15.0000

rbpb 5.0000 5.0000 5.0000 5.0000

rbps 15.0000 15.0000 15.0000 15.0000

rbpd 15.0000 15.0000 15.0000 15.0000

trnqsmod 0. 0. 0. 0.

acnqsmod 0. 0. 0. 0.

rbodymod 1.0000 1.0000 1.0000 1.0000

rgatemod 1.0000 1.0000 1.0000 1.0000

geomod 1.0000 1.0000 1.0000 1.0000

rgeomod 0. 0. 0. 0.

delvto 0. 0. 0. 0.

mulu0 1.0000 1.0000 1.0000 1.0000

delk1 0. 0. 0. 0.

delnfct 0. 0. 0. 0.

deltox 0. 0. 0. 0.

sa 0. 0. 0. 0.

sb 0. 0. 0. 0.

sd 0. 0. 0. 0.

saeff 0. 0. 0. 0.

sbeff 0. 0. 0. 0.

element name 26:m05 26:m06 27:m01 27:m02

drain 26:f1 26:1 0:36 27:2

gate 0:0 0:vdd 0:10 0:0

source 26:1 0:0 27:2 0:vdd

bulk 26:1 0:0 27:2 0:vdd

model 0:nmos_vtl 0:nmos_vtl 0:pmos_vtl 0:pmos_vtl

w eff 70.0000n 70.0000n 190.0000n 190.0000n

l eff 22.5000n 22.5000n 22.5000n 22.5000n

rd eff 0. 0. 0. 0.

rs eff 0. 0. 0. 0.

cdsat 10.0000f 10.0000f 10.0000f 10.0000f

cssat 6.3264a 6.3264a 15.8159a 15.8159a

capbd 39.9200a 39.9200a 99.8000a 99.8000a

capbs 63.8720a 63.8720a 159.6800a 159.6800a

temp 25.0000 25.0000 25.0000 25.0000

aic

nf 1.0000 1.0000 1.0000 1.0000

min 0. 0. 0. 0.

rbdb 15.0000 15.0000 15.0000 15.0000

rbsb 15.0000 15.0000 15.0000 15.0000

rbpb 5.0000 5.0000 5.0000 5.0000

rbps 15.0000 15.0000 15.0000 15.0000

Page 101: Analysis and Design of High Performance 128-Bit Parallel Prefix e

APPENDIX B. HSPICE SIMULATION FILES 92

rbpd 15.0000 15.0000 15.0000 15.0000

trnqsmod 0. 0. 0. 0.

acnqsmod 0. 0. 0. 0.

rbodymod 1.0000 1.0000 1.0000 1.0000

rgatemod 1.0000 1.0000 1.0000 1.0000

geomod 1.0000 1.0000 1.0000 1.0000

rgeomod 0. 0. 0. 0.

delvto 0. 0. 0. 0.

mulu0 1.0000 1.0000 1.0000 1.0000

delk1 0. 0. 0. 0.

delnfct 0. 0. 0. 0.

deltox 0. 0. 0. 0.

sa 0. 0. 0. 0.

sb 0. 0. 0. 0.

sd 0. 0. 0. 0.

saeff 0. 0. 0. 0.

sbeff 0. 0. 0. 0.

element name 27:m03 27:m04 27:m05 27:m06

drain 0:36 0:36 27:f1 27:1

gate 0:vdd 0:10 0:0 0:vdd

source 0:vdd 27:1 27:1 0:0

bulk 0:vdd 27:1 27:1 0:0

model 0:pmos_vtl 0:nmos_vtl 0:nmos_vtl 0:nmos_vtl

w eff 90.0000n 70.0000n 70.0000n 70.0000n

l eff 22.5000n 22.5000n 22.5000n 22.5000n

rd eff 0. 0. 0. 0.

rs eff 0. 0. 0. 0.

cdsat 10.0000f 10.0000f 10.0000f 10.0000f

cssat 7.9079a 6.3264a 6.3264a 6.3264a

capbd 49.9000a 39.9200a 39.9200a 39.9200a

capbs 79.8400a 63.8720a 63.8720a 63.8720a

temp 25.0000 25.0000 25.0000 25.0000

aic

nf 1.0000 1.0000 1.0000 1.0000

min 0. 0. 0. 0.

rbdb 15.0000 15.0000 15.0000 15.0000

rbsb 15.0000 15.0000 15.0000 15.0000

rbpb 5.0000 5.0000 5.0000 5.0000

rbps 15.0000 15.0000 15.0000 15.0000

rbpd 15.0000 15.0000 15.0000 15.0000

trnqsmod 0. 0. 0. 0.

acnqsmod 0. 0. 0. 0.

rbodymod 1.0000 1.0000 1.0000 1.0000

rgatemod 1.0000 1.0000 1.0000 1.0000

geomod 1.0000 1.0000 1.0000 1.0000

rgeomod 0. 0. 0. 0.

delvto 0. 0. 0. 0.

mulu0 1.0000 1.0000 1.0000 1.0000

delk1 0. 0. 0. 0.

delnfct 0. 0. 0. 0.

deltox 0. 0. 0. 0.

sa 0. 0. 0. 0.

sb 0. 0. 0. 0.

sd 0. 0. 0. 0.

saeff 0. 0. 0. 0.

sbeff 0. 0. 0. 0.

Page 102: Analysis and Design of High Performance 128-Bit Parallel Prefix e

APPENDIX B. HSPICE SIMULATION FILES 93

element name 28:m01 28:m02 28:m03 28:m04

drain 0:37 28:2 0:37 0:37

gate 0:10 0:0 0:vdd 0:10

source 28:2 0:vdd 0:vdd 28:1

bulk 28:2 0:vdd 0:vdd 28:1

model 0:pmos_vtl 0:pmos_vtl 0:pmos_vtl 0:nmos_vtl

w eff 190.0000n 190.0000n 90.0000n 70.0000n

l eff 22.5000n 22.5000n 22.5000n 22.5000n

rd eff 0. 0. 0. 0.

rs eff 0. 0. 0. 0.

cdsat 10.0000f 10.0000f 10.0000f 10.0000f

cssat 15.8159a 15.8159a 7.9079a 6.3264a

capbd 99.8000a 99.8000a 49.9000a 39.9200a

capbs 159.6800a 159.6800a 79.8400a 63.8720a

temp 25.0000 25.0000 25.0000 25.0000

aic

nf 1.0000 1.0000 1.0000 1.0000

min 0. 0. 0. 0.

rbdb 15.0000 15.0000 15.0000 15.0000

rbsb 15.0000 15.0000 15.0000 15.0000

rbpb 5.0000 5.0000 5.0000 5.0000

rbps 15.0000 15.0000 15.0000 15.0000

rbpd 15.0000 15.0000 15.0000 15.0000

trnqsmod 0. 0. 0. 0.

acnqsmod 0. 0. 0. 0.

rbodymod 1.0000 1.0000 1.0000 1.0000

rgatemod 1.0000 1.0000 1.0000 1.0000

geomod 1.0000 1.0000 1.0000 1.0000

rgeomod 0. 0. 0. 0.

delvto 0. 0. 0. 0.

mulu0 1.0000 1.0000 1.0000 1.0000

delk1 0. 0. 0. 0.

delnfct 0. 0. 0. 0.

deltox 0. 0. 0. 0.

sa 0. 0. 0. 0.

sb 0. 0. 0. 0.

sd 0. 0. 0. 0.

saeff 0. 0. 0. 0.

sbeff 0. 0. 0. 0.

element name 28:m05 28:m06 29:m1 29:m2

drain 28:f1 28:1 0:n4 0:n4

gate 0:0 0:vdd 0:11 0:11

source 28:1 0:0 0:vdd 0:0

bulk 28:1 0:0 0:vdd 0:0

model 0:nmos_vtl 0:nmos_vtl 0:pmos_vtl 0:nmos_vtl

w eff 70.0000n 70.0000n 240.0000n 90.0000n

l eff 22.5000n 22.5000n 22.5000n 22.5000n

rd eff 0. 0. 0. 0.

rs eff 0. 0. 0. 0.

cdsat 10.0000f 10.0000f 10.0000f 10.0000f

cssat 6.3264a 6.3264a 19.7699a 7.9079a

capbd 39.9200a 39.9200a 124.7500a 49.9000a

capbs 63.8720a 63.8720a 199.6000a 79.8400a

temp 25.0000 25.0000 25.0000 25.0000

aic

Page 103: Analysis and Design of High Performance 128-Bit Parallel Prefix e

APPENDIX B. HSPICE SIMULATION FILES 94

nf 1.0000 1.0000 1.0000 1.0000

min 0. 0. 0. 0.

rbdb 15.0000 15.0000 15.0000 15.0000

rbsb 15.0000 15.0000 15.0000 15.0000

rbpb 5.0000 5.0000 5.0000 5.0000

rbps 15.0000 15.0000 15.0000 15.0000

rbpd 15.0000 15.0000 15.0000 15.0000

trnqsmod 0. 0. 0. 0.

acnqsmod 0. 0. 0. 0.

rbodymod 1.0000 1.0000 1.0000 1.0000

rgatemod 1.0000 1.0000 1.0000 1.0000

geomod 1.0000 1.0000 1.0000 1.0000

rgeomod 0. 0. 0. 0.

delvto 0. 0. 0. 0.

mulu0 1.0000 1.0000 1.0000 1.0000

delk1 0. 0. 0. 0.

delnfct 0. 0. 0. 0.

deltox 0. 0. 0. 0.

sa 0. 0. 0. 0.

sb 0. 0. 0. 0.

sd 0. 0. 0. 0.

saeff 0. 0. 0. 0.

sbeff 0. 0. 0. 0.

element name 30:m0 30:m1 31:m0 31:m1

drain 0:0 0:0 0:vdd 0:vdd

gate 0:11 0:n4 0:n4 0:11

source 0:12 0:12 0:12 0:12

bulk 0:vdd 0:0 0:vdd 0:0

model 0:pmos_vtl 0:nmos_vtl 0:pmos_vtl 0:nmos_vtl

w eff 90.0000n 90.0000n 90.0000n 90.0000n

l eff 22.5000n 22.5000n 22.5000n 22.5000n

rd eff 0. 0. 0. 0.

rs eff 0. 0. 0. 0.

cdsat 10.0000f 10.0000f 10.0000f 10.0000f

cssat 7.9079a 7.9079a 7.9079a 7.9079a

capbd 49.9000a 49.9000a 49.9000a 49.9000a

capbs 79.8400a 79.8400a 79.8400a 79.8400a

temp 25.0000 25.0000 25.0000 25.0000

aic

nf 1.0000 1.0000 1.0000 1.0000

min 0. 0. 0. 0.

rbdb 15.0000 15.0000 15.0000 15.0000

rbsb 15.0000 15.0000 15.0000 15.0000

rbpb 5.0000 5.0000 5.0000 5.0000

rbps 15.0000 15.0000 15.0000 15.0000

rbpd 15.0000 15.0000 15.0000 15.0000

trnqsmod 0. 0. 0. 0.

acnqsmod 0. 0. 0. 0.

rbodymod 1.0000 1.0000 1.0000 1.0000

rgatemod 1.0000 1.0000 1.0000 1.0000

geomod 1.0000 1.0000 1.0000 1.0000

rgeomod 0. 0. 0. 0.

delvto 0. 0. 0. 0.

mulu0 1.0000 1.0000 1.0000 1.0000

delk1 0. 0. 0. 0.

delnfct 0. 0. 0. 0.

Page 104: Analysis and Design of High Performance 128-Bit Parallel Prefix e

APPENDIX B. HSPICE SIMULATION FILES 95

deltox 0. 0. 0. 0.

sa 0. 0. 0. 0.

sb 0. 0. 0. 0.

sd 0. 0. 0. 0.

saeff 0. 0. 0. 0.

sbeff 0. 0. 0. 0.

element name 32:m1 32:m2 33:m1 33:m2

drain 0:n6 0:n6 0:n5 0:n5

gate 0:12 0:12 0:vdd 0:vdd

source 0:vdd 0:0 0:vdd 0:0

bulk 0:vdd 0:0 0:vdd 0:0

model 0:pmos_vtl 0:nmos_vtl 0:pmos_vtl 0:nmos_vtl

w eff 240.0000n 90.0000n 240.0000n 90.0000n

l eff 22.5000n 22.5000n 22.5000n 22.5000n

rd eff 0. 0. 0. 0.

rs eff 0. 0. 0. 0.

cdsat 10.0000f 10.0000f 10.0000f 10.0000f

cssat 19.7699a 7.9079a 19.7699a 7.9079a

capbd 124.7500a 49.9000a 124.7500a 49.9000a

capbs 199.6000a 79.8400a 199.6000a 79.8400a

temp 25.0000 25.0000 25.0000 25.0000

aic

nf 1.0000 1.0000 1.0000 1.0000

min 0. 0. 0. 0.

rbdb 15.0000 15.0000 15.0000 15.0000

rbsb 15.0000 15.0000 15.0000 15.0000

rbpb 5.0000 5.0000 5.0000 5.0000

rbps 15.0000 15.0000 15.0000 15.0000

rbpd 15.0000 15.0000 15.0000 15.0000

trnqsmod 0. 0. 0. 0.

acnqsmod 0. 0. 0. 0.

rbodymod 1.0000 1.0000 1.0000 1.0000

rgatemod 1.0000 1.0000 1.0000 1.0000

geomod 1.0000 1.0000 1.0000 1.0000

rgeomod 0. 0. 0. 0.

delvto 0. 0. 0. 0.

mulu0 1.0000 1.0000 1.0000 1.0000

delk1 0. 0. 0. 0.

delnfct 0. 0. 0. 0.

deltox 0. 0. 0. 0.

sa 0. 0. 0. 0.

sb 0. 0. 0. 0.

sd 0. 0. 0. 0.

saeff 0. 0. 0. 0.

sbeff 0. 0. 0. 0.

element name 34:m0 34:m1 35:m0 35:m1

drain 0:12 0:12 0:n6 0:n6

gate 0:vdd 0:n5 0:n5 0:vdd

source 0:s75 0:s75 0:s75 0:s75

bulk 0:vdd 0:0 0:vdd 0:0

model 0:pmos_vtl 0:nmos_vtl 0:pmos_vtl 0:nmos_vtl

w eff 90.0000n 90.0000n 90.0000n 90.0000n

l eff 22.5000n 22.5000n 22.5000n 22.5000n

rd eff 0. 0. 0. 0.

rs eff 0. 0. 0. 0.

Page 105: Analysis and Design of High Performance 128-Bit Parallel Prefix e

APPENDIX B. HSPICE SIMULATION FILES 96

cdsat 10.0000f 10.0000f 10.0000f 10.0000f

cssat 7.9079a 7.9079a 7.9079a 7.9079a

capbd 49.9000a 49.9000a 49.9000a 49.9000a

capbs 79.8400a 79.8400a 79.8400a 79.8400a

temp 25.0000 25.0000 25.0000 25.0000

aic

nf 1.0000 1.0000 1.0000 1.0000

min 0. 0. 0. 0.

rbdb 15.0000 15.0000 15.0000 15.0000

rbsb 15.0000 15.0000 15.0000 15.0000

rbpb 5.0000 5.0000 5.0000 5.0000

rbps 15.0000 15.0000 15.0000 15.0000

rbpd 15.0000 15.0000 15.0000 15.0000

trnqsmod 0. 0. 0. 0.

acnqsmod 0. 0. 0. 0.

rbodymod 1.0000 1.0000 1.0000 1.0000

rgatemod 1.0000 1.0000 1.0000 1.0000

geomod 1.0000 1.0000 1.0000 1.0000

rgeomod 0. 0. 0. 0.

delvto 0. 0. 0. 0.

mulu0 1.0000 1.0000 1.0000 1.0000

delk1 0. 0. 0. 0.

delnfct 0. 0. 0. 0.

deltox 0. 0. 0. 0.

sa 0. 0. 0. 0.

sb 0. 0. 0. 0.

sd 0. 0. 0. 0.

saeff 0. 0. 0. 0.

sbeff 0. 0. 0. 0.

element name 36:m1 36:m2

drain 0:out 0:out

gate 0:s75 0:s75

source 0:vdd 0:0

bulk 0:vdd 0:0

model 0:pmos_vtl 0:nmos_vtl

w eff 240.0000n 90.0000n

l eff 22.5000n 22.5000n

rd eff 0. 0.

rs eff 0. 0.

cdsat 10.0000f 10.0000f

cssat 19.7699a 7.9079a

capbd 124.7500a 49.9000a

capbs 199.6000a 79.8400a

temp 25.0000 25.0000

aic

nf 1.0000 1.0000

min 0. 0.

rbdb 15.0000 15.0000

rbsb 15.0000 15.0000

rbpb 5.0000 5.0000

rbps 15.0000 15.0000

rbpd 15.0000 15.0000

trnqsmod 0. 0.

acnqsmod 0. 0.

rbodymod 1.0000 1.0000

rgatemod 1.0000 1.0000

Page 106: Analysis and Design of High Performance 128-Bit Parallel Prefix e

APPENDIX B. HSPICE SIMULATION FILES 97

geomod 1.0000 1.0000

rgeomod 0. 0.

delvto 0. 0.

mulu0 1.0000 1.0000

delk1 0. 0.

delnfct 0. 0.

deltox 0. 0.

sa 0. 0.

sb 0. 0.

sd 0. 0.

saeff 0. 0.

sbeff 0. 0.