Characterization, Clock Tree Synthesis and Power Grid

Characterization, Clock Tree Synthesis and Power Grid Dimensioning in SiLago Framework, STOCKHOLM SWEDEN 2018
Characterization, Clock Tree Synthesis and Power Grid Dimensioning in SiLago Framework
ROHIT PRASAD
KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY
Abstract
A hardware design methodology or platform is complete if it has the capabilities to
successfully implement clock tree, predict the power consumption for cases like best and
worst Parasitic Interconnect Corners (RC Corners), supply power to every standard cell,
etc.
This thesis has tried to solve the three unsolved engineering problems in SiLago design.
First, power characterization of the flat design which was designed using the SiLago
methodology. Second, designing a hierarchical clock tree and harden it inside the SiLago
logic. Third, dimensioning hierarchical power grids. Out of these, clock tree illustrates
some interesting characteristics as it is programmable and predictable.
The tools used for digital designing are Cadence Innovus, Synopsys Design Vision, and
Mentor Graphics Questasim. These are very sophisticated tools and widely accepted in
industries as well as in academia.
The work done in this thesis has enabled SiLago platform one step forward toward its
fruition.
hardware design, physical design
En hardvarudesign metodologi eller plattform ar komplett om den har kapabiliteten till
att lyckas genomfora klocktradet, forutsaga stromforbrukningen for basta och varsta
fall av Parasitic Interconnect Corners (RC Corners), tillfora kraft till varje standardcell,
etc.
Denna avhandling har forsokt losa de tre olosta tekniska problemen i SiLago-designen.
Det forsta ar stromkvalificering av designen som designades med hjalp av SiLago
metoden. Det andra problemet ar att designa ett hierarkiskt klocktrad och harda det
inuti SiLago logik. Det tredje problemet ar att dimensionera hierarkiska stromnat. Ur
dessa illustrerar klocktradet nagra intressanta egenskaper eftersom det ar
programmerbart och forutsagbart.
De verktyg som anvands for digital design ar Cadence Innovus, Synopsys Design Vision
och Mentor Graphics Questasim. Dessa verktyg ar mycket sofistikerade och allmant
accepterade i industrier saval som i akademin.
Arbetet i denna avhandling har gjort det mojligt for SiLago-plattformen att ta ett steg
mot att realiseras.
digital hardware design, physical design
iii
Acknowledgement
I would like to thank my examiner Prof. Ahmed Hemani at School of ICT, KTH, for
the guidance and this opportunity. I would also like to thank my supervisors Syed Mo-
hammad Asad Hassan Jafri (now at Ericsson, Sweden) and Dimitrios Stathis (pursuing
Ph.D at KTH) without them this thesis would have lacked quality results.
Finally, I would like to thank my family for for their love and support, without them
this day would not have been possible.
Rohit Prasad
February 2018
List of Figures
1.1 Heads showing how growth rate gap is linked to Computer Architecture. 2
2.1 DRRA cells connected through interconnects. . . . . . . . . . . . . . . . 7
2.2 Register File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5 Searching of design space in SiLago design methodology. . . . . . . . . . 10
2.6 Proposed Clock tree scheme. . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.7 Proposed scheme for hierarchical power grids. . . . . . . . . . . . . . . . 12
3.1 (a) Circuit to demonstrate glitch noise; (b) Simplified circuit . . . . . . . 14
3.2 Power tree building C++ code snippet. . . . . . . . . . . . . . . . . . . . 18
3.3 Total Power Vs. Iterations. . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 (a) Total Power activity for Mode 1 ; (b) Hopping of signals . . . . . . . 20
3.5 Switching Power Vs. Iterations. . . . . . . . . . . . . . . . . . . . . . . . 21
3.6 (a) Switching Power activity for Mode 1 ; (b) Hopping of signals . . . . . 22
3.7 Power Distribution for CNN . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.8 Power Distribution for CNN (breakdown) . . . . . . . . . . . . . . . . . . 23
3.9 Power Distribution for DCT2D . . . . . . . . . . . . . . . . . . . . . . . 24
3.10 Power Distribution for DCT2D (breakdown) . . . . . . . . . . . . . . . . 24
3.11 Power Distribution for FR . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.12 Power Distribution for FR (breakdown) . . . . . . . . . . . . . . . . . . . 25
4.1 Three levels of clock trees in SiLago Design. . . . . . . . . . . . . . . . . 27
4.2 5x2 SiLago Wrapper cells abutted. . . . . . . . . . . . . . . . . . . . . . 31
4.3 Design Vision generated MRB schematic. . . . . . . . . . . . . . . . . . . 31
4.4 5x2 SiLago Wrapper cells showing clock tree mesh and composed by abut-
ment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.5 Clock tree flow of 5x2 Fabric with no space. . . . . . . . . . . . . . . . . 36
4.6 Clock tree flow of 5x2 Fabric with space in-between the blocks. . . . . . . 36
4.7 Regional Clock tree of 5x2 Fabric with no space. . . . . . . . . . . . . . . 37
4.8 Regional Clock tree of 5x2 Fabric with space in between the blocks. . . . 37
4.9 Clock tree flow of 5x2 SiLago-fied Fabric with no space. . . . . . . . . . . 39
4.10 CAD tool’s CTS information. . . . . . . . . . . . . . . . . . . . . . . . . 39
4.11 SiLago-fied clock tree information. . . . . . . . . . . . . . . . . . . . . . . 39
v
5.1 (a) Mesh structure, (b) Interleaved structure, (c) Local tree-based structure. 42
5.2 Proposed power and ground distribution scheme. . . . . . . . . . . . . . 42
vi
4.3 Comparison of design times. . . . . . . . . . . . . . . . . . . . . . . . . . 35
vii
Acronyms
DiMArch Distributed Memory Architecture
PHY Physical
Language
RTL Register-Transfer Level
HLS High-Level Synthesis
FR Face Recognition
CAD Computer-Aided Design
nm Nanometer
pF Picofarad
1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 DRRA: Dynamically Reconfigurable Resource Array . . . . . . . . . . . . 6
2.2 DiMArch: Distributed Memory Architecture . . . . . . . . . . . . . . . . 8
2.3 SiLago Design Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Power Characterization 13
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6 Challenges 45
7 Conclusion 47
Introduction
The electronic systems department at Kungliga Tekniska Hogskolan (KTH) has de-
signed a fundamentally different Application-Specific Integrated Circuit (ASIC) design
methodology compared to traditional standard cell based designs. The methodology
allows design chips by abutting multiple micro-architectural components. By doing so,
it promises and provides 2 orders of magnitude better design productivity. However,
in its present state, we still must prove that the operations minimally couple and this
coupling can be accurately modelled, design a simplified clock tree, and manage the
power.
ASIC designing is considered to be an expensive process because of the multiple repe-
tition between RTL design, logical synthesis and physical synthesis. These repetitions
are because a designer has to purify the ASIC design at high level as there is a lack of
information like location, wire routing and arrangement of hardware cells. To deal with
these issues many recent works have been proposed and SiLago design methodology is
one of them, which increases the level of abstraction of physical design. Works has been
done to make SiLago to accurately predict the costs for area and timing but somewhat
less work has been done to predict the power. This defines the first task of this thesis
i.e., power characterization of the SiLago blocks.
The challenge to design a clock tree which is capable of composition by abutment is
not simple because when SiLago blocks, which are equipped with this clock tree, are
placed side-by-side then they must produce a synchorous [1] large grain VLSI design
objects. This clock tree must be properly structured and have a predictable nature, as
a result its cost metrics will be know before hand. This task is critical and must be
solved because to increase the abstraction level in SiLago design methodology and also
to enable SiLago designed blocks to be composed by abutment, clock tree and clock tree
synthesis scheme play a very important role.
It is also required to design the power grid which must be modular and must satisfy the
requirements of SiLago (described in next section).
1
Chapter 1 Introduction
1.0.1 Efficacy Gaps
There are three trends that describe the growth rate gap linked to complexity of ap-
plication, design technology, VLSI technology and battery technology [1][29][30][31] .
Figure 1.1 shows these heads.
Figure 1.1: Heads showing how growth rate gap is linked to Computer Architecture.
1. Architecture Efficacy Gap : This gap arise due to inefficient placement of mod-
ules and connectivity between them or the clock tree. In this thesis, it has been
shown how clock tree can be efficiently designed, an attempt to close this gap using
SiLago design methodology.
2. Design Productivity Gap : There are two main factors which affect time-to-
market of a computer system.
a) Design Time : This can be overcome by minimizing complexity of the design.
b) Manufacturing Time : Reusing a design or inclusion of regularity in design
will eventually reduce manufacturing time.
These two factors also contribute to the cost of a computer system i.e., sum of
manufacturing and designing costs.
3. Battery Capacity Gap : A good computational efficiency in an architecture will
2
help reduce this gap. A very common practise is to match the granularity, this
can be instruction granularity, bit-width granularity or silicon granularity.
1.0.2 Computer Architectures
In this section, currently available widely accepted computer architectures will be dis-
cussed. These architectures are an attempt to bridge the gaps discussed in previous
section.
1. ASIC : They have least to no granularity mismatches and thus yield high per-
formance on low power budget. ASICs are designed with matching granularity of
instruction and bit-width of the target domain application and also keeping the
silicon granularity mismatch to the least. Once the chip is fabricated, it can not
be further modified, so the parallelism of the target domain algorithm is exploited
during design time. Thus, ASICs lack flexibility and their usage is limited to their
target domain application. This also limits their sustainability i.e., they can not
be reused for any other application algorithm but for what they are initially de-
signed. ASICs exhibits low architecture efficacy gap and battery capacity gap but
high design productivity gap due to high manufacturing and design costs.
2. FPGA : They are programmable at gate level, thus very fine instruction granu-
larity. They have bit-width granularity of 1-bit. Interconnections and basic blocks
in a FPGA can be reconfigured and thus instruction granularity or bit-width gran-
ularity can be fine tuned according to the requirement of the target application.
This results in large reconfiguration overhead i.e., configuration memory, wires and
switches. Due to this reason FPGAs are low performing devices as compared to
ASICs. In contrast, FPGAs are flexible so, they can be reused and their basic
blocks can work autonomously, so parallelism can be exploited either at bit-width
level or instruction level. FPGAs have higher architecture efficacy gap and battery
capacity gap than ASICs but design productivity gap is lesser than ASICs.
3. GPP : They are very flexible and can run any application. Data-path of GPPs
are of the size of basic logical and arithmetic operations, this results in high flexi-
bility. As algorithms are split into basic operations, they exhibit granularity mis-
match and high number of memory operations and interconnect operations. This
also results in high power consumption and lower performance with respect to
ASICs. Due to one time design cost and lower manufacturing cost because of mass
production, GPPs have least design productivity gap and highest architectural
efficacy gap and battery capacity gap with respect to ASICs and FPGAs.
4. CGRA : They are somewhere in between ASICs and FPGAs with respect to
granularity mismatching. CGRAs have advantage over FPGAs because they have
3
Chapter 1 Introduction
coarse grain data-path which results in silicon granularity matching. This also
reduces the number of cells in design, thus reduced wires and routing area overhead
with respect to FPGAs. CGRAs have lower architecture efficacy gap and battery
capacity gap with respect to FPGAs and equivalent design productivity gap to
that of FPGAs.
In chapter 2, it will be discussed that why there was a need for an alternative architecture
than ASICs ans FPGAs.
1.1 Problems
This project had attempted to solve the following unsolved engineering problems in
Silicon Large Grain Object (SiLago) design:
1. How to characterize the operations hosted by the SiLago blocks, including coupling
between physically close blocks. In essence, given a space time trace of operations
performed by the SiLago blocks, the characterization model should be able to
predict the average energy consumed within 1-2% accuracy.
2. Design a hierarchical clock tree scheme where the region wide clock is distributed
manually in a structured manner and the clock fed to each SiLago block in the
region is controlled by a programmable delay buffer to keep the skew within the
margin for which the SiLago logic is hardened. Regions will communicate on GALS
basis.
3. Develop a method to dimension the power grids that will feed the SiLago blocks.
This power distribution will once again be hierarchical. The global power nets will
feed the power rings of the blocks and the power rings of the regions will feed the
power rails of the SiLago blocks. Finally, the power rings of the regions of the
SiLago blocks will feed the power rails of the standard cells. Dimensioning them
and absorbing them in the SiLago blocks so that they compose by abutment is the
challenge.
1.2 Goal and Method
This project was a step forward in realization of SiLago design flow. It made the platform
more complete. The main goal of the project was to characterize and build power models
of the SiLago platform, design a hierarchical clock tree scheme and hierarchical power
grids.
4
1.3 Organization
Instances designed in SiLago platform achieve efficiency of ASIC with very less effort,
thus reducing the manufacturing cost [1] [2] . This framework is proposed as an al-
ternative to the general processor/ software centered and accelerator prolific platform
based SoCs. Because later SoCs are ruled by infrastructural hardware while SiLago has
functional hardware.
Compared to standard cell based design flows, SiLago adopts two policies for the above
mentioned non-incremental advancements in efficiency and quality of the design:
1. Abstraction of physical designs at micro-architectural / register-transfer level (RTL).
By doing so, design space is reduced exponentially, thus, lowering the resource ex-
haustion for synthesis tools used at system level.
2. To enable composition by abutment, SiLago adopts the synchoros design style.
Thus, enabling quick generation of large scale design [3] .
A hybrid library learning based characterization was used since it is the most efficient
characterization technique known.
1.3 Organization
The rest of thesis report is organized as follows. Chapter 2 lays a background for this
thesis by introducing SiLago Design Methodology. Chapter 3 starts with an introduction
for power characterization, followed by steps taken for power characterization and then
the experiments and results are discussed. Chapter 4 begins with introduction of clock
tree and expands the problem statement for clock tree designing scheme, then the design
process is discussed and chapter ends with a detailed discussion of the experimental
setups and results. Power grid dimensioning is discussed in chapter 5, it begins with a
brief introduction, followed by design process and chapter ends with a brief discussion
on the experiments and results. Chapter 6 explains the challenges and problems and
chapter 7 draws a conclusion of this project. In the appendix, the scripts used for each
task are given.
SiLago Design Methodology
There is a need for an alternative to ASICs and FPGAs, due to higher designing cost
of ASICs and high area overhead and low computation efficiency of FPGAs. As a con-
sequence of these, CGRAs come into play because of their high computation efficiency
and lower designing cost and time. CGRAs fit perfectly to put in place of FPGAs and
ASICs for domain specific applications.
In [4], it was shown with the help of a survey that both industrial and scientific research’s
focus is on the systems with multiprocessor and array. There are many unresearched
classes of architecture that can open up a new scope for a new architecture for research
and development of their compilation tools. ASICs clearly force researchers to look
for an alternative due to non-flexibilty and high design costs, and FPGAs have large
reconfigurable overheads and fine granularity. So, DRRA has been proposed as an effort
to overcome the above mentioned shortcomings.
2.1 DRRA: Dynamically Reconfigurable Resource Array
DRRA targets the PHY layer of OSI model for communication and can be realized as a
part of wireless system or as independent macros in SoC [4]. DRRA supports all three
levels of granularity discussed above i.e., Instruction granularity matching, bit-width
granularity matching and silicon granularity matching. DRRA cells are connected to
each other by interconnects and employ a three-hop sliding window communication
strategy.
Figure 2.1 shows a basic schematic of DRRA cells connected through interconnects.
DRRA cells consists four modules, which are:
1. Register File (RFile) : It provides a high bandwidth for parallel data transfer
to DPU. All data that are received by DPU and all data that are computed in
6
Figure 2.1: DRRA cells connected through interconnects.
DPU are stored in RFile. This movement of data takes one clock cycle. Figure 2.2
shows block diagram of RFile.
2. Data Path Unit (DPU) : It includes all of the logical and computational re-
sources of DRRA. DPU is divided into four partitions, which are :
a) Pre-processing Unit executes operations like absolute and negation.
b) Logical Unit executes logical operations like OR, AND, shifting, etc.
c) Arithmetical Unit executes operations like signal processing, etc. This unit
supports fixed point and integer operations but floating point operations.
d) Post-processing Unit executes operations like truncation, etc.
As DPU is pipelined therefore, arithmetic operation takes one clock cycle but
multiply or MAC which take two clock cycles. A local sequencer controls DPU.
Figure 2.3 shows diagram of DPU.
3. Sequencer : It is basically a state machine which controls all DRRA resources.
Each DRRA cells have been allocated a sequencer, due to this allocation the config-
uration of DRRA is dynamic in nature. For synchronization with other resources
of DRRA, sequencers can communicate with each other.
7
Figure 2.2: Register File
4. Switchbox (SWB) : It is placed at the intersection of input and output buses
in DRRA interconnect network. SWBs are connected to a configuration memory
which determine which output lane will connect to input lane. SWB uses tri-state
logic to disconnect not driven lanes from circuit.
2.2 DiMArch: Distributed Memory Architecture
DRRA has a memory network, distributed as a circuit switch, called DiMArch [5]. Di-
MArch needs a single instruction to program a source-destination path [6]. A sequencer
(shown in Figure 2.1) act as a link between DRRA and DiMArch. DiMArch intercon-
nects scheme can be separated into two groups:
1. Data network (dNoC) : dNoC transports data between RFile (in DRRA) and
memory banks (in DiMArch). Both read and write of data between RFile and
dNoC can be performed simultaneously, hence it has full-duplex interconnects.
2. Instruction Network (iNoC) : iNoC is a packet-switched network for transfer of
instructions. AGUs are programmed through this network.
8
2.2 DiMArch: Distributed Memory Architecture
Figure 2.3: Data Path Unit
Both dNoC and iNoC are implemented within Tiles in DiMArch. Tiles in DiMArch of
two types:
1. SRAM Tile (STile) : It is a block of SRAM memory cells which receives the
instruction from Configuration Tile through iNoC. STile is comprised of Instruction
Switch, Partition Handler, Data Switch, SRAM Address Generator Units, and
SRAM.
2. Configuration Tile (ConTile) : A layer of tiles between STiles and DRRA Cells
is comprised of ConTile. Each ConTile can connects to its horizontally placed
neighbour ConTiles.
Figure 2.4: DiMArch and DRRA
Figure 2.4 shows when DRRA and DiMArch are placed together, how does the STile
and ConTile are identified and their arrangements.
9
2.3 SiLago Design Methodology
SiLago design methodology increases the level of abstraction of physical design i.e.,
from
standard cells (Boolean Level) to micro- architecture level (Register Transfer Level).
This enables the synthesis of hardware from higher level of abstractions [7]. Prediction
of cost metrics with higher accuracy is achieved because in this methodology, we reduced
the abstraction gap and hence improved the ability of prediction for cost metrics of
synthesis tools (used at higher abstraction i.e., higher that RTL). This also reduces the
synthesis time by reducing searching of the design space, illustrated in Figure 2.5 .
Figure 2.5: Searching of design space in SiLago design methodology.
SiLago design flow eliminates the tuning of fine refinements (like in HLS tools, user has
to manually define the budget for constraints at algorithm level) and thus guarantees
the correct by construction by replacing those fine tuning with a machine translation.
Thus, functional verification is eliminated.
Accurate prediction of the cost metrics is enabled by accurate characterization of both
the interconnects between micro- architectural level operations and those operations
itself. Thus, constraints verification at system level is eliminated [8].
10
2.3 SiLago Design Methodology
By the virtue of SiLago-fication, we reduced the abstraction gap and this can be very
helpful in automation of synthesis at SoC level.
In order to make SiLago platform more complete, there was an immediate need for few
addition to it.
First is to power characterize the flat design of the fabric. Power characterization of
block design was done already in [3] and in order to prove that the prediction behavior
of SiLago design methodology remains true even if we choose flat design instead of block
design, where there is coupling between closely placed two SiLago blocks. In practice,
this coupling will produce a noise in the circuit, whenever a signal crosses these blocks
and the motivation was to predict this noise and the behavior of the circuit under such
circumstances. It became necessary to record the operations hosted by SiLago blocks
including coupling between physically close blocks. The outcome of this experiment will
enable a designer to predict the average energy consumption with an accuracy between
1-2%.
Second is to design a hierarchical clock tree scheme. Until now, SiLago fabric employs
a clock tree designed with the predefined algorithms in the commercial CAD tools.
These algorithms try to reduce the the clock skew and slew rate by adding a number
of clock buffers in the clock path (available in the technology library), while satisfying
the setup and hold time in each block. This results in the addition of irregular number
of clock buffers and this defies the whole SiLago concept, as the SiLago blocks are
not regular anymore. Prediction of clock tree will not be possible until clock tree
has been synthesised by Cadence Innovus’ Clock Concurrent Optimization Technology
(ccopt) engine [9]. This problem raise the need for designing a predictable clock tree
scheme where it became necessary to trick the available CAD tool, such that tool always
produces a predictable clock tree. Reason behind this workaround is because CAD tool’s
ccopt engine is a black box and tool owner does not provide every details of working
of this engine. The immediate task was to study this engine by running a numerous
number of experiments and recording every minute changes in the generated clock tree
and predict how this engine works. Using available recorded information, then design a
clock tree which will unwillingly force the tool to produce a predictable clock tree with
regular number of clock buffers and then produced SiLago blocks will be regular and
hence predictable in nature. The proposed task is to design a hierarchical clock tree
where manual distribution of clock buffers is done such that the delay is programmable
and costs only one-time engineering effort, which will be done at design time. Figure
2.6 shows proposed scheme for clock tree.
Third is to develop a method to dimension the power grids for SiLago platform. In
order to make the SiLago blocks regular in this aspect as well, such that the behavior of
each block is predictable (to satisfy the SiLago design statute), the power grids should
be hierarchical and there should not be any significant drop in power supply. To achieve
such organization of power grids, local power grids should take input from global power
11
Figure 2.6: Proposed Clock tree scheme.
nets at a regular interval of distance on the die. These global power nets are placed
in such a way that they surround the fabric from outer side and then at every fixed
distance there are horizontal and vertical power rails that feeds the local power rings of
the blocks. With such organization , it is said that the power grids should be regular and
hence predictable. Figure 2.7 shows proposed scheme for power grid dimensioning.
Figure 2.7: Proposed scheme for hierarchical power grids.
The most important feature to identify if a block that is designed using SiLago design
methodology is actually satisfying SiLago statute or not, is to detect if composition by
abutment is possible with each SiLago block. Each task in this thesis strictly follow
this rule and the above mentioned rules as well. In addition, this project required
deep understanding of Very-Large-Scale integration (VLSI) concepts. Cadence SoC
Encounter (now called as Innovus) was used for physical design. Synopsys Design Vision
was used for logical synthesis. QuestaSim (ModelSim), NCsim, Virtuoso, MATLAB,
VHDL, C++, TCL, BASH and SystemVerilog were used for scripting, designing, and
analysis of the results.
At high level, hardware cost prediction becomes challenging due to unavailability of
information like placement, wiring and location of hardware blocks at high level. It has
been proposed by several works [7], [10], [11] that by raising the abstraction level from
standard cells to coarse grain components, accuracy in cost prediction has increased.
Power estimation is more complicated than estimating time and area due to the fact
that power estimation varies and it revolves around signal’s distance traveled, coupling
in the path or adjacent operations, or the data. In [3], a new framework, CoG has been
proposed to estimate the power, they used block design to estimate the power and got
15 times better estimation than state-of-the-art tools. Work done in this thesis uses
the same technique as in [3] for power estimation but instead of using block design, flat
design was used. This was done to estimate the power when there is coupling between
closely placed SiLago blocks. Below is a demonstration using a simple circuit to show
that glitch noise is sufficient to cause functional failures and hence lead to abnormality
in power estimation.
Noise in digital circuit arises when the circuit is operating, cases like when noise is
propagating from other parts of the circuit or when switching of other nearby signals
occur. This affects the behavior and timing of the digital circuit and this is when the
need of characterization arises.
Using information of power characterization, a designer can predict the abnormal be-
havior of the circuit when under the influence of noise. There are mainly three noise
effects in a digital design:
1. Functional failure due to wrong value in the signal.
2. Setup- timing violations due to late arrival of signal, resulting in the chip to run
13
on a low frequency than intended.
3. Hold- timing violation due to early arrival of signal, resulting in fatal failure of the
chip because in this case chip can not even run on low frequency.
The most common noise is the coupling or cross-talk noise [12] , which is also the main
reason for power characterization of the SiLago block. Coupling or cross-talk noise arises
when rise or fall transition occurs in a signal net (victim) coupled to a noise causing net
(aggressor) via a coupling capacitance [13] [14] . Usually when a quiet victim is affected
by coupling noise, it is observed in the form of a spike or glitch (glitch noise).
To show that glitch noise is sufficient to cause a functional failure [15] [16] [17] , consider
a simple circuit where aggressor is a rising buffer and victim is an inverter, both are
coupled with the coupling capacitor ( Ccp ). Figure 3.1(a) shows the circuit.
Figure 3.1: (a) Circuit to demonstrate glitch noise; (b) Simplified circuit
To obtain a quantitative view on the problem, consider the following three assumptions
to simplify the circuit (Figure 3.1 (b) ):
1. Consider the circuit as lump capacitance and ignore the victim and aggressor
resistance;
;
3. A saturated-ramp waveform a(t) is modelled as aggressor using an ideal voltage.
14
3.1 Introduction
To analyze the voltage response at the victim node, take a look at the differential
equation obtained using Kirchhoff law:
Ct dv
dt + v
r = Ccp
dt (3.1)
Ct is Cg+Ccp representing the total capacitance and v(t) is the response on the victim.
The initial condition should be like Eq3.2 :
v(t) = Ccp Ct
τ is rCt (victim time constant).
From Eq3.2 we can derive that if a(t) (aggressor transition) is constant before and after
the transition, v(t) (response) will be a glitch i.e. attains zero before and after the
transition.
T ) (3.3)
v(t) is directly proportional to Ccp, r and a(t). Now Ccp < Ct and the magnitude of
the glitch is limited by τ T , that means the glitch will be small when the transition of
aggressor is slow.
When T << τ ,
v(t) = Ccp Ct
a(t) + o( τ
T ) (3.4)
Ccp Ct
(attenuation factor) and the magnitude of the glitch is limited by Vdd Ccp Ct
, this will
make the initial shape of the v(t) (victim response) to be as the shape of a(t) (aggressor
transition).
15
Chapter 3 Power Characterization
Peak or maximum vpeak, is the most important characteristic of glitch noise. Let us
look into the cases where we derive vpeak using parameters of the circuit.
When a(t) (aggressor transition) is a rising saturated linear ramp,
a(t) =
Vdd , t ≥ T
v(t) =
vpeake (−t τ ) , t ≥ T
(3.6)
At t = T , vpeak is maximum i.e., glitch is maximum and the equation obtained is
vpeak = ( CcpVdd Ct
)f( T
τ ) (3.7)
CcpVdd/Ct is the electrical property of the circuit and f(Tτ ) is the nonlinear function of
times.
vpeak is directly proportional to r, Vdd and Ccp Ct
and inversely proportional to T (aggres-
sor’s transition time). Nets with low drive strength i.e., high driver holding resistance
(r) and high Ccp (coupling capacitance) are the ones which are most vulnerable to cross
talk. It can also be considered that if the aggressor is switching fast i.e., have small
transition time, then the glitch noise will be worse.
Eq3.7 and Eq3.8 are used in programs for cross talk analysis and with the help of these
two equations one can eliminate the nets with low risk and save time and resources.
In the later sections, power estimation process is briefly discussed, an attempt is made
to reason the obtained results, and it is also shown how a designer can benefit from
these power estimations.
3.2 Process
3.2 Process
This section will describe about the steps taken for power characterization. It followed
the same steps as CoG [3] , the only alteration was with the SiLago’s design i.e., instead
of block design, a flat design was used.
Process of Power Characterization can be sum up in four simple steps.
1. Generate random test cases (.m files) using any programming language capable of
file handling (Python or C++ preferred) ;
2. Feed those test cases (.m files) into Vesyla (new version is Algosil) to obtain the
testbenches (.vhdl (V ery High Speed Integrated Circuit Hardware Description
Language) files) along with assembly code files ;
3. Simulate SiLago fabric in either Questasim or NCSim using those testbenches and
generate .vcd (Value Change Dump) files.
4. Using generated .vcd files, power reports are generated from Innovus.
Using the testbenches, assembly codes and power reports, one can determine the power
distribution for any simple to complex algorithms, when run on the SiLago fabric without
actually running the experiments. Thus, saving time and resources.
In the next section, experiment setup and results are discussed.
17
3.3 Experiments and Results
In this section, experimental setup and the results achieved with the power
characterization in the SiLago framework are discussed.
A standard sign-off quality post layout data was used for power characterization. Both
inter-cell and intra-cell characterization was done, as it was done for CoG [3].
Intra-cell characterization provides a complete characterization of single cell, as energy
values for all modes and connections are calculated during this process. Inter-cell char-
acterization provides the information about the cross-coupling between DPUs, which is
one of the main motivation for performing power characterization in this thesis.
Figure 3.2: Power tree building C++ code snippet.
Figure 3.2 shows a C++ code snippet which covers every input for each SiLago cell
with last line demonstrating that each combination is again randomized for next 100
iterations, such that there should not be any doubts about any possible combination
being left. This code also builds a power tree for inter-cell characterization. This in
turn resulted in massive data storage space demand.
Figure 3.3 shows graph between TotalPower(mW ) vs. Iterations .
Mode 1 : Addition
Mode 2 : Multiplication
Mode 3 : MAC (Accumulator is initialized with an additional signal)
Figure 3.3 is somewhat irregular but it shows that each mode followed a comparable
pattern which characterizes that the modes and connection are independent, in terms
of energy consumption.
Figure 3.3: Total Power Vs. Iterations.
Figure 3.4 (a) shows the Total Power activity for Mode 1 and figure 3.4 (b) shows
the hops i.e., inputs and outputs are originating from which DPU. In figure 3.4 (b) ,
Magenta is Output, Green is Input 1 and Red is Input 2.
Figure 3.5 shows the SwitchingActivity vs. Iterations . It can be observed that
switching activity for Mode 1 i.e., addition is low as compared to Mode 2 and Mode 3
which are
multiplication.
Figure 3.6 (a) shows the Switching Power activity for Mode 1 and figure 3.6 (b) shows
the hops i.e., inputs and outputs are originating from which DPU. In figure 3.4 (b) ,
Magenta is Output , Green is Input 1 and Red is Input 2.
From Figure 3.5, we also conclude that if we add a constant across a mode, it will line
up with the other modes’ graph. So, the equation[3] that can be deduced is
Power (modeX) = Power (modeY) + constant
As the power consumption is additive in nature, analyzing any single mode is sufficient.
Assumption is made that a register file is providing all the input. Figure 3.4 reveal three
patterns,
1. A high power consumption is seen when any or both of the 2 inputs connected to
19
Figure 3.4: (a) Total Power activity for Mode 1 ; (b) Hopping of signals
a register of other cell and output goes to the same register.
2. A low energy consumption is seen when either of inputs and output are connected
to the source cell (where DPU is operating).
3. A constant power consumption is seen when neither of the inputs and output are
connected to the same register nor connected to the source.
Spikes (specially seen in Figure 3.3 and Figure 3.6) are due to the following two rea-
sons:
1. Fast switching in the circuit i.e., have small transition time.
2. a(t) is constant before and after the transition, so the response will be a glitch.
As response is directly proportional to coupling capacitance and inversely proportional
to transition time, glitch will be small when transition is slow.
It can also be observed that between iteration 500 - 600 there is an abnormally large
spike. This abnormal behavior is due to the fact that in the design around DPU 5 there
was high concentration of functional wires which resulted in high coupling capacitance
for the circuit around DPU 5. Input1 and Output are being initialized within the DPU
20
5 but Input2 has to travel several blocks before it reaches DPU 5 so, there is an tightness
in timing to avoid setup violations, which creates voltage deviations. It can be observed
from figure 3.6 (a) that between iteration 500-600 there is large switching activity and
the response to that is observed in figure 3.3 in the form of an abnormally large spike.
* However, the argument above needs to be verified by running other experiments before
jumping into any concrete conclusion.
Figure 3.5: Switching Power Vs. Iterations.
21
Figure 3.6: (a) Switching Power activity for Mode 1 ; (b) Hopping of signals
To demonstrate the usefulness of the power characterization, below are the three
experimental values for power consumption for Convolutional Neural Network (CNN),
Discrete Cosine Transform - Two Dimensional (DCT2D) and Face Recognition (FR).
It can be observed that the results shows nearly correct estimations (this conclusion is
made by experiments done in [3]) without actually performing the experiments. Figure
3.7 to figure 3.12 shows the results for power distribution.
22
Figure 3.8: Power Distribution for CNN (breakdown)
23
Figure 3.10: Power Distribution for DCT2D (breakdown)
24
Figure 3.12: Power Distribution for FR (breakdown)
25
4.1 Introduction
Clock design is one of the most challenging task in digital design where a designer has
to distribute clock signals all through a chip. The designer also has to be aware of the
resources when minimizing the factors like power, skew, variation and jitter [18] [19].
A clock period is the duration of clock signal which is the recurrence of the low and high
pattern. Circuit frequency is inversely proportional to the clock period. Time taken by
the clock signal to propagate through the (clock) tree to the sinks is the (insertion)
delay. The sinks are either clock pins (sequential) or clock buffer inputs (hierarchical),
where the clock signal is being received. Mostly, at 50% of the supply voltage, the
clock signal delays are measured . In very few cases, delay is determined by inverter’s
switching threshold and those cases occur in an edge-triggered systems [20] .
The motivation behind designing this clock tree was to achieve fixed skew clock network
[20] [21] [22] [23] with predictable buffer delay insertion. It was observed that the
CAD generated clock tree introduced random clock buffers in the clock path to produce
optimal smallest delay buffered nets. Almost all buffering techniques use van Ginneken
dynamic programming algorithm for buffer insertion and sizing [24] , and delay model
used is the Elmore delay model [25] [26] [27] . The three main steps of this algorithm
are:
1. Buffer addition in O(n) time;
2. Wire addition in O(n) time;
3. In O(n1+n2) time, two branches are merged, where the number of buffer positions
in two branches are represented by n1 and n2.
26
Thus, this algorithm holds a time complexity of O(n2)[28].
The proposed regional clock tree synthesis scheme generates clock tree by the abutment
of identical and synchorous [1] SiLago blocks. This clock tree is not improvised but
structurally parameterized to predict the cost metrics with certainty. Previously in [1]
principles of SiLago method have been presented but how this abutment will result in
a valid clock tree was not refined but it is well elaborated in this part of the thesis.
For the sake of SiLago-fication, presented clock tree synthesis scheme accords to raising
level of abstraction of physical design to micro-architecture level. Clock tree has been
designed to insert minimum delay and thus for a designer, the time complexity will be
O(1), due to the fact that designer has to take a single design time decision i.e., how
many buffers should be included in the buffer chain from the already placed buffers in
the programmable buffer delay block and to do this, it always takes a constant time.
Thus, reducing the design time and engineering effort.
Figure 4.1: Three levels of clock trees in SiLago Design.
The requirements imposed by SiLago methodology on clock tree synthesis is explained
with the help of Figure 4.1 which visualizes three levels of hierarchy. In the figure, a
SiLago SoC is shown with different region instances with different color codes. Instances
of SiLago block, which are specific to each region, act as leaf nodes in hierarchy. These
instances are automatically synthesized in the SiLago Design flow such that the type,
number, relative position and composition of region instances are optimally matched to
their constraints and functional requirements.
Moving on from three levels of design hierarchy to three levels of clock trees in the
Figure 4.1 i.e., local, regional and global. Local clock tree is auto generated with the
27
Chapter 4 Clock Tree Synthesis
help of commercial EDA tool and global clock tree is derived from the PLL/CGU. This
part of thesis will focus on the regional clock tree.
To adopt composition by abutment regional clock tree must satisfy these two require-
ments imposed by SiLago:
1. The cost metrics i.e., latency and energy of the SiLago blocks must be uniform and
identical, and should not be affected by the position in the SiLago design instance.
This property is required just as to keep a scalable engineering effort i.e., one time
engineering effort. This property will also keep the design regular in both physical
and architectural point of view.
2. Creating valid VLSI designs of random size of SiLago design instances by compo-
sition by abutment with valid neighbours should be possible. This means there
should be no further engineering effort applied for implementing the SiLago blocks
except what has been done already. This also means that the design should be
timing clean, should have signal integrity, no IR drop violations, etc. In other
words, the aggregation of parts of clock in the SiLago blocks should appear as
a valid regional clock that does not violates timing because clock tree will itself
balance the skew and maintain the edge.
An immediate reason to propose a new clock tree scheme was due to the fact that
commercial EDA tool’s clock tree synthesis tool violates above mentioned two require-
ments.
The results has been verified with Static Timing Analysis (STA) and also a comparison
against a functionally equivalent clock tree is done. As the synthesized clock tree design
is correct by construction, so no further verification is required.
To verify the delay incurred by Clock nets, RC (delay) [29] can be calculated using
information from .lib (Liberty) file and .lef (Libety Exchange Format) file.
.lib file contains information about the rising and falling times and transitions for a
particular standard cell in the library. It also contains information about power, resis-
tance and capacitance for that particular standard cell.
.lef file contains information about the metal layers.
RPERSQ is the resistance for a system of wire, in Ohms per square.
Resistance for a length of wire is
28
RPERSQ ∗ lengthwire/widthwire (4.1)
CPERSQDIST is the capacitance for each square unit, in pF per square micron (wire
to ground capacitance).
EDGECAPACITANCE specifies a floating-point value of peripheral capacitance, in
pF per micron. The placeroute tool uses this value in two situations:
1. Estimate capacitance before routing.
2. Calculate segment capacitance after routing.
For the second case, the tool uses values only if the layer thickness or height is set to
zero . Formula used in this case to calculate segment capacitance is
C = (CPERSQDIST ∗ w ∗ l) + (EDGECAPACITANCE ∗ 2 ∗ (w + l)) (4.2)
where,
29
4.2 Process
There are two constraints for synthesizing regional clock tree. First, to maximize the
percentage of usable clock period by combinatorial logic, clock skew must be minimum.
This condition arises when clock and data are propagating in same direction. Second,
maintain the drive strength such that slew rate technology design rule is not violated.
To design the proposed clock tree, a pre-placed and pre-routed SiLago block (with
local clock tree) was taken into consideration for further modifications. Then, a pro-
grammable buffer delay block called MRB (Mux-Register-Buffer) block was created
separately. MRB block includes 16 clock buffers (largest clock buffer in the TCBN
40nm library; area 11.6424 microns, I pin capacitance 0.003259 pF, Z pin maximum
capacitance 0.6043 pF, Positive unateness), one MUX and 4 registers for selecting the
number of buffers. This MRB block is placed on top of the SiLago block and a new
top module was created which was called SiLago Wrapper block. This SiLago Wrapper
block has two clock inputs and two clock outputs, the motivation behind this was if such
SiLago Wrapper blocks were abutted, then there will be least functional wires due to
clock signal in the design, apart from what has been implemented and hardened already.
So, focus was emphasized on regional clock network.
For simplicity, the current buffer chain was designed to include 10 buffers before the
clock signal is fed into the SiLago block. This arrangement can be modified according
to the design
requirements by the designer during design time. However, for the sake of better rise
and fall time, and since each SiLago Wrapper regional clock network could only see the
very next SiLago Wrapper as a load and not the whole design, restriction was imposed
that clock network should include at least one clock buffer in the MRB blocks before
propagating to the next SiLago Wrapper or SiLago block. Below Figure 4.2 shows a
schematic when such SiLago Wrapper blocks are abutted.
SiLago Wrapper was designed with pre-placed and pre-routed SiLago block, which has
its own local clock network, such that from a single clock input pin the entire block was
fed with clock signal. For MRB block, it was decided to choose the largest clock buffer
in the TCBN 40nm library, reason being simplicity, least overhead and better skew.
Figure 4.3. shows the schematic of the MRB block. The circuit was designed in such
a way that a designer (using an Elmore Delay model) can include as many available
buffers just by changing the hex value of the RegIn signal, which is considered during
designing.
Logical synthesis of MRB block was done to obtain the netlist, then physically syn-
30
Figure 4.2: 5x2 SiLago Wrapper cells abutted.
Figure 4.3: Design Vision generated MRB schematic.
thesized netlist and lef file for metal layer information was generated using former.
For creating SiLago Wrapper, wires were manually connected between MRB block and
SiLago block by editing logically synthesized netlist. A single clock signal was passed
from MRB to SiLago. Using SiLago Wrapper’s netlist, lef information of MRB and
SiLago blocks, and the TCBN 40nm libraries, a new hardened SiLago Wrapper block
was created.
To verify, if the new block was capable of composition by abutment, a new fabric was
designed which arranged SiLago wrapper in two rows and five columns. To prevent
CAD tools from
performing any further optimization at any stage of synthesis, hardened SiLago Wrapper
blocks were used in design. After design was placed by CAD tool, SiLago Wrapper’s
netlist was loaded into the black boxes. Figure 4.4 shows the clock network of the fabric
and successful composed by abutment of the blocks.
31
Figure 4.4: 5x2 SiLago Wrapper cells showing clock tree mesh and composed by
abutment.
To put this in a more straight forward way, below is the description of the synthe-
sis scripts (see Appendix A.2 for a minimal script for physical synthesis), with focus
on important steps to be followed carefully while designing the clock tree in SiLago
Framework.
Files generated after logical synthesis is not self-sufficient to begin the clock tree design
in SiLago Framework. There are few changes that must be done. Few of them are listed
below:
1. Edit the sdc (Synopsys Design Constraints) file to allow the physical synthesis tool
to propagate clock throughout the chip. Otherwise, tool will treat a very small
portion of clock wire (wire between clock pin on fabric to first sinks at clock pins
on SiLago blocks) as clock and treat other connected clock wires as signal. To fix
this issue add set propagated clock [all clocks] in the sdc file.
2. False paths can also be declared by editing sdc file, use set false path -through
[get net h bus*] ; set false path -from [get port rst n]
First command will declare all paths through pin h bus* as false and the later will
declare all paths that included rst n pin as false.
MMMC (Multi-Mode Multi-Corner) must be defined at the beginning of physical syn-
thesis. Three case are often defined for MMMC analysis; typical, worst and best cases.
Specifically, worst case is used to check the maximum delay (Setup violations) and best
case is used to check the minimum delay (Hold violations).
32
4.2 Process
Before running any processes, design was declared as unique (in SiLago Framework)
with set init design uniquify 1, that allowed placing clone blocks along with a master
block on the floor-plan. Declaring design process mode was another important step that
must be specified otherwise tool would use the default design process mode as 90nm. 40
nm design process mode was defined with setDesignMode -process 40. The design was
initialized with init design and floor planning was done, floor planning would provide
an early feedback that if the initialized design would fit on the die, it also provided an
estimate of congestion and delay caused by functional wires.
After floor-planning, pin placement was done and then black boxes are placed which
provided information if there were issues with routing, heat distribution, performance
or power
consumption. After design was placed, assembleDesign will load the netlist along with
timing and metal layer information. As clock tree was connected already, we can observe
the composition by abutment at this step, where every signal is propagated properly.
Finally, post route timing analysis was run to check for violations.
33
This section will describe the experiments and the results achieved with the clock tree
designed in the SiLago framework. Out of several experiments, only two most important
experiments are described here. These two experiments are sufficient to prove the theory
behind
SiLago-fication of the clock tree and why we need to include clock tree inside the SiLago
blocks.
Following are the two experiments:
1. Designing the fabric with 5x2 SiLago blocks with clock tree generated by CAD
tool (Cadence Innovus) with no space between the SiLago blocks.
2. Designing the fabric with 5x2 SiLago blocks with clock tree generated by CAD
tool (Cadence Innovus) with 2 microns of horizontal and vertical space introduced
between the SiLago blocks.
Table 4.1 shows the comparison of the capacitance between CAD tool generated clock
net and SiLago Framework generated clock net. It was observed that total net capaci-
tance was less in SiLago Framework due to the regularities, as no extra wires or logic
was introduced while abutting the SiLago blocks. Clock net capacitance was observed
to be slightly increased due to the fact that in every block there were 16 clock buffers
present but during design time the designer will only activate very few buffers as per
the requirements and the clock net capacitance will decrease eventually.
With this theory in mind, a short experiment was also done to check if minimum delay
between the blocks was kept, how many clock buffers will be required to place 400 such
SiLago blocks and with an accuracy of 2% or below.
1st cell arrival time = 0.055ns
400th cell arrival time = 1.103ns
Each buffer in delay line (as seen in Innovus delay line) = 0.02ns
If 52 buffers are placed = 0.055 + (0.02 * 52 ) = 1.095 ns
% accuracy = 100* (1.103 - 1.095) / 5 = 1.6%
Table 4.2 shows the comparison of the internal, switching, leakage and total power
between CAD tool generated clock net and SiLago Framework generated clock net.
From table 4.2 it was observe that the switching power of clock net designed in SiLago
34
CAD tool generated 1.35066e-09 F 2.53E-11 F
SiLago Framework 1.24843e-09 F 2.58E-11 F
Table 4.1: Comparison of capacitance
Internal
SiLago Framework 1.806 3.476 0.0006122 5.283
Table 4.2: Comparison of power
framework has decreased significantly and as a result total power is also reduced. This
proves that SiLago Framework will be a good alternative for designers. To support
this statement, an experiment was run to check and see how much time is consumed in
placeDesign and Clock Tree Synthesis (CTS) between CAD tool and SiLago Framework.
Table 4.3 shows the observed times. Using such results a designer can extrapolate the
5x2 blocks 10x2 blocks
SiLago Framework 0: 0:59 hrs 0: 2:45 hrs
Table 4.3: Comparison of design times.
values and get an estimation of time required to design a fabric with 100 blocks or even
more blocks. Below are the series of figures which illustrates the irregularities in design
while using CAD tool for same fabric with very little variations.
Figure 4.5 shows the Clock tree flow of SiLago Wrapper block. This clock tree has
regular load capacitance and predictable behaviour. Figure 4.7 shows the physical layout
of the hardened blocks and Figure 4.9 shows the clock tree flow, vertical clock buffer
chains are the programmable clock buffers and horizontal clock buffer chains are the
buffers from local clock tree. Each branch represents each SiLago blocks. While Figure
4.6 shows the irregularities introduced by CAD tool and Figure 4.8 is the physical layout
used.
35
Figure 4.5: Clock tree flow of 5x2 Fabric with no space.
Figure 4.6: Clock tree flow of 5x2 Fabric with space in-between the blocks.
Figure 4.10 and 4.11 contains the clock tree information of CAD tool generated clock
tree and SiLago-fied clock tree respectively (both tables are generated using CAD tool’s
CTS engine). These two tables contains information about clock tree’s time increment,
arrival time, transition time, capacitance, and distance. It was observed that CAD
tool’s generated clock tree includes random (in size and count of) clock buffers in its
path to reach the local clock tree of each SiLago blocks, whereas SiLago-fied clock tree
included regular (in size) and pre-calculated number of clock buffers. This helps in
pre-calculating the capacitance and hence time increment, arrival time, and transition
36
Figure 4.7: Regional Clock tree of 5x2 Fabric with no space.
Figure 4.8: Regional Clock tree of 5x2 Fabric with space in between the blocks.
time of the SiLago-fied clock tree using equations 4.1, 4.2, forming an RC-π model of
the clock tree, and calculation using Elmore delay model.
An experiment was conducted to predict the above mentioned parameters and compare
them against the table generated from CAD tool. One of the main reason behind this
experiment was to find out how much clock skew was at each sink to the SiLago block,
such that while making a design choice for SiLago-fied clock tree, a designer can easily
choose the correct number of clock buffers. Correct number of clock buffers inclusion
may result in optimizing the drive strength of clock signal, so that clock tree can drive
maximum number SiLago blocks.
A small such calculation to show the usefulness of our prediction scheme is as follows
(clock tree information was extracted from the technology files and Matlab was used for
calculations):
It was known that M3 (Metal Layer 3) was used for clock routing, and other required
information were available from the technology files to calculate the clock arrival at
each SiLago blocks. Wire delay (RC) induced due to 100 µM of wire is calculated as
follows:
R = (RPERSQ * Length) / Width = 397.142857 x 10(3)
C = (CPERDQDIST * Length * Width) + (EDECAPACITANCE * 2 * (Length +
Width)) = 0.01461024408 x 10(-18) F
RC = 5.802 x 10(-15) Seconds (or 5.802 fS )
Wire delay information obtained from CAD Tool was found to be about 7 fS.
As calculations assumed the ideal conditions, hence the difference in RC values.
Using clock speed of 200 Hz (T = 5 nS), clock arrival at 1st SiLago block was 0.055
nS, at 400th SiLago block clock arrival was found to be 1.103 nS and each clock buffer
added a delay of 0.02 nS. To drive the 400th SiLago block, 52 clock buffers are needed
as per calculations (0.055 + (0.02 * 52 ) = 1.095 ns) and with this experiment it was
found that the accuracy of our predicting system was 1.6% .
That’s said about the clock trees, one can observe that clock tree synthesis becomes
relatively an easy task if done in SiLago Platform as clock tree has predictable behavior
in SiLago-fied blocks.
To summarize the experiments and results obtained, it was concluded that the proposed
clock tree design solution was able to match and in few expects it surpasses the EDA tool
hierarchical synthesized clock tree design. This evidence is enough to replace prevalent
38
Figure 4.9: Clock tree flow of 5x2 SiLago-fied Fabric with no space.
Figure 4.10: CAD tool’s CTS information.
Figure 4.11: SiLago-fied clock tree information.
39
design with the proposed design. Proposed scheme comes with added benefits as it
requires only one time engineering effort which makes the VLSI designing process fast,
predictable, easy to implement and correct by construction, which are believed to be
essential for automating the synthesis at higher abstraction level (see Figure 2.5).
40
5.1 Introduction
This section will describe the power and ground network design and IR drop analysis.
V DD and V SS pads are connected to the concentric rings inside the design [30] .
Typically to reduce the electromigration and noise, the ring width is made large.
In terms of performance and minimizing the current and voltage variations in the power
networks, mesh structures are found to be better than interleaved and local tree-based
power distribution techniques [31] [32] .
Figure 5.1 (a), (b) and (c) shows mesh, interleaved and local-tree based power distribu-
tion schemes respectively.
In SiLago design framework, power grid has to modular and must satisfy the two re-
quirements i.e., first, it must be space invariant and second, composition by abutment
should be possible with the proposed design. Very little has been discussed on this topic
as it will be clear in the later sections .
5.2 Process
This section describes the process for power and ground grid dimensioning.
Figure 5.2 shows the mesh structure. Orthogonal wires are spread in the form of rect-
angular grids. Two bottom layers were chosen for adding the stripes because they have
the least resistance, which is suitable for placing power and ground stripes. Vertical
stripes were laid on metal layer 11 (AP) and horizontal layers were laid on metal layer
10 (M10).
Chapter 5 Power Grid Dimensioning
Figure 5.1: (a) Mesh structure, (b) Interleaved structure, (c) Local tree-based structure.
Figure 5.2: Proposed power and ground distribution scheme.
In the Figure 5.2, different colors represent different objects in the design. Blue color
represents the power rings, green color represents horizontal stripes and yellow color
represents vertical stripes. Width is set to be 2 microns, spacing between two adjacent
VDD and VSS stripes is 2 microns for power and ground stripes. While, for rings width
was 5 microns, spacing was 5 microns and offset was 2 microns. Via are allowed to travel
from M1 to AP metal layers to connect the cells from the power and ground network.
42
This section will describe the experiments and results achieved from the IR Drop
Analysis.
The purpose of the experiment was to introduce high and low activities in the power
and ground network and analyze the dynamic behavior (which is the preliminary step
for performing IT Drop Analysis).
Below is the dynamic power report for high activity in the circuit.
*Power in mW and Voltage in V .
43
Chapter 5 Power Grid Dimensioning
Below is the dynamic power report for low activity in the circuit.
*Power in mW and Voltage in V .
It was observed that the leakage power in both reports is exactly the same i.e. 0.004563
mW which confirms that proposed power grid is suitable for SiLago design framework,
as it shows predictable behavior either at high activity or at low activity in the circuit.
Due to unavailability of certain technology files (extraction tech file and .layermap), a
complete IR Drop Analysis was infeasible and was left for the future works.
44
Challenges
Below are the few main challenges and problems faced during this project:
1. While placing the SiLago Wrapper blocks on the fabric, unused clock inputs pins
were automatically assigned logic 0. As some other unused pins (non-clock nets)
were also being assigned logic 0, CAD tool assumed those pins and nets as the
clock net. The solution was to manually delete unused clock pins from the logically
synthesized netlist. In future, this task will done automatically by our compiler.
2. The logical synthesis tool was generating errors while compiling the fabric. The
error was due to multiple drive of a constant net. This was solved by manually
connecting wire outside the generate block.
3. To eliminate any further optimization by CAD tool, hardened SiLago Wrapper
blocks were used.
4. SiLago Wrapper’s liberty file was not read at the design initialization step, due
to which timing information was missing when working with the hardened blocks.
When running the timeDesign in On-Chip Variation (OCV) mode, RC information
of metal wires
inside hardened blocks were missing. Solution was to read the liberty file during
design initialization.
5. There was a delay in characterization due to late arrival of storage devices and
bandwidth bottleneck because Hard Disk Drives (HDDs) were accessible only
through a switch over the KTH’s intranet. To resolve this, a 4TB HDD was
installed in the local machine on which the required software was installed already.
6. An error in the testbenches caused false power estimation in all 30,000 reports and
experiment was done once again but with 3,000 testbenches.
45
Chapter 6 Challenges
7. Due to unavailability of certain technology files, a complete IR Drop Analysis was
infeasible and was left for the future works.
46
Conclusion
This project started with the constraint analysis of SiLago with a focus to solve problem
with coupling. Then, based on previous Ph.D. thesis, power distribution was carried out
which focused on IR Discharge and development of the power grids at chip level supply,
region level supply, and SiLago level supply. Then project moved to timing domain,
in which clocking was done, in parallel with the characterization experiments, using
Global Routing Cells or GRCs by providing each SiLago block with a programmable
delay line. Finally, hierarchical power grids were designed and due to unavailability of
certain technology files, a complete IR Drop Analysis was infeasible but dynamic power
reports were generated at low and high activities the circuit. The key characteristics of
this thesis are listed below:
1. Power Characterization : Power model of SiLago platform was build by power
characterization. To achieve this goal, a hybrid library learning based characteriza-
tion methodology was adopted. Using an example circuit of aggressor and victim,
it as shown that coupling or cross-talk noise was sufficient to cause a functional fail-
ure. Later, using multiple graphs obtained by power characterization, it was also
shown how even abnormal behavior of the circuit can be reasoned using the results
from
various graphs showing experimental values for power consumption (of SiLago
blocks) for CNN, DCT2D and FR.
2. Clock Tree Synthesis : A fixed skew clock network was achieved by inserting
predictable and programmable clock buffers into the SiLago blocks. Time
complexity of designing this clock tree was found to be O(1) because designer
has to take a single design-time decision for the inclusion of buffers into the clock
tree network. Experiments were performed by comparing capacitance, internal
power, switching power, leakage power, total power and design times of the CAD
generated clock tree and SiLago framework generated clock tree. These experi-
ments show that clock tree designed in SiLago framework was straightforward and
47
Chapter 7 Conclusion
in almost all of cases, it has surpassed the CAD generated clock tree.
3. Power Grid Dimensioning : A method was developed to dimension the power
grids of the SiLago blocks. This power distribution was hierarchical and it was
designed in such a way that composition by abutment was possible for SiLago
blocks. Later, using the available resources for IR Drop analysis, experiments
were performed. Dynamic behavior of the circuit was analyzed by introducing
high and low activities in the power and ground network of the SiLago blocks.
48
Bibliography
[1] Ahmed Hemani, Syed Mohammed Asad Hassan Jafri, and Shayesteh Masoumian.
“Synchoricity and NOCs Could Make Billion Gate Custom Hardware Centric
SOCs Affordable.” In: Proceedings of the Eleventh IEEE/ACM International Sym-
posium on Networks-on-Chip. NOCS ’17. Seoul, Republic of Korea: ACM, 2017,
8:1–8:10. isbn: 978-1-4503-4984-0. doi: 10.1145/3130218.3132339. url: http:
//doi.acm.org/10.1145/3130218.3132339.
[2] Ahmed Hemani et al. “The SiLago Solution: Architecture and Design Methods for
a Heterogeneous Dark Silicon Aware Coarse Grain Reconfigurable Fabric.” In: The
Dark Side of Silicon: Energy Efficient Computing in the Dark Silicon Era. Ed. by
Amir M. Rahmani et al. Cham: Springer International Publishing, 2017, pp. 47–94.
doi: 10.1007/978-3-319-31596-6_3. url: https://doi.org/10.1007/978-3-
319-31596-6_3.
[3] S. M. A. H. Jafri, N. Farahini, and A. Hemani. “SiLago-CoG: Coarse-Grained Grid-
Based Design for Near Tape-Out Power Estimation Accuracy at High Level.” In:
2017 IEEE Computer Society Annual Symposium on VLSI (ISVLSI). July 2017,
pp. 25–31. doi: 10.1109/ISVLSI.2017.15.
[4] Muhammad Ali Shami. “Dynamically Reconfigurable Resource Array.” QC 20120917.
PhD thesis. KTH, Electronic Systems, 2012, pp. xix, 196. isbn: 978-91-7501-473-9.
[5] M. Adeel Tajammul et al. “A NoC based distributed memory architecture with
programmable and partitionable capabilities.” In: NORCHIP 2010. Nov. 2010,
pp. 1–6. doi: 10.1109/NORCHIP.2010.5669440.
[6] M. A. Tajammul, M. A. Shami, and A. Hemani. “Segmented Bus Based Path Setup
Scheme for a Distributed Memory Architecture.” In: 2012 IEEE 6th International
Symposium on Embedded Multicore SoCs. Sept. 2012, pp. 67–74. doi: 10.1109/
MCSoC.2012.34.
[7] N. Farahini et al. “Physical design aware system level synthesis of hardware.” In:
2015 International Conference on Embedded Computer Systems: Architectures,
Modeling, and Simulation (SAMOS). July 2015, pp. 141–148. doi: 10 . 1109 /
SAMOS.2015.7363669.
[8] Nasim Farahini et al. SiLago : A Structured Layout Scheme to Enable Efficient
High Level and System Level Synthesis. Tech. rep. 2016:13. QC 20160429. KTH,
Electronics and Embedded Systems, 2016.
Clock Tree Synthesis. url: https://www.cadence.com/content/cadence-
www/global/en_US/home/training/all-courses/86198.html.
[10] W. J. Dally, C. Malachowsky, and S. W. Keckler. “21st century digital design
tools.” In: 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).
May 2013, pp. 1–6. doi: 10.1145/2463209.2488850.
[11] S. Borkar. “Design perspectives on 22nm CMOS and beyond.” In: 2009 46th
ACM/IEEE Design Automation Conference. July 2009, pp. 93–94. doi: 10.1145/
1629911.1629940.
[12] Hai Zhou, N. Shenoy, and W. Nicholls. “Timing analysis with crosstalk as fixpoints
on complete lattice.” In: Proceedings of the 38th Design Automation Conference
(IEEE Cat. No.01CH37232). 2001, pp. 714–719. doi: 10.1109/DAC.2001.156230.
[13] Florentin Dartu and Lawrence T. Pileggi. “Calculating Worst-case Gate Delays
Due to Dominant Capacitance Coupling.” In: Proceedings of the 34th Annual De-
sign Automation Conference. DAC ’97. Anaheim, California, USA: ACM, 1997,
pp. 46–51. isbn: 0-89791-920-3. doi: 10.1145/266021.266033. url: http://
doi.acm.org/10.1145/266021.266033.
[14] A. K. Palit et al. “Analysis of crosstalk coupling effects between aggressor and
victim interconnect using two-port network model.” In: Proceedings. 8th IEEE
Workshop on Signal Propagation on Interconnects. May 2004, pp. 81–84. doi:
10.1109/SPI.2004.1409011.
[15] L. Lavagno et al. “EDA for IC implementation, circuit design, and process tech-
nology.” In: US: CRC Press, 2016, pp. 610–613. isbn: 9781482254617.
[16] Igor Keller, King Ho Tam, and Vinod Kariat. “Challenges in Gate Level Modeling
for Delay and SI at 65Nm and Below.” In: Proceedings of the 45th Annual Design
Automation Conference. DAC ’08. Anaheim, California: ACM, 2008, pp. 468–473.
isbn: 978-1-60558-115-6. doi: 10.1145/1391469.1391590. url: http://doi.
acm.org/10.1145/1391469.1391590.
[17] J. M. Wang, Pinhong Chen, and O. Hafiz. “A new continuous switching window
computation with crosstalk noise.” In: 16th Symposium on Integrated Circuits and
Systems Design, 2003. SBCCI 2003. Proceedings. Sept. 2003, pp. 261–266. doi:
10.1109/SBCCI.2003.1232839.
[18] E. G. Friedman. “Clock distribution networks in synchronous digital integrated
circuits.” In: Proceedings of the IEEE 89.5 (May 2001), pp. 665–692. issn: 0018-
9219. doi: 10.1109/5.929649.
[19] Matthew R. Guthaus, Gustavo Wilke, and Ricardo Reis. “Revisiting Automated
Physical Synthesis of High-performance Clock Networks.” In: ACM Trans. Des.
Autom. Electron. Syst. 18.2 (Apr. 2013), 31:1–31:27. issn: 1084-4309. doi: 10.
1145 / 2442087 . 2442102. url: http : / / doi . acm . org / 10 . 1145 / 2442087 .
4020-8022-0_9. url: https://doi.org/10.1007/1-4020-8022-0_9.
[21] Ting-Hai Chao et al. “Zero skew clock routing with minimum wirelength.” In:
IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Process-
ing 39.11 (Nov. 1992), pp. 799–814. issn: 1057-7130. doi: 10.1109/82.204128.
[22] T. H. Chao, Y. C. Hsu, and J. M. Ho. “Zero skew clock net routing.” In: [1992]
Proceedings 29th ACM/IEEE Design Automation Conference. June 1992, pp. 518–
523. doi: 10.1109/DAC.1992.227749.
[23] J. Cong and Cheng-Kok Koh. “Minimum-cost bounded-skew clock routing.” In:
Circuits and Systems, 1995. ISCAS ’95., 1995 IEEE International Symposium on.
Vol. 1. Apr. 1995, 215–218 vol.1. doi: 10.1109/ISCAS.1995.521489.
[24] M. R. Guthaus, D. Sylvester, and R. B. Brown. “Clock buffer and wire sizing
using sequential programming.” In: 2006 43rd ACM/IEEE Design Automation
Conference. July 2006, pp. 1041–1046. doi: 10.1145/1146909.1147171.
[25] L. Lavagno et al. “EDA for IC implementation, circuit design, and process tech-
nology.” In: US: CRC Press, 2016, pp. 272–273. isbn: 9781482254617.
[26] J. Cong et al. “Bounded-skew clock and Steiner routing under Elmore delay.” In:
Proceedings of IEEE International Conference on Computer Aided Design (IC-
CAD). Nov. 1995, pp. 66–71. doi: 10.1109/ICCAD.1995.479993.
[27] “The Elmore Delay as a Bound for RC Trees with Generalized Input Signals.”
In: 32nd Design Automation Conference. 1995, pp. 364–369. doi: 10.1109/DAC.
1995.249974.
[28] Weiping Shi and Zhuo Li. “A fast algorithm for optimal buffer insertion.” In: IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems 24.6
(June 2005), pp. 879–891. issn: 0278-0070. doi: 10.1109/TCAD.2005.847942.
[29] P. K. Chan and K. Karplus. “Computing Signal Delay in General RC Networks
by Tree/Link Partitioning.” In: 26th ACM/IEEE Design Automation Conference.
June 1989, pp. 485–490. doi: 10.1109/DAC.1989.203445.
[30] H. H. Chen and D. D. Ling. “Power Supply Noise Analysis Methodology For
Deep-submicron Vlsi Chip Design.” In: Proceedings of the 34th Design Automation
Conference. June 1997, pp. 638–643. doi: 10.1109/DAC.1997.597223.
[31] “Electronic Design Automation: Synthesis, Verification, and Test.” In: ed. by
Laung-Terng Wang, Yao-Wen Chang, and Kwang-Ting (Tim) Cheng. San Fran-
cisco, CA, USA: Morgan Kaufmann Publishers Inc., 2009, pp. 751–850. isbn:
9780080922003.
[32] Shen Lin and N. Chang. “Challenges in power-ground integrity.” In: IEEE/ACM
International Conference on Computer Aided Design. ICCAD 2001. IEEE/ACM
Digest of Technical Papers (Cat. No.01CH37281). Nov. 2001, pp. 651–654. doi:
vcom -vopt -work work cp file name
vcom -vopt -work work tb file name
vsim +nowarnTFMPC -t ns -novopt tb work name
run 165 ns
vcd file tbwork.vcd
run 100 ns
quit -sim
The script to generate power report using Innovus, this script will take .vcd file as the
input along with the .dat file for restoring the design from previously stored placeroute
design.
set power analysis mode -reset
set power analysis mode -method static -corner max -create binary db true -write static currents
52
set power output dir -reset
set power output dir /reports
set default switching activity -reset
set default switching activity -input activity 0.0 -period 10.0
read activity file -reset
set power -reset
set dynamic power simulation -reset
report power -outfile /PowerTotal.txt -clock network all -hierarchy all -cell type all -
power domain all -pg net all -net -sort total
The following script should be added below where the power report is generated, it will
generate separate power reports for different instances of the SiLago block, in addition
to the TotalPower.txt.
set c 0
set r 0
for {set c 0} {$c ¡ 5} {incr c} { for {set r 0} {$r ¡ 2} {incr r} {
set Sc ”SILEGO cell”
set MTRF ”MTRF cell”
set silego $Sc$us$c$us$r
} }
A.2 Clock Tree Synthesis
Below is the minimal script for physical synthesis, in which chip dimension is 1000x500
microns with 20 microns extra space for power stripes. Only first block is placed and
clock pin is placed because rest will follow the same pattern for block placement and
pin placement. #Cadence Innovus commands
set init design uniquify 1
setDesignMode -process 40
set init gnd net {VSS} set init pwr net {VDD} set init lef file {library.lef} set init mmmc file {mmmc.tcl} set init top cell {top module} set init verilog {fabric.v} init design
floorPlan -site core -s 1000 500 20 20 20 20
relativeFPlan –relativePlace {SILEGO block 0 0} TR Bottom Core Boundary TL 40 40
editPin -use CLOCK -fixedPin 1 -fixOverlap 1 -unit MICRON -spreadDirection clock-
wise -side Top -layer 2 -spreadType start -spacing 0.14 -start 100.0 220.0 -pin clk
placeDesign
assignPtnPin
clonePlace
extractRC
-useOutputPinCap true -sequentialConstProp false -timingSelfLoopsNoSkew false
-enableMultipleDriveNet true -clkSrcPath true -warn true -usefulSkew true
-analysisType onChipVariation -log true
timeDesign− postRoute− pathReports− drvReports− slackReports− numPaths50
− prefixfabric postRoute− outDirtimingReports
vcom -vopt -work work cp file name
vcom -vopt -work work tb file name
vsim +nowarnTFMPC -t ns -novopt tb work name
run 165 ns
vcd file tbwork.vcd
run 100 ns
quit -sim
The script to generate power report using Innovus, this script will take .vcd file as the
input along with the .dat file for restoring the design from previously stored placeroute
design.
set power analysis mode -reset
set power analysis mode -method static -corner max -create binary db true -write static currents
true -honor negative energy true -ignore control signals true
set power output dir -reset
set power output dir /reports
set default switching activity -reset
set default switching activity -input activity 0.0 -period 10.0
read activity file -reset
set power -reset
set dynamic power simulation -reset
report power -outfile /PowerTotal.txt -clock network all -hierarchy all -cell type all -
power domain all -pg net all -net -sort total
The following script should be added below where the power report is generated, it will
generate separate power reports for different instances of the SiLago block, in addition
to the TotalPower.txt.
set c 0
Appendix A Scripts
set r 0
for {set c 0} {$c ¡ 5} {incr c} { for {set r 0} {$r ¡ 2} {incr r} {
set Sc ”SILEGO cell”
set MTRF ”MTRF cell”
set silego $Sc$us$c$us$r
} }

Documents

Characterization, Clock Tree Synthesis and Power Grid