Characterization, Clock Tree Synthesis and Power Grid
69
IN DEGREE PROJECT ELECTRICAL ENGINEERING, SECOND CYCLE, 30 CREDITS , STOCKHOLM SWEDEN 2018 Characterization, Clock Tree Synthesis and Power Grid Dimensioning in SiLago Framework ROHIT PRASAD KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY
Characterization, Clock Tree Synthesis and Power Grid
Characterization, Clock Tree Synthesis and Power Grid Dimensioning
in SiLago Framework, STOCKHOLM SWEDEN 2018
Characterization, Clock Tree Synthesis and Power Grid Dimensioning
in SiLago Framework
ROHIT PRASAD
KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF INFORMATION AND
COMMUNICATION TECHNOLOGY
Abstract
A hardware design methodology or platform is complete if it has the
capabilities to
successfully implement clock tree, predict the power consumption
for cases like best and
worst Parasitic Interconnect Corners (RC Corners), supply power to
every standard cell,
etc.
This thesis has tried to solve the three unsolved engineering
problems in SiLago design.
First, power characterization of the flat design which was designed
using the SiLago
methodology. Second, designing a hierarchical clock tree and harden
it inside the SiLago
logic. Third, dimensioning hierarchical power grids. Out of these,
clock tree illustrates
some interesting characteristics as it is programmable and
predictable.
The tools used for digital designing are Cadence Innovus, Synopsys
Design Vision, and
Mentor Graphics Questasim. These are very sophisticated tools and
widely accepted in
industries as well as in academia.
The work done in this thesis has enabled SiLago platform one step
forward toward its
fruition.
hardware design, physical design
En hardvarudesign metodologi eller plattform ar komplett om den har
kapabiliteten till
att lyckas genomfora klocktradet, forutsaga stromforbrukningen for
basta och varsta
fall av Parasitic Interconnect Corners (RC Corners), tillfora kraft
till varje standardcell,
etc.
Denna avhandling har forsokt losa de tre olosta tekniska problemen
i SiLago-designen.
Det forsta ar stromkvalificering av designen som designades med
hjalp av SiLago
metoden. Det andra problemet ar att designa ett hierarkiskt
klocktrad och harda det
inuti SiLago logik. Det tredje problemet ar att dimensionera
hierarkiska stromnat. Ur
dessa illustrerar klocktradet nagra intressanta egenskaper eftersom
det ar
programmerbart och forutsagbart.
De verktyg som anvands for digital design ar Cadence Innovus,
Synopsys Design Vision
och Mentor Graphics Questasim. Dessa verktyg ar mycket
sofistikerade och allmant
accepterade i industrier saval som i akademin.
Arbetet i denna avhandling har gjort det mojligt for
SiLago-plattformen att ta ett steg
mot att realiseras.
digital hardware design, physical design
iii
Acknowledgement
I would like to thank my examiner Prof. Ahmed Hemani at School of
ICT, KTH, for
the guidance and this opportunity. I would also like to thank my
supervisors Syed Mo-
hammad Asad Hassan Jafri (now at Ericsson, Sweden) and Dimitrios
Stathis (pursuing
Ph.D at KTH) without them this thesis would have lacked quality
results.
Finally, I would like to thank my family for for their love and
support, without them
this day would not have been possible.
Rohit Prasad
February 2018
List of Figures
1.1 Heads showing how growth rate gap is linked to Computer
Architecture. 2
2.1 DRRA cells connected through interconnects. . . . . . . . . . .
. . . . . 7
2.2 Register File . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 8
2.5 Searching of design space in SiLago design methodology. . . . .
. . . . . 10
2.6 Proposed Clock tree scheme. . . . . . . . . . . . . . . . . . .
. . . . . . . 12
2.7 Proposed scheme for hierarchical power grids. . . . . . . . . .
. . . . . . 12
3.1 (a) Circuit to demonstrate glitch noise; (b) Simplified circuit
. . . . . . . 14
3.2 Power tree building C++ code snippet. . . . . . . . . . . . . .
. . . . . . 18
3.3 Total Power Vs. Iterations. . . . . . . . . . . . . . . . . . .
. . . . . . . 19
3.4 (a) Total Power activity for Mode 1 ; (b) Hopping of signals .
. . . . . . 20
3.5 Switching Power Vs. Iterations. . . . . . . . . . . . . . . . .
. . . . . . . 21
3.6 (a) Switching Power activity for Mode 1 ; (b) Hopping of
signals . . . . . 22
3.7 Power Distribution for CNN . . . . . . . . . . . . . . . . . .
. . . . . . . 23
3.8 Power Distribution for CNN (breakdown) . . . . . . . . . . . .
. . . . . . 23
3.9 Power Distribution for DCT2D . . . . . . . . . . . . . . . . .
. . . . . . 24
3.10 Power Distribution for DCT2D (breakdown) . . . . . . . . . . .
. . . . . 24
3.11 Power Distribution for FR . . . . . . . . . . . . . . . . . .
. . . . . . . . 25
3.12 Power Distribution for FR (breakdown) . . . . . . . . . . . .
. . . . . . . 25
4.1 Three levels of clock trees in SiLago Design. . . . . . . . . .
. . . . . . . 27
4.2 5x2 SiLago Wrapper cells abutted. . . . . . . . . . . . . . . .
. . . . . . 31
4.3 Design Vision generated MRB schematic. . . . . . . . . . . . .
. . . . . . 31
4.4 5x2 SiLago Wrapper cells showing clock tree mesh and composed
by abut-
ment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 32
4.5 Clock tree flow of 5x2 Fabric with no space. . . . . . . . . .
. . . . . . . 36
4.6 Clock tree flow of 5x2 Fabric with space in-between the blocks.
. . . . . . 36
4.7 Regional Clock tree of 5x2 Fabric with no space. . . . . . . .
. . . . . . . 37
4.8 Regional Clock tree of 5x2 Fabric with space in between the
blocks. . . . 37
4.9 Clock tree flow of 5x2 SiLago-fied Fabric with no space. . . .
. . . . . . . 39
4.10 CAD tool’s CTS information. . . . . . . . . . . . . . . . . .
. . . . . . . 39
4.11 SiLago-fied clock tree information. . . . . . . . . . . . . .
. . . . . . . . . 39
v
5.1 (a) Mesh structure, (b) Interleaved structure, (c) Local
tree-based structure. 42
5.2 Proposed power and ground distribution scheme. . . . . . . . .
. . . . . 42
vi
4.3 Comparison of design times. . . . . . . . . . . . . . . . . . .
. . . . . . . 35
vii
Acronyms
DiMArch Distributed Memory Architecture
PHY Physical
Language
RTL Register-Transfer Level
HLS High-Level Synthesis
FR Face Recognition
CAD Computer-Aided Design
nm Nanometer
pF Picofarad
1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . 5
2.1 DRRA: Dynamically Reconfigurable Resource Array . . . . . . . .
. . . . 6
2.2 DiMArch: Distributed Memory Architecture . . . . . . . . . . .
. . . . . 8
2.3 SiLago Design Methodology . . . . . . . . . . . . . . . . . . .
. . . . . . 10
3 Power Characterization 13
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 26
4.2 Process . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 30
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 41
5.2 Process . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 41
6 Challenges 45
7 Conclusion 47
Introduction
The electronic systems department at Kungliga Tekniska Hogskolan
(KTH) has de-
signed a fundamentally different Application-Specific Integrated
Circuit (ASIC) design
methodology compared to traditional standard cell based designs.
The methodology
allows design chips by abutting multiple micro-architectural
components. By doing so,
it promises and provides 2 orders of magnitude better design
productivity. However,
in its present state, we still must prove that the operations
minimally couple and this
coupling can be accurately modelled, design a simplified clock
tree, and manage the
power.
ASIC designing is considered to be an expensive process because of
the multiple repe-
tition between RTL design, logical synthesis and physical
synthesis. These repetitions
are because a designer has to purify the ASIC design at high level
as there is a lack of
information like location, wire routing and arrangement of hardware
cells. To deal with
these issues many recent works have been proposed and SiLago design
methodology is
one of them, which increases the level of abstraction of physical
design. Works has been
done to make SiLago to accurately predict the costs for area and
timing but somewhat
less work has been done to predict the power. This defines the
first task of this thesis
i.e., power characterization of the SiLago blocks.
The challenge to design a clock tree which is capable of
composition by abutment is
not simple because when SiLago blocks, which are equipped with this
clock tree, are
placed side-by-side then they must produce a synchorous [1] large
grain VLSI design
objects. This clock tree must be properly structured and have a
predictable nature, as
a result its cost metrics will be know before hand. This task is
critical and must be
solved because to increase the abstraction level in SiLago design
methodology and also
to enable SiLago designed blocks to be composed by abutment, clock
tree and clock tree
synthesis scheme play a very important role.
It is also required to design the power grid which must be modular
and must satisfy the
requirements of SiLago (described in next section).
1
Chapter 1 Introduction
1.0.1 Efficacy Gaps
There are three trends that describe the growth rate gap linked to
complexity of ap-
plication, design technology, VLSI technology and battery
technology [1][29][30][31] .
Figure 1.1 shows these heads.
Figure 1.1: Heads showing how growth rate gap is linked to Computer
Architecture.
1. Architecture Efficacy Gap : This gap arise due to inefficient
placement of mod-
ules and connectivity between them or the clock tree. In this
thesis, it has been
shown how clock tree can be efficiently designed, an attempt to
close this gap using
SiLago design methodology.
2. Design Productivity Gap : There are two main factors which
affect time-to-
market of a computer system.
a) Design Time : This can be overcome by minimizing complexity of
the design.
b) Manufacturing Time : Reusing a design or inclusion of regularity
in design
will eventually reduce manufacturing time.
These two factors also contribute to the cost of a computer system
i.e., sum of
manufacturing and designing costs.
3. Battery Capacity Gap : A good computational efficiency in an
architecture will
2
help reduce this gap. A very common practise is to match the
granularity, this
can be instruction granularity, bit-width granularity or silicon
granularity.
1.0.2 Computer Architectures
In this section, currently available widely accepted computer
architectures will be dis-
cussed. These architectures are an attempt to bridge the gaps
discussed in previous
section.
1. ASIC : They have least to no granularity mismatches and thus
yield high per-
formance on low power budget. ASICs are designed with matching
granularity of
instruction and bit-width of the target domain application and also
keeping the
silicon granularity mismatch to the least. Once the chip is
fabricated, it can not
be further modified, so the parallelism of the target domain
algorithm is exploited
during design time. Thus, ASICs lack flexibility and their usage is
limited to their
target domain application. This also limits their sustainability
i.e., they can not
be reused for any other application algorithm but for what they are
initially de-
signed. ASICs exhibits low architecture efficacy gap and battery
capacity gap but
high design productivity gap due to high manufacturing and design
costs.
2. FPGA : They are programmable at gate level, thus very fine
instruction granu-
larity. They have bit-width granularity of 1-bit. Interconnections
and basic blocks
in a FPGA can be reconfigured and thus instruction granularity or
bit-width gran-
ularity can be fine tuned according to the requirement of the
target application.
This results in large reconfiguration overhead i.e., configuration
memory, wires and
switches. Due to this reason FPGAs are low performing devices as
compared to
ASICs. In contrast, FPGAs are flexible so, they can be reused and
their basic
blocks can work autonomously, so parallelism can be exploited
either at bit-width
level or instruction level. FPGAs have higher architecture efficacy
gap and battery
capacity gap than ASICs but design productivity gap is lesser than
ASICs.
3. GPP : They are very flexible and can run any application.
Data-path of GPPs
are of the size of basic logical and arithmetic operations, this
results in high flexi-
bility. As algorithms are split into basic operations, they exhibit
granularity mis-
match and high number of memory operations and interconnect
operations. This
also results in high power consumption and lower performance with
respect to
ASICs. Due to one time design cost and lower manufacturing cost
because of mass
production, GPPs have least design productivity gap and highest
architectural
efficacy gap and battery capacity gap with respect to ASICs and
FPGAs.
4. CGRA : They are somewhere in between ASICs and FPGAs with
respect to
granularity mismatching. CGRAs have advantage over FPGAs because
they have
3
Chapter 1 Introduction
coarse grain data-path which results in silicon granularity
matching. This also
reduces the number of cells in design, thus reduced wires and
routing area overhead
with respect to FPGAs. CGRAs have lower architecture efficacy gap
and battery
capacity gap with respect to FPGAs and equivalent design
productivity gap to
that of FPGAs.
In chapter 2, it will be discussed that why there was a need for an
alternative architecture
than ASICs ans FPGAs.
1.1 Problems
This project had attempted to solve the following unsolved
engineering problems in
Silicon Large Grain Object (SiLago) design:
1. How to characterize the operations hosted by the SiLago blocks,
including coupling
between physically close blocks. In essence, given a space time
trace of operations
performed by the SiLago blocks, the characterization model should
be able to
predict the average energy consumed within 1-2% accuracy.
2. Design a hierarchical clock tree scheme where the region wide
clock is distributed
manually in a structured manner and the clock fed to each SiLago
block in the
region is controlled by a programmable delay buffer to keep the
skew within the
margin for which the SiLago logic is hardened. Regions will
communicate on GALS
basis.
3. Develop a method to dimension the power grids that will feed the
SiLago blocks.
This power distribution will once again be hierarchical. The global
power nets will
feed the power rings of the blocks and the power rings of the
regions will feed the
power rails of the SiLago blocks. Finally, the power rings of the
regions of the
SiLago blocks will feed the power rails of the standard cells.
Dimensioning them
and absorbing them in the SiLago blocks so that they compose by
abutment is the
challenge.
1.2 Goal and Method
This project was a step forward in realization of SiLago design
flow. It made the platform
more complete. The main goal of the project was to characterize and
build power models
of the SiLago platform, design a hierarchical clock tree scheme and
hierarchical power
grids.
4
1.3 Organization
Instances designed in SiLago platform achieve efficiency of ASIC
with very less effort,
thus reducing the manufacturing cost [1] [2] . This framework is
proposed as an al-
ternative to the general processor/ software centered and
accelerator prolific platform
based SoCs. Because later SoCs are ruled by infrastructural
hardware while SiLago has
functional hardware.
Compared to standard cell based design flows, SiLago adopts two
policies for the above
mentioned non-incremental advancements in efficiency and quality of
the design:
1. Abstraction of physical designs at micro-architectural /
register-transfer level (RTL).
By doing so, design space is reduced exponentially, thus, lowering
the resource ex-
haustion for synthesis tools used at system level.
2. To enable composition by abutment, SiLago adopts the synchoros
design style.
Thus, enabling quick generation of large scale design [3] .
A hybrid library learning based characterization was used since it
is the most efficient
characterization technique known.
1.3 Organization
The rest of thesis report is organized as follows. Chapter 2 lays a
background for this
thesis by introducing SiLago Design Methodology. Chapter 3 starts
with an introduction
for power characterization, followed by steps taken for power
characterization and then
the experiments and results are discussed. Chapter 4 begins with
introduction of clock
tree and expands the problem statement for clock tree designing
scheme, then the design
process is discussed and chapter ends with a detailed discussion of
the experimental
setups and results. Power grid dimensioning is discussed in chapter
5, it begins with a
brief introduction, followed by design process and chapter ends
with a brief discussion
on the experiments and results. Chapter 6 explains the challenges
and problems and
chapter 7 draws a conclusion of this project. In the appendix, the
scripts used for each
task are given.
SiLago Design Methodology
There is a need for an alternative to ASICs and FPGAs, due to
higher designing cost
of ASICs and high area overhead and low computation efficiency of
FPGAs. As a con-
sequence of these, CGRAs come into play because of their high
computation efficiency
and lower designing cost and time. CGRAs fit perfectly to put in
place of FPGAs and
ASICs for domain specific applications.
In [4], it was shown with the help of a survey that both industrial
and scientific research’s
focus is on the systems with multiprocessor and array. There are
many unresearched
classes of architecture that can open up a new scope for a new
architecture for research
and development of their compilation tools. ASICs clearly force
researchers to look
for an alternative due to non-flexibilty and high design costs, and
FPGAs have large
reconfigurable overheads and fine granularity. So, DRRA has been
proposed as an effort
to overcome the above mentioned shortcomings.
2.1 DRRA: Dynamically Reconfigurable Resource Array
DRRA targets the PHY layer of OSI model for communication and can
be realized as a
part of wireless system or as independent macros in SoC [4]. DRRA
supports all three
levels of granularity discussed above i.e., Instruction granularity
matching, bit-width
granularity matching and silicon granularity matching. DRRA cells
are connected to
each other by interconnects and employ a three-hop sliding window
communication
strategy.
Figure 2.1 shows a basic schematic of DRRA cells connected through
interconnects.
DRRA cells consists four modules, which are:
1. Register File (RFile) : It provides a high bandwidth for
parallel data transfer
to DPU. All data that are received by DPU and all data that are
computed in
6
Figure 2.1: DRRA cells connected through interconnects.
DPU are stored in RFile. This movement of data takes one clock
cycle. Figure 2.2
shows block diagram of RFile.
2. Data Path Unit (DPU) : It includes all of the logical and
computational re-
sources of DRRA. DPU is divided into four partitions, which are
:
a) Pre-processing Unit executes operations like absolute and
negation.
b) Logical Unit executes logical operations like OR, AND, shifting,
etc.
c) Arithmetical Unit executes operations like signal processing,
etc. This unit
supports fixed point and integer operations but floating point
operations.
d) Post-processing Unit executes operations like truncation,
etc.
As DPU is pipelined therefore, arithmetic operation takes one clock
cycle but
multiply or MAC which take two clock cycles. A local sequencer
controls DPU.
Figure 2.3 shows diagram of DPU.
3. Sequencer : It is basically a state machine which controls all
DRRA resources.
Each DRRA cells have been allocated a sequencer, due to this
allocation the config-
uration of DRRA is dynamic in nature. For synchronization with
other resources
of DRRA, sequencers can communicate with each other.
7
Figure 2.2: Register File
4. Switchbox (SWB) : It is placed at the intersection of input and
output buses
in DRRA interconnect network. SWBs are connected to a configuration
memory
which determine which output lane will connect to input lane. SWB
uses tri-state
logic to disconnect not driven lanes from circuit.
2.2 DiMArch: Distributed Memory Architecture
DRRA has a memory network, distributed as a circuit switch, called
DiMArch [5]. Di-
MArch needs a single instruction to program a source-destination
path [6]. A sequencer
(shown in Figure 2.1) act as a link between DRRA and DiMArch.
DiMArch intercon-
nects scheme can be separated into two groups:
1. Data network (dNoC) : dNoC transports data between RFile (in
DRRA) and
memory banks (in DiMArch). Both read and write of data between
RFile and
dNoC can be performed simultaneously, hence it has full-duplex
interconnects.
2. Instruction Network (iNoC) : iNoC is a packet-switched network
for transfer of
instructions. AGUs are programmed through this network.
8
2.2 DiMArch: Distributed Memory Architecture
Figure 2.3: Data Path Unit
Both dNoC and iNoC are implemented within Tiles in DiMArch. Tiles
in DiMArch of
two types:
1. SRAM Tile (STile) : It is a block of SRAM memory cells which
receives the
instruction from Configuration Tile through iNoC. STile is
comprised of Instruction
Switch, Partition Handler, Data Switch, SRAM Address Generator
Units, and
SRAM.
2. Configuration Tile (ConTile) : A layer of tiles between STiles
and DRRA Cells
is comprised of ConTile. Each ConTile can connects to its
horizontally placed
neighbour ConTiles.
Figure 2.4: DiMArch and DRRA
Figure 2.4 shows when DRRA and DiMArch are placed together, how
does the STile
and ConTile are identified and their arrangements.
9
2.3 SiLago Design Methodology
SiLago design methodology increases the level of abstraction of
physical design i.e.,
from
standard cells (Boolean Level) to micro- architecture level
(Register Transfer Level).
This enables the synthesis of hardware from higher level of
abstractions [7]. Prediction
of cost metrics with higher accuracy is achieved because in this
methodology, we reduced
the abstraction gap and hence improved the ability of prediction
for cost metrics of
synthesis tools (used at higher abstraction i.e., higher that RTL).
This also reduces the
synthesis time by reducing searching of the design space,
illustrated in Figure 2.5 .
Figure 2.5: Searching of design space in SiLago design
methodology.
SiLago design flow eliminates the tuning of fine refinements (like
in HLS tools, user has
to manually define the budget for constraints at algorithm level)
and thus guarantees
the correct by construction by replacing those fine tuning with a
machine translation.
Thus, functional verification is eliminated.
Accurate prediction of the cost metrics is enabled by accurate
characterization of both
the interconnects between micro- architectural level operations and
those operations
itself. Thus, constraints verification at system level is
eliminated [8].
10
2.3 SiLago Design Methodology
By the virtue of SiLago-fication, we reduced the abstraction gap
and this can be very
helpful in automation of synthesis at SoC level.
In order to make SiLago platform more complete, there was an
immediate need for few
addition to it.
First is to power characterize the flat design of the fabric. Power
characterization of
block design was done already in [3] and in order to prove that the
prediction behavior
of SiLago design methodology remains true even if we choose flat
design instead of block
design, where there is coupling between closely placed two SiLago
blocks. In practice,
this coupling will produce a noise in the circuit, whenever a
signal crosses these blocks
and the motivation was to predict this noise and the behavior of
the circuit under such
circumstances. It became necessary to record the operations hosted
by SiLago blocks
including coupling between physically close blocks. The outcome of
this experiment will
enable a designer to predict the average energy consumption with an
accuracy between
1-2%.
Second is to design a hierarchical clock tree scheme. Until now,
SiLago fabric employs
a clock tree designed with the predefined algorithms in the
commercial CAD tools.
These algorithms try to reduce the the clock skew and slew rate by
adding a number
of clock buffers in the clock path (available in the technology
library), while satisfying
the setup and hold time in each block. This results in the addition
of irregular number
of clock buffers and this defies the whole SiLago concept, as the
SiLago blocks are
not regular anymore. Prediction of clock tree will not be possible
until clock tree
has been synthesised by Cadence Innovus’ Clock Concurrent
Optimization Technology
(ccopt) engine [9]. This problem raise the need for designing a
predictable clock tree
scheme where it became necessary to trick the available CAD tool,
such that tool always
produces a predictable clock tree. Reason behind this workaround is
because CAD tool’s
ccopt engine is a black box and tool owner does not provide every
details of working
of this engine. The immediate task was to study this engine by
running a numerous
number of experiments and recording every minute changes in the
generated clock tree
and predict how this engine works. Using available recorded
information, then design a
clock tree which will unwillingly force the tool to produce a
predictable clock tree with
regular number of clock buffers and then produced SiLago blocks
will be regular and
hence predictable in nature. The proposed task is to design a
hierarchical clock tree
where manual distribution of clock buffers is done such that the
delay is programmable
and costs only one-time engineering effort, which will be done at
design time. Figure
2.6 shows proposed scheme for clock tree.
Third is to develop a method to dimension the power grids for
SiLago platform. In
order to make the SiLago blocks regular in this aspect as well,
such that the behavior of
each block is predictable (to satisfy the SiLago design statute),
the power grids should
be hierarchical and there should not be any significant drop in
power supply. To achieve
such organization of power grids, local power grids should take
input from global power
11
Figure 2.6: Proposed Clock tree scheme.
nets at a regular interval of distance on the die. These global
power nets are placed
in such a way that they surround the fabric from outer side and
then at every fixed
distance there are horizontal and vertical power rails that feeds
the local power rings of
the blocks. With such organization , it is said that the power
grids should be regular and
hence predictable. Figure 2.7 shows proposed scheme for power grid
dimensioning.
Figure 2.7: Proposed scheme for hierarchical power grids.
The most important feature to identify if a block that is designed
using SiLago design
methodology is actually satisfying SiLago statute or not, is to
detect if composition by
abutment is possible with each SiLago block. Each task in this
thesis strictly follow
this rule and the above mentioned rules as well. In addition, this
project required
deep understanding of Very-Large-Scale integration (VLSI) concepts.
Cadence SoC
Encounter (now called as Innovus) was used for physical design.
Synopsys Design Vision
was used for logical synthesis. QuestaSim (ModelSim), NCsim,
Virtuoso, MATLAB,
VHDL, C++, TCL, BASH and SystemVerilog were used for scripting,
designing, and
analysis of the results.
At high level, hardware cost prediction becomes challenging due to
unavailability of
information like placement, wiring and location of hardware blocks
at high level. It has
been proposed by several works [7], [10], [11] that by raising the
abstraction level from
standard cells to coarse grain components, accuracy in cost
prediction has increased.
Power estimation is more complicated than estimating time and area
due to the fact
that power estimation varies and it revolves around signal’s
distance traveled, coupling
in the path or adjacent operations, or the data. In [3], a new
framework, CoG has been
proposed to estimate the power, they used block design to estimate
the power and got
15 times better estimation than state-of-the-art tools. Work done
in this thesis uses
the same technique as in [3] for power estimation but instead of
using block design, flat
design was used. This was done to estimate the power when there is
coupling between
closely placed SiLago blocks. Below is a demonstration using a
simple circuit to show
that glitch noise is sufficient to cause functional failures and
hence lead to abnormality
in power estimation.
Noise in digital circuit arises when the circuit is operating,
cases like when noise is
propagating from other parts of the circuit or when switching of
other nearby signals
occur. This affects the behavior and timing of the digital circuit
and this is when the
need of characterization arises.
Using information of power characterization, a designer can predict
the abnormal be-
havior of the circuit when under the influence of noise. There are
mainly three noise
effects in a digital design:
1. Functional failure due to wrong value in the signal.
2. Setup- timing violations due to late arrival of signal,
resulting in the chip to run
13
on a low frequency than intended.
3. Hold- timing violation due to early arrival of signal, resulting
in fatal failure of the
chip because in this case chip can not even run on low
frequency.
The most common noise is the coupling or cross-talk noise [12] ,
which is also the main
reason for power characterization of the SiLago block. Coupling or
cross-talk noise arises
when rise or fall transition occurs in a signal net (victim)
coupled to a noise causing net
(aggressor) via a coupling capacitance [13] [14] . Usually when a
quiet victim is affected
by coupling noise, it is observed in the form of a spike or glitch
(glitch noise).
To show that glitch noise is sufficient to cause a functional
failure [15] [16] [17] , consider
a simple circuit where aggressor is a rising buffer and victim is
an inverter, both are
coupled with the coupling capacitor ( Ccp ). Figure 3.1(a) shows
the circuit.
Figure 3.1: (a) Circuit to demonstrate glitch noise; (b) Simplified
circuit
To obtain a quantitative view on the problem, consider the
following three assumptions
to simplify the circuit (Figure 3.1 (b) ):
1. Consider the circuit as lump capacitance and ignore the victim
and aggressor
resistance;
;
3. A saturated-ramp waveform a(t) is modelled as aggressor using an
ideal voltage.
14
3.1 Introduction
To analyze the voltage response at the victim node, take a look at
the differential
equation obtained using Kirchhoff law:
Ct dv
dt + v
r = Ccp
dt (3.1)
Ct is Cg+Ccp representing the total capacitance and v(t) is the
response on the victim.
The initial condition should be like Eq3.2 :
v(t) = Ccp Ct
τ is rCt (victim time constant).
From Eq3.2 we can derive that if a(t) (aggressor transition) is
constant before and after
the transition, v(t) (response) will be a glitch i.e. attains zero
before and after the
transition.
T ) (3.3)
v(t) is directly proportional to Ccp, r and a(t). Now Ccp < Ct
and the magnitude of
the glitch is limited by τ T , that means the glitch will be small
when the transition of
aggressor is slow.
When T << τ ,
v(t) = Ccp Ct
a(t) + o( τ
T ) (3.4)
Ccp Ct
(attenuation factor) and the magnitude of the glitch is limited by
Vdd Ccp Ct
, this will
make the initial shape of the v(t) (victim response) to be as the
shape of a(t) (aggressor
transition).
15
Chapter 3 Power Characterization
Peak or maximum vpeak, is the most important characteristic of
glitch noise. Let us
look into the cases where we derive vpeak using parameters of the
circuit.
When a(t) (aggressor transition) is a rising saturated linear
ramp,
a(t) =
Vdd , t ≥ T
v(t) =
vpeake (−t τ ) , t ≥ T
(3.6)
At t = T , vpeak is maximum i.e., glitch is maximum and the
equation obtained is
vpeak = ( CcpVdd Ct
)f( T
τ ) (3.7)
CcpVdd/Ct is the electrical property of the circuit and f(Tτ ) is
the nonlinear function of
times.
vpeak is directly proportional to r, Vdd and Ccp Ct
and inversely proportional to T (aggres-
sor’s transition time). Nets with low drive strength i.e., high
driver holding resistance
(r) and high Ccp (coupling capacitance) are the ones which are most
vulnerable to cross
talk. It can also be considered that if the aggressor is switching
fast i.e., have small
transition time, then the glitch noise will be worse.
Eq3.7 and Eq3.8 are used in programs for cross talk analysis and
with the help of these
two equations one can eliminate the nets with low risk and save
time and resources.
In the later sections, power estimation process is briefly
discussed, an attempt is made
to reason the obtained results, and it is also shown how a designer
can benefit from
these power estimations.
3.2 Process
3.2 Process
This section will describe about the steps taken for power
characterization. It followed
the same steps as CoG [3] , the only alteration was with the
SiLago’s design i.e., instead
of block design, a flat design was used.
Process of Power Characterization can be sum up in four simple
steps.
1. Generate random test cases (.m files) using any programming
language capable of
file handling (Python or C++ preferred) ;
2. Feed those test cases (.m files) into Vesyla (new version is
Algosil) to obtain the
testbenches (.vhdl (V ery High Speed Integrated Circuit Hardware
Description
Language) files) along with assembly code files ;
3. Simulate SiLago fabric in either Questasim or NCSim using those
testbenches and
generate .vcd (Value Change Dump) files.
4. Using generated .vcd files, power reports are generated from
Innovus.
Using the testbenches, assembly codes and power reports, one can
determine the power
distribution for any simple to complex algorithms, when run on the
SiLago fabric without
actually running the experiments. Thus, saving time and
resources.
In the next section, experiment setup and results are
discussed.
17
Chapter 3 Power Characterization
3.3 Experiments and Results
In this section, experimental setup and the results achieved with
the power
characterization in the SiLago framework are discussed.
A standard sign-off quality post layout data was used for power
characterization. Both
inter-cell and intra-cell characterization was done, as it was done
for CoG [3].
Intra-cell characterization provides a complete characterization of
single cell, as energy
values for all modes and connections are calculated during this
process. Inter-cell char-
acterization provides the information about the cross-coupling
between DPUs, which is
one of the main motivation for performing power characterization in
this thesis.
Figure 3.2: Power tree building C++ code snippet.
Figure 3.2 shows a C++ code snippet which covers every input for
each SiLago cell
with last line demonstrating that each combination is again
randomized for next 100
iterations, such that there should not be any doubts about any
possible combination
being left. This code also builds a power tree for inter-cell
characterization. This in
turn resulted in massive data storage space demand.
Figure 3.3 shows graph between TotalPower(mW ) vs. Iterations
.
Mode 1 : Addition
Mode 2 : Multiplication
Mode 3 : MAC (Accumulator is initialized with an additional
signal)
Figure 3.3 is somewhat irregular but it shows that each mode
followed a comparable
pattern which characterizes that the modes and connection are
independent, in terms
of energy consumption.
Figure 3.3: Total Power Vs. Iterations.
Figure 3.4 (a) shows the Total Power activity for Mode 1 and figure
3.4 (b) shows
the hops i.e., inputs and outputs are originating from which DPU.
In figure 3.4 (b) ,
Magenta is Output, Green is Input 1 and Red is Input 2.
Figure 3.5 shows the SwitchingActivity vs. Iterations . It can be
observed that
switching activity for Mode 1 i.e., addition is low as compared to
Mode 2 and Mode 3
which are
multiplication.
Figure 3.6 (a) shows the Switching Power activity for Mode 1 and
figure 3.6 (b) shows
the hops i.e., inputs and outputs are originating from which DPU.
In figure 3.4 (b) ,
Magenta is Output , Green is Input 1 and Red is Input 2.
From Figure 3.5, we also conclude that if we add a constant across
a mode, it will line
up with the other modes’ graph. So, the equation[3] that can be
deduced is
Power (modeX) = Power (modeY) + constant
As the power consumption is additive in nature, analyzing any
single mode is sufficient.
Assumption is made that a register file is providing all the input.
Figure 3.4 reveal three
patterns,
1. A high power consumption is seen when any or both of the 2
inputs connected to
19
Chapter 3 Power Characterization
Figure 3.4: (a) Total Power activity for Mode 1 ; (b) Hopping of
signals
a register of other cell and output goes to the same
register.
2. A low energy consumption is seen when either of inputs and
output are connected
to the source cell (where DPU is operating).
3. A constant power consumption is seen when neither of the inputs
and output are
connected to the same register nor connected to the source.
Spikes (specially seen in Figure 3.3 and Figure 3.6) are due to the
following two rea-
sons:
1. Fast switching in the circuit i.e., have small transition
time.
2. a(t) is constant before and after the transition, so the
response will be a glitch.
As response is directly proportional to coupling capacitance and
inversely proportional
to transition time, glitch will be small when transition is
slow.
It can also be observed that between iteration 500 - 600 there is
an abnormally large
spike. This abnormal behavior is due to the fact that in the design
around DPU 5 there
was high concentration of functional wires which resulted in high
coupling capacitance
for the circuit around DPU 5. Input1 and Output are being
initialized within the DPU
20
3.3 Experiments and Results
5 but Input2 has to travel several blocks before it reaches DPU 5
so, there is an tightness
in timing to avoid setup violations, which creates voltage
deviations. It can be observed
from figure 3.6 (a) that between iteration 500-600 there is large
switching activity and
the response to that is observed in figure 3.3 in the form of an
abnormally large spike.
* However, the argument above needs to be verified by running other
experiments before
jumping into any concrete conclusion.
Figure 3.5: Switching Power Vs. Iterations.
21
Chapter 3 Power Characterization
Figure 3.6: (a) Switching Power activity for Mode 1 ; (b) Hopping
of signals
To demonstrate the usefulness of the power characterization, below
are the three
experimental values for power consumption for Convolutional Neural
Network (CNN),
Discrete Cosine Transform - Two Dimensional (DCT2D) and Face
Recognition (FR).
It can be observed that the results shows nearly correct
estimations (this conclusion is
made by experiments done in [3]) without actually performing the
experiments. Figure
3.7 to figure 3.12 shows the results for power distribution.
22
Figure 3.8: Power Distribution for CNN (breakdown)
23
Figure 3.10: Power Distribution for DCT2D (breakdown)
24
Figure 3.12: Power Distribution for FR (breakdown)
25
4.1 Introduction
Clock design is one of the most challenging task in digital design
where a designer has
to distribute clock signals all through a chip. The designer also
has to be aware of the
resources when minimizing the factors like power, skew, variation
and jitter [18] [19].
A clock period is the duration of clock signal which is the
recurrence of the low and high
pattern. Circuit frequency is inversely proportional to the clock
period. Time taken by
the clock signal to propagate through the (clock) tree to the sinks
is the (insertion)
delay. The sinks are either clock pins (sequential) or clock buffer
inputs (hierarchical),
where the clock signal is being received. Mostly, at 50% of the
supply voltage, the
clock signal delays are measured . In very few cases, delay is
determined by inverter’s
switching threshold and those cases occur in an edge-triggered
systems [20] .
The motivation behind designing this clock tree was to achieve
fixed skew clock network
[20] [21] [22] [23] with predictable buffer delay insertion. It was
observed that the
CAD generated clock tree introduced random clock buffers in the
clock path to produce
optimal smallest delay buffered nets. Almost all buffering
techniques use van Ginneken
dynamic programming algorithm for buffer insertion and sizing [24]
, and delay model
used is the Elmore delay model [25] [26] [27] . The three main
steps of this algorithm
are:
1. Buffer addition in O(n) time;
2. Wire addition in O(n) time;
3. In O(n1+n2) time, two branches are merged, where the number of
buffer positions
in two branches are represented by n1 and n2.
26
Thus, this algorithm holds a time complexity of O(n2)[28].
The proposed regional clock tree synthesis scheme generates clock
tree by the abutment
of identical and synchorous [1] SiLago blocks. This clock tree is
not improvised but
structurally parameterized to predict the cost metrics with
certainty. Previously in [1]
principles of SiLago method have been presented but how this
abutment will result in
a valid clock tree was not refined but it is well elaborated in
this part of the thesis.
For the sake of SiLago-fication, presented clock tree synthesis
scheme accords to raising
level of abstraction of physical design to micro-architecture
level. Clock tree has been
designed to insert minimum delay and thus for a designer, the time
complexity will be
O(1), due to the fact that designer has to take a single design
time decision i.e., how
many buffers should be included in the buffer chain from the
already placed buffers in
the programmable buffer delay block and to do this, it always takes
a constant time.
Thus, reducing the design time and engineering effort.
Figure 4.1: Three levels of clock trees in SiLago Design.
The requirements imposed by SiLago methodology on clock tree
synthesis is explained
with the help of Figure 4.1 which visualizes three levels of
hierarchy. In the figure, a
SiLago SoC is shown with different region instances with different
color codes. Instances
of SiLago block, which are specific to each region, act as leaf
nodes in hierarchy. These
instances are automatically synthesized in the SiLago Design flow
such that the type,
number, relative position and composition of region instances are
optimally matched to
their constraints and functional requirements.
Moving on from three levels of design hierarchy to three levels of
clock trees in the
Figure 4.1 i.e., local, regional and global. Local clock tree is
auto generated with the
27
Chapter 4 Clock Tree Synthesis
help of commercial EDA tool and global clock tree is derived from
the PLL/CGU. This
part of thesis will focus on the regional clock tree.
To adopt composition by abutment regional clock tree must satisfy
these two require-
ments imposed by SiLago:
1. The cost metrics i.e., latency and energy of the SiLago blocks
must be uniform and
identical, and should not be affected by the position in the SiLago
design instance.
This property is required just as to keep a scalable engineering
effort i.e., one time
engineering effort. This property will also keep the design regular
in both physical
and architectural point of view.
2. Creating valid VLSI designs of random size of SiLago design
instances by compo-
sition by abutment with valid neighbours should be possible. This
means there
should be no further engineering effort applied for implementing
the SiLago blocks
except what has been done already. This also means that the design
should be
timing clean, should have signal integrity, no IR drop violations,
etc. In other
words, the aggregation of parts of clock in the SiLago blocks
should appear as
a valid regional clock that does not violates timing because clock
tree will itself
balance the skew and maintain the edge.
An immediate reason to propose a new clock tree scheme was due to
the fact that
commercial EDA tool’s clock tree synthesis tool violates above
mentioned two require-
ments.
The results has been verified with Static Timing Analysis (STA) and
also a comparison
against a functionally equivalent clock tree is done. As the
synthesized clock tree design
is correct by construction, so no further verification is
required.
To verify the delay incurred by Clock nets, RC (delay) [29] can be
calculated using
information from .lib (Liberty) file and .lef (Libety Exchange
Format) file.
.lib file contains information about the rising and falling times
and transitions for a
particular standard cell in the library. It also contains
information about power, resis-
tance and capacitance for that particular standard cell.
.lef file contains information about the metal layers.
RPERSQ is the resistance for a system of wire, in Ohms per
square.
Resistance for a length of wire is
28
RPERSQ ∗ lengthwire/widthwire (4.1)
CPERSQDIST is the capacitance for each square unit, in pF per
square micron (wire
to ground capacitance).
EDGECAPACITANCE specifies a floating-point value of peripheral
capacitance, in
pF per micron. The placeroute tool uses this value in two
situations:
1. Estimate capacitance before routing.
2. Calculate segment capacitance after routing.
For the second case, the tool uses values only if the layer
thickness or height is set to
zero . Formula used in this case to calculate segment capacitance
is
C = (CPERSQDIST ∗ w ∗ l) + (EDGECAPACITANCE ∗ 2 ∗ (w + l))
(4.2)
where,
29
4.2 Process
There are two constraints for synthesizing regional clock tree.
First, to maximize the
percentage of usable clock period by combinatorial logic, clock
skew must be minimum.
This condition arises when clock and data are propagating in same
direction. Second,
maintain the drive strength such that slew rate technology design
rule is not violated.
To design the proposed clock tree, a pre-placed and pre-routed
SiLago block (with
local clock tree) was taken into consideration for further
modifications. Then, a pro-
grammable buffer delay block called MRB (Mux-Register-Buffer) block
was created
separately. MRB block includes 16 clock buffers (largest clock
buffer in the TCBN
40nm library; area 11.6424 microns, I pin capacitance 0.003259 pF,
Z pin maximum
capacitance 0.6043 pF, Positive unateness), one MUX and 4 registers
for selecting the
number of buffers. This MRB block is placed on top of the SiLago
block and a new
top module was created which was called SiLago Wrapper block. This
SiLago Wrapper
block has two clock inputs and two clock outputs, the motivation
behind this was if such
SiLago Wrapper blocks were abutted, then there will be least
functional wires due to
clock signal in the design, apart from what has been implemented
and hardened already.
So, focus was emphasized on regional clock network.
For simplicity, the current buffer chain was designed to include 10
buffers before the
clock signal is fed into the SiLago block. This arrangement can be
modified according
to the design
requirements by the designer during design time. However, for the
sake of better rise
and fall time, and since each SiLago Wrapper regional clock network
could only see the
very next SiLago Wrapper as a load and not the whole design,
restriction was imposed
that clock network should include at least one clock buffer in the
MRB blocks before
propagating to the next SiLago Wrapper or SiLago block. Below
Figure 4.2 shows a
schematic when such SiLago Wrapper blocks are abutted.
SiLago Wrapper was designed with pre-placed and pre-routed SiLago
block, which has
its own local clock network, such that from a single clock input
pin the entire block was
fed with clock signal. For MRB block, it was decided to choose the
largest clock buffer
in the TCBN 40nm library, reason being simplicity, least overhead
and better skew.
Figure 4.3. shows the schematic of the MRB block. The circuit was
designed in such
a way that a designer (using an Elmore Delay model) can include as
many available
buffers just by changing the hex value of the RegIn signal, which
is considered during
designing.
Logical synthesis of MRB block was done to obtain the netlist, then
physically syn-
30
Figure 4.2: 5x2 SiLago Wrapper cells abutted.
Figure 4.3: Design Vision generated MRB schematic.
thesized netlist and lef file for metal layer information was
generated using former.
For creating SiLago Wrapper, wires were manually connected between
MRB block and
SiLago block by editing logically synthesized netlist. A single
clock signal was passed
from MRB to SiLago. Using SiLago Wrapper’s netlist, lef information
of MRB and
SiLago blocks, and the TCBN 40nm libraries, a new hardened SiLago
Wrapper block
was created.
To verify, if the new block was capable of composition by abutment,
a new fabric was
designed which arranged SiLago wrapper in two rows and five
columns. To prevent
CAD tools from
performing any further optimization at any stage of synthesis,
hardened SiLago Wrapper
blocks were used in design. After design was placed by CAD tool,
SiLago Wrapper’s
netlist was loaded into the black boxes. Figure 4.4 shows the clock
network of the fabric
and successful composed by abutment of the blocks.
31
Chapter 4 Clock Tree Synthesis
Figure 4.4: 5x2 SiLago Wrapper cells showing clock tree mesh and
composed by
abutment.
To put this in a more straight forward way, below is the
description of the synthe-
sis scripts (see Appendix A.2 for a minimal script for physical
synthesis), with focus
on important steps to be followed carefully while designing the
clock tree in SiLago
Framework.
Files generated after logical synthesis is not self-sufficient to
begin the clock tree design
in SiLago Framework. There are few changes that must be done. Few
of them are listed
below:
1. Edit the sdc (Synopsys Design Constraints) file to allow the
physical synthesis tool
to propagate clock throughout the chip. Otherwise, tool will treat
a very small
portion of clock wire (wire between clock pin on fabric to first
sinks at clock pins
on SiLago blocks) as clock and treat other connected clock wires as
signal. To fix
this issue add set propagated clock [all clocks] in the sdc
file.
2. False paths can also be declared by editing sdc file, use set
false path -through
[get net h bus*] ; set false path -from [get port rst n]
First command will declare all paths through pin h bus* as false
and the later will
declare all paths that included rst n pin as false.
MMMC (Multi-Mode Multi-Corner) must be defined at the beginning of
physical syn-
thesis. Three case are often defined for MMMC analysis; typical,
worst and best cases.
Specifically, worst case is used to check the maximum delay (Setup
violations) and best
case is used to check the minimum delay (Hold violations).
32
4.2 Process
Before running any processes, design was declared as unique (in
SiLago Framework)
with set init design uniquify 1, that allowed placing clone blocks
along with a master
block on the floor-plan. Declaring design process mode was another
important step that
must be specified otherwise tool would use the default design
process mode as 90nm. 40
nm design process mode was defined with setDesignMode -process 40.
The design was
initialized with init design and floor planning was done, floor
planning would provide
an early feedback that if the initialized design would fit on the
die, it also provided an
estimate of congestion and delay caused by functional wires.
After floor-planning, pin placement was done and then black boxes
are placed which
provided information if there were issues with routing, heat
distribution, performance
or power
consumption. After design was placed, assembleDesign will load the
netlist along with
timing and metal layer information. As clock tree was connected
already, we can observe
the composition by abutment at this step, where every signal is
propagated properly.
Finally, post route timing analysis was run to check for
violations.
33
4.3 Experiments and Results
This section will describe the experiments and the results achieved
with the clock tree
designed in the SiLago framework. Out of several experiments, only
two most important
experiments are described here. These two experiments are
sufficient to prove the theory
behind
SiLago-fication of the clock tree and why we need to include clock
tree inside the SiLago
blocks.
Following are the two experiments:
1. Designing the fabric with 5x2 SiLago blocks with clock tree
generated by CAD
tool (Cadence Innovus) with no space between the SiLago
blocks.
2. Designing the fabric with 5x2 SiLago blocks with clock tree
generated by CAD
tool (Cadence Innovus) with 2 microns of horizontal and vertical
space introduced
between the SiLago blocks.
Table 4.1 shows the comparison of the capacitance between CAD tool
generated clock
net and SiLago Framework generated clock net. It was observed that
total net capaci-
tance was less in SiLago Framework due to the regularities, as no
extra wires or logic
was introduced while abutting the SiLago blocks. Clock net
capacitance was observed
to be slightly increased due to the fact that in every block there
were 16 clock buffers
present but during design time the designer will only activate very
few buffers as per
the requirements and the clock net capacitance will decrease
eventually.
With this theory in mind, a short experiment was also done to check
if minimum delay
between the blocks was kept, how many clock buffers will be
required to place 400 such
SiLago blocks and with an accuracy of 2% or below.
1st cell arrival time = 0.055ns
400th cell arrival time = 1.103ns
Each buffer in delay line (as seen in Innovus delay line) =
0.02ns
If 52 buffers are placed = 0.055 + (0.02 * 52 ) = 1.095 ns
% accuracy = 100* (1.103 - 1.095) / 5 = 1.6%
Table 4.2 shows the comparison of the internal, switching, leakage
and total power
between CAD tool generated clock net and SiLago Framework generated
clock net.
From table 4.2 it was observe that the switching power of clock net
designed in SiLago
34
CAD tool generated 1.35066e-09 F 2.53E-11 F
SiLago Framework 1.24843e-09 F 2.58E-11 F
Table 4.1: Comparison of capacitance
Internal
SiLago Framework 1.806 3.476 0.0006122 5.283
Table 4.2: Comparison of power
framework has decreased significantly and as a result total power
is also reduced. This
proves that SiLago Framework will be a good alternative for
designers. To support
this statement, an experiment was run to check and see how much
time is consumed in
placeDesign and Clock Tree Synthesis (CTS) between CAD tool and
SiLago Framework.
Table 4.3 shows the observed times. Using such results a designer
can extrapolate the
5x2 blocks 10x2 blocks
SiLago Framework 0: 0:59 hrs 0: 2:45 hrs
Table 4.3: Comparison of design times.
values and get an estimation of time required to design a fabric
with 100 blocks or even
more blocks. Below are the series of figures which illustrates the
irregularities in design
while using CAD tool for same fabric with very little
variations.
Figure 4.5 shows the Clock tree flow of SiLago Wrapper block. This
clock tree has
regular load capacitance and predictable behaviour. Figure 4.7
shows the physical layout
of the hardened blocks and Figure 4.9 shows the clock tree flow,
vertical clock buffer
chains are the programmable clock buffers and horizontal clock
buffer chains are the
buffers from local clock tree. Each branch represents each SiLago
blocks. While Figure
4.6 shows the irregularities introduced by CAD tool and Figure 4.8
is the physical layout
used.
35
Chapter 4 Clock Tree Synthesis
Figure 4.5: Clock tree flow of 5x2 Fabric with no space.
Figure 4.6: Clock tree flow of 5x2 Fabric with space in-between the
blocks.
Figure 4.10 and 4.11 contains the clock tree information of CAD
tool generated clock
tree and SiLago-fied clock tree respectively (both tables are
generated using CAD tool’s
CTS engine). These two tables contains information about clock
tree’s time increment,
arrival time, transition time, capacitance, and distance. It was
observed that CAD
tool’s generated clock tree includes random (in size and count of)
clock buffers in its
path to reach the local clock tree of each SiLago blocks, whereas
SiLago-fied clock tree
included regular (in size) and pre-calculated number of clock
buffers. This helps in
pre-calculating the capacitance and hence time increment, arrival
time, and transition
36
4.3 Experiments and Results
Figure 4.7: Regional Clock tree of 5x2 Fabric with no space.
Figure 4.8: Regional Clock tree of 5x2 Fabric with space in between
the blocks.
time of the SiLago-fied clock tree using equations 4.1, 4.2,
forming an RC-π model of
the clock tree, and calculation using Elmore delay model.
An experiment was conducted to predict the above mentioned
parameters and compare
them against the table generated from CAD tool. One of the main
reason behind this
experiment was to find out how much clock skew was at each sink to
the SiLago block,
such that while making a design choice for SiLago-fied clock tree,
a designer can easily
choose the correct number of clock buffers. Correct number of clock
buffers inclusion
may result in optimizing the drive strength of clock signal, so
that clock tree can drive
maximum number SiLago blocks.
Chapter 4 Clock Tree Synthesis
A small such calculation to show the usefulness of our prediction
scheme is as follows
(clock tree information was extracted from the technology files and
Matlab was used for
calculations):
It was known that M3 (Metal Layer 3) was used for clock routing,
and other required
information were available from the technology files to calculate
the clock arrival at
each SiLago blocks. Wire delay (RC) induced due to 100 µM of wire
is calculated as
follows:
R = (RPERSQ * Length) / Width = 397.142857 x 10(3)
C = (CPERDQDIST * Length * Width) + (EDECAPACITANCE * 2 * (Length
+
Width)) = 0.01461024408 x 10(-18) F
RC = 5.802 x 10(-15) Seconds (or 5.802 fS )
Wire delay information obtained from CAD Tool was found to be about
7 fS.
As calculations assumed the ideal conditions, hence the difference
in RC values.
Using clock speed of 200 Hz (T = 5 nS), clock arrival at 1st SiLago
block was 0.055
nS, at 400th SiLago block clock arrival was found to be 1.103 nS
and each clock buffer
added a delay of 0.02 nS. To drive the 400th SiLago block, 52 clock
buffers are needed
as per calculations (0.055 + (0.02 * 52 ) = 1.095 ns) and with this
experiment it was
found that the accuracy of our predicting system was 1.6% .
That’s said about the clock trees, one can observe that clock tree
synthesis becomes
relatively an easy task if done in SiLago Platform as clock tree
has predictable behavior
in SiLago-fied blocks.
To summarize the experiments and results obtained, it was concluded
that the proposed
clock tree design solution was able to match and in few expects it
surpasses the EDA tool
hierarchical synthesized clock tree design. This evidence is enough
to replace prevalent
38
4.3 Experiments and Results
Figure 4.9: Clock tree flow of 5x2 SiLago-fied Fabric with no
space.
Figure 4.10: CAD tool’s CTS information.
Figure 4.11: SiLago-fied clock tree information.
39
Chapter 4 Clock Tree Synthesis
design with the proposed design. Proposed scheme comes with added
benefits as it
requires only one time engineering effort which makes the VLSI
designing process fast,
predictable, easy to implement and correct by construction, which
are believed to be
essential for automating the synthesis at higher abstraction level
(see Figure 2.5).
40
5.1 Introduction
This section will describe the power and ground network design and
IR drop analysis.
V DD and V SS pads are connected to the concentric rings inside the
design [30] .
Typically to reduce the electromigration and noise, the ring width
is made large.
In terms of performance and minimizing the current and voltage
variations in the power
networks, mesh structures are found to be better than interleaved
and local tree-based
power distribution techniques [31] [32] .
Figure 5.1 (a), (b) and (c) shows mesh, interleaved and local-tree
based power distribu-
tion schemes respectively.
In SiLago design framework, power grid has to modular and must
satisfy the two re-
quirements i.e., first, it must be space invariant and second,
composition by abutment
should be possible with the proposed design. Very little has been
discussed on this topic
as it will be clear in the later sections .
5.2 Process
This section describes the process for power and ground grid
dimensioning.
Figure 5.2 shows the mesh structure. Orthogonal wires are spread in
the form of rect-
angular grids. Two bottom layers were chosen for adding the stripes
because they have
the least resistance, which is suitable for placing power and
ground stripes. Vertical
stripes were laid on metal layer 11 (AP) and horizontal layers were
laid on metal layer
10 (M10).
Chapter 5 Power Grid Dimensioning
Figure 5.1: (a) Mesh structure, (b) Interleaved structure, (c)
Local tree-based structure.
Figure 5.2: Proposed power and ground distribution scheme.
In the Figure 5.2, different colors represent different objects in
the design. Blue color
represents the power rings, green color represents horizontal
stripes and yellow color
represents vertical stripes. Width is set to be 2 microns, spacing
between two adjacent
VDD and VSS stripes is 2 microns for power and ground stripes.
While, for rings width
was 5 microns, spacing was 5 microns and offset was 2 microns. Via
are allowed to travel
from M1 to AP metal layers to connect the cells from the power and
ground network.
42
5.3 Experiments and Results
5.3 Experiments and Results
This section will describe the experiments and results achieved
from the IR Drop
Analysis.
The purpose of the experiment was to introduce high and low
activities in the power
and ground network and analyze the dynamic behavior (which is the
preliminary step
for performing IT Drop Analysis).
Below is the dynamic power report for high activity in the
circuit.
*Power in mW and Voltage in V .
43
Chapter 5 Power Grid Dimensioning
Below is the dynamic power report for low activity in the
circuit.
*Power in mW and Voltage in V .
It was observed that the leakage power in both reports is exactly
the same i.e. 0.004563
mW which confirms that proposed power grid is suitable for SiLago
design framework,
as it shows predictable behavior either at high activity or at low
activity in the circuit.
Due to unavailability of certain technology files (extraction tech
file and .layermap), a
complete IR Drop Analysis was infeasible and was left for the
future works.
44
Challenges
Below are the few main challenges and problems faced during this
project:
1. While placing the SiLago Wrapper blocks on the fabric, unused
clock inputs pins
were automatically assigned logic 0. As some other unused pins
(non-clock nets)
were also being assigned logic 0, CAD tool assumed those pins and
nets as the
clock net. The solution was to manually delete unused clock pins
from the logically
synthesized netlist. In future, this task will done automatically
by our compiler.
2. The logical synthesis tool was generating errors while compiling
the fabric. The
error was due to multiple drive of a constant net. This was solved
by manually
connecting wire outside the generate block.
3. To eliminate any further optimization by CAD tool, hardened
SiLago Wrapper
blocks were used.
4. SiLago Wrapper’s liberty file was not read at the design
initialization step, due
to which timing information was missing when working with the
hardened blocks.
When running the timeDesign in On-Chip Variation (OCV) mode, RC
information
of metal wires
inside hardened blocks were missing. Solution was to read the
liberty file during
design initialization.
5. There was a delay in characterization due to late arrival of
storage devices and
bandwidth bottleneck because Hard Disk Drives (HDDs) were
accessible only
through a switch over the KTH’s intranet. To resolve this, a 4TB
HDD was
installed in the local machine on which the required software was
installed already.
6. An error in the testbenches caused false power estimation in all
30,000 reports and
experiment was done once again but with 3,000 testbenches.
45
Chapter 6 Challenges
7. Due to unavailability of certain technology files, a complete IR
Drop Analysis was
infeasible and was left for the future works.
46
Conclusion
This project started with the constraint analysis of SiLago with a
focus to solve problem
with coupling. Then, based on previous Ph.D. thesis, power
distribution was carried out
which focused on IR Discharge and development of the power grids at
chip level supply,
region level supply, and SiLago level supply. Then project moved to
timing domain,
in which clocking was done, in parallel with the characterization
experiments, using
Global Routing Cells or GRCs by providing each SiLago block with a
programmable
delay line. Finally, hierarchical power grids were designed and due
to unavailability of
certain technology files, a complete IR Drop Analysis was
infeasible but dynamic power
reports were generated at low and high activities the circuit. The
key characteristics of
this thesis are listed below:
1. Power Characterization : Power model of SiLago platform was
build by power
characterization. To achieve this goal, a hybrid library learning
based characteriza-
tion methodology was adopted. Using an example circuit of aggressor
and victim,
it as shown that coupling or cross-talk noise was sufficient to
cause a functional fail-
ure. Later, using multiple graphs obtained by power
characterization, it was also
shown how even abnormal behavior of the circuit can be reasoned
using the results
from
various graphs showing experimental values for power consumption
(of SiLago
blocks) for CNN, DCT2D and FR.
2. Clock Tree Synthesis : A fixed skew clock network was achieved
by inserting
predictable and programmable clock buffers into the SiLago blocks.
Time
complexity of designing this clock tree was found to be O(1)
because designer
has to take a single design-time decision for the inclusion of
buffers into the clock
tree network. Experiments were performed by comparing capacitance,
internal
power, switching power, leakage power, total power and design times
of the CAD
generated clock tree and SiLago framework generated clock tree.
These experi-
ments show that clock tree designed in SiLago framework was
straightforward and
47
Chapter 7 Conclusion
in almost all of cases, it has surpassed the CAD generated clock
tree.
3. Power Grid Dimensioning : A method was developed to dimension
the power
grids of the SiLago blocks. This power distribution was
hierarchical and it was
designed in such a way that composition by abutment was possible
for SiLago
blocks. Later, using the available resources for IR Drop analysis,
experiments
were performed. Dynamic behavior of the circuit was analyzed by
introducing
high and low activities in the power and ground network of the
SiLago blocks.
48
Bibliography
[1] Ahmed Hemani, Syed Mohammed Asad Hassan Jafri, and Shayesteh
Masoumian.
“Synchoricity and NOCs Could Make Billion Gate Custom Hardware
Centric
SOCs Affordable.” In: Proceedings of the Eleventh IEEE/ACM
International Sym-
posium on Networks-on-Chip. NOCS ’17. Seoul, Republic of Korea:
ACM, 2017,
8:1–8:10. isbn: 978-1-4503-4984-0. doi: 10.1145/3130218.3132339.
url: http:
//doi.acm.org/10.1145/3130218.3132339.
[2] Ahmed Hemani et al. “The SiLago Solution: Architecture and
Design Methods for
a Heterogeneous Dark Silicon Aware Coarse Grain Reconfigurable
Fabric.” In: The
Dark Side of Silicon: Energy Efficient Computing in the Dark
Silicon Era. Ed. by
Amir M. Rahmani et al. Cham: Springer International Publishing,
2017, pp. 47–94.
doi: 10.1007/978-3-319-31596-6_3. url:
https://doi.org/10.1007/978-3-
319-31596-6_3.
[3] S. M. A. H. Jafri, N. Farahini, and A. Hemani. “SiLago-CoG:
Coarse-Grained Grid-
Based Design for Near Tape-Out Power Estimation Accuracy at High
Level.” In:
2017 IEEE Computer Society Annual Symposium on VLSI (ISVLSI). July
2017,
pp. 25–31. doi: 10.1109/ISVLSI.2017.15.
[4] Muhammad Ali Shami. “Dynamically Reconfigurable Resource
Array.” QC 20120917.
PhD thesis. KTH, Electronic Systems, 2012, pp. xix, 196. isbn:
978-91-7501-473-9.
[5] M. Adeel Tajammul et al. “A NoC based distributed memory
architecture with
programmable and partitionable capabilities.” In: NORCHIP 2010.
Nov. 2010,
pp. 1–6. doi: 10.1109/NORCHIP.2010.5669440.
[6] M. A. Tajammul, M. A. Shami, and A. Hemani. “Segmented Bus
Based Path Setup
Scheme for a Distributed Memory Architecture.” In: 2012 IEEE 6th
International
Symposium on Embedded Multicore SoCs. Sept. 2012, pp. 67–74. doi:
10.1109/
MCSoC.2012.34.
[7] N. Farahini et al. “Physical design aware system level
synthesis of hardware.” In:
2015 International Conference on Embedded Computer Systems:
Architectures,
Modeling, and Simulation (SAMOS). July 2015, pp. 141–148. doi: 10 .
1109 /
SAMOS.2015.7363669.
[8] Nasim Farahini et al. SiLago : A Structured Layout Scheme to
Enable Efficient
High Level and System Level Synthesis. Tech. rep. 2016:13. QC
20160429. KTH,
Electronics and Embedded Systems, 2016.
Clock Tree Synthesis. url:
https://www.cadence.com/content/cadence-
www/global/en_US/home/training/all-courses/86198.html.
[10] W. J. Dally, C. Malachowsky, and S. W. Keckler. “21st century
digital design
tools.” In: 2013 50th ACM/EDAC/IEEE Design Automation Conference
(DAC).
May 2013, pp. 1–6. doi: 10.1145/2463209.2488850.
[11] S. Borkar. “Design perspectives on 22nm CMOS and beyond.” In:
2009 46th
ACM/IEEE Design Automation Conference. July 2009, pp. 93–94. doi:
10.1145/
1629911.1629940.
[12] Hai Zhou, N. Shenoy, and W. Nicholls. “Timing analysis with
crosstalk as fixpoints
on complete lattice.” In: Proceedings of the 38th Design Automation
Conference
(IEEE Cat. No.01CH37232). 2001, pp. 714–719. doi:
10.1109/DAC.2001.156230.
[13] Florentin Dartu and Lawrence T. Pileggi. “Calculating
Worst-case Gate Delays
Due to Dominant Capacitance Coupling.” In: Proceedings of the 34th
Annual De-
sign Automation Conference. DAC ’97. Anaheim, California, USA: ACM,
1997,
pp. 46–51. isbn: 0-89791-920-3. doi: 10.1145/266021.266033. url:
http://
doi.acm.org/10.1145/266021.266033.
[14] A. K. Palit et al. “Analysis of crosstalk coupling effects
between aggressor and
victim interconnect using two-port network model.” In: Proceedings.
8th IEEE
Workshop on Signal Propagation on Interconnects. May 2004, pp.
81–84. doi:
10.1109/SPI.2004.1409011.
[15] L. Lavagno et al. “EDA for IC implementation, circuit design,
and process tech-
nology.” In: US: CRC Press, 2016, pp. 610–613. isbn:
9781482254617.
[16] Igor Keller, King Ho Tam, and Vinod Kariat. “Challenges in
Gate Level Modeling
for Delay and SI at 65Nm and Below.” In: Proceedings of the 45th
Annual Design
Automation Conference. DAC ’08. Anaheim, California: ACM, 2008, pp.
468–473.
isbn: 978-1-60558-115-6. doi: 10.1145/1391469.1391590. url:
http://doi.
acm.org/10.1145/1391469.1391590.
[17] J. M. Wang, Pinhong Chen, and O. Hafiz. “A new continuous
switching window
computation with crosstalk noise.” In: 16th Symposium on Integrated
Circuits and
Systems Design, 2003. SBCCI 2003. Proceedings. Sept. 2003, pp.
261–266. doi:
10.1109/SBCCI.2003.1232839.
[18] E. G. Friedman. “Clock distribution networks in synchronous
digital integrated
circuits.” In: Proceedings of the IEEE 89.5 (May 2001), pp.
665–692. issn: 0018-
9219. doi: 10.1109/5.929649.
[19] Matthew R. Guthaus, Gustavo Wilke, and Ricardo Reis.
“Revisiting Automated
Physical Synthesis of High-performance Clock Networks.” In: ACM
Trans. Des.
Autom. Electron. Syst. 18.2 (Apr. 2013), 31:1–31:27. issn:
1084-4309. doi: 10.
1145 / 2442087 . 2442102. url: http : / / doi . acm . org / 10 .
1145 / 2442087 .
4020-8022-0_9. url: https://doi.org/10.1007/1-4020-8022-0_9.
[21] Ting-Hai Chao et al. “Zero skew clock routing with minimum
wirelength.” In:
IEEE Transactions on Circuits and Systems II: Analog and Digital
Signal Process-
ing 39.11 (Nov. 1992), pp. 799–814. issn: 1057-7130. doi:
10.1109/82.204128.
[22] T. H. Chao, Y. C. Hsu, and J. M. Ho. “Zero skew clock net
routing.” In: [1992]
Proceedings 29th ACM/IEEE Design Automation Conference. June 1992,
pp. 518–
523. doi: 10.1109/DAC.1992.227749.
[23] J. Cong and Cheng-Kok Koh. “Minimum-cost bounded-skew clock
routing.” In:
Circuits and Systems, 1995. ISCAS ’95., 1995 IEEE International
Symposium on.
Vol. 1. Apr. 1995, 215–218 vol.1. doi:
10.1109/ISCAS.1995.521489.
[24] M. R. Guthaus, D. Sylvester, and R. B. Brown. “Clock buffer
and wire sizing
using sequential programming.” In: 2006 43rd ACM/IEEE Design
Automation
Conference. July 2006, pp. 1041–1046. doi:
10.1145/1146909.1147171.
[25] L. Lavagno et al. “EDA for IC implementation, circuit design,
and process tech-
nology.” In: US: CRC Press, 2016, pp. 272–273. isbn:
9781482254617.
[26] J. Cong et al. “Bounded-skew clock and Steiner routing under
Elmore delay.” In:
Proceedings of IEEE International Conference on Computer Aided
Design (IC-
CAD). Nov. 1995, pp. 66–71. doi: 10.1109/ICCAD.1995.479993.
[27] “The Elmore Delay as a Bound for RC Trees with Generalized
Input Signals.”
In: 32nd Design Automation Conference. 1995, pp. 364–369. doi:
10.1109/DAC.
1995.249974.
[28] Weiping Shi and Zhuo Li. “A fast algorithm for optimal buffer
insertion.” In: IEEE
Transactions on Computer-Aided Design of Integrated Circuits and
Systems 24.6
(June 2005), pp. 879–891. issn: 0278-0070. doi:
10.1109/TCAD.2005.847942.
[29] P. K. Chan and K. Karplus. “Computing Signal Delay in General
RC Networks
by Tree/Link Partitioning.” In: 26th ACM/IEEE Design Automation
Conference.
June 1989, pp. 485–490. doi: 10.1109/DAC.1989.203445.
[30] H. H. Chen and D. D. Ling. “Power Supply Noise Analysis
Methodology For
Deep-submicron Vlsi Chip Design.” In: Proceedings of the 34th
Design Automation
Conference. June 1997, pp. 638–643. doi:
10.1109/DAC.1997.597223.
[31] “Electronic Design Automation: Synthesis, Verification, and
Test.” In: ed. by
Laung-Terng Wang, Yao-Wen Chang, and Kwang-Ting (Tim) Cheng. San
Fran-
cisco, CA, USA: Morgan Kaufmann Publishers Inc., 2009, pp. 751–850.
isbn:
9780080922003.
[32] Shen Lin and N. Chang. “Challenges in power-ground integrity.”
In: IEEE/ACM
International Conference on Computer Aided Design. ICCAD 2001.
IEEE/ACM
Digest of Technical Papers (Cat. No.01CH37281). Nov. 2001, pp.
651–654. doi:
vcom -vopt -work work cp file name
vcom -vopt -work work tb file name
vsim +nowarnTFMPC -t ns -novopt tb work name
run 165 ns
vcd file tbwork.vcd
run 100 ns
quit -sim
The script to generate power report using Innovus, this script will
take .vcd file as the
input along with the .dat file for restoring the design from
previously stored placeroute
design.
set power analysis mode -reset
set power analysis mode -method static -corner max -create binary
db true -write static currents
52
set power output dir -reset
set power output dir /reports
set default switching activity -reset
set default switching activity -input activity 0.0 -period
10.0
read activity file -reset
set power -reset
set dynamic power simulation -reset
report power -outfile /PowerTotal.txt -clock network all -hierarchy
all -cell type all -
power domain all -pg net all -net -sort total
The following script should be added below where the power report
is generated, it will
generate separate power reports for different instances of the
SiLago block, in addition
to the TotalPower.txt.
set c 0
set r 0
for {set c 0} {$c ¡ 5} {incr c} { for {set r 0} {$r ¡ 2} {incr r}
{
set Sc ”SILEGO cell”
set MTRF ”MTRF cell”
set silego $Sc$us$c$us$r
} }
A.2 Clock Tree Synthesis
Below is the minimal script for physical synthesis, in which chip
dimension is 1000x500
microns with 20 microns extra space for power stripes. Only first
block is placed and
clock pin is placed because rest will follow the same pattern for
block placement and
pin placement. #Cadence Innovus commands
set init design uniquify 1
setDesignMode -process 40
set init gnd net {VSS} set init pwr net {VDD} set init lef file
{library.lef} set init mmmc file {mmmc.tcl} set init top cell {top
module} set init verilog {fabric.v} init design
floorPlan -site core -s 1000 500 20 20 20 20
relativeFPlan –relativePlace {SILEGO block 0 0} TR Bottom Core
Boundary TL 40 40
editPin -use CLOCK -fixedPin 1 -fixOverlap 1 -unit MICRON
-spreadDirection clock-
wise -side Top -layer 2 -spreadType start -spacing 0.14 -start
100.0 220.0 -pin clk
placeDesign
assignPtnPin
clonePlace
extractRC
-useOutputPinCap true -sequentialConstProp false
-timingSelfLoopsNoSkew false
-enableMultipleDriveNet true -clkSrcPath true -warn true
-usefulSkew true
-analysisType onChipVariation -log true
timeDesign− postRoute− pathReports− drvReports− slackReports−
numPaths50
− prefixfabric postRoute− outDirtimingReports
vcom -vopt -work work cp file name
vcom -vopt -work work tb file name
vsim +nowarnTFMPC -t ns -novopt tb work name
run 165 ns
vcd file tbwork.vcd
run 100 ns
quit -sim
The script to generate power report using Innovus, this script will
take .vcd file as the
input along with the .dat file for restoring the design from
previously stored placeroute
design.
set power analysis mode -reset
set power analysis mode -method static -corner max -create binary
db true -write static currents
true -honor negative energy true -ignore control signals true
set power output dir -reset
set power output dir /reports
set default switching activity -reset
set default switching activity -input activity 0.0 -period
10.0
read activity file -reset
set power -reset
set dynamic power simulation -reset
report power -outfile /PowerTotal.txt -clock network all -hierarchy
all -cell type all -
power domain all -pg net all -net -sort total
The following script should be added below where the power report
is generated, it will
generate separate power reports for different instances of the
SiLago block, in addition
to the TotalPower.txt.
set c 0
Appendix A Scripts
set r 0
for {set c 0} {$c ¡ 5} {incr c} { for {set r 0} {$r ¡ 2} {incr r}
{
set Sc ”SILEGO cell”
set MTRF ”MTRF cell”
set silego $Sc$us$c$us$r
} }