This document is downloaded from DR‑NTU (https://dr.ntu.edu.sg)Nanyang Technological University, Singapore.
Design of minimum energy driven ultra‑lowvoltage SRAMs and D flip‑flop
Wang, Bo
2015
Wang, B. (2015). Design of minimum energy driven ultra‑low voltage SRAMs and D flip‑flop.Doctoral thesis, Nanyang Technological University, Singapore.
https://hdl.handle.net/10356/65355
https://doi.org/10.32657/10356/65355
Downloaded on 04 Oct 2021 00:15:36 SGT
Design of Minimum Energy Driven
Ultra-low Voltage
SRAMs and D Flip-Flop
WANG BO
SCHOOL OF ELECTRICAL AND ELECTRONIC
ENGINEERING
2015
Design of Minimum Energy Driven
Ultra-Low Voltage
SRAMs and D Flip-Flop
WANG BO
School of Electrical and Electronic Engineering
A thesis submitted to the Nanyang Technological University
in partially fulfillment of the requirement for the degree of
Doctor of Philosophy
2015
I
Acknowledgement
My first and sincere gratitude goes to my supervisor, Prof. Tony Tae-Hyoung Kim,
for his continuous guidance and significant support in my PhD life. His well-
planned research perspective extremely helps me to establish the research goals and
develop the research path. His enthusiasm in training students provides me
opportunities to discuss with him face to face every time I need and enormously
inspires me. His generosity in financial support endows me many chances to attend
international conferences and broadens my views. Without his enlightenment and
encouragement, I would never have been able to achieve any tiny progress in
research.
I would like to thank my co-supervisor, Dr Zhou Jun, for his devotion in my PhD
research. He offers me a chance to work with him and learn from him. He is very
open and ready to help either in idea sharing or career planning, which benefits me
immensely. I would also give my appreciation to Prof. Yeo Kiat Seng, Prof. See
Kye Yak, Prof. Siek Liter, Prof. Goh Wang Ling, Prof. Zhang Yue Ping, Prof. Boon
Chirn Chye, Prof. Chan Pak Kwong, Prof. Lam Ying Hung, Prof. Kong Zhi Hui,
Prof. Zheng Yuanjing, Dr Tan Khen Sang and all the technical staffs in the VIRTUS
lab, the VLSI lab and the IC Design lab.
I want to give my thanks to my colleagues both in NTU and in the Institute of
Microelectronics, A*STAR for their enormous help. They are Prof. Je Minkyu, Dr
Liu Xin, Dr Wang Chao, Dr Hylas Lam, Dr Do Ahn Tuan, Dr Mohammed Sultan
Mohiuddin Siddiqui, Liu Lizhuang, Chang Kah Hyong, Lan Jingjing, Lim Ching
Yun, Lee Zhao Chuan, Seyed Mohammad Ali Zeinolabedin, Aung Myat Thu Linn,
Le Ba Ngoc, Karim Hany Mohamed Rawy, Abhik Das, Neelakantan Narasimman, J.
Karthik Gopal, Kavitha Velayudhan, Achiranshu Garg, Qi Li, Truc Quynh Nguyen
and Yeo Yuan Lin. I am very pleased to work with them and be friends with them.
My thanks also go to Dr Chong Sau Siong, Dr He Xiaofeng, Dr Li Sizhen, Dr Lu
Zhenghao, Dr Chris Yeung, Dr Huang Xiwei, Dr Fei Wei, Dr Jeremy Low Yung
Shern, Dr Joshua Low Yung Lih, Dr Yu Jun, Dr Liu Chang, Dr Li Yan, Ms Yang
II
Wanlan, Howard Tang, Zou Qiong, Ye Wanxin, Zhu Yao, Yang Yan, Chen Yi, Cai
Deyun, Chen Zihao, Wang Yanmei, Xu Shanshan, Meng Fanyi, Lu Lu, Zhang Ying,
Sun Junyi, Feng Guangyin, Huang Nan, Feng Xiaohua, Yao Enyi, Qiu Lei, Tay
Thian Fatt, Zhang Le, Wang Yong, Deng Tianwei, Wu Chundong, Yang Yongkui,
Han Beibei, Yi Xiang, Zhang Xiangyu, Qian Xinyuan, Yu Hang, Chen Yi, Tan
Xiaoliang, and Zhao Jianming for their sharing and helping.
Last but not least, I give my heartiest thanks to my parents who always support me
with love. They create a world for me and guide me to explore a bigger one. They
are the greatest mom and dad.
III
Table of Contents
Acknowledgement ....................................................................................................... I
List of Figures ......................................................................................................... VII
List of Tables ......................................................................................................... XIII
Chapter 1 Introduction .......................................................................................... 1
1.1 Motivation ................................................................................................... 1
1.2 Research Objectives and Contributions ....................................................... 4
1.3 Organizations ............................................................................................... 7
Chapter 2 Background and Literature Review ..................................................... 8
2.1 Conventional 6T Single-port SRAMs ......................................................... 8
2.1.1 6T Single-port SRAM Operation ......................................................... 8
2.1.2 Challenges of 6T SRAMs for Ultra-low Voltage Operation .............. 11
2.1.3 Design Techniques of 6T SRAMs to Improve Minimum Voltage ..... 15
2.2 Conventional 8T Dual-port SRAMs .......................................................... 18
2.2.1 8T Dual-port SRAM Operation .......................................................... 19
2.2.2 Challenges of 8T SRAMs for Ultra-low Voltage Operation .............. 21
2.2.3 Design Techniques of 8T SRAMs with Ultra-low Supply Voltage .... 23
2.3 Conventional D Flip-Flop Circuits ............................................................ 27
2.3.1 Mainstream DFF Circuits and Timing Properties .............................. 27
2.3.2 Design Challenges of DFFs for Energy Efficient Applications ......... 30
IV
2.3.3 Design Techniques for Energy Efficient DFFs................................... 31
Chapter 3 SRAM Device and Circuits Optimization toward Energy Efficiency in
Multi-Vth CMOS ...................................................................................................... 34
3.1 Background ................................................................................................ 34
3.2 Analysis of SRAM Energy ........................................................................ 36
3.2.1 SRAM Energy Modeling.................................................................... 36
3.2.2 Effects of Supply Voltage Scaling and Threshold Voltage on Energy
Efficiency ......................................................................................................... 39
3.2.3 Effects of Multi-Vth Devices on SRAM Energy ................................ 40
3.3 Minimum Energy-Driven SRAM Design Utilizing Multi-Vth Devices .... 42
3.3.1 Analysis of SRAM Energy without Multi-Vth Devices ..................... 43
3.3.2 Analysis of SRAM Energy with Multi-Vth Devices........................... 44
3.4 Design Techniques for SRAM Energy Efficiency Improvement Utilizing
Multi-Vth Devices ................................................................................................. 49
3.4.1 Effect of Power Reduction Techniques on SRAM Energy ................ 50
3.4.2 Effect of Performance Boosting Techniques on SRAM Energy ........ 52
3.4.3 Combination Effect of Power Reduction and Performance Boosting
Techniques ........................................................................................................ 55
3.5 Summary .................................................................................................... 57
Chapter 4 Design of an Ultra-low Voltage 9T SRAM with Equalized Bitline
Leakage and CAM-assisted Energy Efficiency ........................................................ 58
4.1 Background ................................................................................................ 58
4.2 Proposed SRAM Design Techniques for Ultra-low Voltage Operation .... 60
V
4.2.1 A Novel 9T SRAM Cell ..................................................................... 60
4.2.2 Analysis of Static Noise Margin and Write Margin ........................... 62
4.2.3 Bitline Leakage Equalization with the Worst Case of Leakage ......... 64
4.3 Proposed Energy Efficient Improvement Technique ................................. 69
4.3.1 Limitation of MTCMOS on SRAM Energy Efficiency ..................... 69
4.3.2 Proposed CAM-assisted Write Performance Boosting Technique ..... 72
4.4 Test Chip Implementation and Measurement ............................................ 78
4.5 Summary .................................................................................................... 82
Chapter 5 Design of an Ultra-low Voltage Disturb-suppressed Dual-port
SRAM ......................................................................................................................83
5.1 Background ................................................................................................ 83
5.2 Proposed 12T DP SRAM Cell ................................................................... 85
5.2.1 12T SRAM Cell Design ..................................................................... 86
5.2.2 Implementation of Virtual Ground for Bitline Leakage Reduction ... 87
5.3 Disturb Suppression of 12T DP SRAM in Common-Row-Access
Mode ....................................................................................................................89
5.3.1 Analysis of Disturb Occurrence Probability ...................................... 89
5.3.2 Analysis of Read Disturb ................................................................... 91
5.3.3 Analysis of Write Disturb ................................................................... 94
5.4 Measurement Results ................................................................................. 96
5.5 Summary .................................................................................................... 99
Chapter 6 Design of an Ultra-low Voltage, Energy-Delay Efficient Charge-
Pumped DFF ...........................................................................................................100
VI
6.1 Background .............................................................................................. 100
6.2 A Novel Sub-threshold DFF .................................................................... 103
6.2.1 DFF Circuit Design and Near-/Sub-threshold Operation ................. 103
6.2.2 Inverse-Narrow-Width-Effect-Aware Sizing Strategy ..................... 106
6.3 Analysis of CPDFF with TGFF and ACFF ............................................. 108
6.3.1 C-Q Delay Investigation ................................................................... 108
6.3.2 Comparison of Setup Time and Hold Time ...................................... 111
6.3.3 Analysis of Energy-Delay Product ................................................... 113
6.4 Test Chip Implementation and Measurement .......................................... 114
6.5 Summary .................................................................................................. 118
Chapter 7 Conclusions and Future Works ......................................................... 119
7.1 Conclusions ............................................................................................. 119
7.2 Future Works ........................................................................................... 120
Publications ............................................................................................................ 121
Bibliography ........................................................................................................... 123
VII
List of Figures
Fig. 1.1 Process feature size trend [1]. ....................................................................... 1
Fig. 1.2 Power dissipation as a function of VDD for the 16 kb 9T SRAM [6]. ......... 2
Fig. 1.3 Energy dissipation as a function of VDD for the 16b 1024-point FFT [7]. .. 3
Fig. 2.1 Schematic of conventional 6T SRAM cell. .................................................. 8
Fig. 2.2(a) Concept of read operation in a 6T SRAM bitline. (b) Concept of write
operation in a 6T SRAM bitline. ................................................................................ 9
Fig. 2.3 Cell stability degradation of 6T SRAM cell due to read disturb. ............... 11
Fig. 2.4 Illustration of SNM of 6T SRAM cell. ....................................................... 12
Fig. 2.5(a) 6T SRAM read SNM by VDD sweeping. (b) 6T SRAM write margin by
VDD sweeping. ........................................................................................................ 14
Fig. 2.6(a) Schematic of 8T SRAM cell [13]. (b) Schematic of 10T SRAM cell [17].
.................................................................................................................................. 16
Fig. 2.7(a) Schematic of conventional 8T dual-port SRAM. (b) Parallel memory
access of 8T dual-port SRAM [30]. ......................................................................... 18
Fig. 2.8(a) Illustration of different-row-different-column access. (b) Illustration of
different-row-same-column access. (c) Illustration of common-row-different-
column access. (d) Illustration of common-row-common-column access [30]. ...... 19
Fig. 2.9 Comparison of read SNMs in different-row access and common-row access
situations [30]. .......................................................................................................... 22
Fig. 2.10 Concept of access circumvention scheme for dual-port 8T SRAM [30]. . 23
Fig. 2.11 Concept of active bitline equalizing technique for dual-port 8T SRAM
[32]. .......................................................................................................................... 24
Fig. 2.12(a) Write-disturb detector for 8T dual-port SRAM. (b) Coordinately-
VIII
activated write drivers for dual-port SRAM [32]. .................................................... 25
Fig. 2.13 Schematic of transmission-gate FF (TGFF). ............................................ 27
Fig. 2.14 Schematic of True-Single-Phase-Clocked (TSPC) FF. ............................. 28
Fig. 2.15 Illustrating DFF setup time and hold time [34]. ....................................... 29
Fig. 2.16 Schematic of adaptive-coupling FF (ACFF) [37]. .................................... 31
Fig. 2.17 Schematic of Static-Single-Phase Contention-Free FF (S2CFF) and its
operation [38]. .......................................................................................................... 33
Fig. 3.1 Simplified SRAM array diagram for energy analysis. ................................ 36
Fig. 3.2 Schematic of an 8T decoupled SRAM cell with multi-Vth devices. ........... 41
Fig. 3.3 Normalized energy of three SRAMs designed by three different device
types (i.e. HVT, SVT and LVT). All transistors in one SRAM have the same Vth. ..43
Fig. 3.4 Impact of device selection on normalized energy of three SRAMs. Note
that HVT devices are employed for read port in all SRAMs. Rest transistors in each
SRAM cell adopt one device type. ........................................................................... 44
Fig. 3.5 Normalized delay values of SRAM read and write operations designed with
HVT devices. ............................................................................................................ 44
Fig. 3.6 Comparison of read delay (LVT) with write delay implemented with multi-
Vth devices (SVT and HVT). .................................................................................... 45
Fig. 3.7 Normalized energy of SRAMs utilizing three different device types (i.e.
HVT, SVT and LVT) for data storage and write paths. Note that LVT devices are
used in read port. ...................................................................................................... 46
Fig. 3.8 Comparison of leakage current over various device combinations. ........... 46
Fig. 3.9 Summary of normalized minimum energy consumption over various device
combinations. ........................................................................................................... 47
Fig. 3.10 Summary of normalized leakage current over various device combinations.
IX
.................................................................................................................................. 47
Fig. 3.11 8T decoupled SRAM cells with leakage reduction techniques: (a) column-
interleaved and (b) read buffer foot control ............................................................. 49
Fig. 3.12 Effect of column-interleaved scheme on SRAM energy. The reference
design is using SVT devices in the write paths and LVT devices in the read path,
which is also shown in Fig. 3.7. ............................................................................... 49
Fig. 3.13 Simplified 8T SRAM schematic adopting boosted wordline scheme. ..... 52
Fig. 3.14 Improvement of energy efficiency by boosting write performance.
Additional energy overhead induced by the boosting voltage generation is not
considered in this simulation. ................................................................................... 52
Fig. 3.15 Comparison of normalized minimum energy consumption with write
performance techniques. .......................................................................................... 53
Fig. 3.16 Improvement of minimum energy after adopting the column-interleaved
scheme (Fig. 3.11(a)) and the boosted voltage scheme (Fig. 3.13). Multiplex ratio of
32 is assumed. .......................................................................................................... 55
Fig. 3.17 Comparison of normalized SRAM minimum energy consumtpion. ........ 55
Fig. 4.1(a) Proposed 9T SRAM cell implemented with HVT devices in write paths
and LVT devices in read port. (b) Layout of the 9T SRAM cell based on 65 nm
logic rules. ................................................................................................................ 60
Fig. 4.2(a) 9T read SNMs compared with 6T and 10T SNMs with different voltages.
(b) Distribution of 9T read SNMs at 0.4 V. .............................................................. 62
Fig. 4.3 Comparison of write margins of HVT and SVT devices. ........................... 63
Fig. 4.4(a) Conventional bitline sensing in the 8T SRAM [13]. (b) Concept of
proposed 9T bitline sensing improvement by bitline leakage equalization technique.
.................................................................................................................................. 64
Fig. 4.5 Improved RBL swing and sensing window of 9T bitline at 0.2 V and fCLK =
X
50 kHz with the worst case of leakage. .................................................................... 66
Fig. 4.6 Histogram of RBL swings of 9T SRAM at 0.2 V with the worst case of
leakage from 10k-point Monte Carlo runs. .............................................................. 67
Fig. 4.7 Improved RBL swing with different numbers of cells and temperature.
Typical corner is used in the simulation. .................................................................. 67
Fig. 4.8 Definition of data flipping delay and data full development delay.
Difference of the data flipping delay and the full development delay substantially
increases with scaling VDD. .................................................................................... 69
Fig. 4.9 Read failure due to data non-full development in SRAM cell nodes. ........ 70
Fig. 4.10(a) Read and write delays against scaling VDD at TT corner. (b) Read and
write delays against scaling VDD at FNSP corner. .................................................. 71
Fig. 4.11 Data paths of write and read operations in the CAM-assisted SRAM
circuit. ....................................................................................................................... 72
Fig. 4.12 Delay of four different operations of SRAM and CAM circuits. ............. 73
Fig. 4.13 Circuit diagram of CAM array, search logics and miniature SRAM array.
.................................................................................................................................. 74
Fig. 4.14 Timing diagram of SRAM array and CAM circuit during succession of
write and read operations. ........................................................................................ 75
Fig. 4.15 Faster write completion in CAM array than SRAM array at different
corners. ..................................................................................................................... 76
Fig. 4.16 Measured (a) leakage current of the test chip and (b) write, read and
average power at maximum operating frequency. ................................................... 78
Fig. 4.17 Measured (a) read access time and (b) improved operating frequency of
the CAM-assisted SRAM. ........................................................................................ 78
Fig. 4.18 Measured energy of SRAM only and the CAM-assisted SRAM. ............ 79
XI
Fig. 4.19(a) Readout waveforms capture at 0.26 V. (b) Die micro-photograph. ...... 80
Fig. 5.1(a) Schematic of proposed 12T dual-port SRAM cell. (b) Layout of the 12T
dual-port cell. ........................................................................................................... 85
Fig. 5.2(a) Leakage problem in conventional 2T read port. (b) Read bitline leakage
suppression by implementation of virtual ground technique. .................................. 87
Fig. 5.3 Implementation of virtual ground technique. ............................................. 88
Fig. 5.4(a) Cell stability issue in common-row-different-column access. (b) Cell
stability issue in common-row-common-column access. ........................................ 89
Fig. 5.5 Comparison of worst SNM scenarios in conventional DP SRAM cell and
proposed DP SRAM cell. ......................................................................................... 90
Fig. 5.6(a) Illustration of read disturb in the 8T DP SRAM cell. (b) Read disturb
suppression in the 12T DP SRAM cell. ................................................................... 91
Fig. 5.7 Simulated waveforms of read disturb for the 8T DP cell and the 12T DP cell
at VDD = 0.4 V, FNSP corner. Note that the data in the 8T cell flips due to the read
disturb whereas the data in the 12T cell maintains. ................................................. 92
Fig. 5.8 Comparison of read SNMs of the 8T DP SRAM and the 12T DP SRAM.
................................................................................................................................ ..93
Fig. 5.9(a) Write disturb illustration of the conventional 8T DP SRAM. (b) Write
disturb suppression from the 12T DP SRAM. ......................................................... 94
Fig. 5.10 Circuit of hierarchical write bitline. .......................................................... 95
Fig. 5.11 Architecture of the 65 nm test chip. .......................................................... 96
Fig. 5.12(a) Measured leakage current. (b) Measured read access time. ................. 96
Fig. 5.13(a) Measured power consumption of read and write. (b) Measured energy
per operation. ............................................................................................................ 97
Fig. 5.14(a) Micro-photograph of the test chip. (b) Captured waveforms of the RBL
XII
at VDD = 0.4 V. ........................................................................................................ 97
Fig. 6.1 Schematic of proposed charge-pumped DFF. ........................................... 103
Fig. 6.2 Simulated output waveforms of the two charge pumps at 0.2 V, 1 kHz. ..104
Fig. 6.3 NMOS threshold voltage vs. transistor width at different supply voltages
with a 90 nm CMOS technology [61]. ................................................................... 106
Fig. 6.4 Simulated output waveforms of CPDFF, TGFF and ACFF at 0.4 V. ........ 109
Fig. 6.5 Monte Carlo simulation results of C-Q delay: (a) data ‘0’ and (b) data ‘1’.
The proposed CPDFF shows less variability. ......................................................... 110
Fig. 6.6(a) Setup time of CPDFF, TGFF and ACFF at different process corners and
(b) Hold time of CPDFF, TGFF and ACFF at different process corners. .............. 111
Fig. 6.7 Simulated Energy-Delay product against data activity at VDD = 0.4 V. .. 113
Fig. 6.8 Architecture of the FIFO circuit. ............................................................... 114
Fig. 6.9 Measured C-Q delay against VDD. .......................................................... 114
Fig. 6.10(a) Measured power against VDD. (b) Measured power against frequency.
................................................................................................................................ 115
Fig. 6.11 Measured energy-delay product. ............................................................. 116
Fig. 6.12 Measured power of the 2 FIFOs at 0.3 V with 10% data activity. .......... 117
Fig. 6.13(a) Screen capture of CPDFF output waveforms at 0.18 V and (b) die
micro-photograph. .................................................................................................. 117
XIII
List of Tables
Table 3.1 Parameter summary on energy analysis simulation ................................. 42
Table 4.1 Design metric comparison with various ultra-low voltage SRAMs. ........ 80
Table 6.1 Performance improvement from INWE-aware sizing strategy. ............. 108
XIV
Summary
The aggressive CMOS technology shrinking driven by cost reduction, performance
improvement and power minimization enables integration of billions of transistors
onto a single chip. State-of-the-art System-on-Chips (SoCs) incorporate more cores,
larger capacity caches and more application-specific hardware accelerators,
resulting in significant increase of power density. To reduce power and improve
energy efficiency, ultra-low voltage operation is widely employed. By lowering the
supply voltage from nominal level to near or beneath transistor’s threshold voltage
(known as near-/sub-threshold operation), the power is substantially suppressed and
the energy efficiency is optimized. However, various challenging issues including
high process-voltage-temperature (PVT) variation sensitivity and lack of systematic
design methodology exacerbate the utility of ultra-low voltage circuits. New design
methodology with minimum energy consideration to enhance performance, combat
variability and suppress leakage is worthy of extensive and in-depth explorations.
In the thesis, the characteristics of transistors at near-/sub-threshold region are
studied and their impact on energy consumption is investigated. Based on that,
ultra-low voltage circuits with improved performance, enhanced variation-resilience
and high energy/energy-delay efficiency are developed.
The main goal of the research is to explore and demonstrate optimal solutions of
Static Random-Access Memory (SRAM) and D Flip-Flop (DFF) circuits in energy
or energy-delay space and overcome the limitations imposed by ultra-low supply
voltage. Specifically, the outcomes are demonstrated through an ultra-low voltage
9-transistor (9T) single-port SRAM, a near-threshold 12-transistor (12T) dual-port
SRAM and an ultra-low voltage, energy-delay-efficient 16-transistor (16T) DFF:
1) As preliminary work, energy efficiency analysis of single-port SRAM
utilizing multi-threshold CMOS (MTCMOS) technology is presented. The
work investigates various device combinations and reveals the optimum
device selection for the best energy efficiency from a MTCMOS perspective.
2) A 9T SRAM macro is developed with MTCMOS technology to enhance
read performance and at the same time minimize leakage. In the 9T SRAM
XV
cell, a 3T-based novel read port is proposed to equalize read bitline (RBL)
leakage and improve RBL sensing margin. To optimize energy efficiency, a
miniature Content-Addressable-Memory-assisted (CAM-assisted) circuit is
integrated to conceal the slow data development after data flipping in write
operation and therefore enhance the operating frequency. A 16 kb SRAM
test chip is fabricated in 65 nm CMOS technology. The operating voltage of
the test chip is scalable down to 0.26 V. Minimum energy of 2.07 pJ is
achieved at 0.4 V with 40.3% improvement. Energy efficiency is enhanced
by 29.4% between 0.38 V ~ 0.6 V.
3) A 12T dual-port SRAM is proposed to suppress disturb at the common-row-
access mode and improve read-ability, write-ability and cell stability. The
novel dual-port SRAM cell significantly relaxes the probability of suffering
disturb, increases the resilience against disturb and extends the operating
voltage to near-threshold region. In addition, hierarchical bitlines and virtual
ground schemes are employed to further improve performance and leakage
of the SRAM circuit. A fabricated 16 kb 12T dual-port SRAM circuit shows
successful dual-port operations down to 0.4 V at the common-row-access
mode.
4) A 16T DFF with a low energy-delay product for sub-threshold applications
is presented. The device count of the proposed DFF is minimized by
eliminating the clock buffer and replacing transmission gates with pass gates.
To reduce the Clock-to-Q delay and improve variation resilience, two charge
pumps and inverse-narrow-width-effect-aware sizing strategy are utilized,
improving the performance by 23%. The fabricated DFF is fully functional
down to 0.18 V and shows an energy-delay product of 13.1 pJ·ns at 100%
data activity, achieving an improvement of 51.8% compared to the
transmission-gate FF.
1
Chapter 1 Introduction
1.1 Motivation
Semiconductor process technology continues to scale by 0.7× every 2 years as Fig.
1.1 depicts [1]. The aggressive CMOS technology shrinking driven by cost
reduction, performance improvement and power minimization enables integration
of billions of transistors into a single chip with only hundreds of mm2 area. State-of-
the-art System-on-Chips (SoCs) incorporate more cores, larger capacity caches,
more radio frequency components and more application-specific hardware
accelerators than ever, resulting in significant increase of power density [1]-[4]. The
power consumption trend of server processors from 1990 to 2010 is revealed in [5].
The total power augments roughly 20 times over 10 years with increasing
contribution from leakage power. The prevailing of big data and Gbps applications
intensifies this trend and makes power minimization imperative.
Fig. 1.1 Process feature size trend [1].
2
The most straight-forward method to reduce power consumption is voltage scaling.
By lowering the supply voltage from nominal level to near or beneath transistor’s
threshold voltage (known as near-/sub-threshold operation), the power is
substantially suppressed by several orders of magnitude. As Fig. 1.2 exhibits, the
active power of the 16 kb 9-transistor (9T) Static Random-Access Memory (SRAM)
decreases from 1.3 mW to 1.2 µW when supply voltage changes from 1.2 V to 0.24
V with power saving of more than 1000 times [6]. Due to the attractive prospective,
substantial research activities about ultra-low voltage circuit have been performed
[7]-[10]. However, various challenging issues including frailty sub-threshold
operations, high process-voltage-temperature (PVT) variation sensitivity and lack of
systematic design methodology exacerbate the utility of ultra-low voltage circuits.
New design methodology with novel circuit techniques to enhance performance,
combat variability and suppress leakage is worthy of extensive and in-depth
explorations.
Ultra-low voltage operation improves not only power dissipation but energy
efficiency as well. The emergency of energy constrained applications, such as
1E-1
1E-0
1E+1
1E+2
1E+3
1E+4
0.2 0.4 0.6 0.8 1 1.2
Ac
tive
P
ow
er (µ
W)
Supply Voltage (V)
Write power
Read power
Temp.=27ºC
Fig. 1.2 Power dissipation as a function of VDD for the 16 kb 9T SRAM [6].
3
handheld devices, wearable electronics, wireless sensor nodes, and implantable
biomedical instruments demand energy efficient circuit solutions to prolong battery
life. Although energy harvesting circuits are widely used in energy autonomous
systems, the energy it derives from the external sources is a very small amount and
cannot satisfy the energy specification of the whole SoC. Now it is known that
energy has a correlation with supply voltage. As Fig. 1.3 presents, energy per
operation of the 16b 1024-point Fast Fourier Transformation (FFT) decreases with
voltage scaling. However, the energy as a function of supply voltage reaches an
optimal point and increases again even if the voltage is lowered further. Thereby,
how to find the optimal point in the energy space to ensure energy efficient circuit is
of high importance.
While energy is paramount for circuits like the Random Access Memory (RAM),
performance is equally critical for various other circuitries such as the standard cell
logics with respective to the traditional design philosophy. For these circuits, e.g.
flip-flops (FFs), delay is as crucial as energy [11][12]. To emphasize both
parameters, extended specification space has to be adopted to evaluate the
performance-sensitive circuits. Hereby, energy-delay product is exploited as a key
metric to evaluate them.
VDD
Fig. 1.3 Energy dissipation as a function of VDD for the 16b 1024-point FFT [7].
4
1.2 Research Objectives and Contributions
Numerous researches have fueled the field of ultra-low voltage circuits but the
subsequent challenges, such as variability, leakage and etc., become more and more
severe as transistor length shrinks below 100 nm. In the thesis, we study the
characteristics of transistors at near-/sub-threshold region and investigate its impact
on energy and energy-delay product. Based on that, we aim to develop ultra-low
voltage circuits with improved performance, enhanced variation-resilience and high
energy/energy-delay efficiency.
The research outcomes are demonstrated through ultra-low voltage SRAMs, D flip-
flop (DFF) and First-In-First-Out (FIFO) circuits. Specifically, the contributions of
our research work are as followings:
1. Minimum-energy-driven SRAM design is highly sought after in numerous
emerging applications. As preliminary research, the thesis presents SRAM
energy analysis utilizing multi-threshold voltage (multi-Vth) devices and
various circuit techniques for power reduction and performance
improvement, and suggests optimal device combinations for energy
efficiency improvement. In general, higher-Vth devices are preferred in the
cross-coupled latches and the write access transistors for reducing leakage
current while lower-Vth devices are desired in the read port for
implementing higher performance. However, excessively raised Vth in the
write paths, i.e. the cross-coupled latches and the write access transistors,
leads to slower write speed than read, which quickly nullifies improved
energy efficiency. In the work, the energy efficiency improvement of 6.24
is achieved only through an optimal device combination in a commercial 65
nm CMOS technology. Employing power reduction and performance
boosting techniques together with the optimal device combination enhances
the energy efficiency by up to 33.
2. Conventional 6-transistor (6T) SRAMs suffer severe cell stability issue
during read at ultra-low voltage. Decoupled SRAM cells, such as 8T SRAM
cell [13], are widely adopted to ameliorate this issue by decoupling read port
5
from the data storage latch. Based on this preliminary research, a 9-
transistor (9T) SRAM cell is developed using the multi-threshold CMOS
(MTCMOS) technology to enhance read current and speed while minimize
the leakage current. Another issue of the 6T SRAMs at ultra-low voltage is
the degraded read sense ability due to data-dependent leakage. In the 9T
SRAM cell, a 3T-based novel read port is proposed to equalize read bitline
(RBL) leakage and to improve the RBL sensing margin by eliminating the
data-dependence of bitline leakage current. To optimize energy efficiency, a
miniature CAM-assisted circuit is integrated to conceal the slow data
development after data flipping in write operation and therefore enhance the
operating frequency. A 16 kb SRAM test chip has been fabricated in 65 nm
CMOS technology. The operating voltage of the test chip is scalable from
1.2 V down to 0.26 V with the read access time from 6 ns to 0.85 µs.
Minimum energy of 2.07 pJ is achieved at 0.4 V with 40.3% improvement
compared to the SRAM without the aid of the CAM. Energy efficiency is
enhanced by 29.4% between 0.38 V ~ 0.6 V by the proposed CAM-assisted
circuit.
3. Dual-port SRAMs can execute two operations simultaneously in one clock
cycle. Current 8T dual-port SRAMs are implemented in a similar way as the
conventional 6T SRAMs. Apart from the weakness inherited from the 6T
SRAMs, the 8T dual-port SRAMs are challenged by common-row-access
disturb, which severely limits their operation under low voltage. In the thesis,
a 12-transistor (12T) dual-port SRAM is proposed to suppress the disturb at
the common-row-access mode and improve the worst case read-ability,
write-ability and cell stability. The work significantly relaxes the probability
of suffering disturb, increases the resilience against disturb and extends the
operating voltage to near-threshold region. In addition, a virtual ground
technique is employed to further lower the power and energy by reducing
bitline leakage. A 16 kb 12T dual-port SRAM has been fabricated in a 65
nm CMOS process technology and showed successful dual-port SRAM
operations down to 0.4 V at the common-row-access mode.
4. Analysis of energy-delay domain reveals that performance boosting
6
technique is essential to achieve an optimal energy-delay product. This is
exactly what the research on ultra-low voltage DFF achieves. A 16-transistor
DFF featuring a low energy-delay product for near-/sub-threshold
applications is implemented. The device count of the proposed DFF is less
than the mainstream DFF, such as the transmission-gate FF (TGFF). This is
possible through eliminating clock buffer and employing pass gates instead
of transmission gates. To reduce the Clock-to-Q (CQ) delay and improve its
variation resilience, two charge pumps and an inverse-narrow-width-effect-
aware strategy are utilized, improving the performance by 23%. The novel
DFF fabricated with 180 nm CMOS technology is fully functional down to
0.18 V and shows an energy-delay product of 13.1 pJ·ns at 100% data
activity, achieving 51.8% improvement compared to the conventional TGFF,
respectively. When VDD = 0.5 V, the energy-delay product is averagely
enhanced by 50.8%. Two 256-bit FIFOs are implemented using the
proposed DFF and TGFF. The FIFO utilizing the charge-pumped DFF
exhibits 31.2% total power reduction at subthreshold regime.
In summary, the main contribution of the thesis is exploration and demonstration of
optimal solutions for SRAM and DFF circuits which overcome the limitations by
ultra-low voltage while satisfy the requirements of energy constrained applications.
7
1.3 Organizations
The rest of the thesis is organized as follows. Chapter 2 introduces the background
for ultra-low voltage, low power SRAM and flip-flop circuitries. Common design
techniques to assist near-/sub-threshold operation and minimize PVT variation are
reviewed. Fundamental knowledge of energy/energy-delay efficient design
methodology is provided. Chapter 3 models the energy consumption of an 8T
SRAM and comprehensively investigates the impact of multi-threshold CMOS
(MTCMOS) devices on energy efficiency. Subsequently, various assisting-circuit
techniques for power reduction and performance improvement are examined.
Optimal device combinations for energy efficiency improvement are suggested.
Chapter 4 presents an in-depth analysis on the correlation of MTCMOS technology
on energy efficiency. Based on the observation, the methodology to design an
energy-efficient MTCMOS SRAM with improved read sensing margin and
enhanced write performance is discussed. Specifically, the RBL leakage is analyzed
and leakage equalization is utilized to make a higher read sensing margin. Read and
write delays are investigated and the idea of exploiting a miniature Content-
Addressable-Memory (CAM) circuit is investigated to boost the write performance
and ultimately the energy efficiency. Chapter 5 analyzes the common-row-access
behavior for the conventional 8T dual-port SRAM. The solution to suppress disturb
at the common-row-access mode is explored by a novel 12T dual-port SRAM.
Other techniques, such as virtual ground and hierarchical bitline are also evaluated
in the chapter. The DFF work is presented in Chapter 6. In this chapter, the Clock-
to-Q delay parameter is probed and optimized through the sizing methodology and
use of charge-pump circuits. Silicon measurement results are included to validate
the effectiveness of the techniques on energy-delay improvement. Chapter 7
summarizes the entire research work in the thesis and looks ahead the possible
future works.
8
Chapter 2 Background and Literature
Review
2.1 Conventional 6T Single-port SRAMs
Single-port Static Random-Access Memory (SRAM) has been widely utilized in
CPUs and processor cores as on-chip memory to provide solution to intermediate
data access. Compared to Dynamic Random Access Memory (DRAM), SRAM
enables static on-chip data storage. This feature makes SRAM less complicate than
DRAM which has to be refreshed periodically in order to retain data. Without the
additional circuitry and timing to introduce the refresh, SRAM is generally faster
and less power hungry than DRAM. In CPUs, the embedded SRAMs usually serve
as cache memories due to its speed, density and energy characteristics.
2.1.1 6T Single-port SRAM Operation
The conventional 6T single-port SRAM is depicted in Fig. 2.1. Transistors M1 and
M2 are switches for data access. Transistors M3~M6 form a cross-coupled latch in
the middle and serve as a data storage element. The node Q and QB hold data and
its opposite value. Column-wise, hundreds of SRAM cells are assembled and share
the data in and out paths, which are known as bitline (BL) and bitline bar (BLB).
The pair of bitlines is connected to the drain terminals of all the access transistors in
WL
BL BLB
Q QB
M5
M3
M1
M6
M4
M2
Fig. 2.1 Schematic of conventional 6T SRAM cell.
9
the column. Row-wise, dozens of SRAM cells are connected to each other by a
shared wordline (WL), which are utilized to activate operations.
Fig. 2.2(a) and (b) illustrate read and write operations of the 6T SRAM cell,
respectively. At low level of a clock cycle, the bitlines (BL and BLB) are pre-
charged to VDD or a moderate high voltage. In read operation, WL is asserted to
turn on the access transistors while the preliminarily pre-charged bitline pair is left
floating for read evaluation. According to the data pattern, BL and BLB can behave
differently. When Q holds logic ‘1’, M3 and M6 are turned on while M4 and M5 are
cut off. Hereby, BLB can discharge mainly through a path formed by M2 and M6. If,
on the contrary, if Q holds logic ‘0’, M3 and M6 are switch off whereas M4 and M5
WL
BL BLBQ QB
Precharge Devices
VDD
Sense Amplifier
Data Out
M5
M3
M1
M6
M4
M2
WL
BL BLBQ QB
VDD
M5
M3
M1
M6
M4
M2
Precharge Devices
Data In Data In
(a) (b)
Fig. 2.2(a) Concept of read operation in a 6T SRAM bitline. (b) Concept
of write operation in a 6T SRAM bitline.
10
are turned on. The conductive path forms from M1 to M5, which decreases the
voltage of BL. At the end of the bitline pair, a sense-amplifier responds to the
voltage difference between BL and BLB to output the value of the cell as long as
the cell established an enough voltage difference in the cycle.
In write operation, data and its opposite value are loaded into BL and BLB,
respectively. Specifically, one bitline is pre-charged to a high voltage and the other
is connected to ground. The access transistor M1 and M2 are simultaneously
switched on by raising the WL voltage to VDD. Since the strength of the NMOS
pass-gate is sufficiently stronger than the PMOS pull-up device, the internal node
with logic ‘1’ is pulled down by the adjacent bitline which is grounded. The positive
feedback due to the cross-coupled structure assists to flip the original value and
maintain the new data.
11
2.1.2 Challenges of 6T SRAMs for Ultra-low Voltage
Operation
Although the 6T SRAM circuit is mature in industry and used in most commercial
chips, it is very poor in voltage scalability. The reasons can be categorized into three
aspects. Firstly, the read disturb and the write-ability issue impedes voltage scaling.
Secondly, the large variation at ultra-low voltage can cause severe reliability
problem. Lastly, degraded Ion-to-Ioff ratio makes sub-threshold operation very
difficult.
The read operation of 6T SRAMs is fast but destructive, that is, the cell nodes can
suffer disturbance during read operation and the data can be overwritten by disturb
current. Fig. 2.3 depicts the read stability issue. The data storage node Q and QB are
directly accessed to the bitline pair through the access devices. Therefore, due to
voltage division effect between the cross-coupled latch and the bitline capacitance,
the value in the SRAM cell is vulnerable to flip during read. Specifically, voltage at
node QB rises to a small amount ∆V above ground when M2 discharges BLB. If ∆V
is larger than the trip point of the inverters, cell value is flipped with the effect of
the positive feedback loop. To prevent the destructive read, access transistors M1
and M2 are required to be downsized or the pull-down transistors M5 and M6 are
needed to be upsized. In other words, the cell ratio which is defined as the ratio of
drain current of the pull down device and the drain current of the access device has
to be increased accordingly to suppress ∆V. To evaluate the data stability of the 6T
WL
BL BLB
Q QB
M5
M3
M1
M6
M4
M2VDD VDD
‘0’ to ΔV‘1’
Fig. 2.3 Cell stability degradation of 6T SRAM cell due to read disturb.
12
SRAM cell, read Static Noise Margin (SNM) is adopted as a key functionality
metric for read operation. It is defined as the minimum amount of DC noise
required to flip the state of the cell. Fig. 2.4 illustrates the SNM in the DC transfer
function curves of the latch. Normally, the larger the SNM is, the better the ability
of the cell against read disturb is. But read SNM deteriorates significantly with
voltage scaling.
On the other hand, the bitlines have to overpower cell with new data to foster a
successful write operation. During write cycle, the grounded bitline provides a
discharging path for the internal node holding logic ‘1’. The relative strength of the
access transistor and the pull-up PMOS device determines the write-ability. To ease
data flipping, the pull-up strength should be smaller than the access strength to
weaken data retention capability. Usually, write margin is employed as a key metric
to evaluate the write-ability of SRAM cells. For 6T SRAMs, it is interpreted as the
voltage headroom at WL for a successful write operation.
As [14],[15] manifest, the minimum operating voltage is bounded by read stability
and write margin. However, the substantial degraded read SNM and write margin of
6T SRAMs prevent aggressive voltage scaling and make sub-threshold operation
extremely challenging without the aid of assisting circuits. Fig. 2.5(a) and (b)
exhibit the trends of read SNM and write margin with scaled VDD, respectively.
SNM
Q (V)
QB
(V
)
Fig. 2.4 Illustration of SNM of 6T SRAM cell.
13
The SNM is far less than half VDD at all supply voltages while the write margin
degrades approximately 9× when the voltage scales from 1.2 V to 0.2 V.
The degradation of the two key metrics is accelerated by large Vth variation at ultra-
low voltage. Conventionally, transistor is thought to cutoff when the overdrive
voltage (VGS–Vth) becomes zero. In fact, transistor is still working although the VGS
is lower than the threshold voltage Vth and the current follows a correlation defined
by Equation (2.1):
( )/( )
0GSq V Vth n kT
subW
I I eL
(2.1)
where I0 is highly relevant to the technology, q is the electronic charge, T is the
temperature and k is the Boltzmann’s constant. For sub-threshold computing, since
the current is exponentially correlated to (VGS–Vth) and temperature, very small
drifting of VDD, Vth and temperature can cause large amount of current variation.
Therefore, the PVT (process, voltage and temperature) variation which is mainly
attributed to voltage scaling, random dopant fluctuation and temperature variation
worsens various figures-of-merit and creates reliability issue, such as unacceptable
bit error rate. To combat variability, supply voltage of 6T SRAM circuits cannot be
tuned to very low.
Leakage is another bottleneck to overcome for ultra-low voltage operation. When
VDD is high, the drain current of transistor is tens of thousands times larger than
the leakage current. The high Ion-to-Ioff ratio at strong inversion region makes the
impact of leakage current on key design metrics negligible. However, if circuits
work at near- or sub-threshold region, the transistor channel is only moderately or
weakly inverted, resulting in a small Ion-to-Ioff ratio. It is worsen by CMOS
technology shrinking which makes gate leakage more and more difficult to control.
When the leakage current is comparable to the drain current, characteristics of 6T
SRAM circuits, such as operating frequency, read sensing ability, energy efficiency
change accordingly. In addition, bitline leakage dependent upon data pattern of each
column in 6T SRAMs is detrimental for read sensing. If the amount of the data-
14
dependent bitline leakage is considerable enough, the bitline level of data ‘1’ could
be lower than that of data ‘0’ [16]. Consequently, this limits the number of cells per
bitline and the minimum operating voltage.
0
1
2
3
4
5
6
0 0.2 0.4 0.6 0.8 1 1.2
Re
ad
SN
M
(a.u
.)
VDD (V)
(a)
1
10
100
1000
0 0.2 0.4 0.6 0.8 1 1.2
Wri
te M
arg
in
(a.u
.)
VDD (V)
~ 9 X
(b)
Fig. 2.5(a) 6T SRAM read SNM by VDD sweeping. (b) 6T SRAM write
margin by VDD sweeping.
15
2.1.3 Design Techniques of 6T SRAMs to Improve
Minimum Voltage
Diverse design techniques are proposed to cope with the challenges and improve the
minimum operating voltage of 6T SRAMs. Basically, most of these techniques aims
to improve read SNM, enhance write-ability and reduce leakage.
Decoupled SRAM cells with dedicated read ports such as 8T and 10T SRAM cells
[13],[17],[18] are common solutions to suppress read disturb and improve cell
stability. The read port, which is decoupled from the internal storage node by
separation from the data storage element, enables a single-ended read sensing
without internal node access. When read is asserted, the read current flows through
the transistors in read port without any interference to Q and QB. As the read
disturb is diminished, smaller transistor can be adopted in SRAM cells to
compensate the area overhead. In [13], a two-transistor read stack is added to the
standard 6T cell and enabled by read wordline (RWL) as Fig. 2.6(a) shows. As
RWL is asserted, M7 is on and the exclusive read bitline (RBL), which is
precharged in prior, is left floating for data evaluation. When Q stores logic ‘1’, M8
is cutoff due to negative overdrive voltage. Therefore no conductive path forms in
the read port, which maintains a high voltage of RBL. When Q stores logic ‘0’, M8
is switched on and the read current forms from RBL through M7 and M8 to ground.
Consequently, the RBL voltage is fast pulled down and amplified for sensing data
‘0’. The 8T SRAM, combined with optimized write pass-gate and MTCMOS
technology, achieves 295 MHz operation at 0.41 V. Alternative technique, such as
wordline underdrive, is also effective to minimize read disturb by driving WL lower
than VDD. Compared to the decoupled SRAM cells, it incurs less area penalty
which is beneficial for high-density SRAMs [19].
Write margin is another challenge to overcome. The sizes of the pull-up PMOS
transistors and the NMOS access transistors have to be carefully designed to ensure
the write current is strong enough to flip the original data. Moreover, the strength of
the write access transistors is susceptible to PVT variations, which is severe in
advanced technologies. If the access devices are weakened by the PVT variations,
the write margin could be degraded. To assist write and improve write margin,
16
various design techniques are utilized to manipulate wordline, bitline and cell
supply voltage. Boosted wordline technique pumps the voltage of WL higher than
VDD to enhance VGS and ease data flipping during write. Collapsed cell supply
voltage method intentionally decreases the supply voltage of the cross-coupled latch
during write [17] to weaken the hold SNM. Negative bitline strengthens the access
transistor by pull down the voltage of the bitline with data ‘0’ to a negative voltage
which enforces write margin ultimately [19],[20],[21]. In [21], a negative bitline is
implemented by using a negative bias to represent data ‘0’. This write-assist
technique enhances the write ability through increasing the strength of the access
WWL
WBL WBLB
Q QB
M5
M3
M1
M6
M4
M2
RBL
M8
RWL
M7
(a)
WWL
WBL WBLB
Q QB
RBL
RWLVVDD
M5
M3
M1
M4
M6
M2
M9
M7
M10M8
(b)
Fig. 2.6(a) Schematic of 8T SRAM cell [13]. (b) Schematic of 10T
SRAM cell [17].
17
transistor in the SRAM cell. However, it requires a tracking replica bitline, a pulse
generator and a negative charge pump, which incurs increased control complexity
and additional circuitry.
For bitline with a large number of cells, leakage current becomes more significant
and causes bitline sensing more difficult with voltage scaling. The 10T SRAM cell
[17] depicted in Fig. 2.6(b) exhibits a read port for leakage reduction. The source of
M3 and M4 are connected to a cell supply VVDD for write. The read port is
implemented by M7 through M10. As Q holds ‘0’ and QB holds ‘1’, M10 adds an
off device in series with the leakage path through M8 and the path through M9. As
Q holds ‘1’ and QB holds ‘0’, M10 reduces leakage through M7 by the stack effect.
The reduction in sub-threshold leakage through M8 reduces the impact of leakage
from unaccessed cells and allows more cells on a bitline. The 256 kb 65 nm CMOS
test chip using the 10T cell and boosted wordline scheme functions without error at
380 mV. At 27°C, the test chip approximately dissipates 2 µW in terms of leakage
power with a supply voltage of 0.3 V. The 10T memory saves over 60× in leakage
power when VDD scales from 1.2 V to 0.3 V [17].
To address the data-dependent bitline leakage challenge, circuit techniques are
explored and proposed. In [22], the pull-down bitline leakage in the conventional
6T SRAMs is compensated by injecting additional pull-up current. However, the
analog detection and injection circuit in this design is highly sensitive to process
variations and coupling noises [23]. Sensing calibration technique proposed in [24]
solves the data-dependency problem by injecting the same voltage offset on BL and
BLB using a crossing-structure circuit. The main drawback is the calibration
scheme, which needs bitline loading capacitors and hence the higher power
consumption. Although the P-P-N based 10T SRAM cell presented in [25] achieves
high immunity to the data-dependent bitline leakage, it has to triple the size of the
four transistors to improve its write ability and thus occupies substantial area.
18
2.2 Conventional 8T Dual-port SRAMs
Dual-port SRAMs can read and write different cells at different addresses
simultaneously. This increases bandwidth by approximate 2× compared to single
port SRAMs, which accesses only a cell at a time. As design complexity grows,
greater demands are placed upon high-bandwidth memories to boost throughput.
For example, high-speed communication and multi-media processing [26]-[29]
need dual-port SRAMs to improve the total chip performance by parallel operation.
In addition, dual-port SRAMs can be implemented as register file in CPU. There are
two types of dual-port SRAMs, which are synchronous dual-port SRAMs and
asynchronous dual-port SRAMs. In our research work, only synchronous dual-port
BLA
BLB
Q QB
WLA
/BLA
/BLB
VDD
WLB
Over-sized transistors
(a)
(b)
Fig. 2.7(a) Schematic of conventional 8T dual-port SRAM. (b)
Parallel memory access of 8T dual-port SRAM [30].
19
SRAMs are investigated.
2.2.1 8T Dual-port SRAM Operation
Fig. 2.7(a) shows the schematic of the conventional 8T dual-port SRAM cell. Port A
and Port B are accessed by exclusive address and operation instruction. Each port
consists of their corresponding wordline (WL) and a pair of bitlines (BL and /BL).
(a) (b)
(c) (d)
Fig. 2.8(a) Illustration of different-row-different-column access. (b)
Illustration of different-row-same-column access. (c) Illustration of
common-row-different-column access. (d) Illustration of common-row-
common-column access [30].
20
Combined with the cross-coupled latch, the operation of each port acts exactly like
a 6T single-port SRAM cell, which has been analyzed in details in Section 2.1.1.
The widths of the two NMOS drive transistors are expanded to maintain the cell
stability against common-row-access. Parallel memory access by the dual-port
SRAM block is portrayed in Fig. 2.7(b), where both unit-A and unit-B can access a
dual-port SRAM cell simultaneously within a cycle. Accordingly, there are four
possible simultaneous access operations occurring at the same address: read-write,
write-write, read-read, and write-read. A simultaneous read-read operation does not
affect the cell but the other operations need proper measures to ensure that no data
collision occurs. Fig. 2.8 presents the variety of the access situations of the dual-
port SRAMs when both ports are activated within a clock cycle. Fig. 2.8(a) shows
the case in which a SRAM cell is accessed from both ports designated
independently by each address. Fig. 2.8(b) describes a situation of different rows
but a common-column-access. Both of the two cases have no access conflict issue
because either of the SRAM cell is only enabled by one port, which is exactly like a
single-port SRAM operation. Fig. 2.8(c) and (d) present situations incurring access
conflicts, which are common-row-different-column access and common-row-
common-column access, respectively. In these common-row-access cases, port A
and port B of the selected row are simultaneously enabled, which substantially
degrades SNM and write margin, which will be elaborated in Section 2.2.2.
21
2.2.2 Challenges of 8T SRAMs for Ultra-low Voltage
Operation
Conventional 8T dual-port SRAM cell is derived from the standard 6T single-port
SRAM cell. It inherits the weakness of the 6T SRAM cell at ultra-low voltage like
poor cell stability, reduced read-ability and write-ability, which impedes voltage
scaling. The above issues are exacerbated because the 8T dual-port cell has two
more access transistors exposure to disturbance, especially in the common-row-
access mode. This limits the minimum operating voltage which is generally not as
low as that of the 6T single-port SRAM circuit.
In the common-row-access cases, the cell stability has to be treated as the worst
case because the enabled two WLs impose more disturbances and degrade SNM for
all cells along with the selected row. Fig. 2.9 exhibits the SNM in common-row
access and different-row access [30]. The butterfly curve of the 8T dual-port SRAM
cell is depicted by overlapping the voltage transfer curve of one inverter with its
inverse. The SNM is visualized by the diagonal length of the largest square which
can be embedded between the two lobes of the butterfly curve. For common-row
access mode, the read SNM is reduced approximately by 1/3, which directly results
from the degradation of electrical β ratio. The β ratio is deteriorated from βD1/ βA1
(different-row access) to βD1/ (βA1+ βA2) (common-row access) because of the
simultaneous activation of WLA and WLB, where βD1, βA1, βA2 indicate the
coefficients of source-drain currents of the pull-down NMOS transistor, the access
transistor for port A and the access transistor for port B.
In 8T dual-port SRAM bitline, write-ability has to be also taken account of as the
worst case. When port A writes, the cell incurs disturb due to the activation of the
other port, which prevents the data flipping by current interference with the storage
node. The situations of the read and the write disturb will be analyzed in Chapter 5.
As discussed, the situations in Fig. 2.7(c) generate the worst case of read and write
for a selected cell and the worst case of data retention for unselected cells. Same
address access for write operation, as shown in Fig. 2.7(d) is prohibited because the
destructive write causes abnormal leakage current in the SRAM cell if the writing
22
data from both ports are opposite [30]. Still, the simultaneous write-read, read-write
and read-read operations are allowed and frequently required from the system. As
the conventional dual-port SRAM must satisfy various worst case situations, the
size of the memory cell has to be increased accordingly to improve cell stability as
well as read and write abilities. Normally, the drive NMOS transistor in the 8T dual-
port cell can be oversized more than 2× as the transistor size in the 6T single-port
cell.
In summary, the conventional 8T dual-port SRAM is extremely poor in terms of
voltage scalability due to its intrinsic flaw of cell stability and the difficulty to
perform read and write operations at ultra-low voltage.
Q (V)
QB
(V
)
Q QB
VDD
ON ON
OFF OFF
Q QB
VDD
ON ON
ON ON
Fig. 2.9 Comparison of read SNMs in different-row access and common-
row access situations [30].
23
2.2.3 Design Techniques of 8T SRAMs with Ultra-low
Supply Voltage
Most of the contemporary design techniques for single-port 6T SRAMs are
universal for SRAM circuits. Hereby many of them are transplanted to dual-port
SRAM to enhance figures-of-merit. Recent state-of-the-art dual-port 8T SRAMs are
improved in various ways, such as low leakage and conflict-free 8T solutions.
Although those circuit techniques are beneficial to lower the supply voltage, very
rare cases of 8T dual-port SRAM are demonstrated through ultra-low voltage
environment.
A 20 nm high-density dual-port SRAM with wordline-voltage-adjustment technique
has been demonstrated in [31] to achieve better read-ability and write-ability against
local variation. This scheme adopts lowering wordline voltage for read assist and
raising wordline voltage for write assist. A temperature-monitoring device is
embedded to sense the temperature variation for more accurate control. An
assisting controller is connected to the fuse which triggers the on-chip regulator to
generate corresponding wordline voltage according to different process variations
and temperature thresholds. Measurement reveals a 0.1 V minimum voltage
improvement.
Fig. 2.10 Concept of access circumvention scheme for dual-port 8T
SRAM [30].
24
The minimum operating voltage of 8T dual-port SRAMs, as discussed in Section
2.2.2, is constrained by the situation where the same row address is asserted by both
ports simultaneously. The read disturb and write disturb together with the cell
stability challenge under the circumstance substantially impede voltage scaling.
Therefore, how to eliminate the conflict in common-row access becomes the key to
lowering supply voltage further and draw great attentions for researchers. A
circumvention technique has been proposed to avoid simultaneous common-row
access for 8T dual-port SRAM [30]. It adopts a priority row decoder and a bitline
shifter to prevent access conflict. The fundamental concept of the scheme is
illustrated in Fig. 2.10. In the scheme, port A is defined as primary whereas port B is
considered as secondary. A row address comparator and a bitline shifter are
introduced in the secondary port. When both ports are asserted at the same row, the
comparator outputs logic ‘0’ to disable the secondary row decoder for port B.
Consequently, only the wordline for port A is accessible to the SRAM cell.
Meanwhile, the logic ‘0’ from the comparator changes the connection of the
secondary port from the pair of BLB to that of BLA to foster a possible read
operation. However, this technique incurs area penalty because of the complexity of
the comparator and the bitline shifter circuits. More importantly, the circumvention
Fig. 2.11 Concept of active bitline equalizing technique for dual-port 8T
SRAM [32].
25
scheme delays the operation through the secondary port in common-row access
mode, which degrades the performance of the dual-port SRAM. The measurement
results shows the 32 kB macro fabricated with 65 nm technology can work down to
0.8 V.
An active bitline equalizing circuitry has been proposed to improve the write
margin by removing write disturb when the same row address is activated [32]. Fig.
2.11 presents the concept of the equalizing technique. In common-row-access mode,
(a)
(b)
Fig. 2.12(a) Write-disturb detector for 8T dual-port SRAM. (b) Coordinately-
activated write drivers for dual-port SRAM [32].
26
the pair of bitlines in port B (BLB) is disconnected from the storage node and
equalized to the pair of bitlines in port A (BLA), thus the write disturb from BLB is
circumvented. To realize the concept, write-disturb detector and coordinately-
activated write driver are proposed and implemented in the selected column (Fig.
2.12(a), (b)). By utilizing this circuitry, the minimum voltage of the 28 nm dual-port
SRAM macro is improved by 120 mV to 0.66 V for slow clock cycle at expense of
5.8 % extra area.
In summary, although various assisting circuitries have been proposed to reinforce
low voltage operation, the minimum voltage of 8T dual-port SRAMs are still
relatively higher. Lower operating voltage for dual-port SRAM should be
continuously pursued.
27
2.3 Conventional D Flip-Flop Circuits
D Flip-Flops (DFFs) are very fundamental components in digital CMOS integrated
circuit (IC) design. They can account for substantial random-logic power and
random area. Most DFFs are consisted of pairs of latches which are transparent on
different phases of a clock cycle. In general, the latches are build upon regenerative
storage type, which are “staticˮ because data are constantly restored by the positive
feedback in the storage loop. This method prevents the stored data from being
corrupted by parasitic leakage current compared to capacitor storage type [33].
2.3.1 Mainstream DFF Circuits and Timing Properties
Edge-triggered FFs are very popular in globally clocked systems mainly due to the
simplicity of referencing all events and timing parameters to a single toggling of the
clock [33].
One of the most widely used edge-triggered FFs is transmission-gate FF (TGFF),
which is composed of two level-sensitive latches but operates in two opposite clock
phases controlled by corresponding transmission gate. Fig. 2.13 shows the circuit of
TGFF. At the low phase of the clock cycle, the inverted input D is sampled by the
first latch, which is known as the master stage, with a connected path linked by the
first transmission gate. The new data overpowers the original data stored in the
master latch with cutoff of the feedback loop as the transmission gate in the loop
turns off simultaneously. At the high phase of the clock cycle, the second latch
which is the slave stage samples the data held in the master stage and outputs it with
D Q
CLKBB CLKB
CLKB CLKBB
CLKB
CLKBB
CLKBB
CLKB
CLK CLKBBCLKB
Fig. 2.13 Schematic of transmission-gate FF (TGFF).
28
inversion. It is noticeable that the clock fed to each transmission gate is buffered
locally to accommodate the high load of the two clock phases.
True Single-Phase-Clocked (TSPC) FF is a common dynamic DFF variety which
cascades the negative and positive dynamic latches of Yan and Svensson. Fig. 2.14
depicts the schematic of TSPC. Transistor M1 ~ M3 consist of the negative dynamic
latch which is transparent when the voltage of clock is low. Similarly, transistor M7
~ M9 consists of the positive dynamic latch, which is transparent when the voltage
of the clock is high. At the low phase of the clock signal, net1 samples the inverted
input while M4 isolates the second latch from the first one. As the clock signal
toggles, M4 shuts off and M6 is switched on. The voltage of net2 varies on the basis
of the voltage of net1. If net1 stores logic ‘1’, the voltage of net2 discharges through
the conductive path of M5 and M6 and triggers M7 to raise the voltage of QB. If, on
the contrary, net1 stores logic ‘0’, the voltage of net2 will remain high to turn on M9.
The voltage of QB discharges to ground and an output of logic ‘1’ forms at the Q
terminal.
Timing properties are important for synchronous sequential logic circuits. Any
sequential logic has to comply with certain timing specifications to ensure a
successful operation. For edge-triggered FFs, set-up time, hold time and Clock-to-Q
(C-Q) delay are the most important timing metrics. Fig. 2.15 illustrates the three
parameters and their concepts. The set-up time is defined as the delay from the
data’s becoming valid to the rising edge of the clock. Likewise, the hold time is the
delay from the clock to the data’s becoming invalid [34]. The C-Q delay is the delay
D CLK
CLK
CLK
CLK
M1
M2
M3
M4
M5
M6
M7
M8
M9
net1
net2
QQB
Fig. 2.14 Schematic of True-Single-Phase-Clocked (TSPC) FF.
29
from the rising edge of the clock to the output’s becoming valid. Violation of set-up
time and hold time can cause C-Q delay to increase or even output flipping. A more
practical way to measure set-up time/hold time is to capture the set-up/hold skew
when nominal delay is degraded by 10% [35]. The energy-delay DFF circuit
proposed by the thesis adopts this method to measure both parameters.
CLK
tsetup thold D inputs either high or
low
Fig. 2.15 Illustrating DFF setup time and hold time [34].
30
2.3.2 Design Challenges of DFFs for Energy Efficient
Applications
The conventional TGFF utilizes two opposite level-sensitive latches to enable the
edge trigger operation. It employs two inverters to build a clock buffer locally so as
to increase loading capability and reduce switching current. However, the clock
buffer consumes power as long as the clock toggles, even at low data activity when
input D rarely changes. This persistent power consumption causes extra power and
energy, which is disadvantage for energy-constrained applications when a large
number of TGFFs is integrated. On the other hand, the TGFF is prune to incur
timing property when it is working at near- or sub-threshold regimes. As the clock
is buffered and the data is inverted, the mismatch between the clock path delay and
the data path delay can cause problems. Specifically, the NMOS device in the first
transmission gate can turn off earlier than its PMOS. Likewise, the PMOS in the
first feedback loop turns on before its NMOS counterpart. This can cause hold time
violation when the input data changes from logic ‘1’ to logic ‘0’ just after the clock
edge [36]. Moreover, the hold time degrades at ultra-low voltage where the PVT is
accentuated.
The TSPC eliminates the clock buffer by utilizing one clock phase. However,
dynamic operation of the circuit degrades its robustness especially at ultra-low
voltage because net1 and net2 are extremely subject to leakage and noise when they
are not being driven. For example, when the input is logic ‘0’ and the clock signal
toggles to high level, the voltage of net1 is neither pulled-up to VDD nor pulled-
down to ground. Likewise, the voltage of net2 does not necessarily remain high
when M4 is shut off because M5 can be non-ideally switched off. To make things
worse, once the voltage of net2 drops to ground while the clock is high, the node
would remain low because there is nothing to pull it up, resulting a functionality
problem.
31
2.3.3 Design Techniques for Energy Efficient DFFs
Although the conventional TGFF circuit is robust in ultra-low voltage, it is not
energy efficient due to the persistent power spent on the clock buffer in every clock
cycle. On the other hand, TGFF consumes 24 devices, which can incur large
random area penalty as well as leakage energy. To solve the problem, an emerging
DFF utilizing adaptive-coupling configuration to reduce transistor count and power
for energy saving has been proposed recently [37]. Fig. 2.16 illustrates its schematic
and configuration. To remove the clock buffer, a differential master-slave topology
is proposed in the adaptive-coupling FF (ACFF). The original transmission gates in
the propagation path are replaced by PMOS and NMOS pass gates, respectively.
However, the circuit is subject to process variations, because the PMOS pass gates
are too weak to provide a strong source-drain current at low voltage to overpower
the strong coupling from the latch during a transition. To ameliorate it, the adaptive-
coupling method is introduced such that the strong coupling in the feedback loop is
weakened when the input is opposite to the internal storage. Specifically, an
adaptive-coupling element which is comprised of a PMOS transistor and an NMOS
transistor is configured in parallel to control the cross-coupled loop. If the level of
node BN is high, the PMOS device is cutoff and the NMOS device is switched on,
weakening the feedback path of G-F and enabling easier discharging of node F to
D
Q
CLK CLK
CLK CLK
BN FN G H
B F GN HN
Adaptive
coupling
element
Fig. 2.16 Schematic of adaptive-coupling FF (ACFF) [37].
32
node B. As node FN is charged to high state, the level of node G accordingly
becomes low and enforces node F to discharge completely by the NMOS device in
the adaptive-coupling element. Consequently, the C-Q delay is improved due to the
easier data transition and the energy is optimized by the elimination of power-
hungry clock buffer and the reduction of device count. Silicon experiments on 40
nm CMOS technology validates this scheme achieves a less mean C-Q delay with a
smaller standard deviation and a reduced hold time compared to TGFF. The energy
per cycle is improved by 60.8% at 10% data activity with a supply voltage of 1.1 V.
Up to 77% energy reduction is obtained at 0% data activity mainly due to the
elimination of the clock buffer. Despite the improvements, the ACFF circuit is
reported to work typically at super-threshold region (VDD > 0.75 V), which is not
fully qualified for ultra-low voltage operation.
The dynamic TSPC circuit, as analyzed in Section 2.3.2, is highly susceptible to
noise and PVT variations, thus it is very fragile for ultra-low voltage operation. An
improved DFF on the basis of TSPC has been proposed to foster a static operation
with single-phase clocking and contention-free transitions [38]. The so-called Static
Single-Phase Contention-Free FF (S2CFF) (Fig. 2.17) has the same transistor count
as a TGFF. When D is ‘0’, net1 stores the opposite value of input data and net2 is
precharged at the low phase of the clock. As the clock toggles, net2 discharges to
ground through M9 and M10 and updates QN by switching on M13. When D is ‘1’,
net1 discharges to ground and net2 is pulled up to VDD. At the high phase of the
clock, QN is updated by the pull-down operation through M14 ~ M16. In the S2CFF
circuit, net1 and net2 become static nodes, which is different from the TSPC design.
This is accomplished by the clocked devices and the positive feedback loop
between the two nets. As such, the operation of the DFF is fully static. On the other
hand, the sub-circuit consisted of M11, M12 and M15 prevents possible glitch which
enables a contention-free operation. The S2CFF has been implemented in a 45 nm
SOI technology and showed a clock power reduction of 41% and a total sequential
power reduction of 39% at 1V/1GHz compared to TGFF. Active energy is also
improved by 32% and 34% at VDD = 1 V and 0.4 V, respectively. The reported
minimum supply voltage of the S2CFF is 0.4 V.
33
In summary, the ACFF and the S2CFF exhibit substantial improvement on power
and energy efficiency compared to TGFF. However, the minimum supply voltage of
each DFF is just moderate. Further exploration on ultra-low voltage and minimum
energy-driven techniques which are effective on DFF is demanded. More aggressive
voltage scaling for DFF is expected to enable computing in ultra-low voltage
circuits, such as [7], [39].
Fig. 2.17 Schematic of Static-Single-Phase Contention-Free FF (S2CFF) and its
operation [38].
34
Chapter 3 SRAM Device and Circuits
Optimization toward Energy Efficiency in
Multi-Vth CMOS
3.1 Background
Recently emerging micro-watt applications such as micro-sensor networks, handset
electronics and implantable biomedical devices, etc., have placed their primary
criterion on minimum energy consumption or high energy efficiency to prolong
battery life time. To improve the energy efficiency, operating voltage (VDD) in
these applications is positioned near or below the threshold voltage (Vth), known as
the near- or sub-threshold region. However, design of ultra-low voltage digital and
memory circuits is highly required for achieving this ultra-low energy goal.
Particularly, the design of ultra-low voltage SRAMs remains significantly
challenging due to the additional constraints such as high sensitivity to process-
voltage-temperature (PVT) variations, smaller cell stability, smaller voltage margin,
and prevailing leakage current.
SRAMs can dissipate significant power and consume high energy in numerous
applications, such as DSP, MCU and etc. Consequently, energy efficiency is a
topmost parameter for SRAMs embedded in micro-watt systems. While numerous
research works have been conducted for minimizing SRAM energy consumption,
research on the utilization of multi-Vth devices for minimum energy-driven SRAMs
has rarely been explored. The main challenge in the design of ultra-low voltage
SRAMs with multi-Vth is to reduce leakage without degrading performance. In
decoupled SRAM cells, higher-Vth devices are preferred in write paths and data
storage to reduce leakage current, and lower-Vth devices are used in read paths to
achieve better performance. However, this can generate excessively slower write
operation than read operation if Vth of the devices in the write paths is too high
compared to that of the devices in the read ports. This chapter thereby examines the
35
approaches to improve the energy efficiency of SRAMs with multi-Vth devices.
Optimal device combinations will be analyzed for maximizing energy efficiency.
We will also present the effects of various SRAM design techniques on enhancing
the energy efficiency with multi-Vth devices. The rest of the chapter is organized as
follows. In Section 3.2, we will analyze the energy consumption of SRAMs. The
optimal Vth for maximum energy efficiency will be discussed in Section 3.3.
Section 3.4 explains design techniques that can enhance the energy efficiency.
Finally, we will make a summary in Section 3.5.
36
3.2 Analysis of SRAM Energy
Energy efficiency is a paramount design criterion in emerging ultra-low power
applications. Supply voltage scaling has been the most widely accepted method for
energy efficiency improvement. SRAMs, however, require additional considerations
such as array structures, active-switching, and leakage energy. Although dual-Vth
and multi-Vth schemes have been utilized for power reduction [40], [41], minimum
energy-driven device selections have been rarely visited. In this section, we will
analyze SRAM energy minimization considering the option of multi-Vth devices.
The functionality of all SRAMs is guaranteed by simulation, even at the condition
of the lowest supply voltage.
3.2.1 SRAM Energy Modeling
The occurrence of a minimum energy operating point is determined by the
correlation of power and performance. Energy consumption of an SRAM can be
separated into two components: switching energy, also known as dynamic energy,
MC MC
MC MC
WWL[0]
RWL[0]
WWL[n-1]
RWL[n-1]
Read Write Column Mux.
Sense Amplifiers and Write Drivers
CW
BL
CW
BL
CR
BL
CW
BL
CW
BL
CR
BL
CRWL
CWWL
CRWL
CWWL
WBL[0] WBLB[0] RBL[0] WBL[k] WBLB[k] RBL[k]
MC
: 8T decoupled
SRAM
Fig. 3.1 Simplified SRAM array diagram for energy analysis.
37
and leakage energy known as static energy. Fig. 3.1 shows a simplified SRAM array
with highlighted critical parameters relevant to the energy analysis. The
conventional 8T decoupled SRAM cell [13] is employed due to its popularity in
ultra-low voltage SRAM design. The effect of an optimal device selection on the
energy of SRAM peripheral circuits is insignificant compared to that on SRAM
arrays. Therefore, peripheral circuits such as decoding blocks, column multiplexers
for read and write operations, sense amplifiers and write drivers are excluded in this
energy analysis.
The total energy (Etotal) of the SRAM array can be expressed by
(3.1)
where Eswitching represents the dynamic energy consumed by switching activities.
Eleakage is the static energy consumption coming from the leakage current in the
SRAM cells. Eswitching is the summation of the switching energies during read
operation and write operation, which can be expressed as below.
(3.2)
where PRead is the probability of read operation, PLow is the probability of reading
data ‘0’ during read operation, and PWrite is the probability of write operation. Note
that no switching activity occurs in the read bitlines when the read data is ‘1’. As
shown in Fig. 3.1, the read energy is associated with the wordline capacitance (CRWL)
and the bitline capacitance (CRBL). Note that multiple read bitlines (k) will be
discharged during read operation due to the shared read wordline. Read data also
affects the switching energy since the read bitlines are only discharged with the read
data of ‘0’. Similarly, the write energy is primarily determined by the write wordline
capacitance (CWWL) and the write bitline capacitance (CWBL). The switching write
bitline capacitance is determined by the number of columns (k) and the multiplexing
ratio (m). One write bitline in a pair switches regardless of write data. Therefore, the
total switching leakageE E E
2 2
Re
2 2
switching ad RWL DD Low RBL DD
Write WWL DD WBL DD
E P C V k P C V
kP C V C V
m
38
write energy is independent of the write data.
The static energy (Eleakage) of the SRAM array is given by
(3.3)
where, ILeakage is the total leakage current, N is the number of SRAM cells, ISN and
ISP are technology scaling parameters for the NMOS and PMOS devices, VGS is the
gate-to-source voltage, VDS is the drain-to-source voltage, Vthn and Vthp are the
device threshold voltage of the NMOS and PMOS transistors, n is related to the
sub-threshold slope, VT is the thermal voltage, and T is the time to finish a
computation. For simplicity, we assume that the sub-threshold current only consists
of the drain current in the sub-threshold region.
(1 )
GS thpGS thn
T T
DS
T
leakage DD Leakage
V VV V
nV nV
DD SN SP
V
V
E V I T
V N I e I e
e T
39
3.2.2 Effects of Supply Voltage Scaling and Threshold
Voltage on Energy Efficiency
As shown in Equation 3.1 ~ Equation 3.3, the energy consumption is highly
sensitive to VDD and device threshold voltage. In the point of view of designers,
the simplest method of improving energy efficiency is to scale supply voltage.
Lowering VDD improves energy efficiency when the dynamic energy is dominant
over the static energy. However, the static energy becomes significant when the
supply voltage becomes near or below the device threshold voltage level. In this
region, even though the leakage current still decreases by lowering VDD, the
exponentially degraded performance quickly increases the overall static energy. As
a result, the combination of the dynamic energy and the static energy generates an
operating point that minimizes the total energy consumption. This point is generally
found in the region where VDD is below the device threshold voltage.
Higher-Vth devices have been utilized in the design of ultra-low power SRAMs due
to the exponentially decreased leakage current. The ultra-low power is obtained at
the cost of degraded performance. However, compared to the effect of the supply
voltage scaling on the energy efficiency, the effect of the threshold voltage on the
energy efficiency is not straightforward. Increasing the threshold voltage decreases
the amount of leakage current exponentially. However, increased threshold voltage
degrades performance exponentially too. Consequently, the impact of the threshold
voltage alteration on the static energy is determined by the ratio of the reduced
leakage current to the increased operating delay. If the gain in the leakage reduction
is larger than the loss in the performance, the overall energy efficiency improves by
replacing with higher-Vth transistors. Contrarily, if the impact of delay degradation
exceeds the gain in leakage suppression, the energy efficiency improves when
lower-Vth devices are adopted.
40
3.2.3 Effects of Multi-Vth Devices on SRAM Energy
Circuit design utilizing multi-Vth devices has been widely used in digital circuits.
Critical paths are preferred to be designed using lower-Vth devices while higher-Vth
devices are favored in non-critical paths. The higher-Vth devices in non-critical
paths reduce the leakage current and the lower-Vth devices maintain the required
performance. However, this cannot be easily employed in conventional 6T SRAMs
since the SRAM performance is directly related to the amount of leakage current.
Instead, multi-Vth devices have been usually adopted to achieve balanced design
parameters such as cell stability, performance, and write margin. Unlike the
conventional 6T SRAMs, decoupled SRAM cells with separated read and write
ports can accomplish the energy efficiency improvement by employing higher-Vth
devices in the data storage and low-Vth devices in the performance limiting read
port.
Fig. 3.2 illustrates a sample 8T dual-port SRAM cell designed with higher-Vth
devices in the cross-coupled latch and the write access transistors, and lower-Vth
devices in the read port. This is straightforward when considering that read
operation is slower than write operation. In this case, the energy model described in
Section 3.2.1 has to be modified. The energy equation for the switching energy
remains the same while the leakage energy equation is written by
(3.4)
_ _
_
_ _
_
(1 )
GS thn HV GS thp HV
T T
GS thn LV
T
DS
T
leakage Leakage DD
V V V V
nV nV
SN HV SP HV
V V
nV
SN LV
V
V
DD
E I V T
I e I e
I e
e N V T
41
where ISN-HV, ISP-HV, and ISP-LV are technology scaling parameters for the higher-Vth
NMOS, higher-Vth PMOS and lower-Vth NMOS, Vthn-HV, Vthp-HV and Vthn-LV are the
device threshold voltage of the higher-Vth NMOS, higher-Vth PMOS and lower-Vth
NMOS. Compared to the previous leakage energy equation, three different types of
devices (two types in NMOS and one type in PMOS) determine the cell leakage
current. In addition to the leakage current, the time to finish a computation has to be
rewritten as
where tread is the time to finish a read operation and twrite is the time to complete a
write operation. If the write operation with higher-Vth devices takes longer time that
the read operation with lower-Vth devices, twrite has to be used in the energy
estimation. This indicates that increasing the threshold voltage of the higher-Vth
devices over an optimal point quickly lose the energy efficiency improvement. In
the following section, we will discuss the optimal SRAM cell design toward energy
minimization using multi-Vth devices.
( , )read writeT max t t
WB
L
RWLWWL WWL
WB
LB
RB
L
Higher-Vth Lower-Vth
Fig. 3.2 Schematic of an 8T decoupled SRAM cell with multi-Vth devices.
42
3.3 Minimum Energy-Driven SRAM Design
Utilizing Multi-Vth Devices
SRAM energy consumption is determined not only by supply voltage selection but
also by device selection. It has been demonstrated that the minimum energy of an
SRAM is found in the sub-threshold region. In this section, we will investigate the
impact of device selection on minimum energy consumption. Table 3.1 summarizes
the relevant design parameters used in the analysis. An SRAM array in commercial
65 nm CMOS technology is simulated over various combinations in device
selection. We use three types of devices which are low-Vth device (LVT), standard-
Vth device (SVT) and high-Vth device (HVT) available in the selected CMOS
technology. Read delay is measured at points of crossing ‘0.5 VDD’. Write delay
is measured as the delay from enabling to data flip point in the analysis. Instant
read-after-write operation is not considered and it will be analyzed in Chapter 4.
Process and temperature variations affect device characteristics. However, they are
not included and this work will primarily focus on the effect of multi-Vth devices on
SRAM energy minimization.
Table 3.1 Parameter summary on energy analysis simulation
Technology
Array structure
SRAM cell
Devices
Vth
Read delay
Write delay
Items
Commercial 65 nm CMOS
256 rows × 128 columns
LVT: 0.28 V/-0.2 V SVT: 0.37 V/-0.31 V HVT: 0.61 V/-0.59 V
From clock to RBL at 0.5 × VDD
8T decoupled SRAM cell
LVT, SVT, HVT
Remarks
SRAM operation Read probability = 0.5, write probability = 0.5
From clock to data flipping point
43
3.3.1 Analysis of SRAM Energy without Multi-Vth Devices
To minimize the leakage power consumption, SRAM cells have employed higher-
Vth devices at the cost of performance degradation. However, the degraded
performance caused by the selection of higher-Vth devices also affects energy
consumption. Therefore, careful device selection has to be considered for improving
energy efficiency using multi-Vth devices. Fig. 3.3 demonstrates the SRAM energy
consumption designed by different device types sweeping supply voltage. When the
supply voltage is in the super-threshold region, dynamic energy is dominant
compare to leakage energy. Therefore, lowering supply voltage reduces overall
energy consumption. As expected, the minimum point of each device selection is
formed at a point where the supply voltage is around the threshold voltage of the
devices. However, the minimum energy level using HVT (0.16) or SVT (0.12) is
higher than that of using LVT (0.08), which explains that selecting SVT or HVT for
improving power dissipation is not the best choice in terms of energy efficiency.
This result can be explained as follows. Compared to LVT, SVT and HVT decrease
leakage current and increase read delay. However, since the increase in the read
delay is more significant than the decrease in the leakage, the SRAM arrays using
SVT and HVT consume more energy overall.
0.05
0.5
0 0.2 0.4 0.6 0.8 1 1.2 1.4
1.0
HVT
SVT
LVTNo
rma
lize
d E
ne
rgy
(a
.u.)
Supply Voltage (V)
0.08 using LVT
0.12 using SVT
0.16 using HVT
Min. energy
: Min. energy point
Fig. 3.3 Normalized energy of three SRAMs designed by three different device
types (i.e. HVT, SVT and LVT). All transistors in one SRAM have the same Vth.
44
3.3.2 Analysis of SRAM Energy with Multi-Vth Devices
Transistors with different threshold voltages are offered in recent CMOS
technologies. This provides circuit designers with more opportunities to optimize
circuits in performance, power, and energy. While higher energy efficiency can be
achieved through a proper device selection, an undesirable device selection will
0.1
1
10
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Supply Voltage (V)
No
rma
lize
d E
ne
rgy
(a.u
.)
HVT
SVT
LVT
HVT for read port
0.31 using LVT
0.23 using SVT
0.16 using HVT
Min. energy
Min. energy points
Fig. 3.4 Impact of device selection on normalized energy of three SRAMs. Note
that HVT devices are employed for read port in all SRAMs. Rest transistors in
each SRAM cell adopt one device type.
1E+00
1E+01
1E+02
1E+03
1E+04
1E+05
0 0.2 0.4 0.6 0.8 1 1.2 1.4
1E-01
No
rma
lize
d D
ela
y (
a.u
.)
Read delay
Write delay
Supply Voltage (V)
HVT
Fig. 3.5 Normalized delay values of SRAM read and write operations designed
with HVT devices.
45
produce lower energy efficiency. Fig. 3.4 shows the impact of undesirable device
selections on SRAM energy. HVT devices are employed in the read port to limit the
overall performance. In this case, using SVT and LVT devices does not improve the
energy efficiency because SVT and LVT devices in write paths dissipate more
power without improving the overall performance.
In general, higher-Vth devices are employed in non-critical paths for reducing power
while lower-Vth devices are adopted in critical paths for achieving high performance.
Conventionally, read paths are considered as critical paths, limiting overall
performance as shown in Fig. 3.5. Write paths are non-critical due to the faster
operation speed than read paths. Therefore, lower-Vth devices have to be
incorporated in read operation, and higher-Vth devices can be employed in write
paths. However, as supply voltage decreases, the write speed with higher-Vth
devices degrades faster than the read speed with lower-Vth, eventually making the
write paths critical. In this case, overall energy consumption needs to be estimated
carefully since the degraded critical path delay from write operation becomes more
substantial. Fig. 3.6 explains the impacts of device selection on the critical path
delay. Using the result of Fig. 3.3, LVT devices are used in the read port for
enhancing the SRAM performance. When SVT devices are used in the write paths,
the delay of the write paths is still smaller than that of the read paths using LVT
1E+00
1E+01
1E+02
1E+03
1E+04
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Read delay
(LVT)
Write delay
(HVT)
Write delay
(SVT)
1E-01
Supply Voltage (V)
No
rma
lize
d D
ela
y (
a.u
.)
Write delay (HVT)
= Read delay (LVT)
Fig. 3.6 Comparison of read delay (LVT) with write delay implemented with
multi-Vth devices (SVT and HVT).
46
devices. As a result, using SVT in the write paths will decrease leakage current
while maintaining the same performance, consequently reducing energy
consumption. However, when HVT devices are adopted in the write paths, the delay
of the write paths will be larger than that of the read paths at lower supply voltages.
This occurs because the write delay increases exponentially from a higher supply
level while the read delay starts to augment exponentially at lower VDD.
Specifically, read delay is larger than write delay in a single Vth SRAM cell. For this
0.05
0.5
0 0.2 0.4 0.6 0.8 1 1.2 1.4
1.0
HVT
LVT
SVT
LVT for read port
Supply Voltage (V)
No
rma
lize
d E
ne
rgy
(a
.u.)
0.10 using LVT
0.07 using SVT
0.13 using HVT
Min. energy
: Min. energy point
Fig. 3.7 Normalized energy of SRAMs utilizing three different device types
(i.e. HVT, SVT and LVT) for data storage and write paths. Note that LVT
devices are used in read port.
1E-3
1E-2
1E-1
1E+0
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
Supply Voltage (V)
No
rma
lize
d L
ea
ka
ge
(a.u
.)
LVT(W)
-LVT(R)
SVT(W)
-LVT(R)
HVT(W)
-LVT(R)
SVT(W)
-SVT(R)
HVT(W)
-SVT(R)
HVT(W)
-HVT(R)
Fig. 3.8 Comparison of leakage current over various device combinations.
47
MTCMOS cell, read delay is larger initially. However, HVT devices have higher Vth
than LVT devices. Thus current from the write paths degrades sharply when VDD is
near the Vth of HVT devices (~ 0.6 V) whereas it is still super-threshold for LVT
devices. The significantly degraded write performance will lose the benefit of
utilizing higher-Vth for enhancing energy efficiency. Fig. 3.7 demonstrates
simulated SRAM energy of various device combinations. As expected from Fig. 3.6,
Variations in energy caused
by device selection: 6.24x
Device Combination
No
rma
lize
d E
ne
rgy
(a
.u.)
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
1.0
LV
T(W
)
-HV
T(R
)
SV
T(W
)
-HV
T(R
)
HV
T(W
)
-HV
T(R
)
LV
T(W
)
-SV
T(R
)
SV
T(W
)
-SV
T(R
)
HV
T(W
)
-LV
T(R
)
HV
T(W
)
-SV
T(R
)
LV
T(W
)
-LV
T(R
)
SV
T(W
)
-LV
T(R
)
6.24
Fig. 3.9 Summary of normalized minimum energy consumption over various
device combinations.
Device Combination
No
rma
lize
d L
ea
ka
ge
(a.u
.)
LV
T(W
)
-HV
T(R
)
SV
T(W
)
-HV
T(R
)
HV
T(W
)
-HV
T(R
)
LV
T(W
)
-SV
T(R
)
SV
T(W
)
-SV
T(R
)
HV
T(W
)
-LV
T(R
)
HV
T(W
)
-SV
T(R
)
LV
T(W
)
-LV
T(R
)
SV
T(W
)
-LV
T(R
)
1.00
0.0
0.2
0.4
0.6
0.8
1.0
1.2
: Leakage in write paths
: Leakage in read ports
SVT(W)-LVT(R) for the
best energy efficiency is
not the best for leakage.
Fig. 3.10 Summary of normalized leakage current over various device
combinations.
48
the minimum energy of an SRAM array using SVT in write paths shows better
efficiency due to the reduced leakage current. However, in case of using HVT in
write paths, the minimum energy point is formed at the supply voltage of 0.4 V and
the energy increases dramatically. Fig. 3.8 demonstrates the leakage current of the
SRAM arrays under different device combinations. Although the selection of SVT
in the write paths and LVT in the read paths has the lowest minimum energy level,
it consumes the second largest leakage current.
Fig. 3.9 summarizes the normalized energy consumption of various device
combinations. The energy variation of up to 6.24 exists, which emphasizes the
importance of careful device selection. Apart from energy efficiency, leakage
reduction by device selection is equally important. SRAMs with less leakage power
dissipation are more demanded in battery-powered applications, especially in sleep
mode. The corresponding leakage currents of the devices combinations described in
Fig. 3.9 are shown in Fig. 3.10. Note that the device combination for the highest
energy efficiency (SVT(W)-LVT(R)) is not the best in terms of leakage. In addition,
the device combination with the highest leakage (LVT(W)-LVT(R)) has the second
highest energy efficiency. Therefore, careful device selections have to be made
depending upon the system requirements. If an SRAM stays in an idle or sleep
mode for majority of the life time, the leakage current becomes more significant
than the energy efficiency during computational operations. However, the energy
efficiency will be more significant if the SRAM workload becomes substantial. A
design option is implementing the write paths with different device types. By
separating the write access transistors and the latch with individual Vth devices,
write delay and leakage current can be both improved. Similarly, MTCMOS
methodology is also applicable to 6T SRAM cell. But the transistor size has to be
carefully selected to maintain cell stability.
49
3.4 Design Techniques for SRAM Energy Efficiency
Improvement Utilizing Multi-Vth Devices
As discussed earlier, SRAM energy is determined by multiple parameters such as
RWL
WWL
WB
L
WB
LB
CS
L
RB
L
6T
RWL
WWL
WB
L
WB
LB
RB
L
FOOT
6T
(a) (b)
Fig. 3.11 8T decoupled SRAM cells with leakage reduction techniques: (a)
column-interleaved and (b) read buffer foot control
0 0.2 0.4 0.6 0.8 1 1.2 1.4
No
rma
lize
d E
ne
rgy
(a
.u.)
Supply Voltage (V)
Reference
(Fig. 7)
0.01
0.1
1
: Min. energy point
8-to-1 Mux.
16-to-1 Mux.
32-to-1 Mux.
0.072
0.014
Fig. 3.12 Effect of column-interleaved scheme on SRAM energy. The
reference design is using SVT devices in the write paths and LVT devices in
the read path, which is also shown in Fig. 3.7.
50
leakage current, dynamic current and critical path delay. Various SRAM design
techniques have been proposed to improve the above parameters. In this section, we
will explore the effects of various SRAM design techniques on energy efficiency
under multi-Vth devices. Design techniques such as a column-interleaved scheme
and a read buffer foot control scheme for leakage reduction, and boosting schemes
for performance improvement will be considered. The energy overhead of utilizing
the two techniques is negligible compared to the improvement they make. Other
write performance boosting techniques such as data retention voltage collapsing and
negative bitline scheme are also effective and applicable.
3.4.1 Effect of Power Reduction Techniques on SRAM
Energy
Fig. 3.11 illustrates 8T decoupled SRAMs employing the column-interleaved
scheme [42] and the read buffer foot control scheme [43]. In Fig. 3.11(a), the
Column-Selected Line (CSL) is shared by SRAM cells in each column. During non-
read operation CSL is held to VDD to eliminate the read bitline leakage from pre-
charged RBL to CSL. During read operation, CSL in selected columns is pulled
down to GND and RBL is conditionally discharged based upon the stored cell data.
However, in unselected columns, CSL remains at VDD, which eliminates not only
the bitline leakage in the read port but also the unwanted RBL discharging. Read
buffer foot technique was proposed to reduce bitline leakage and enhance read
bitline sensing margin. As Fig. 3.11(b) depicts, FOOT is shared by SRAM cells in
each row. It can be either pulled-up to VDD to eliminate leakage current flowing to
the read bitline or statically connected to GND to form a discharging path from read
bitline to ground. During non-read operation, FOOT is connected to VDD to
eliminate the leakage through the read port. During read operation, only FOOT
in the selected row is pulled down to GND, and all RBLs are conditionally
discharged based upon the data in the selected row. Compared to the column-
interleaved scheme, the key advantage of the read buffer foot control scheme is to
provide enhanced RBL sensing margin at low supply voltage. However, the
column-interleaved scheme demonstrates better performance in point of power
reduction since it eliminates the unwanted dynamic discharging as well as the RBL
51
leakage. Therefore, in this analysis, we will estimate the effect of the column-
interleaved scheme on the overall SRAM energy. The combination of SVT in write
paths and LVT in read paths as shown in Fig. 3.7 is also assumed in the analysis.
While the RBL leakage is avoided in both of the column-interleaved scheme (Fig.
3.11(a)) and the read buffer foot control scheme (Fig. 3.11(b)), the dynamic energy
reduction is more significant in the column-interleaved scheme, which is primarily
determined by the multiplex ratio. Although the selected column dissipates more
power due to the discharging of CSL and the internal node in the read port, the
elimination of discharging RBL in unselected columns improves the overall SRAM
energy. Fig. 3.12 demonstrates the effectiveness of the column-interleaved scheme
on energy efficiency. Simulation shows the energy reduction is proportional to the
multiplex ratio. A multiplex ratio of 32 improves the energy efficiency by ~5
compared to the reference design whose device combination has the highest energy
efficiency (Fig. 3.7). In addition, raising the multiplex ratio moves the minimum
energy points to higher supply voltages, which is more desirable when considering
the larger device variations at lower supply voltages.
52
3.4.2 Effect of Performance Boosting Techniques on SRAM
Energy
At a given SRAM array architecture, device selection has to be made for
maximizing performance and minimizing leakage to achieve better energy
efficiency. In Fig. 3.7, the highest energy efficiency is achieved by the combination
of SVT in the write paths and LVT in the read paths. Although HVT in the write
RWL
WWL
WB
L
WB
LB
RB
L
6T
: Boosted signal
: Normal signal
Fig. 3.13 Simplified 8T SRAM schematic adopting boosted wordline scheme.
0.05
0.5
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Supply Voltage (V)
No
rma
lize
d E
ne
rgy
(a.u
.)
Write: HVT, Read :LVT
Before boosting write speed
After boosting write speed
Energy reduction
0.47
0.05
0.13
Fig. 3.14 Improvement of energy efficiency by boosting write performance.
Additional energy overhead induced by the boosting voltage generation is not
considered in this simulation.
53
paths can reduce the leakage more substantially, the exponentially degraded
performance in write operation deteriorates the overall energy efficiency much
faster. Boosted voltage schemes can be employed for enhancing write performance
over read performance (Fig. 3.13). In this scheme, the voltage of WWL is boosted
to a higher voltage than VDD. Consequently, the VGS of the write access transistors
increases and the write speed is enhanced accordingly. Fig. 3.14 demonstrates the
change in the SRAM array after utilizing a boosted voltage scheme. As expected,
the significant boosting in write performance eliminates the increase in the SRAM
energy below the previous minimum energy point and improves the energy
efficiency continuously even at lower supply voltages. The gain in the energy
reduction expands as the supply voltage decreases. For example, 9.4 improvement
was achieved at the supply voltage of 0.2 V. Fig. 3.15 summarizes the effectiveness
of the boosted voltage scheme on various device combinations. The boosting
voltage scheme is only useful in HVT(W)-LVT(R) and HVT(W)-SVT(R) whose
write operation is slower than read operation at lower supply voltages. It is worth
noting that the largest energy reduction is realized in HVT(W)-LVT(R) because the
leakage in the write paths is the smallest and the performance is the highest.
Compared to SVT(W)-LVT(R) whose energy efficiency is the highest before
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
HVT(W)
-LVT(R)
HVT(W)
-LVT(R)
HVT(W)
-SVT(R)
HVT(W)
-SVT(R)
SVT(W)
-LVT(R)
HVT(W)
-HVT(R)
SVT(W)
-SVT(R)
LVT(W)
-LVT(R)
After write
speed boosting
Device Combination
No
rma
lize
d E
ne
rgy
(a
.u.)
2.13
0.89
1.86
1.44
1.0
3.23
2.30
1.59
Fig. 3.15 Comparison of normalized minimum energy consumption with write
performance techniques.
54
performance boosting, HVT(W)-LVT(R) consumes 11% less energy. The relatively
small improvement is due to the fact that although significant amount of leakage is
reduced from the array by using HVT, the RBL leakage caused by the LVT devices
dominates the overall leakage current. This limits the overall improvement in the
energy efficiency.
55
3.4.3 Combination Effect of Power Reduction and
Performance Boosting Techniques
To maximize the energy efficiency of an SRAM, both leakage reduction and
(b)
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Supply Voltage (V)
No
rma
lize
d E
ne
rgy
(a.u
.)
Write: HVT, Read :LVT
0.001
0.01
0.1
1
Original
(a)
- Design Techniques -
(a): Performance boosting
(b): Power reduction(a)+(b)
0.131
0.055
0.014
0.006
Fig. 3.16 Improvement of minimum energy after adopting the column-
interleaved scheme (Fig. 3.11(a)) and the boosted voltage scheme (Fig. 3.13).
Multiplex ratio of 32 is assumed.
0.0
0.2
0.4
0.6
0.8
1.0
1.2
HV
T(W
)-H
VT
(R)
Energy efficiency improvement = 33x
Device Combination
No
rma
lize
d E
ne
rgy
(a.u
.) 1.00
SV
T(W
)-S
VT
(R)
LV
T(W
)-L
VT
(R)
HV
T(W
)-S
VT
(R)
HV
T(W
)-L
VT
(R)
SV
T(W
)-L
VT
(R)
HV
T(W
)-S
VT
(R)
HV
T(W
)-L
VT
(R)
HV
T(W
)-L
VT
(R)
HV
T(W
)-L
VT
(R)
0.71
0.49
0.580.66
0.31
0.45
0.27
0.070.03
: After write performance boosting
: After power reduction
: Both
Fig. 3.17 Comparison of normalized SRAM minimum energy consumtpion.
56
performance improvement need to be achieved at the same time. In this work, the
maximum energy efficiency can be obtained from HVT(W)-LVT(R) after adopting
the column-interleaved scheme (Fig. 3.11(a)) for power reduction and the boosted
voltage scheme (Fig. 3.14) for performance improvement. Fig. 3.16 illustrate the
SRAM energy of HVT(W)-LVT(R) after employing the aforementioned design
techniques. Note that two design techniques improve the normalized minimum
energy from 0.131 to 0.006 (~22). Finally, the benefit of multi-Vth devices
incorporated with the power reduction and performance boosting techniques are
summarized in Fig. 3.17. When additional circuit techniques are not employed, the
SRAM with the optimal device selection (SVT(W)-LVT(R)) consumes 31% of the
SRAM energy designed solely by HVT devices. However, the optimal device
selection moves from SVT(W)-LVT(R) to HVT(W)-LVT(R) after adopting the
power reduction and performance boosting techniques. Consequently, the energy
efficiency improvement of 33 is achieved, which is larger than the energy saving
from voltage scaling as shown in Fig. 3.7.
57
3.5 Summary
This chapter presents a comprehensive energy analysis of the SRAMs under multi-
Vth devices. Although higher-Vth devices are preferred in the write paths for
reducing power and energy consumption, a careful device type selection has to be
considered to maximize the benefit of utilizing multi-Vth devices. Using higher-Vth
devices in the SRAM write paths improve energy efficiency when VDD is in the
strong inversion region where the write speed with higher-Vth devices is still higher
than the read speed with lower-Vth devices. However, lowering supply voltage
degrades the write speed faster than the read speed, eventually leading to slower
write operation and losing the benefit of using higher-Vth devices in the write paths
for power and energy reduction. Therefore, there exists a limitation in the Vth
difference of the devices used in the write paths and the read paths. In this analysis,
using the devices (HVT, SVT, and LVT) available in a commercial CMOS
technology, the best device combination for energy minimization is to use SVT
devices in the write paths and LVT devices in the read ports. We also explored the
effects of several power reduction and performance boosting techniques on SRAM
energy efficiency. After employing these techniques, the optimal device
combination moves to HVT devices in the write paths and LVT devices in the read
paths. This optimal combination improves energy efficiency by 33 compared to
the device combination of HVT devices in the write and read paths.
58
Chapter 4 Design of an Ultra-low Voltage
9T SRAM with Equalized Bitline Leakage
and CAM-assisted Energy Efficiency
4.1 Background
State-of-the-art DSP cores and advanced healthcare SoCs [44],[45] benefit from
availability of on-chip SRAMs with substantially reduced power dissipation and
improved energy efficiency. Integrated SRAMs play a crucial role in providing the
required density, performance, power, and energy consumption of applications. By
aggressively scaling supply voltage near or below transistor’s threshold voltage,
power and energy efficiency of SRAMs can be greatly ameliorated at the expense of
performance. However, the vulnerability of SRAMs to PVT fluctuations makes
reliable near- and sub-threshold operation extremely challenging in deep sub-
micron CMOS technologies. Simultaneously, other design metrics such as stability,
read/write margin, and leakage need to be carefully revisited for the reliable
operation.
SRAMs have achieved ultra-low power/energy through supply scaling
[16],[46],[47]. However, they suffer from various design issues mainly caused by
reduced Ion-to-Ioff ratio combined with large variations. Under severely scaled
supply voltage, cell stability and bitline sensing margin of 6T SRAMs degrade
dramatically due to the significant impact of disturbing current and bitline leakage.
To handle it, an 8T differential SRAM cell [48] has been proposed to inject
identical leakage current into the differential bitlines, eliminating the differential
offset voltage from the leakage. However, in general, decoupled SRAM cells
[16],[47] are preferable in weak-inversion regime to make the read SNM identical
to the hold SNM. Moreover, the dedicated read port enables a faster read operation
with no disturbing current to cell nodes.
Energy efficiency is a vital design metric for ultra-low voltage SRAMs. Although
59
voltage scaling decreases the switching energy quadratically, it deteriorates the
operating frequency by several orders of magnitude. Accordingly, leakage energy
accumulated in slow clock cycles would dominate the total energy in the deep sub-
threshold region, leading to an energy contour shooting up [46]. To reduce the static
energy, leakage current minimization techniques are desirable. In general logic
circuits, adoption of HVT devices in non-critical paths is favorable to suppress the
leakage. Another effective method to improve energy efficiency is suppressing
leakage energy by eliminating idle gates or modules in the system, which is adopted
by [49]. Leakage suppression is also attainable from algorithm level [50]. Among
all the strategies for energy saving, leakage energy reduction is the first concern to
improve energy efficiency.
In this work, we present several design techniques to foster an energy efficient
SRAM in a wide range of supply voltages with the following features: 1) a
decoupled 9T SRAM cell with an improved SNM compared to the 6T cell; 2) a 3T
read port for equalizing RBL leakage and augmenting bitline swing; 3) utilizing
MTCMOS technology for minimizing leakage in 6T write port and maximizing
SRAM performance in read port; 4) a CAM-assisted circuit technique for
improving the energy efficiency by boosting the write speed. The proposed circuit
techniques are demonstrated by a 16 kb SRAM test macro (including the CAM)
fabricated in a 65 nm CMOS technology.
60
4.2 Proposed SRAM Design Techniques for Ultra-
low Voltage Operation
4.2.1 A Novel 9T SRAM Cell
Fig. 4.1 depicts the proposed 9T SRAM cell and its layout. The cell consists of a 6T
SRAM part (the write-access transistors with a cross-coupled latch) and a dedicated
read port. The read port comprises three NMOS transistors (M7, M8 and M9) for
realizing equalized bitline leakage and improving bitline sensing margin in a single-
ended read bitline (RBL). The write access paths and the data storage latch are
implemented with HVT devices for leakage reduction while the read port employs
LVT devices for performance. The layout of the 9T cell occupies an area of 2.63×
WWL
WBL WBLB
Q QB
M1
M2
M3
M4
M5
M6
RBL
M7
Q
RWL
M8
M9
Higher Vth Lower Vth
(a)
WWL
WBLB WBL
RWL
RBL
M9
M8
M7M1
VSS VDD VSS
M3
M2M6
M4 M5
2.63 µm
0.7
2 µ
m
(b)
Fig. 4.1(a) Proposed 9T SRAM cell implemented with HVT devices in write
paths and LVT devices in read port. (b) Layout of the 9T SRAM cell based on
65 nm logic rules.
61
0.72 µm2 based on logic design rules. A write operation is enabled by activating a
write wordline (WWL) and completed when the data loaded at WBL and WBLB is
written into Q and QB. A read operation starts by enabling a read wordline (RWL)
and is followed by conditional RBL discharging. If Q holds logic ‘0’, M7 is turned
on and discharges RBL to GND. If, on the contrary, Q stores logic ‘1’, M8 is
activated and provides pull-up current from RWL ( = VDD) to RBL, slowing down
the discharging speed of RBL.
62
4.2.2 Analysis of Static Noise Margin and Write Margin
Decoupled SRAM cells, such as the 8T SRAM cell in [13] and the 10T SRAM cell
in [17], have been widely accepted for SNM improvement. Eliminating the
interference from read bitlines into cell nodes, such as the 8T cell and the 10T cell,
makes the read-mode SNM equivalent to the hold-mode SNM. The read-mode
SNM of the proposed 9T multi-Vth SRAM cell is compared to those of the
conventional 6T cell and the 10T cell in Fig. 4.2(a). To investigate the impact of
different Vth on SNM, the 6T SRAM cell is implemented with two device types.
One is implemented with SVT devices and the other is implemented with HVT
devices. Both pull-down NMOS transistors are over-sized by 1.67×. SVT devices
with the same geometry as the 9T SRAM cell are utilized for the 10T cell with the
assumption that no multi-threshold voltage option is adopted in [17]. The SNM
values over the operating supply range are illustrated in Fig. 4.2(a). For the SVT
cells, SNMs increase significantly with VDD and then slightly slows down in the
super-threshold regime. The SNMs of the 6T and the 9T HVT cells, whereas,
exhibit a more linear slope with supply voltage, which are far from saturation with
increased VDD. This is partially caused by higher channel implant by HVT layer in
this multi-threshold technology. The 9T cell shows a SNM of 52 mV at 0.2 V,
improving the margin by 85.7% compared to the 6T HVT cell. At the nominal
0
100
200
300
400
500
600
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Re
ad
SN
M (
mV
)
VDD (V)
6T HVT
9T HVT10T SVT
6T SVT
0
500
100
0
1500
200
0
2500
3000
80 92 104 116128140152164176188
Occu
rre
nce
Read SNM (mV)
μ = 145 mV
σ = 17 mVVDD = 0.4 V
(a) (b)
Fig. 4.2(a) 9T read SNMs compared with 6T and 10T SNMs with different
voltages. (b) Distribution of 9T read SNMs at 0.4 V.
63
supply voltage, the SNM of the 9T SRAM cell is 1.13× larger than the 10T SRAM
cell whereas the difference of the two SNMs decreases at lower supply voltages.
SNM Monte Carlo simulations for 3σ mismatch on top of the TT corner are
conducted and the results are illustrated in Fig. 4.2(b). The 10k-point Monte Carlo
simulations at VDD = 0.4 V reveal that the proposed SRAM cell generates a mean
SNM of 145 mV with a standard deviation of 17 mV. It provides a higher mean
value with comparable variation than the 10T SRAM cell composed of all standard
Vth (SVT) transistors.
For an SRAM cell, write margin is interpreted as the voltage headroom at write
wordline for a successful write operation. Generally, it is determined by the drive
strength ratio of the write-access transistors to the pull-up transistors. Simulated
write margin of the 9T SRAM cell is plotted in Fig. 4.3. By sweeping supply
voltage from 0.2 V to 1.2 V, the write margin increases from 34 mV to 320 mV with
9.4× improvement. Utilizing SVT devices in the write paths can generate larger
write margin due to its stronger writeability compared to HVT devices. Although
the HVT devices in the 9T cell are relatively weak, they are employed in the entire
write paths since compact cell layout for high-density integration and lower leakage
is more important. The write failure in the 9T SRAM cell can be compensated by a
CAM-assisted write performance boosting technique whose details will be
explained in Section 4.2.3.
0
100
200
300
400
500
0 0.2 0.4 0.6 0.8 1 1.2 1.4W
rite
Ma
rgin
(m
V)
VDD (V)
HVT
SVT
Fig. 4.3 Comparison of write margins of HVT and SVT devices.
64
4.2.3 Bitline Leakage Equalization with the Worst Case of
Leakage
During read operation, the voltage level of RBL is a function of VDD, device
threshold voltage and leakage current, etc. At a specific VDD and a bitline length,
the RBL level is highly affected by the amount of leakage current. Maximum bitline
leakage occurs when the data in the unselected cells is all logic ‘0’. Similarly,
minimum leakage current appears if the data pattern in the un-accessed cells is all
logic ‘1’. Conventionally, a successful bitline sensing requires RBL for data ‘0’ to
QB‘1’
Q
‘0’
QB ‘0’Q
‘1’
QB ‘1’Q
‘0’
QB ‘0’Q
‘1’
RBL for ‘0’ RBL for ‘1’
Proposed Bitline Equalization
Read ‘0’ Read ‘1’RWL = ‘1’
RWL = ‘0’
Icell1
Icell0 Ileak Ileak
I1 I2
Icell0 Icell1
RB
L<
i>
RB
L<
j>
Equalization: Make I1×255 = I2×255 = Ileak
Icell0 + Ileak >> Ileak – Icell1
It’s guaranteed in this case
SA<i> SA<j>
QB ‘1’Q QB ‘0’Q
QB ‘0’Q QB ‘1’Q
RBL for ‘0’ RBL for ‘1’
For successful sensing, it’s needed that
Read ‘0’ Read ‘1’RWL = ‘1’
RWL = ‘0’
Conventional Bitline Sensing
Icell + Ileak_min >> Ileak_max
Icell Ileak_min Ileak_max
I1×255 = Ileak_min I2×255 = Ileak_max
Icell
RB
L<
i>
RB
L<
j>
I1 I2
SA<i> SA<j>
(a) (b)
Fig. 4.4(a) Conventional bitline sensing in the 8T SRAM [13]. (b) Concept of
proposed 9T bitline sensing improvement by bitline leakage equalization
technique.
65
discharge much faster than that for data ‘1’. However, when variation in bitline
leakage becomes comparable to cell read current, reliable detection of data ‘0’ and
‘1’ is difficult due to the small margin in the current to be sensed. In [16], it is
shown that, in the worst case, the bitline level of data ‘0’ could be even higher than
that of data ‘1’ due to the significant data-dependent bitline leakage particularly at
ultra-low voltage.
The conventional bitline sensing problem caused by leakage at ultra-low voltage is
illustrated in Fig. 4.4(a). In read operation, RWL is enabled and the RBL voltage
forms depending on the accessed data. The pull-down strength for sensing ‘0’
should be far higher than that of sensing ‘1’. After that, the simple sense amplifier
(SA) consisting of two stages of invertors senses the voltage of each RBL without
trigger timing. As illustrated in the bottom of Fig. 4.4(a), this requires the total of
the cell current and the minimum leakage current (Icell + Ileak_min) to be far larger
than the maximum leakage (Ileak_max) for successful sensing. As the amount of
leakage is comparable to the cell current and the leakage current varies from
column to column due to different data pattern, this condition could not be always
met.
To address this problem, we propose a bitline leakage equalization technique for
single-ended read bitlines. Fig. 4.4(b) depicts the concept of the proposed bitline
equalization technique utilizing the proposed 9T cell. In unselected cells, leakage
current I1 flows to GND through the device which is controlled by node QB when
the data stored is logic ‘0’. Likewise, when the data is logic ‘1’, leakage current I2
flows to RWL ( = GND) through the device controlled by Q. Accordingly, one of
two devices connected to Q and QB (M7 and M8 in Fig. 4.1) is always turned on
and the read access device (M9 in Fig. 4.1) is off. Consequently, two leakage paths
have the same strength regardless of the stored data and the constant bitline leakage
Ileak is formed. In Fig. 4.5, the RBLs are indicated as RBL for ‘1’ with maximum
leakage and RBL for ‘0’ with minimum leakage. The pull down current for sensing
‘0’ (Icell0 + Ileak) is always larger than that for sensing ‘1’ (Ileak - Icell1). This ensures
that the RBL level for data ‘0’ is always lower than that for data ‘1’ and irrespective
of the magnitude of Ileak. Thus, positive sensing margin could always be provided.
66
Sample simulated RBL waveforms (Fig. 4.5) show a drastically improved RBL
swing in the 9T SRAM at VDD = 0.2 V whereas the conventional 8T column
(HVT(W)-LVT(R)) generates a negligible RBL swing. The proposed scheme
improves the RBL swing by 4.6 × at 0.2 V, 27°C, and 256 cells per bitline.
Simultaneously, it also provides a wider sensing timing window, which is denoted
by a double-side arrow. Note that the sensing timing window is defined as the time
difference between the RBL of ‘0’ and that of ‘1’ measured when they cross VDD /2.
Since the trip point of our sense amplifier is VDD/2, we used it as a reference level.
With a frequency of 50 kHz, a sensing timing window of 1.5 µs is achieved by the
leakage equalization technique whereas nearly no sensing timing window is
obtained in the 8T bitline. The RBL behavior of the 10T SRAM [17] is also
captured in Fig 4.5. Apparently, the RBLs couldn’t fully discharge at this frequency
and they are too close to differentiate for sensing.
Variations of cell current and leakage current cause RBL swing to change as well as
sensing problems. Fig. 4.6 depicts the distribution of RBL swing of the proposed 9T
SRAM with 3σ local variation at the minimum operating voltage. With a mean
256 cells per RBL
0
0.2
VD
D (
V)
‘1’ with max. lkg
‘0’ with min. lkg
10T RBL
9T: 1.5 μs
sensing
window
45 50 55 60 65
9T RBL 8T RBL
RWL
Time (μs)
Fig. 4.5 Improved RBL swing and sensing window of 9T bitline at 0.2 V and
fCLK = 50 kHz with the worst case of leakage.
67
value of 53 mV, the RBL swing distribution from 10k-point Monte Carlo runs
exhibits a longer right tail. Fig. 4.7 presents the simulated swing-to-VDD ratio of the
proposed 9T SRAM and the 8T SRAM at different temperatures and maximum
numbers of cells per RBL (RBL lengths). In order to compare different bitcell
topologies in terms of RBL length, we assume nominal process parameter values. In
reality, accounting for within-die parametric variations, the effective number of
cells per RBL degrades. The proposed 9T SRAM bitline can attach more cells due
to the larger RBL swing as verified in Fig. 4.7. In the 8T SRAM bitline, only 512
mean = 53 mV
sd = 22 mV
0
500
1000
1500
2000
2500
3000
0 15 30 45 60 75 90 105120135150
Oc
cu
rre
nc
e
RBL Swing (mV)
165
Fig. 4.6 Histogram of RBL swings of 9T SRAM at 0.2 V with the worst case
of leakage from 10k-point Monte Carlo runs.
9T @ 27°C8T @ 27°C
9T @ 80°C8T @ 80°C
VDD=0.3 V at TT corner
No. of cells per RBL
RB
L s
win
g/V
DD
(%
)
0
20
40
60
80
100
128 256 512 1024
Max. length of 8T RBL
Fig. 4.7 Improved RBL swing with different numbers of cells and
temperature. Typical corner is used in the simulation.
68
cells can be attached for a sensible RBL swing. Note that a sensible swing should be
at least a positive value. In the proposed SRAM, up to 1024 cells can be attached to
the 9T bitline at 0.3 V and 80°C. The 8T bitline with 1024 cells generates a negative
bitline swing at 80°C.
69
4.3 Proposed Energy Efficient Improvement
Technique
4.3.1 Limitation of MTCMOS on SRAM Energy Efficiency
For a given SRAM structure, the energy efficiency can be optimized by minimizing
leakage and maximizing performance. To realize it, the 9T SRAM cell consists of
HVT devices in the 6T part and LVT devices in the read port. However, as
explained in [51], this is not the best option in terms of energy efficiency, which is
primarily due to the write performance degradation. Assuming 50% duty cycle,
SRAM energy (Etotal) can be written by
2
2 ( , )
total switching leakage
switching DD leakage DD
read write
E E E
C V I V T
where T max t t
(4.1)
In the case, T is determined by tread, using HVT in the 6T part reduces Ileakage and
improves the energy. For the other cases, when T is determined by twrite, the
reduction in Ileakage and the increase in T have to be carefully revisited.
Fig. 4.8 illustrates a write operation with data flipping. The write operation is
10E-2
10E-1
10E 0
10E+1
10E+2
10E+3
10E+4
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Fu
ll D
eve
lop
. D
ela
y
– F
lipp
ing
De
lay (
ns)
VDD (V)
Q
QB Flipping
DelayFull Development Delay
WWL
Fig. 4.8 Definition of data flipping delay and data full development delay.
Difference of the data flipping delay and the full development delay
substantially increases with scaling VDD.
70
divided into two stages, data flipping and data full development. After the data
flipping, additional delay is required for the internal nodes, e.g. QB, to be fully
developed. In skew conditions (e.g. MTCMOS cell, skew process corners), QB
could move to high voltage very slowly. In this work, the delay till node crossing is
defined as the data flipping delay. The delay till data full development (i.e. 90% of
VDD) is defined as the data full development delay. The latter is more proper to
measure the real completion of a write operation. Fig. 4.8 plots the difference
between the data full development delay and the data flipping delay. It is clearly
demonstrated that the delay difference between data flipping and full development
sharply expands at ultra-low voltage operation.
In an SRAM circuit, the active clock duration is decided by the larger value
between the write delay and the read delay. As supply voltage decreases, the write
delay with HVT devices degrades faster than the read delay with LVT devices,
eventually exceeding the read delay. In this scenario, the overall performance is
limited by the slower write operation. To improve it, the flipping delay instead of
the full development delay, can be adopted as write delay when no read-after-write
operation is assumed. Fig. 4.9 shows the issue of the read-after-write operation.
After the data flipping at Q and QB, QB rises slowly. When RWL is enabled, Q and
QB have not been fully developed yet and the read operation could fail. The write
data could be accessed only after additional clock cycles for full development.
CLK
Write Read
WWL
QQB
RWL
RBL
RBL should discharge to GND
Fig. 4.9 Read failure due to data non-full development in SRAM cell
nodes.
71
Consequently, the excessively degraded full development delay nullifies the energy
efficiency by prolonging T even if significant leakage reduction is achieved with
HVT devices in the 6T part. Fig. 4.10 depicts the read and the write delay of the 9T
SRAM array at TT and FNSP corners, respectively. When the supply voltage is
lowered below 0.6 V, the full development delay and the flipping delay deteriorate
faster than the read delay (Fig. 4.10(a)). The former delay is 6.12× of the latter
delay at FNSP corner and VDD = 0.46 V, as shown in Fig. 4.10(b). The read delay
is larger than the flipping delay when VDD is between 0.46 V ~ 0.64 V. In this
simulated voltage range, data flipping is definite to occur within the read delay.
Therefore, energy improvement can be obtained if the read-after-write issue could
be eliminated by utilizing a faster delay (e.g. read delay/data flipping delay) as T.
To address the above issue and enhance energy efficiency, we propose a Content-
Addressable-Memory-assisted (CAM-assisted) circuit for boosting write
performance as well as compensating write failure.
0.1
1
10
100
0.44 0.49 0.54 0.59 0.64
De
lay
(n
s)
VDD (V)
Read Delay = Data
Flipping Delay
Read Delay = Full
Develop. Delay
Write (FNSP), Read (FNSP)
×6.12
Read Delay = Full
Develop. Delay
Read Delay = Data
Flipping Delay
0
2
4
6
8
10
12
14
0.5 0.55 0.6 0.65
De
lay
(n
s)
VDD (V)
Data Flipping Delay
Read Delay Full Develop. Delay
Write (TT), Read (TT)
(a) (b)
Fig. 4.10(a) Read and write delays against scaling VDD at TT corner. (b)
Read and write delays against scaling VDD at FNSP corner.
72
4.3.2 Proposed CAM-assisted Write Performance Boosting
Technique
The slow-write-fast-read problem can be addressed in the architecture level [8]. A
completion signal is asserted to alert the CPU when the write operation is finished,
otherwise the CPU stalls for 2~3 cycle during write. Traditional bypass circuit
implemented in SRAM utilizing registers to cache input data can also boost
performance in the read-after-write case. However, firstly, the register cell can
easily cost more than 16 transistors if a mainstream DFF style is adopted for ultra-
low voltage operation. Secondly, large number of dependent MUXs and
comparators are needed and could not be multiplexed. Therefore, it is beneficial to
make the storage circuit, MUXs and comparators implemented with fewer
transistors and in an area-efficient array-based way. In this section, we explain a
circuit technique that can enhance write performance with this advantage.
Fig. 4.11 illustrates the operation of the proposed CAM-assisted technique. The
SRAM comprises two main paths, an SRAM path and a CAM path. The SRAM
De
co
de
r
CTRL Data Path
Main
SRAM Array
ADDR
MC
DATA
Po
inte
r
En
c.
Proposed CAM
SRAM
(Data)
CAM
(Addr.) Sel.
In1
In2
MUX
Write Phase
De
co
de
r
CTRL Data Path
Main
SRAM Array
ADDR
MC
Po
inte
r
En
c.
Proposed CAM
Sel.
In1
In2MUX
RDATA
Read Phase
Search
MC : 9T Cell
SR
AM
Wri
te
SR
AM
Re
ad
CA
M W
rite
CA
M R
ea
dSRAM
(Data)
CAM
(Addr.)
Match
Fig. 4.11 Data paths of write and read operations in the CAM-assisted
SRAM circuit.
73
path consists of a 16 kb 9T SRAM array (main SRAM array), decoders and data
IOs. The CAM path is composed of a tiny 48b CAM array for storing addresses, a
ring counter as an address pointer, an encoder, and a miniature SRAM array for
storing write data. The CAM array (Addr.) and the SRAM array (Data) are
implemented with LVT devices for faster read, write, and parallel search to conceal
the slow full data development in the main SRAM array. The primary role of the
CAM is to store most recent write addresses and data for possible subsequent read
access till the data written into the main SRAM array is fully developed.
During write operation (Fig. 4.11 left), data is written into the main SRAM array
(through the SRAM write path) and the miniature SRAM array (through the CAM
write path). The write address is stored in the CAM array. The write address and the
data in the CAM can be accessed in the succeeding cycles since the proposed CAM
is implemented with LVT devices. During read operation (Fig. 4.11 right), the main
SRAM array is accessed for normal read operation, and the CAM array is
simultaneously searched using the read address as search data. If the read address is
not found in the CAM array, the cells that are written in the preceding cycles
couldn’t be accessed. Thus, the selection signal from the encoder (Match = 0) will
select the read data from the main SRAM array as the final data through MUX. If
an address match occurs by a subsequent read-after-write operation, the encoder
enables a wordline signal corresponding to the matched address. The wordline
activates reading data from the SRAM array and later the data is sent to MUX.
0 5 10 15 20
SRAM Read
CAM Read
SRAM Write
CAM Write
CLK to Operation Completion in CAM
CLK to Operation Completion in SRAM
Delay (a.u.)
47.5% CLK Period
reduction
Hide Data Development
Fig. 4.12 Delay of four different operations of SRAM and CAM circuits.
74
Finally, using the selection signal from the encoder (Match = 1), MUX will select
the data from the proposed CAM as the final data. In this case, the read data from
the main SRAM array cannot be used as the final data since the data written in the
previous cycle has not been fully developed due to the slow development speed of
the latches using HVT devices. Therefore, the read data from the CAM should be
selected as the final read data. Through this, the write performance is determined by
the read operation or the data flipping delay, not by the slower full development
delay. As a result, instant read-after-write operation for the same address is
executable without slowing down the clock frequency for providing full data
development in the main SRAM array.
Fig. 4.12 compares the delays of four different operations (i.e. SRAM read, SRAM
write, CAM read and CAM write) to demonstrate the performance advantage of the
proposed scheme. The delay of SRAM write is calculated by the full development
time. As shown in Fig. 4.12, the delay of SRAM write is the largest whereas that of
CAM write is the smallest. Since the CAM-assisted technique hides the slow
SRAM write, the overall performance is improved from SRAM write to SRAM
ML_EN
CA
M_
RW
L<
0>
SC
SC
PARALLEL SEARCH DRIVERS
PO
INT
ER
CA
M_
RB
L<
0>
SLB<0> SL<0>SLB<7> SL<7>
WL<0>
WL<3>
ML<0>
ML<3>
CA
M_
RW
L<
3>
WRITE
DRIVER
SC
SC
CA
M_R
BL
<3
>
CC
CC
CC
CC
EN
CO
DE
RCAM for Address SRAM for Data
WRITE
DRIVEREN
D
EN
D
WRITE
ADDR
WRITE
DATAWRITE
DRIVEREN
D WRITE
DRIVEREN
D
CC : CAM Cell SC : 9T SRAM Cell
: Write Operation : Search Operation
Fig. 4.13 Circuit diagram of CAM array, search logics and miniature SRAM
array.
75
read. The performance improvement of 47.5% is achieved from simulation.
The schematic and the searching operation of the 10T CAM cell employed in this
work adopts from [52]. The CAM cell comprises a 6T SRAM part and search logic
circuits. Before search operation, the match line (ML) is precharged to VDD. A
search operation starts by loading search data into the search lines. If the search data
is different from the stored data, one of the search logic circuits would discharge
ML to GND. Contrarily, if the search data is identical to the stored data, ML
remains at a high voltage. The circuit diagram of the CAM-assisted circuit is
described in Fig. 4.13. Conventionally, input of a CAM is data and output is a hit
address. In this work, input is a read address and output is data. The CAM array is
comprised of 4 rows (i.e. storing 4 most recent write addresses) and 12 columns (i.e.
12-bit address). The number of rows is mainly determined by the ratio of the data
full development delay and the flipping delay. A ring counter is utilized to act as a
CLK
WL/WWL
DATA
CAM_QCAM_QB
SL
ML
CAM_RWL
CAM_RBL
Write Read
SRAM_Q
SRAM_QB
SRAM_RWL
SRAM_RBL
Faster Read ‘0’
Slower Development
Slower Read or Failure
tM×t
‘0’ or ‘1’
Fig. 4.14 Timing diagram of SRAM array and CAM circuit during
succession of write and read operations.
76
pointer for the CAM array. When a write operation is asserted, the pointer enables
one row, writing the input address into the CAM array and the data into the
miniature SRAM array. When a SRAM read operation is enabled, the address is
loaded into the search lines (SL<i> and SLB<i>) of the CAM array. If the address is
found from the CAM array, the corresponding ML(s) will be enabled. Otherwise, no
ML is enabled and the search operation finishes. If multiple MLs are enabled, the
encoder activates only one read wordline (CAM_RWL<i>) corresponding to the
most recent write operation. The activated wordline enables reading data through
read bitlines (CAM_RBL<i>) and sending the read data to MUX (Fig. 4.11). The
number of rows in the CAM array can be estimated by the following equation if 50%
clock duty cycle is assumed
1
2
Data Full Development DelayN
Data Flipping Delay
(4.2)
If M in Fig. 4.14 is greater than 2, read operation is likely to fail in the subsequent
read operation (50% duty cycle), which is addressed by the proposed CAM-assisted
technique. To cover a case at FNSP corner (Fig. 4.10(b)), N should be at least
⌈6.12/2⌉-1, which is 3. In this work, we implemented 4 rows to provide a
redundancy in N for real application.
0.1
1
10
100
0.5 0.52 0.54 0.56 0.58 0.6 0.62
Fu
ll D
eve
lop
. D
ela
y (
ns)
VDD (V)
SRAM TT
CAM TTCAM FF
CAM SS
Read Delay (SRAM)
Fig. 4.15 Faster write completion in CAM array than SRAM array at
different corners.
77
The timing diagram of the proposed CAM-assisted SRAM is illustrated in Fig. 4.14.
The data in the tiny SRAM (CAM_Q/QB) develops much faster than that in the
main SRAM array (SRAM_Q/QB). In the subsequent read operation, input address
in the search lines (SLs) keeps the corresponding ML high, and accordingly quickly
generates CAM_RWL and CAM_RBL due to the LVT devices and the small load.
The other read path through RWL generates SRAM_RBL with a larger delay and, in
the worst case, it generates a read failure. Fig. 4.15 manifests that the full
development delay of the CAM is always smaller compared to the main SRAM
array at all corners. Simultaneously, the full development delay of the CAM is also
shorter than the read delay of the main SRAM array, making the read paths critical.
78
4.4 Test Chip Implementation and Measurement
The main SRAM array is organized with 256 words 4 bits 4, which occupies an
area of 169 µm × 195 µm (including power rails in rows and columns). It is divided
into 4 sub-blocks and each sub-block is composed of 16 columns, sharing one IO.
The CAM array is configured with 4 rows and each row has 12 CAM cells for
storing addresses and 4 SRAM cells (LVT) for storing write data. The proposed
CAM circuit occupies 1061 µm2 (not including interconnections), which is at least
60% smaller than the DFF-based design in our estimation. It causes an overhead
0
50
100
150
200
250
300
350
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Le
aka
ge
Cu
rre
nt
(µA
)
VDD (V)
100°C
27°C
139.3 µA @ 1.2 V, 27°C
1.4 µA @ 0.1 V, 27°C
× 2.2
1
10
100
1000
0.25 0.35 0.45 0.55 0.65
Po
we
r (µ
W)
VDD (V)
Read
Write
Average
Average power of
146 µW @ 0.6 V
4.12 µW @ 0.32 V
(a) (b)
Fig. 4.16 Measured (a) leakage current of the test chip and (b) write, read
and average power at maximum operating frequency.
0
20
40
60
80
100
120
0.3
50.4 0.45 0.5 0.55 0.6 0.65
Re
ad
Acce
ss T
ime
(n
s)
VDD (V)
CAM is faster than SRAM by
7% @ 0.38 V & 25.8% @ 0.6 V
From SRAM
From CAM
Op
era
tin
g F
req
. (M
Hz)
VDD (V)
CAM-assisted SRAMSRAM Only
05
10
15
20
25
30
35
40
45
0.35 0.4 0.45 0.5 0.55 0.6 0.65
66.7% & 42.5%
improvement
@ 0.4 V, 0.6 V
40 MHz
(a) (b)
Fig. 4.17 Measured (a) read access time and (b) improved operating
frequency of the CAM-assisted SRAM.
79
approximately as 3% of the SRAM array area. The overhead will be less at a higher
SRAM array density since the number of rows is mostly determined by a single cell.
The energy dissipation by the proposed CAM circuit occupies a very small portion
of the overall consumption. Simulation shows that the CAM energy per read with
search operation is 59 fJ at 0.4 V with frequency of 1MHz. To be more flexible,
data from CAM_RBL and SRAM_RBL can bypass the MUX for separate
measurement.
A 16 kb SRAM test chip is fabricated in a commercial 65 nm CMOS technology
with a nominal VDD of 1.2 V. Fig. 4.16(a) shows the experimental results of the
leakage current. At 27°C, the leakage current of the test chip changes from 139.3
µA (1.2 V) to 1.4 µA (0.1 V). When temperature goes to 100°C, it becomes 305 µA
and 4.5 µA, respectively. Power of read and write operation is measured at the
maximum operating frequency (Fig. 4.16(b)). The read power is larger than the
write power due to the precharging and discharging current in the read bitlines. The
average power is measured in the supply range of interest, assuming equal
probability of performing read and write operations. It changes from 146 µW at 0.6
V to 4.12 µW at 0.32 V. Fig. 4.17(a) verifies that the CAM circuit can provide a
shorter read access time by 25.8% at 0.6 V and 7% at 0.38 V compared to the
SRAM without the CAM. Below 0.38 V, read from CAM takes more time due to
slow search operation at ultra-low voltage. The operating frequency of the CAM-
assisted SRAM is depicted in Fig. 4.17(b). Around the critical voltage of 0.6 V, the
En
erg
y/O
pe
ratio
n (
pJ)
VDD (V)
3.47 pJ of Emin @ 0.42 V
to 2.07 pJ of Emin @ 0.4 V
SRAM Only
0
1
2
3
4
5
6
7
0.35 0.4 0.45 0.5 0.55 0.6 0.65
CAM-assisted SRAM
Fig. 4.18 Measured energy of SRAM only and the CAM-assisted SRAM.
80
CAM circuit speeds up clock frequency of the main SRAM to 40 MHz. The
maximum operating frequency at VDD = 0.4 V is boosted to 5 MHz. The SRAM
performance is therefore improved by 42.6% and 66.7% at 0.6 V and 0.4 V,
respectively. The plot of energy per operation is shown in Fig. 4.18. The SRAM
consumes an energy per operation of 3.47 pJ at VDD = 0.42 V. Thanks to the CAM-
assisted circuit, a minimum energy per operation (Emin) of 2.07 pJ is achieved and
the energy efficiency is consequently improved by 40.3%. Averagely, the energy
Clock
Read Data ‘0’
Core VDD=0.26 V
Fmax=100 KHz
Access Time=0.85 µs
Ro
w
De
co
de
r
16 Kb
SRAM
Array
Read-out,
Drivers & IOs
Ctrl
CAM
(a) (b)
Fig. 4.19(a) Readout waveforms capture at 0.26 V. (b) Die micro-
photograph.
Table 4.1 Design metric comparison with various ultra-low voltage SRAMs.
JSSC 2013
[55]This Work
Technology
Density
Transistor count
Cell size
VDDmin
Access time
Leakage current
Normalized Emin
65 nm
32 kb
0.26 V
0.55 μs (0.26 V)
171 aJ/b
7T
65 nm
16 kb
0.26 V
0.85 μs (0.26 V)
126 aJ/b
9T
1.4 μA (0.1 V)
2.63×0.72 μm2N.A.
N.A.
A-SSCC 2012
[53]
65 nm
128 kb
0.37 V
N.A.
162 aJ/b
8T
N.A.
N.A.
JSSC 2013
[54]
65 nm
2 kb
0.28 V
4.55 μs (0.3 V)
278 aJ/b
9T
0.05 μA (0.4 V)
1.24 x 2.31 μm2
Min. energy (Emin) 5.6 pJ 2.07 pJ21.2 pJ 0.57 pJ
81
efficiency in the supply range of 0.38 V to 0.6 V is enhanced by 29.4%. The test
cells are fully functional down to 0.26 V with the maximum operating frequency of
100 kHz (27°C). The read access time of the SRAM is measured as 0.85 µs at 0.26
V. The CAM circuit achieves an average improvement of 18.7% in the read access
time between 0.38 V ~ 0.6 V. It lowers the minimum read voltage further from 0.26
V to 0.23 V. The test chip micro-photograph with waveform capture is shown in Fig.
4.19. Table 4.1 compares the test chip with various ultra-low voltage SRAM circuits.
Among the SRAMs, this work achieves the lowest minimum energy when it is
normalized with respect to density.
82
4.5 Summary
Leakage and energy efficiency are primary concerns for ultra-low voltage SRAM
design. This chapter presents several circuit techniques to implement an energy
efficient SRAM with reliable read operation under ultra-low voltage. The proposed
9T SRAM cell with equalized bitline leakage fosters SRAM read operation at ultra-
low voltage, achieving read access time of 79 ns and 0.85 µs at 0.4 V and 0.26 V,
respectively. To further reduce the static energy, MTCMOS technology is utilized to
reduce the leakage in the SRAM array. While HVT devices in the 6T part reduce
leakage, they degrade write performance significantly at low voltage. This nullifies
the energy efficiency improvement in the near- or the sub-threshold region. To
tackle this issue, we proposed a CAM-assisted write performance boosting circuit to
speed up clock frequency. The test chip shows an average energy efficiency
improvement of 29.4% with the aid of the proposed circuit technique. Consequently,
the energy efficiency is improved by 40.3% with the minimum energy per operation
of 2.07 pJ at 0.4 V. The measurement results prove that the proposed techniques are
good circuit solutions for ultra-low voltage and energy efficient applications.
83
Chapter 5 Design of an Ultra-low Voltage
Disturb-suppressed Dual-port SRAM
5.1 Background
Contemporary computing platforms with big data enable unprecedented interaction
between human and computational resources. The ubiquitous computing
necessitates high computing power with multi-core processing units and multi-port
SRAMs. Dual-port SRAMs are accordingly highly demanded, even by energy-
constraint applications, whose low energy consumption mainly attributes to ultra-
low voltage operation.
Conventional 8T dual-port (DP) SRAM cells are derived from standard 6T single-
port (SP) SRAM cells. Consequently they inherit the weakness of the 6T SRAM
cells at low voltage operation like poor cell stability, reduced read-ability and write-
ability, which impedes voltage scaling. The above issues are exacerbated because
the 8T DP SRAM cell has two more access transistors leading to larger disturbance,
especially at the common-row-access mode.
As Section 2.2.2 analyzes, the worst case of read-ability, write-ability and cell
stability occur at the common-row-access mode where the two SRAM cells sharing
the same wordlines are accessed via the two designated ports in one clock cycle.
Under this circumstance, both wordlines for port access are enabled, which makes
the total four access transistors exposure to noise. For a cell to be read, the storage
nodes will suffer a disturb current from the other port which impedes the
discharging of the read current and results in read-ability degradation or read failure.
For a cell to be written, the process of the data flipping will be interfered with a
disturb current from the other bitlines, which degrades the write-ability. For
unselected cells along the wordlines, disturb current will be injected to the storage
nodes from two directions so the cells are extremely susceptible to noise, which is
known as cell stability downgrading.
84
These limit the minimum operating voltage (Vmin) which is generally higher than
that of the 6T SP SRAM cell. Various circuits have been proposed to either improve
read-/write-ability of 8T DP cell or reduce read-write disturb under the common-
row-access circumstance. As Section 2.2.3 discusses, a priority row decoder with
bitline shifter has been proposed to circumvent the common access mode for
enhancing cell stability [30]. A wordline-voltage-adjustment system has been
utilized to improve read and write against PVT variation [31]. It is obvious that the
8T DP SRAM cell has to employ assisting circuits to accommodate challenges from
aggressive voltage and technology scaling.
In this chapter, we propose a 12T DP SRAM cell with 2 decoupled read ports for
better read-ability, write-ability and cell stability without any assisting circuit. In
addition, a virtual-ground scheme and hierarchical bitlines are deployed to suppress
the leakage current from read bitlines and to improve the performance, which
further improves Vmin.
85
5.2 Proposed 12T DP SRAM Cell
Dual-port SRAMs boost computation performance and throughput by doubling the
number of simultaneous memory access. For the conventional 8T DP SRAM cell,
port A and B are accessed by exclusive address and operation instruction. Each port
consists of their corresponding wordline (WL) and a pair of bitlines (BL and /BL).
In the common-row-access mode, the selected cell is inevitably disturbed through
the second activated WL. Thus, the width of the 2 NMOS drive transistors has to be
further expanded (e.g. × 2.7) to maintain the cell stability. However, the upsized 8T
DP cell still has limitations for ultra-low voltage operation, which will be analyzed
in Section 5.3. Decoupled DP SRAM cells are promising solutions for that.
RBLB
RWLB
VGNDB
WBLA
WBLB
Q QB
WWLA
/WBLA
/WBLB
VDD
WWLB
Q
RBLA
RWLA
VGNDA
QBM1
M2
M3
M4
(a)
Metal-2Metal-3
Metal-5
Via-3
Via-4
Via-5
RBLA VSSA
WBLB /WBLB
VSS VDD WBLA
VSS
/WBLA
VSSB RBLB
WWLA
RWLB
RWLA
WWLB
(b)
Fig. 5.1(a) Schematic of proposed 12T dual-port SRAM cell. (b) Layout
of the 12T dual-port cell.
86
5.2.1 12T SRAM Cell Design
Fig. 5.1(a) portrays the proposed 12T DP SRAM cell. The proposed cell decouples
read paths from write paths by implementing exclusive read port A (M1 ~ M2) and
read port B (M3 ~ M4). The read wordlines (RWLA and RWLB) control the access
to the two read ports by switching on the read access transistor M1 or M3. The
RBLA and the RBLB are precharged to VDD in prior. During read cycle, they are
left floating for data evaluation based upon the fighting between the read current in
the selected cell and the leakage current from the unselected cells along the bitline.
The conditional discharging of the read bitlines (RBLA and RBLB) is manipulated
by VGND employed to suppress leakage. The corresponding VGND terminal is
pulled down to ground for the selected column to foster a read operation whereas it
is precharged to VDD to reduce leakage to the unselected columns. During read, the
voltage level of RBLB represents the opposite value of node Q, hence it is
connected to a global bitline via a PMOS transistor for data inversion. Write
operation for the 12T SRAM cell is activated by write wordlines (WWLA and
WWLB) and followed by data flipping in the storage nodes Q and QB, which is
exactly the same as the 8T single-port SRAM cell. By separating the read paths
from the write paths, read-write disturb is significantly relaxed and various design
metrics such as stability and read-write margins are improved. This will be further
discussed in Section 5.3. The proposed cell eliminates the necessity of over-sizing
of the pull down devices while improves the key design metrics.
The layout of the proposed DP SRAM cell is illustrated in Fig. 5.1(b). Both of the
WBLs and the RBLs run with the second metal layer in vertical direction. The
RWLA and the RWLB maintain the third metal layer while the WWLA and WWLB
go with the fifth metal layer. The power-line (VDD and VSS) and the virtual ground
terminals also run with the second metal layer, resulting in eleven vertical tracks in
total. The dimension of the 12T SRAM cell in the 65 nm technology is 2.75 µm2. To
align with row decoder circuit, the height of the cell remains as 0.72 µm. In our
design, as the electrical β ratio is reduced from 2.7 to 1, the area overhead caused by
the read ports can be partly compensated.
87
5.2.2 Implementation of Virtual Ground for Bitline
Leakage Reduction
Leakage current is detrimental in energy constrained applications as it consumes
energy all the time, irrespective of data activity and event trigger. To make things
worse, the aggregate leakage component at ultra-low voltage can offset the energy
efficiency with voltage scaling and deflect the energy per operation from the
optimum point. This problem becomes even more severe in the proposed 12T DP
SRAM circuits when the leakage paths may double than the 8T SP SRAM.
During non-read cycles, the read bitlines are conventionally precharged to high
voltage while the source terminals of M2 and M4 are normally grounded. However,
this creates leakage current paths from the read bitlines to ground for unselected
cells. Fig.5.2(a) illustrates the leakage current injection problem. Assume the very
first cell is accessed through read port B. Although the other cells sharing the same
QB‘1’
Q
QB‘1’
Q
RB
LA
0‘ ’
0‘ ’
Ileak
QB‘1’
Q
0‘ ’
Ileak
QB‘1’
Q
QB‘1’
Q
RB
LA
0‘ ’
0‘ ’
QB‘1’
Q
0‘ ’
1‘ ’
Ileak
Ileak
Ileak
0‘ ’
1‘ ’
0‘ ’
‘0’
‘0’
‘0’
RB
LB
Iread
Ileak
Ileak
0‘ ’
1‘ ’
0‘ ’
‘0’
‘0’
‘0’
RB
LB
Iread
0‘ ’
(a) (b)
Fig. 5.2(a) Leakage problem in conventional 2T read port. (b) Read bitline
leakage suppression by implementation of virtual ground technique.
88
read bitlines are switched off for read access, leakage paths form as long as voltage
difference exists. Accordingly, every unselected cell suffers leakage current in two
directions, one from RBLA to ground and the other from RBLB to ground. In
addition, leakage current in standby mode is unwelcome in terms of power and
energy consumption.
To minimize the leakage from the read ports, our design leverages a virtual ground
technique (VGND) [43] by controlling the source voltage of M2 and M4 to suppress
the leakage current. Unlike the row-wise implementation, this design adopts
column-based virtual ground control to prevent bitline discharging in read for
unselected columns. Fig. 5.2(b) illustrates the technique. Only during read operation,
the corresponding VGNDs of the selected columns are grounded. Otherwise, it is
pulled-up to VDD to eliminate the leakage injection. Fig. 5.3 shows the control
circuit to implement the virtual ground scheme. When a port of the column is
selected for read, COL_SEL and RD are enabled simultaneously. NOP is
deactivated to indicate the circuit is in non-standby mode. The control pattern
causes the level of SELECT to rise, which turns off the precharged PMOS device
and switches on the transmission gate. Thus, VGND is synchronized with the
inversion of clock (CLKB), which means VGND discharges to ground at the high
phase of the clock. When a port of the column is unselected for read, the PMOS
device provides current to pull up VGND.
RD
COL_SEL
NOP
CLKB
VDD
VGND
SELECT
Fig. 5.3 Implementation of virtual ground technique.
89
5.3 Disturb Suppression of 12T DP SRAM in
Common-Row-Access Mode
Dual-port (DP) SRAMs have various access modes based upon row and column
selection. Worst case disturb occurs when two selected SRAM cells are in the same
row, which is called common-row-access mode. Fig. 5.4 illustrates the half-selected
circumstance at the common-row-access for the conventional 8T DP SRAM. In Fig.
5.4(a), two DP cells are accessed from two designated ports, respectively. The cells
sharing the same wordlines are all half selected by enabled wordlines. In Fig. 5.4(b),
one DP cell is accessed by the two ports simultaneously. In both cases, the selected
cells are disturbed by the current from bitlines to cell nodes through four activated
access transistors. Simultaneously, the half-selected cells also suffer from disturb as
the conventional SP SRAMs. Therefore, every 8T DP cell in the selected row is
exposure to noise and subjected to cell stability issue. This chapter discusses the
challenge of disturb in the common-row-access mode.
5.3.1 Analysis of Disturb Occurrence Probability
Regardless of 8T DP cell or 12T DP cell, the worst-case cell stability occurs when
the data storage nodes are exposure to disturb current induced by both ports.
Consequently, Fig. 5.5 summarizes SNMs in different situations with respect to
A AB B
Selected
Half-selected
(a) (b)
Fig. 5.4(a) Cell stability issue in common-row-different-column access. (b) Cell
stability issue in common-row-common-column access.
90
operation in each port including read (R), write (W), half-selected by read (HR) and
half-selected by write (HW). For the conventional 8T DP cell, SNM degrades most
as long as two wordlines are enabled simultaneously except for write operation,
where SNM is meant to be destroyed. Accordingly, 2Read, 1Read1Half-Selection,
2Half-Selction in com-row-access mode are all worst SNM cases. As indicated in
Fig. 5.5(a), the probability of worst SNM for the 8T DP SRAM is 9/16.
The proposed 12T SRAM cell substantially decreases the probability of suffering
stability degradation thanks to the decoupling of read and write ports. In the 12T
cell, read operation does not affect SNM as it is isolated from the storage nodes.
Therefore, only the situation where the cell is simultaneously half-selected by write
can impose the worst-case disturb (Fig. 5.5(b)). This minimizes the disturb
occurrence probability from 9/16 to 1/16, achieving an improvement of 88.9%.
Although the patterns of the worst case SNM are very complicated for analysis,
most of the worst case patterns can be categorized into two classes, read disturb and
write disturb. Read disturb is the situation where disturb occurs during read
operation. Similarly, write disturb suggests disturb happens during write operation.
Note that the other half-selection cases can be considered as dummy read operation,
which will be treated as read disturb for analysis. Next two sections will provide
insightful investigation on these two circumstances.
R
Port A
X X X
X X X
X X X
X X X
X
X
X
O
O X O
X X X
O X O
O X O
O
X
O
O
W HR HW
RW
HR
HW
Po
rt B
Prop.
R
Port A
W HR HWR
WH
RH
W
Po
rt B
Conv.
O : Worst SNMX : Non-worst SNM
R: Read
W: Write
HR: Half-selected by Read
WR: Half-selected by Write
Fig. 5.5 Comparison of worst SNM scenarios in conventional DP SRAM cell
and proposed DP SRAM cell.
91
5.3.2 Analysis of Read Disturb
The read disturb of the conventional 8T DP SRAM cell is described in Fig. 5.6(a).
Suppose port A are selected for read and port B are either selected or half-selected.
The precharged bitlines for port A are left floating for data evaluation. As the access
NMOS device is strong for passing ‘0’, the read operation is dominated by the
discharging capability of the precharged bitline through the storage node with ‘0’.
As port B is simultaneously selected or half-selected with continuously precharged
bitlines, it acts as a dummy read operation. The corresponding discharging current,
as depicted in red in Fig. 5.6(a), injects to the same data storage node. Therefore, it
1 to 0
1
0 1
ON
1
1ONON
ON
Idist.
Iread
Port A
Port B
(a)
0 1
0 1
1 1
1 1
OFFOFF OFF
ON ON
ON
IreadIdist.
1 to 0
Read Port A
(b)
Fig. 5.6(a) Illustration of read disturb in the 8T DP SRAM cell. (b) Read disturb
suppression in the 12T DP SRAM cell.
92
impedes the discharging of the bitline in Port A by raising the voltage of the storage
node and increasing the load of the pull-down NMOS in the latch. This degrades
read speed and cell stability, and can eventually result in read failure or data
flipping [57].
The read disturb of the proposed 12T DP SRAM cell is depicted in Fig. 5.6(b). The
read current, unlike the 8T DP cell, comes from the read bitline and discharges to
ground through the read port with no interference with the data storage nodes.
Although the read disturb current exists, the voltage of the node is not raised as
much as that in the 8T DP SRAM cell because the destructive read operations are
ameliorated from two ports to one port. Hereby, the cell suffers a much less
possibility of data flipping.
0
0.4 V
0
0.4 V
2
0
0.4 V
4 6 8 10
Time (μs)
WLA/RWLA
8T_QB
8T_Q
12T_QB
12T_Q
Fig. 5.7 Simulated waveforms of read disturb for the 8T DP cell and the 12T DP
cell at VDD = 0.4 V, FNSP corner. Note that the data in the 8T cell flips due to
the read disturb whereas the data in the 12T cell maintains.
93
Fig. 5.7 presents simulated waveforms of the data storage nodes for both cells at
FNSP corner using a 65 nm technology. Note that the NMOS drive devices are
upsized by 2.7× in the 8T cell. When VDD = 0.4 V, the data in the 8T DP SRAM
cell flips due to the read disturb from Port B whereas the 12T DP SRAM cell
maintains the original data successfully with the presence of the read disturb. The
SNMs in the common-row-access mode of the 12T DP SRAM and the 8T DP
SRAM are compared in Fig. 5.8. An SNM of 58 mV is observed in the 12T DP
SRAM cell at 0.4 V, which is improved by 26% compared to the 8T DP cell. The
SNM of the proposed DP SRAM is greater than that of the conventional DP SRAM
when the supply voltage is below 1 V. The data validates that the 12T DP SRAM
cell has stronger immunity to read disturb compared to the 8T DP cell at near- or
sub-threshold region, which is beneficial for ultra-low voltage operation.
0
0.04
0.08
0.12
0.16
0 0.2 0.4 0.6 0.8 1 1.2 1.4
SN
M (
V)
VDD (V)
Prop. 12 DP SRAM
Conv. 8T DP SRAM
Temp. = 80°C
Fig. 5.8 Comparison of read SNMs of the 8T DP SRAM and the 12T DP SRAM.
94
5.3.3 Analysis of Write Disturb
The write disturb issue of the conventional 8T DP SRAM cell is described in Fig.
5.9(a). Suppose port A is selected for write and port B is either selected or half-
selected. As NMOS transistors is strong for passing ‘0’, the write operation is
driven by the discharging capability of the storage node with ‘1’ as Fig. 5.9
indicates. As Section 5.3.2 explains, a dummy read operation conducts in port B
where the disturb current generates from the constantly precharged bitline. The
current is injected to the storage node and impedes the voltage discharging. In
1
1
0 1
ON
1
1ONON
ON
Idist.
Iwrite
Port A Write
Large
Cap.
(a)
1
0 1
ON
1
1ONON
ON
Idist.
Iwrite
Port A Write
Small
Cap.
(b)
Fig. 5.9(a) Write disturb illustration of the conventional 8T DP SRAM. (b) Write
disturb suppression from the 12T DP SRAM.
95
addition, as the single bitline links to a great number of SRAM cells which can
range from hundreds to tens of hundreds, the associated large capacitive load slows
down the discharging speed at ultra-low voltage. Thus, the relative long-time write
disturb can result in a write failure [57].
The proposed DP SRAM utilizes hierarchical write bitline scheme to reduce the
bitline capacitance and ease the discharging. Although the write disturb pattern is
very similar to that of the 8T DP cell, the bitline capacitance is reduced for faster
disturb current discharging as depicted in Fig. 5.9(b). Fig. 5.10 presents the
hierarchical bitline circuit. The 256 SRAM cells are linked to global write bitlines
whereas each 64 SRAM cells are organized by local write bitlines. Each local
bitline pair is allocated with individual precharge devices. The global bitlines access
the local ones via transmission gates designated for each sub-block. Hereby, the
associated bitline capacitance is mainly related to the 64 cells instead of 256 cells.
64 Cells
64 Cells
64 Cells
64 Cells
Data /Data
Glo
ba
l W
BL
Glo
ba
l W
BL
B
Lo
ca
l
Lo
ca
l
Lo
ca
l
Lo
ca
l
Lo
ca
l
Lo
ca
l
Lo
ca
l
Lo
ca
l
Fig. 5.10 Circuit of hierarchical write bitline.
96
5.4 Measurement Results
The proposed 12T dual-port SRAM is fabricated in 65nm CMOS technology. Fig.
5.11 presents the architecture of the SRAM test chip. The 16 kb SRAM array is
configured by 256 words × 32 bit × 2. Each column is divided into 4 sub-blocks to
implement the hierarchical write bitline. As Fig. 5.11 depicts, the dual-port SRAM
has two access interfaces with dedicated peripheral circuits such as control logics,
decoders, read-out circuits, I/Os, etc. The layout of the virtual ground (VGND)
16kb SRAM Array(256 rows x 32 columns x 2)
Port A IOs
16-to-1
Mux.
Control
Logic
A
Read-out & VGND Circuit A
Col. Decoder & Write Drivers A16-to-1
Mux.
16-to-1
Mux.
16-to-1
Mux.
Ro
w D
ec
od
er
& W
L D
riv
ers
A
Control
Logic
B
Ro
w D
ec
od
er
& W
L D
riv
ers
B
Port B IOs16-to-1
Mux.
Read-out & VGND Circuit B
Col. Decoder & Write Drivers B
16-to-1
Mux.
16-to-1
Mux.
16-to-1
Mux.
Fig. 5.11 Architecture of the 65 nm test chip.
Le
ak
ag
e C
urr
en
t (μ
A)
VDD (V)
63 μA @ 1.2 V
7.6 μA @ 0.4 V
0
10
20
30
40
50
60
70
0 0.2 0.4 0.6 0.8 1 1.2 1.41E+0
1E+1
1E+2
1E+3
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Re
ad
Ac
ce
ss
Tim
e (
ns
)
VDD (V)
580 ns @ 0.4 V
6 ns @ 1.2 V
(a) (b)
Fig. 5.12(a) Measured leakage current. (b) Measured read access time.
97
circuit is implemented in vertical direction to align with the bitlines. With respect to
the peripheral circuits, the layout is implemented as symmetric as possible. The test
chip occupies an area of 398 × 385 µm2
while the dual-port SRAM cell has a
dimension of 3.82 µm by width and 0.72 µm by height.
The chip measurement has been conducted at the common-row-access mode and all
data except leakage current is collected under this circumstance. The test chip is
functional from 1.2 V to 0.4 V. Fig. 5.12(a) presents the measurement result of
leakage. The leakage current decreases with supply voltage and hits a number of 7.6
1E-3
1E-2
1E-1
1E+0
1E+1
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Po
we
r (m
W)
VDD (V)
Read power
Write power
0
5
10
15
20
25
30
35
40
45
50
0 0.2 0.4 0.6 0.8 1 1.2 1.4
En
erg
y/O
pe
rati
on
(p
J)
VDD (V)
8.7 pJ @ 0.48 V
(a) (b)
Fig. 5.13(a) Measured power consumption of read and write. (b) Measured energy
per operation.
Array
Ctrl.
A
Ctrl.
B
SA. & Drivers A
SA. & Drivers B
De
co
de
r A
De
co
de
r B
398 μm
38
5 μ
m
CLK
Port A
Port B0.4 V
580 ns
(a) (b)
Fig. 5.14(a) Micro-photograph of the test chip. (b) Captured waveforms of the
RBL at VDD = 0.4 V.
98
µA at the minimum operating voltage while it is 63 µA at the nominal voltage. The
simultaneous read operations are observed from the test chip. The read current is
recorded with the maximum frequency and an equal probability of read ‘0’ and read
‘1’. The total power is obtained by multiplying the current and the voltage. Energy
per operation is the power consumed in an operation cycle. Fig. 5.12(b) describes
the measured read access time (including I/O delay), which is the larger value of the
delays from single port operation. The read access time varies from 6 ns at 1.2 V to
580 ns at 0.4 V. The plot exhibits a trend of exponential increasing with scaled VDD.
Fig. 5.13(a) depicts the read and the write power consumptions with a maximum
frequency clamp of 50 MHz due to the limitation of the equipment. The slope of the
read power becomes steep below 0.7 V because the corresponding maximum
frequencies are not affected by the clamp value. At 0.4 V, the test chip dissipates a
power of 5.8 µW with a frequency of 350 kHz for successful read operations
through the two ports in a common row. The write power shows a similar trend to
the read power curve. The average current for write is calculated based on different
write patterns, such as write ‘0’–write ‘1’, both write ‘0’ and both write ‘1’. The 16
kb SRAM consumes a write power of 4 µW with the same clock frequency. Fig.
5.13(b) presents the energy per operation of the proposed dual-port SRAM macro.
The energy contour decreases with voltage and reaches a minimum point. It shoots
up again with further voltage scaling because the performance degrades too much
and the leakage component deteriorates the total energy efficiency. The minimum
energy of 8.7 pJ is measured at 0.48 V. Fig. 5.14(a) shows the micro-photograph of
the test chip and Fig. 5.14(b) captures the waveforms when the chip is working in
the common-row-access mode with the minimum supply voltage.
99
5.5 Summary
This chapter presents a near-threshold dual-port SRAM circuit with suppressed read
and write disturb at the common-row-access mode. To ameliorate disturb occurred
in the conventional design, a novel decoupled SRAM cell is proposed to reduce the
probability of disturb occurrences and eliminate the impact of the read disturb. A
hierarchical write bitline scheme is implemented to boost write operation and
combat with the write disturb. To minimize the bitline leakage, the proposed SRAM
circuit utilizes a virtual ground technique in the read ports. The 12T dual-port
SRAM is fabricated in a 65 nm CMOS technology and achieves a minimum
operating voltage of 0.4 V at the common-row-access mode. This is the lowest
voltage among all reported designs to the best knowledge of the author.
100
Chapter 6 Design of an Ultra-low Voltage,
Energy-Delay Efficient Charge-Pumped
DFF
6.1 Background
D-Flip-Flop (DFF) as fundamental unit in integrated circuits can account for
substantial random-logic power and area [37]. It is widely adopted in various
applications such as polar code decoders, wireless channel equalizers, and seizure
classification processors [58]-[60]. In emerging energy-constrained systems,
minimum power/energy consumption has been insistently pursued. To attain it,
D Q
CLKBB CLKB
CLKB CLKBB
CLKB
CLKBB
CLKBB
CLKB
CLK CLKBBCLKB
Fig. 2.13 Schematic of Transmission-Gate FF (TGFF).
D
Q
CLK CLK
CLK CLK
BN FN G H
B F GN HN
Adaptive
coupling
element
Fig. 2.16 Schematic of Adaptive-Coupling FF (ACFF) [37].
101
supply voltage is usually positioned near or below the threshold voltage of transistor
for minimum energy expenditure. However, aforementioned energy-efficient
systems typically suffer drastic performance degradation and variation problems.
Therefore, energy-delay-efficient DFF circuit is most desirable for near-/sub-
threshold DFF applications.
To achieve reliable operation with energy-delay efficiency over a wide range of
supply voltages, a DFF must satisfy the following requirements: 1) fully static, as
static circuits are more tolerant to PVT variations especially at ultra-low voltage; 2)
single-phase clocking, since the toggling of internal clock inverter incurs large
power consumption; 3) less occurrence of setup time and hold time violation; 4)
minimum device count compared to traditional DFFs for less area and less leakage.
Section 2.3.2 and Section 2.3.3 analyze 2 conventional and 2 emerging DFF circuits.
To refresh, the pros and cons of the transmission-gate FF (Fig. 2.13), the adaptive-
coupling FF (Fig. 2.16) and the S2CFF (Fig. 2.17) are rephrased here. The related
schematics are reposted in this chapter for convenience. The mainstream
transmission-gate FF (TGFF) is suitable for near-/sub-threshold operation due to its
robustness at low voltage scenarios. However, the main challenge of the TGFF for
energy-saving applications is the large power dissipation and low efficiency in
energy. Specifically, the requirement of local clock buffer increases its power
consumption and area overhead. To eliminate the clock buffer, a 22-transistor
single-phase-clocking adaptive-coupling FF (ACFF) circuit has been proposed [37].
Fig. 2.17 Schematic of Static-Single-Phase Contention-Free FF (S2CFF).
102
By deploying a differential structure with adaptive coupling scheme, it eases data
transition, saves power and exhibits better energy efficiency than the TGFF. Despite
its superior in energy efficiency, the ACFF typically works at super-threshold region
( > 0.75 V), which is not fully qualified for ultra-low voltage operation [37].
Recently, a static single-phase-clocked 24-transistor S2CFF [38] has been proposed
for low power applications. As Section 2.3.3 analyses, this DFF eliminates the clock
buffer to improve energy efficiency and enhances the robustness of the circuit by
utilizing keepers and glitch prevention technique. However, the transistor count is
relatively large which can result in area and leakage inefficiency.
In this chapter, we present an ultra-low voltage and energy-delay efficient DFF for
near-/sub-threshold applications. The transistor count of the proposed DFF is further
decreased from 22 to 16 to save power and area, which is the minimum among all
existing static style DFF circuits to the best knowledge of the authors.
103
6.2 A Novel Sub-threshold DFF
In order to achieve energy-delay efficient operation across a wide range of supply
voltage, a DFF should have the following features: 1) static operation to against
PVT variations; 2) single-phase clocking to suppress power consumption on the
internal clock buffer; 3) minimum or less occurrences of setup time and hold time
violations; 4) minimum or less area penalty compared to conventional DFFs. Our
proposed DFF circuit meets the above requirements.
6.2.1 DFF Circuit Design and Near-/Sub-threshold
Operation
Fig. 6.1 depicts the schematic of the 16-transistor charge-pumped DFF (CPDFF),
which adopts the master-slave structure. The CPDFF is controlled by three timing
signals –– an external clock signal (CLK) and two internal clock signals (CLKH and
CLKL). CLKH and CLKL are generated by two embedded charge pumps, one
positive charge pump and one negative charge pump. The CLKH has a higher
voltage above VDD while the CLKL has a lower voltage below GND as Fig. 6.1
illustrates. To minimize area penalty, the two charge pumps are shared by 8 DFFs.
This results in 1.75 device count increase for each DFF. Yet it still has the smallest
transistor count compared to other static CMOS DFFs.
D QCLKL CLKH
CLK CLK
CLKCLK
Master Stage Slave Stage
Inv1 Inv2 Inv4
Inv3PG1 PG2
VDD LevelGND Level
Charge Pump Circuits
Shared by 8 DFFs
Inv2
Inv3
C1
C2
Fig. 6.1 Schematic of proposed charge-pumped DFF.
104
The proposed DFF, as the conventional TGFF, samples and latches the data at two
clock phases. When CLK is low, the master stage samples the input (D) with CLKL.
Since CLKL has a lower voltage level than GND in this state, it enhances |Vgs| of
the first pass gate (PG1) to ease its state transition, which is particularly beneficial
for ultra-low voltage operation. Similarly, when CLK is high, the positively boosted
CLKH improves the sampling ability of the slave stage by expanding the overdrive
voltage of PG2. This also results in an increased output swing at sub-threshold
regime.
The logic size is optimized to improve the performance of the DFF. Four inverters
with different sizes (Inv1, Inv2, Inv3 and Inv4 in Fig. 6.1) are applied in the data
propagation path and the positive feedback paths. Inv1 is maximally sized to
sharpen the data transition and reduce the short-circuit current of PG1 during
switching. Inv3 with the minimum device width is deployed in the feedback loops
to reduce the capacitive load of Inv2 and decrease the overpower effect from
previous storage. The device sizes of Inv2 and Inv4 are tuned for the best C-Q delay.
The two charge pumps generate a voltage less than GND and a voltage greater than
VDD, respectively. The capacitor C1 charges its negative terminal to the most
CLKCLK
Negative
Charge Pump
Positive
Charge Pump
CLK_low
VDD VDD
CLK_highGND_low
VDD_high
Vo
lta
ge
(V
)
Vo
lta
ge
(V
)
0
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
- 0.20-0.15-0.10-0.05
00.050.100.150.200.25
Fig. 6.2 Simulated output waveforms of the two charge pumps at 0.2 V, 1 kHz.
105
negative voltage (GND_low) while C2 charges its positive terminal to the most
positive voltage (VDD_high). The pumped voltages are transferred to feed PG1 and
PG2. Both capacitors are implemented by metal layers which are laid above the
transistors in layout to minimize area penalty. Each charge pump leverages a diode
connected NMOS transistor to clamp the pumped voltage. Fig. 6.2 shows the
boosted outputs from the two charge pumps at 0.4 V with a clock frequency of 1
kHz.
106
6.2.2 Inverse-Narrow-Width-Effect-Aware Sizing Strategy
Research in [61] has revealed that the Inverse-Narrow-Width Effect (INWE)
significantly influences threshold voltage and the corresponding drain current in the
near- and sub-threshold regions. The INWE is caused by the parasitic transistor at
the sharp corner in the shallow-trench isolation (STI) process. The parasitic
transistor will be switched on at lower voltages than the main channel due to the
geometry effect of the STI corner. As the transistor width shrinks, it dominates the
performance of the whole transistor and makes the threshold voltage lower for
narrower transistors. Fig. 6.3 investigates the impact of INWE on the threshold
voltage at different voltages for a 90 nm CMOS technology. As the transistor width
becomes less than 0.5 µm, the threshold voltage decreases quickly. The variation of
the threshold voltage is around 130 mV as the width increases from the minimum
value to 0.5 µm [61].
At sub-threshold region, the drain current is inversely proportional to the threshold
voltage. As the threshold voltage is lowered, the drain current increases following
Equation (2.1). This can be leveraged to enhance the performance of a narrow-
Fig. 6.3 NMOS threshold voltage vs. transistor width at different supply voltages
with a 90 nm CMOS technology [61].
107
width transistor especially at ultra-low voltage. To utilize it, the total width of a
transistor is implemented as a combination of multiple fingered minimum widths
[61]. Consequently, each minimum-width finger has the lowest threshold voltage to
maximize the drain current. In addition, the drain current of the transistor is
accordingly proportional to the width. The C-Q delay is consequently improved
with the sizing strategy, which is extremely advantageous for sub-threshold
operation. The detailed analysis of C-Q delay improvement is presented in Section
6.3.1.
108
6.3 Analysis of CPDFF with TGFF and ACFF
The proposed CPDFF is compared with the TGFF and the ACFF through
simulations on C-Q delay, setup time, hold time and energy-delay product. The
schematics of TGFF and ACFF are shown in Fig.2.13 and Fig. 2.16, respectively.
The transistor sizes of the two DFFs are optimized to achieve the best C-Q delay
through extensive simulations for fair comparison. The data input and the clock
signal fed to 8 CPDFFs, a TGFF and an ACFF come from two-stage buffers. To
mimic practical scenarios, a FO4 buffer loaded with 3 pF capacitor is connected to
each DFF output.
6.3.1 C-Q Delay Investigation
The utilization of INWE-aware sizing strategy can boost C-Q delay. Simulated
delay savings from our proposed CPDFF are listed in Table 6.1. As output Q is logic
‘1’, more than 14% faster C-Q delay is obtained at VDD = 0.4 V. When input data is
logic ‘0’, the CPDFF can speed up the C-Q delay by 4%. More significant
enhancement can be attained if the PG2 size is increased. The performance boosting
is also observed at lower supply voltage.
In nanometer technologies, DFF performance variation due to transistor mismatch
is challenging. To investigate it, a Monte Carlo simulation with 3σ mismatch is
conducted. Fig. 6.4 captures the output waveforms of the 3 DFFs at VDD = 0.4 V.
The TGFF and the CPDFF exhibit better mismatch tolerance than the ACFF, whose
output can have a variation of 0.75 µs when T = 2 µs as Fig. 6.4 shows.
Table 6.1 Performance improvement from INWE-aware sizing strategy.
VDD (V)
0.4
0.5
0.6
0.7
Δt of data ‘1’ (ns) Δt of data ‘0’ (ns)Δt/T Δt/T
1000
300
40
20
26%
11%
4%
1.6%
2%
1%
149.3
34
10.4
2.2
39.8
4.7
0.8
0.2
T (ns)
14.9%
11.3%
109
Occasionally, it fails in functionality when the CLK triggers. This is due to the
extremely reduced output swings of the pass gates at ultra-low voltage despite the
aid of adaptive coupling transistors.
Further analysis discloses that the proposed DFF has better performance over the
TGFF. This is because the increased |Vgs| by charge pumps not only eases state
transition of a pass gate but ameliorates its variability as well. A 1000-point Monte
0.4V
0
0.4V
0
0.4V
0
0.4V
0
0.4V
0
Variation
Failure0 2.5 107.55.0
D
CLK
Q_CPDFF
Q_TGFF
Q_ACFF
Time (μs)
Fig. 6.4 Simulated output waveforms of CPDFF, TGFF and ACFF at 0.4 V.
0
50
100
150
200
250
300
0.2 0.3 0.4 0.5
TGFF
CPDFF
C-Q Delay of ‘0’ (µs)
Oc
cu
rre
nc
e
μ 278.4n=
σ 15.6n
μ 341.2n=
σ 21.7n
(a)
110
Carlo simulation (Fig. 6.5) verifies that the CPDFF can provide a smaller mean
value of C-Q delay with less variation than the TGFF.
0
50
100
150
200
250
300
TGFF
CPDFF
C-Q Delay of ‘1’ (μs)
Oc
cu
rre
nc
e
μ 401.9n=
σ 11.9n
μ 423.5n=
σ 15.9n
0.2 0.3 0.4 0.5
(b)
Fig. 6.5 Monte Carlo simulation results of C-Q delay: (a) data ‘0’ and (b) data
‘1’. The proposed CPDFF shows less variability.
111
6.3.2 Comparison of Setup Time and Hold Time
As discussed in Section 2.3.1, Tsetup and Thold are key figures of merit for DFF
circuits. Setup time (Tsetup) violation can cause input data sampling failure and the
consequently functional failure of the DFF. It can be ameliorated by relaxing the
clock frequency of the system. Hold time (Thold) violation also causes harsh
100
200
300
400
500
600
700
CPDFF
TGFF
ACFF
Se
tup
Tim
e (
ns
)
TT FF SS
Process Corners
50.2 35
121
6.3 6.8 17.9
593.5
202.5
failure
(a)
-700
-600
-500
-400
-300
-200
-100
0TT FF SS
CPDFF
TGFF
ACFF
Process Corners
Ho
ld T
ime
(n
s)
-6-49.6
-68.5
-1 -8.3 -12.8 -6.9
-405.9
failure
(b)
Fig. 6.6(a) Setup time of CPDFF, TGFF and ACFF at different process
corners and (b) Hold time of CPDFF, TGFF and ACFF at different
process corners.
112
functional problem. However, it cannot be compensated by clock frequency
manipulation. Therefore, hold time violation is more severe than setup time
violation in terms of lack of rectification methods. The master-slave type of DFFs,
such as the TGFF and the proposed CPDFF, usually has positive setup time and
negative hold time. The negative hold time origins from that the preliminary data is
latched by the master stage. It relaxes the requirement that the data should remain
unchanged after the clock edge. The negative hold time with the positive setup time
makes the master-slave style DFFs not prone to data race [11].
Swept Tsetup and Thold against VDD are illustrated in Fig. 6.6(a) and (b), respectively.
At TT corner, the CPDFF has a moderate setup time whereas the ACFF needs a
maximum setup time. At SS corner and VDD = 0.4 V, the ACFF fails due to the
extremely degraded output swing of pass gates. Therefore, its setup time cannot be
obtained under this circumstance. Similarly, at SS corner, the hold time of the ACFF
is not attainable (Fig. 6.6(b)). It reveals that the ACFF circuit is prone to fail at
ultra-low voltage and skew process condition. Negative hold time with large
absolute value is preferred. The CPDFF, as Fig. 6.6(b) indicates, provides a small
negative hold time but generates less variability than the TGFF does.
113
6.3.3 Analysis of Energy-Delay Product
Energy-delay space analysis is an effective means to compare the utility of various
DFFs [11],[62]. As ultra-low voltage/power applications emerge, an in-depth
understanding of energy-delay (ED) tradeoff is crucial to fairly evaluate both energy
and performance. A wide range of different ED tradeoffs can be explored by
varying the components i and j in the figure of merit of EiD
j. The investigation of
minimum ED2 ~ ED
5 is a high-performance-emphasis approach while the
exploration of minimum E2D and E
3D is more low power biased. However, the
basic energy-delay product is adequate to equally weight energy and delay to
examine the features at ultra-low voltage domain.
Fig. 6.7 plots the curves of energy-delay products from the CPDFF, the TGFF and
the ACFF with respective to data activity α. When α is 0%, ACFF is still the most
efficient DFF style. The CPDFF is not the best if α < 40% as Fig. 6.7 presents.
Because when α is low or even 0%, charge pumps still consume active power,
which degrades the energy-delay efficiency. However, when α increases, the
advantage of utilizing the CPDFF instead of the ACFF becomes more obvious. At
100% data activity, the CPDFF exhibits more than 30% improvement in energy-
delay compared to the ACFF. In addition, the CPDFF is always more efficient than
the TGFF regardless of data activity. With respective to power, it follows a similar
trend and the experimental data will be analyzed in Section 6.4.
0
1
2
3
4
5
6
0% 20% 40% 60% 80% 100%
TGFF
ACFF
CPDFF
Data Activity
En
erg
y-D
ela
y (
a.u
.)
Fig. 6.7 Simulated Energy-Delay product against data activity at VDD = 0.4 V.
114
6.4 Test Chip Implementation and Measurement
A test chip comprising 9 DFFs, 2 charge pumps and 2 FIFOs (First-In-First-Out) is
fabricated in a 180 nm technology with a nominal voltage of 1.8 V. One TGFF and
eight CPDFF circuits sharing the two charge pumps are implemented to test timing
parameters, power consumption and energy-delay product. The outputs of the DFFs
are buffered with 2 stages of FO4 inverters, whose power are not incorporated in
DFF power calculation. For the two charge pumps, the plate capacitors in the
wdata rdata
clk
waddr raddr
Synchronous FIFO
D D D D
clk
16 bits per word
16
Fig. 6.8 Architecture of the FIFO circuit.
0
10
20
30
40
50
60
70
80
0 0.1 0.2 0.3 0.4 0.5 0.6
TGFF
CPDFF
VDD (V)
C-Q
De
lay
(µ
s)
24.5% reduction
@ 0.18V
Averagely 23% improvement
from 0.18V to 0.3V
Fig. 6.9 Measured C-Q delay against VDD.
115
circuits utilize top two metal layers. By laid over lower layers, area overhead of the
capacitors is reduced by 50%. Based on the preliminary simulations, the ACFF
circuit is not competent to operate at sub-threshold region due to the functionality
issue. Therefore, this circuit is not fabricated. Two 256-bit FIFO circuits are
implemented (Fig. 6.8), one deploys the CPDFF and the other utilizes the TGFF.
The FIFOs are synthesized with the same control logic and IO circuits. Fig. 6.9
shows the measurement results of C-Q delay against VDD. The CPDFF is fully
0
0.02
0.04
0.06
0.08
0.1
0 0.1 0.2 0.3 0.4 0.5 0.6
CPDFF
TGFF
VDD (V)
Po
we
r (µ
W)
42.3% reduction
@ 0.5V
15.6% improvement
@ 0.18V
(a)
0
0.02
0.04
0.06
0.08
0.1
0.12
0.1 1 10 100 1000 10000
CPDFF
TGFF
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
Frequency (KHz)
Po
we
r (µ
W)
42.3% @ 1KHz
34.1% @ 1MHz
× 1.4
(b)
Fig. 6.10(a) Measured power against VDD. (b) Measured power against
frequency.
116
functional down to 0.18 V with a maximum frequency of 1 kHz. At the minimum
voltage, 24.5% delay reduction is observed by the proposed DFF. From 0.3 V to
0.18 V, the CPDFF provides 23% faster C-Q delays than the TGFF on average. Fig.
6.10(a) demonstrate the measurement results of the power against VDD with a
frequency of 1 kHz while Fig. 6.10(b) shows the measured power against frequency
when VDD = 0.5 V. The power contour in Fig. 6.10(a) shows the CPDFF consumes
less power than the TGFF in the whole VDD range, achieving a maximum
reduction of 42.3% at 0.5 V and 15.6% at 0.18 V. In Fig. 6.10(b), when sweeping
the frequency from 1 kHz to 1 MHz, the power consumption of the CPDFF shoots
up to 1.4× and the power of the TGFF also augments to 1.2×. The power reduction
by the CPDFF at each operating frequency is depicted with the hollow-dotted curve
in the same figure. The CPDFF is more power efficient at low frequency, such as 1
kHz and 10 kHz, and it dissipates 34.1% less at 1 MHz. Fig. 6.11 exhibits the
energy-delay products of the two DFF design with respect to data activity α. When
VDD scales to 0.5 V, the CPDFF achieves an energy-delay of 11 pJ·ns at 0% data
activity, which is 50.9% less than that of the TGFF. As data activity increases, the
energy-delay curve of the CPDFF rises slightly and attains 13.1 pJ·ns at α = 100%
whereas the parameter of the TGFF is more than twice. Averagely, the CPDFF
5
10
15
20
25
30
0% 20% 40% 60% 80% 100%
Data Activity
En
erg
y-D
ela
y (
pJ·n
s)
11pJ·ns @ 0%
22.4 pJ·ns @ 0%
50.8% improvement
on average
27.2 pJ·ns @ 100%
13.1pJ·ns @ 100%
TGFF
CPDFF
Fig. 6.11 Measured energy-delay product.
117
suppresses 50.8% power dissipation compared to the TGFF at near-/sub-threshold
region. Power reduction of the FIFO by using CPDFF is illustrated in Fig. 6.12. 45%
power saving (not including IO and control logic power) is achieved at 0.3 V and 10%
data activity thanks to the CPDFF. 31.2% total power is suppressed. Fig. 6.13(a)
captures the output waveforms of the CPDFF at the minimum operating voltage.
The die micro-photograph is presented in Fig. 6.13(b).
Fig. 6.12 Measured power of the 2 FIFOs at 0.3 V with 10% data activity.
FIFO
DFFs180mV Full Swing
CLK
Q_CPDFF
52.5 μs C-Q delay
(a) (b)
Fig. 6.13(a) Screen capture of CPDFF output waveforms at 0.18 V and (b) die
micro-photograph.
118
6.5 Summary
This chapter presents a 0.18 V energy-delay efficient 16-transistor CPDFF targeting
near-/sub-threshold operation. With the aid of charge pumps and INWE-aware
sizing strategy, 23% boosted C-Q delay from 0.18 V to 0.3 V is observed. The delay
variability is minimized by charge-pumped overdrive voltages. The CPDFF proves
to have a 50.8% lower energy-delay product compared to the TGFF. The utility of
the CPDFF is verified by a 256-bit FIFO and achieves 31.2% power reduction at 0.3
V. Experimental results validates the proposed CPDFF is competent for near-/sub-
threshold applications.
119
Chapter 7 Conclusions and Future Works
7.1 Conclusions
The research work begins with an investigation of the impact of MTCMOS device
on SRAM energy efficiency. Comprehensive simulations reveal the device
combinations cause large variations on energy efficiency. Combined with assisting
techniques, such as column-interleaved scheme and boosted wordline, the energy
efficiency of MTCMOS SRAM can be enhanced as much as 33×.
Leakage and energy efficiency are primary concerns for ultra-low voltage SRAM
design. The thesis presents several circuit techniques to implement an energy
efficient single-port SRAM with reliable read operation at ultra-low voltage. The
proposed 9T SRAM cell with equalized bitline leakage fosters read operation at
sub-threshold regime. To further reduce the static energy, MTCMOS technology is
utilized to reduce the leakage in the SRAM array. The corresponding degraded
energy efficiency is compensated by a CAM-assisted write performance boosting
circuit which speeds up the clock frequency.
Disturb due to common-row access is a paramount challenge for dual-port SRAM
circuits. The research work explores design techniques to tackle the issue by
proposing a 12T dual-port SRAM cell with hierarchical bitline and virtual ground
schemes. The novel SRAM cell decreases the probability of suffering disturb and
suppresses read disturb at the common-row access condition. The hierarchical write
bitlines boost write operation and improve write ability which is originally degraded
by write disturb. Test chip has validated a successful common-row access at 0.4 V.
Finally, a 0.18 V energy-delay efficient charge-pumped DFF targeting near-/sub-
threshold operation is presented in the research work. The C-Q delay is boosted
with the aid of charge pumps and the inverse-narrow-width-effect-aware sizing
strategy. The according enhanced overdrive voltage minimizes the delay variability
at ultra-low voltage. The proposed DFF proves to have a much lower energy-delay
product and be able to work with a supply voltage of 0.18 V.
120
7.2 Future Works
The ultimate goal of the research work is to develop sub-threshold circuit design
techniques for microwatt applications with robustness and high energy efficiency.
The research consists of following major tasks covering from design methodology
to most essential circuit blocks for microwatt systems: 1) design and optimization
techniques for ultra-low voltage digital and memory circuits; 2) design of energy
efficient near-/sub-threshold memories; 3) design of energy-delay efficient sub-
threshold digital logics, and 4) design of a sub-threshold biomedical signal
processor utilizing the circuits proposed by 2) and 3). The thesis has presented
energy efficient sub-threshold memories and logic design except the work of
biomedical signal processor, which will be presented in the future.
The target application of the biomedical signal processor is a wireless neural SoC
platform which has multiple channels with various digital and mixed circuits. Most
of the state-of-the-art works use nominal supply voltage in memory domain and
logic control domain, which is not efficient with respect to energy consumption. As
energy is a topmost design constraint in wireless systems, circuit design techniques
for high energy efficiency have to be continuously pursued and explored in the
future.
In addition, traditional design methods using HDL coding is not suitable for sub-
threshold operation due to the lack of information in standard cell libraries and huge
variations. As the ultra-low voltage signal processor with reliable sub-threshold
operation and high energy efficiency is highly demanded, novel architectures and
circuit techniques for ultra-low power consumption and robustness will be
investigated. A standard cell library which is exclusively optimized for sub-
threshold operation will be created based on the work from 1) and 3). With the
proposed energy efficient SRAMs, it can greatly enhance the performance and
power of the processor in future.
121
Publications
Journal
[1] B. Wang, T. Q. Nguyen, A. Do, J. Zhou. M. Je, and T. Kim, “Design of an
ultra-low voltage 9T SRAM with equalized bitline leakage and CAM-
assisted energy efficiency improvement,” IEEE Transaction on Circuits and
Systems-I (TCAS-I), vol. 62, no. 2, pp. 441-448, 2015.
[2] B. Wang, J. Zhou, and T. Kim, “SRAM devices and circuits optimization
toward energy efficiency in multi-Vth CMOS,” Elsevier Microelectronics
Journal (MEJ), vol. 46, no. 3, pp. 265-272, 2015.
[3] X. Liu, J. Zhou, Y. Yang, B. Wang, J. Lan, C. Wang, J. Luo, W. Goh, T.
Kim, and M. Je, “A 457-nW near-threshold cognitive multi-functional ECG
processor CMOS for long-term cardiac monitoring,” IEEE Journal of Solid-
State Circuits (JSSC), vol. 49, no. 11, pp. 2422-2434, 2014.
Conference
[4] B. Wang, J. Zhou, and T. Kim, “Ultra-low Power 12T Dual Port SRAM for
Hardware Accelerators,” IEEE International SoC Design
Conference (ISOCC), pp. 274-275, Nov. 2014.
[5] A. Do, Z. Lee, B. Wang, I. Chang, and T. Kim, “0.2V 8T SRAM with
Improved Bitline Sensing Using Column-based Data Randomization,” IEEE
Asian Solid-State Circuits Conference (A-SSCC), pp. 141-144, Nov. 2014.
[6] J. Zhou, X. Liu, C. Wang, K. Chang, J. Luo, J. Lan, L. Liao, Y. Lam, Y.
Yang, B. Wang, X. Zhang, W. Goh, T. Kim, and M. Je, “A 0.5 V 29
pJ/Cycle Sensor Node Processor for Intelligent Sensing Applications,” IEEE
International SoC Design Conference (ISOCC), pp. 70-71, Nov. 2014.
[7] B. Wang, J. Zhou, K. H. Chang, M. Je, and T. Kim, “A 0.18V charge-
pumped DFF with 50.8% energy-delay reduction for near-/sub-threshold
circuits,” IEEE Asian Solid-State Circuits Conference (A-SSCC), pp. 121-
124, Nov. 2013.
122
[8] X. Liu, J. Zhou, Y. Yang, B. Wang, J. Lan, C. Wang, J. Luo, W. Goh, T.
Kim, and M. Je, “A 457-nW Cognitive Multi-Functional ECG
Processor,” IEEE Asian Solid-State Circuits Conference (A-SSCC), pp. 141-
144, Nov. 2013.
[9] Y. Yeoh, B. Wang, X Yu, and T. Kim, “A 0.4 V 7T SRAM with Write
Through Virtual Ground and Ultra-fine Grain Power Gating
Switches,” IEEE International Symposium on Circuits and Systems (ISCAS),
pp. 3030-3033, May 2013.
[10] B. Wang, T. Q. Nguyen, A. Do, J. Zhou. M. Je, and T. Kim, “A 0.2V 16Kb
9T SRAM with bitline leakage equalization and CAM-assisted write
performance boosting for improving energy efficiency,” IEEE Asian Solid-
State Circuits Conference (A-SSCC), pp. 73-76, Nov. 2012.
[11] T. Kim, B. Wang, and A. Do, “High Energy Efficient Ultra-low Voltage
SRAM Design: Device, Circuit, and
Architecture,” International SoC Design Conference (ISOCC), pp. 367-370,
Nov. 2012.
[12] Q. Li, B. Wang, and T. Kim, “A 5.61 pJ, 16 kb 9T SRAM with Single-
ended Equalized Bitlines and Fast Local Write-back for Cell Stability
Improvement,” IEEE European Solid-State Device Research Conference
(ESSDERC), pp. 201-204, Sep. 2012.
[13] B. Wang, J. Zhou, and T. Kim, “Maximization of SRAM Energy Efficiency
Utilizing MTCMOS Technology,” Asia Symposium on Quality Electronic
Design (ASQED), pp. 35-40, Jul. 2012.
123
Bibliography
[1] M. Bohr, “The new era of scaling in an SoC world,” ISSCC Dig. Tech.
Papers, pp. 23-28, 2009.
[2] S. Damaraju, V. George, S. Jahagirdar, T. Khondker, R. Milstrey, S. Sarkar,
S. Scott, I. Stolero, and A. Subbiah, “A 22nm IA multi-CPU and GPU
System-on-Chip, ” ISSCC Dig. Tech. Papers, pp. 56-57, 2012.
[3] R. Islam, A. Sabbavarapu, and R. Patel, “Power Reduction Schemes in Next
Generation Intel ATOM Processor based SoC for Handheld Applications,”
IEEE Symposium on VLSI Circuits, pp. 173-174, 2010.
[4] H. Lakdawala, M. Schaecher, C. Fu, R. Limaye, J. Duster, Y. Tan, A.
Balankutty, E. Alpman, C. Lee, S. Suzuki, B. Carlton, H. Kim, M. Verhelst,
S. Pellerano, T. Kim, D. Srivastava, S. Venkatesan, H. Lee, P. Vandervoorn,
J. Rizk, C. Jan, K. Soumyanath, and S. Ramamurthy, “32nm x86 OS-
Compliant PC On-Chip with Dual-Core Atom Processor and RF WiFi
Transceiver,” ISSCC Dig. Tech. Papers, pp. 62-64, 2012.
[5] S. Rusu, “Microprocessor Design in the Nanoscale Era,” http://www.ieee-
jp.org/section/kansai/chapter/sscs/20120719/uP_Design_July_2012.pdf
[6] Q. Li, B. Wang, and T. Kim, “A 5.61 pJ, 16 kb 9T SRAM with Single-
ended Equalized Bitlines and Fast Local Write-back for Cell Stability
Improvement,” Proc. of the European Solid-State Device Research
Conference, pp. 201-204, 2012.
[7] A. Wang and A. Chandrakasan, “A 180-mV sub-threshold FFT processor
using a minimum energy design methodology,” IEEE J. Solid-State Circuits,
vol. 40, no. 1, pp. 310–319, 2005.
[8] S. Hanson et al., “A Low-Voltage Processor for Sensing Applications with
Picowatt Standby Mode,” IEEE J. Solid-State Circuits, vol. 44, no. 4, pp.
1145-1155, 2009.
[9] G. Chen et al., “Millimeter-Scale Nearly Perpetual Sensor System with
Stacked Battery and Solar Cells,” ISSCC Dig. Tech. Papers, pp. 288-289,
2010.
124
[10] K. Lee and N. Verma, “A 1.2-0.55V General-purpose Biomedical Processor
with Configurable Machine-learning Accelerators for High-order Patient-
adaptive Monitoring,” Proc. Eur. Solid-State Circuits Conf. (ESSCIRC), pp.
285-288, 2012.
[11] M. Alioto, E. Consoli, and G. Palumbo, “Analysis and Comparison in the
Energy-Delay-Area Domain of Nanometer CMOS Flip-Flops: Part I —
Methodology and Design Strategies,” IEEE. Trans. on VLSI Systems, pp.
725-736, 2011.
[12] M. Alioto, E. Consoli, and G. Palumbo, “Analysis and Comparison in the
Energy-Delay-Area Domain of Nanometer CMOS Flip-Flops: Part II —
Results and Figures of Merit,” IEEE Trans. on VLSI Systems, pp. 737-750,
2011.
[13] L. Chang, Y. Nakamura, R. Montoye, J. Sawada, A. Martin, K. Kinoshita, F.
Gebara, K. Agarwal, D. Acharyya, W. Haensch, K. Hosokawa, and D.
Jamsek, “A 5.3GHz 8T-SRAM with Operation Down to 0.41 V in 65 nm
CMOS,” IEEE Symposium on VLSI Circuits, pp. 250-253, 2007.
[14] Y. Wang, U. Bhattacharya, F. Hamzaoglu, P. Kolar, Y. Ng, L. Wei, Y.
Zhang, K. Zhang, and M. Bohr, “A 4.0 GHz 291 Mb Voltage-scalable
SRAM Design in a 32 nm High-k+ Metal-gate CMOS Technology With
Integrated Power Management,” IEEE J. Solid-State Circuits, vol. 45, no. 1,
pp. 103-110, 2010.
[15] P. Kolar, E. Karl, U. Bhattacharya, F. Hamzaoglu, H. Nho, Y. Ng, Y. Wang,
and K. Zhang, “A 32 nm High-k Metal Gate SRAM With Adaptive
Dynamic Stability Enhancement for Low-Voltage,” IEEE J. Solid-State
Circuits, vol. 46, no. 1, pp. 76-84, 2011.
[16] T. Kim, J. Liu, and C. Kim, “A Voltage Scalable 0.26 V, 64 kb 8T SRAM
with Vmin Lowering Techniques and Deep Sleep Mode,” IEEE J. Solid-
State Circuits, vol. 44, no. 6, pp. 1785-1795, 2009.
[17] B. Calhoun and A. Chandrakasan, “A 256-kb 65-nm sub-threshold SRAM
Design for Ultra-Low-Voltage Operation,” IEEE J. Solid-State Circuits, vol.
42, no. 3, pp. 680-688, 2007.
125
[18] T. Kim, J. Liu, J. Keane, and C. Kim, “A 0.2 V, 480 kb Subthreshold SRAM
With 1k Cells Per Bitline for Ultra-Low-Voltage Computing,” IEEE J.
Solid-State Circuits, vol. 43, no. 2, pp. 518-529, 2008.
[19] T. Song, W. Rim, J. Jung, G. Yang, J. Park, S. Park, K. Baek, S. Baek, S.
Oh, J. Jung, S. Kim, G. Kim, J. Kim, Y. Lee, K. Kim, S. Sim, J. Yoon, and
K. Choi, “A 14nm FinFET 128Mb 6T SRAM with VMIN-Enhancement
Techniques for Low-Power Applications,” ISSCC Dig. Tech. Papers, pp.
232-232, 2014.
[20] Y. Chen, W. Chan, W. Wu, H. Liao, K. Pan, J. Liaw, T. Chung, Q. Li, G.
Chang, C. Lin, M. Chiang, S. Wu, S. Natarajan, and J. Chang, “A 16nm
128Mb SRAM in High-k Metal-Gate FinFET Technology with Write-Assist
Circuitry for Low-VMIN Applications,” ISSCC Dig. Tech. Papers, pp. 238-
239, 2014.
[21] J. Chang, Y. Chen, H. Cheng, W. Chan, H. Liao, Q. Li, S. Chang, S.
Natarajan, R. Lee, P. Wang, S. Lin, C. Wu, K. Cheng, M. Cao, and G.
Chang, “A 20nm 112Mb SRAM in High-k Metal-Gate with Assist Circuitry
for Low-Leakage and Low-VMIN Applications,” ISSCC Dig. Tech. Papers,
pp. 316-318, 2013.
[22] K. Agawa, H. Hara, T. Takayanagi, and T. Kuroda, “A Bitline Leakage
Compensation Scheme for Low-Voltage SRAMs,” IEEE J. Solid-State
Circuits, vol. 36, no. 5, pp. 726-734, 2001.
[23] A. Calimera, A. Macii, E. Macii, and M. Poncino, “Design Techniques and
Architectures for Low-Leakage SRAMs,” IEEE Trans. Circuits and Systems
— I, vol. 59, no. 9, pp. 1992-2007, 2012.
[24] Y. Lai and S. Huang, “X-Calibration: A Technique for Combating
Excessive Bitline Leakage Current in Nanometer SRAM Designs,” IEEE J.
Solid-State Circuits, vol. 43, no. 9, pp. 1964-1971, 2008.
[25] C. Lo and S. Huang, “P-P-N Based 10T SRAM Cell for Low-Leakage and
Resilient Subthreshold Operation,” IEEE J. Solid-State Circuits, vol. 46, no.
3, pp. 695-704, 2011.
126
[26] J. Kim, Y. Choi, J. Jeong, S. Lee, and S. Kim, “The v2.0+ EDR Blue-tooth
SoC Architecture for Multimedia,” IEEE Trans. Consum. Electron., vol. 52,
no. 2, pp. 436-444, 2006.
[27] T. Shiota, K. Kawasaki, Y. Kawabe, W. Shibamoto, A. Sato, T. Hashimoto,
F. Hayakawa, S. Tago, H. Okano, Y. Nakamura, H. Miyake, A. Suga, and H.
takahashi, “A 51.2 GOPS 1.0 GB/s-DMA Single-chip Multi-processor
Integrating Quadruple 8-way VLIW processors,” ISSCC Dig. Tech. Papers,
pp. 194-195, 2005.
[28] M. Makajima, T. Yamamoto, M. Yamasaki, K. Kaneko, and T. Hosoki,
“Homogeneous Dual-processor Core with Shared L1 Cache for Mobile
Multimedia SoC,” IEEE Symposium on VLSI Circuits, pp. 216-217, 2007.
[29] M. Miyama, J. Miyakoshi, Y. Kuroda, K. Imamura, H. Hashimoto, and M.
Yoshimoto, “A sub-mW MPEG-4 Motion Estimation Processor Core for
Mobile Video Application,” IEEE J. Solid-State Circuits, vol. 39, no. 9, pp.
1562-1570, 2004.
[30] K. Nii, Y. Tsukamoto, M. Yabuuchi, Y. Masuda, S. Imaoka, K. Usui, S.
Ohbyashi, H. Makino, and H. Shinohara, “Synchronous Ultra-High-Density
2RW Dual-Port 8T-SRAM with Circumvention of Simultaneous Common-
Row-Access,” IEEE J. Solid-State Circuits, vol. 44, no. 3, pp. 977-986,
2009.
[31] M. Yabuuchi, Y. Tsukamoto, M. Morimoto, M. Tanaka, and K. Nii, “20nm
High-Density Single-Port and Dual-Port SRAMs with Wordline-Voltage-
Adjustment System for Read/Write Assists,” ISSCC Dig. Tech. Papers, pp.
234-235, 2014.
[32] Y. Ishii, H. Fujiwara, K. Nii, H. Chigasaki, O. Kuromiya, T. Saiki, A.
Miyanishi, and Y. Kihara, “A 28-nm Dual-Port SRAM Macro with Active
Bitline Equalizing Circuitry against Write Disturb Issue,” IEEE Symposium
on VLSI Circuits, pp. 99-100, 2010.
[33] W. Dally and J. Poulton, “Digital Systems Engineering,” Cambridge
University Press, pp. 574, 1998.
[34] R. Baker, “CMOS Circuit Design, Layout, and Simulation,” IEEE Press,
Wiley, pp. 388-389, 2010.
127
[35] C. Chang and P. Gupta, “Calibration of Setup and Hold Time for Latches
and Flip-Flops,” nanocad.ee.ucla.edu/pub/Main/SnippetTutorial/calI.pdf
[36] C. Chen, K. Bowman, C. Augustine, Z. Zhang, and J. Tschanz, “Minimum
Supply Voltage for Sequential Logic Circuits in a 22nm Technology,” IEEE
International Symp. Low-Power Electronics and Design, pp. 181-186, 2013.
[37] C. Teh, T. Fujita, H. Hara, and M. Hamada, “A 77% Energy-Saving 22-
Transistor Single-Phase-Clocking D-Flip-Flop with Adaptive-Coupling
Configuration in 40nm CMOS,” ISSCC Dig. Tech. Papers, pp. 338-340,
2011.
[38] Y. Kim, W. Jung, I. Lee, Q. Dong, M. Henry, D. Sylvester, and D. Blaauw,
“A Static Contention-Free Single-Phase-Clocked 24T Flip-Flop in 45nm for
Low-Power Applications,” ISSCC Dig. Tech. Papers, pp. 466-467, 2014.
[39] S. Jain, S. Khare, S. Yada, V. Ambili, P. Salihundam, S. Ramani, S.
Muthukumar, M. Srinivasan, A. Kumar, S. Gb, R. Ramanarayanan, V.
Erraguntla, J. Howard, S. Vangal, S. Dighe, G. Ruhl, P. Aseron, H. Wilson,
N. Borkar, V. De, and S. Borkar, “A 280mV-to-1.2V Wide-Operating-
Range IA-32 Processor in 32nm CMOS,” ISSCC Dig. Tech. Papers, pp. 66-
68, 2012.
[40] P. Liu, J. Wang, M. Phan, M. Garg, R. Zhang, A. Cassier, L. Chua-Eoan, B.
Andreev, S. Weyland, S. Ekbote, M. Han, J. Fischer, G. Yeap, P. Wang, Q.
Li, C. Hou, S. Lee, Y. Wang, S. Lin, M. Cao, and Y. Mii, “A dual core
oxide 8T SRAM cell with low Vccmin and dual voltage supplies in 45nm
triple gate oxide and multi Vt CMOS for very high performance yet low
leakage mobile SoC applications,” IEEE Symposium on VLSI Technology,
pp. 135-136, 2010.
[41] C. Diaz, K. Young, J. Hsu, J. Lin, C. Hou, C. Lin, J. Liaw, C. Wu, C. Su, C.
Wang, J. Ting, S. Yang, K. Lee, S. Wu, C. Tsai, H. Tao, S. Jang, S. Shue, H.
Hsieh, Y. Wang, C. Chen, S. Yang, S. Fu, S. Chang, T. Lo, J. Wu, J. Shy, C.
Liu, S. Chen, B. Lin, B. Liew, T. Yen, C. Yu, Y. Chao, M. Liang, C. Wang,
and J. Sun, “A 0.18 μm CMOS Logic Technology with Dual Gate Oxide
and Low-k Interconnect for High-Performance and Low-Power
Applications,” IEEE Symposium on VLSI Technology, pp. 11–12, 1999.
128
[42] M. E. Sinangil, N. Verma, and A. Chandrakasan, “A 54 nm 0.5V 8T
column-interleaved SRAM with on-chip reference selection loop for sense
amplifier,” IEEE Asian Solid-State Circuits Conf. (A-SSCC), pp. 225-228,
2009.
[43] N. Verma, and A. Chandrakasan, “A 256 kb 65 nm 8T Subthreshold SRAM
Employing Sense-Amplifier Redundancy,” IEEE J. Solid-State Circuits, vol.
43, no. 1, pp. 141-149, 2008.
[44] H. Kim, Y. Kim, J. Oh, and L. Kim, “A Reconfigurable SIMT Processor for
Mobile Ray Tracing with Contention Reduction in Shared Memory,” IEEE
Trans. on Circuits and System — I, vol. 60, no. 4, pp. 938-950, 2013.
[45] M. Ghaed, G. Chen, R. Haque, M. Wieckowski, Y. Kim, G. Kim, Y. Lee, I.
Lee, D. Fick, D. Kim, M. Seok, K. Wise, D. Blaauw, and D. Sylvester,
“Circuits for a Cubic-Millimeter Energy-Autonomous Wireless Intraocular
Pressure Monitor,” IEEE Trans. on Circuits and System — I, vol. 60, no. 12,
pp. 3152-3162, 2013.
[46] M. Tu, J. Lin, M. Tsai, C. Lu, Y. Lin, M. Wang, H. Huang, K. Lee, W. Shih,
S. Jou, and C. Chuang, “A Single-Ended Disturb-Free 9T Subthreshold
SRAM With Cross-Point Data-Aware Write Word-Line Structure, Negative
Bit-Line, and Adaptive Read Operation Timing Tracing,” IEEE J. Solid-
State Circuits, vol. 47, no. 6, pp. 1469-1482, 2012.
[47] A. Teman, L. Pergament, O. Cohen, and A. Fish, “A 250 mV 8 kb 40 nm
Ultra-Low Power 9T Supply Feedback SRAM (SF-SRAM),” IEEE J. Solid-
State Circuits, vol. 46, no. 11, pp. 2713-2726, 2011.
[48] A. Alvandpour, D. Somasekhar, R. Krishnamurthy, V. De, S. Borkar, and C.
Svensson, “Bitline Leakage Equalization for Sub-100nm Caches,” Eur.
Solid-State Circuits Conf. (ESSCIRC), pp. 401-404, 2003.
[49] D. Jeon, M. Seok, C. Chakrabarti, D. Blaauw, and D. Sylvester, “A Super-
Pipelined Energy Efficiency Subthreshold 240 MS/s FFT Core in 65 nm
CMOS,” IEEE J. Solid-State Circuits, vol. 47, no. 1, pp. 23-34, 2012.
[50] R. Abdallah and N. Shanbhag, “A 14.5 fJ/cycle/k-Gate, 0.33 V ECG
Processor in 45 nm CMOS Using Statistical Error Compensation,” IEEE
Custom Integr. Circuits Conf. (CICC), 2012, pp. 1-4, 2012.
129
[51] B. Wang, J. Zhou, and T. Kim, “Maximization of SRAM Energy Efficiency
Utilizing MTCMOS Technology,” Asia Symp. Quality Electronic Design
(ASQED), pp. 35-40, 2011.
[52] K. Pagiamtzis and A. Sheikholeslami, “Content-Addressable Memory
(CAM) Circuits and Architectures: A Tutorial and Survey,” IEEE J. Solid-
State Circuits, vol. 41, no. 3, pp. 712-7272006.
[53] Y. Sinangil, and A. Chandrakasan, “An Embedded Energy Monitoring
Circuit for a 128kbit SRAM with Body-biased Sense-Amplifiers,” Asian
Solid-State Circuits Conf. (A-SSCC), pp. 69–72, 2012.
[54] S. Lütkemeier, T. Jungeblut, H. Berge, S. Aunet, M. Porrmann, and U.
Ruckert, “A 65 nm 32 b Subthreshold Processor With 9T Multi-Vt SRAM
and Adaptive Supply Voltage Control,” IEEE J. Solid-State Circuits, vol. 48,
no. 1, pp. 8–192013.
[55] M. Chang, M. Chen, L. Chen, S. Yang, Y. Kuo, J. Wu, H. Su, Y. Chu, W.
Wu, T. Yang, and H. Yamauchi, “A Sub-0.3 V Area-Efficient L-Shaped 7T
SRAM With Read Bitline Swing Expansion Schemes Based on Boosted
Read-Bitline, Asymmetric-VTH Read-Port, and Offset Cell VDD Biasing
Techniques,” IEEE J. Solid-State Circuits, vol. 48, no. 10, pp. 2558–2569,
2013.
[56] M. Yabuuchi, Y. Tsukamoto, M. Morimoto, M. Tanaka, and K. Nii, “20nm
High-Density Single-Port and Dual-Port SRAMs with Wordline-Voltage-
Adjustment System for Read/Write Assists,” ISSCC Dig. Tech. Papers, pp.
234-236, 2014.
[57] Y. Ishii, H. Fujiwara, S. Tanaka, T. Doguchi, O. Kuromiya, H. Chigasaki, Y.
Tsukamoto, and K. Nii, “A 28 nm Dual-Port SRAM Macro with Screening
Circuitry Against Write-Read Disturb Failure Issues,” Asian Solid-State
Circuits Conf. (A-SSCC), pp. 1-4, 2010.
[58] A. Mishra, A. Raymond, L. Amaru, G. Sarkis, C. Leroux, P. Meinerzhagen,
A. Burg, and W. Gross, “A Successive Cancellation Decoder ASIC for a
1024-bit Polar Code in 180nm CMOS,” IEEE Asian Solid-State Circuits
Conf. (A-SSCC), pp. 205-208, 2012.
130
[59] F. Hsiao, A. Tang, D. Yang, M. Pham, and M. Chang, “A 7Gb/s SC-
FDE/OFDM MMSE Equalizer for 60GHz Wireless Communications,”
IEEE Asian Solid-State Circuits Conf. (A-SSCC), pp. 293-296, 2011.
[60] M. Altaf, J. Tillak, Y. Kifle, and J. Yoo, “A 1.83µJ/Classification Nonlinear
Support-Vector-Machine-Based Patient-Specific Seizure Classification SoC,”
ISSCC Dig. Tech. Papers, pp. 100-102, 2013.
[61] J. Zhou, S. Jayapal, B. Busze, L. Huang, and J. Stuyt, “A 40 nm Inverse-
Narrow-Effect-Aware Sub-Threshold Standard Cell Library,” IEEE Trans.
on Circuits and System — I, vol. 59, pp. 2569-2577, 2012.
[62] E. Consoli, M. Alioto, G. Palumbo, and J. Rabaey, “Conditional Push-Pull
Pulsed Latches with 726fJ∙ps Energy-Delay Product in 65nm CMOS,”
ISSCC Dig. Tech. Papers, pp. 482-484, 2012.