A 100MHz to 1GHz, 0.35V to 1.5V Supply 256 x 64 SRAM Block

A 100MHz to 1GHz, 0.35V to 1.5V Supply 256 x 64 SRAM Block using Symmetrized 9T SRAM Cell with Controlled Read

Satish Anand Verkila VLSI Circuits and Systems

Lab, E.C.E, IISc - Bangalore-560012 [email protected]

Siva Kumar Bondada Nano Scale Device

Research Lab,CEDT, IISc - Bangalore-560012

[email protected]

Bharadwaj S. Amrutur VLSI Circuits and Systems

Lab, E.C.E, IISc - Bangalore-560012. [email protected]

Abstract

In this paper, we present Dynamic Voltage and

Frequency Managed 256 x 64 SRAM block in 65nm technology, for frequency ranging from 100MHz to 1GHz. The total energy is minimized for any operating frequency in the above range and leakage energy is minimized during standby mode. Since noise margin of SRAM cell deteriorates at low voltages, we propose Static Noise Margin improvement circuitry, which symmetrizes the SRAM cell by controlling the body bias of pull down NMOS transistor. We used a 9T SRAM cell that isolates Read and Hold Noise Margin and has less leakage. We have implemented an efficient technique of pushing address decoder into zigzag-super-cut-off in stand-by mode without affecting its performance in active mode of operation. The Read Bit Line (RBL) voltage drop is controlled and pre-charge of bit lines is done only when needed for reducing power wastage. 1. Introduction

Dynamic voltage and frequency scaling has recently emerged as a very effective technique to trade off power and performance of a chip dynamically [8]. The supply voltage and substrate bias are adjusted to achieve the minimum power at any operating frequency [9, 10]. Most of the work in this area though has primarily focused on logic circuits. Since SRAMs consume a significant fraction of the total chip power, in this paper we look at the problem of designing SRAMs with a large frequency and voltage range of operation.

Fig.1 9 Transistor SRAM cell obtained by eliminating the PMOS of the 10T cell [1].

The main barrier to achieving a low-voltage SRAM

is the increase in process variation with process scaling. Lower supply and higher variation in advanced processes makes the SRAM operation difficult, and some techniques must be used to expand its operating margin.

A standard 6 Transistor SRAM cell has significant problems working at low voltages due to its poor read and write noise margin. The read noise margin is typically only half of hold noise margin and it is difficult to robustly write into the cell at supplies less than 0.6V. The 6T cell also has large leakages which prevents having too many cells sharing a single bitline column. This implies a large partitioning overhead to accommodate buffers and muxes for each local bitline column.

The authors in [1, 2, 7] explore the use of cells with more than 6 transistors to reduce leakage and improve noise margins. Even though a single cell is larger in area compared to a 6 transistor cell, due to smaller

21st International Conference on VLSI Design

1063-9667/08 $25.00 © 2008 IEEEDOI 10.1109/VLSI.2008.89

564


1063-9667/08 $25.00 © 2008 IEEEDOI 10.1109/VLSI.2008.89

560


1063-9667/08 $25.00 © 2008 IEEEDOI 10.1109/VLSI.2008.89

560


1063-9667/08 $25.00 © 2008 IEEEDOI 10.1109/VLSI.2008.89

560


1063-9667/08 $25.00 © 2008 IEEEDOI 10.1109/VLSI.2008.89

560


1063-9667/08 $25.00 © 2008 IEEEDOI 10.1109/VLSI.2008.89

560


1063-9667/08 $25.00 © 2008 IEEEDOI 10.1109/VLSI.2008.89

560

partitioning overheads, larger arrays with these cells end up having smaller overall area.

The authors in [7] propose an 8 Transistor cell which is able to separate both read and write noise margins, but it too suffers from the bitline column height limitation due to large leakages.

The authors in [2] propose a 7 Transistor cell which has no read noise margin limitations. But during a read operation, a cell storage node is in dynamic state, which can lead to problems because of process variations, especially with long read times.

The authors in [1] use some of the ideas above to create a 10T SRAM cell shown in Fig. 1, mainly for sub-threshold operation. They separate the read and write ports and provide a read buffer consisting of 4 transistors (M7, M8, M9, Mp). This allows separation of read and hold noise margins. The read buffer has an extra series NMOS transistor, M9, which reduces bitline leakage drastically. They also solve the write noise margin problem by using the floating VDD technique [4].

We extend the ideas proposed in the above papers to design an SRAM with a large supply and frequency range of operation. We propose a 9T cell with threshold control for all the NMOS transistors to enable a very low supply operation. The threshold control enables symmetrizing the cell DC transfer curve and gives better noise margins at low supply voltages. The low leakage property of the cell also enables large bitline columns and hence less overall array area. Also we get good reduction in cell area compared to 10T in which we need some room in the layout to place PMOS Mp. Additionally we apply the Zigzag Super Cutoff CMOS scheme [11] for the decoder to achieve low standby power at very minimal performance loss.

In section 2 we discuss the 9T cell and reasons for going to 9T. Section 3 discusses how to symmetrize the cell. In section 4 we describe an efficient scheme of pushing the decoder into sleep mode without delay degradation. Section 5 gives results of our DVFM implementation. We conclude the paper in section 6. 2. 9T SRAM cell

The 9T cell is obtained by removing the PMOS Mp from the 10T cell (Fig.1). The main effect of this is that the node between M8 and M9 will be floating during the hold mode (RWL=0). We don’t expect this floating node to cause any problems during hold as it will be disconnected from the large bitline capacitance. Even a 10T cell will have this node to be floating when QB=1. However during a read, if

Fig. 2. Layout of a) 8T [2] and b) 9T SRAM cells

QB=0, the 9T cell connects a high impedance node to the bitline during the read (unlike the 10T cell). However if the bitline capacitance is large and it is precharged to Vdd, even under the worst case conditions, the bitline signal doesn’t degrade by more than 10mV. The 9T cell shares the same leakage reduction properties of the 10T cell due to the series stacking effect of the NMOS M7-9. Thus in a SRAM block having 128 cells per column using VDD = 0.35 (which is the minimum required supply at 100MHz operation) at 100C temperature, the ratio of Iread_cell to Itotal_leakage is 1.96 for a 9T design and is 1.06 for 8T design. Hence we can put double the number of 9T cells per column than in 8T design. This enables smaller bitline partitioning and reduced area overhead. A 9T SRAM cell has less area than 10T cell (due to missing PMOS) and is only a bit more area compared to the 8T SRAM cell (Fig. 2). Mosis scalable rules have been used to layout both the cells in the figure. The 6T SRAM cell area is 26X45 = 1170 and 8T SRAM cell area is 72X18 = 1296. So our 9T SRAM cell area is 72X19 = 1368. 8T has 10% more area compared to 6T and our 9T has 17% more area compared to 6T. 9T cell is having 16.5% less area than 10T cell reported in [1]. The area overhead is not so high, because in 8T and 9T the pull-down NMOS transistor size is not constrained by any condition unlike in 6T. We used large width for bottom two transistors in read buffer and small width for upper transistor, so that RBL capacitance is under control and circuit speed won’t deteriorate much.

3. Symmetric Cell for Noise Margin Improvement

In a conventional SRAM cell, the PMOS is usually made very weak compared to the access transistor in the cell in order to satisfy the requirements of good

565561561561561561561

(a) (b)

Fig. 3 Symmetrizing circuit using replica cell inverters with a) diode connected ON transistors; (b) OFF transistors

00.050.1

0.150.2

0.250.3

0.350.4

0.45

0 0.5 1 1.5Vdd (V)

RN

M (V

)

6T

9T with diodeconnectedtransistor9T

9T with offtransistor

Fig.4 Comparison of read noise margin across

supply range of 0.3V to 1.5V for 6T and 9T cells.

00.05

0.10.15

0.20.25

0.30.35

0.40.45

0 0.5 1 1.5Vdd (V)

RN

M (V

)

TTLLHHLT HTTHTL

Fig.5 Change in RNM with process variations

using symmetrizing circuit using diode connected transistor

0

0.1

0.2

0.3

0.4

0.5

0 0.5 1 1.5Vdd (V)

RN

M (V

)

TTLLHHLTHTTLTH

Fig.6 Change in RNM with process variations

when using symmetrizing circuit using off transistor

write margin. Simultaneously, the access transistor is made weaker than the driver transistor to create good read margins. Thus the sizes of all the transistors are coupled and the DC transfer curve of the cell inverter is quite skewed as the PMOS becomes much weaker than the driver NMOS. However in the 9T cell, the write margins are improved by using the floating Vdd

0.85

0.9

0.95

1

1.05

0 0.5 1 1.5Vdd (V)

Inve

rter

trip

po

int

norm

alis

ed to

Vd

d/2

diodeconnected

off transistor

Fig.7 Comparison of adjusted inverter trip

point using two symmetrizing circuits

0

0.1

0.2

0.3

0.4

-0.15 -0.1 -0.05 0 0.05 0.1 0.15Vth difference (V)

RN

M (V

)

Fig.8(a) Degradation of read noise margin with

variation effects for different •Vth between replica cell inverter and SRAM cell

00.05

0.10.15

0.20.25

0.30.35

0.40.45

0 0.2 0.4 0.6 0.8 1 1.2 1.4Vdd (V)

RN

M (V

)

TT

LT

HT

TL

TH

LH

HL

Fig.8(b) Degradation of read noise margin with

supply for a •Vth=25mV between replica cell inverter and SRAM cell for different corners

technique [4]. Also the read buffer decouples the read and hold noise margins. This in turn allows the potential to have a symmetric DC transfer for the cell inverter to improve the cell noise margin.

Since the PMOS will be weaker than NMOS, their drive strengths can be equalized by raising the NMOS threshold voltage. This can be done for a triple well process with all the NMOS transistors sharing the same well. We have explored two possibilities for balancing the NMOS and PMOS drive strengths. In one, the ON currents are matched in a replica cell inverter by adjusting the body bias of the NMOS via a feedback amplifier (Fig 3a). In the other, the OFF currents are balanced in a similar way (Fig.3b).In either case, the body bias of the pull down NMOS is adjusted such that node A stays at Vdd/2. This body bias is applied to all SRAM cell NMOS’s.

All our simulations are done using the PTM SPICE models [12] for the 65nm technology node. We emulate the process corners with fast (L), nominal (T) and slow (H) devices by creating transistor models with -0.1V and +0.1V threshold shift for the fast and slow cases compared to the nominal case.

566562562562562562562

Fig. 9 Pushing decoder into low leakage stand

by state

The NMOS is stronger for the process corners TT, LL and HH which are the most dominant process corners (the first letter applies to NMOS and the second to PMOS threshold). But for process corners such TL, LT, TH, HT we can adjust NMOS body bias from 0.3V to more negative voltage so as to make cell symmetric. But for cases like LH, HL, PMOS body bias has to be adjusted so as to make cell symmetric, which we don’t do here.

Fig.4 shows read noise margins for the 6T, 9T, symmetrized 9T SRAM using Diode connected transistor (Fig. 3a) and 9T SRAM cell symmetrized using off transistor (Fig. 3b). 9T SRAM cell has 100% more SNM than 6T SRAM cell. 9T SRAM cell symmetrized using diode connected transistor has at least 15% more noise margin at lower Vdds and 40% more noise margin at higher Vdds compared to 9T SRAM cell. 9T SRAM cell symmetrized using OFF transistor has 6% to 20% more Noise Margin than 9T SRAM cell. This enables lower voltage operation with the symmetrized cell than is possible with the regular 9T SRAM cell.

Fig.5 reveals that the noise margin for the symmetrized 9T cell with diode connected transistor is high enough even across process corners. Fig.6 shows that noise margin for symmetrized 9T cell with OFF transistor feedback is less immune to process variations than that with diode connected transistor. This is merely a reflection of the fact that matching OFF currents doesn’t imply a good match of the ON currents. Fig.7 gives the comparison of adjusted inverter trip point using diode connected symmetrizing circuit and off transistor symmetrizing circuit.

In reality, the replica cell inverter circuits in Fig. 3 will have a variation with respect to any cell in the array due to both systematic and random variations. We study the impact of this mismatch in figure 8 across supply voltages as well as the mismatch amount. Fig. 8(a) shows the degradation of the read noise margin as a function of the mismatch between

the replica cell inverter and the SRAM cell at 1V TT corner. Fig. 8(b) shows the change in noise margin across the Vdd for different corners for a 25mV Vth difference between the replica cell inverter and a SRAM cell.

4. Low Leakage High Speed Decoder

Decoders have very low activity since only one of 2n output lines will be activated at a time. Additionally the gates in the decoder are sequentially activated from input to the output. Both of these properties can be used in combination with zigzag super cutoff technique [11] to reduce leakage in stand by mode without any delay degradation for active mode.

As an example consider the 2-to-4 decoder shown in Fig. 9. In stand by mode, by pushing output nodes x1, x2, x3, x4 to zero, all the NANDs will have off-off stacking and hence low leakage. The output of all NANDs will be ‘1’ and output of all subsequent NOT gates will be ‘0’. Hence using PMOS sleep transistors in series for the inverters shown, even these will be in low leakage mode. This will in turn cause all the subsequent NAND gates of the next stage to get off-off stacked. Thus we can achieve low leakage without too much speed degradation. In each decode stage, only one NOT gate will be active at any time. Also the NOT gates across stages will be active at different times. Hence the PMOS sleep transistors for the NOT gates can all be merged into a single PMOS of the same size. So by using single PMOS sleep transistor and single NMOS sleep transistor (needed for buffer chain), the entire decoder can be pushed into a low leakage stand by mode. Sizing of the sleep transistors is done so as to not degrade speed by more than 2.5%(wire delay of decoder is also considered), yet achieve a leakage reduction of 80%. 5. Controlled Read and DVFM

We use a replica based control technique to monitor

delay of our SRAM block and control the Read Bit Line voltage drop [6]. For the delay mimic we have used an extra column of SRAM cells.

Bit line sensing is done using a skewed inverter and this single ended reading is slower because one single buffer in SRAM cell has to discharge entire Read Bit Line. Hence it takes more time to read than to write into cell. In our SRAM block, reading is slower than writing (uses floating VDD technique). write operation is achieved by floating the cell’s VDD prior to write [4]. In Fig.11 first cycle is write operation. As we can see that VVDD line is dropped down to 0.64V for

567563563563563563563

1.15V operation, enabling easier writing of SRAM cell. Afterwards the QB signal recovered to 1.15V.

For DVFM implementation, we tap the replica bitline signal and delay it through buffers to provide margins. The output of each buffer is captured in flip flops [5]. The DVFM control loop can use this in its feedback to adjust the supply voltage so that required delay is met with some margin. We have used a flip flop chain to monitor the delay as shown in Fig. 10 and adjust for minimum VDD for that frequency of operation. In Fig.11 2nd and 3rd cycles are read operations. In Fig.11 we see the RBL line discharge is stopped after sensed_replica_output came (as internal word line (WL) is cut-off). Thus we can save major fraction of SRAM block power.

Fig. 12 shows the results of performing DVFM adjustment by using the replica bitline shown in Fig. 10 as the delay synthesizer. We have given a margin of 7.5% in terms of buffer delay after the replica bitline. The minimum supply voltages possible over clock frequency ranging from 100MHz to 1GHz are shown in the Fig.12. One problem of using symmetrizing technique for SRAM is we require separate well for read buffer because we should not apply body bias to it. So as to overcome this problem, what we do is use same P-well for all NMOSs as done routinely, but apply body bias for symmetrizing only at lower voltages, where noise margins are low. From fig.4 we can infer that, we need to improve RNM only for supply voltages less than 0.6V. So we use the following algorithm- Step1: Get the minimum VDD required for the current frequency of operation. Step2: If VDD < 0.6V apply body bias to symmetrize the SRAM cell (of course read buffer may become slow, as body bias is applied to it also). Step3: Now take what ever VDD that is given by DVFM circuit.

In fig.12 we have also shown the increase in VDD after applying this algorithm. Supply voltage needed is maximum in HH corner (both PMOS and NMOS are weak) and minimum in LL corner (both PMOS and NMOS are strong). The maximum Vdd for 1GHz operation is 1.5V and minimum is 0.899V. The maximum Vdd for 100MHz operation is 0.751V and minimum is 0.35V. Thus having a replica bitline based delay column, local to each SRAM block, allows good tracking across process variations, as long as these are limited within a SRAM block.

While the replica bitline based delay mimic works well for tracking out correlated variations across the

Fig.10 Model for implementing Read control

and DVFM

Fig.11 One cycle of write followed by two

cycles of read

entire memory bank it will not track out cell to cell variations which are becoming significant as we scale the technology. In order to study this, we plot the delay difference between the replica path and the main bitline path in Fig. 13 as a function of the threshold voltage difference between the replica cell and the memory cell. We observe that for threshold voltage difference of 55mV and more the replica block is faster than SRAM block, thus turning off SRAM word line, leading to read failure. In the opposite direction of the mismatch with the replica cell having more Vth than the SRAM cells, power is wasted due to the delay in turning off the word line. A 200ps of delay margin for SRAM block is maintained at zero mismatch between replica and SRAM block. This extends allowable process variations.

568564564564564564564

00.20.40.60.8

11.21.41.6

50 250 450 650 850 1050Frequency (MHz)

Vdd

(V)

TT

HH

HL

LH

LL

TT withbodybiasLL withbodybias

Fig.12 Minimum supply voltage for different frequencies at different process corners.

Fig.13 Delay and power changes with process

variations

In stand by mode the decoder is completely cut off as described earlier and SRAM block leakage is reduced by reducing its supply to 0.3V where its hold noise margin is 0.117V. Overall, the 256X64 block in active mode for write operation consumes 2.624mW and the stand-by mode power is 6uW.

6. Conclusions

We have explored the design of a low-power high speed Dynamic Voltage and Frequency Managed 256 x 64 SRAM block in this paper. We have presented a 9T SRAM cell with improved the static noise margin and low leakage. The reduced leakage enables to have a larger number of cells in a bitline column, thus reducing overall memory area. The noise margins at very low supplies are further improved by symmetrizing the cell inverter’s DC characteristics, via adjustment of the NMOS thresholds. A replica cell inverter is used in the feedback for this adjustment. This enables operation even into sub threshold region with a supply of about 0.3V. We reduce the stand-by power of the decoder by using the zigzag cut-off technique, with very little degradation of the decoder

speed. Replica bitline is used to provide both the delay mimic for DVFM control, as well as used to reduce the bitline power. Low voltage writes are achieved by adopting the floating-Vdd technique. 7. References

[1] B.Calhoun and A.Chandrakasan,”A 256-kb 65-nm Sub-threshold SRAM Design for Ultra- Low-Voltage Operation”. IEEE JSSC , VOL. 42, NO. 3, MAR. 2007 [2] K.Takeda et. al., “A Read-Static-Noise-Margin-Free SRAM Cell for Low- VDD and High-Speed Applications”, IEEE JSSC , VOL. 41, NO. 1, JAN. 2006 [3] M.Nomura et. al “Delay and Power Monitoring Schemes for Minimizing Power Consumption by Means of Supply and Threshold Voltage Control in Active and Standby Modes” IEEE JSSC, VOL. 41, NO. 4, APRIL 2006 [4] M.Yamaoka, et. al. “90-nm Process-Variation Adaptive Embedded SRAM Modules With Power-Line-Floating Write Technique”, IEEE JSSC, VOL. 41, NO. 3, MAR. 2006 [5] M.Nakai et. al., “Dynamic voltage and frequency management for a low-power embedded microprocessor,” IEEE JSSC, vol. 40, no. 1, pp. 28–35, JAN. 2005. [6] Bharadwaj S.Amrutur, and Mark A. Horowitz, “A Replica Technique for Wordline and Sense Control in Low-Power SRAM’s” IEEE JSSC, VOL. 33, NO. 8, AUG. 1998 [7] L. Chang et al., “Stable SRAM cell design for the 32 nm node and beyond,” in Symp. VLSI Tech. Dig., JUNE. 2005, pp. 128–129. [8] M. Nakai et. al., “Dynamic voltage and frequency management for a low-power embedded microprocessor,” IEEE JSSC, vol. 40, no. 1, JAN. 2005, pp. 28–35. [9] K. Nose and T. Sakurai, “Optimization of VDD and VTH for lowpower and high-speed applications,” Proc. ASP-DAC, Jan. 2000, pp. 469–474. [10] Nomura et.al., “Delay and Power Monitoring Schemes for Minimizing Power Consumption by Supply and Threshold Voltage Control in Active and Standby Modes”, IEEE JSSC , April 2006. [11] Min et. al., “ZigZag Super Cut-off CMOS (ZSSCMOS) Block Activation with self adaptive voltage level controller: An alternative to clock gating scheme in leakage dominant era”, ISSCC 2003. [12] PTM Spice models

569565565565565565565

Documents

A 100MHz to 1GHz, 0.35V to 1.5V Supply 256 x 64 SRAM Block