Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
ARRAY ARCHITECTURE FOR A NONVOLATILE
3-DIMENSIONAL CROSS-POINT MEMORY
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF
ELECTRICAL ENGINEERING
AND THE COMMITTEE ON GRADUATE STUDIES OF
STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Elaine Ou
March 2010
This dissertation is online at: http://purl.stanford.edu/dg048ny4241
© 2010 by Elaine Ou. All Rights Reserved.
Re-distributed by Stanford University under license with the author.
ii
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
S Wong, Primary Adviser
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Thomas Lee
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Yoshio Nishi
Approved for the Stanford University Committee on Graduate Studies.
Patricia J. Gumport, Vice Provost Graduate Education
This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.
iii
Abstract
THIS WORK explores the design and capabilities of a three-dimensional cross-point
array structure suitable for use with resistance-change non-volatile memory. The
resistance-change cell serves as both the access element and the memory element, eliminat-
ing the need for individual selection transistors or diodes. This enables the memory to be
fabricated in arrays with a line spacing of F , the minimum feature size for a given process
technology. By stacking the cross-point arrays in n layers, we achieve an effective cell size
of 4F 2/n.
Previous works describing transistor-free memory arrays have been limited by exces-
sive leakage current across unselected bitlines and wordlines during memory access. This
work presents novel architecture and circuit techniques that minimize leakage current ef-
fects while maintaining a high effective bit density. A test chip fabricated in 0.18 µm
CMOS technology allows us to verify the architecture and circuit functionality.
The performance of a 8 Gb memory chip build in 65 nm technology has been simulated.
A random access time of 104 ns is achieved with a power dissipation of 61.2 µW. This
makes the 3-D cross-point memory competitive with NOR flash in terms of read time, and
competitive with NAND flash in terms of area efficiency.
iv
Acknowledgments
THE WORK presented in this dissertation is the result of a long and difficult journey
made possible only with the support and guidance of many people.
First and foremost, I must thank my advisor, Professor Simon Wong. He oversaw my
research efforts and taught me to guide my own work. His insight and experience has been
invaluable, and his generosity with his time and mentorship was second to none. It has
been an honor for me to work with Professor Wong.
I would also like to thank my associate advisor, Professor Yoshio Nishi. He has been
a valuable source of guidance and input during my time at Stanford. Despite his busy
schedule, he always took the time to check on my progress. Next, I would like to thank
Professor Thomas Lee for serving on my reading committee. As the cofounder of Matrix
Semiconductor, he was able to provide insightful industry perspectives to my research. I
am also most indebted to Professor Ada Poon for serving as my committee chair on very
little notice.
I would next like to thank June Wang and Natasha Newson, who always made the
administrative details run smoothly.
This research was made possible by funding from the American Society for Engineer-
ing Education and the Nonvolatile Memory Technology Research Initiative (NMTRI). The
NMTRI group provided the support and opportunity for me to interact and discuss my work
with many key industry members. Furthermore, I would like to thank the 3D Technology
v
Group at SanDisk for allowing me to spend a summer working alongside nonvolatile mem-
ory experts including Roy Scheuerlein and Luca Fasoli.
My experience at Stanford University would not have been the same without the mem-
bers of Professor Wong’s research group, who were not just great colleagues but also be-
came my close friends. I would like to thank Dr. Haitao Gan, Dr. Yun Bai, Dr. Paul
Park, Dr. Henry Nho, Dr. Andrew Poon, Dr. Aaron Gibby, Dr. Wei Wang, Jeongha Park,
Kasra Omid-Zahoor, Sung Il Park, Wanki Kim, Young Yang-Liauw, Zhiping Zhang, and
Saihua Lin for their collaboration, assistance, and input. I would especially like to express
my gratitude to Stanley Yeh, who provided me with extensive discussions regarding my
research, and who will be carrying out a continuation of this work.
Additionally, I am grateful to Jenny Hu and Warren Mar, who suffered with me through
the trials and tribulations of life as a graduate student, and who eventually became my best
friends at Stanford University.
Finally, I would like to dedicate this work to my family: To my mother, for her un-
conditional encouragement and support; To my father, for challenging me to reach the best
of my potential; To my brother, for being my lifelong friend and competitor; and to the
memory of my grandmother, who always believed in me.
vi
Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1 Introduction 1
1.1 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Motivation and Prior Work 3
2.1 Flash Memory Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Emerging Memories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 3D Integration Technologies . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Cross-point Memory Design Methodology 13
3.1 Cross-point Memory Array Goals . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Memory Write Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Memory Read Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Memory Cell Characterization . . . . . . . . . . . . . . . . . . . . . . . . 20
vii
3.5 Design Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4 Circuit Design Tradeoffs 24
4.1 A Resistive Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 MOSFET Current Amplifier . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3 Bipolar Transistor Current Amplifier . . . . . . . . . . . . . . . . . . . . . 28
4.4 Lateral Bipolar Junction Transistors in Standard CMOS Process . . . . . . 31
4.5 Sense Amplifier Area Considerations . . . . . . . . . . . . . . . . . . . . . 35
5 Three-Dimensional Memory Architecture 38
5.1 Memory Array Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Circuit Techniques for Vertical Architecture . . . . . . . . . . . . . . . . . 42
5.3 Layout Techniques and Array Efficiency . . . . . . . . . . . . . . . . . . . 44
5.4 Stacking Memory Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.5 3D Memory Area Comparisons . . . . . . . . . . . . . . . . . . . . . . . . 56
6 Performance Analysis 60
6.1 8 Gb Memory Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2 Memory and Array Models . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.3 Waveform Analysis and Timing Diagrams . . . . . . . . . . . . . . . . . . 67
6.4 Performance Comparisons with Commercial Products . . . . . . . . . . . . 69
7 Test Chip Design 73
7.1 Memory Array Implementation . . . . . . . . . . . . . . . . . . . . . . . . 73
7.2 Device Measurement Results . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.3 Sense Amplifier Verification . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.4 Voltage Scaling Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
viii
8 Conclusion 95
8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.2 Other Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
8.3 Recommendations for future work . . . . . . . . . . . . . . . . . . . . . . 98
References 99
ix
List of Tables
3.1 Characteristics of an HfO2-based RRAM device . . . . . . . . . . . . . . . 22
4.1 Sense amplifier designs and impact on layout area when considering a 4 kb
row. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.1 NMOS saturation current for different process technologies. . . . . . . . . 43
5.2 Metal layers and functions in multi-layer cross-point arrays. . . . . . . . . 47
5.3 NAND flash memory area efficiency comparisons. . . . . . . . . . . . . . . 59
6.1 HfO2-based memory cell model parameters. . . . . . . . . . . . . . . . . . 65
6.2 Nonvolatile memory performance comparisons. . . . . . . . . . . . . . . . 70
6.3 Read path timing signal comparisons. . . . . . . . . . . . . . . . . . . . . 72
x
List of Figures
2.1 Flash memory cell programming . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Comparison of NAND and NOR flash architectures . . . . . . . . . . . . . 5
2.3 NAND GB shipments and price/GB . . . . . . . . . . . . . . . . . . . . . 6
2.4 Flash process technology scaling . . . . . . . . . . . . . . . . . . . . . . . 8
2.5 SEM photographs of 3D stacked NAND cell string. . . . . . . . . . . . . . 11
2.6 SEM photograph of SanDisk 8-layer cross-point memory. . . . . . . . . . . 12
3.1 RESET and SET operation on a cross-point array. . . . . . . . . . . . . . . 15
3.2 Read operation on a cross-point array. . . . . . . . . . . . . . . . . . . . . 17
3.3 Op-amp current-to-voltage converter. . . . . . . . . . . . . . . . . . . . . . 18
3.4 Exploded view of an LM107 operational amplifier. . . . . . . . . . . . . . 19
3.5 Single gain stage current-sensing amplifier. . . . . . . . . . . . . . . . . . 20
3.6 HfO2-based RRAM device . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.7 HfO2 bipolar switching characteristics. . . . . . . . . . . . . . . . . . . . . 22
3.8 The memory array represented as a resistive divider network. . . . . . . . . 23
4.1 Sense amplifier design tradeoffs. . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Memory array design variables. . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 Normalized current between the bitline and MOSFET sense amplifier input. 28
4.4 Current-sensing amplifier replacing NMOS transistors with NPN bipolar
transistors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
xi
4.5 Normalized current between the bitline and BJT sense amplifier input . . . 30
4.6 Cross-sectional diagram of a vertical NPN transistor. . . . . . . . . . . . . 31
4.7 Cross-sectional diagram of a lateral NPN BJT . . . . . . . . . . . . . . . . 32
4.8 Parasitic bipolar devices. . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.9 Lateral NPN device based on standard MOSFET layout. . . . . . . . . . . 34
4.10 Lateral NPN device with octagonal emitter shape. . . . . . . . . . . . . . . 35
4.11 Layout area of the small MOSFET sense amplifiers. . . . . . . . . . . . . . 37
5.1 Hybrid NAND/NOR architecture of cross-point memory. . . . . . . . . . . 40
5.2 Architecture of a half page. . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.3 Layout of bitline-select transistors . . . . . . . . . . . . . . . . . . . . . . 45
5.4 Layout of wordline-select transistors . . . . . . . . . . . . . . . . . . . . . 46
5.5 Layout of bitline-select transistors under a cross-point memory array . . . . 48
5.6 Cross-sectional view of bitline-selection transistors under a cross-point mem-
ory array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.7 Layout of wordline drivers for a cross-point memory array . . . . . . . . . 50
5.8 Schematic of bitline-select transistors under a cross-point memory array . . 51
5.9 Schematic of wordline drivers under a cross-point memory array . . . . . . 52
5.10 Cross-sectional view of bitline-selection transistors between adjacent mem-
ory arrays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.11 Two-layer cross-point memory . . . . . . . . . . . . . . . . . . . . . . . . 53
5.12 Connecting two layers of bitlines to the substrate. . . . . . . . . . . . . . . 54
5.13 Four-layer memory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.14 Four-layer memory with buffer connections. . . . . . . . . . . . . . . . . . 55
5.15 Schematic of wordline drivers under a four-layer cross-point memory array 57
5.16 Array efficiency vs. number of vertical memory layers . . . . . . . . . . . 58
6.1 Four-layer cross-point memory implemented in an 8 Gb architecture. . . . . 61
xii
6.2 32 x 32 cross-point memory array . . . . . . . . . . . . . . . . . . . . . . 66
6.3 Multi-layer memory array represented as an RC network. . . . . . . . . . . 66
6.4 Simulation waveforms for memory read access. . . . . . . . . . . . . . . . 68
6.5 Read operation timing diagram . . . . . . . . . . . . . . . . . . . . . . . . 69
7.1 Test chip architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.2 Effective resistance of memory cell vs. gate voltage. . . . . . . . . . . . . . 76
7.3 I-V characteristics of the standard MOSFET-style BJT. . . . . . . . . . . . 77
7.4 I-V characteristics of the octagonal-style BJT. . . . . . . . . . . . . . . . . 78
7.5 Array configuration during a test read operation. . . . . . . . . . . . . . . . 81
7.6 RON and ROFF values in a 32-wordline array with MOSFET design (1). . . 82
7.7 RON and ROFF values in a 64-wordline array with MOSFET design (1). . . 83
7.8 RON and ROFF values in a 32-wordline array with MOSFET design (2). . . 84
7.9 RON and ROFF values in a 64-wordline array with MOSFET design (2). . . 85
7.10 Inconsistencies in the lateral NPN transistor characteristics. . . . . . . . . . 86
7.11 RON and ROFF values in a 32-wordline array with BJT design (MOSFET-
style). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.12 RON and ROFF values in a 32-wordline array with BJT design (octagonal-
style). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.13 Voltage scaling results for a 32-wordline array with MOSFET design (1). . 90
7.14 Voltage scaling results for a 32-wordline array with MOSFET design (2). . 91
7.15 Voltage scaling results for a 32-wordline array with BJT design (MOSFET-
style). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.16 Voltage scaling results for a 32-wordline array with BJT design (octagonal-
style). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
xiii
xiv
Chapter 1
Introduction
C MOS TECHNOLOGY scaling has enabled the integration of increasingly dense and
numerous functionalities into a single chip at a rate described by Moore’s law [1].
Historically, the number of transistors per chip has doubled every two-to-three years. Tech-
nology process scaling increases bit density, but as features get smaller, reliability and
accuracy suffer. Memory suppliers are seeking an alternative to traditional NAND flash
memory to enable the continuation of scaling without sacrificing reliability and cost.
With the development of resistance-change materials, it is possible to substitute resistance-
change memory devices where transistors were once used. In this work we demonstrate the
capability of a cross-point memory array structure that utilizes a resistance-change cell to
serve as both the access element and the memory element. The cell can be manufactured
entirely with existing silicon-based complementary metal-oxide-semiconductor (CMOS)
fabrication tools and materials. With no access transistor, a single memory cell has a size
of 4F 2, where F is the minimum feature size available for a given process technology. We
have modeled a memory architecture that supports multiple layers of cross-point arrays,
resulting in an effective cell size of 4F 2/n, where n is the number of memory layers.
1
1.1 Organization
The organization of this thesis is as follows: Chapter 2 explores existing resistance-change
memory technologies and other alternatives used to improve the scalability of nonvolatile
memory. An overview of 3-dimensional integration technologies is also presented.
Chapter 3 describes the program and read strategies employed in a 2-dimensional cross-
point memory array without using access transistors. This chapter also identifies the charac-
teristics of the resistance-change material used for the memory elements. Then, the design
procedure for the memory architecture is described.
In Chapter 4, we further explore the read operation at a circuit level, and present the
novel design and layout techniques that minimize leakage current effects while maintaining
a high effective bit density.
Chapter 5 builds upon the 2-dimensional cross-point memory array and shows how it
can be built into a 3-dimensional architecture. Here, we also give consideration to the pe-
ripheral circuitry overhead and requirements for a vertical design. The design for an 8 Gb,
four-layer memory architecture is described. The following chapter presents a performance
analysis of the read operation on the 8 Gb 3-dimensional memory and compares the results
to those of other nonvolatile memory products fabricated in similar process technologies.
To demonstrate the feasability of the cross-point memory architecture, we designed
and fabricated a prototype that emulates a resistance-change cross-point memory from a
functional standpoint. Chapter 7 describes the test chip implementation and design choices
as well as the measurement results.
Finally, we conclude with a summary of the research and discuss possible directions
for future work.
2
Chapter 2
Motivation and Prior Work
FLASH MEMORY suppliers aim to continue increasing the bit density (bits per mm2)
of memory chips, but have to do so without sacrificing reliability and cost. This
chapter explores flash memory technologies such as multi-level cells that have effectively
increased the bit density in past years, and explains why the future of flash memory may
be limited by technology scaling despite recent advancements. We also look at trends in
emerging nonvolatile memories that could serve as potential replacements for NAND and
NOR flash. Finally, we present prior methods of 3D memory integration and give a brief
overview of current 3D integration technologies.
2.1 Flash Memory Overview
Flash memory cells consist of a single floating-gate transistor and retain a ’1’ or a ’0’ bit by
storing a certain amount of charge on the floating gate, as shown in Figure 2.1. The amount
of charge stored on the floating gate determines the threshold voltage of the device [2]. The
floating gate is insulated from the substrate by a layer of tunnel oxide, and from the control
gate by the inter-poly dielectric material.
3
Figure 2.1: Flash memory cell programming.
There are two primary architectures currently used for flash products. NAND flash is a
high-density memory architecture with bitlines consisting of 16 or 32 cells in series (hence
the name ”NAND”). It has a denser layout than NOR flash and is used for most removable
solid-state memory applications today. NOR flash has a separate metal contact for each cell
and is arranged with transistors in parallel (similar to a ”NOR” gate). A NOR flash cell has
a cell size of 10F 2 while a NAND flash cell has a cell size of 4F 2. Because a NOR flash
cell can be accessed without charging all the memory cells on the bitline in series, it has
a much faster read time than NAND flash. Hence, it is more suitable for execute-in-place
memory applications, meaning that stored instructions can be executed directly without the
need to buffer the data in random-access memory. Figure 2.2 gives a comparison of the key
differences between NOR and NAND flash memory architectures.
NAND and NOR flash also have distinct programming mechanisms. In NOR flash,
charge is stored by applying a high bias voltage to the control gate and the drain to induce
hot-electron injection from the transistor channel to the floating gate. In NAND flash, a bias
voltage is applied only to the control gate, causing a Fowler-Nordheim electron tunneling
4
Figure 2.2: Comparison of NAND and NOR flash architectures [2].
current through the tunnel oxide from the source-drain channel to the floating gate. Both
NAND and NOR flash use the Fowler-Nordheim tunneling effect to remove charge from
the floating gate during the erase operation [3].
5
2.1.1 Flash Scalability
Because flash memory has such a uniform layout, it is well-suited for leading-edge pro-
cessing. Flash memory is a market that has a history of fast growth and rapid evolution. As
can be seen in Figure 2.3, the price of NAND flash has continuously fallen in recent years.
Manufacturing costs have not been decreasing at the same pace [4], thus profits are low for
NAND flash suppliers.
Figure 2.3: NAND GB shipments and price per GB [4].
Generally speaking, the best way to reduce manufacturing costs for memory is to re-
duce the total die area while maintaining the same capacity. As flash memory technol-
ogy has been scaled to finer and finer lithography processes, nonvolatile memory becomes
cheaper and cheaper to serve a wider range of capacity-demanding applications. Main-
stream NAND flash memories are currently manufactured on 4x-nm processes with major
6
NAND flash vendors migrating to 3x-nm this year. In the race to reduce costs, NAND
flash manufacturers are in the process of developing 2x-nm technology. However, with
performance and reliability characteristics severely degraded relative to the 4x-nm gen-
eration, 2x-nm floating gate NAND flash is anticipated to be the last process technology
generation [4].
Multi-level cells have been used in flash memory to increase the number of bits stored
per cell [5]. This increases the memory capacity on a die without scaling the technology
process. Up to eight threshold-voltage levels have been implemented in NAND flash mem-
ory cells, encoding three bits per cell [6]. While this significantly increases the bit-density
of flash memory, it is accompanied by a higher bit-error ratio. Software complexity in the
form of error-correction algorithms is commonly increased to compensate for the bit-error
ratio. While multi-level flash may provide an increase in memory capacity at a certain
technology process, it is even less scalable for future technology nodes.
As shown in Figure 2.4, in sub-50-nanometer process technologies, only tens of elec-
trons are being stored and detected on the floating gate of a single NAND flash memory cell.
This brings to light several reliability concerns. Flash memory cells commonly suffer from
charge leakage due to oxide-tunneling, and these effects are expected to increase when the
oxide thickness decreases with technology scaling. Also, flash charge retention time de-
creases with wear, because defects in the tunnel oxide can be introduced with multiple write
and erase cycles. Currently, the typical lifetime of a NAND flash cell is 106 program/erase
cycles, while it is only 105 cycles for a NOR flash cell. NOR flash uses higher program-
ming voltages for channel hot electron injection, which drives a larger number of electrons
onto the floating gate than Fowler-Nordheim tunneling, but reduces the endurance of the
memory devices [7]. Thus, flash reliability and accuracy suffer as process technologies de-
crease below 50-nm, and multi-level cells become difficult to implement. A more scalable
memory technology is crucial for advanced lithography processes.
7
Figure 2.4: The number of electrons present on the floating gate scales
with process technology [8].
2.2 Emerging Memories
Memory producers are also trying to develop alternative technologies that may be scalable
beyond 20 nm lithography. There is a wide range of emerging memory technologies under
development to someday replace flash memory, and the most notable candidates include
magnetoresistive random access memory (MRAM) [9], phase-change memory [10] [11],
and resistance-change memory [12].
MRAM utilizes two ferromagnetic plates separated by a thin insulating layer for a mem-
ory cell. The lower plate is set to a fixed polarity, while the polarity of the upper plate can
be switched during a write operation. The electrical resistance of the memory cell changes
depending on whether or not the polarity is aligned between the two plates. MRAM was
8
originally thought to be able to replace SRAM, DRAM, and flash memories with a poten-
tially transistor-free architecture, but its lack of reliability in high-density arrays due to the
magnetic fields produced by the switching mechanism limits its potential applications [13].
Both phase-change and resistance-change memory cells are designed for detectable
changes in conductivity. Phase-change memory is currently designed around a random-
access architecture, similar to that of NOR flash [14]. It operates by using a resistive
heater to melt and recrystallize a chalcogenide memory cell in either a crystalline or amor-
phous state. It is unlikely that phase-change memory could replace NAND flash as a high-
density nonvolatile memory because of the high programming current density (in excess of
107 A/cm2) required to melt the memory cell.
Traditionally, phase-change memory has received more attention as a future replace-
ment for flash because of its reliable switching characteristics, but as technology improves,
resistance-change metal oxides have also become increasingly popular candidates for non-
volatile memory. Resistance-change metal-oxide materials have been shown to possess
favorable characteristics that make them particularly suitable for a 3D memory architec-
ture. They are highly compatible with modern CMOS processes and have demonstrated
high-speed and low-power switching abilities, with decade-long retention times and better
endurance than flash memory. Metal-oxide-based resistance-change technologies gener-
ally exhibit lower programming current requirements than phase-change memory [15] and
thus should have better scalability. The mechanism behind resistance-change is not yet
completely understood, but it is believed that oxygen motion and filament formation are
responsible for resistive changes in oxide-based resistance-change materials [46] [17].
9
2.3 3D Integration Technologies
Because chip area largely dictates manufacturing cost, building memory elements verti-
cally above a silicon surface is a highly effective way to reduce cost-per-bit. Current 3D
integration methods range from package-level stacking, in which dies are vertically con-
nected using bond wires or solder bumps, to monolithic integration, in which silicon is
recrystallized between each transistor layer to allow the fabrication of multiple layers of
active devices on a single wafer. From a circuit-design standpoint, the most significant
differentiator between the technologies is the density and parasitics of the inter-layer vias.
Fabrication costs can vary because, while the monolithic 3D design requires less silicon, it
involves a more complicated fabrication process as well as a possible degradation in yield.
Chip-stacking is the most primitive 3D-IC fabrication method and connects pre-fabricated
chips via bond wires or solder bumps. The long bond wires required to connect the lay-
ers of chips can add significant resistance and capacitance. Wafer-stacking avoids some
of the performance limitations of chip-stacking by directly bonding fabricated wafers after
they have been thinned down by chemical-mechanical polishing. Each layer is then only a
few tens of microns thick, but this still adds significant parasitics relative to a single-layer
design. Also, processing complexity is increased because of the challenges involved with
aligning multiple layers.
Monolithic 3D integration connects multiple layers using inter-layer vias similar to the
typical inter-metal vias used in a standard CMOS fabrication process. Because each layer is
less than one micron in thickness [18], performance is similar to that of a circuit fabricated
on a 2D substrate. In this manner, Samsung has developed a 3-dimensionally stacked
NAND flash memory by implementing a single-crystal Si layer stacking technology, where
epitaxial silicon is deposited in multiple layers above the substrate [19]. While this method
does improve the bit-density on a die, it is still subject to the limitations of flash scaling.
10
Figure 2.5: Vertical SEM photographs of Samsung 3D stacked NAND
cell string [19]
One-time programmable multi-layer cross-point memories have been developed by
both Samsung [20] and Matrix Semiconductor (currently part of SanDisk) [21]. Diodes
are used as access devices for each memory element to provide the necessary current den-
sity and prevent leakage across unselected cells, but this causes an increase in the bias
voltage required for access. Memory cell access diodes will also suffer from the effects
of discrete dopants in the p-n junction when process technologies scale below 40 nm [22].
The one-time programmable aspect of the memory limits its potential applications as well.
2.4 Motivation
For true scalability beyond 20 nm technology nodes, it is necessary to design a cross-
point memory array which does not require diodes for access elements. The cross-point
memory architecture described in this work is designed so that it can easily be fabricated
in multiple layers to form a stacked 3-dimensional memory. The memory array does not
use active devices for cell selection, so the stacking of multiple memory layers is relatively
11
Figure 2.6: Vertical SEM photograph of SanDisk 8-layer cross-point
memory [21]
straightforward, unlike the 3D stacked NAND flash.
Rewriteable resistance-change materials are used as memory elements. By keeping the
memory cells free of access devices such as diodes or transistors, the memory architecture
is greatly scalable over traditional flash memory.
12
Chapter 3
Cross-point Memory Design
Methodology
THE ARCHITECTURE of each layer of the cross-point memory array resembles that
of NAND flash [23]. Because the cross-point memory design is optimized for area,
the access times are not expected to be competitive with for execute-in-place memory appli-
cations. This chapter begins with an overview of the goals for a cross-point memory array,
then discusses the strategies employed for reading and writing the memory. We present the
limitations that we face when removing the access transistors, and what target memory cell
characteristics are desirable for a cross-point memory architecture. Finally, this chapter
summarizes the design process and how it can be adapted to suit other resistance-change
memory devices and technology processes.
3.1 Cross-point Memory Array Goals
Using a transistor-free cross-point memory architecture allows us to fabricate wordlines
and bitlines in the minimum metal pitch allowed by the technology process, as we are no
13
longer limited by individual selection transistors. However, layout design becomes more
difficult because the peripheral circuitry must be able to access the bitlines and wordlines
in such a tight pitch.
The purpose of cross-point memory is to attain minimal area while maintaining scal-
ability. As with NAND flash memory, read operations are performed in parallel; that is,
all of the bits on a single wordline can be read at once. To read many bits in parallel, the
complexity and power of the sense amplifiers must be managed.
3.1.1 Leakage Current Considerations
When implementing an access-transistor-free cross-point memory array, the greatest con-
cern is the leakage current that arises as a result of unwanted memory cells being biased
during a read or write operation. Because there are no transistors or diodes at the memory-
cell cross-points, charge is free to flow throughout the memory array. For the read oper-
ation, this adds noise to the signal being sensed. For the write operation, this can create
disturb conditions on unselected cells, or incomplete programming of selected cells.
The worst-case leakage condition occurs when the majority of the memory array is in
the low-resistance state and a selected cell is in the high-resistance state. Because there
are no access transistors, the memory cells must also serve as rectifying devices, and are
less effective at this when at a low resistance. The peripheral access circuitry must be de-
signed to overcome this leakage current problem, and this will be discussed in the following
sections.
14
3.2 Memory Write Operation
Programming resistance-change memory cells is accomplished by applying either a SET
voltage (VSET ) or a RESET voltage (VRESET ). ”SET” is defined as the transition of cells
from a high-resistance state to a low-resistance state, while ”RESET” brings the cells
back to a high-resistance state from a low-resistance state. It has been shown that some
resistance-change cells are more reliable and demonstrate a faster switching speed when
operating in the bipolar mode [15]. In this mode, VRESET will be a negative bias while
VSET is a positive bias. The key to overcoming the leakage current problems during the
write operations is to ensure that unselected cells are biased at intermediate voltages that
are not high enough to accidentally program them.
Figure 3.1: A RESET (left) and a SET (right) operation performed on a
cross-point array.
15
Figure 3.1 shows a representation of the RESET and SET operations. In general, all
the cells in a memory array are first RESET to the high-resistance state, and then only se-
lected cells are SET to the low-resistance state. The entire memory array is RESET bitline-
by-bitline before the SET voltage is applied to those cells that need to be programmed.
This enables us to predict the current requirements for the SET operation to avoid over-
programming (bringing memory cells to too low a resistance). Over-programming is not
a concern during RESET because the higher resistance will automatically limit the current
flow.
In the RESET operation, VRESET is applied sequentially to one bitline at a time. All
wordlines in the array are pulled to ground (0 V), and every cell in this row of bits becomes
RESET. The unselected bitlines are terminated with high-impedances, or are left floating,
so that the only current path is from the selected bitline to the wordlines. It is expected that
the unselected bitlines will drift to some voltage slightly higher than 0 V. Leakage current
is not a significant concern in the RESET operation because every cell in the array will
eventually be brought to the high-resistance state.
In the SET operation, 0 V is applied sequentially to one bitline at a time, while VSET is
applied only to selected wordlines. The unselected wordlines are terminated with a high-
impedance and are expected to drift to some voltage between ground and VSET
2. Unselected
bitlines are biased at VSET
2. The unselected cells see a bias of at most VSET
2. While this
allows some leakage current in the array, it is necessary to prevent unselected bitlines from
drifting to too low a voltage and causing an inadvertent SET.
3.3 Memory Read Operation
A memory array is accessed wordline-by-wordline. During a read operation, the selected
wordline is raised to VREAD and a read current is driven in parallel through the bitlines. The
16
unselected wordlines are terminated with high-impedances while each bitline is connected
to an individual sense amplifier at one end. In this manner, the sense amplifiers should
provide the only current path to ground.
Figure 3.2: A read operation performed on a cross-point array.
Figure 3.2 depicts the current leakage paths through the unselected wordlines. As men-
tioned in Section 3.1, the leakage current is worst when the majority of the memory array is
in the low-resistance state, as this allows more leakage current to traverse between bitlines
and cause read errors.
3.3.1 Sense amplifier design
Our strategy for overcoming the leakage current problem is to implement current-sensing
amplifiers that can maintain the bitlines at as near a constant voltage as possible. If the
17
voltage difference between bitlines is minimized, then so is the leakage current. Current-
sensing amplifiers provide significant reductions in bit-line voltage swing and sensing de-
lays [24].
Figure 3.3: Op-amp current-to-voltage converter.
This can potentially be done using an operational-amplifier implemented as a current-
to-voltage converter. Figure 3.3 shows the basic implementation of an op-amp current-
to-voltage converter. The inverting input is maintained at virtual ground and the entire
input current is directed across resistor R. The benefit of this design is that every bitline
can be maintained at the same virtual ground, and the current differential between a high-
resistance and a low-resistance state is seen as the voltage drop across the resistor. With
this ideal sense amplifier design, the memory array will see no leakage current whatsoever.
Figure 3.4 shows the transistor-level view of an operational amplifier that might be used
for current sensing. Unfortunately, in practice, the complexity level of an effective op-amp
circuit renders it impossible for use in accessing bitlines laid out in the minimum metal
pitch.
A much simpler alternative is needed if the sense amplifier is to fit in a single-bitline
pitch. Flash memory typically uses a differential current-sensing scheme [25]. Even this
18
Figure 3.4: Exploded transistor-level view of an LM107 operational
amplifier [26].
structure would be too complex to accommodate a cross-point array. To address the chal-
lenge of an area-efficient sensing design, a diode-connected NMOS transistor followed by
a current mirror is used to detect the read current from a bitline, as shown in Figure 3.5.
The mirrored current is compared to a bias current set by the gate voltage Vbias on the
PMOS transistor in the reference branch, and this controls the voltage signal Vout, which
is buffered and latched. This simple sense amplifier is also better suited for future voltage
scaling.
A diode-connected MOSFET is useful for clamping the bitline to a certain voltage level,
reinforcing the original goal of maintaining a constant bitline potential to minimize leakage
19
Figure 3.5: Single gain stage current-sensing amplifier.
current. The diode-connected NMOS must have a sufficiently low input resistance relative
to the resistances of the memory cells to ensure that it provides enough conductance for
sensing current. The NMOS transistor in the current-mirror branch must be sufficiently
sized to detect the small amount of voltage swing. Transistor sizing is a critical part of the
peripheral circuitry design, and these considerations will be further discussed in the next
chapter.
3.4 Memory Cell Characterization
The memory array design depends strongly on the resistance-change memory cell charac-
teristics. There are a number of metal oxide materials that offer reversible, voltage-induced,
resistance changes. When choosing a suitable memory cell candidate, a number of factors
need to be considered. The material needs to serve as both the memory element and a
20
Figure 3.6: HfO2-based RRAM device
rectifying device, so the resistance of the chosen material must be high enough to mini-
mize leakage current in both the low-resistance and high-resistance states. However, the
resistance-change material should not have high write-energy requirements, as high current
or voltage requirements for programming would have extra peripheral circuitry require-
ments, and this would also limit the future scalability of the cross-point memory.
Two distinct conduction states are observed in certain polycrystalline metal oxides such
as TiO2 [46]. The high-resistance state can switch to a low-resistance state by voltage
and/or current stress [12]. The low-to-high resistance transition is also induced by voltage
and/or current stress.
In this work, the target device is a HfO2-based resistance-change material with an in-
terfacial layer between one electrode and the HfO2 layer. The interfacial layer is a reactive
buffer layer that allows a large number of oxygen atoms to diffuse from the HfO2 layer to
the interfacial layer, improving the reliability of the resistance-switching mechanism [15].
Table 3.1 summarizes the characteristics of the target memory device used in this work.
The I-V characteristics of the resistance-change device are shown in Figure 3.7.
21
RON 100 kΩ
ROFF 1 MΩ
VSET 1 V
ISET 5 µA
VRESET -1.5 V
IRESET 40 µA
Table 3.1: Characteristics of an HfO2-based RRAM device
Figure 3.7: HfO2 bipolar switching characteristics.
3.5 Design Process
The remainder of this work will focus on the process of designing a memory architecture
to accommodate the read operation. The design methodology presented can be applied in
22
the same manner when later optimizing the memory architecture for the write operation.
Figure 3.8: The memory array represented as a resistive network
consisting of word line drivers, memory cells, and sense
amplifiers.
The key variables to be determined in designing a circuit architecture to support a mem-
ory array are: array dimensions, wordline driver size, and sense amplifier input impedance.
These are the factors that directly interact with a memory cell under bias during the read
operation, and scaling them controls read accuracy.
As shown in Figure 3.8, the memory array can be seen as a resistive network made up
of the key design variables identified above. The sneak current path depicted by the arrow
between two bitlines is undesirable, and can be moderated if the input impedance of the
sense amplifiers is sufficiently low. Alternately, the number of wordlines could be reduced,
giving the read current fewer leakage paths. Finally, the wordline drivers could be made
larger, resulting in a larger absolute current differential between the bitlines.
The next chapter will analyze these design variable tradeoffs and find the optimal array
design.
23
Chapter 4
Circuit Design Tradeoffs
THE THREE PRIMARY design variables affecting the memory array during the read
operation were identified in the previous chapter. In this chapter, we seek to deter-
mine how these variables can be optimized to yield the smallest layout area while main-
taining reasonable read accuracy, as that is the primary goal of cross-point memory design.
4.1 A Resistive Network
A 32-bitline cross-point memory array using a resistor model of the HfO2-based RRAM
device was created in HSPICE using a 0.18 µm process technology with a nominal voltage
of 1.8 V. The array was simulated for accuracy while varying the number of wordlines,
the sense amplifier NMOS sizes, and the wordline driver transistor sizes. Read accuracy
was determined by accessing the leftmost wordline in a memory array and measuring the
amount of current reaching the sense amplifier input. In order to properly sense a sig-
nal, there should to be a sufficient difference in the current between the high and low-
resistance states being accessed on the active wordline when the remainder of the array
is in the low-resistance state, as this array configuration allows the most leakage current
24
across unselected cells. It is determined that a 15% current differential is sufficient for a
minimum-size buffer to latch the sense amplifier output following a 4x current gain.
Figure 4.1: Sense amplifier design tradeoffs. The optimum sense
amplifier size for a 16-wordline array is marked (1), and the
optimum sense amplifier size for a 32-wordline array is
marked (2).
The results in Figure 4.1 show the minimum transistor sizes that can be used during
a worst-case read operation. Read operations were performed by activating the farthest
wordline from the sense ampliers. The transistor sizes on the chart are normalized to the
minimum transistor size for the process technology used for simulation.
The rationale behind the sizing tradeoffs can be understood from Figure 4.2, which
is a simplified representation of the resistive network that occurs during a read operation
in the worst-case array state. In order for any sense amplifiers to be able to differentiate
25
Figure 4.2: Memory array design variables.
between the input currents ION and IOFF in this scenario, the leakage current from the low-
resistance bitline (RON ) to the high-resistance bitline (ROFF ) must be less than the original
bias current across the memory cell. The amount of current traversing from theRON bitline
to the ROFF bitline is equal to the difference between the bitline voltages (∆VBL) divided
by the sum of all the resistors along the bitline in parallel ( RON
#WL).
Then,∆VBL · #WL
RON
<VREAD
RON
VREAD
∆VBL
> #WL
This gives the absolute minimum requirements for a functional memory array.
For arrays with up to 32 wordlines, VREAD can be increased and ∆VBL decreased by
increasing the width of the wordline drivers and sense amplifier transistors respectively. In-
creasing the wordline transistor size drives more current through the selected cells in order
to account for the greater number of leakage paths. Increasing the size of the sense am-
plifier NMOS transistors provides a lower-impedance path relative to all of the unselected
memory cells in parallel. This generates the L-shaped curve seen in Figure 4.1.
26
When the array size is increased to 64 wordlines, the wordline drivers and sense ampli-
fier widths must be significantly increased in size to a point where it is no longer practical
for implementation. As shown in the above equation, for a constant VREAD, ∆VBL must
be halved to accommodate the doubling of the number of wordlines. While a valid con-
figuration for a 64-wordline array was found through simulation, in practice, this would
not work, because it would be impossible to lay out such large worline driver and sense
amplifier transistors within the narrow pitch of the wordlines and bitlines.
Two acceptable array configurations are marked (1) and (2) in Figure 4.1 and will be
further discussed. Design (1) is optimized for a 16-wordline array. Design (2) is optimized
for a 32-wordline array and has a sense amplifier with NMOS transistor widths sized at
four times those of design (1).
4.2 MOSFET Current Amplifier
The sense amplifiers selected in the previous section were further evaluated by simulating
the current output from the bitline to the sense amplifier when the number of wordlines
is increased. Figure 4.3 shows the simulation results under different array configurations.
Once again, read operations were performed by activating the wordline farthest from the
sense amplifiers. The effect of the sneak paths in the array is clearly evidenced by the
narrowing current window between reading an ON state and an OFF state as the number
of wordlines is increased. Sense amplifier (1) only has a sufficient current sense margin
(approximately 15% IOFF ) at 16 wordlines, while sense amplifier (2) has a sufficient sense
margin up to 32 wordlines. Both sense amplifiers have too narrow a margin by 64 word-
lines. This is consistent with the results presented in Figure 4.1.
27
Figure 4.3: Normalized current between the bitline and MOSFET sense
amplifier input under various array states.
4.3 Bipolar Transistor Current Amplifier
As shown in Figure 4.1, the NMOS transistors in the sense amplifier need to be sized to over
20 times the minimum transistor size in order to accommodate 16 or 32 wordlines. These
28
are rather large transistor areas, and it is possible that bipolar transistors may actually be
able to offer a more area-efficient and accurate design.
The input impedance of a BJT is lower than that of a MOSFET. The current-voltage re-
lation of the base-emitter junction of a BJT is equivalent to the exponential current-voltage
curve of the p-n junction of a diode. Thus, the transconductance gm of a BJT increases
exponentially with emitter current. This results in a higher transconductance than that of a
MOSFET, which has a quadratic current-voltage relation between the gate voltage and drain
current. This sensitivity can make a bipolar-based current mirror a better transimpedance
amplifier than the MOSFET-based circuit. In Figure 4.4, the NMOS transistors in the sense
amplifier have been replaced with NPN bipolar transistors. The PMOS device remains, as
it only serves to provide a reference current.
Figure 4.4: Single gain stage current-sensing amplifier replacing
NMOS transistors with NPN bipolar transistors.
An evaluation of a memory array using bipolar sense amplifiers yields the results shown
in Figure 4.5. The NPN transistors in these sense amplifiers have emitter areas of 2 µm by
29
2 µm, the smallest vertical NPN transistor available in 0.18 µm CMOS technology. These
sense amplifiers maintain a sufficient margin of differentiation between the high-resistance
and low-resistance states for up to 32 wordlines. While the sense margin does become
insufficient at 64 wordlines, as was the case with the MOSFET-based sense amplifiers, the
current window is still greater than that achieved using the original sense amplifier designs.
This means that the bipolar sense amplifier could be more tolerant of resistance variations
in the memory cells.
Figure 4.5: Normalized current between the bitline and BJT sense
amplifier input under various array states.
30
4.4 Lateral Bipolar Junction Transistors in Standard CMOS
Process
A vertical NPN transistor is fabricated with a heavily-doped emitter in a p-doped base layer,
in a lightly-doped n-well which serves as the collector. This is shown in Figure 4.6. Ver-
tical bipolar junction transistors are often not available for integration in standard CMOS
processes. When they are, each device can occupy tens of microns in height and width.
The area requirement renders it impractical for use in the cross-point array.
Figure 4.6: Cross-sectional diagram of a vertical NPN transistor.
Bipolar junction devices also occur as parasitics in MOSFETs and have been utilized
as active devices with some success [27] [28]. The parasitic NPN junction is formed be-
tween the drain, substrate, and source of an NMOS transistor, as seen in Figure 4.7. The
base width is determined by the gate length of the MOSFET in the fabrication process
technology.
31
Figure 4.7: Cross-sectional diagram of a lateral NPN BJT as a parasitic
in an NMOS transistor.
In an NPN transistor, collector current is generated when a voltage is applied to forward-
bias the base-emitter junction and electrons are injected into the base region. These elec-
trons diffuse through the base towards the collector and are swept into the collector by the
electric field in the depletion region of the collector-base junction. The gain, β is dependent
on the base width being much shorter than the diffusion length of the electrons in advanced
CMOS technologies [29].
Figure 4.8 shows the parasitic bipolar devices present in a cross-sectional diagram of
an NMOS transistor.
The electrons injected from the emitter to the base are collected mainly vertically in the
vertical BJT and laterally with the lateral device. The generated emitter current depends
on the area of the base-emitter junction closest to the collector region. The area of the
base-emitter junction in the vertical device is determined by the overall emitter area, which
is 4 µm2 in the transistor we want to emulate. The lateral BJT needs to be able to match
this base-emitter junction area in order to achieve the same gain as the 4 µm2 vertical
device. As identified in Figure 4.7, the diffusion depth for the active MOSFET regions
is approximately 0.10-0.25 µm in 0.18 µm CMOS technology. This means that a lateral
32
Figure 4.8: Parasitic bipolar devices identified in cross section of
NMOS transistor.
bipolar device should have an emitter perimeter of at least 16 µm.
Two layout designs for lateral BJTs were considered for fabrication. The first is based
on a standard MOSFET layout, with drain and source active regions interleaved between
the multiple polysilicon fingers that serve as the gate. The drain and source regions serve
as the collector and emitter, and the substrate below the gate serves as the base region. The
device is surrounded by a ring of contacts to the substrate, or base region. It is fabricated
in a deep n-well and surrounded by n-well regions for isolation. This design is shown
in Figure 4.9, with the collector region labeled (C), emitter region labeled (E), and the
base/substrate region labeled (B).
While this layout design is compact, it may not achieve a very high gain because the
unsurrounded perimeter directly above and below the emitter regions enables electrons
to recombine in the base without reaching the collector. This increases the base current
without any increase in the collector current, effectively reducing the current gain β.
33
Figure 4.9: Lateral NPN bipolar device based on standard MOSFET
layout.
An alternate design, shown in Figure 4.10, places the emitter in an active region com-
pletely surrounted by the collector. This should result in most of the electrons injected from
the emitter to the base being collected laterally. The smaller emitter area should also result
in a lower base current because it reduces the number of electrons that recombine in the
substrate below the emitter. Unfortunately, the unique shape of the polysilicon gate in this
layout requires a slightly longer gate in most design processes (for example, the gate length
becomes 0.21 µm in the TSMC 0.18 µm process). The wider base region will allow more
electrons from the emitter to recombine in the base before entering the collector, reducing
the gain.
34
Figure 4.10: Lateral NPN bipolar device with octagonal emitter shape.
4.5 Sense Amplifier Area Considerations
Finally, as the cross-point memory should be optimized for minimal area, it is important to
consider the layout overhead involved with each of the sense amplifier designs presented.
We do this by evaluating the total area components of a 4 kb row.
A 16x32-bit cross-point array has dimensions of 64λ by 128λ, where λ is one-half the
minimum feature size in any process technology. Then, a 32-by-32-bit block has dimen-
sions of 128λ by 128λ. The minimally-sized block-selection transistors occupy a width
of 44λ per bitline. The wordline-selection transistors are not considered for this estimate.
Table 4.1 shows a comparison of the different sense amplifier designs presented in this
chapter and their respective overhead impact on layout area when considering a single 4 kb
row. All circuits are designed for a minimum bitline-pitch, and staggered laterally if too
large.
35
Sense amplifier Number of wordlines SA width Total width of 4 kb row
MOSFET (1) 16 924 λ 28572 λ
4x MOSFET (2) 32 1844 λ 12860 λ
BJT (flat) 32 2514 λ 13530 λ
BJT (octagon) 32 2807 λ 13823 λ
Table 4.1: Sense amplifier designs and impact on layout area when
considering a 4 kb row.
Figure 4.11 shows the layout area of the small MOSFET sense amplifiers compared
to multiple blocks of memory. With hundreds of memory arrays sharing a single set of
sense amplifiers, the greatest overhead will ultimately be that of the selection transistors.
Even though the 16-wordline array could utilize a much smaller sense amplifier design, the
32-wordline array provides better efficiency when all peripheral circuitry is considered.
36
Figure 4.11: Layout area of the small MOSFET sense amplifiers shared
by multiple memory arrays.
37
Chapter 5
Three-Dimensional Memory
Architecture
BUILDING INTEGRATED circuits vertically allows for a reduced chip footprint when
compared to a traditional two-dimensional (2D) design, by an approximate factor
of the number of layers used. This offers significant advantages in terms of reduced in-
terconnect delay when routing to blocks that otherwise would have been placed laterally.
Traditionally, a 3D integrated circuit (3D-IC) has used more than one active device layer.
While the resistance-change memory cells are not active devices, they function as rectifying
devices in our design.
This chapter describes the process for which the 2D cross-point array can be built into
a multi-layer 3D architecture. We also explore layout techniques that seek to maximize
the amount of peripheral circuitry that is folded underneath the memory arrays. While
support circuitry could also be built above the substrate using wafer bonding or monolithic
3D integration, the process complexity associated with these methods is currently too high
for economical manufacturing; thus, this work will only explore 2D peripheral circuitry for
38
supporting 3D memory arrays.
5.1 Memory Array Architecture
As determined in the previous chapter, a 32-wordline array provides the most area-efficient
solution for a cross-point array given the target memory cell characteristics presented in
Section 3.4.
The number of bitlines does not affect the sense accuracy, but is limited by the write
operation, as that is done by individually biasing the bitlines. Based on a programming
current of about 40 µA per cell, 32 bitlines will require a total programming current of
approximately 1.2 mA, which is manageable. We will estimate that the number of bitlines
will also be limited to 32. Thus, the memory array size is 32 bitlines by 32 wordlines. This
array size also ensures that a constant bias will be seen by all memory cells being accessed,
as the voltage drop across a wordline the length of 32 bitlines is negligible. The metal
resistance is about 0.08 Ω per square for the copper layer in most process technologies.
Even in the worst case, when all memory cells being accessed are in the low-resistance
state, the voltage drop across the 32-bit word line is about 2 mV.
A single page architecture is shown in Figure 5.1. The cross-point memory utilizes a
hybrid of NAND- and NOR-type architecture. While we want to read many bits in parallel,
it is necessary to break up the memory into smaller, electrically isolated arrays, in order to
minimize leakage current across the array. We choose to have 128 blocks in a row because
this results in a 573 Ω resistance along the global bitlines between the farthest memory
cells and the sense amplifier inputs. This keeps the voltage drop across the global bitline
sufficiently small as to not disrupt the sense margin.
There are also 128 blocks in a column. The limitation to the number of vertical blocks is
the delay along the polysilicon line that controls the block-selection transistors. The column
39
Figure 5.1: Hybrid NAND/NOR architecture of cross-point memory.
This shows a 32Mb page.
decoders are located at the top of the page and control the bitline-selection transistors down
a column of blocks. The wordline decoders are located on the left side of the page and
propagate control signals for individual wordlines horizontally across the entire page, as
shown in Figure 5.2. The horizontal routing is necessary to ensure that control signal lines
do not interfere with global and local wordlines when multiple layers of memory are used.
A popular approach in current two-dimensional memory designs was described in 1984
by Mohsen, et al [30], and involves the arrangement of support circuitry between adjacent
memory arrays. A page is divided into two halves, with 128 by 128 blocks on each half of
the page. A single set of sense amplifiers and latches is shared between the two halves of
40
Figure 5.2: Architecture of a half page showing wordline decoder
routing.
the page. A read operation is performed in two cycles. In the first stage, the odd-numbered
blocks in a single column are accessed on the left half of the page. The even-numbered
blocks in the corresponding column are accessed on the right half of the page. In the
second stage, the even-numbered blocks are accessed in the same column on the left half
of the page, and the odd-numbered blocks in the same column are accessed on the right
half of the page. In these two cycles, an entire column of bits is read on both sides of the
page. The column decoder provides an identical output for each half of the page. Because
41
alternating blocks are accessed for each read cycle, it is possible to share wordline-driver
transistors between vertically adjacent blocks.
The memory array size has been limited to account for the effects of leakage current,
and thus selection transistors are required for every bitline and wordline of each block.
The bitline-selection transistors connect the array bitlines to global bitlines and serve to
isolate the block and prevent additional leakage current from external blocks during a read
operation. These selection transistors are unique to each memory block and cannot be
shared by multiple bitlines. Because of this, the selection transistors need to fit in the
minimum metal pitch. This technique is described in the following section.
The wordline-selection transistors are used to drive the read voltage to the selected
wordline and provide other biases during the SET and RESET operations as described in
Chapter 3. In order to minimize area overhead, the wordline-selection transistors are shared
between vertically adjacent arrays. This is enabled by the fact that only alternating rows
of blocks are accessed at a time. The gates of the bitline access transistors in a column of
arrays are controlled by two global selection lines for selecting odd and even rows. The
selection transistors are staggered so that two vertical wires can select alternating rows.
These transistors also need to be laid out to accommodate the minimum metal pitch.
5.2 Circuit Techniques for Vertical Architecture
In order to understand how the memory array peripheral circuitry can be folded underneath
the cross-point memory cells, we first explore the layout techniques used for a single-layer
memory.
The area occupied by the selection transistor layout depends on the amount of current
that a given transistor width can accommodate. The saturation current of a MOSFET is
constantly improving with more advanced technology processes, but for the purpose of this
42
design, we will use the typical transistor characteristics given in Table 5.1 [31].
Both wordline and bitline selection transistors are laid out as NMOS devices because
NMOS transistors have a much higher saturation current capacity than PMOS transistors.
For instance, in 0.18 µm CMOS technology, the saturation current capacity of an NMOS
device is about 750 µA/µm, while it is only about 280 µA/µm for a PMOS device.
Exclusively using NMOS transistors for bitline and wordline access also avoids the p-
well spacing constraints involved when both complementary devices are used. The body
effect is unavoidable, as the substrate must remain grounded to avoid any unwanted par-
asitics. This effect occurs because the threshold voltage of a MOSFET is affected by the
substrate voltage, as it changes the width of the depletion region. This requires that a
slightly wider transistor than the one indicated in the chart must be used for driving a posi-
tive voltage into the wordline or bitline [32]. Furthermore, the NMOS drivers will require
charge pumps to generate a sufficiently high VGS , as the source will be connected to VDD.
These charge pumps will be reused for the programming circuitry.
Process tech. max VDS IDS (µA/µm)
180 nm 1.8 V 750
130 nm 1.2 V 530
90 nm 1.0 V 640
65 nm 0.8 V 750
Table 5.1: NMOS saturation current for different process technologies.
Given the memory cell characteristics described in Section 3.4, a bitline may need to
43
accommodate up to 10 µA for a read operation. As such, a minimum-size selection tran-
sistor may be used when fabricating the memory array in any of the process technologies
currently available. When reduced to layout, the transistor will occupy a horizontal width
of approximately 6λ. With staggered placement and routing, these select transistors can
be fit into a single-bitline pitch. The resulting pass-circuitry width is 32λ, and spans 8
wordline-widths after taking substrate contacts and diffusion spacing into account. Fig-
ure 5.3 demonstrates how this can be done.
The bitline-selection circuitry utilizes metal layers 1 through 3 for local interconnect,
and assumes that the local bitlines are built using metal layer 6.
One set of global wordlines can connect all of the memory blocks in a column. The
bitline-selection transistors dictate which arrays to be selected for access. As shown in
Figure 5.4, the wordline drivers can be staggered to fit into the minimum metal-width pitch.
The wordline drivers for a 32-bitline block occupy a height of 24 bitlines. The gate signals
that control the wordline transistors are routed horizontally, as the wordline decoders lie to
the sides of each page.
The wordline driver devices are laid out also using metal layers 1 through 3, and as-
sumes that the local wordlines are built using metal 7. The VREAD voltage source runs on a
wide bus that will be routed on metal 5. Metal layer 4 is reserved for routing the wordline
driver gate signals from the decoder. Including one metal layer for the global bitlines, the
cross-point memory array overlaid on the access circuitry requires at least 8 metal layers.
These metal layers are summarized in Table 5.2.
5.3 Layout Techniques and Array Efficiency
In our goal to minimize layout area, we design our peripheral circuitry with the intention
of placing transistors underneath the memory arrays. Because the area of the selection
44
Figure 5.3: Layout and schematic of four bitline-select transistors made
to accommodate a minimum-width metal pitch.
circuitry and wordline drivers is predetermined by the characteristics of the resistance-
change material, the bitline and wordline selection devices become the limiting factor when
trying to reduce the total memory footprint.
As described in the previous section, the bitline-selection transistors span a width of
45
Figure 5.4: Layout and schematic of wordline-select transistors made to
accomodate a minimum-width metal pitch.
8 metal lines when laid out to accommodate the minimum metal spacing. The wordline
drivers span a height of 24 metal lines when laid out to accommodate the minimum pitch.
46
1-layer 2-layer 4-layer
Metal layer Function Metal layer Function Metal layer Function
1 local IC, 1 local IC, 1 local IC,
BL decode BL decode BL decode
2 local IC 2 local IC 2 local IC
3 local IC 3 local IC 3 local IC
4 local IC 4 local IC 4 local IC
5 WL decode 5 WL decode 5 local IC
6 VREAD bus 6 VREAD bus 6 WL decode
7 local BL 7 local BL (1) 7 VREAD bus
8 local WL 8 local WL 8 local WL (1)
9 global BL 9 local BL (2) 9 local BL (1)
10 global BL 10 local WL (2)
11 local BL (2)
12 local WL (3)
13 global BL
Table 5.2: Metal layers and functions in multi-layer cross-point arrays.
Then, the total selection-transistor overhead will occupy the same area as a 32 wordline by
32 bitline array when packed as tightly as possible. Array efficiency is defined as the total
area occupied by the memory cells divided by the total combined area of the memory and
the peripheral circuitry. Thus, a completely two-dimensional array block of 32 wordlines
by 32 bitlines would have an array efficiency of just under 50%. There will need to be some
allowance for edge contacts and spacing.
47
For a single layer of cross-point memory, it is possible to fold the access circuitry
entirely underneath the memory array. As demonstrated above, the circuitry itself can be
staggered to match the minimum metal pitch. Figure 5.5 shows the layout of bitlines and
their access transistors. Figure 5.6 gives a three-dimensional perspective of the vertical vias
and local interconnect involved with routing bitlines to the substrate. Two metal layers are
needed for routing bitlines to the selection transistors in addition to the local and global
bitline metal layers.
Figure 5.5: Layout of bitline-select transistors made to fold under a
cross-point memory array.
The same technique can be employed for the wordlines with an additional metal layer
to bypass the bitline transistors, along with separate metal layers for the local and global
wordlines. To enable wordline-driver sharing, we have the contacts and associated selection
circuitry lie on alternating bitlines and wordlines. As the selection transistors need to be
48
Figure 5.6: Cross-sectional view of bitline-selection transistors made to
fold under a cross-point memory array.
staggered anyway to be accessed by metal lines in the minimum pitch, this arrangement
does not involve any extra overhead. This layout is shown with an abbreviated memory
array in Figure 5.7.
Folding the wordline drivers underneath the memory array becomes tricky, as the cor-
ners are already occupied by the bitline transistors. The solution to this is to fold the
wordline drivers under the adjacent memory arrays, and to reserve the remaining space un-
der the memory array for the bitline-selection transistors of neighboring blocks. Figure 5.8
and Figure 5.9 show how the bitline and wordline transistors can be folded under adjacent
memory blocks in a checkerboard pattern. However, in order for bitlines from adjacent
memory blocks to access the selection transistors, an additional metal layer needs to be
routed between the memory array and the substrate, bringing the total number of metal lay-
ers to 9. This is shown in Figure 5.10. No additional metal layers are required for routing
49
Figure 5.7: Layout of wordline drivers sized to fit under a cross-point
memory array.
the wordline drivers, because a single set of wordline drivers is shared between arrays.
If we consider an example page consisting of four 32 by 32 memory blocks with the
selection circuitry folded underneath the arrays as described above, we achieve an array
efficiency of 91.3% (without taking into account the decoders or sense circuitry), as the
only peripheral circuitry that cannot lie beneath the memory are the contacts connecting to
the global wordlines and bitlines.
5.4 Stacking Memory Layers
We have seen that, for a single layer of cross-point memory, the bitline-selection transis-
tors and wordline drivers can easily be placed under the memory arrays so as to present
very little overhead. For a second memory array stacked on top of the first, the same can
be achieved. Currently, contacts to the global wordlines occupy two peripheral edges of
50
Figure 5.8: Schematic of bitline-selection transistors made to fold
under a cross-point memory array.
each memory array, and contacts to the global bitlines occupy one peripheral edge of each
memory array. The one remaining peripheral edge can be used to route the second layer of
bitlines to the substrate devices, and the substrate devices to the global bitlines. Wordlines
will be shared between two layers of bitlines, as shown in Figure 5.11. Only one layer
of bitlines should be active at a time, so the wordline drivers will not need to drive any
more current in a single read operation than with the single-layer design. Furthermore, this
avoids using another metal layer and the vertical complexity of connecting a second set of
51
Figure 5.9: Schematic of wordline drivers made to fold under a
cross-point memory array.
wordlines. The additional bitlines should have no significant effect on the leakage current
during the read or write operations as they are essentially ”floating” when unselected, with
no path to ground.
If two second-layer arrays needed to connect to the substrate devices at the same shared
edge, this would require yet another metal layer for routing between the memory array
52
Figure 5.10: Cross-sectional view of bitline-selection transistors shared
between two adjacent memory arrays.
Figure 5.11: Shared wordlines in a two-layer cross-point memory.
and substrate devices. This is undesirable. An alternative solution is to abut substrate
device contacts from the first layer with substrate device contacts from the second layer.
This is demonstrated in Figure 5.12. This method does not require any additional metal
interconnect layers besides the metal layer for the second set of bitlines. The only additional
53
Figure 5.12: Connecting two layers of bitlines to the substrate.
area overhead is the space at the array edge where a set of bitline contacts must be inserted.
With vias surrounding every edge of the memory array and the selection circuitry on the
substrate below, the effective array efficiency is now 88.5% for two layers of memory cells
in a 32 by 32 block. The total number of metal layers required for fabrication is now ten.
5.4.1 Four memory layers
One of the advantages of a transistor-free cross-point memory array lies in the fact that
the memory can be stacked indefinitely above the substrate. With some additional decoder
complexity, it is possible to build up to four memory layers without significant area over-
head at the array level.
At the current array pitch, it would not be possible to route additional bitline layers to
their selection circuitry in the substrate. Thus, to build additional memory layers, we must
increase the number of wordline layers.
54
Figure 5.13: Four-layer memory.
Figure 5.14: Four-layer memory with buffer connections.
We can increase the number of vertical layers to four by sharing wordlines, as shown
in Figure 5.13. The top and bottom wordline layers (1) and (3) can share a set of selection
55
transistors, as they deliver current to different bitline layers. The middle wordline layer (2)
is connected to driver transistors on the opposite side of the array, as shown in Figure 5.14.
All wordline driver transistors are connected to a global bus that provides the appropriate
bias for the read or write operation.
The additional metal layers required are the two layers for the wordline layers (1) and
(3), and one more layer for routing. At this point, there is limited floorspace underneath
the memory array. The wordline-selection circuitry for the two-layer design occupies a
height of 24 bitlines, as shown in Figure 5.4. After accounting for spacing and metal via
requirements, an additional height of 10 bitlines will be necessary between all rows to allow
for the wordline circuitry that does not fit beneath the memory array. This decreases the
array efficiency to 71.7%.
Now that all four edges of each memory array are surrounded by vias, it would be dif-
ficult to increase the number of memory layers beyond four without a significant decrease
in array efficiency.
Table 5.2 summarizes the number of metal layers required for distributing all of the
necessary interconnect for multiple layers of memory. It is critical that the fabrication
technology be able to maintain minimum feature-size spacing even at the highest metal
layers.
5.5 3D Memory Area Comparisons
A typical die size in 65 nm CMOS technology is about 150 mm2. Our 32-by-32-bit block
(1 kbit) has an area of 17.3 µm2. If we assume a conservative estimate of 50% memory
area efficiency, then we should be able to fit approximately 4.3 Gbits on a 150 mm2 die
with the single-layer cross-point memory design. This gives us an overall bit-efficiency
rating of 28.9 Mbits/mm2.
56
Figure 5.15: Schematic of wordline drivers made to fold under a
four-layer cross-point memory array.
If we assume that the peripheral circuitry (decoders, page buffers, sense amplifiers, etc.)
occupies the same area for a multi-layer memory as for a single-layer memory, and that the
only difference in total layout area are the array efficiencies shown in Figure 5.16, then a
two-layer cross-point memory would have a die efficiency of 48.5% and a four-layer cross-
point memory would have a die efficiency of 39.3% based on the scaling of their respective
array efficiencies. These efficiency estimates are very conservative.
A two-layer cross-point memory should allow for over 8 Gbits to fit on the same
57
Figure 5.16: Array efficiency vs. number of vertical memory layers,
assuming 32-by-32-bit arrays.
150 mm2 die. A four-layer cross-point memory built for an 8 Gbit capacity would oc-
cupy a total die area of only 133 mm2.
Table 5.3 lists some production NAND flash memory devices fabricated in similar pro-
cess technologies and their area efficiencies to provide a basis of comparison to the multi-
layer cross-point memories presented in this chapter [33] [34] [35] [36]. Even with a con-
servative efficiency estimate, the four-layer cross-point memory still achieves a far better
bit-efficiency than even Samsung’s multi-level NAND flash that stores two bits per cell.
One of the major challenges in maximizing area efficiency is the consideration that
memory access times could be limited by the additional wire length needed to reach the
read circuitry as well as the parasitic capacitance from long wires spaced closely together.
An evaluation of the performance of a 3-dimensional cross-point design will be presented
58
in the next chapter.
memory size die size bit-efficiency die efficiency
(Mbits/mm2)
Samsung SLC (65 nm) 4 Gb 131 mm2 31.3 54%
Toshiba SLC (65 nm) 4 Gb 137 mm2 29.2 60.4%
Hynix/STMicro (70 nm) 4 Gb 144 mm2 28.4 65.4%
Micron SLC (50 nm) 8 Gb 169.5 mm2 47.2 65%
Samsung MLC (63 nm) 8 Gb 133 mm2 61.6 70%
1-layer Cross-point (65 nm) 4 Gb 139 mm2 29.5 50%
2-layer Cross-point (65 nm) 8 Gb 144 mm2 56.9 48.5%
4-layer Cross-point (65 nm) 8 Gb 88.7 mm2 92.4 39.3%
Table 5.3: NAND flash memory area efficiency comparisons.
59
Chapter 6
Performance Analysis
EVEN THOUGH the cross-point memory array is expected to have relatively slow
access times due to the fact that the design is optimized for area, performance esti-
mates are necessary to perform a comprehensive evaluation of the 3-dimensional memory
model.
An 8 Gb cross-point memory array built in four layers was modeled and simulated in
HSPICE using circuit parameters for the TSMC 65 nm CMOS process with a nominal
operating voltage of 1.2 V.
This chapter presents the performance results from these circuit simulations as well
as a comparison of the cross-point memory performance requirements with that of other
nonvolatile memories in production today.
6.1 8 Gb Memory Architecture
As described in Section 5.1, 32-bit by 32-bit arrays are tiled into a page, with 128 rows
and 256 columns per page. The column decoders control the bitline-selection devices and
are located above the page. The wordline decoders control the wordline-selection devices
60
Figure 6.1: Four-layer cross-point memory implemented in an 8 Gb
architecture.
and are located to the sides of the page. The sense amplifiers used in this model are the
4x MOSFET sense amplifiers (2) described in Section 4.2. Unfortunately, a memory archi-
tecture implementing the lateral bipolar sense amplifiers could not be simulated for perfor-
mance evaluations because we did not yet have an accurate SPICE model for lateral bipolar
devices.
A complete read operation reads two columns of bits, one from each half of a page.
This takes two cycles, as a read cycle accesses alternating rows from each half. Each read
cycle senses 4 kb of data.
61
A page stores 32 Mb per layer, so a four-layer page has a memory capacity of 128 Mb.
We want to simulate a total memory size of 8 Gb as this is a commonly-used capacity for
NAND flash memories currently in production in 65 nm process technologies. We choose
to model a four-layer memory architecture as this offers the highest bit density available
for our design, as described in Chapter 5. Modeling an entire read path from address-latch
to buffered output will give us a good basis for comparison of the cross-point memory with
other memory architectures.
The memory utilizes a single-core architecture with 64 four-layer pages. The pages can
be arranged in a number of ways, but a design with 32 pages in two vertical columns was
chosen for simulation. This layout most closely resembles most NAND flash architectures.
In the future, charge pumps and programming circuitry could be placed between the two
columns. A general block diagram for the 8 Gb memory is shown in Figure 6.1.
6.1.1 Decoder architecture
A 21-bit address is required to access a column of 4096 bits in a single read cycle. Every
page receives a 5-bit wordline address to select one of 32 wordlines. The wordline-selection
signal propagates across the page. Each page also receives an 8-bit column address and a
2-bit layer selection code. The column decoder output controls the bitline-selection tran-
sistors. The wordline address is decoded using a two-stage decoder and the column address
enters a three-stage decoder. The page outputs ultimately pass through a mux controlled by
a 6-bit, two-stage page decoder.
6.1.2 Critical path
The slowest-case read operation occurs when trying to access one of the bottom-most pages
in the two columns shown in Figure 6.1. In this case, the address bits must be propagated
62
from the address buffer all the way across the die. The critical path follows the address
bits to the wordline decoder, which must drive the capacitance of a metal line the width
of a page as well as 256 wordline drivers. The local wordlines and bitlines charge within
picoseconds, as they are only 32 bits in length. The remainder of the critical path involves
charging the global bitlines, sensing and latching the bitline signal, and propagating the
data to the output buffer.
6.1.3 Assumptions
We assume that the output buffer must drive a 10 pF load. As the memory array has been
optimized for area, it is assumed that all decoders and buffers are built using minimum-size
transistors, with the exception of the wordline drivers and the ouput buffer. Also, all data
buses are assumed to be laid out in the minimum metal pitch. In order to achieve a fast
burst speed (the speed at which buffered data is piped to the I/O), the output buffer will be
optimized for a fan-out of 4 (FO4). That is, CLOAD/CIN = 4. The typical FO4 delay in a
65 nm process technology is 18 ps. An eight-stage buffer is needed to drive a 10 pF output
load.
The interface for the 8 Gb memory will have 8 I/O pads. The 21 address bits will be
stored in an address buffer in three address cycles. As described in the previous chapter,
each read operation consists of two read cycles, accessing alternating even/odd rows of
blocks in a single page column. For these simulations, each single read cycle is considered
independently addressed. Ultimately, it will be up to the memory file system to control the
order of read cycles. That is, a single read access of even/odd rows in a page column may
not necessarily be followed by a second read access of odd/even rows in the same page
column. Because each read cycle stores 4 kb of data in the page buffer, this data also will
be shifted out 8 bits at a time.
63
Other assumptions used for the 8 Gb memory model are that the decoders implemented
are fully static and that all wordlines and bitlines are fabricated in copper and have a thick-
ness of 200 nm, regardless of metal layer. The memory layers are fabricated above the
first four metal layers, with the local block selection transistors underneath the memory
arrays. All other peripheral circuitry, including decoders, sense amplifiers, and buffers lie
in the substrate surrounding the memory arrays. It is assumed that the local interconnect
within the circuitry has minimal effect on capacitance, but delay contributions due to the
capacitance from address and data buses as well as bitlines and wordlines are accounted
for.
6.2 Memory and Array Models
6.2.1 Simulation methodology
Parasitic interconnect delay is one of the greatest contributors to read latency in the cross-
point memory array. Long bitlines and wordlines are fabricated in the minimum metal pitch
available for the technology process, leading to high parasitic resistance and capacitance. In
a typical memory architecture, the parasitic interconnect capacitances are among the most
difficult parameters to estimate accurately. In order to accurately model parasitic capaci-
tance, each wire must be treated as a three-dimensional structure in metal or polysilicon,
interacting with all of the surrounding wires and the ground plane.
Parasitic capacitances in the multi-layer cross-point memory arrays were modeled and
extracted using Ansoft Q3D Extractor [37], a parasitic extraction technology that uses the
Finite Element Method to compute 3D capacitance and resistance parameters of a structure
and automatically generates an equivalent SPICE sub-circuit. This sub-circuit can then be
simulated in HSPICE, allowing for a more accurate estimation of read latency.
64
It is also necessary to consider resistive loads when measuring delay times. This is a
more straightforward calculation. We use standard wire resistances of 0.07 Ω per square
for metal layers 1-3, and 0.03 Ω per square for all higher metal layers.
6.2.2 Capacitance modeling
The memory cell model was built according to the characteristics described in Section 3.4.
TiN electrodes sandwich a thin HfO2 resistance-change memory layer. A thin TiO2 in-
terfacial layer lies between the HfO2 layer and the electrode closer to the bitline. The
material thicknesses and dielectric constants are listed in Table 6.1. The total capacitance
across the memory cell as extracted by Ansoft Q3D is 0.0679 fF. We assume that both the
high-resistance state and the low-resistance state have the same capacitance.
Material thickness dielectric constant κ
TiN 20 nm 769
HfO2 20 nm 19
TiO2 5 nm 106
Table 6.1: HfO2-based memory cell model parameters.
A 32 x 32 four-layer array was also modeled in order to extract the parasitic capac-
itances of the wordlines and bitlines. The ground plane lies 1.6 µm below the memory
array, allowing for four layers of interconnect. All wordlines and bitlines are built with a
width and pitch of the minimum feature size, 65 nm. The memory cells lie intersections
between the wordlines and bitlines and have an area of 65 nm by 65 nm. We assume a SiO2
dielectric material with a dielectric constant κ of 4.0.
65
Figure 6.2: 32 x 32 cross-point memory array modeled in Ansoft Q3D.
Figure 6.3: Multi-layer memory array represented as an RC network for
simulation in HSPICE.
The lateral parasitic capacitance between 32-bit metal lines is 0.615 fF, and the verti-
cal parasitic capacitance between 32-bit metal lines is 0.0892 fF. This is consistent with
66
documentation for 65 nm process technologies. Parasitic capacitances for address and data
buses were extrapolated from these numbers.
6.3 Waveform Analysis and Timing Diagrams
Figure 6.4 shows the modeled waveforms of a single access performed on the 8 Gb four-
layer memory architecture described above, operating at the 1.2 V nominal voltage for
the 65 nm process technology. A worst-case access was simulated, reading a bit from the
leftmost column of the bottom-most page in the memory structure.
As seen in Figure 6.4, the address signal takes over 100 ns to fully propagate to the
decoder input of the farthest page from the address buffer. However, the wordline selection
transistors can latch the address by 30 ns. The global wordline is sufficiently charged by
35 ns, and the local wordline can drive sufficent current for sensing to the local bitlines by
45 ns. The global bitline is charged at 50 ns and the sense amplifier has a sufficient output
swing for latching by 66 ns. The data from the page buffer can be propagated through the
I/O driving a 10 pF load by 70 ns, but the page buffer output lines are not stabilized until
104 ns. After a stable signal is latched by the page buffer at 104 ns, the next read cycle can
commence. The next page buffer output should not be shifted out until after 104 ns even if
the output data may be valid by 70 ns.
A precharge condition where the global wordlines and bitlines are charged to a voltage
close to the read bias prior to an initial page access could decrease the read latency by about
25 ns. The read operation could also be sped up by introducing intermediate buffers for the
address and data buses, as the propagation of address and data bits across the die account
for the greatest delay component in a read operation. However, these buffers will need to
be fitted into the memory array and will likely increase the area overhead.
Most of the delay components seen in Figure 6.4 are independent of the number of
67
Figure 6.4: Simulation waveforms for 8 Gb memory read access.
layers of memory. The charging of the local wordlines and bitlines are the elements most
affected by increasing the number of memory layers, as this adds vertical parasitic capac-
itance between the wordlines and bitlines. However, these parasitic effects are only about
68
.0892 fF per local bitline or wordline and will not significantly affect the access time.
6.3.1 Burst read
Only 8 bits are available for the I/O, while a single read cycle accesses 4096 bits. Thus, it
takes 512 cycles to fully output the 4096 bits. An eight-stage output buffer optimized for
a fan-out of 4 can shift out each set of data in 173 ps, but it could be clocked as slowly as
201 ps so that the data is shifted out within a read cycle of 104 ns.
Figure 6.5: Read operation timing diagram.
6.4 Performance Comparisons with Commercial Products
Table 6.2 shows a comparison of the performance numbers from various types of non-
volatile memory currently in production, from NOR flash [35] to SanDisk’s 3D one-time-
programmable memory [38] [39]. It is clear that the 3D cross-point memory has an access
time (tR access) almost as fast as that of a lower-capacity NOR flash, and a burst read
69
time (tR sequential) that is much faster than that of NAND flash. Of course, the 8 Gb
cross-point memory architecture described in this chapter does not implement any sort of
error-correction or redundancy, which adds some significant amount of overhead to the
access times of flash memory.
tR access tR sequential Iavg/Imax Pavg/Pmax
Numonyx 1Gb NOR Flash 100 ns 25 ns 21 mA/24 mA 35.7 mW/40.8 mW
Micron 8Gb NAND Flash 25 µs 30 ns 15 mA/30 mA 40.5 mW/81 mW
SanDisk 1Gb 3-D OTP 140 µs 100 ns 20 mA/30 mA 54 mW/81 mW
4-layer 8Gb Cross-point 104 ns 201 ps 27 mA/51 mA 32.4 mW/61.2 mW
Table 6.2: Nonvolatile memory performance comparisons.
The power requirements for the read operation were calculated only based upon the
components used in the read path. We have to assume perfect voltage gating for all other
on-chip circuitry simply because they were not modeled or simulated. In reality, other
peripheral circuitry such as that used for the write operation would contribute some amount
of power dissipation due to leakage current during the read operation.
The cross-point memory current draw is higher than that of the other memory archi-
tectures presented primarily because the cross-point memory has the widest datapath. We
read 4 kb of data per read cycle, compared to 2 kb per read cycle for NAND flash. The
wide datapath ensures that the cross-point memory achieves the fastest bit-rate.
70
6.4.1 NAND flash latency
The 3D cross-point memory and NAND flash are both optimized for area, but NAND flash
access times are an order of magnitude slower. There are a number of reasons why NAND
flash latency is so much higher. NAND flash memory typically has 16 or 32 floating-gate
transistors in series to form a bitline [40]. So many MOSFETs in series need increasingly
higher gate voltages to ensure that VGS is greater than the threshold voltage. This requires
charge pumps to generate the required read voltage, typically around 5 V.
Not only does it take tens of microseconds for a charge pump to produce the required
read voltage from a 2.7 V operating voltage [41], but it would take hundreds of nanoseconds
to raise the long polysilicon wordlines to such a high voltage.
As higher voltages are required for both the read and write operations, longer-channel
devices are required for the peripheral circuitry in order to maintain reliability [42]. These
long-channel devices have slower switching speed than standard devices. The clock speed
needs to be slowed to accommodate the high-voltage devices, and NAND flash can only
pipe data out in sequential (burst) access operations as fast as the clock speed.
Although the write circuitry for cross-point memory is not discussed in this work, it is
anticipated that high-voltage devices will not be required for SET or RESET operations.
Another consideration is error-correction. NAND flash typically has built-in error-
correction capabilities that take another 5-10 µs to decode, as the error-correcting circuits
also use the same global clock that has been slowed to accommodate the high-voltage cir-
cuitry. Error-correction is not something that has been accounted for in the cross-point
memory design, and will likely be necessary for actual products.
Table 6.3 summarizes the key latency components for the 3D cross-point memory,
NAND flash, and NOR flash [43]. The 8 Gb cross-point memory has a similar wordline
charge time (which includes the address propagation and decode) to NOR flash. The sense
71
time is nearly the same between the cross-point memory, NOR flash and NAND flash. All
three memories also share a similar delay time for propagating data to the output buffer.
NOR flash does not typically implement on-chip error-correction and hence has no further
delay.
3D Cross-point NAND flash NOR flash
Wordline charge time 38 ns 20 µs 36 ns
Sense amp latency 16 ns 17 ns 18 ns
Page buffer datapath latency 50 ns 60 ns 42 ns
Error-correction 0 5 µs 0
Total 104 ns 25 µs 100 ns
Table 6.3: Read path timing signal comparisons.
72
Chapter 7
Test Chip Design
ATEST CHIP was designed and fabricated to serve as proof-of-concept for the mem-
ory array architecture during the read operation described in Chapter 3. The test
chip also includes test structures for evaluating the performance of the lateral bipolar junc-
tion transistors described in Chapter 4. Other goals served by the test chip are to verify the
functionality of the sense amplifiers for different array sizes and determine the minimum
detectable RON and ROFF resistance values for these array sizes. Finally, voltage scaling
capabilities are also studied.
7.1 Memory Array Implementation
The test chip was fabricated using TSMC 0.18 µm CMOS technology. This is a triple-well
process, enabling the fabrication of lateral bipolar devices. Because the resistance-change
HfO2 devices were not yet available for deposition, the memory cells were modeled us-
ing PMOS transistors at the cross-point junctions. The gate voltage of these transistors
was externally controlled to emulate different values for RON and ROFF . Even though
the memory architecture thus far has been designed to accommodate the resistance-change
73
characteristics described in Chapter 3, testing controlled variations in ON and OFF resis-
tances allows us to estimate the range of memory-cell resistance variations that our design
can tolerate.
Figure 7.1: Test chip architecture.
Figure 7.1 shows the test chip architecture. There are four rows of memory arrays,
each consisting of four 16-wordline by 256-bitline blocks. The bitlines of each block are
separated by selection transistors so that the effective block size can be increased from 16
to 64 wordlines during testing. Each row of blocks is connected to an array of 256 sense
74
amplifiers that buffer into a shift register. Four sense amplifier designs were laid out, one
for each row. The first row uses the MOSFET sense amplifier (1) design optimized for
16 wordlines, as described in Section 4.2. The second row uses the 4x MOSFET sense
amplifier (2) optimized for 32 wordlines. The third row uses the lateral bipolar NPN sense
amplifier laid out in the typical MOSFET structure (Section 4.3). The fourth row uses the
lateral bipolar NPN sense amplifier laid out in an octagonal structure.
PMOS transistor test structures were fabricated and microprobed to characterize the
I-V curves and approximate the memory cell resistance value corresponding to each gate
voltage. This is shown in Figure 7.2. The resistance range from 1 kΩ to 100 kΩ is difficult
to control at a fine resolution because of the steep dependence of resistance on Vg in that
range. This resistance model also assumes that the voltage between the drain and the source
is constant at -0.8 V. The source voltage of the PMOS transistors actually varies depending
on the bitline voltage of the array being accessed, which depends on the sense amplifier
design used for the array. The bitline voltages were microprobed during read access to
verify the VDS across the memory cell. Thus, each memory array tested had memory cells
with slightly different resistance characteristics.
7.2 Device Measurement Results
Test structures for both of the lateral NPN transistor designs described in Chapter 4 were
fabricated and microprobed. The I-V characteristics are shown in Figure 7.3 and Figure 7.4.
Unfortunately, the gain (β) of the lateral BJTs was highly inconsistent. For the MOSFET-
style lateral BJT, β varied from 13.6 to 21.3 on test structures from eight die samples. For
the octagonal-style lateral BJT, β varied from 7.78 to 22.1 on test structures from the same
sample set. The expected β for a vertical BJT fabricated with the same emitter perimeter
(4 µm) is 21.4.
75
Figure 7.2: Effective resistance of PMOS memory cell vs. gate voltage.
Resistance is shown in ohms. The source and drain are
biased at 0.8 V and 0 V, respectively
76
Figure 7.3: I-V characteristics of the standard MOSFET-style BJT.
77
Figure 7.4: I-V characteristics of the octagonal-style BJT.
It is apparent from the I-V plots that the lateral devices suffer from significant Early
effect. The Early voltage was extracted for each lateral NPN device by fitting a regression
line to the I-V curves shown in Figure 7.3 and Figure 7.4. The MOSFET-style device
had an Early voltage of 5.7 V while the octagonal NPN device had an Early voltage of
4.8 V. Typical Early voltages for vertical NPN transistors of the same dimensions should
be closer to 50 V [44]. The Early effect is caused by a narrowing of the base width as a
reverse bias across the collector-base junction increases the collector-base depletion width.
The collector region should have a more lightly doped layer to allow the formation of a
wide depletion region extending into the collector, rather than into the base. Because the
collector has the same doping concentration as the emitter in a lateral device, the Early
voltage is significantly decreased.
78
7.2.1 Discussion of lateral NPN transistor characteristics
There are a number of factors that could have contributed to the wide variations in β.
First, the lateral BJT device has a base width that is determined by the gate length of the
MOSFET, which was 0.18 µm in this case. A vertical BJT can have a base width of only
tens of nanometers. The base current does not depend on the gate length, but the collector
current and hence the gain are inversely proportional to the gate length [28]. Gate-length
variations are frequently present in CMOS technology, resulting in variations in β. This
was probably exacerbated by the implementation of octagonal-shaped gate designs.
When designing the lateral NPN devices, the focus was primarily on maximizing the
emitter perimeter, because the electrons injected from the emitter to the base are mostly
collected laterally. However, the substrate that forms the base is also underneath the emitter.
This provides a region for carriers from the base to establish base current via injection
across the base-emitter junction. Thus, the base current depends strongly on the emitter
area. The MOSFET-style lateral BJT has an emitter area approximately three times that
of the octagonal lateral BJT, and hence should have a base current that is approximately
three times that of the octagonal device. While the collector currents may be the same, β is
calculated as a function of the base current, and this may explain why the MOSFET-style
lateral BJT generally has a lower gain than the octagonal-shaped device.
The lack of uniformity in the lateral bipolar transistors can also be explained by the fact
that we are trying to exploit the parasitic devices that fabrication processes usually seek
to minimize. However, as the lateral devices were fabricated with the same base-emitter
junction area as a 2 µm x 2 µm vertical device, it was expected that the distribution of
measured β values should be centered around 21.4. There are a number of factors that
could have reduced the expected gain, as explained in Section 4.4. Finally, one factor that
had not been taken into account during design was the vertical parasitic bipolar junction
79
between the source, substrate, and deep n-well region, shown in Figure 4.8. This vertical
device would not have much effect on the collector current of the lateral BJT, but the deep
n-well can control how many minority carriers injected from the emitter recombine in the
substrate region. When the deep n-well has a positive bias, it would collect the carriers
before they can recombine with holes in the base region, decreasing the base current and
effectively increases the gain. Ideally, the deep n-well should be connected to the collector
of the lateral NPN device. Unfortunately, this parasitic condition could not be mitigated
in our test chip. The deep n-well region was specifically used for device isolation in the
substrate and hence was subjected to a significant amount of noise.
A technology process optimized for lateral bipolar devices would be better suited to the
memory array design that utilizes BJT sense amplifiers. The 0.18 µm gate length is too
long to serve as a narrow base, and the substrate contacts could not be placed close enough
to the active area to minimize base resistance. The lack of gate-length uniformity was
probably the biggest contributor to variations in β. More advanced technology processes
should solve both of these design issues in the future. However, the symmetric doping
profile of the lateral bipolar devices causes a decrease in the Early voltage and limits the
emitter efficiency, and this characteristic will probably not change with more advanced
lithography.
7.3 Sense Amplifier Verification
Read operations were performed for each row of memory shown in Figure 7.1. The test
procedure involved setting a single 16x256 memory array to a fully low-resistance config-
uration by biasing the PMOS memory cell gate voltages to VRon. The leftmost wordline
would alternate between high- and low-resistance states (Figure 7.5) by setting gate volt-
ages to VRoff and VRon, respectively. This wordline would then be selected, biased, and
80
Figure 7.5: Array configuration during a test read operation.
the sense amplifier outputs would be latched and shifted out through the I/O pins. For each
array test, VRoff begins at a low enough voltage that ROFF is almost as low as RON . If the
outputs could not be successfully differentiated at the I/O pins, then VRoff was increased
by 10 mV and the wordline was accessed again. If, even at the highest value of VRoff , the
outputs could not be detected, then RON was increased and the process was repeated again.
In this manner, we iteratively determined the lowest possibleRON andROFF combinations
that could still produce a distinguishable output through the latch.
Failures (in the form of high-resistance cells being detected as low-resistance cells, or
vice versa) generally first occurred at the very bottom of the memory array, while reading
bit number 256 on the wordline, and would propagate upwards to the bit closest to the
wordline driver. This is probably due to the IR drop along the wordline providing a lower
read voltage to the farther cells. Because our memory array architecture requires 32 bitlines
in an array, we chose to designate a failure mode as the point when bit number 32 along the
wordline first fails.
The results of the read operation tests are plotted in Figure 7.6 through Figure 7.9. Fig-
ure 7.6 and Figure 7.7 show the shmoo results from MOSFET sense amplifier design (1), as
81
Figure 7.6: Minimum detectable RON and ROFF values in a
32-wordline array with MOSFET sense amplifier design
(1).
described in Section 4.2. The shaded areas indicate the RON and ROFF combinations that
failed, while the white areas indicate resistance combinations that yielded a differentiable
output. As seen in Figure 7.6, with a memory array of only 32 wordlines, RON can be as
low as 1746 Ω while ROFF can be as low as 2400 Ω. When the number of wordlines is
increased to 64, RON can only be decreased to 2290 Ω while ROFF can only be as low as
4016 Ω.
82
Figure 7.7: Minimum detectable RON and ROFF values in a
64-wordline array with MOSFET sense amplifier design
(1).
83
Figure 7.8: Minimum detectable RON and ROFF values in a
32-wordline array with MOSFET sense amplifier design
(2).
A much larger white region can be seen from the test results of the 4x-sized MOSFET
sense amplifier (2), in Figure 7.8 and Figure 7.9. Here, RON can be as low as 1065 Ω with
ROFF as low as 2182 Ω. By increasing the number of wordlines to 64, RON needs to be at
least 1553 Ω and ROFF must be at least 2600 Ω.
84
Figure 7.9: Minimum detectable RON and ROFF values in a
64-wordline array with MOSFET sense amplifier design
(2).
85
Figure 7.10: The inconsistencies in the lateral NPN transistor
characteristics render the sense amplifier outputs
indistinguishable.
Unfortunately, the same test procedure could not be applied to the rows implemented
with the lateral BJT sense amplifiers. Because of the device inconsistencies described in
the previous section, it was impossible to read an entire column of bits. This problem is
visualized in Figure 7.10. Each NPN sense amplifier would have a different output profile
depending on the gain of the individual devices. Because the buffers following the sense
amplifiers were tuned to latch a certain output range, a memory cell could be read as either
a high-resistance state or a low-resistance state depending on the sense amplifier device
characteristics.
86
Figure 7.11: Minimum detectable RON and ROFF values in a
32-wordline array with BJT sense amplifier design
(MOSFET-style).
Instead, one functional sense amplifier was chosen from each of the two bipolar tran-
sistor arrays and tested with 32 wordlines. The results are shown in Figure 7.11 and Fig-
ure 7.12. The MOSFET-style lateral bipolar sense amplifier can detect a low-resistance
state of 306 Ω and a high-resistance state of 1476 Ω. The octagonal lateral bipolar sense
amplifier can detect a low-resistance state of 400 Ω and a high-resistance state of 1377 Ω.
87
Figure 7.12: Minimum detectable RON and ROFF values in a
32-wordline array with BJT sense amplifier design
(octagonal-style).
88
Of course, the data presented for the NPN transistor sense amplifier designs are not re-
producible due to the variations in the lateral BJT characteristics. However, it does demon-
strate the potential of the NPN sense amplifier designs.
7.4 Voltage Scaling Results
Voltage-scaling experiments were also performed to examine how supply voltage (VDD)
variation affects the read reliability. The nominal voltage of the 0.18 µm CMOS technology
used is 1.8V. These tests scaled VDD from 1.4 V to 2.0 V. The reference current for sensing
was adjusted accordingly.
Figure 7.13 shows the results of scaling the voltage while reading a 32-wordline array
using the MOSFET sense amplifier design (1). The area above each curve indicates the
resistance combinations that were successfully detected, while the area below each curve
indicates the failed resistance combinations for that voltage. Both RON and ROFF values
needed to be slightly raised when VDD was reduced to 1.6 V, and the memory array failed
to function at 1.4 V. There was no apparent benefit to increasing VDD to 2.0 V.
With MOSFET sense amplifier design (2), VDD could be scaled down to 1.4 V with no
visible effect on the resistance requirements. This is shown in Figure 7.14.
89
Figure 7.13: Voltage scaling results for a 32-wordline array with
MOSFET sense amplifier design (1).
90
Figure 7.14: Voltage scaling results for a 32-wordline array with
MOSFET sense amplifier design (2).
91
Looking back at the memory array design tradeoffs described in Chapter 4, it is clear
that the value of RON must increase with a lower bias voltage, because the amount of cur-
rent generated from a biased memory cell will no longer be sufficient to withstand the ef-
fects of leakage across unselected cells. Ultimately, the memory architecture fails because
the sense amplifier latches are sized for a fixed range of sense amplifier output voltage.
When VDD becomes sufficiently low, then the output voltage falls outside the range that
can be latched. The 4x MOSFET sense amplifier design (2) does not suffer from this prob-
lem because the original sense amplifier output range begins at a lower voltage due to the
wider NMOS transistors and their lower effective resistance.
The same individual memory cells were selected for testing using the NPN transistor
sense amplifier designs. The results are shown in Figure 7.15 and Figure 7.16. There does
not appear to be any significant effect on the detection limits of the resistance values from
voltage scaling, probably also because the NPN devices have lower resistances and see a
lower voltage drop to begin with.
92
Figure 7.15: Voltage scaling results for a 32-wordline array with BJT
sense amplifier design (MOSFET-style).
93
Figure 7.16: Voltage scaling results for a 32-wordline array with BJT
sense amplifier design (octagonal-style).
94
Chapter 8
Conclusion
ANOVEL MEMORY architecture derived from resistance-change memory has been
designed and simulated in this work. The cross-point memory array uses a sin-
gle resistive element to combine the functions of both data storage and addressing. Our
simulation results show that an 8 Gb memory architecture can be accessed with reasonable
power and latency requirements that are competitive with those of NOR Flash. Due to the
absense of individual access transistors, the crosspoint memory architecture presented in
this work can be integrated into a 3-dimensional stacked structure simply by layering the
arrays. This can be done in a standard CMOS process without forming epitaxial silicon
layers as is commonly used in 3D monolithic integration. The resulting multi-layer mem-
ory array has an expected bit-density that far exceeds that of single- and multi-level NAND
flash.
8.1 Summary
We have developed a design strategy for building a 3D cross-point array into an architec-
ture that can effectively manage leakage current based on the parameters of its peripheral
95
circuitry. Unlike previous designs, no diodes are needed within the memory array. We
have demonstrated the functionality of the read operation in a test chip. The method of
optimizing the peripheral circuitry for a HfO2-based memory cell described in this work
can be applied to future designs involving other resistance-change materials with known
resistivities.
Lateral bipolar junction transistors were designed and fabricated in an effort to create
a high-transconductance sense amplifier that would be available in any triple-well CMOS
process technology. Test results showed that these lateral transistors had inconsistent device
matching characteristics, but the fully functional devices have much potential to serve as
elements in versatile current sense amplifiers.
8.2 Other Considerations
The 3D cross-point memory architecture was designed to someday serve as a replacement
for NAND flash memory. We have demonstrated that it can be superior over NAND flash
in terms of speed and area, but the reliability, data retention, and endurance still need to be
explored.
8.2.1 Scalability
The greatest benefit of a metal-oxide-based cross-point memory over NAND flash is its
scalability. A four-layer cross-point memory can have a significantly greater bit-density
than NAND flash memories fabricated in the same technology node. This provides a great
advantage in the cost-per-bit scaling for future nonvolatile memory. From a lithography
standpoint, the 3D cross-point memory explored in this work should be able to scale well
beyond 20 nm process technologies, unlike flash memory. The limitations to scaling will
96
come from the resistance-change memory materials, and it is believed that, if filament
formation is the cause of resistive switching in metal oxides, then resistance values will not
change with decreasing cell size [17].
Metal oxides have lower current and voltage requirements for programming than flash
or phase-change memory. Not only does this lead to faster read and write times, but it
also benefits memory reliability. The low power requirements of metal-oxide resistance
switching makes it highly suitable for low-power applications.
In the future, resistance-change metal oxides could also potentially be programmed as
multi-level cells with some additional sense amplifier complexity.
8.2.2 Retention and endurance
Endurance tests performed on HfO2-based resistance-change materials have demonstrated
that more than 106 SET/RESET cycles can be performed on a memory cell without a de-
crease in the sense margin [15]. This easily matches the endurance of current NAND flash
memories. Resistance-change materials are not subject to the tunnel-oxide charge-trapping
problems that flash memory cells suffer from, but have been observed to be sensitive to
switching cycles at high temperature because of the semiconductor-like high-resistance
state at temperatures greater than 200C.
Data retention tests have been promising as well, showing that a lifetime of at least 10
years can be expected [15]. This is also competitive with flash memories on the market
today.
8.2.3 Reliability
One issue preventing the commercial implementation of resistance-change metal oxides
as memory cells in the past was the loose distribution of resistance values for RON and
97
ROFF . This has been improved in recent studies [15]. One promising aspect of resistance-
change metal-oxides is that any resistance variations tend to be towards higher resistance
values. As our experimental data have shown, higher values of RON are preferable to a
large high/low resistance window.
8.3 Recommendations for future work
Post-fabrication processing and the deposition of resistance-change materials were not
available for the test chip presented in this work. A more robust verification of the read
operation should be conducted using actual resistance-change memory cells, as this would
allow the fabrication of actual cross-point arrays. The simulated performance parameters
could then be verified. Noise effects should also be evaluated, as a cross-point array of
resistors may be a considerable source of electrical and thermal noise.
Further characterization of the resistance-change material is also necessary in order
to guarantee that the 3D cross-point memory will be practical for data storage. HfO2-
based memory cells have been demonstrated to have favorable characteristics, but larger-
scale studies will need to be performed to study their potential as a commercial product.
Factors such as yield and susceptibility to single-event upsets should be evaluated. Also,
the scalability of metal-oxide resistance change materials beyond 20 nm technology nodes
still needs to be studied.
Error-correction will likely be necessary for cross-point memory to serve as a high-
density storage device in most applications. The bit-error rate should be determined and
appropriate error-correction encodings explored. This will add latency and area overhead
to the current design. Also, depending on the failure modes of the memory cells, row and
column redundancy may also need to be implemented.
Now that a method for optimizing the read circuitry for a cross-point memory array has
98
been demonstrated, it is possible to apply the same design strategy to the write operation.
The programming operation is expected to be competitive with both NAND and NOR flash
in terms of speed because of the relatively low voltage requirements of resistance-change
materials. If the peripheral circuitry for accommodating the write operation can be made
sufficiently compact, then the 3D cross-point memory will indeed be a viable replacement
for NAND and NOR flash in future process generations.
99
Bibliography
[1] G. Moore. ”Cramming more components onto integrated circuits”, Electronics, 38, 8,
1965.
[2] Micron Technologies. ”NAND Flash 101: An Introduction to NAND Flash and How
to Design It In Your Next Product”, Micron Technical Note TN-29-19, 2006.
[3] R. Bez, et al. ”Introduction to Flash Memory”, Proceedings of the IEEE, 91, 4, 2003.
[4] L. Mason. ”Memory Market Outlook”, MemCon San Jose, 2009.
[5] M. Chi and A. Bergemont. ”Multi-level flash/EPROM memories: new self-convergent
programming methods for low-voltage applications”, Technical Digest of Interna-
tional Electron Devices Meeting, 1995.
[6] T.-K. Kim, S. Chang, and J.-H. Choi. ”Floating gate technology for high performance
8-level 3-bit NAND flash memory”, Solid-State Electronics, 53, 7, 2009.
[7] D. Ielmini, A. Spinelli, A. Lacaita. ”Recent developments on Flash memory reliabil-
ity”, Microelectronic Engineering, 80, 17, 2005.
[8] E. Doller. ”Making Sense of It All: The Ever-changing Role and Challenges of Non-
volatile Memory Today and Tomorrow”, MemCon San Jose, 2008.
100
[9] R. Scheuerlein. ”Magneto-resistive IC memory limitations and architecture implica-
tions”, Seventh Biennial Nonvolatile Memory Technology Conference, June 1998.
[10] S. Lai, T. Lowrey. ”OUM - A 180 nm Non-Volatile Memory Cell Element Technology
For Stand Alone and Embedded Applications”, Proc. International Electron Devices
Meeting, 2001.
[11] M. Gill, T. Lowrey, J. Park. ”Ovonic unified memory - a high performance non-
volatile memory technology for stand-alone memory and embedded applications”,
Proc. ISSCC, 2002.
[12] A. Beck, J.G. Bednorz, C. Gerber, C. Rossel, D. Widmer. ”Reproducible switching
effect in thin oxide films for memory applications”, IEEE Electron Device Letters,
139–141, 77, 1, 2000.
[13] J. Li, H. Liu, S. Salahuddin, K. Roy. ”Variation-tolerant Spin-Torque Transfer (STT)
MRAM array for yield enhancement,” IEEE Custom Integrated Circuits Conference,
2008.
[14] ”PCM Becomes a Reality,” Objective Analysis Semiconductor Market Research, Au-
gust 2009.
[15] H. Y. Lee, et al. ”Low power and high speed bipolar switching with a thin reactive
Ti buffer layer in robust HfO2 based RRAM”, Proc. IEEE International Electron
Devices Meeting, 2008.
[16] B.J. Choi, et al. ”Resistive switching mechanism of TiO2 thin films grown by atomic-
layer deposition”, Journal of Appl. Phys., 98, 033715, 2005.
101
[17] I. Baek, et al. ”Highly Scalable Non-volatile Resistive Memory using Simple Binary
Oxide Driven by Asymmetric Unipolar Voltage Pulses”, Proc. International Electron
Devices Meeting, 2004.
[18] S. Wong, et al. ”Monolithic 3D Integrated Circuits”, International Symposium on
VLSI Technology Systems and Applications, 2007.
[19] S.-M. Jung, et al. ”Three Dimensionally Stacked NAND Flash Memory Technology
Using Stacking Single Crystal Si Layers on ILD TANOS Structure for Beyond 30nm
Node”, Proc. International Electron Devices Meeting, 2006.
[20] K. Kim, et al. ”Multilevel Programmable Oxide Diode for Cross-Point Memory by
Electrical-Pulse-Induced Resistance Change”, IEEE Electron Device Letters, 30, 10,
2009.
[21] M. Johnson, et al. ”512-Mb PROM with a three-dimensional array of diode/antifuse
memory cells”, IEEE Journal of Solid-State Circuits, 38, 11, 2003.
[22] H.-S. Wong, Y. Taur, and D. Frank. ”Discrete random dopant distribution effects in
nanometer-scale MOSFETs”, Microelectronics and Reliability, 38, 9, 1998.
[23] J. Brewer and M. Gill. Nonvolatile Memory Technologies with Emphasis on Flash.
IEEE, Wiley & Sons, Inc., Hoboken, NJ 2008.
[24] T. Blalock, R. Jaeger. ”A High-Speed Clamped Bit-Line Current-Mode Sense Ampli-
fier”, IEEE Journal of Solid-State Circuits, 542–548, 26, 4, 1991.
[25] T. Tanzawa, et al. ”Design of a sense circuit for low-voltage Flash memories”, IEEE
Journal of Solid-State Circuits, 35, 10, 2000.
102
[26] Linear Technology. ”LM101A/LM301A/LM107/LM307 Operational Amplifiers”,
Linear Technology Datasheet, 1994.
[27] E. A. Vittoz. ”MOS transistors operated in the lateral bipolar mode and their applica-
tion in CMOS technology”, IEEE Journal of Solid State Circuits, 18, 3, 1983.
[28] Z. Feng, et al. ”Gate Controlled Vertical-Lateral NPN Bipolar Transistor in 90nm RF
CMOS Process”, IEEE Bipolar/BiCMOS Circuits and Technology Meeting, 2008.
[29] M. Daibo, T. Kikuchi, and M. Yoshizawa. ”Minority Carrier Diffusion Length Mea-
surement of Semiconductors Using a Multiwavelength Laser SQUID Microscope”,
IEEE Transactions on Applied Superconductivity, 13, 2, 2003.
[30] A. Mohsen, et al. ”The Design ad Performance of CMOS 256K Bit DRAM Devices”,
IEEE Journal of Solid-State Circuits, 19, 5, 1984.
[31] http://www.mosis.com
[32] V. Quenette, et al. ”Electrical Characterization and Compact Modeling of MOSFET
body effect”, 9th International Conference on Ultimate Integration of Silicon, 2008.
[33] Samsung Electronics. ”K9XXG08UXA: 512M x 8 Bit/1G x 8 Bit NAND Flash Mem-
ory”, Samsung Electronics Datasheet, 2006.
[34] ”4-Gbit NAND built at 65 nm”, EE Times, July 17, 2006.
[35] D. Nobunaga, et al. ”A 50nm 8Gb NAND Flash Memory with 100MB/s Program
Throughput and 200MB/s DDR Interface”, International Solid-State Circuits Confer-
ence, 2008.
[36] D.-S. Byeon, et al. ”An 8Gb Multi-Level NAND Flash Memory with 63nm STI
CMOS Process Technology”, International Solid-State Circuits Conference, 2005.
103
[37] http://www.ansoft.com
[38] Numonyx. ”Numonyx Axcell M29EW”, Numonyx Datasheet, 2009.
[39] SanDisk. ”SanDisk 3-D OTP Memory”, SanDisk Datasheet, Document Number
DS034, 2006.
[40] NAND Flash Memories Application Note. ”NAND Flash Memories and Program-
ming NAND Flash Memories Using ELNEC Device Programmers”, ELNEC NAND
Flash, 2008.
[41] W. Dong, P. Liyang, D. Zhigang, and Z. Jun. ”Charge pump system sharing the
coupling capacitors for NOR flash memory”, Proc. 5th International Conference on
ASIC, 1, 2003.
[42] M. Combe, et al. ”Design of high-speed 128-bit embedded flash memories allowing
in place execution of code”, Solid-State Electronics, 49, 1867-1874, 2005.
[43] R. Micheloni, L. Crippa, M. Sangalli, G. Campardo. ”The flash memory read path:
building blocks and critical aspects”, Proceedings of the IEEE, 91, 4, 2003.
[44] E. Conrad, et al. ”Early Voltage and Saturation Voltage Improvement in Deep Sub-
Micron Technologies Using Associations of Transistors”, Proc. of 21st annual sym-
posium on integrated circuits and system design, 2008.
[45] Y. Chen, et al. ”An Access-Transistor-Free (0T/1R) Non-Volatile Resistance Ran-
dom Access Memory (RRAM) Using a Novel Threshold Switching, Self-Rectifying
Chalcogenide Device”, Proc. International Electron Devices Meeting, 2003.
[46] B.J. Choi, et al. ”Resistive switching mechanism of TiO2 thin films grown by atomic-
layer deposition”, Journal of Appl. Phys., 98, 033715, 2005.
104
[47] R. Cobley, C.D. Wright. ”Parameterized SPICE Model for a Phase-Change RAM
Device”, IEEE Transactions on Electron Devices, 53, 1, 2006.
[48] D. MacSweeney, et al. ”A SPICE Compatible Subcircuit Model for Lateral Bipolar
Transistors in a CMOS Process”, IEEE Transactions on Electron Devices, 45, 9, 1998.
[49] S.R. Ovshinsky. ”Reversible electrical switching phenomena in disordered struc-
tures”, Physics Review Letters, 21, 1968.
[50] H. Sim, et al. ”Resistance-Switching Characteristics of Polycrystalline Nb2O5 for
Nonvolatile Memory Application”, IEEE Electron Device Letters, 26, 5, 2005.
105