ARRAY ARCHITECTURE FOR A NONVOLATILE 3 …dg048ny4241/main-augmented.pdfAbstract T HIS WORK explores the design and capabilities of a three-dimensional cross-point array structure

ARRAY ARCHITECTURE FOR A NONVOLATILE

3-DIMENSIONAL CROSS-POINT MEMORY

A DISSERTATION

SUBMITTED TO THE DEPARTMENT OF

ELECTRICAL ENGINEERING

AND THE COMMITTEE ON GRADUATE STUDIES OF

STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Elaine Ou

March 2010

This dissertation is online at: http://purl.stanford.edu/dg048ny4241

© 2010 by Elaine Ou. All Rights Reserved.

Re-distributed by Stanford University under license with the author.

ii

http://purl.stanford.edu/dg048ny4241

I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

S Wong, Primary Adviser


Thomas Lee


Yoshio Nishi

Approved for the Stanford University Committee on Graduate Studies.

Patricia J. Gumport, Vice Provost Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.

iii

Abstract

THIS WORK explores the design and capabilities of a three-dimensional cross-point

array structure suitable for use with resistance-change non-volatile memory. The

resistance-change cell serves as both the access element and the memory element, eliminat-

ing the need for individual selection transistors or diodes. This enables the memory to be

fabricated in arrays with a line spacing of F , the minimum feature size for a given process

technology. By stacking the cross-point arrays in n layers, we achieve an effective cell size

of 4F 2/n.

Previous works describing transistor-free memory arrays have been limited by exces-

sive leakage current across unselected bitlines and wordlines during memory access. This

work presents novel architecture and circuit techniques that minimize leakage current ef-

fects while maintaining a high effective bit density. A test chip fabricated in 0.18 µm

CMOS technology allows us to verify the architecture and circuit functionality.

The performance of a 8 Gb memory chip build in 65 nm technology has been simulated.

A random access time of 104 ns is achieved with a power dissipation of 61.2 µW. This

makes the 3-D cross-point memory competitive with NOR flash in terms of read time, and

competitive with NAND flash in terms of area efficiency.

iv

Acknowledgments

THE WORK presented in this dissertation is the result of a long and difficult journey

made possible only with the support and guidance of many people.

First and foremost, I must thank my advisor, Professor Simon Wong. He oversaw my

research efforts and taught me to guide my own work. His insight and experience has been

invaluable, and his generosity with his time and mentorship was second to none. It has

been an honor for me to work with Professor Wong.

I would also like to thank my associate advisor, Professor Yoshio Nishi. He has been

a valuable source of guidance and input during my time at Stanford. Despite his busy

schedule, he always took the time to check on my progress. Next, I would like to thank

Professor Thomas Lee for serving on my reading committee. As the cofounder of Matrix

Semiconductor, he was able to provide insightful industry perspectives to my research. I

am also most indebted to Professor Ada Poon for serving as my committee chair on very

little notice.

I would next like to thank June Wang and Natasha Newson, who always made the

administrative details run smoothly.

This research was made possible by funding from the American Society for Engineer-

ing Education and the Nonvolatile Memory Technology Research Initiative (NMTRI). The

NMTRI group provided the support and opportunity for me to interact and discuss my work

with many key industry members. Furthermore, I would like to thank the 3D Technology

v

Group at SanDisk for allowing me to spend a summer working alongside nonvolatile mem-

ory experts including Roy Scheuerlein and Luca Fasoli.

My experience at Stanford University would not have been the same without the mem-

bers of Professor Wong’s research group, who were not just great colleagues but also be-

came my close friends. I would like to thank Dr. Haitao Gan, Dr. Yun Bai, Dr. Paul

Park, Dr. Henry Nho, Dr. Andrew Poon, Dr. Aaron Gibby, Dr. Wei Wang, Jeongha Park,

Kasra Omid-Zahoor, Sung Il Park, Wanki Kim, Young Yang-Liauw, Zhiping Zhang, and

Saihua Lin for their collaboration, assistance, and input. I would especially like to express

my gratitude to Stanley Yeh, who provided me with extensive discussions regarding my

research, and who will be carrying out a continuation of this work.

Additionally, I am grateful to Jenny Hu and Warren Mar, who suffered with me through

the trials and tribulations of life as a graduate student, and who eventually became my best

friends at Stanford University.

Finally, I would like to dedicate this work to my family: To my mother, for her un-

conditional encouragement and support; To my father, for challenging me to reach the best

of my potential; To my brother, for being my lifelong friend and competitor; and to the

memory of my grandmother, who always believed in me.

vi

Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

1 Introduction 1

1.1 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Motivation and Prior Work 3

2.1 Flash Memory Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Emerging Memories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 3D Integration Technologies . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Cross-point Memory Design Methodology 13

3.1 Cross-point Memory Array Goals . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Memory Write Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3 Memory Read Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.4 Memory Cell Characterization . . . . . . . . . . . . . . . . . . . . . . . . 20

vii

3.5 Design Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 Circuit Design Tradeoffs 24

4.1 A Resistive Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2 MOSFET Current Amplifier . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.3 Bipolar Transistor Current Amplifier . . . . . . . . . . . . . . . . . . . . . 28

4.4 Lateral Bipolar Junction Transistors in Standard CMOS Process . . . . . . 31

4.5 Sense Amplifier Area Considerations . . . . . . . . . . . . . . . . . . . . . 35

5 Three-Dimensional Memory Architecture 38

5.1 Memory Array Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.2 Circuit Techniques for Vertical Architecture . . . . . . . . . . . . . . . . . 42

5.3 Layout Techniques and Array Efficiency . . . . . . . . . . . . . . . . . . . 44

5.4 Stacking Memory Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.5 3D Memory Area Comparisons . . . . . . . . . . . . . . . . . . . . . . . . 56

6 Performance Analysis 60

6.1 8 Gb Memory Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.2 Memory and Array Models . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6.3 Waveform Analysis and Timing Diagrams . . . . . . . . . . . . . . . . . . 67

6.4 Performance Comparisons with Commercial Products . . . . . . . . . . . . 69

7 Test Chip Design 73

7.1 Memory Array Implementation . . . . . . . . . . . . . . . . . . . . . . . . 73

7.2 Device Measurement Results . . . . . . . . . . . . . . . . . . . . . . . . . 75

7.3 Sense Amplifier Verification . . . . . . . . . . . . . . . . . . . . . . . . . 80

7.4 Voltage Scaling Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

viii

8 Conclusion 95

8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

8.2 Other Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

8.3 Recommendations for future work . . . . . . . . . . . . . . . . . . . . . . 98

References 99

ix

List of Tables

3.1 Characteristics of an HfO2-based RRAM device . . . . . . . . . . . . . . . 22

4.1 Sense amplifier designs and impact on layout area when considering a 4 kb

row. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.1 NMOS saturation current for different process technologies. . . . . . . . . 43

5.2 Metal layers and functions in multi-layer cross-point arrays. . . . . . . . . 47

5.3 NAND flash memory area efficiency comparisons. . . . . . . . . . . . . . . 59

6.1 HfO2-based memory cell model parameters. . . . . . . . . . . . . . . . . . 65

6.2 Nonvolatile memory performance comparisons. . . . . . . . . . . . . . . . 70

6.3 Read path timing signal comparisons. . . . . . . . . . . . . . . . . . . . . 72

x

List of Figures

2.1 Flash memory cell programming . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Comparison of NAND and NOR flash architectures . . . . . . . . . . . . . 5

2.3 NAND GB shipments and price/GB . . . . . . . . . . . . . . . . . . . . . 6

2.4 Flash process technology scaling . . . . . . . . . . . . . . . . . . . . . . . 8

2.5 SEM photographs of 3D stacked NAND cell string. . . . . . . . . . . . . . 11

2.6 SEM photograph of SanDisk 8-layer cross-point memory. . . . . . . . . . . 12

3.1 RESET and SET operation on a cross-point array. . . . . . . . . . . . . . . 15

3.2 Read operation on a cross-point array. . . . . . . . . . . . . . . . . . . . . 17

3.3 Op-amp current-to-voltage converter. . . . . . . . . . . . . . . . . . . . . . 18

3.4 Exploded view of an LM107 operational amplifier. . . . . . . . . . . . . . 19

3.5 Single gain stage current-sensing amplifier. . . . . . . . . . . . . . . . . . 20

3.6 HfO2-based RRAM device . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.7 HfO2 bipolar switching characteristics. . . . . . . . . . . . . . . . . . . . . 22

3.8 The memory array represented as a resistive divider network. . . . . . . . . 23

4.1 Sense amplifier design tradeoffs. . . . . . . . . . . . . . . . . . . . . . . . 25

4.2 Memory array design variables. . . . . . . . . . . . . . . . . . . . . . . . . 26

4.3 Normalized current between the bitline and MOSFET sense amplifier input. 28

4.4 Current-sensing amplifier replacing NMOS transistors with NPN bipolar

transistors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

xi

4.5 Normalized current between the bitline and BJT sense amplifier input . . . 30

4.6 Cross-sectional diagram of a vertical NPN transistor. . . . . . . . . . . . . 31

4.7 Cross-sectional diagram of a lateral NPN BJT . . . . . . . . . . . . . . . . 32

4.8 Parasitic bipolar devices. . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.9 Lateral NPN device based on standard MOSFET layout. . . . . . . . . . . 34

4.10 Lateral NPN device with octagonal emitter shape. . . . . . . . . . . . . . . 35

4.11 Layout area of the small MOSFET sense amplifiers. . . . . . . . . . . . . . 37

5.1 Hybrid NAND/NOR architecture of cross-point memory. . . . . . . . . . . 40

5.2 Architecture of a half page. . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.3 Layout of bitline-select transistors . . . . . . . . . . . . . . . . . . . . . . 45

5.4 Layout of wordline-select transistors . . . . . . . . . . . . . . . . . . . . . 46

5.5 Layout of bitline-select transistors under a cross-point memory array . . . . 48

5.6 Cross-sectional view of bitline-selection transistors under a cross-point mem-

ory array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.7 Layout of wordline drivers for a cross-point memory array . . . . . . . . . 50

5.8 Schematic of bitline-select transistors under a cross-point memory array . . 51

5.9 Schematic of wordline drivers under a cross-point memory array . . . . . . 52

5.10 Cross-sectional view of bitline-selection transistors between adjacent mem-

ory arrays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.11 Two-layer cross-point memory . . . . . . . . . . . . . . . . . . . . . . . . 53

5.12 Connecting two layers of bitlines to the substrate. . . . . . . . . . . . . . . 54

5.13 Four-layer memory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.14 Four-layer memory with buffer connections. . . . . . . . . . . . . . . . . . 55

5.15 Schematic of wordline drivers under a four-layer cross-point memory array 57

5.16 Array efficiency vs. number of vertical memory layers . . . . . . . . . . . 58

6.1 Four-layer cross-point memory implemented in an 8 Gb architecture. . . . . 61

xii

6.2 32 x 32 cross-point memory array . . . . . . . . . . . . . . . . . . . . . . 66

6.3 Multi-layer memory array represented as an RC network. . . . . . . . . . . 66

6.4 Simulation waveforms for memory read access. . . . . . . . . . . . . . . . 68

6.5 Read operation timing diagram . . . . . . . . . . . . . . . . . . . . . . . . 69

7.1 Test chip architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

7.2 Effective resistance of memory cell vs. gate voltage. . . . . . . . . . . . . . 76

7.3 I-V characteristics of the standard MOSFET-style BJT. . . . . . . . . . . . 77

7.4 I-V characteristics of the octagonal-style BJT. . . . . . . . . . . . . . . . . 78

7.5 Array configuration during a test read operation. . . . . . . . . . . . . . . . 81

7.6 RON and ROFF values in a 32-wordline array with MOSFET design (1). . . 82




7.10 Inconsistencies in the lateral NPN transistor characteristics. . . . . . . . . . 86

7.11 RON and ROFF values in a 32-wordline array with BJT design (MOSFET-

style). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

7.12 RON and ROFF values in a 32-wordline array with BJT design (octagonal-

style). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

7.13 Voltage scaling results for a 32-wordline array with MOSFET design (1). . 90

7.14 Voltage scaling results for a 32-wordline array with MOSFET design (2). . 91

7.15 Voltage scaling results for a 32-wordline array with BJT design (MOSFET-

style). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

7.16 Voltage scaling results for a 32-wordline array with BJT design (octagonal-

style). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

xiii

xiv

Chapter 1

Introduction

C MOS TECHNOLOGY scaling has enabled the integration of increasingly dense and

numerous functionalities into a single chip at a rate described by Moore’s law [1].

Historically, the number of transistors per chip has doubled every two-to-three years. Tech-

nology process scaling increases bit density, but as features get smaller, reliability and

accuracy suffer. Memory suppliers are seeking an alternative to traditional NAND flash

memory to enable the continuation of scaling without sacrificing reliability and cost.

With the development of resistance-change materials, it is possible to substitute resistance-

change memory devices where transistors were once used. In this work we demonstrate the

capability of a cross-point memory array structure that utilizes a resistance-change cell to

serve as both the access element and the memory element. The cell can be manufactured

entirely with existing silicon-based complementary metal-oxide-semiconductor (CMOS)

fabrication tools and materials. With no access transistor, a single memory cell has a size

of 4F 2, where F is the minimum feature size available for a given process technology. We

have modeled a memory architecture that supports multiple layers of cross-point arrays,

resulting in an effective cell size of 4F 2/n, where n is the number of memory layers.

1

1.1 Organization

The organization of this thesis is as follows: Chapter 2 explores existing resistance-change

memory technologies and other alternatives used to improve the scalability of nonvolatile

memory. An overview of 3-dimensional integration technologies is also presented.

Chapter 3 describes the program and read strategies employed in a 2-dimensional cross-

point memory array without using access transistors. This chapter also identifies the charac-

teristics of the resistance-change material used for the memory elements. Then, the design

procedure for the memory architecture is described.

In Chapter 4, we further explore the read operation at a circuit level, and present the

novel design and layout techniques that minimize leakage current effects while maintaining

a high effective bit density.

Chapter 5 builds upon the 2-dimensional cross-point memory array and shows how it

can be built into a 3-dimensional architecture. Here, we also give consideration to the pe-

ripheral circuitry overhead and requirements for a vertical design. The design for an 8 Gb,

four-layer memory architecture is described. The following chapter presents a performance

analysis of the read operation on the 8 Gb 3-dimensional memory and compares the results

to those of other nonvolatile memory products fabricated in similar process technologies.

To demonstrate the feasability of the cross-point memory architecture, we designed

and fabricated a prototype that emulates a resistance-change cross-point memory from a

functional standpoint. Chapter 7 describes the test chip implementation and design choices

as well as the measurement results.

Finally, we conclude with a summary of the research and discuss possible directions

for future work.

2

Chapter 2

Motivation and Prior Work

FLASH MEMORY suppliers aim to continue increasing the bit density (bits per mm2)

of memory chips, but have to do so without sacrificing reliability and cost. This

chapter explores flash memory technologies such as multi-level cells that have effectively

increased the bit density in past years, and explains why the future of flash memory may

be limited by technology scaling despite recent advancements. We also look at trends in

emerging nonvolatile memories that could serve as potential replacements for NAND and

NOR flash. Finally, we present prior methods of 3D memory integration and give a brief

overview of current 3D integration technologies.

2.1 Flash Memory Overview

Flash memory cells consist of a single floating-gate transistor and retain a ’1’ or a ’0’ bit by

storing a certain amount of charge on the floating gate, as shown in Figure 2.1. The amount

of charge stored on the floating gate determines the threshold voltage of the device [2]. The

floating gate is insulated from the substrate by a layer of tunnel oxide, and from the control

gate by the inter-poly dielectric material.

3

Figure 2.1: Flash memory cell programming.

There are two primary architectures currently used for flash products. NAND flash is a

high-density memory architecture with bitlines consisting of 16 or 32 cells in series (hence

the name ”NAND”). It has a denser layout than NOR flash and is used for most removable

solid-state memory applications today. NOR flash has a separate metal contact for each cell

and is arranged with transistors in parallel (similar to a ”NOR” gate). A NOR flash cell has

a cell size of 10F 2 while a NAND flash cell has a cell size of 4F 2. Because a NOR flash

cell can be accessed without charging all the memory cells on the bitline in series, it has

a much faster read time than NAND flash. Hence, it is more suitable for execute-in-place

memory applications, meaning that stored instructions can be executed directly without the

need to buffer the data in random-access memory. Figure 2.2 gives a comparison of the key

differences between NOR and NAND flash memory architectures.

NAND and NOR flash also have distinct programming mechanisms. In NOR flash,

charge is stored by applying a high bias voltage to the control gate and the drain to induce

hot-electron injection from the transistor channel to the floating gate. In NAND flash, a bias

voltage is applied only to the control gate, causing a Fowler-Nordheim electron tunneling

4

Figure 2.2: Comparison of NAND and NOR flash architectures [2].

current through the tunnel oxide from the source-drain channel to the floating gate. Both

NAND and NOR flash use the Fowler-Nordheim tunneling effect to remove charge from

the floating gate during the erase operation [3].

5

2.1.1 Flash Scalability

Because flash memory has such a uniform layout, it is well-suited for leading-edge pro-

cessing. Flash memory is a market that has a history of fast growth and rapid evolution. As

can be seen in Figure 2.3, the price of NAND flash has continuously fallen in recent years.

Manufacturing costs have not been decreasing at the same pace [4], thus profits are low for

NAND flash suppliers.

Figure 2.3: NAND GB shipments and price per GB [4].

Generally speaking, the best way to reduce manufacturing costs for memory is to re-

duce the total die area while maintaining the same capacity. As flash memory technol-

ogy has been scaled to finer and finer lithography processes, nonvolatile memory becomes

cheaper and cheaper to serve a wider range of capacity-demanding applications. Main-

stream NAND flash memories are currently manufactured on 4x-nm processes with major

6

NAND flash vendors migrating to 3x-nm this year. In the race to reduce costs, NAND

flash manufacturers are in the process of developing 2x-nm technology. However, with

performance and reliability characteristics severely degraded relative to the 4x-nm gen-

eration, 2x-nm floating gate NAND flash is anticipated to be the last process technology

generation [4].

Multi-level cells have been used in flash memory to increase the number of bits stored

per cell [5]. This increases the memory capacity on a die without scaling the technology

process. Up to eight threshold-voltage levels have been implemented in NAND flash mem-

ory cells, encoding three bits per cell [6]. While this significantly increases the bit-density

of flash memory, it is accompanied by a higher bit-error ratio. Software complexity in the

form of error-correction algorithms is commonly increased to compensate for the bit-error

ratio. While multi-level flash may provide an increase in memory capacity at a certain

technology process, it is even less scalable for future technology nodes.

As shown in Figure 2.4, in sub-50-nanometer process technologies, only tens of elec-

trons are being stored and detected on the floating gate of a single NAND flash memory cell.

This brings to light several reliability concerns. Flash memory cells commonly suffer from

charge leakage due to oxide-tunneling, and these effects are expected to increase when the

oxide thickness decreases with technology scaling. Also, flash charge retention time de-

creases with wear, because defects in the tunnel oxide can be introduced with multiple write

and erase cycles. Currently, the typical lifetime of a NAND flash cell is 106 program/erase

cycles, while it is only 105 cycles for a NOR flash cell. NOR flash uses higher program-

ming voltages for channel hot electron injection, which drives a larger number of electrons

onto the floating gate than Fowler-Nordheim tunneling, but reduces the endurance of the

memory devices [7]. Thus, flash reliability and accuracy suffer as process technologies de-

crease below 50-nm, and multi-level cells become difficult to implement. A more scalable

memory technology is crucial for advanced lithography processes.

7

Figure 2.4: The number of electrons present on the floating gate scales

with process technology [8].

2.2 Emerging Memories

Memory producers are also trying to develop alternative technologies that may be scalable

beyond 20 nm lithography. There is a wide range of emerging memory technologies under

development to someday replace flash memory, and the most notable candidates include

magnetoresistive random access memory (MRAM) [9], phase-change memory [10] [11],

and resistance-change memory [12].

MRAM utilizes two ferromagnetic plates separated by a thin insulating layer for a mem-

ory cell. The lower plate is set to a fixed polarity, while the polarity of the upper plate can

be switched during a write operation. The electrical resistance of the memory cell changes

depending on whether or not the polarity is aligned between the two plates. MRAM was

8

originally thought to be able to replace SRAM, DRAM, and flash memories with a poten-

tially transistor-free architecture, but its lack of reliability in high-density arrays due to the

magnetic fields produced by the switching mechanism limits its potential applications [13].

Both phase-change and resistance-change memory cells are designed for detectable

changes in conductivity. Phase-change memory is currently designed around a random-

access architecture, similar to that of NOR flash [14]. It operates by using a resistive

heater to melt and recrystallize a chalcogenide memory cell in either a crystalline or amor-

phous state. It is unlikely that phase-change memory could replace NAND flash as a high-

density nonvolatile memory because of the high programming current density (in excess of

107 A/cm2) required to melt the memory cell.

Traditionally, phase-change memory has received more attention as a future replace-

ment for flash because of its reliable switching characteristics, but as technology improves,

resistance-change metal oxides have also become increasingly popular candidates for non-

volatile memory. Resistance-change metal-oxide materials have been shown to possess

favorable characteristics that make them particularly suitable for a 3D memory architec-

ture. They are highly compatible with modern CMOS processes and have demonstrated

high-speed and low-power switching abilities, with decade-long retention times and better

endurance than flash memory. Metal-oxide-based resistance-change technologies gener-

ally exhibit lower programming current requirements than phase-change memory [15] and

thus should have better scalability. The mechanism behind resistance-change is not yet

completely understood, but it is believed that oxygen motion and filament formation are

responsible for resistive changes in oxide-based resistance-change materials [46] [17].

9

2.3 3D Integration Technologies

Because chip area largely dictates manufacturing cost, building memory elements verti-

cally above a silicon surface is a highly effective way to reduce cost-per-bit. Current 3D

integration methods range from package-level stacking, in which dies are vertically con-

nected using bond wires or solder bumps, to monolithic integration, in which silicon is

recrystallized between each transistor layer to allow the fabrication of multiple layers of

active devices on a single wafer. From a circuit-design standpoint, the most significant

differentiator between the technologies is the density and parasitics of the inter-layer vias.

Fabrication costs can vary because, while the monolithic 3D design requires less silicon, it

involves a more complicated fabrication process as well as a possible degradation in yield.

Chip-stacking is the most primitive 3D-IC fabrication method and connects pre-fabricated

chips via bond wires or solder bumps. The long bond wires required to connect the lay-

ers of chips can add significant resistance and capacitance. Wafer-stacking avoids some

of the performance limitations of chip-stacking by directly bonding fabricated wafers after

they have been thinned down by chemical-mechanical polishing. Each layer is then only a

few tens of microns thick, but this still adds significant parasitics relative to a single-layer

design. Also, processing complexity is increased because of the challenges involved with

aligning multiple layers.

Monolithic 3D integration connects multiple layers using inter-layer vias similar to the

typical inter-metal vias used in a standard CMOS fabrication process. Because each layer is

less than one micron in thickness [18], performance is similar to that of a circuit fabricated

on a 2D substrate. In this manner, Samsung has developed a 3-dimensionally stacked

NAND flash memory by implementing a single-crystal Si layer stacking technology, where

epitaxial silicon is deposited in multiple layers above the substrate [19]. While this method

does improve the bit-density on a die, it is still subject to the limitations of flash scaling.

10

Figure 2.5: Vertical SEM photographs of Samsung 3D stacked NAND

cell string [19]

One-time programmable multi-layer cross-point memories have been developed by

both Samsung [20] and Matrix Semiconductor (currently part of SanDisk) [21]. Diodes

are used as access devices for each memory element to provide the necessary current den-

sity and prevent leakage across unselected cells, but this causes an increase in the bias

voltage required for access. Memory cell access diodes will also suffer from the effects

of discrete dopants in the p-n junction when process technologies scale below 40 nm [22].

The one-time programmable aspect of the memory limits its potential applications as well.

2.4 Motivation

For true scalability beyond 20 nm technology nodes, it is necessary to design a cross-

point memory array which does not require diodes for access elements. The cross-point

memory architecture described in this work is designed so that it can easily be fabricated

in multiple layers to form a stacked 3-dimensional memory. The memory array does not

use active devices for cell selection, so the stacking of multiple memory layers is relatively

11

Figure 2.6: Vertical SEM photograph of SanDisk 8-layer cross-point

memory [21]

straightforward, unlike the 3D stacked NAND flash.

Rewriteable resistance-change materials are used as memory elements. By keeping the

memory cells free of access devices such as diodes or transistors, the memory architecture

is greatly scalable over traditional flash memory.

12

Chapter 3

Cross-point Memory Design

Methodology

THE ARCHITECTURE of each layer of the cross-point memory array resembles that

of NAND flash [23]. Because the cross-point memory design is optimized for area,

the access times are not expected to be competitive with for execute-in-place memory appli-

cations. This chapter begins with an overview of the goals for a cross-point memory array,

then discusses the strategies employed for reading and writing the memory. We present the

limitations that we face when removing the access transistors, and what target memory cell

characteristics are desirable for a cross-point memory architecture. Finally, this chapter

summarizes the design process and how it can be adapted to suit other resistance-change

memory devices and technology processes.

3.1 Cross-point Memory Array Goals

Using a transistor-free cross-point memory architecture allows us to fabricate wordlines

and bitlines in the minimum metal pitch allowed by the technology process, as we are no

13

longer limited by individual selection transistors. However, layout design becomes more

difficult because the peripheral circuitry must be able to access the bitlines and wordlines

in such a tight pitch.

The purpose of cross-point memory is to attain minimal area while maintaining scal-

ability. As with NAND flash memory, read operations are performed in parallel; that is,

all of the bits on a single wordline can be read at once. To read many bits in parallel, the

complexity and power of the sense amplifiers must be managed.

3.1.1 Leakage Current Considerations

When implementing an access-transistor-free cross-point memory array, the greatest con-

cern is the leakage current that arises as a result of unwanted memory cells being biased

during a read or write operation. Because there are no transistors or diodes at the memory-

cell cross-points, charge is free to flow throughout the memory array. For the read oper-

ation, this adds noise to the signal being sensed. For the write operation, this can create

disturb conditions on unselected cells, or incomplete programming of selected cells.

The worst-case leakage condition occurs when the majority of the memory array is in

the low-resistance state and a selected cell is in the high-resistance state. Because there

are no access transistors, the memory cells must also serve as rectifying devices, and are

less effective at this when at a low resistance. The peripheral access circuitry must be de-

signed to overcome this leakage current problem, and this will be discussed in the following

sections.

14

3.2 Memory Write Operation

Programming resistance-change memory cells is accomplished by applying either a SET

voltage (VSET ) or a RESET voltage (VRESET ). ”SET” is defined as the transition of cells

from a high-resistance state to a low-resistance state, while ”RESET” brings the cells

back to a high-resistance state from a low-resistance state. It has been shown that some

resistance-change cells are more reliable and demonstrate a faster switching speed when

operating in the bipolar mode [15]. In this mode, VRESET will be a negative bias while

VSET is a positive bias. The key to overcoming the leakage current problems during the

write operations is to ensure that unselected cells are biased at intermediate voltages that

are not high enough to accidentally program them.

Figure 3.1: A RESET (left) and a SET (right) operation performed on a

cross-point array.

15

Figure 3.1 shows a representation of the RESET and SET operations. In general, all

the cells in a memory array are first RESET to the high-resistance state, and then only se-

lected cells are SET to the low-resistance state. The entire memory array is RESET bitline-

by-bitline before the SET voltage is applied to those cells that need to be programmed.

This enables us to predict the current requirements for the SET operation to avoid over-

programming (bringing memory cells to too low a resistance). Over-programming is not

a concern during RESET because the higher resistance will automatically limit the current

flow.

In the RESET operation, VRESET is applied sequentially to one bitline at a time. All

wordlines in the array are pulled to ground (0 V), and every cell in this row of bits becomes

RESET. The unselected bitlines are terminated with high-impedances, or are left floating,

so that the only current path is from the selected bitline to the wordlines. It is expected that

the unselected bitlines will drift to some voltage slightly higher than 0 V. Leakage current

is not a significant concern in the RESET operation because every cell in the array will

eventually be brought to the high-resistance state.

In the SET operation, 0 V is applied sequentially to one bitline at a time, while VSET is

applied only to selected wordlines. The unselected wordlines are terminated with a high-

impedance and are expected to drift to some voltage between ground and VSET

2. Unselected

bitlines are biased at VSET

2. The unselected cells see a bias of at most VSET

2. While this

allows some leakage current in the array, it is necessary to prevent unselected bitlines from

drifting to too low a voltage and causing an inadvertent SET.

3.3 Memory Read Operation

A memory array is accessed wordline-by-wordline. During a read operation, the selected

wordline is raised to VREAD and a read current is driven in parallel through the bitlines. The

16

unselected wordlines are terminated with high-impedances while each bitline is connected

to an individual sense amplifier at one end. In this manner, the sense amplifiers should

provide the only current path to ground.

Figure 3.2: A read operation performed on a cross-point array.

Figure 3.2 depicts the current leakage paths through the unselected wordlines. As men-

tioned in Section 3.1, the leakage current is worst when the majority of the memory array is

in the low-resistance state, as this allows more leakage current to traverse between bitlines

and cause read errors.

3.3.1 Sense amplifier design

Our strategy for overcoming the leakage current problem is to implement current-sensing

amplifiers that can maintain the bitlines at as near a constant voltage as possible. If the

17

voltage difference between bitlines is minimized, then so is the leakage current. Current-

sensing amplifiers provide significant reductions in bit-line voltage swing and sensing de-

lays [24].

Figure 3.3: Op-amp current-to-voltage converter.

This can potentially be done using an operational-amplifier implemented as a current-

to-voltage converter. Figure 3.3 shows the basic implementation of an op-amp current-

to-voltage converter. The inverting input is maintained at virtual ground and the entire

input current is directed across resistor R. The benefit of this design is that every bitline

can be maintained at the same virtual ground, and the current differential between a high-

resistance and a low-resistance state is seen as the voltage drop across the resistor. With

this ideal sense amplifier design, the memory array will see no leakage current whatsoever.

Figure 3.4 shows the transistor-level view of an operational amplifier that might be used

for current sensing. Unfortunately, in practice, the complexity level of an effective op-amp

circuit renders it impossible for use in accessing bitlines laid out in the minimum metal

pitch.

A much simpler alternative is needed if the sense amplifier is to fit in a single-bitline

pitch. Flash memory typically uses a differential current-sensing scheme [25]. Even this

18

Figure 3.4: Exploded transistor-level view of an LM107 operational

amplifier [26].

structure would be too complex to accommodate a cross-point array. To address the chal-

lenge of an area-efficient sensing design, a diode-connected NMOS transistor followed by

a current mirror is used to detect the read current from a bitline, as shown in Figure 3.5.

The mirrored current is compared to a bias current set by the gate voltage Vbias on the

PMOS transistor in the reference branch, and this controls the voltage signal Vout, which

is buffered and latched. This simple sense amplifier is also better suited for future voltage

scaling.

A diode-connected MOSFET is useful for clamping the bitline to a certain voltage level,

reinforcing the original goal of maintaining a constant bitline potential to minimize leakage

19

Figure 3.5: Single gain stage current-sensing amplifier.

current. The diode-connected NMOS must have a sufficiently low input resistance relative

to the resistances of the memory cells to ensure that it provides enough conductance for

sensing current. The NMOS transistor in the current-mirror branch must be sufficiently

sized to detect the small amount of voltage swing. Transistor sizing is a critical part of the

peripheral circuitry design, and these considerations will be further discussed in the next

chapter.

3.4 Memory Cell Characterization

The memory array design depends strongly on the resistance-change memory cell charac-

teristics. There are a number of metal oxide materials that offer reversible, voltage-induced,

resistance changes. When choosing a suitable memory cell candidate, a number of factors

need to be considered. The material needs to serve as both the memory element and a

20

Figure 3.6: HfO2-based RRAM device

rectifying device, so the resistance of the chosen material must be high enough to mini-

mize leakage current in both the low-resistance and high-resistance states. However, the

resistance-change material should not have high write-energy requirements, as high current

or voltage requirements for programming would have extra peripheral circuitry require-

ments, and this would also limit the future scalability of the cross-point memory.

Two distinct conduction states are observed in certain polycrystalline metal oxides such

as TiO2 [46]. The high-resistance state can switch to a low-resistance state by voltage

and/or current stress [12]. The low-to-high resistance transition is also induced by voltage

and/or current stress.

In this work, the target device is a HfO2-based resistance-change material with an in-

terfacial layer between one electrode and the HfO2 layer. The interfacial layer is a reactive

buffer layer that allows a large number of oxygen atoms to diffuse from the HfO2 layer to

the interfacial layer, improving the reliability of the resistance-switching mechanism [15].

Table 3.1 summarizes the characteristics of the target memory device used in this work.

The I-V characteristics of the resistance-change device are shown in Figure 3.7.

21

RON 100 kΩ

ROFF 1 MΩ

VSET 1 V

ISET 5 µA

VRESET -1.5 V

IRESET 40 µA

Table 3.1: Characteristics of an HfO2-based RRAM device

Figure 3.7: HfO2 bipolar switching characteristics.

3.5 Design Process

The remainder of this work will focus on the process of designing a memory architecture

to accommodate the read operation. The design methodology presented can be applied in

22

the same manner when later optimizing the memory architecture for the write operation.

Figure 3.8: The memory array represented as a resistive network

consisting of word line drivers, memory cells, and sense

amplifiers.

The key variables to be determined in designing a circuit architecture to support a mem-

ory array are: array dimensions, wordline driver size, and sense amplifier input impedance.

These are the factors that directly interact with a memory cell under bias during the read

operation, and scaling them controls read accuracy.

As shown in Figure 3.8, the memory array can be seen as a resistive network made up

of the key design variables identified above. The sneak current path depicted by the arrow

between two bitlines is undesirable, and can be moderated if the input impedance of the

sense amplifiers is sufficiently low. Alternately, the number of wordlines could be reduced,

giving the read current fewer leakage paths. Finally, the wordline drivers could be made

larger, resulting in a larger absolute current differential between the bitlines.

The next chapter will analyze these design variable tradeoffs and find the optimal array

design.

23

Chapter 4

Circuit Design Tradeoffs

THE THREE PRIMARY design variables affecting the memory array during the read

operation were identified in the previous chapter. In this chapter, we seek to deter-

mine how these variables can be optimized to yield the smallest layout area while main-

taining reasonable read accuracy, as that is the primary goal of cross-point memory design.

4.1 A Resistive Network

A 32-bitline cross-point memory array using a resistor model of the HfO2-based RRAM

device was created in HSPICE using a 0.18 µm process technology with a nominal voltage

of 1.8 V. The array was simulated for accuracy while varying the number of wordlines,

the sense amplifier NMOS sizes, and the wordline driver transistor sizes. Read accuracy

was determined by accessing the leftmost wordline in a memory array and measuring the

amount of current reaching the sense amplifier input. In order to properly sense a sig-

nal, there should to be a sufficient difference in the current between the high and low-

resistance states being accessed on the active wordline when the remainder of the array

is in the low-resistance state, as this array configuration allows the most leakage current

24

across unselected cells. It is determined that a 15% current differential is sufficient for a

minimum-size buffer to latch the sense amplifier output following a 4x current gain.

Figure 4.1: Sense amplifier design tradeoffs. The optimum sense

amplifier size for a 16-wordline array is marked (1), and the

optimum sense amplifier size for a 32-wordline array is

marked (2).

The results in Figure 4.1 show the minimum transistor sizes that can be used during

a worst-case read operation. Read operations were performed by activating the farthest

wordline from the sense ampliers. The transistor sizes on the chart are normalized to the

minimum transistor size for the process technology used for simulation.

The rationale behind the sizing tradeoffs can be understood from Figure 4.2, which

is a simplified representation of the resistive network that occurs during a read operation

in the worst-case array state. In order for any sense amplifiers to be able to differentiate

25

Figure 4.2: Memory array design variables.

between the input currents ION and IOFF in this scenario, the leakage current from the low-

resistance bitline (RON ) to the high-resistance bitline (ROFF ) must be less than the original

bias current across the memory cell. The amount of current traversing from theRON bitline

to the ROFF bitline is equal to the difference between the bitline voltages (∆VBL) divided

by the sum of all the resistors along the bitline in parallel ( RON

#WL).

Then,∆VBL · #WL

RON

<VREAD

RON

VREAD

∆VBL

> #WL

This gives the absolute minimum requirements for a functional memory array.

For arrays with up to 32 wordlines, VREAD can be increased and ∆VBL decreased by

increasing the width of the wordline drivers and sense amplifier transistors respectively. In-

creasing the wordline transistor size drives more current through the selected cells in order

to account for the greater number of leakage paths. Increasing the size of the sense am-

plifier NMOS transistors provides a lower-impedance path relative to all of the unselected

memory cells in parallel. This generates the L-shaped curve seen in Figure 4.1.

26

When the array size is increased to 64 wordlines, the wordline drivers and sense ampli-

fier widths must be significantly increased in size to a point where it is no longer practical

for implementation. As shown in the above equation, for a constant VREAD, ∆VBL must

be halved to accommodate the doubling of the number of wordlines. While a valid con-

figuration for a 64-wordline array was found through simulation, in practice, this would

not work, because it would be impossible to lay out such large worline driver and sense

amplifier transistors within the narrow pitch of the wordlines and bitlines.

Two acceptable array configurations are marked (1) and (2) in Figure 4.1 and will be

further discussed. Design (1) is optimized for a 16-wordline array. Design (2) is optimized

for a 32-wordline array and has a sense amplifier with NMOS transistor widths sized at

four times those of design (1).

4.2 MOSFET Current Amplifier

The sense amplifiers selected in the previous section were further evaluated by simulating

the current output from the bitline to the sense amplifier when the number of wordlines

is increased. Figure 4.3 shows the simulation results under different array configurations.

Once again, read operations were performed by activating the wordline farthest from the

sense amplifiers. The effect of the sneak paths in the array is clearly evidenced by the

narrowing current window between reading an ON state and an OFF state as the number

of wordlines is increased. Sense amplifier (1) only has a sufficient current sense margin

(approximately 15% IOFF ) at 16 wordlines, while sense amplifier (2) has a sufficient sense

margin up to 32 wordlines. Both sense amplifiers have too narrow a margin by 64 word-

lines. This is consistent with the results presented in Figure 4.1.

27

Figure 4.3: Normalized current between the bitline and MOSFET sense

amplifier input under various array states.

4.3 Bipolar Transistor Current Amplifier

As shown in Figure 4.1, the NMOS transistors in the sense amplifier need to be sized to over

20 times the minimum transistor size in order to accommodate 16 or 32 wordlines. These

28

are rather large transistor areas, and it is possible that bipolar transistors may actually be

able to offer a more area-efficient and accurate design.

The input impedance of a BJT is lower than that of a MOSFET. The current-voltage re-

lation of the base-emitter junction of a BJT is equivalent to the exponential current-voltage

curve of the p-n junction of a diode. Thus, the transconductance gm of a BJT increases

exponentially with emitter current. This results in a higher transconductance than that of a

MOSFET, which has a quadratic current-voltage relation between the gate voltage and drain

current. This sensitivity can make a bipolar-based current mirror a better transimpedance

amplifier than the MOSFET-based circuit. In Figure 4.4, the NMOS transistors in the sense

amplifier have been replaced with NPN bipolar transistors. The PMOS device remains, as

it only serves to provide a reference current.

Figure 4.4: Single gain stage current-sensing amplifier replacing

NMOS transistors with NPN bipolar transistors.

An evaluation of a memory array using bipolar sense amplifiers yields the results shown

in Figure 4.5. The NPN transistors in these sense amplifiers have emitter areas of 2 µm by

29

2 µm, the smallest vertical NPN transistor available in 0.18 µm CMOS technology. These

sense amplifiers maintain a sufficient margin of differentiation between the high-resistance

and low-resistance states for up to 32 wordlines. While the sense margin does become

insufficient at 64 wordlines, as was the case with the MOSFET-based sense amplifiers, the

current window is still greater than that achieved using the original sense amplifier designs.

This means that the bipolar sense amplifier could be more tolerant of resistance variations

in the memory cells.

Figure 4.5: Normalized current between the bitline and BJT sense

amplifier input under various array states.

30

4.4 Lateral Bipolar Junction Transistors in Standard CMOS

Process

A vertical NPN transistor is fabricated with a heavily-doped emitter in a p-doped base layer,

in a lightly-doped n-well which serves as the collector. This is shown in Figure 4.6. Ver-

tical bipolar junction transistors are often not available for integration in standard CMOS

processes. When they are, each device can occupy tens of microns in height and width.

The area requirement renders it impractical for use in the cross-point array.

Figure 4.6: Cross-sectional diagram of a vertical NPN transistor.

Bipolar junction devices also occur as parasitics in MOSFETs and have been utilized

as active devices with some success [27] [28]. The parasitic NPN junction is formed be-

tween the drain, substrate, and source of an NMOS transistor, as seen in Figure 4.7. The

base width is determined by the gate length of the MOSFET in the fabrication process

technology.

31

Figure 4.7: Cross-sectional diagram of a lateral NPN BJT as a parasitic

in an NMOS transistor.

In an NPN transistor, collector current is generated when a voltage is applied to forward-

bias the base-emitter junction and electrons are injected into the base region. These elec-

trons diffuse through the base towards the collector and are swept into the collector by the

electric field in the depletion region of the collector-base junction. The gain, β is dependent

on the base width being much shorter than the diffusion length of the electrons in advanced

CMOS technologies [29].

Figure 4.8 shows the parasitic bipolar devices present in a cross-sectional diagram of

an NMOS transistor.

The electrons injected from the emitter to the base are collected mainly vertically in the

vertical BJT and laterally with the lateral device. The generated emitter current depends

on the area of the base-emitter junction closest to the collector region. The area of the

base-emitter junction in the vertical device is determined by the overall emitter area, which

is 4 µm2 in the transistor we want to emulate. The lateral BJT needs to be able to match

this base-emitter junction area in order to achieve the same gain as the 4 µm2 vertical

device. As identified in Figure 4.7, the diffusion depth for the active MOSFET regions

is approximately 0.10-0.25 µm in 0.18 µm CMOS technology. This means that a lateral

32

Figure 4.8: Parasitic bipolar devices identified in cross section of

NMOS transistor.

bipolar device should have an emitter perimeter of at least 16 µm.

Two layout designs for lateral BJTs were considered for fabrication. The first is based

on a standard MOSFET layout, with drain and source active regions interleaved between

the multiple polysilicon fingers that serve as the gate. The drain and source regions serve

as the collector and emitter, and the substrate below the gate serves as the base region. The

device is surrounded by a ring of contacts to the substrate, or base region. It is fabricated

in a deep n-well and surrounded by n-well regions for isolation. This design is shown

in Figure 4.9, with the collector region labeled (C), emitter region labeled (E), and the

base/substrate region labeled (B).

While this layout design is compact, it may not achieve a very high gain because the

unsurrounded perimeter directly above and below the emitter regions enables electrons

to recombine in the base without reaching the collector. This increases the base current

without any increase in the collector current, effectively reducing the current gain β.

33

Figure 4.9: Lateral NPN bipolar device based on standard MOSFET

layout.

An alternate design, shown in Figure 4.10, places the emitter in an active region com-

pletely surrounted by the collector. This should result in most of the electrons injected from

the emitter to the base being collected laterally. The smaller emitter area should also result

in a lower base current because it reduces the number of electrons that recombine in the

substrate below the emitter. Unfortunately, the unique shape of the polysilicon gate in this

layout requires a slightly longer gate in most design processes (for example, the gate length

becomes 0.21 µm in the TSMC 0.18 µm process). The wider base region will allow more

electrons from the emitter to recombine in the base before entering the collector, reducing

the gain.

34

Figure 4.10: Lateral NPN bipolar device with octagonal emitter shape.

4.5 Sense Amplifier Area Considerations

Finally, as the cross-point memory should be optimized for minimal area, it is important to

consider the layout overhead involved with each of the sense amplifier designs presented.

We do this by evaluating the total area components of a 4 kb row.

A 16x32-bit cross-point array has dimensions of 64λ by 128λ, where λ is one-half the

minimum feature size in any process technology. Then, a 32-by-32-bit block has dimen-

sions of 128λ by 128λ. The minimally-sized block-selection transistors occupy a width

of 44λ per bitline. The wordline-selection transistors are not considered for this estimate.

Table 4.1 shows a comparison of the different sense amplifier designs presented in this

chapter and their respective overhead impact on layout area when considering a single 4 kb

row. All circuits are designed for a minimum bitline-pitch, and staggered laterally if too

large.

35

Sense amplifier Number of wordlines SA width Total width of 4 kb row

MOSFET (1) 16 924 λ 28572 λ

4x MOSFET (2) 32 1844 λ 12860 λ

BJT (flat) 32 2514 λ 13530 λ

BJT (octagon) 32 2807 λ 13823 λ

Table 4.1: Sense amplifier designs and impact on layout area when

considering a 4 kb row.

Figure 4.11 shows the layout area of the small MOSFET sense amplifiers compared

to multiple blocks of memory. With hundreds of memory arrays sharing a single set of

sense amplifiers, the greatest overhead will ultimately be that of the selection transistors.

Even though the 16-wordline array could utilize a much smaller sense amplifier design, the

32-wordline array provides better efficiency when all peripheral circuitry is considered.

36

Figure 4.11: Layout area of the small MOSFET sense amplifiers shared

by multiple memory arrays.

37

Chapter 5

Three-Dimensional Memory

Architecture

BUILDING INTEGRATED circuits vertically allows for a reduced chip footprint when

compared to a traditional two-dimensional (2D) design, by an approximate factor

of the number of layers used. This offers significant advantages in terms of reduced in-

terconnect delay when routing to blocks that otherwise would have been placed laterally.

Traditionally, a 3D integrated circuit (3D-IC) has used more than one active device layer.

While the resistance-change memory cells are not active devices, they function as rectifying

devices in our design.

This chapter describes the process for which the 2D cross-point array can be built into

a multi-layer 3D architecture. We also explore layout techniques that seek to maximize

the amount of peripheral circuitry that is folded underneath the memory arrays. While

support circuitry could also be built above the substrate using wafer bonding or monolithic

3D integration, the process complexity associated with these methods is currently too high

for economical manufacturing; thus, this work will only explore 2D peripheral circuitry for

38

supporting 3D memory arrays.

5.1 Memory Array Architecture

As determined in the previous chapter, a 32-wordline array provides the most area-efficient

solution for a cross-point array given the target memory cell characteristics presented in

Section 3.4.

The number of bitlines does not affect the sense accuracy, but is limited by the write

operation, as that is done by individually biasing the bitlines. Based on a programming

current of about 40 µA per cell, 32 bitlines will require a total programming current of

approximately 1.2 mA, which is manageable. We will estimate that the number of bitlines

will also be limited to 32. Thus, the memory array size is 32 bitlines by 32 wordlines. This

array size also ensures that a constant bias will be seen by all memory cells being accessed,

as the voltage drop across a wordline the length of 32 bitlines is negligible. The metal

resistance is about 0.08 Ω per square for the copper layer in most process technologies.

Even in the worst case, when all memory cells being accessed are in the low-resistance

state, the voltage drop across the 32-bit word line is about 2 mV.

A single page architecture is shown in Figure 5.1. The cross-point memory utilizes a

hybrid of NAND- and NOR-type architecture. While we want to read many bits in parallel,

it is necessary to break up the memory into smaller, electrically isolated arrays, in order to

minimize leakage current across the array. We choose to have 128 blocks in a row because

this results in a 573 Ω resistance along the global bitlines between the farthest memory

cells and the sense amplifier inputs. This keeps the voltage drop across the global bitline

sufficiently small as to not disrupt the sense margin.

There are also 128 blocks in a column. The limitation to the number of vertical blocks is

the delay along the polysilicon line that controls the block-selection transistors. The column

39

Figure 5.1: Hybrid NAND/NOR architecture of cross-point memory.

This shows a 32Mb page.

decoders are located at the top of the page and control the bitline-selection transistors down

a column of blocks. The wordline decoders are located on the left side of the page and

propagate control signals for individual wordlines horizontally across the entire page, as

shown in Figure 5.2. The horizontal routing is necessary to ensure that control signal lines

do not interfere with global and local wordlines when multiple layers of memory are used.

A popular approach in current two-dimensional memory designs was described in 1984

by Mohsen, et al [30], and involves the arrangement of support circuitry between adjacent

memory arrays. A page is divided into two halves, with 128 by 128 blocks on each half of

the page. A single set of sense amplifiers and latches is shared between the two halves of

40

Figure 5.2: Architecture of a half page showing wordline decoder

routing.

the page. A read operation is performed in two cycles. In the first stage, the odd-numbered

blocks in a single column are accessed on the left half of the page. The even-numbered

blocks in the corresponding column are accessed on the right half of the page. In the

second stage, the even-numbered blocks are accessed in the same column on the left half

of the page, and the odd-numbered blocks in the same column are accessed on the right

half of the page. In these two cycles, an entire column of bits is read on both sides of the

page. The column decoder provides an identical output for each half of the page. Because

41

alternating blocks are accessed for each read cycle, it is possible to share wordline-driver

transistors between vertically adjacent blocks.

The memory array size has been limited to account for the effects of leakage current,

and thus selection transistors are required for every bitline and wordline of each block.

The bitline-selection transistors connect the array bitlines to global bitlines and serve to

isolate the block and prevent additional leakage current from external blocks during a read

operation. These selection transistors are unique to each memory block and cannot be

shared by multiple bitlines. Because of this, the selection transistors need to fit in the

minimum metal pitch. This technique is described in the following section.

The wordline-selection transistors are used to drive the read voltage to the selected

wordline and provide other biases during the SET and RESET operations as described in

Chapter 3. In order to minimize area overhead, the wordline-selection transistors are shared

between vertically adjacent arrays. This is enabled by the fact that only alternating rows

of blocks are accessed at a time. The gates of the bitline access transistors in a column of

arrays are controlled by two global selection lines for selecting odd and even rows. The

selection transistors are staggered so that two vertical wires can select alternating rows.

These transistors also need to be laid out to accommodate the minimum metal pitch.

5.2 Circuit Techniques for Vertical Architecture

In order to understand how the memory array peripheral circuitry can be folded underneath

the cross-point memory cells, we first explore the layout techniques used for a single-layer

memory.

The area occupied by the selection transistor layout depends on the amount of current

that a given transistor width can accommodate. The saturation current of a MOSFET is

constantly improving with more advanced technology processes, but for the purpose of this

42

design, we will use the typical transistor characteristics given in Table 5.1 [31].

Both wordline and bitline selection transistors are laid out as NMOS devices because

NMOS transistors have a much higher saturation current capacity than PMOS transistors.

For instance, in 0.18 µm CMOS technology, the saturation current capacity of an NMOS

device is about 750 µA/µm, while it is only about 280 µA/µm for a PMOS device.

Exclusively using NMOS transistors for bitline and wordline access also avoids the p-

well spacing constraints involved when both complementary devices are used. The body

effect is unavoidable, as the substrate must remain grounded to avoid any unwanted par-

asitics. This effect occurs because the threshold voltage of a MOSFET is affected by the

substrate voltage, as it changes the width of the depletion region. This requires that a

slightly wider transistor than the one indicated in the chart must be used for driving a posi-

tive voltage into the wordline or bitline [32]. Furthermore, the NMOS drivers will require

charge pumps to generate a sufficiently high VGS , as the source will be connected to VDD.

These charge pumps will be reused for the programming circuitry.

Process tech. max VDS IDS (µA/µm)

180 nm 1.8 V 750

130 nm 1.2 V 530

90 nm 1.0 V 640

65 nm 0.8 V 750

Table 5.1: NMOS saturation current for different process technologies.

Given the memory cell characteristics described in Section 3.4, a bitline may need to

43

accommodate up to 10 µA for a read operation. As such, a minimum-size selection tran-

sistor may be used when fabricating the memory array in any of the process technologies

currently available. When reduced to layout, the transistor will occupy a horizontal width

of approximately 6λ. With staggered placement and routing, these select transistors can

be fit into a single-bitline pitch. The resulting pass-circuitry width is 32λ, and spans 8

wordline-widths after taking substrate contacts and diffusion spacing into account. Fig-

ure 5.3 demonstrates how this can be done.

The bitline-selection circuitry utilizes metal layers 1 through 3 for local interconnect,

and assumes that the local bitlines are built using metal layer 6.

One set of global wordlines can connect all of the memory blocks in a column. The

bitline-selection transistors dictate which arrays to be selected for access. As shown in

Figure 5.4, the wordline drivers can be staggered to fit into the minimum metal-width pitch.

The wordline drivers for a 32-bitline block occupy a height of 24 bitlines. The gate signals

that control the wordline transistors are routed horizontally, as the wordline decoders lie to

the sides of each page.

The wordline driver devices are laid out also using metal layers 1 through 3, and as-

sumes that the local wordlines are built using metal 7. The VREAD voltage source runs on a

wide bus that will be routed on metal 5. Metal layer 4 is reserved for routing the wordline

driver gate signals from the decoder. Including one metal layer for the global bitlines, the

cross-point memory array overlaid on the access circuitry requires at least 8 metal layers.

These metal layers are summarized in Table 5.2.

5.3 Layout Techniques and Array Efficiency

In our goal to minimize layout area, we design our peripheral circuitry with the intention

of placing transistors underneath the memory arrays. Because the area of the selection

44

Figure 5.3: Layout and schematic of four bitline-select transistors made

to accommodate a minimum-width metal pitch.

circuitry and wordline drivers is predetermined by the characteristics of the resistance-

change material, the bitline and wordline selection devices become the limiting factor when

trying to reduce the total memory footprint.

As described in the previous section, the bitline-selection transistors span a width of

45

Figure 5.4: Layout and schematic of wordline-select transistors made to

accomodate a minimum-width metal pitch.

8 metal lines when laid out to accommodate the minimum metal spacing. The wordline

drivers span a height of 24 metal lines when laid out to accommodate the minimum pitch.

46

1-layer 2-layer 4-layer

Metal layer Function Metal layer Function Metal layer Function

1 local IC, 1 local IC, 1 local IC,

BL decode BL decode BL decode

2 local IC 2 local IC 2 local IC



5 WL decode 5 WL decode 5 local IC

6 VREAD bus 6 VREAD bus 6 WL decode

7 local BL 7 local BL (1) 7 VREAD bus

8 local WL 8 local WL 8 local WL (1)

9 global BL 9 local BL (2) 9 local BL (1)

10 global BL 10 local WL (2)

11 local BL (2)

12 local WL (3)

13 global BL

Table 5.2: Metal layers and functions in multi-layer cross-point arrays.

Then, the total selection-transistor overhead will occupy the same area as a 32 wordline by

32 bitline array when packed as tightly as possible. Array efficiency is defined as the total

area occupied by the memory cells divided by the total combined area of the memory and

the peripheral circuitry. Thus, a completely two-dimensional array block of 32 wordlines

by 32 bitlines would have an array efficiency of just under 50%. There will need to be some

allowance for edge contacts and spacing.

47

For a single layer of cross-point memory, it is possible to fold the access circuitry

entirely underneath the memory array. As demonstrated above, the circuitry itself can be

staggered to match the minimum metal pitch. Figure 5.5 shows the layout of bitlines and

their access transistors. Figure 5.6 gives a three-dimensional perspective of the vertical vias

and local interconnect involved with routing bitlines to the substrate. Two metal layers are

needed for routing bitlines to the selection transistors in addition to the local and global

bitline metal layers.

Figure 5.5: Layout of bitline-select transistors made to fold under a

cross-point memory array.

The same technique can be employed for the wordlines with an additional metal layer

to bypass the bitline transistors, along with separate metal layers for the local and global

wordlines. To enable wordline-driver sharing, we have the contacts and associated selection

circuitry lie on alternating bitlines and wordlines. As the selection transistors need to be

48

Figure 5.6: Cross-sectional view of bitline-selection transistors made to

fold under a cross-point memory array.

staggered anyway to be accessed by metal lines in the minimum pitch, this arrangement

does not involve any extra overhead. This layout is shown with an abbreviated memory

array in Figure 5.7.

Folding the wordline drivers underneath the memory array becomes tricky, as the cor-

ners are already occupied by the bitline transistors. The solution to this is to fold the

wordline drivers under the adjacent memory arrays, and to reserve the remaining space un-

der the memory array for the bitline-selection transistors of neighboring blocks. Figure 5.8

and Figure 5.9 show how the bitline and wordline transistors can be folded under adjacent

memory blocks in a checkerboard pattern. However, in order for bitlines from adjacent

memory blocks to access the selection transistors, an additional metal layer needs to be

routed between the memory array and the substrate, bringing the total number of metal lay-

ers to 9. This is shown in Figure 5.10. No additional metal layers are required for routing

49

Figure 5.7: Layout of wordline drivers sized to fit under a cross-point

memory array.

the wordline drivers, because a single set of wordline drivers is shared between arrays.

If we consider an example page consisting of four 32 by 32 memory blocks with the

selection circuitry folded underneath the arrays as described above, we achieve an array

efficiency of 91.3% (without taking into account the decoders or sense circuitry), as the

only peripheral circuitry that cannot lie beneath the memory are the contacts connecting to

the global wordlines and bitlines.

5.4 Stacking Memory Layers

We have seen that, for a single layer of cross-point memory, the bitline-selection transis-

tors and wordline drivers can easily be placed under the memory arrays so as to present

very little overhead. For a second memory array stacked on top of the first, the same can

be achieved. Currently, contacts to the global wordlines occupy two peripheral edges of

50

Figure 5.8: Schematic of bitline-selection transistors made to fold

under a cross-point memory array.

each memory array, and contacts to the global bitlines occupy one peripheral edge of each

memory array. The one remaining peripheral edge can be used to route the second layer of

bitlines to the substrate devices, and the substrate devices to the global bitlines. Wordlines

will be shared between two layers of bitlines, as shown in Figure 5.11. Only one layer

of bitlines should be active at a time, so the wordline drivers will not need to drive any

more current in a single read operation than with the single-layer design. Furthermore, this

avoids using another metal layer and the vertical complexity of connecting a second set of

51

Figure 5.9: Schematic of wordline drivers made to fold under a

cross-point memory array.

wordlines. The additional bitlines should have no significant effect on the leakage current

during the read or write operations as they are essentially ”floating” when unselected, with

no path to ground.

If two second-layer arrays needed to connect to the substrate devices at the same shared

edge, this would require yet another metal layer for routing between the memory array

52

Figure 5.10: Cross-sectional view of bitline-selection transistors shared

between two adjacent memory arrays.

Figure 5.11: Shared wordlines in a two-layer cross-point memory.

and substrate devices. This is undesirable. An alternative solution is to abut substrate

device contacts from the first layer with substrate device contacts from the second layer.

This is demonstrated in Figure 5.12. This method does not require any additional metal

interconnect layers besides the metal layer for the second set of bitlines. The only additional

53

Figure 5.12: Connecting two layers of bitlines to the substrate.

area overhead is the space at the array edge where a set of bitline contacts must be inserted.

With vias surrounding every edge of the memory array and the selection circuitry on the

substrate below, the effective array efficiency is now 88.5% for two layers of memory cells

in a 32 by 32 block. The total number of metal layers required for fabrication is now ten.

5.4.1 Four memory layers

One of the advantages of a transistor-free cross-point memory array lies in the fact that

the memory can be stacked indefinitely above the substrate. With some additional decoder

complexity, it is possible to build up to four memory layers without significant area over-

head at the array level.

At the current array pitch, it would not be possible to route additional bitline layers to

their selection circuitry in the substrate. Thus, to build additional memory layers, we must

increase the number of wordline layers.

54

Figure 5.13: Four-layer memory.

Figure 5.14: Four-layer memory with buffer connections.

We can increase the number of vertical layers to four by sharing wordlines, as shown

in Figure 5.13. The top and bottom wordline layers (1) and (3) can share a set of selection

55

transistors, as they deliver current to different bitline layers. The middle wordline layer (2)

is connected to driver transistors on the opposite side of the array, as shown in Figure 5.14.

All wordline driver transistors are connected to a global bus that provides the appropriate

bias for the read or write operation.

The additional metal layers required are the two layers for the wordline layers (1) and

(3), and one more layer for routing. At this point, there is limited floorspace underneath

the memory array. The wordline-selection circuitry for the two-layer design occupies a

height of 24 bitlines, as shown in Figure 5.4. After accounting for spacing and metal via

requirements, an additional height of 10 bitlines will be necessary between all rows to allow

for the wordline circuitry that does not fit beneath the memory array. This decreases the

array efficiency to 71.7%.

Now that all four edges of each memory array are surrounded by vias, it would be dif-

ficult to increase the number of memory layers beyond four without a significant decrease

in array efficiency.

Table 5.2 summarizes the number of metal layers required for distributing all of the

necessary interconnect for multiple layers of memory. It is critical that the fabrication

technology be able to maintain minimum feature-size spacing even at the highest metal

layers.

5.5 3D Memory Area Comparisons

A typical die size in 65 nm CMOS technology is about 150 mm2. Our 32-by-32-bit block

(1 kbit) has an area of 17.3 µm2. If we assume a conservative estimate of 50% memory

area efficiency, then we should be able to fit approximately 4.3 Gbits on a 150 mm2 die

with the single-layer cross-point memory design. This gives us an overall bit-efficiency

rating of 28.9 Mbits/mm2.

56

Figure 5.15: Schematic of wordline drivers made to fold under a

four-layer cross-point memory array.

If we assume that the peripheral circuitry (decoders, page buffers, sense amplifiers, etc.)

occupies the same area for a multi-layer memory as for a single-layer memory, and that the

only difference in total layout area are the array efficiencies shown in Figure 5.16, then a

two-layer cross-point memory would have a die efficiency of 48.5% and a four-layer cross-

point memory would have a die efficiency of 39.3% based on the scaling of their respective

array efficiencies. These efficiency estimates are very conservative.

A two-layer cross-point memory should allow for over 8 Gbits to fit on the same

57

Figure 5.16: Array efficiency vs. number of vertical memory layers,

assuming 32-by-32-bit arrays.

150 mm2 die. A four-layer cross-point memory built for an 8 Gbit capacity would oc-

cupy a total die area of only 133 mm2.

Table 5.3 lists some production NAND flash memory devices fabricated in similar pro-

cess technologies and their area efficiencies to provide a basis of comparison to the multi-

layer cross-point memories presented in this chapter [33] [34] [35] [36]. Even with a con-

servative efficiency estimate, the four-layer cross-point memory still achieves a far better

bit-efficiency than even Samsung’s multi-level NAND flash that stores two bits per cell.

One of the major challenges in maximizing area efficiency is the consideration that

memory access times could be limited by the additional wire length needed to reach the

read circuitry as well as the parasitic capacitance from long wires spaced closely together.

An evaluation of the performance of a 3-dimensional cross-point design will be presented

58

in the next chapter.

memory size die size bit-efficiency die efficiency

(Mbits/mm2)

Samsung SLC (65 nm) 4 Gb 131 mm2 31.3 54%

Toshiba SLC (65 nm) 4 Gb 137 mm2 29.2 60.4%

Hynix/STMicro (70 nm) 4 Gb 144 mm2 28.4 65.4%

Micron SLC (50 nm) 8 Gb 169.5 mm2 47.2 65%

Samsung MLC (63 nm) 8 Gb 133 mm2 61.6 70%

1-layer Cross-point (65 nm) 4 Gb 139 mm2 29.5 50%

2-layer Cross-point (65 nm) 8 Gb 144 mm2 56.9 48.5%

4-layer Cross-point (65 nm) 8 Gb 88.7 mm2 92.4 39.3%

Table 5.3: NAND flash memory area efficiency comparisons.

59

Chapter 6

Performance Analysis

EVEN THOUGH the cross-point memory array is expected to have relatively slow

access times due to the fact that the design is optimized for area, performance esti-

mates are necessary to perform a comprehensive evaluation of the 3-dimensional memory

model.

An 8 Gb cross-point memory array built in four layers was modeled and simulated in

HSPICE using circuit parameters for the TSMC 65 nm CMOS process with a nominal

operating voltage of 1.2 V.

This chapter presents the performance results from these circuit simulations as well

as a comparison of the cross-point memory performance requirements with that of other

nonvolatile memories in production today.

6.1 8 Gb Memory Architecture

As described in Section 5.1, 32-bit by 32-bit arrays are tiled into a page, with 128 rows

and 256 columns per page. The column decoders control the bitline-selection devices and

are located above the page. The wordline decoders control the wordline-selection devices

60

Figure 6.1: Four-layer cross-point memory implemented in an 8 Gb

architecture.

and are located to the sides of the page. The sense amplifiers used in this model are the

4x MOSFET sense amplifiers (2) described in Section 4.2. Unfortunately, a memory archi-

tecture implementing the lateral bipolar sense amplifiers could not be simulated for perfor-

mance evaluations because we did not yet have an accurate SPICE model for lateral bipolar

devices.

A complete read operation reads two columns of bits, one from each half of a page.

This takes two cycles, as a read cycle accesses alternating rows from each half. Each read

cycle senses 4 kb of data.

61

A page stores 32 Mb per layer, so a four-layer page has a memory capacity of 128 Mb.

We want to simulate a total memory size of 8 Gb as this is a commonly-used capacity for

NAND flash memories currently in production in 65 nm process technologies. We choose

to model a four-layer memory architecture as this offers the highest bit density available

for our design, as described in Chapter 5. Modeling an entire read path from address-latch

to buffered output will give us a good basis for comparison of the cross-point memory with

other memory architectures.

The memory utilizes a single-core architecture with 64 four-layer pages. The pages can

be arranged in a number of ways, but a design with 32 pages in two vertical columns was

chosen for simulation. This layout most closely resembles most NAND flash architectures.

In the future, charge pumps and programming circuitry could be placed between the two

columns. A general block diagram for the 8 Gb memory is shown in Figure 6.1.

6.1.1 Decoder architecture

A 21-bit address is required to access a column of 4096 bits in a single read cycle. Every

page receives a 5-bit wordline address to select one of 32 wordlines. The wordline-selection

signal propagates across the page. Each page also receives an 8-bit column address and a

2-bit layer selection code. The column decoder output controls the bitline-selection tran-

sistors. The wordline address is decoded using a two-stage decoder and the column address

enters a three-stage decoder. The page outputs ultimately pass through a mux controlled by

a 6-bit, two-stage page decoder.

6.1.2 Critical path

The slowest-case read operation occurs when trying to access one of the bottom-most pages

in the two columns shown in Figure 6.1. In this case, the address bits must be propagated

62

from the address buffer all the way across the die. The critical path follows the address

bits to the wordline decoder, which must drive the capacitance of a metal line the width

of a page as well as 256 wordline drivers. The local wordlines and bitlines charge within

picoseconds, as they are only 32 bits in length. The remainder of the critical path involves

charging the global bitlines, sensing and latching the bitline signal, and propagating the

data to the output buffer.

6.1.3 Assumptions

We assume that the output buffer must drive a 10 pF load. As the memory array has been

optimized for area, it is assumed that all decoders and buffers are built using minimum-size

transistors, with the exception of the wordline drivers and the ouput buffer. Also, all data

buses are assumed to be laid out in the minimum metal pitch. In order to achieve a fast

burst speed (the speed at which buffered data is piped to the I/O), the output buffer will be

optimized for a fan-out of 4 (FO4). That is, CLOAD/CIN = 4. The typical FO4 delay in a

65 nm process technology is 18 ps. An eight-stage buffer is needed to drive a 10 pF output

load.

The interface for the 8 Gb memory will have 8 I/O pads. The 21 address bits will be

stored in an address buffer in three address cycles. As described in the previous chapter,

each read operation consists of two read cycles, accessing alternating even/odd rows of

blocks in a single page column. For these simulations, each single read cycle is considered

independently addressed. Ultimately, it will be up to the memory file system to control the

order of read cycles. That is, a single read access of even/odd rows in a page column may

not necessarily be followed by a second read access of odd/even rows in the same page

column. Because each read cycle stores 4 kb of data in the page buffer, this data also will

be shifted out 8 bits at a time.

63

Other assumptions used for the 8 Gb memory model are that the decoders implemented

are fully static and that all wordlines and bitlines are fabricated in copper and have a thick-

ness of 200 nm, regardless of metal layer. The memory layers are fabricated above the

first four metal layers, with the local block selection transistors underneath the memory

arrays. All other peripheral circuitry, including decoders, sense amplifiers, and buffers lie

in the substrate surrounding the memory arrays. It is assumed that the local interconnect

within the circuitry has minimal effect on capacitance, but delay contributions due to the

capacitance from address and data buses as well as bitlines and wordlines are accounted

for.

6.2 Memory and Array Models

6.2.1 Simulation methodology

Parasitic interconnect delay is one of the greatest contributors to read latency in the cross-

point memory array. Long bitlines and wordlines are fabricated in the minimum metal pitch

available for the technology process, leading to high parasitic resistance and capacitance. In

a typical memory architecture, the parasitic interconnect capacitances are among the most

difficult parameters to estimate accurately. In order to accurately model parasitic capaci-

tance, each wire must be treated as a three-dimensional structure in metal or polysilicon,

interacting with all of the surrounding wires and the ground plane.

Parasitic capacitances in the multi-layer cross-point memory arrays were modeled and

extracted using Ansoft Q3D Extractor [37], a parasitic extraction technology that uses the

Finite Element Method to compute 3D capacitance and resistance parameters of a structure

and automatically generates an equivalent SPICE sub-circuit. This sub-circuit can then be

simulated in HSPICE, allowing for a more accurate estimation of read latency.

64

It is also necessary to consider resistive loads when measuring delay times. This is a

more straightforward calculation. We use standard wire resistances of 0.07 Ω per square

for metal layers 1-3, and 0.03 Ω per square for all higher metal layers.

6.2.2 Capacitance modeling

The memory cell model was built according to the characteristics described in Section 3.4.

TiN electrodes sandwich a thin HfO2 resistance-change memory layer. A thin TiO2 in-

terfacial layer lies between the HfO2 layer and the electrode closer to the bitline. The

material thicknesses and dielectric constants are listed in Table 6.1. The total capacitance

across the memory cell as extracted by Ansoft Q3D is 0.0679 fF. We assume that both the

high-resistance state and the low-resistance state have the same capacitance.

Material thickness dielectric constant κ

TiN 20 nm 769

HfO2 20 nm 19

TiO2 5 nm 106

Table 6.1: HfO2-based memory cell model parameters.

A 32 x 32 four-layer array was also modeled in order to extract the parasitic capac-

itances of the wordlines and bitlines. The ground plane lies 1.6 µm below the memory

array, allowing for four layers of interconnect. All wordlines and bitlines are built with a

width and pitch of the minimum feature size, 65 nm. The memory cells lie intersections

between the wordlines and bitlines and have an area of 65 nm by 65 nm. We assume a SiO2

dielectric material with a dielectric constant κ of 4.0.

65

Figure 6.2: 32 x 32 cross-point memory array modeled in Ansoft Q3D.

Figure 6.3: Multi-layer memory array represented as an RC network for

simulation in HSPICE.

The lateral parasitic capacitance between 32-bit metal lines is 0.615 fF, and the verti-

cal parasitic capacitance between 32-bit metal lines is 0.0892 fF. This is consistent with

66

documentation for 65 nm process technologies. Parasitic capacitances for address and data

buses were extrapolated from these numbers.

6.3 Waveform Analysis and Timing Diagrams

Figure 6.4 shows the modeled waveforms of a single access performed on the 8 Gb four-

layer memory architecture described above, operating at the 1.2 V nominal voltage for

the 65 nm process technology. A worst-case access was simulated, reading a bit from the

leftmost column of the bottom-most page in the memory structure.

As seen in Figure 6.4, the address signal takes over 100 ns to fully propagate to the

decoder input of the farthest page from the address buffer. However, the wordline selection

transistors can latch the address by 30 ns. The global wordline is sufficiently charged by

35 ns, and the local wordline can drive sufficent current for sensing to the local bitlines by

45 ns. The global bitline is charged at 50 ns and the sense amplifier has a sufficient output

swing for latching by 66 ns. The data from the page buffer can be propagated through the

I/O driving a 10 pF load by 70 ns, but the page buffer output lines are not stabilized until

104 ns. After a stable signal is latched by the page buffer at 104 ns, the next read cycle can

commence. The next page buffer output should not be shifted out until after 104 ns even if

the output data may be valid by 70 ns.

A precharge condition where the global wordlines and bitlines are charged to a voltage

close to the read bias prior to an initial page access could decrease the read latency by about

25 ns. The read operation could also be sped up by introducing intermediate buffers for the

address and data buses, as the propagation of address and data bits across the die account

for the greatest delay component in a read operation. However, these buffers will need to

be fitted into the memory array and will likely increase the area overhead.

Most of the delay components seen in Figure 6.4 are independent of the number of

67

Figure 6.4: Simulation waveforms for 8 Gb memory read access.

layers of memory. The charging of the local wordlines and bitlines are the elements most

affected by increasing the number of memory layers, as this adds vertical parasitic capac-

itance between the wordlines and bitlines. However, these parasitic effects are only about

68

.0892 fF per local bitline or wordline and will not significantly affect the access time.

6.3.1 Burst read

Only 8 bits are available for the I/O, while a single read cycle accesses 4096 bits. Thus, it

takes 512 cycles to fully output the 4096 bits. An eight-stage output buffer optimized for

a fan-out of 4 can shift out each set of data in 173 ps, but it could be clocked as slowly as

201 ps so that the data is shifted out within a read cycle of 104 ns.

Figure 6.5: Read operation timing diagram.

6.4 Performance Comparisons with Commercial Products

Table 6.2 shows a comparison of the performance numbers from various types of non-

volatile memory currently in production, from NOR flash [35] to SanDisk’s 3D one-time-

programmable memory [38] [39]. It is clear that the 3D cross-point memory has an access

time (tR access) almost as fast as that of a lower-capacity NOR flash, and a burst read

69

time (tR sequential) that is much faster than that of NAND flash. Of course, the 8 Gb

cross-point memory architecture described in this chapter does not implement any sort of

error-correction or redundancy, which adds some significant amount of overhead to the

access times of flash memory.

tR access tR sequential Iavg/Imax Pavg/Pmax

Numonyx 1Gb NOR Flash 100 ns 25 ns 21 mA/24 mA 35.7 mW/40.8 mW

Micron 8Gb NAND Flash 25 µs 30 ns 15 mA/30 mA 40.5 mW/81 mW

SanDisk 1Gb 3-D OTP 140 µs 100 ns 20 mA/30 mA 54 mW/81 mW

4-layer 8Gb Cross-point 104 ns 201 ps 27 mA/51 mA 32.4 mW/61.2 mW

Table 6.2: Nonvolatile memory performance comparisons.

The power requirements for the read operation were calculated only based upon the

components used in the read path. We have to assume perfect voltage gating for all other

on-chip circuitry simply because they were not modeled or simulated. In reality, other

peripheral circuitry such as that used for the write operation would contribute some amount

of power dissipation due to leakage current during the read operation.

The cross-point memory current draw is higher than that of the other memory archi-

tectures presented primarily because the cross-point memory has the widest datapath. We

read 4 kb of data per read cycle, compared to 2 kb per read cycle for NAND flash. The

wide datapath ensures that the cross-point memory achieves the fastest bit-rate.

70

6.4.1 NAND flash latency

The 3D cross-point memory and NAND flash are both optimized for area, but NAND flash

access times are an order of magnitude slower. There are a number of reasons why NAND

flash latency is so much higher. NAND flash memory typically has 16 or 32 floating-gate

transistors in series to form a bitline [40]. So many MOSFETs in series need increasingly

higher gate voltages to ensure that VGS is greater than the threshold voltage. This requires

charge pumps to generate the required read voltage, typically around 5 V.

Not only does it take tens of microseconds for a charge pump to produce the required

read voltage from a 2.7 V operating voltage [41], but it would take hundreds of nanoseconds

to raise the long polysilicon wordlines to such a high voltage.

As higher voltages are required for both the read and write operations, longer-channel

devices are required for the peripheral circuitry in order to maintain reliability [42]. These

long-channel devices have slower switching speed than standard devices. The clock speed

needs to be slowed to accommodate the high-voltage devices, and NAND flash can only

pipe data out in sequential (burst) access operations as fast as the clock speed.

Although the write circuitry for cross-point memory is not discussed in this work, it is

anticipated that high-voltage devices will not be required for SET or RESET operations.

Another consideration is error-correction. NAND flash typically has built-in error-

correction capabilities that take another 5-10 µs to decode, as the error-correcting circuits

also use the same global clock that has been slowed to accommodate the high-voltage cir-

cuitry. Error-correction is not something that has been accounted for in the cross-point

memory design, and will likely be necessary for actual products.

Table 6.3 summarizes the key latency components for the 3D cross-point memory,

NAND flash, and NOR flash [43]. The 8 Gb cross-point memory has a similar wordline

charge time (which includes the address propagation and decode) to NOR flash. The sense

71

time is nearly the same between the cross-point memory, NOR flash and NAND flash. All

three memories also share a similar delay time for propagating data to the output buffer.

NOR flash does not typically implement on-chip error-correction and hence has no further

delay.

3D Cross-point NAND flash NOR flash

Wordline charge time 38 ns 20 µs 36 ns

Sense amp latency 16 ns 17 ns 18 ns

Page buffer datapath latency 50 ns 60 ns 42 ns

Error-correction 0 5 µs 0

Total 104 ns 25 µs 100 ns

Table 6.3: Read path timing signal comparisons.

72

Chapter 7

Test Chip Design

ATEST CHIP was designed and fabricated to serve as proof-of-concept for the mem-

ory array architecture during the read operation described in Chapter 3. The test

chip also includes test structures for evaluating the performance of the lateral bipolar junc-

tion transistors described in Chapter 4. Other goals served by the test chip are to verify the

functionality of the sense amplifiers for different array sizes and determine the minimum

detectable RON and ROFF resistance values for these array sizes. Finally, voltage scaling

capabilities are also studied.

7.1 Memory Array Implementation

The test chip was fabricated using TSMC 0.18 µm CMOS technology. This is a triple-well

process, enabling the fabrication of lateral bipolar devices. Because the resistance-change

HfO2 devices were not yet available for deposition, the memory cells were modeled us-

ing PMOS transistors at the cross-point junctions. The gate voltage of these transistors

was externally controlled to emulate different values for RON and ROFF . Even though

the memory architecture thus far has been designed to accommodate the resistance-change

73

characteristics described in Chapter 3, testing controlled variations in ON and OFF resis-

tances allows us to estimate the range of memory-cell resistance variations that our design

can tolerate.

Figure 7.1: Test chip architecture.

Figure 7.1 shows the test chip architecture. There are four rows of memory arrays,

each consisting of four 16-wordline by 256-bitline blocks. The bitlines of each block are

separated by selection transistors so that the effective block size can be increased from 16

to 64 wordlines during testing. Each row of blocks is connected to an array of 256 sense

74

amplifiers that buffer into a shift register. Four sense amplifier designs were laid out, one

for each row. The first row uses the MOSFET sense amplifier (1) design optimized for

16 wordlines, as described in Section 4.2. The second row uses the 4x MOSFET sense

amplifier (2) optimized for 32 wordlines. The third row uses the lateral bipolar NPN sense

amplifier laid out in the typical MOSFET structure (Section 4.3). The fourth row uses the

lateral bipolar NPN sense amplifier laid out in an octagonal structure.

PMOS transistor test structures were fabricated and microprobed to characterize the

I-V curves and approximate the memory cell resistance value corresponding to each gate

voltage. This is shown in Figure 7.2. The resistance range from 1 kΩ to 100 kΩ is difficult

to control at a fine resolution because of the steep dependence of resistance on Vg in that

range. This resistance model also assumes that the voltage between the drain and the source

is constant at -0.8 V. The source voltage of the PMOS transistors actually varies depending

on the bitline voltage of the array being accessed, which depends on the sense amplifier

design used for the array. The bitline voltages were microprobed during read access to

verify the VDS across the memory cell. Thus, each memory array tested had memory cells

with slightly different resistance characteristics.

7.2 Device Measurement Results

Test structures for both of the lateral NPN transistor designs described in Chapter 4 were

fabricated and microprobed. The I-V characteristics are shown in Figure 7.3 and Figure 7.4.

Unfortunately, the gain (β) of the lateral BJTs was highly inconsistent. For the MOSFET-

style lateral BJT, β varied from 13.6 to 21.3 on test structures from eight die samples. For

the octagonal-style lateral BJT, β varied from 7.78 to 22.1 on test structures from the same

sample set. The expected β for a vertical BJT fabricated with the same emitter perimeter

(4 µm) is 21.4.

75

Figure 7.2: Effective resistance of PMOS memory cell vs. gate voltage.

Resistance is shown in ohms. The source and drain are

biased at 0.8 V and 0 V, respectively

76

Figure 7.3: I-V characteristics of the standard MOSFET-style BJT.

77

Figure 7.4: I-V characteristics of the octagonal-style BJT.

It is apparent from the I-V plots that the lateral devices suffer from significant Early

effect. The Early voltage was extracted for each lateral NPN device by fitting a regression

line to the I-V curves shown in Figure 7.3 and Figure 7.4. The MOSFET-style device

had an Early voltage of 5.7 V while the octagonal NPN device had an Early voltage of

4.8 V. Typical Early voltages for vertical NPN transistors of the same dimensions should

be closer to 50 V [44]. The Early effect is caused by a narrowing of the base width as a

reverse bias across the collector-base junction increases the collector-base depletion width.

The collector region should have a more lightly doped layer to allow the formation of a

wide depletion region extending into the collector, rather than into the base. Because the

collector has the same doping concentration as the emitter in a lateral device, the Early

voltage is significantly decreased.

78

7.2.1 Discussion of lateral NPN transistor characteristics

There are a number of factors that could have contributed to the wide variations in β.

First, the lateral BJT device has a base width that is determined by the gate length of the

MOSFET, which was 0.18 µm in this case. A vertical BJT can have a base width of only

tens of nanometers. The base current does not depend on the gate length, but the collector

current and hence the gain are inversely proportional to the gate length [28]. Gate-length

variations are frequently present in CMOS technology, resulting in variations in β. This

was probably exacerbated by the implementation of octagonal-shaped gate designs.

When designing the lateral NPN devices, the focus was primarily on maximizing the

emitter perimeter, because the electrons injected from the emitter to the base are mostly

collected laterally. However, the substrate that forms the base is also underneath the emitter.

This provides a region for carriers from the base to establish base current via injection

across the base-emitter junction. Thus, the base current depends strongly on the emitter

area. The MOSFET-style lateral BJT has an emitter area approximately three times that

of the octagonal lateral BJT, and hence should have a base current that is approximately

three times that of the octagonal device. While the collector currents may be the same, β is

calculated as a function of the base current, and this may explain why the MOSFET-style

lateral BJT generally has a lower gain than the octagonal-shaped device.

The lack of uniformity in the lateral bipolar transistors can also be explained by the fact

that we are trying to exploit the parasitic devices that fabrication processes usually seek

to minimize. However, as the lateral devices were fabricated with the same base-emitter

junction area as a 2 µm x 2 µm vertical device, it was expected that the distribution of

measured β values should be centered around 21.4. There are a number of factors that

could have reduced the expected gain, as explained in Section 4.4. Finally, one factor that

had not been taken into account during design was the vertical parasitic bipolar junction

79

between the source, substrate, and deep n-well region, shown in Figure 4.8. This vertical

device would not have much effect on the collector current of the lateral BJT, but the deep

n-well can control how many minority carriers injected from the emitter recombine in the

substrate region. When the deep n-well has a positive bias, it would collect the carriers

before they can recombine with holes in the base region, decreasing the base current and

effectively increases the gain. Ideally, the deep n-well should be connected to the collector

of the lateral NPN device. Unfortunately, this parasitic condition could not be mitigated

in our test chip. The deep n-well region was specifically used for device isolation in the

substrate and hence was subjected to a significant amount of noise.

A technology process optimized for lateral bipolar devices would be better suited to the

memory array design that utilizes BJT sense amplifiers. The 0.18 µm gate length is too

long to serve as a narrow base, and the substrate contacts could not be placed close enough

to the active area to minimize base resistance. The lack of gate-length uniformity was

probably the biggest contributor to variations in β. More advanced technology processes

should solve both of these design issues in the future. However, the symmetric doping

profile of the lateral bipolar devices causes a decrease in the Early voltage and limits the

emitter efficiency, and this characteristic will probably not change with more advanced

lithography.

7.3 Sense Amplifier Verification

Read operations were performed for each row of memory shown in Figure 7.1. The test

procedure involved setting a single 16x256 memory array to a fully low-resistance config-

uration by biasing the PMOS memory cell gate voltages to VRon. The leftmost wordline

would alternate between high- and low-resistance states (Figure 7.5) by setting gate volt-

ages to VRoff and VRon, respectively. This wordline would then be selected, biased, and

80

Figure 7.5: Array configuration during a test read operation.

the sense amplifier outputs would be latched and shifted out through the I/O pins. For each

array test, VRoff begins at a low enough voltage that ROFF is almost as low as RON . If the

outputs could not be successfully differentiated at the I/O pins, then VRoff was increased

by 10 mV and the wordline was accessed again. If, even at the highest value of VRoff , the

outputs could not be detected, then RON was increased and the process was repeated again.

In this manner, we iteratively determined the lowest possibleRON andROFF combinations

that could still produce a distinguishable output through the latch.

Failures (in the form of high-resistance cells being detected as low-resistance cells, or

vice versa) generally first occurred at the very bottom of the memory array, while reading

bit number 256 on the wordline, and would propagate upwards to the bit closest to the

wordline driver. This is probably due to the IR drop along the wordline providing a lower

read voltage to the farther cells. Because our memory array architecture requires 32 bitlines

in an array, we chose to designate a failure mode as the point when bit number 32 along the

wordline first fails.

The results of the read operation tests are plotted in Figure 7.6 through Figure 7.9. Fig-

ure 7.6 and Figure 7.7 show the shmoo results from MOSFET sense amplifier design (1), as

81

Figure 7.6: Minimum detectable RON and ROFF values in a

32-wordline array with MOSFET sense amplifier design

(1).

described in Section 4.2. The shaded areas indicate the RON and ROFF combinations that

failed, while the white areas indicate resistance combinations that yielded a differentiable

output. As seen in Figure 7.6, with a memory array of only 32 wordlines, RON can be as

low as 1746 Ω while ROFF can be as low as 2400 Ω. When the number of wordlines is

increased to 64, RON can only be decreased to 2290 Ω while ROFF can only be as low as

4016 Ω.

82



(1).

83



(2).

A much larger white region can be seen from the test results of the 4x-sized MOSFET

sense amplifier (2), in Figure 7.8 and Figure 7.9. Here, RON can be as low as 1065 Ω with

ROFF as low as 2182 Ω. By increasing the number of wordlines to 64, RON needs to be at

least 1553 Ω and ROFF must be at least 2600 Ω.

84



(2).

85

Figure 7.10: The inconsistencies in the lateral NPN transistor

characteristics render the sense amplifier outputs

indistinguishable.

Unfortunately, the same test procedure could not be applied to the rows implemented

with the lateral BJT sense amplifiers. Because of the device inconsistencies described in

the previous section, it was impossible to read an entire column of bits. This problem is

visualized in Figure 7.10. Each NPN sense amplifier would have a different output profile

depending on the gain of the individual devices. Because the buffers following the sense

amplifiers were tuned to latch a certain output range, a memory cell could be read as either

a high-resistance state or a low-resistance state depending on the sense amplifier device

characteristics.

86


32-wordline array with BJT sense amplifier design

(MOSFET-style).

Instead, one functional sense amplifier was chosen from each of the two bipolar tran-

sistor arrays and tested with 32 wordlines. The results are shown in Figure 7.11 and Fig-

ure 7.12. The MOSFET-style lateral bipolar sense amplifier can detect a low-resistance

state of 306 Ω and a high-resistance state of 1476 Ω. The octagonal lateral bipolar sense

amplifier can detect a low-resistance state of 400 Ω and a high-resistance state of 1377 Ω.

87


32-wordline array with BJT sense amplifier design

(octagonal-style).

88

Of course, the data presented for the NPN transistor sense amplifier designs are not re-

producible due to the variations in the lateral BJT characteristics. However, it does demon-

strate the potential of the NPN sense amplifier designs.

7.4 Voltage Scaling Results

Voltage-scaling experiments were also performed to examine how supply voltage (VDD)

variation affects the read reliability. The nominal voltage of the 0.18 µm CMOS technology

used is 1.8V. These tests scaled VDD from 1.4 V to 2.0 V. The reference current for sensing

was adjusted accordingly.

Figure 7.13 shows the results of scaling the voltage while reading a 32-wordline array

using the MOSFET sense amplifier design (1). The area above each curve indicates the

resistance combinations that were successfully detected, while the area below each curve

indicates the failed resistance combinations for that voltage. Both RON and ROFF values

needed to be slightly raised when VDD was reduced to 1.6 V, and the memory array failed

to function at 1.4 V. There was no apparent benefit to increasing VDD to 2.0 V.

With MOSFET sense amplifier design (2), VDD could be scaled down to 1.4 V with no

visible effect on the resistance requirements. This is shown in Figure 7.14.

89

Figure 7.13: Voltage scaling results for a 32-wordline array with

MOSFET sense amplifier design (1).

90

Figure 7.14: Voltage scaling results for a 32-wordline array with

MOSFET sense amplifier design (2).

91

Looking back at the memory array design tradeoffs described in Chapter 4, it is clear

that the value of RON must increase with a lower bias voltage, because the amount of cur-

rent generated from a biased memory cell will no longer be sufficient to withstand the ef-

fects of leakage across unselected cells. Ultimately, the memory architecture fails because

the sense amplifier latches are sized for a fixed range of sense amplifier output voltage.

When VDD becomes sufficiently low, then the output voltage falls outside the range that

can be latched. The 4x MOSFET sense amplifier design (2) does not suffer from this prob-

lem because the original sense amplifier output range begins at a lower voltage due to the

wider NMOS transistors and their lower effective resistance.

The same individual memory cells were selected for testing using the NPN transistor

sense amplifier designs. The results are shown in Figure 7.15 and Figure 7.16. There does

not appear to be any significant effect on the detection limits of the resistance values from

voltage scaling, probably also because the NPN devices have lower resistances and see a

lower voltage drop to begin with.

92

Figure 7.15: Voltage scaling results for a 32-wordline array with BJT

sense amplifier design (MOSFET-style).

93

Figure 7.16: Voltage scaling results for a 32-wordline array with BJT

sense amplifier design (octagonal-style).

94

Chapter 8

Conclusion

ANOVEL MEMORY architecture derived from resistance-change memory has been

designed and simulated in this work. The cross-point memory array uses a sin-

gle resistive element to combine the functions of both data storage and addressing. Our

simulation results show that an 8 Gb memory architecture can be accessed with reasonable

power and latency requirements that are competitive with those of NOR Flash. Due to the

absense of individual access transistors, the crosspoint memory architecture presented in

this work can be integrated into a 3-dimensional stacked structure simply by layering the

arrays. This can be done in a standard CMOS process without forming epitaxial silicon

layers as is commonly used in 3D monolithic integration. The resulting multi-layer mem-

ory array has an expected bit-density that far exceeds that of single- and multi-level NAND

flash.

8.1 Summary

We have developed a design strategy for building a 3D cross-point array into an architec-

ture that can effectively manage leakage current based on the parameters of its peripheral

95

circuitry. Unlike previous designs, no diodes are needed within the memory array. We

have demonstrated the functionality of the read operation in a test chip. The method of

optimizing the peripheral circuitry for a HfO2-based memory cell described in this work

can be applied to future designs involving other resistance-change materials with known

resistivities.

Lateral bipolar junction transistors were designed and fabricated in an effort to create

a high-transconductance sense amplifier that would be available in any triple-well CMOS

process technology. Test results showed that these lateral transistors had inconsistent device

matching characteristics, but the fully functional devices have much potential to serve as

elements in versatile current sense amplifiers.

8.2 Other Considerations

The 3D cross-point memory architecture was designed to someday serve as a replacement

for NAND flash memory. We have demonstrated that it can be superior over NAND flash

in terms of speed and area, but the reliability, data retention, and endurance still need to be

explored.

8.2.1 Scalability

The greatest benefit of a metal-oxide-based cross-point memory over NAND flash is its

scalability. A four-layer cross-point memory can have a significantly greater bit-density

than NAND flash memories fabricated in the same technology node. This provides a great

advantage in the cost-per-bit scaling for future nonvolatile memory. From a lithography

standpoint, the 3D cross-point memory explored in this work should be able to scale well

beyond 20 nm process technologies, unlike flash memory. The limitations to scaling will

96

come from the resistance-change memory materials, and it is believed that, if filament

formation is the cause of resistive switching in metal oxides, then resistance values will not

change with decreasing cell size [17].

Metal oxides have lower current and voltage requirements for programming than flash

or phase-change memory. Not only does this lead to faster read and write times, but it

also benefits memory reliability. The low power requirements of metal-oxide resistance

switching makes it highly suitable for low-power applications.

In the future, resistance-change metal oxides could also potentially be programmed as

multi-level cells with some additional sense amplifier complexity.

8.2.2 Retention and endurance

Endurance tests performed on HfO2-based resistance-change materials have demonstrated

that more than 106 SET/RESET cycles can be performed on a memory cell without a de-

crease in the sense margin [15]. This easily matches the endurance of current NAND flash

memories. Resistance-change materials are not subject to the tunnel-oxide charge-trapping

problems that flash memory cells suffer from, but have been observed to be sensitive to

switching cycles at high temperature because of the semiconductor-like high-resistance

state at temperatures greater than 200C.

Data retention tests have been promising as well, showing that a lifetime of at least 10

years can be expected [15]. This is also competitive with flash memories on the market

today.

8.2.3 Reliability

One issue preventing the commercial implementation of resistance-change metal oxides

as memory cells in the past was the loose distribution of resistance values for RON and

97

ROFF . This has been improved in recent studies [15]. One promising aspect of resistance-

change metal-oxides is that any resistance variations tend to be towards higher resistance

values. As our experimental data have shown, higher values of RON are preferable to a

large high/low resistance window.

8.3 Recommendations for future work

Post-fabrication processing and the deposition of resistance-change materials were not

available for the test chip presented in this work. A more robust verification of the read

operation should be conducted using actual resistance-change memory cells, as this would

allow the fabrication of actual cross-point arrays. The simulated performance parameters

could then be verified. Noise effects should also be evaluated, as a cross-point array of

resistors may be a considerable source of electrical and thermal noise.

Further characterization of the resistance-change material is also necessary in order

to guarantee that the 3D cross-point memory will be practical for data storage. HfO2-

based memory cells have been demonstrated to have favorable characteristics, but larger-

scale studies will need to be performed to study their potential as a commercial product.

Factors such as yield and susceptibility to single-event upsets should be evaluated. Also,

the scalability of metal-oxide resistance change materials beyond 20 nm technology nodes

still needs to be studied.

Error-correction will likely be necessary for cross-point memory to serve as a high-

density storage device in most applications. The bit-error rate should be determined and

appropriate error-correction encodings explored. This will add latency and area overhead

to the current design. Also, depending on the failure modes of the memory cells, row and

column redundancy may also need to be implemented.

Now that a method for optimizing the read circuitry for a cross-point memory array has

98

been demonstrated, it is possible to apply the same design strategy to the write operation.

The programming operation is expected to be competitive with both NAND and NOR flash

in terms of speed because of the relatively low voltage requirements of resistance-change

materials. If the peripheral circuitry for accommodating the write operation can be made

sufficiently compact, then the 3D cross-point memory will indeed be a viable replacement

for NAND and NOR flash in future process generations.

99

Bibliography

[1] G. Moore. ”Cramming more components onto integrated circuits”, Electronics, 38, 8,

1965.

[2] Micron Technologies. ”NAND Flash 101: An Introduction to NAND Flash and How

to Design It In Your Next Product”, Micron Technical Note TN-29-19, 2006.

[3] R. Bez, et al. ”Introduction to Flash Memory”, Proceedings of the IEEE, 91, 4, 2003.

[4] L. Mason. ”Memory Market Outlook”, MemCon San Jose, 2009.

[5] M. Chi and A. Bergemont. ”Multi-level flash/EPROM memories: new self-convergent

programming methods for low-voltage applications”, Technical Digest of Interna-

tional Electron Devices Meeting, 1995.

[6] T.-K. Kim, S. Chang, and J.-H. Choi. ”Floating gate technology for high performance

8-level 3-bit NAND flash memory”, Solid-State Electronics, 53, 7, 2009.

[7] D. Ielmini, A. Spinelli, A. Lacaita. ”Recent developments on Flash memory reliabil-

ity”, Microelectronic Engineering, 80, 17, 2005.

[8] E. Doller. ”Making Sense of It All: The Ever-changing Role and Challenges of Non-

volatile Memory Today and Tomorrow”, MemCon San Jose, 2008.

100

[9] R. Scheuerlein. ”Magneto-resistive IC memory limitations and architecture implica-

tions”, Seventh Biennial Nonvolatile Memory Technology Conference, June 1998.

[10] S. Lai, T. Lowrey. ”OUM - A 180 nm Non-Volatile Memory Cell Element Technology

For Stand Alone and Embedded Applications”, Proc. International Electron Devices

Meeting, 2001.

[11] M. Gill, T. Lowrey, J. Park. ”Ovonic unified memory - a high performance non-

volatile memory technology for stand-alone memory and embedded applications”,

Proc. ISSCC, 2002.

[12] A. Beck, J.G. Bednorz, C. Gerber, C. Rossel, D. Widmer. ”Reproducible switching

effect in thin oxide films for memory applications”, IEEE Electron Device Letters,

139–141, 77, 1, 2000.

[13] J. Li, H. Liu, S. Salahuddin, K. Roy. ”Variation-tolerant Spin-Torque Transfer (STT)

MRAM array for yield enhancement,” IEEE Custom Integrated Circuits Conference,

2008.

[14] ”PCM Becomes a Reality,” Objective Analysis Semiconductor Market Research, Au-

gust 2009.

[15] H. Y. Lee, et al. ”Low power and high speed bipolar switching with a thin reactive

Ti buffer layer in robust HfO2 based RRAM”, Proc. IEEE International Electron

Devices Meeting, 2008.

[16] B.J. Choi, et al. ”Resistive switching mechanism of TiO2 thin films grown by atomic-

layer deposition”, Journal of Appl. Phys., 98, 033715, 2005.

101

[17] I. Baek, et al. ”Highly Scalable Non-volatile Resistive Memory using Simple Binary

Oxide Driven by Asymmetric Unipolar Voltage Pulses”, Proc. International Electron

Devices Meeting, 2004.

[18] S. Wong, et al. ”Monolithic 3D Integrated Circuits”, International Symposium on

VLSI Technology Systems and Applications, 2007.

[19] S.-M. Jung, et al. ”Three Dimensionally Stacked NAND Flash Memory Technology

Using Stacking Single Crystal Si Layers on ILD TANOS Structure for Beyond 30nm

Node”, Proc. International Electron Devices Meeting, 2006.

[20] K. Kim, et al. ”Multilevel Programmable Oxide Diode for Cross-Point Memory by

Electrical-Pulse-Induced Resistance Change”, IEEE Electron Device Letters, 30, 10,

2009.

[21] M. Johnson, et al. ”512-Mb PROM with a three-dimensional array of diode/antifuse

memory cells”, IEEE Journal of Solid-State Circuits, 38, 11, 2003.

[22] H.-S. Wong, Y. Taur, and D. Frank. ”Discrete random dopant distribution effects in

nanometer-scale MOSFETs”, Microelectronics and Reliability, 38, 9, 1998.

[23] J. Brewer and M. Gill. Nonvolatile Memory Technologies with Emphasis on Flash.

IEEE, Wiley & Sons, Inc., Hoboken, NJ 2008.

[24] T. Blalock, R. Jaeger. ”A High-Speed Clamped Bit-Line Current-Mode Sense Ampli-

fier”, IEEE Journal of Solid-State Circuits, 542–548, 26, 4, 1991.

[25] T. Tanzawa, et al. ”Design of a sense circuit for low-voltage Flash memories”, IEEE

Journal of Solid-State Circuits, 35, 10, 2000.

102

[26] Linear Technology. ”LM101A/LM301A/LM107/LM307 Operational Amplifiers”,

Linear Technology Datasheet, 1994.

[27] E. A. Vittoz. ”MOS transistors operated in the lateral bipolar mode and their applica-

tion in CMOS technology”, IEEE Journal of Solid State Circuits, 18, 3, 1983.

[28] Z. Feng, et al. ”Gate Controlled Vertical-Lateral NPN Bipolar Transistor in 90nm RF

CMOS Process”, IEEE Bipolar/BiCMOS Circuits and Technology Meeting, 2008.

[29] M. Daibo, T. Kikuchi, and M. Yoshizawa. ”Minority Carrier Diffusion Length Mea-

surement of Semiconductors Using a Multiwavelength Laser SQUID Microscope”,

IEEE Transactions on Applied Superconductivity, 13, 2, 2003.

[30] A. Mohsen, et al. ”The Design ad Performance of CMOS 256K Bit DRAM Devices”,

IEEE Journal of Solid-State Circuits, 19, 5, 1984.

[31] http://www.mosis.com

[32] V. Quenette, et al. ”Electrical Characterization and Compact Modeling of MOSFET

body effect”, 9th International Conference on Ultimate Integration of Silicon, 2008.

[33] Samsung Electronics. ”K9XXG08UXA: 512M x 8 Bit/1G x 8 Bit NAND Flash Mem-

ory”, Samsung Electronics Datasheet, 2006.

[34] ”4-Gbit NAND built at 65 nm”, EE Times, July 17, 2006.

[35] D. Nobunaga, et al. ”A 50nm 8Gb NAND Flash Memory with 100MB/s Program

Throughput and 200MB/s DDR Interface”, International Solid-State Circuits Confer-

ence, 2008.

[36] D.-S. Byeon, et al. ”An 8Gb Multi-Level NAND Flash Memory with 63nm STI

CMOS Process Technology”, International Solid-State Circuits Conference, 2005.

103

[37] http://www.ansoft.com

[38] Numonyx. ”Numonyx Axcell M29EW”, Numonyx Datasheet, 2009.

[39] SanDisk. ”SanDisk 3-D OTP Memory”, SanDisk Datasheet, Document Number

DS034, 2006.

[40] NAND Flash Memories Application Note. ”NAND Flash Memories and Program-

ming NAND Flash Memories Using ELNEC Device Programmers”, ELNEC NAND

Flash, 2008.

[41] W. Dong, P. Liyang, D. Zhigang, and Z. Jun. ”Charge pump system sharing the

coupling capacitors for NOR flash memory”, Proc. 5th International Conference on

ASIC, 1, 2003.

[42] M. Combe, et al. ”Design of high-speed 128-bit embedded flash memories allowing

in place execution of code”, Solid-State Electronics, 49, 1867-1874, 2005.

[43] R. Micheloni, L. Crippa, M. Sangalli, G. Campardo. ”The flash memory read path:

building blocks and critical aspects”, Proceedings of the IEEE, 91, 4, 2003.

[44] E. Conrad, et al. ”Early Voltage and Saturation Voltage Improvement in Deep Sub-

Micron Technologies Using Associations of Transistors”, Proc. of 21st annual sym-

posium on integrated circuits and system design, 2008.

[45] Y. Chen, et al. ”An Access-Transistor-Free (0T/1R) Non-Volatile Resistance Ran-

dom Access Memory (RRAM) Using a Novel Threshold Switching, Self-Rectifying

Chalcogenide Device”, Proc. International Electron Devices Meeting, 2003.

[46] B.J. Choi, et al. ”Resistive switching mechanism of TiO2 thin films grown by atomic-

layer deposition”, Journal of Appl. Phys., 98, 033715, 2005.

104

[47] R. Cobley, C.D. Wright. ”Parameterized SPICE Model for a Phase-Change RAM

Device”, IEEE Transactions on Electron Devices, 53, 1, 2006.

[48] D. MacSweeney, et al. ”A SPICE Compatible Subcircuit Model for Lateral Bipolar

Transistors in a CMOS Process”, IEEE Transactions on Electron Devices, 45, 9, 1998.

[49] S.R. Ovshinsky. ”Reversible electrical switching phenomena in disordered struc-

tures”, Physics Review Letters, 21, 1968.

[50] H. Sim, et al. ”Resistance-Switching Characteristics of Polycrystalline Nb2O5 for

Nonvolatile Memory Application”, IEEE Electron Device Letters, 26, 5, 2005.

105

Documents

ARRAY ARCHITECTURE FOR A NONVOLATILE 3 …dg048ny4241/main-augmented.pdfAbstract T HIS WORK explores the design and capabilities of a three-dimensional cross-point array structure