View
3
Download
0
Category
Preview:
Citation preview
A 3D NAND Flash Ready 8-Bit Convolutional Neural Network Core Demonstrated in a
Standard Logic Process
M. Kim1, M. Liu1, L. Everson1, G. Park1, Y. Jeon2,
S. Kim2, S. Lee2, S. Song2 and C. H. Kim1
1Dept. of ECE, University of Minnesota2Anaflash Inc.
Outline
• 3D NAND based Neural Network
• Prototyping in a Standard Logic Process
‒ Architecture, synapse cell, memory array design
• 65nm Test Chip Results
‒ Programming results, MNIST demonstration, retention
• Conclusions2
Deep Neural Network Complexity
3
• State-of-the-art deep neural networks: ~150 layers, ~150M
parameters, ~100MB memory per image
An Analysis of Deep Neural Network Models for Practical Applications, Arxiv, 2017
In-Memory Computing and Flash-based Design
4
• Ultra-high density, reduced data traffic, low cost, mature technology
• Low program/erase speed (but fast read speed), limited endurance cycles (but fine for neural network applications)
Flash SSD
(e.g. 2TB)Data Data
DRAM
(e.g. 64GB)
SRAM
(e.g. 22MB)
3D NAND Cell and Array
Source: Toshiba
W0
Wi
X0
Xi
ISum
W1
X1
Analog multiply and accumulate (MAC)
In-memory Computing (e.g. SRAM, RRAM)
5
H. Maejima, Toshiba, ISSCC 2018
WL<0>
SGS
WL<95>
CSL
SGD<0> SGD<1> SGD<2> SGD<3>
Block Y
XZ
X
Y
16KB BLs
SGD<3>
SGD<0>
State-of-the-Art 3D NAND Flash Memory
• (x,y,z) = (BL, SGD, WL)‒ BL shared across multiple blocks ‒ WL is a shared plane (not line)‒ SGD can be individually controlled
Capacity
Technology 96-WL-Layer
512Gb (3bit/cell)
Bit Density 5.95Gb/mm2
Organization(1822 + EXT)
Blocks / Plane
ThroughputRead(tR) : 58µs
(ABL : 16KB)
Cost 4 Cent/Gb
6
Analog MAC in 3D NAND Array
• Σ Xi × Wi‒Xi : Binary input applied to individual
SGD lines (no analog voltages)‒Wi: Multi-level weight (MLC, TLC, QLC)‒Σ (Accumulate) : Bitline currents of
different blocks summed up‒High resolution MAC can be realized
using multiple weight cells, bit serial operation, and partial product post-processing
WL<0>
SGS
WL<95>
CSL
X0 X1 X2 X3
Block
7
Analog MAC in 3D NAND Array
• Σ Xi × Wi‒Xi : Binary input applied to individual
SGD lines (no analog voltages)‒Wi: Multi-level weight (MLC, TLC, QLC)‒Σ (Accumulate) : Bitline currents of
different blocks summed up‒High resolution MAC can be realized
using multiple weight cells, bit serial operation, and partial product post-processing
Vpass
SGS
Vread
CSL
X0 X1 X2 X3
Block
Vpass
w00 w01 w02 w03
8
Analog MAC in 3D NAND Array
• Σ Xi × Wi‒Xi : Binary input applied to individual
SGD lines (no analog voltages)‒Wi: Multi-level weight (MLC, TLC, QLC)‒Σ (Accumulate) : Bitline currents of
different blocks summed up‒High resolution MAC can be realized
using multiple weight cells, bit serial operation, and partial product post-processing
Vpass
SGS
Vread
CSL
X0 X1 X2 X3
Block
Vpass
w00 w01 w02 w03
9
Analog MAC in 3D NAND Array
• Σ Xi × Wi‒Xi : Binary input applied to individual
SGD lines (no analog voltages)‒Wi: Multi-level weight (MLC, TLC, QLC)‒Σ (Accumulate) : Bitline currents of
different blocks summed up‒High resolution MAC can be realized
using multiple weight cells, bit serial operation, and partial product post-processing
Vread
SGS
Vpass
CSL
X0 X1 X2 X3
Block
Vpass
w00 w01 w02 w03
Y0 Y1 Y2 Y3
Z0 Z1 Z2 Z3
Bit Serial Operation for 8bit x 8bit MAC
10
8bit Input
Σ ×
×
LSB
8bit Weight (Pos. 7bit + Neg. 7bit) Analog MAC Operation
MSB
11000101
LSB MSB
1 0 10 01 1
10111000
0 1 11 11 0
1
3 4
Cycle 1
Σ ×
×
8bit Weight Analog MAC
Operation LSB MSB
1 0 10 01 1
0 1 11 11 0
0
3
Cycle 2
3
8bit Input
Σ ×
×
8bit Weight Analog MAC
Operation LSB MSB
1 0 10 01 1
0 1 11 11 0
3
0
Cycle 31 LSB
3
8bit Input
Σ ×
×
8bit Weight Analog MAC
Operation LSB MSB
1 0 10 01 1
0 1 11 11 0
1
0
Cycle 32 LSB
1
MSB MSB
8bit Input LSB MSB
11000101
10111000
11000101
10111000
11000101
10111000
+ 4 × 20
Pos. weights
+4 × 20
3 × 22 +
+
+4 × 20
3 × 22 +
+3 × 2(7+4)
× 20 × 22
× 2(7+4)
+
× 2(7+6)
P. Judd, ICAL 2017
Logic Compatible 3T eFlash Based NAND String
11
• Cell current proportional to X∙W (=0µA, 3µA, 6µA, or 9µA) • BL voltage pinned at 0.8V during read and verify operation
VPASS
X
SGS : 2.5VCSL : 0V
VBL=0.8V
VFG: Floating Gate Voltage
* Selected WL : VREAD(1.1V)* Unselected WL : VPASS (2.6V)
X W X∙W ICELL
0 0 0 0µA
0 1 0 0µA
0 2 0 0µA
0 3 0 0µA
1 0 0 0µA
1 1 1 3µA
1 2 2 6µA
1 3 3 9µA
ICELL
VPASS
VPASS
VREAD
VREAD
VREAD
0.0 0.2 0.4 0.6 0.8 1.0
0
3
6
9
12
VFG -0.6V
VFG -0.290V
VFG -0.066V
VFG 0.150V
BL Voltage (V)
65nm, VDD 1.2V, 25°C, X(SGD) =2.5V
Cell C
urren
t (µA
)X=1, W=0 or X=0, W=0,1,2,3
X=1, W=1
X=1, W=2
X=1, W=3
12
Prototype Design in a Standard Logic Process
• Flatten to 2D while preserving 3D NAND array architecture• Unit block size: 4 SGD x 16 WL x 40 BL
WL<0>
SGS
WL<15>
CSL
40BLs
SGD<0> SGD<1> SGD<2> SGD<3>
Block
WL<0>
SGS
WL<15>
CSL
Block
BL0
BL39
SGD<0>
BL0 BL39
SGD<1>SGD<2>
SGD<3>
Positive and Negative Weight Storage
13
X W0 W1 X∙W0 X∙W1 ICELL0 ICELL1 ∆I
1 0 3 0 3 0µA 9µA -9µA
1 0 2 0 2 0µA 6µA -6µA
1 0 1 0 1 0µA 3µA -3µA
1 0 0 0 0 0µA 0µA 0µA
1 1 0 1 0 3µA 0µA 3µA
1 2 0 2 0 6µA 0µA 6µA
1 3 0 3 0 9µA 0µA 9µA
Negative2bit
Weights
Positive2bit
Weights
Pos. W
X
SGS : 2.5VCSL : 0V
Even BL=0.8V
* Selected WL : VREAD(1.1V)* Unselected WL : VPASS (2.6V)
Odd BL=0.8V
W0
Neg. W
W1
VPASS
ICELL0
VPASS
VPASS
VREAD
ICELL1
VPASS
2bit
2bit
2bit
1bit
7bit 7bit
8bitpair
Erase and Program Operations in NAND String
14
• FN tunneling utilized for erase and program• Program inhibition of unselected cells via self-boosting
PWL<15>(0V)
WWL<15>(HV)
SGD<3:0>(0V)
WL<0>(0V)
SGS(0V)CSL(0V)
Erase Operation (WL15)
WM1=
8xWM2,3
M1
M2
e-
OFF
BL(0V)
~0V
OFF
PWL<15>(HV)
WWL<15>(HV)
SGD<0>(VDD)
SGS(0V)CSL(VDD)
PGM Operation (WL15)
e-
ON
BLA(0V)
Ce
ll A
<0
>
e-
OFF
Ce
ll A
<3
:1>
SGD<3:1>(0V)
e-
OFF
Ce
ll B
<0
>
BLB(VDD)
Programmed Program-Inhibited
via self-boosting
WL<1>(0V)
OFF OFF OFF
WL<0>(VPASS)
WL<1>(VPASS)
64 Row x 320 Column Core Architecture
15
• 16 stack eNAND array, high voltage switches, BL sensing circuits• Input data loaded on to 25 SGD lines, 3 SGD lines for bias
BL0
Analog MUX
BL39
BL Current Verify/Sensing Circuit
(Core Devices)
SGD0<3:0>
Block0(16 StackeNANDArray)
SGD4<3:0>
Block4
Pix
el-
IN,W
L a
nd
Blo
ck
SC
AN
Block3Block7(RepairBlock)
VCO
OutputFrequency Divider
HVS WWL015PWL015
HVS WWL00PWL00
SGS0
SGD3<3:0>
SGS3
HVS
HVS
HVS
HVS
WWL315PWL315
WWL30PWL30
WWL415PWL415
WWL40PWL40
WWL715PWL715
WWL70PWL70
HVS
HVS
Pix
el-
IN,W
L a
nd
Blo
ck
SC
AN
VP
P,V
PA
SS
SGS4
SGD7<3:0>
SGS7
BL SCAN & DRIVER C
on
tro
l1
Co
ntr
ol2
High Voltage Switch
BL
WRITE BL SCAN
VREG –
+
Frequency Divider
BLSGD
PWL
WWL
S1
M1
M2
M3
FG
CSL
PWL
WWL
SGS
M1
M2
M3
S2
FG
16Stack eNANDCell String
S2
PWL<15>
WL<0>
SGD<0>
SGD<3>
SGSCSL
Output
BL
voltage
regulator
VCO
M1
M2
M3
S1
FG
WL<1>
WL<15>
WL<14>
WWL<15>
65nm Die Photo and Feature Summary
16
16 Stack eNANDBased Synapse Array
Technology
Core Size 1100 X 600 μm2
65nm Logic
VDD (Core / IO)
1.2V / 2.5V
# of 8bit Weights
2,560
# of Synapses
20K(=64x320)
Throughputw/o VCO
4.95 µW (per bitline)
Power
0.5G pixels/s per core
(tREAD : 50ns)
Program Characteristics vs. NAND String Location
17
• Different Vgs and Vdsdepending on the stack location
• Requires different Vth for the same cell current
• Longer programming time for cells closer to the bottom
• Cell current variation less than 0.6µA after program-verify operations
10 20 30 40
3
6
9
12WL0-3
WL12-15
Ce
ll C
urr
en
t (µ
A) Weight2
Weight2
Weight1
Weight1
25°C, VPASS 2.6V, VREAD 1.1V
Weight3
Weight3
Fine Program Pulse Count
SGS(2.5V)
CSL(0V)
SGD (2.5V)
WL0
WL1
WL2
WL3
WL15
WL14
WL13
WL12
BL (0.8V)
Cell Current
3
6
9
12
*Selected WL : VREAD, Unselected WL : VPASS
0
LeNet-5 CNN Demonstration
•5 x 5 8bit convolution performed on chip•ADC, pooling, bit serial operation performed off chip
18
INPUT 28X28 (8 bit)
55
5×5 Convolutions
22
Max Pooling
55
5×5 Convolutions2
2
Max Pooling
Feature maps6@24X24
OUTPUT10
Layer120
Full connectionFull connection
Gaussian connections
Feature maps6@12X12
Feature maps16@8X8 Feature maps
16@4X4 Layer84
eNAND 5×5 6filters (12BLs) operation
eNAND 5×5 96(6×16)filters (192BLs) operation
* eNAND 1 filter : 2BLs (Pos. BL + Neg. BL)
Final results
0 1 2 3 4 5 6 7 8 9
On-chip Off-chip On-chip Off-chip
MNIST Digit Recognition Accuracy Results
•Recognition accuracy is 98.5% which is close to the software model’s 99.0% accuracy
19
Output Recognized Index
Digit 0
Digit 1
Digit 2
Digit 3
Digit 4
Digit 5
Digit 6
Digit 7
Digit 8
Digit 9
Coun
t #
0
1
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
10
100
1
10
100
1
10
100
1
10
100
1
10
100
Retention Test Results
20
• 1K erase and program cycles before baking test
• Excellent retention characteristics additional weight levels possible (e.g. 3 bit per cell)
1K Pre-Cycles , Tbake=150°C
100101 102 103
Bake Time (hrs)
0
3
6
Weight0
9
Weight1
Weight2
Weight3
Before baking
Ce
ll C
urr
en
t (µ
A)
Comparison Table
[1] X. Guo, et al., IEDM, 2017.[2] W. Chen, et al., ISSCC, 2018.
[3] M. Kim, et al., IEDM, 2018.[4] C. Xue, et al., ISSCC, 2019.
[5] X. Si, et al., ISSCC, 2019.
21
ISSCC 19 [4]
Technology 55nm
IEDM 18 [3]
65nm
ISSCC 19 [5]
55nm
Input
Resolution
Non volatile? YES (ReRAM) NO (SRAM) YES (eFlash)
Voltage 1.0V 1.0V1.0V
Logic
Compatible?
ISSCC 18 [2]
Weight
Resolution
YES (ReRAM)
YES YES NONO
3 Bits 3 Bits 5 Bits 2.3 Bits
Program-verify? NO NO YES NO
65nm
1.0V
YES (eFlash)
YES
65nm
This work
YES (eFlash)
1.2V
YES
8 Bits
YES
IEDM 17 [1]
Cell Type NAND NOR NOR NOR NOR NOR
1 Bit
2 Bits
NO
2.7V
180nm
# of Currents
Summed Up8 Cells 68 Cells 14 Cells28 Cells 32 Cells
3 Bits 2 Bits 2 Bits 1 Bit8 Bits
Neural Net
ArchitectureCNN MLP CNNCNN MLPCNN
4 Cells
Conclusions
• 3D NAND Flash ready 8bit x 8bit convolutional neural network core demonstrated in a 65nm standard logic process‒16 stack NAND string‒2 bit per cell weight storage‒Bit serial operation‒Back-pattern tolerant program verify (details in paper)
• LeNet5 2-layer CNN test results show 98.5% digit recognition accuracy
22
Recommended