12
© 2018 Marvell Semiconductor, Inc. All rights reserved. Error Recovery Flows in NAND Flash SSDs Viet-Dzung Nguyen Marvell Semiconductor, Inc. Flash Memory Summit 2018 Santa Clara, CA 1

Error Recovery Flows in NAND Flash SSDs · 09/08/2018 · Marvell Semiconductor, Inc. Flash Memory Summit 2018 Santa Clara, CA 1 © 2018 Marvell Semiconductor, Inc. All rights reserved

  • Upload
    lamlien

  • View
    239

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Error Recovery Flows in NAND Flash SSDs · 09/08/2018 · Marvell Semiconductor, Inc. Flash Memory Summit 2018 Santa Clara, CA 1 © 2018 Marvell Semiconductor, Inc. All rights reserved

© 2018 Marvell Semiconductor, Inc. All rights reserved.

Error Recovery Flows in NAND Flash SSDs

Viet-Dzung Nguyen

Marvell Semiconductor, Inc.

Flash Memory Summit 2018

Santa Clara, CA 1

Page 2: Error Recovery Flows in NAND Flash SSDs · 09/08/2018 · Marvell Semiconductor, Inc. Flash Memory Summit 2018 Santa Clara, CA 1 © 2018 Marvell Semiconductor, Inc. All rights reserved

© 2018 Marvell Semiconductor, Inc. All rights reserved.

Outline

2

• Data Reliability in NAND Flash Memories

• Concept of an Error Recovery Flow (ERF)

• Components of an ERF

• Optimization of an ERF in an LDPC-Based Controller

• Comparison with a BCH-Based ERF

Page 3: Error Recovery Flows in NAND Flash SSDs · 09/08/2018 · Marvell Semiconductor, Inc. Flash Memory Summit 2018 Santa Clara, CA 1 © 2018 Marvell Semiconductor, Inc. All rights reserved

© 2018 Marvell Semiconductor, Inc. All rights reserved.

Data Reliability in NAND Flash

3

• NAND flash stores information in different

charge levels of NAND flash cells

• In a fresh NAND device, the distributions of

threshold voltages around their mean values

are very sharp

- Hence, the number of bits detected erroneously

is very small

• In a NAND device stressed by program/erase

cycle, retention or read disturb, the

distributions become wider

- Hence, the number of bits detected erroneously

is larger

1 0

1 0

Fresh NAND

Stressed NAND

Charge level

Charge level

Vth

Vth

Page 4: Error Recovery Flows in NAND Flash SSDs · 09/08/2018 · Marvell Semiconductor, Inc. Flash Memory Summit 2018 Santa Clara, CA 1 © 2018 Marvell Semiconductor, Inc. All rights reserved

© 2018 Marvell Semiconductor, Inc. All rights reserved.

Error Correction

4

• Error correction codes are employed to correct errors introduced by NAND flash

• Decoder classification:

- Hard decoders use the hard decision generated by one read from NAND

- Soft decoders use soft information generated by many reads from NAND

ECC

encoder

User

dataECC

codeword

User

data

Noisy ECC

codeword

NANDNANDNANDNANDHOST

ECC

encoder

Page 5: Error Recovery Flows in NAND Flash SSDs · 09/08/2018 · Marvell Semiconductor, Inc. Flash Memory Summit 2018 Santa Clara, CA 1 © 2018 Marvell Semiconductor, Inc. All rights reserved

© 2018 Marvell Semiconductor, Inc. All rights reserved.

Error Recovery Flow

5

Issue a read to

NAND with a Vth

setting, decode with

a hard decoder

Is decoding

successful?

Data sent to

host

Yes

Is decoding

successful?

Exit ERF

with failure

Any more

ERPs left?

Set i:=i+1

Exit ERF

with success

Run error recovery

procedure i

Enter ERF

Set i=0

No

On-the-fly Decode Error Recovery Flow

Page 6: Error Recovery Flows in NAND Flash SSDs · 09/08/2018 · Marvell Semiconductor, Inc. Flash Memory Summit 2018 Santa Clara, CA 1 © 2018 Marvell Semiconductor, Inc. All rights reserved

© 2018 Marvell Semiconductor, Inc. All rights reserved.

Components of an ERF

6

Hard Read Retry• Retrying hard input decoder by performing a read with new Vth

• Useful when sub-optimal Vth were used for original read. The performance is not sensitive to quality of soft information

Soft Read Retry

• Running soft input decoder using soft information collected from multiple reads

• Useful to recover pages with many error bits. Has higher average latency but very good performance

Vth Calibration• Inferring optimal Vth using histogram of charge levels

• Useful when sub-optimal Vth settings are reason for failure and a good setting could not be found easily

LLR Calibration

• Adjusting LLRs assignment based on collected charge level histogram

• Useful when soft information is collected with sub-optimal Vth and LLRs were assigned assuming that the Vth were optimal

Page 7: Error Recovery Flows in NAND Flash SSDs · 09/08/2018 · Marvell Semiconductor, Inc. Flash Memory Summit 2018 Santa Clara, CA 1 © 2018 Marvell Semiconductor, Inc. All rights reserved

© 2018 Marvell Semiconductor, Inc. All rights reserved.

Components of an ERF (cont.)

7

Inter-Cell Interference Cancellation

• Cancelling the interference caused by adjacent cells by assigning LLRs based on states of adjacent cells

• Useful to recover pages with very bad quality

Hard Error Mitigation• Adjusting LLRs taking hard errors into account, detecting bad cells

and bad bit-lines

• Useful to recover pages with disproportionate number of hard errors

RAID

• Recovering one failed component in a RAID stripe by XOR-ing the remaining successful components in the stripe

• Useful to recover from die failures or specific failure mechanisms of 3D NANDs

Page 8: Error Recovery Flows in NAND Flash SSDs · 09/08/2018 · Marvell Semiconductor, Inc. Flash Memory Summit 2018 Santa Clara, CA 1 © 2018 Marvell Semiconductor, Inc. All rights reserved

© 2018 Marvell Semiconductor, Inc. All rights reserved.

ERF Optimization Goal

8

• Recover data with highest

probability of success and lowest

latency:

- Some system has specific

requirements on host time out

- A good ERF must produce a good

shape of the latency distribution

- A good ERF must not exhaust

hardware and NAND resources

that are needed to sustain other

traffic

0 200 400 600 800 1000 12000

0.4

0.8

1.3

1.7

2.1

Error Recovery Flow Latency (s)

Perc

enta

ge

Page 9: Error Recovery Flows in NAND Flash SSDs · 09/08/2018 · Marvell Semiconductor, Inc. Flash Memory Summit 2018 Santa Clara, CA 1 © 2018 Marvell Semiconductor, Inc. All rights reserved

© 2018 Marvell Semiconductor, Inc. All rights reserved.

ERF Classification

9

Simple/Typical/Static Complex/Customized/Adaptive

Description 1. Hard decode with Vth settings from the

NAND vendor’s read retry table

2. Soft decode with a few hard-coded Vth

settings

3. RAID

1. Start with a hard decode that uses a customized Vth setting

(derived by the firmware’s media management algorithm)

2. The subsequent steps can be either a hard decode step or soft

decode step. Each read in the subsequent steps use a Vth setting

that is adaptively derived.

3. Early detection of defective NANDs and trigger RAID as soon as

possible if RAID is deemed necessary

Vth Settings Pre-determined Derived on-the-fly

Reliability Typically meets low-end requirements if

stay within or below NAND vendor’s NAND

stressing boundary

Meet high-end requirements even if NAND is stressed beyond

NAND vendor’s stressing boundary

Latency Acceptable average latency, very bad

worst case latency

Optimal latency distribution

Optimization

complexity

Simple, requires little NAND

characterization

Complex, requires extensive NAND characterization

Page 10: Error Recovery Flows in NAND Flash SSDs · 09/08/2018 · Marvell Semiconductor, Inc. Flash Memory Summit 2018 Santa Clara, CA 1 © 2018 Marvell Semiconductor, Inc. All rights reserved

© 2018 Marvell Semiconductor, Inc. All rights reserved.

ERF Optimization Strategy

10

• NAND characterization

NAND

0 10 20 30 40 50 60 7010

-7

10-6

10-5

10-4

10-3

10-2

10-1

100

Default read, RD0, RT0, read repeat0

Number of Error Bits / AU

Fre

quency

PE =6

PE =106

PE =206

PE =1506

PE =3006

PE =3506

0 500 1000 150010

-7

10-6

10-5

10-4

10-3

10-2

10-1

Default read, RD0, RT0, read repeat0

Number of Error Bits / AU

Fre

quency

PE =3

PE =103

PE =203

PE =1503

PE =3003

PE =3503

0 20 40 60 80 100 120 140 160 180 20010

-7

10-6

10-5

10-4

10-3

10-2

10-1

Normal Mode, best read, RD0, RT0, read repeat0

Number of Error Bits / AU

Fre

quency

PE =3

PE =103

PE =203

PE =1503

PE =3003

PE =3503

• ERF Optimization

HARDWARE SIMULATOR

(MARVELL INTERNAL,

ALSO AVAILABLE FOR

SELECTED CUSTOMERS)

NAND error bit distributions

at various NAND stressing

condition and Vth settings

NAND error bit distributions

at various NAND stressing

condition and Vth settings

DSP Techniques

Statistical Analysis Tools

Error Recovery Flow

Refine

Performance estimations:

- Retry rate at different drive conditions

- Throughput at different drive conditions

Page 11: Error Recovery Flows in NAND Flash SSDs · 09/08/2018 · Marvell Semiconductor, Inc. Flash Memory Summit 2018 Santa Clara, CA 1 © 2018 Marvell Semiconductor, Inc. All rights reserved

© 2018 Marvell Semiconductor, Inc. All rights reserved.

Comparison with BCH-Based ERF

11

• In BCH-based controllers, the aim of ERF is to find a Vth setting in which the

number of raw bit errors is less than the correction capability of the BCH

code

• Hence, ERF in BCH controllers primarily involves read retries in which the

failed page is retried by reading with a different Vth setting for next read retry- If a particular read retry results in a failure, it does not provide any feedback which can help

in choosing the Vth for next read retry

- This means that the long tail in the distribution of retry latency cannot be cut short

• Don’t use an LDPC-based ERF that looks like a BCH-based ERF

• Advanced tools exists (at least in Marvell’s controllers): We collaborate with

customers on optimizing ERFs as well!

Page 12: Error Recovery Flows in NAND Flash SSDs · 09/08/2018 · Marvell Semiconductor, Inc. Flash Memory Summit 2018 Santa Clara, CA 1 © 2018 Marvell Semiconductor, Inc. All rights reserved

© 2018 Marvell Semiconductor, Inc. All rights reserved.

Questions?

12