IEE5008 Autumn 2012 Memory Systems Solid State Disks · Anwar,Hossameldin 13 Hybrid Mapping • It is a comprise between page-level mapping and block-level mapping. • A small portion

IEE5008 –Autumn 2012Memory Systems

Solid State Disks

Anwar,Hossameldin

Department of Electronics Engineering

National Chiao Tung University

[email protected]

Anwar,Hossameldin 2012

Outline

Introduction

Flash memory

Flash Translation Layer (FTL)

Solid-State Disks (SSDs)

Hydra Solid-State Disk Architecture

Hydra SSD Architecture Key Features and Techniques

Performance Evaluation (PC MARK05)

Related Issues

Conclusion

References

2Anwar,Hossameldin

Introduction Flash memory enabled devices such as MP3 players, mobile phones, digital cameras are

widely used todays.

Flash memory features:

1.Low power consumption.

2.Nonvolatility.

3.High random access performance.

4.High mobility.

Solid-State Disks(SSDs) are composed of multiple NAND flash memory chips, replacing

HDDs in the mass storage market.

Solid-State Disks(SSDs) features:

1.Low power consumption.

2.Faster random access.

3.Greater shock resistance.

SSD performance is being increased due to the exploitation of parallel I/O architectures.

3Anwar,Hossameldin

An SSD interacts with the host computer via standard interface such as PATA or SATA and

behaves much like a standard hard drive.

Operating systems use storage devices to provide file systems and virtual memory.

Flash transfer layer (FTL) is a software layer, used to emulate the functionality of HDDs.

It hides the peculiarities of flash memory and gives the illusion of an HDD.

SSD challenges:

1.Reliability for large scale flash storage.

2.The balance between cost, performance and lifetime.

3.Cost per bit of NAND flash memory is still high.

Hydra SSD Architecture is our proposed solution to enhance storage system performance

using the parallelism of multiple NAND flash memory chips.

There are various techniques to achieve this goal:

1.Bus-level and Chip-level interleaving for flash memory buses.

• Problem: The difference between the slow flash memory bus(<40MB/s) and the host

interface(>150MB/s).

• Result: The collective bandwidth meets or exceeds that of the host interface.

It hides the flash memory latency.

4Anwar,Hossameldin

2.Multiple High-level flash memory controllers.

•It executes sequences of high-level flash memory operation without intervention by FTL.

•It divides as two units of controllers (foreground unit and background unit).

•The foreground unit has priority over the background unit and it used to expedite the processing

of host read requests.

3.The write buffering.

•It expedites the processing of host write requests.

•It allows the parallelism in multiple flash memory chips to be exploited by multiple background

units.

•So, it performs materialization to flash memory in parallel on different interleaved units

5Anwar,Hossameldin

Flash Memory A NAND flash memory chip consist of a set of blocks.

Each block consists of a set of pages.

Each page has two parts:

• Data part that stores the user data.

• Spare part that stores metadat that associated with the user data.

The size of the data part is a multiple of the sector size(512 bytes).

The size of the spare part is 16 bytes for each sector in the data part.

Currently, the most popular block size is 128 KB, consisting of 64 pages, each of 2 KB (a data

part of four sectors and a spare part of 64 bytes).

Flash memory has several features, unlike the traditional magnetic hard disk.

1.”Erase before Write” Architecture

• The block should be first cleaned by the erase command to write data into this block.

2.The unit size of the erase and write operations are symmetric.

• The write operation is performed by the unit of a page.

• Flash translation layer(FTL) software is required to map the logical page address from the

host system to the physical page address in the flash memory devices.

6Anwar,Hossameldin

There are four major operations provided by typical low-level flash memory controller:

7Anwar,Hossameldin

Erase Block Operation

Program Page Operation

Copy-Back Page Operation

Read Page Operation

The erase block operation

• It sets all the bits in a block to 1 and takes about 2 ms.

• It is initiated by an erase command that includes the address of the block to be erased.

• After the erase complete, status check command is issued to detect any errors during the

operations.

The program page operation

• It writes the data supplied to the page(previously erased).

• It consists of three phase:

1.Data transfer phase, data to be written to the target page is transferred at 40 MB/s over the

flash memory bus to the internal page buffer in the NAND flash memory chip.

2.Program command is issued along with address of the target page (takes about 200 us).

3.After program operation is complete, a status check command is issued to check for errors.

8Anwar,Hossameldin

The read page operation

• It reads a page from flash memory

1.Read command is issued.

2.This loads a page into the memory’s internal page buffer (takes about~20 us).

3.The data in the internal page buffer is read out at 40 MB/s over the flash memory bus.

The copy-back page operation

• It transfers data from one page into another inside the chip.

• It is more efficient that moving the data out of the chip and back in again.

• It uses read page operation followed by a program page operation

1.It is initialed by a read command that moves the data to the internal page buffer.

2.The data corresponding to the portion of the page to be modified is transferred to the internal

page buffer and the program command is issued.

3.finally, the usual status check is performed.

9Anwar,Hossameldin

Flash Transfer Layer (FTL)

It hides the peculiarities of flash memory and emulates the functionality of an HDD.

It maintains a mapping between the logical sector address used by the host system

and the physical flash memory address.

The address mapping schemes of the FTL can be divided into three classes.

10Anwar,Hossameldin

Page-level mapping

Hybrid mapping

Block-level mapping

11Anwar,Hossameldin

Page-Level Mapping

• It used to map a logical page to any physical page in flash memory.

• When the host sends the read request with a logical page number(LPN),FTL finds the

physical page number(PPN) from the mapping table.

• Since the mapping table is generally maintained in SRAM,each physical page has its LPN in

the spare field against sudden power failure.

• If the update request is sent for the data that have been already written to the flash memory,

page-level mapping techniques writes the new data to an empty page, invalidates the old

data and changes the mapping information for the logical page number.

• That’s because the flash memory page can not be overwritten.

• The invalidation of old data is marked at its spare field.

• The disadvantage of this technique: it requires a large SRAM for mapping table.

12Anwar,Hossameldin

Block-Level Mapping

• Only mapping information between the logical block number(LBN) and the physical block

number(PBN) is maintained.

• A page should be in the same page offset within both the logical block and the physical block.

• It requires small-sized mapping table.

• However, when the logical page is updated, the page should be written to new clean flash

block.

• So, all of non updated pages of the old block should be copied into the new flash block.

• It invokes large page migration costs.

13Anwar,Hossameldin

Hybrid Mapping

• It is a comprise between page-level mapping and block-level mapping.

• A small portion of physical block is reserved as a log buffer.

• While the log blocks in the log buffer use the page-level mapping scheme, the normal data

blocks are handled by the block-level mapping.

• When a write request is sent to the FTL, the data are written to a log block, and the

corresponding old data in the data block are invalidated.

• When there is no empty space in the log buffer, one of the log blocks is selected as a victim

and all of the valid pages in the log block are moved into the data blocks to make space for

on-going write requests.

• Hybrid mapping requires a small-sized mapping table since only the log blocks are handled

by the page-level mapping.

Solid-State Disks(SSDs)

A typical SSD is composed of host interface control logic, an array of NAND flash

memory, RAM and SSD controller.

The host transfer control logic transfers command data from/to the host via the USB,

PATA, protocol.

SSD controller is used to translate read/write requests into flash memory operations.

The controller exploits RAM to temporarily buffer write requests or accessed data

during handling read/write requests.

The entire operations are governed by firmware(FTL),run by the SSD controller.

14Anwar,Hossameldin

15Anwar,Hossameldin

To increase the read/write bandwidth of SSD,many SSDs use an interleaving technique that

exploits the parallelism of accessing multiple NAND chips simultaneously.

If there are multiple independent channels, the read/write bandwidth of SSDs can be

accelerated further by exploiting interchannel and intrachannel parallelism.

Example: Write or Program operation steps:

1.Loading data to the internal page register of a NAND chip.

2.Programming the loaded data into the appropriate NAND flash cells.

So, data can be loaded to another NAND chip during the data programming time.

That’s because the data programming time is longer than the data loading time.

Hydra Solid-State Disk Architecture

16Anwar,Hossameldin

Hydra SSD architecture is based on block-level mapping.

The SSD is connected to the host system through a device side storage system protocol

such as serial ATA or serial attached SCSI.

The embedded processor, SRAM and code storage provides the execution environment for

the FTL.

The Host system requires:

1. High bandwidth (to meet or exceeds maximum speed of its interface).

2. Fast response time for both reads and writes.

NAND flash memory is characterized by a slow (<40 MB/s) bus.so,

• Hydra resolves this mismatch by the use of multiple flash memory buses.

• Then, the collective bandwidth meets or exceeds the maximum bandwidth of its host

interface.

It uses chip-level interleaving to hide the flash read latency.

Both chip-level and bus-level interleaving are implemented by MUX/DEMUX unit.

17Anwar,Hossameldin

Definitions

• Superchip is a set of flash memory chips that are related to each other by the bus-level and

chip-level interleaving.

• Physical superblock in superchip is a set of physical blocks, one from each a chip in

superchip.

• Superblock used to refer to metadata describing a file system.

• The size of a superblock is the same as that of the physical block(128KB),multiplied by the

combined bus-level and chip-level interleaving.

• The size of a superpage is determined by the same way.

18Anwar,Hossameldin

To meet the maximum bandwidth requirement of host write requests:

• Hydra employs volatile write buffering using bulk RAM (High bandwidth DRAM).

• The write buffering reduces the response time of host write requests.

• The write request received from the host by the inbound feeder.so,

1.It can be acknowledged as soon as all the data from the host has been buffered to the bulk

RAM.

2.The data is later materialized to the flash memory by the multiple background units.

3.Each of these units perform high level flash memory operations on superchips without any

intervention by the FTL.

To achieve the fast response time required for host read requests:

• Hydra uses the foreground unit which has higher priority than the background unit to avoid

the read latency requests being delayed by non-time critical materialization tasks performed

by the background units.

• The prioritized forwarder unit is used to prioritize access to the bus.

19Anwar,Hossameldin

To reduce the response time for the read requests:

• Hydra uses a foreground request synthesizer unit

• It contains hardware which automatically generates requests to the foreground unit in

response to a read request by the host.

• It uses a mapping stored in a table in the bulk RAM.

• Hydra uses the outbound feeder unit to read the write buffered data from the bulk RAM.

• Also, hydra uses the selective replacer unit to make the required replacements to the sectors

read from flash memory.

The CRC/ECC encoder and decoder units are used to detect and correct possible flipping

errors in the flash memory.

The splitter unit is used to forward data from flash memory either to the host interface or

bulk RAM.

20Anwar,Hossameldin

Hydra SSD Architecture Features and Techniques

21Anwar,Hossameldin

Bus-level and Chip-level interleaving

Write Buffering

Multiple High-level Flash memory controllers

Wear-Leveling

Prioritized Handling of Foreground and Background Requests

Bus-level and Chip-level interleaving

o Hydra SSD uses interleaving over multiple flash memory buses to overcome the bandwidth

limitation of the flash memory bus.

Bus-Level Interleaving

Sectors are fetched from the flash memory in around robin manner.

For example: sectors 0,4,8,12 are fetched from flash memory bus0,sectors 1,5,9,13 from

flash memory bus1,and so on.

The speed buffering between the MUX/DEMUX unit and each flash memory bus allows

concurrent data transfers from the flash memory bus while the MUX/DEMUX unit is fetching

the speed-matching buffer associated with another flash memory bus.

22Anwar,Hossameldin

Flash read commands are initially issued to all chips belonging to the same superchip.

After the flash read latency, data transfers are made over the flash memory buses.

The dependencies between the flash memory operations are indicated by arrows.

For example: The arrow starts at 1st read command operation addressed to chip0 on bus0

and goes to the 1st data transfer from the same chip

• It indicates that this data transfer is only possible after the flash read latency of 20 us.

23Anwar,Hossameldin

Advantage

• It achieves an effective bandwidth of 160 MB/s(16 sector in 50us) within a superpage by the

use of four 40 MB/s flash memory.

Disadvantage

• There is an unavoidable time interval between access to the different superpages during the

flash memory buses are idle because the flash read latency is not fully hidden by the data

transfer time.

Chip-Level Interleaving

It used to eliminate this idle time.

Chip-level interleaving of degree two is sufficient to hide the flash memory latency.

If the latency is longer, higher degree of chip-level interleaving will be required to hide it.

24Anwar,Hossameldin

The logical address from the host is divided into a logical superblock address and a sector

address within that superblock.

The logical superblock is mapped to a physical superblock.

The required mapping is provided by the block mapping table, which is stored in both bulk RAM

and flash memory.

25Anwar,Hossameldin

What is the restriction in mapping between logical and physical superblock?

•A given logical superblock can only be remapped to a physical superblock in the same

superchip.

How to enforce this restriction?

•By using the lower I bits of a logical superblock address to index the superchip[i=log2(n)], where

n= number of superchips

•So, the logical superblock is always remapped to a physical superblock in the superchip.

Advantage

•This allows hydra to utilize the copy-back page operation which allowed only between pages in

the same flash memory chip.

26Anwar,Hossameldin

Write Buffering

• It is used to decouple the materialization to flash memory from the processing of host write

requests.

• It requires a large portion of the bulk RAM to be reserved as a sector buffer.

• The sector buffer operates as a circular buffer.

• The information about the sectors in the sector buffer is stored in the bulk RAM.

• Also, it is maintained for each logical superblock by the FTL.

• This information is used when the sectors are later materialized to the flash memory.

• This materialization is performed by the background units.

• The background materialization process is invoked in three cases:

1.When the free space in the sector buffer is below a given threshold(flush high-watermark).

2.When the host sends a flush cache request that requires sectors write-buffered in volatile

storage to be materialized in non-volatile storage.

3.When the number of superblocks in the sector buffer rises above a given threshold.

27Anwar,Hossameldin

Multiple high level flash memory controllers

• It is consisting of one foreground unit and several background unit.

• Each high-level controller is capable of executing high-level flash memory operations(a linked

list of operation descriptor).

• Since high-level flash memory operation is directed to a superchip, we refer it as a superchip

operation.

• A linked list containing a sequence of superchip operations is prepared by the FTL.

• The address of the 1st descriptor is given to the high-level flash memory controller through a

command register.

• A superchip operation operates on a physical superblock and can span multiple superpages.

• For each superchip operation, high-level flash memory operations generate a sequence of

low-level flash memory operations

• That’s to perform the requested task without any intervention by the FTL.

28Anwar,Hossameldin

1.High-level flash memory operations

Erase superchip

•It erase all the physical blocks in a physical superblock.

•It accepts the superchip number and the physical block address to be erased.

•It uses a sequence of low-level erase block operations to perform the erasure.

•If there is an error during one of the low-level erase block operations, an interrupt is generated in

the FTL which also supplied with the identify of the physical block which caused the error.

•This bad block is mapped out and replaced by one of the spare physical blocks.

•Then, the FTL retries the erase superchip operation

Program superchip

•It programs an arbitrary number of sectors in a physical superblock.

•It takes five arguments: the superchip number, the physical block address of the physical

superblock, the number of sectors to be programmed, the number of the start sectors within the

physical superblock, the start sector buffer index from which the sectors to be programmed or

sequentially read.

•In case, there is an error, the valid sectors in the remapped superblock are copied to the new

superblock before the program superchip operation retried.

29Anwar,Hossameldin

Read superchip

• It reads on arbitrary number of sectors from a physical superblock.

• It requires five arguments: the superchip number, the physical block addresses of the physical

superblock, the number of the start sector within the physical superblock, the number of

sectors to be read, and the start sector buffer index to which the requested sectors are

sequentially stored.

• The data returned from flash memory can either be stored in the sector or forwarded to the

host interface.

• So, special sector buffer index is reserved to designate the host interface.

• Also, the splitter unit handles the routing required.

Copy-back superchip

• It merges operation between the physical superblock currently mapped to a logical superblock

and the set of sectors that are write buffered in the sector buffer.

• It has four arguments: the number of the source and destination physical superblocks, the

physical block addresses of the source physical superblock, the address of the data structure

that contains information about the sectors write buffered in the sector buffer for the logical

superblock, the physical block addresses of the destination physical superblock.

30Anwar,Hossameldin

Notes:

• The superchip number for both the source and the destination physical superblocks are the

same.

• After a merge operation for a logical superblocks, the block mapping table in the bulk RAM is

updated for that logical superblock.

• Also, a log is created in a reserved area of flash memory.

• Periodically, the block mapping table in the bulk RAM is flushed to another reserved area of

flash memory.

• During power-on recovery, the block mapping table in the flash memory is loaded into the

bulk RAM.

• The logs are replayed to reconstruct the up-to-date block mapping table.

31Anwar,Hossameldin

2.Multiple background units

•Each is a high-level flash memory controller.

•It allows more than one superchip operation involving different logical superblocks to be

performed in parallel.

•Since flash memory buses are reserved for a low-level erase operation, issued by background

unit1.

•So, those from background unit 2~4 can not be performed.

•Each E,P,R and S command denotes multiple low-level flash memory command of the same

type directed to the constituent chips in the superchip.

•Also, T indicates parallel data transfer to/from the chips in the superchip over multiple flash

memory buses.

32Anwar,Hossameldin

How to rectify the problem resulting from the coarse-grained interleaving?

•Operations from background units are more finely interleaved.

•If a background unit issues a long-latency commands (such as erase, program or read

command), that unit is suspended and another background unit is resumed.

•Background unit 1 is suspended after issuing an erase command, allowing other background

units to issue command directed to other superchips.

•After the erase command is complete, background unit resumes and it issues the required

status check command.

•The dispatcher unit is responsible for the scheduling between multiple background units.

33Anwar,Hossameldin

Prioritized handling of foreground and background requests

• one of the high-level flash memory controllers is designated as the foreground unit to reduce

the response time of host read requests.

• The prioritized forwarder unit gives this foreground unit priority over the background units.

• When a new foreground request arrives, the prioritized forwarder unit preempts the

background processing in progress, at the earliest possible time.

• when the foreground request targets a superchip, the preemption occurs at the end of the

current primitive operation.

34Anwar,Hossameldin

The foreground units waits until the current low-level flash operation from background 4

is complete.

That’s because both units need to access the same superchip

This delayed preemption is necessary because the operation in a NAND flash memory

chip can not be suspended by another operation addressed to the same chip.

35Anwar,Hossameldin

Wear-leveling

• Hydra uses two simple techniques borrowed from wear-leveling in page mapping FTL.

1.Implicit

2.Explicit

• In implicit wear-leveling, the free physical superblock with the smallest erase count is used as

the destination of the copy-back superchip operation.

• In explicit wear-leveling, when the SSD is idle, the physical superblock with the smallest

erase count is swapped with the free physical superblock with the largest erase count.

• To facilitate both implicit and explicit wear-leveling, Hydra keeps erase counts of all the

physical superblocks in the bulk RAM, and flushes this information to flash memory when the

block mapping table is flushed.

36Anwar,Hossameldin

Performance Evaluation (PC MARK05) PCMark05 HDD benchmark program (build 1.2.0)

• It emulates the workload of a typical PC environment.

• It has five components: XP Startup, Application Loading, General Usage, Virus Scan,

and File Write.

Comparison of Hydra with HDDs and other SSDs

• Hydra performs 80% better than the best of the other disks (Samsung 2.5-inch SSD)

37Anwar,Hossameldin

Sensitivity Analysis

1.Effect of bus-level and chip-level interleaving

•Improvement on all the component benchmarks as the degree of either type of interleaving is

increased &Faster servicing of both read and requests from the host and background

materialization to flash memory.

2.Effect of multiple background units

•The performance of Hydra improves gradually as more background units are used for writing to

flash.

38Anwar,Hossameldin

3.Effect of write buffering

•performance is severely limited when there

is no write buffering for all the component

benchmarks except for Virus Scan.

•When the buffer size at 15 MB, and

looking at the effect of changing the

flush high-watermark .On all the benchmarks

except File Write, the result is similar to that

of changing the sector buffer size because

reducing the flush high-watermark has the

indirect effect of limiting the size of the sector

buffer available for write buffering.

39Anwar,Hossameldin

4.Effect of prioritized handling of foreground and background requests

5.Effect of mapping table replication

•This added protection can be obtained without seriously affecting the performance(<50%)

40Anwar,Hossameldin

Related IssuesEnergy Consumption

•Energy consumption (in micro joules per sector read or written) of the Hydra SSD

prototype and the other five storage systems.

41Anwar,Hossameldin

Wear-Leveling

•Fig. 22a shows the maximum difference in erase counts among the physical superblocks over

time for different threshold values used in Hydra’s explicit wear-leveling technique.

•Fig. 22b shows a smaller threshold value gives a tighter upper bound and more effective wear-

leveling, but this is at the expense of an increase in the number of extra erase operations.

42Anwar,Hossameldin

Conclusion

Hydra SSD architecture uses both bus-level and chip-level interleaving to overcome the

bandwidth limitation of the flash memory bus/chip.

It has a prioritized structure of memory controllers, consisting of a single high-priority

foreground unit and multiple low-priority background units, all capable of executing

sequences of high-level flash memory operations without any software intervention. The

foreground unit is dedicated to the processing of read requests from the host to minimize

the response time.

It employs an aggressive write buffering scheme, which ensures that the background

units are utilized effectively, and also reduces the response time of write requests.

43Anwar,Hossameldin

References Jaehong KimSangwon Seo ; Dawoon Jung ; Jin-Soo Kim ; Jaehyuk ,”Parameter-Aware I/O Management for

Solid State Disks (SSDs)”, IEEE Transactions on Computers, May 2012.

Seon-yeong Park ;Euiseong Seo ; Ji-Yong Shin ; Seungryoul Maeng ; Joonwon Lee,” Exploiting Internal

Parallelism of Flash-based SSDs”, Computer Architecture Letters, Jan. 2010

Jin Hyuk Yoon ;Eyec Hyun Nam ; Yoon Jae Scong ; Kim, H. ; Kim, B.S ; Sang Lyul Min ; Yookun Cho,”

Chameleon: A High Performance Flash/FRAM Hybrid Solid State Disk Architecture”, Computer Architecture

Letters, Jan. 2008

Hsin-Hung Chen ; Wei-Jian Su ; Da-Wei Chang,” Cross-Layer Optimizations in Solid-State Drives”,

Embedded Systems Letters, IEEE, Dec. 2011

Dong Kim ;Kwanhu Bang ; Seung-Hwan Ha ; Sungroh Yoon ; Eui-Young Chung,” Architecture Exploration of

High-Performance PCs with a Solid-State Disk”, IEEE Transactions on Computers, July 2010.

Soojun Im ;Dongkun Shin,” Flash-Aware RAID Techniques for Dependable and High-Performance Flash

Memory SSD”, IEEE Transactions on Computers, Jan. 2011.

Li-Pin Chang,” A Hybrid Approach to NAND-Flash-Based Solid-State Disks”, IEEE Transactions on

Computers, Oct 2010.

Yoon Jae Seong; Eyee Hyun Nam; Jin Hyuk Yoon; Hongseok Kim; Jin-Yong Choi; Sookwan Lee; Young Hyun

Bae; Jaejin Lee; Yookun Cho; Sang Lyul Min,” Hydra: A Block-Mapped Parallel Flash Memory Solid-State Disk

Architecture”, IEEE Transactions on Computers, July 2010.

44Anwar,Hossameldin

Documents

IEE5008 Autumn 2012 Memory Systems Solid State Disks · Anwar,Hossameldin 13 Hybrid Mapping • It is a comprise between page-level mapping and block-level mapping. • A small portion