FRA: a flash-aware redundancy array of flash storage devices

FRA: A Flash-aware Redundancy Array of Flash Storage Devices

Yangsup Lee Samsung Electronics. Co., Ltd.

Korea

[email protected]

Sanghyuk Jung Yong Ho Song Department of Electronics Computer Engineering

Hanyang University, Korea

{shjung, yhsong}@enc.hanyang.ac.kr ABSTRACT Since flash memory has many attractive characteristics such as high performance, non-volatility, low power consumption and shock resistance, it has been widely used as storage media in the embedded and computer system environments. In the case of reliability, however, there are many shortcomings in flash memory: potentially high I/O latency due to erase-before-write and poor durability due to limited erase cycles. To overcome these problems, a RAID technique borrowed from storage technology based on hard disks is employed. In the RAID technology, multi-bit burst failures in the page, block or device are easily detected and corrected so that the reliability can be significantly enhanced. However the existing RAID-5 scheme for the flash-based storage has delayed response time for parity updating. To overcome this problem, we propose a novel approach using a RAID technique in flash storage, called Flash-aware Redundancy Array. In this approach, parity updates are postponed so that they are not included in the critical path of read and write operations. Instead, they are scheduled for when the device becomes idle. For example, the proposed scheme shows a 19% improvement in the average write response time, compared to other approaches.

Categories and Subject Descriptors B.3.2 [Memory Structures]: Mass Storage; D.4.2 [Operating Systems]: Garbage collection

General Terms Management, Measurement, Performance, Reliability

Keywords Storage Systems, Flash Memory, Flash Translation Layer, RAID

1. INTRODUCTION Flash memory has many attractive characteristics such as high performance, non-volatility, low power consumption and shock resistance. It has been widely used as storage media in a variety

of embedded and computer systems. Recently, flash-based storage systems such as SSDs (Solid State Drives) have been substituting traditional hard disks in server systems as their performance improves continuously. However, there are many shortcomings in flash memory, which still limit the rapid proliferation of flash-based storages: potentially high I/O latency due to erase-before-write and poor durability due to limited erase cycles. Flash memory may contain bad blocks since the time of manufacture: if data are written into such blocks, the reliability of the data is not guaranteed. When a block has been erased beyond the cycles allowed, it becomes unusable. The reliability is an important issue that needs to be properly addressed in building flash-based storage systems.

To overcome this problem, the Flash Translation Layer (FTL) of flash-based storage systems implements many protective measures. One of them is to store an Error Correction Code (ECC) in the spare area of flash blocks. This is extra space per page for maintaining a variety of bookkeeping information. When a page is retrieved from the flash, it is accompanied by the ECC stored in the spare area. From the page, a new ECC is calculated and compared against the retrieved one to detect and correct bit errors before the page is forwarded to the host. However, recent flash memory tends to yield more bit errors than such an ECC mechanism can handle as silicon technology evolves.

Another approach is to employ the RAID technique borrowed from storage technology based on hard disks. The rationale behind this approach is related to the organization of flash-based storage: many flash memories are accommodated into a single storage to implement high-capacity high-performance storage, which in turn increases the probability of failure. Ideally, when a RAID technology is used in flash storage, multi-bit burst failure in a page, block or device is easily detected and corrected so that the reliability can be significantly enhanced.

However, the implementation of RAID technology in flash storage is not straightforward because of the unique characteristics of flash memory such as erase-before-write and asymmetric read/write performance. If not properly implemented, the RAID technique can significantly deteriorate the storage performance. In particular, whenever a page is updated, the other pages need to read it to calculate a new parity page and the new parity will be stored in the flash memory, which should be avoided if possible.

In this paper, we propose a novel approach of using RAID technique in flash storage, called Flash-aware Redundancy Array

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CODES+ISSS’09, October 11-16, 2009, Grenoble, France. Copyright 2009 ACM 978-1-60558-628-1/09/10...$10.00.

163

(FRA, in short). This approach allows page, block or device failure to be recovered by using RAID-5 scheme, while incurring less timing overhead in read and write operations in the flash memory. In our approach, parity updates are postponed so that they are not included in the critical path of read and write operations. Instead, they are scheduled for when the device becomes idle. Simulation results show that the proposed scheme improves the RAID-based flash storage by up to 44% compared to other approaches. One of the reasons for having this performance benefit is that multiple delayed parity updates to the same page can be coalesced into one before they are presented to the flash memory.

The rest of this paper is organized as follows. In Section 2, we present the preliminaries of flash memory based storage technology including FTL and RAID schemes. Section 3 contrasts related works and Section 4 describes the motivations of this paper. The organization and operation details of FRA are described in Section 5. The performance of the proposed technique is compared against others in Section 6. And finally, Section 8 concludes the paper.

2. BACKGROUND 2.1 Flash Memory Flash storage is often compared with hard disks regarding many important aspects. First, flash memory uses page as a read/write unit, which used to be equal to a sector (512 bytes) but is now 2 Kbytes or 4 Kbytes for most flash devices. The discrepancy of the read/write unit along with no support for in-place updates results in the introduction of FTL to flash storage. This software layer is responsible for mapping a logical sector address from a host to a physical location in the flash memory. Depending on the granularity of mapping operation, such mapping algorithms are classified into page-level mapping, block-level mapping, and hybrid mapping [6][15]. (See Section 2.2 for details)

Second, unlike hard disks, each page of flash memory consists of a data area to store user data and a spare area to be used to hold metadata associated with the user data such as ECC, logical sector addresses corresponding to the user data, etc.

Third, flash memory does not support in-place updates. When a page is updated, it must be erased and then it can be re-written with the new update. Furthermore, the unit of erase operation is a block that consists of multiple pages, typically 64 or 128 as shown in Table 1. Therefore, when a block is erased, the valid pages in the block should be safely saved for further accesses.

Table 1. Characteristics comparison SLC and MLC

Type Erase cycle Pages in a Block Program time

SLC 100,000 64 200us

MLC 10,000 128 800us

The attempt to increase the capacity of flash devices has introduced the technique of storing multiple bits in a cell. Such a device is called a MLC (Multi-Level Cell) flash. However, the capacity increase is achieved at the cost of offsetting the durability and performance of the flash device. As summarized in

Table 1, traditional SLC (Single Level Cells) hold only one bit per cell with 10 times longer erase cycles and 4 times higher write performance compared to MLC. Due to their low cost and high density, MLCs are often perferred by storage vendors. The decrease of reliability necessitates the use of error detection and correction techniques in MLC.

2.2 FTL (Flash Translation Layer) FTL is a software layer that provides hard disk interfaces to upper layer software (e.g. file systems) by dynamically associating logical sector addresses with physical locations in the flash device. There are three common features in FTL: address mapping, wear-leveling and garbage collection. Address mapping is a feature that translates a given logical sector address from the host to a corresponding physical page address of the flash memory. In the case of write, the physically addressed page should be clean, i.e., erased. There are three types of mapping scheme: block-level mapping, page-level mapping and hybrid mapping. Among these, hybrid mapping employs the logging technique to temporarily hold updates to user data in the flash memory, which contributes to good write performance and small mapping table size. Thus, it is especially useful in embedded systems. Some examples of such hybrid mapping include BAST (Block-Associative Sector Translation) [9], FAST (Fully-Associative Sector Translation) [5], SAST (Set-Associative Sector Translation) [6], Superblock FTL [10] and MC_SPLIT [7]. Flash memory uses a wear-leveling scheme to increase its life time by equally erasing blocks in the flash memory as much as possible [11]. However, strict enforcement of wear-leveling may result in the unfruitful increase of erase count in the flash memory. Garbage collection reclaims space by erasing obsolete blocks. It chooses a victim block to reclaim, copies any valid pages to another and performs an erase operation on the block. This operation prepares a clean block for further write operations so that a write can be finished without having to perform an expensive erase before the write. Therefore, the garbage collection is closely connected with the write performance of flash storage.

2.3 RAID RAID (Redundant Arrays of Inexpensive Disks), based on the magnetic disk technology developed for personal computers, offers an attractive alternative to SLED (Single Large Expensive Disk), promising improvements of an order of magnitude in performance, reliability, power consumption, and scalability [3]. There are several levels of RAIDs, giving their relative cost, reliability and performance.

2.3.1 RAID-0 RAID-0 has been developed for building a large storage capacity using the arrays of inexpensive disks. In this approach, user data are interleaved over disks so that both disk capacity data transfer rates increase. Recently, RAID-0 has been mainly used for achieving high performance, not a large capacity. However, this approach does not show reliability: when one of disks malfunctions, it is impossible to retrieve the user data from the storage. Therefore, it is used when access performance is the only goal to achieve.

164

2.3.2 RAID-4 To overcome the reliability challenge, RAID-4 makes use of a extra disk, which is responsible for holding redundant information necessary to recover user data when a disk fails [3]. RAID-4 stripes data at the block level across several drives, with parity stored on one drive. This approach provides high read performance as in RAID-0 due to parallel reads to multiple disks. However, write performance is poor due to the overhead caused by writing additional parity to the disk. One of drawbacks of RAID-4 is that some disks have too much traffic because it should write parity for every write.

(a) RAID Level 4

(b) RAID Level 5

2.3.3 RAID-5 While RAID-4 achieves a high read performance due to parallel accesses to disks, its write performance suffers from excessive write accesses aggregated to a parity disk. To remedy this problem, RAID-5 distributes user data and parity across disks as shown in Figure 1. For example, when the data in D1 and D4 are updated, Disk 0, 1, 3 and 4 may become busy in parallel or in an overlapped fashion, and thus no single disk remains in a performance bottleneck.

However, in flash storage, if RAID-5 is used in the same way as in hard disks, it results in poor write performance because even a small random write may incur expensive write and erase operations.

3. RELATED WORKS Many approaches have aimed to improve the performance of flash storage when RAID-5 is used. One such effort is REM (Reliability Enhancing Mechanism) which uses RAID-5 with several MLC flash memories to increase the reliability of the storage [1]. Because this approach uses a native RAID-5 scheme, it inherits all the strengths and weaknesses of RAID-5. In general, RAID-5 write requests are handled using the following sequence.

1. Read the full stripe except the portion where the modification will take place.

2. Generate new parity information with new and unmodified data

3. Write new user data to storage 4. Write new parity data to storage

As indicated in the above sequence, in order to write new user data, RAID-5 needs to read the unmodified area in a stripe and generates a new parity. Finally, new user data and parity should be stored in the storage. Due to the overhead associated with writing additional parity, write performance can be penalized by over 50% compared with the case of having no redundancy.

Another approach is SBSS (Self-Balancing Striping Scheme) which is a new striping scheme for flash storage to reduce read latency [2]. Because of the conflict between localities of reads and writes, the benefit of parallelism is severely limited. To solve this problem, a redundancy allocation policy and a request-scheduling algorithm are presented to realize the idea. The redundancy allocation regularly disperses a piece of data over two other chips, as shown in Figure 2. The data is completely interleaved on three different banks for the possibility of load balancing. Thus, if some chips have a lot of pending flash operations, a read request can be redirected to another chip that has associated parity data. As a result, read latency can be reduced. However, this approach needs to update two parities for every write to user data. Therefore, write performance can be decreased by over 70% as compared with the case where no redundancy is used. Furthermore, the data area is as large as redundancy space, which reduces space efficiency.

There has been an attempt to use RAID-4 to enhance fault tolerance [12]. This mechanism uses NVRAM to amortize poor write performance caused by frequently updated parity. Most of parities are updated in NVRAM for the write requests, and then the parity is flushed to the flash memory when it comes to perfection. As a result, it handles the problems of frequently updated parity of a specified device and numerous parity write operations for the partial write request. However, it needs additional NVRAM to solve the problems and it may have reliability issues regarding NVRAM because the write cycle of NVRAM is increased considerably by the frequently updated parity.

Figure 2. Data and code organization of SBSS

Figure 1. RAID architectures

165

4. IDEA IN A NUTSHELL As explained above, RAID-5 has the problem of updating parity data even for small random writes. Furthermore, such parity updates may increase the latency of write operations. The idea here is quite straightforward. (1) Parity updates are performed after a write operation finishes. If there is sufficient idle time between flash accesses, this period can be utilized to store parity to flash, and thus to increase write performance. (2) Delayed parity updates can accommodate frequent updates to hot data in the flash, which in turn contributes to reducing the write operations. In this case, it contributes not only to performance, but also to reliability. The details of a delayed update scheme will be discussed in Section 5.2. It has been observed that idle time between disk accesses is much longer than disk access time in the PC environment. Figure 3 illustrates (a) the distribution of idle time, and (c) and (d) its distribution when the disk activating a PC with windows XP is monitored using DiskMon[4]. Figure 3 (c) and (d) indicate that idle time is longer than 10ms in average. Therefore, we can utilize idle time to update parity for write performance improvement. However, we noticed that there is insufficient idle time in the benchmark tool or huge-sized file copy operation.

0

200

400

600

800

1000

1200

Internet Exploration

MS Office Installation

ATTO MP3 File Copy

Sandra File System

Movie File Copy

Micr

osec

onds

(a) Idle time per sector

(c) Idle time distribution of MS Office Installation

(d) Idle time distribution of Internet Exploration

5. FRA (Flash-aware Redundancy Array) Recent RAID-based approaches as discussed in Section 3 have good characteristics such as high reliability and high read performance. However, they have traded write performance for such benefits. The degradation of write performance is caused by the overhead writing parity to flash for better reliability. Some of the techniques lose 50% of the write performance due to such overheads. Therefore, the overhead minimization is the key requirement to improve write performance. Latency of a write request is the flash access time for the data and its parity data write time to the flash memory. It is expected that write performance in FRA is almost the same as the case when no redundancy scheme is adopted because the parity update is delayed and executed in a later idle period.

(a) Existing algorithm behavior

(b) FRA behavior

5.1 Overall Architecture FRA is based on RAID-5 architecture. In RAID-5, user data is stripped onto 4 different disks as in Figure 1(b), and parity data is stored in other disks. Therefore, user data and parity data can be stored at the same time. SSDs have recently used multiple channels such as 4/8/16 channels, and flash memories share their I/O with other flash memories in the same channel. In FRA, logical address layout is organized as shown in Figure 5. To reduce latency, user data and parity data are located in different chips such as RAID-5 architectures.

... ... ... ... ... ... ... ...

Figure 3. Idle and disk access time

Figure 4. Idle time usage

Figure 5. Overall organization of FRA

166

The FRA allocates 20% of the total blocks to the redundancy area. In this case, four pages each from a device at the same offset form a stripe, and their parity is stored in one of the devices in a different way. If we are to support reliability only for a fraction of the data area, we can save the parity area, and thus have more blocks for user data.

5.1.1 LP, LPG and Parity LPG (Logical Page Group) is a group that is composed of LPs (Logical Pages). The number of LPs in a LPG is the same as the number of channels to maximize the parallelism, and each LP is sequentially ordered. A parity page is generated for each LPG. If the flash memory has problems such as read errors, we can recover the original user data using its parity because it is in another chip. That is why every LPG and its parity are located in different flash memories. Figure 6 describes the overall layout of LPs and LPGs with eight flash memories and 4 channels.

5.2 Delayed Write Scheme FRA performs three different operations for a write request from a host. First, user data is written into a log block in the flash memory. Second, previously generated parity data associated with the LPG becomes invalid because of the newly arrival user data. Therefore, the FRA deletes the mapping entry of the log block as in Figure 7(a). Finally, the FRA inserts a job of parity update information into the PG (Parity Generation) queue to generate new parity data later during idle time. If there appears to be an idle period, the FRA pops the first entry of the PG queue, generates new parity data and writes it into the flash memory. This finishes a delayed write. By using the delayed write scheme, the FRA can reduce parity write frequency for multiple write requests to the same area. In other words, it has the advantage of reducing parity update count, which reveals a similar effect of using NVRAM in Building Reliable NAND Flash Memory Storage Systems [12]. Furthermore,

the overall erase count can be reduced, which contributes to increasing the life time of flash storage.

Table 2. Parity update time

Device Type Page Size Parity Update Time

SLC 4KB 420us

SLC 2KB 320us

MLC 4KB 450us

Assume that the PG queue has a number of parity update entries caused by the write requests. The FRA takes out a parity update entry from the PG queue, generates new parity data by reading user data of the LPG from the flash memory, and writes it into the flash memory. Thus, the parity update takes time in reading from the data page of 4 channels and programs new parity data into the flash memory. Table 2 shows the time required to generate parity data using each type of flash memory.

(a) Parity delete example

(b) Delayed write example

From Figure 3, we can see that we can write 40 parity data in SLC with 4KB per page, because the arithmetic mean of idle time per sector is 2,100us. It is sufficient to write parity to flash memory during idle time.

Figure 7. (a) Delete and write operation example

Figure 6. LP to LPG mapping

167

5.3 Dual Page Mapping Table in Log Block Delayed write scheme has two problems. First, the PG queue can become full if there is no idle period between successive write operations. Second, the FRA cannot recover original user data for a read error if old parity data has been deleted. To overcome these problems, we suggest a dual page mapping table in the log block. Each mapping table of the log block has a logical page number (LPN) to physical page number (PPN), and another physical page number (Original PPN) which is the address for parity data for old data of the LPN. By using a dual page mapping table in the log block, the PG queue does not require any more because the log block mapping table can represent the newly written logical address of the LP. During an idle period, we can generate new parity data by looking up the log block mapping table. Furthermore, the FRA sets a limitation for the number of newly updated pages in a log block. This means the maximum queue depth is the same as the maximum valid page of log blocks. During idle time, if the PPN is not equal to the original PPN when the FRA looks up a log block mapping table, this means it has not completed the generation of a new parity yet. Thus, it needs to generate a new parity. If FTL needs to perform a merge operation to prepare a free page for a write request, it chooses a victim block among log blocks and copies a valid page from the log block and data block to a free block. In this merge operation, the FRA generates valid new parity data but the PPNs need to be updated with the original PPN page in the log block. Assume that a 32GB SSD is using SLC flash memory whose page size is 4KB. If we allocate ten log blocks in a chip, a single page mapping table needs 10KB RAM space, and the FRA needs 20KB RAM space to allocate a dual page mapping table for log blocks. The dual page mapping scheme increases its mapping table size to twice as much as single page mapping table scheme, but it can solve two problems of the delayed write scheme.

8LPN

40PPN

9 0xFFFF10 4111 0xFFFF

640xFFFF

660xFFFF

Original PPN

8 21

Data blocks

10

1011

2120

2223

LBN(2), PBN(10) LBN(5), PBN(9)

PBN(16) PBN(8)

20LPN

0xFFFFPPN

21 3722 0xFFFF23 0xFFFF

0xFFFF36

0xFFFF0xFFFF

Original PPN

21

(1) Write 21(2) Idle(3) Write 21(4) Write 8(5) Write 10 Log blocks

89

Figure 8 describes a dual page mapping table operation in log blocks. A write request arrived at the LPN(21) and parity data was generated for the LPN(21) in idle time, and then the next write requests to LPN(21), LPN(8), LPN(10) arrived. For the first write request, the data of LPN(21) was written into the PPN(36),

and at that time the original PPN of LPN(21) was 33. Parity data of the LPN(21) was generated and written into the flash memory during the idle period, so the PPN and original PPN became 36. After that, there was a further write request to LPN(21). Therefore, its mapping table changed into a PPN(37). We need to choose one log-block among log-blocks to generate new parity information during an idle period. In general, LRU (Least Recently Used) is an efficient mechanism in the cache algorithm [13] or log-block management [14]. This is because the oldest page may not update anymore. Therefore, FRA chooses a least recently used log block, generates new parity data whose parity data was old one, writes the new parity data in the flash memory and updates the dual page mapping table.

5.4 Power-off Recovery Flash memory is used in various mobile devices, so a FTL should consider sudden power-failure. There are two processes in the power-off recovery algorithm. First, a FTL should find the latest meta-data and load it into RAM. Then, it should recover the mapping table of log blocks by reading the spare area, which contains the logical address information.

The FRA uses the dual page mapping table in the log block, but its overall design compared to the log-block based FTL is not changed, so it can use the log block based power-off recovery scheme as in PORCE (Power Off Recovery sChEme) [8]. However, it should identify the PPN and original PPN while constructing the log block mapping table.

6. PERFORMANCE EVALUATION

6.1 Experiment Setup We implemented a trace-driven simulator to evaluate performance. The simulator emulates a number of chips with multi-channels. We assumed that the large block SLC having 2KB per page should be used for the simulation and the number of pages in a block was 64. BAST was assumed in FTL by default.

Table 3. Workload descriptions

Benchmark Description

Sandra File System Sequential, random and update pattern

ATTO Sequential write and read for each size of 0.5KB to 1MB

MP3 File Copy Copy mp3 files

Movie File Copy Copy a movie file

MS Office Installation MS Office 2003 installation

Internet Exploration Small size read and write pattern

For the simulation, we gathered various workloads using DiskMon in MS Windows XP environment, such as Sandra File System, ATTO, MP3 File Copy, Movie File Copy, MS Office Installation and Internet Exploration. We also used separated secondary storage to eliminate the effects of the operating system’s internal operations. Table 3 describes the workloads for the simulation.

Figure 8. Dual mapping table example

168

To evaluate write performance when idle time is insufficient, we used the Sandra File System, ATTO, MP3 File Copy and Movie File Copy workloads. To evaluate write performance in random write dominant workloads, we used the Sandra File System and Internet Exploration workloads which have very random write propensity. Figure 9 illustrates the disk access characteristics of the Sandra File System and Internet Exploration workloads.

The FRA implements the dual page mapping table in the log block based on BAST, and allocates 20% of the total space to the redundancy area to guarantee reliability for all user data. To compare write performance of the FRA with other algorithms, we developed a RAID-5 algorithm for REM. RAID-5 has a buffering scheme to reduce parity write overhead in sequential workloads, and it is also based on BAST. SBSS was not considered because its write performance is lower than RAID-5, and in this approach, each user data is associated with two parity data.

In this experiment, we used the same FTL algorithms to observe how the write performance is affected by different redundancy algorithms, and all experiments were performed with the same number of flash memories and channels. The FTL code execution time and FTL meta-data management overhead were not considered because we were focusing on the redundancy policy.

(a) Internet Exploration Workload

(b) Sandra File System Workload

6.2 Experimental Results 6.2.1 Performance Comparison Figure 10 (a) shows the total execution time of BAST, RAID-5 and FRA for the workloads in Table 3. The y-axis is the total operation time for each workload, where the unit is second.

The performance of the BAST is the best among others because it does not support a redundancy mechanism from the read failure. However, FRA is better than RAID-5 for all workloads including ATTO, MP3 File Copy, Sandra File System and Movie File Copy, which have less idle time. The difference in performance between the FRA and RAID-5 was related to the number of parity updates and amount of idle time. In RAID-5, every write operation to the log block has its own parity write operation. However, FRA can write its parity data into the flash memory in idle time or during a merge operation, which means if workloads have a special localization, then it can reduce parity write frequency even more. Figure 10 (b) shows the performance of BAST, FRA and RAID-5, normalized to BAST. The average performance of RAID-5 was about a 56% degree of the BAST and FRA was about a 69% degree of the BAST. FRA achieved 74% of BAST for the Internet Exploration and Office Installation workloads, which have enough idle time to update parity data. Figure 11 (a) describes the proportion of invalid pages of user data in log blocks for each workload. Most workloads have invalid pages in log blocks caused by overwriting in the same location. In particular, the ATTO workload has 10% overwrites because the ATTO benchmark software writes user data with small sector size (512 bytes to 1MB) and a mis-aligned sector number. In the case of those characteristics, FRA can reduce the number of parity write requests and the frequent parity overwrites for each workload using delayed write scheme like Figure 11 (b) and Figure 11 (c). In other words, FRA can reduce the number of parity writes in mis-aligned or special locality workloads. Therefore, ATTO, which has a high proportion of overwriting parity data, shows a greater performance improvement than MP3 File Copy, Sandra File System and Movie File Copy. Furthermore, FRA has higher performance improvement for the MS Office Installation and Internet Exploration in which idle time is sufficient. This means that idle time is an important factor for improving performance.

Table 4. Normalized performance improvement

Benchmark RAID-5 FRA

Internet Exploration 1 1.55

MS Office Installation 1 1.41

ATTO 1 1.17

MP3 File Copy 1 1.11

Sandra File System 1 1.14

Movie File Copy 1 1.11

Table 4 describes the performance improvement rate of FRA compared to RAID-5. FRA increases performance by about a 20% degree compared to RAID-5 for all of workloads. We could also see a lower performance improvement for the MP3 File Copy and Movie File Copy, which are sequential workloads. In other words, the delayed write effect of FRA is reduced in sequential workloads.

Figure 9. Workload characteristics

169

050

100150200250300350400450500



ATTO MP3 File Copy

Sandra File System

Movie File Copy

Seco

nds

BASTRAID5FRA

00.10.20.30.40.50.60.70.80.9

1



ATTO MP3 File Copy

Sandra File System

Movie File Copy

BASTRAID5FRA

(a) Total execution time (b) Execution time comparison with BAST

0

2

4

6

8

10

12



ATTO MP3 File Copy

Sandra File System

Movie File Copy

Perc

enta

ge(%

)

BASTRAID5FRA

0

5

10

15

20

25

30

35

40



ATTO MP3 File Copy

Sandra File System

Movie File Copy

Perc

enta

ge(%

)

BASTRAID5FRA

0

50000

100000

150000

200000

250000

300000



ATTO MP3 File Copy

Sandra File System

Movie File Copy

BASTRAID5FRA

0100000200000300000400000500000600000700000800000900000

1000000



ATTO MP3 File Copy

Sandra File System

Movie File Copy

BASTRAID5FRA

0

20000

40000

60000

80000

100000

120000



ATTO MP3 File Copy

Sandra File System

Movie File Copy

BASTRAID5FRA

6.2.1 Erase Count Comparison Figure 12 describes the erase count comparison of blocks in the flash memory for each workload. FRA delays parity write

operations until its log block performs a merge operation. Therefore, the number of erase count can be reduced by lowering parity writing frequency. In particular, FRA does not only increase write performance but also reduces the erase count as in Figure 12 in the ATTO workload, which has many overwrites in user and parity data as in Figure 11 (a) and (b).

6.2.2 Merge Comparison Figure 12 (a) describes the number of merge counts and the type of merge operations. There are three types of merge operations: swap merge, partial merge and full merge. Since the swap merge has the smallest merge cost among them, the storage system that has the highest number of swap merge operations leads to good write performance. The FRA achieves a decreased number of full merge counts by reducing parity write frequency and by using a delayed write

Figure 12. Erase count comparison

Figure 11. The result related with parity generation

(d) The total number of reads required to generate new parity (c) The total number of parity writes

(b) The proportion of invalid pages in the log block for parity data(a) The proportion of invalid pages in the log block for user data

Figure 10. Performance result

170

approach. We can notice that the overall merge count is decreased, while the swap merge proportion is increased. However, its log block utilization is degraded due to reduced parity write frequency. Figure 12 (b) describes the log block utilization.

0

2000

4000

6000

8000

10000

12000

14000

BAST

RAID

5

FRA

BAST

RAID

5

FRA

BAST

RAID

5

FRA

BAST

RAID

5

FRA

BAST

RAID

5

FRA

BAST

RAID

5

FRA



ATTO MP3 File Copy Sandra File System

Movie File Copy

Swap Merge Partial Merge Full Merge

(a) Number of merge counts and merge type

00.10.20.30.40.50.60.70.80.9

1



ATTO MP3 File Copy

Sandra File System

Movie File Copy

BASTRAID5FRA

(b) Log block utilization

7. ACKNOWLEDGEMENTS This work was developed within the scope of Human Resource Development Project for IT SoC Architecture. & This research paper has been supported by 「Nano IP/SoC Promotion Group」 of 「Seoul R&BD Program」 in 2009.

8. CONCLUSIONS The RAID-5 technique is a well-known approach for increasing the reliability of traditional disks. However, as we have mentioned above, the existing RAID-5 scheme for flash-based storage has a delayed response time for parity updating. To overcome this problem, we have proposed a flash-aware redundancy array technique called FRA. In FRA, parity updates are postponed so that they are not included in the critical path of read/write operations. Instead, they are scheduled for when the device becomes idle. Simulation results have shown that the proposed scheme improved the RAID-based flash storage by up to 19% as compared with other approaches.

9. REFERENCES [1] Soraya Zertal, “A Reliability Enhancing Mechanism for a

Large Flash Embedded Satellite Storage System”, Systems, 2008 ICONS 08 Third International Conference on, 2008.

[2] Yu-Bin Chang, Li-Pin Chang, “A Self-Balancing Striping Scheme for NAND-Flash Storage Systems”, Proceedings of the 2008 ACM symposium on Applied computing, 2008.

[3] DA Patterson, G Gibson, and RH Katz, “A Case for Redundant Arrays of Inexpensive Disks(RAID)”, ACM SIGMOD Record, 1988.

[4] Disk Monitor for Windows v2.01, http://technet.microsoft.com/en-us/sysinternals/bb896646.aspx.

[5] S.W. Lee, D.J. Park, T.S. Park, D.H. Lee, S.W. Park, and H.J. Song, “A Log Buffer-Based Flash Translation Layer Using Fully-Associative Sector Translation”, ACM Transactions on Embedded Computing Systems, vol. 6, no. 3, 2007.

[6] C.I. Park, W.M. Cheon, Y.S. Lee, M.S. Jung, W.H. Cho, and H.B. Yoon, “A Re-configurable FTL(Flash Translation Layer) Architecture for NAND Flash based Applications”, 18th IEEE/IFIP International Workshop on Rapid System Prototyping, IEEE, 2007.

[7] J.H. Kim, S. Jung, and Y.H. Song, “Cost and Performance Analysis of NAND Mapping Algorithms in Shared-bus Multi-chip Configuration”, IWSSPS’08, 2008

[8] Tae-Sun Chung, Myungho Lee, Yeonseung Ryu, and Kangsun Lee, “PORCE: An efficient power off recovery scheme for flash memory”, Journal of Systems Architecture, vol. 54, no. 10, 2008.

[9] Jesung Kim, JongMin Kim, Sam H. Noh, Sang Lyul Min, and Yookun Cho, “A Space-Efficient Flash Translation Layer for CompactFlash Systems”, IEEE Transactions on Consumer Electronics, vol. 48, no. 2, 2002.

[10] Jeong-Uk Kang, Heeseung Jo, Jin-Soo Kim, and Joonwon Lee, “A Superblock-based Flash Translation Layer for NAND Flash Memory”, EMSOFT’06, ACM, 2006.

[11] Dawoon Jung, Yoon-Hee Chae, Heeseung Jo, Jin-Soo Kim, and Joonwon Lee, “A Group-Based Wear-Leveling Algorithm for Large-Capacity Flash Memory Storage Systems”, CASES’07, ACM, 2007.

[12] Kevin M. Greenan, Ethan L. Miller, and Darrell D.E. Long, “Building Reliable NAND Flash Memory Storage Systems”, International Workshop on Large-Scale NVRAM Technology, 2008.

[13] H Jo, J Kang, S Park, J Kim, and J Lee, “FAB: Flash-Aware Buffer Management Policy for Portable Media Players”, IEEE Transactions on Consumer Electronics, 2006.

[14] C Park, P Talawar, D Won, MJ Jung, JB Im, S Kim, and Y Choi, “A High Performance Controller for NAND Flash-based Solid State Disk (NSSD)”, IEEE Non-Volatile Semiconductor Memory Workshop(NVSMW), 2006.

[15] S. Jung, J. Kim, and Y. Song, “Hierarchical Architecture of Flash-based Storage Systems for High Performance and Durability”, Proceedings of the 46th annual Design Automation Conference (DAC), 2009

Figure 13. Merge and log block utilization

171

Documents

FRA: a flash-aware redundancy array of flash storage devices