26
Algorithms and Data Structures for Flash Memories ERAN GAL AND SIVAN TOLEDO Tel-Aviv University Flash memory is a type of electrically-erasable programmable read-only memory (EEPROM). Because flash memories are nonvolatile and relatively dense, they are now used to store files and other persistent objects in handheld computers, mobile phones, digital cameras, portable music players, and many other computer systems in which magnetic disks are inappropriate. Flash, like earlier EEPROM devices, suffers from two limitations. First, bits can only be cleared by erasing a large block of memory. Second, each block can only sustain a limited number of erasures, after which it can no longer reliably store data. Due to these limitations, sophisticated data structures and algorithms are required to effectively use flash memories. These algorithms and data structures support efficient not-in-place updates of data, reduce the number of erasures, and level the wear of the blocks in the device. This survey presents these algorithms and data structures, many of which have only been described in patents until now. Categories and Subject Descriptors: D.4.2 [Operating Systems]: Storage Management—Allocation/deallocation strategies; garbage collection; secondary storage; D.4.3 [Operating Systems]: File Systems Management—Acess methods; file organization; E.1 [Data Structures]—Arrays; lists, stacks, and queues; trees; E.2 [Data Storage Representations]—Linked representations; E.5 [Files]—Organizagtion/ structure General Terms: Algorithms, Performance, Reliability Additional Key Words and Phrases: Flash memory, EEPROM memory, wear leveling 1. INTRODUCTION Flash memory is a type of electrically- erasable programmable read-only mem- ory (EEPROM). Flash memory is nonvolatile (retains its content without power) so it is used to store files and other persistent objects in workstations and servers (for the BIOS), handheld computers and mobile phones, digital cameras, and portable mu- sic players. The read/write/erase behaviors of flash memory is radically different than that Authors’ address: School of Computer Science, Tel-Aviv University, Tel-Aviv 69978, Israel; email: [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copy- rights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or [email protected]. c 2005 ACM 0360-0300/05/0600-0138 $5.00 of other programmable memories such as volatile RAM and magnetic disks. Perhaps more importantly, memory cells in a flash device (as well as in other types of EEP- ROM memory) can be written to only a lim- ited number of times, between 10,000 and 1,000,000, after which they wear out and become unreliable. In fact, flash memories come in two fla- vors, NOR and NAND, that are also quite dif- ferent from each other. In both types, write operations can only clear bits (change their value from 1 to 0). The only way to ACM Computing Surveys, Vol. 37, No. 2, June 2005, pp. 138–163.

Algorithms and Data Structures for Flash Memorieslass.cs.umass.edu/~shenoy/courses/fall07/papers/flash...Algorithms and Data Structures for Flash Memories 139 set bits (change their

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Algorithms and Data Structures for Flash Memorieslass.cs.umass.edu/~shenoy/courses/fall07/papers/flash...Algorithms and Data Structures for Flash Memories 139 set bits (change their

Algorithms and Data Structures for Flash Memories

ERAN GAL AND SIVAN TOLEDO

Tel-Aviv University

Flash memory is a type of electrically-erasable programmable read-only memory(EEPROM). Because flash memories are nonvolatile and relatively dense, they are nowused to store files and other persistent objects in handheld computers, mobile phones,digital cameras, portable music players, and many other computer systems in whichmagnetic disks are inappropriate. Flash, like earlier EEPROM devices, suffers from twolimitations. First, bits can only be cleared by erasing a large block of memory. Second,each block can only sustain a limited number of erasures, after which it can no longerreliably store data. Due to these limitations, sophisticated data structures andalgorithms are required to effectively use flash memories. These algorithms and datastructures support efficient not-in-place updates of data, reduce the number of erasures,and level the wear of the blocks in the device. This survey presents these algorithmsand data structures, many of which have only been described in patents until now.

Categories and Subject Descriptors: D.4.2 [Operating Systems]: StorageManagement—Allocation/deallocation strategies; garbage collection; secondary storage;D.4.3 [Operating Systems]: File Systems Management—Acess methods; fileorganization; E.1 [Data Structures]—Arrays; lists, stacks, and queues; trees; E.2 [DataStorage Representations]—Linked representations; E.5 [Files]—Organizagtion/structure

General Terms: Algorithms, Performance, Reliability

Additional Key Words and Phrases: Flash memory, EEPROM memory, wear leveling

1. INTRODUCTION

Flash memory is a type of electrically-erasable programmable read-only mem-ory (EEPROM). Flash memory is nonvolatile(retains its content without power) so itis used to store files and other persistentobjects in workstations and servers (forthe BIOS), handheld computers and mobilephones, digital cameras, and portable mu-sic players.

The read/write/erase behaviors of flashmemory is radically different than that

Authors’ address: School of Computer Science, Tel-Aviv University, Tel-Aviv 69978, Israel; email:[email protected] to make digital or hard copies of part or all of this work for personal or classroom use is grantedwithout fee provided that copies are not made or distributed for profit or direct commercial advantage andthat copies show this notice on the first page or initial screen of a display along with the full citation. Copy-rights for components of this work owned by others than ACM must be honored. Abstracting with credit ispermitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any componentof this work in other works requires prior specific permission and/or a fee. Permissions may be requestedfrom Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, [email protected]©2005 ACM 0360-0300/05/0600-0138 $5.00

of other programmable memories such asvolatile RAM and magnetic disks. Perhapsmore importantly, memory cells in a flashdevice (as well as in other types of EEP-ROM memory) can be written to only a lim-ited number of times, between 10,000 and1,000,000, after which they wear out andbecome unreliable.

In fact, flash memories come in two fla-vors, NOR and NAND, that are also quite dif-ferent from each other. In both types, writeoperations can only clear bits (changetheir value from 1 to 0). The only way to

ACM Computing Surveys, Vol. 37, No. 2, June 2005, pp. 138–163.

Page 2: Algorithms and Data Structures for Flash Memorieslass.cs.umass.edu/~shenoy/courses/fall07/papers/flash...Algorithms and Data Structures for Flash Memories 139 set bits (change their

Algorithms and Data Structures for Flash Memories 139

set bits (change their value from 0 to 1) isto erase an entire region memory. Theseregions have fixed size in a given device,typically ranging from several kilobytesto hundreds of kilobytes and are callederase units. NOR flash, the older type, is arandom-access device that is directly ad-dressable by the processor. Each bit in aNOR flash can be individually cleared onceper-erase-cycle of the erase unit contain-ing it. NOR devices suffers from high erasetimes. NAND flash, the newer type, enjoysmuch faster erase times, but it is not di-rectly addressable (it is accessed by issu-ing commands to a controller), access isby page (a fraction of an erase unit, typi-cally 512 bytes) not by bit or byte, and eachpage can be modified only a small num-ber of times in each erase cycle. That is,after a few writes to a page, subsequentwrites cannot reliably clear additional bitsin the page; the entire erase unit must beerased before further modifications of thepage are possible [Woodhouse 2001].

Because of these peculiarities, storage-management techniques that were de-signed for other types of memory devices,such as magnetic disks, are not alwaysappropriate for flash. To address theseissues, flash-specific storage techniqueshave been developed with the widespreadintroduction of flash memories in the early1990s. Some of these techniques were in-vented specifically for flash memories, butmany have been adapted from techniquesthat were originally invented for otherstorage devices. This article surveys thedata structures and algorithms that havebeen developed for management of flashstorage.

The article covers techniques that havebeen described in the open literature,including patents. We only cover USpatents mostly because we assume thatUS patents are a superset of those of othercountries. To cope with the large numberof flash-related patents, we used the fol-lowing strategy to find relevant patents.We examined all the patents whose ti-tles contain the words flash, file (or fil-ing), and system, as well as all the patentswhose titles contain the words wear andleveling. In addition, we also examined

other patents assigned to two compa-nies that specialize in flash-managementproducts, M-Systems and SanDisk (for-merly SunDisk). Finally, we also examinedpatents that were cited or mentioned inother relevant materials both patents andWeb sites.

We believe that this methodology led usto most of the relevant materials. But thisdoes not imply, of course, that the articlecovers all the techniques that have beeninvented. The techniques that are used insome flash-management products have re-mained trade secrets; some are alluded toin corporate literature but are not fullydescribed. Obviously, this article does notcover such techniques.

The rest of this survey consists of threesections, each of which describes the map-ping of one category of abstract data struc-tures onto flash memories. The next sec-tion discusses flash data structures thatstore an array of fixed- or variable-lengthblocks. Such data structures typically em-ulate magnetic disks where each blockin the array represents one disk sec-tor. Even these simple data structurespose many flash-specific challenges suchas wear leveling and efficient reclama-tion. These challenges and techniques toaddress them are discussed in detail inSection 2, and in less detail in later sec-tions. The section that follows, Section 3,describes flash-specific file systems. A filesystem is a data structure that representsa collection of mutable random-access filesin a hierarchical name space. Section 4describes three additional classes of flashdata structures: application-specific datastructures (mainly search trees), datastructures for storing machine code, and amechanism to use flash as a main memoryreplacement. Section 5 summarizes thesurvey.

2. BLOCK-MAPPING TECHNIQUES

One approach to using flash memory is totreat it as a block device that allows fixed-size data blocks to be read and writtenmuch like disk sectors. This allows stan-dard file systems designed for magneticdisks, such as FAT, to utilize flash devices.

ACM Computing Surveys, Vol. 37, No. 2, June 2005.

Page 3: Algorithms and Data Structures for Flash Memorieslass.cs.umass.edu/~shenoy/courses/fall07/papers/flash...Algorithms and Data Structures for Flash Memories 139 set bits (change their

140 E. Gal and S. Toledo

In this setup, the file system code callsa device driver, requesting block read orwrite operations. The device driver storesand retrieves blocks from the flash device.(Some removable flash devices, like Com-pactFlash, even incorporate a completeATA disk interface so they can actually beused through the standard disk driver.)

However, mapping the blocks onto flashaddresses in a simple linear fashionpresents two problems. First, some datablocks may be written to much morethan others. This presents no problem formagnetic disks so conventional file sys-tems do not attempt to avoid such situa-tions. But when the file system in mappedonto a flash device, frequently used eraseunits wear out quickly, slowing down ac-cess times and eventually burning out.This problem can be addressed by usinga more sophisticated block-to-flash map-ping scheme and by moving blocks around.Techniques that implement such strate-gies are called wear-leveling techniques.

The second problem that the identitymapping poses is the inability to writedata blocks smaller than a flash eraseunit. Suppose that the data blocks thatthe file system uses are 4KB each and thatflash erase units are 128KB each. If 4KBblocks are mapped to flash addresses us-ing the identity mapping, writing a 4KBblock requires copying a 128KB flash eraseunit to RAM, overwriting the appropriate4KB region, erasing the flash erase unit,and rewriting it from RAM. Furthermore, ifpower is lost before the entire flash eraseunit is rewritten to the device, 128KB ofdata are lost; in a magnetic disk, only the4KB data block would be lost. It turns outthat the wear-leveling technique automat-ically address this issue as well.

2.1. The Block-Mapping Idea

The basic idea behind all the wear-levelingtechniques is to map the block numberpresented by the host, called a virtualblock number, to a physical flash address,called a sector. (Some authors and ven-dors use a slightly different terminology.)When a virtual block needs to be rewrit-

ten, the new data does not overwrite thesector where the block is currently stored.Instead, the new data is written to anothersector and the virtual-block-to-sector mapis updated.

Typically, sectors have a fixed size andoccupy a fraction of an erase unit. In NAND

devices, sectors usually occupy one flashpage. But in NOR devices, it is also possibleto use variable-length sectors.

This mapping serves several purposes:

—First, writing frequently modified blocksto a different sector in every modifica-tion evens out the wear of different eraseunits.

—Second, the mapping allows writing asingle block to flash without erasing andrewriting an entire erase unit [Assaret al. 1995a, 1995b, 1996].

—Third, the mapping allows block writesto be implemented atomically so that, ifpower is lost during a write operation,the block reverts to its prewrite statewhen flash is used again.

Atomicity is achieved using the followingtechnique. Each sector is associated witha small header which may be adjacentto the sector or elsewhere in the eraseunit. When a block is to be written, thesoftware searches for a free and erasedsector. In this state, all the bits in boththe sector and its header are 1. Then afree/used bit in the header of the sectoris cleared, to mark that the sector is nolonger free. Then the virtual block numberis written to its header, and the new datais written to the chosen sector. Next, theprevalid/valid bit in the header is clearedto mark the sector is ready for reading. Fi-nally, the valid/obsolete bit in the headerof the old sector is cleared to mark that itis no longer contains the most recent copyof the virtual block.

In some cases, it is possible to optimizethis procedure, for example, by combin-ing the free/used bit with the virtual blocknumber: if the virtual block number is all1s, then the sector is still free, otherwiseit is in use.

If power is lost during a write operation,the flash may be in two possible states

ACM Computing Surveys, Vol. 37, No. 2, June 2005.

Page 4: Algorithms and Data Structures for Flash Memorieslass.cs.umass.edu/~shenoy/courses/fall07/papers/flash...Algorithms and Data Structures for Flash Memories 139 set bits (change their

Algorithms and Data Structures for Flash Memories 141

with respect to the modified block. If powerwas lost before the new sector was markedvalid, its contents are ignored when theflash is next used, and its valid/obsoletebit can be set to mark it ready for era-sure. If power was lost after the new sec-tor was marked valid but before the oldone was marked obsolete, both copies arelegitimate (indicating two possible serial-izations of the failure and write events),and the system can choose either one andmark the other obsolete. If choosing themost recent version is important, a two-bit version number can indicate which oneis more recent. Since there can be at mosttwo valid versions with consecutive ver-sion numbers modulo 4, 1 is newer than0, 2 than 1, 3 than 2, and 0 is newer than3 [Aleph One 2002].

2.2. Data Structures for Mapping

How does the system find the sector thatcontains a given block? Fundamentally,there are two kinds of data structures thatrepresent such mappings. Direct maps areessentially arrays that store in the ith lo-cation the index of the sector that cur-rently contains block i. Inverse maps storein the ith location the identity of the blockstored in the ith sector. In other words,direct maps allow efficient mapping ofblocks to sectors, and inverse maps allowefficient mapping of sectors to blocks. Insome cases, direct maps are not simplearrays but more complex data structure.But a direct map, whether implementedas an array or not, always allows effi-cient mapping of blocks to sectors. Inversemaps are almost always arrays, althoughthey may not be contiguous in physicalmemory.

Inverse maps are stored on the flash de-vice itself. When a block is written to asector, the identity of the block is also writ-ten. The block’s identity is always writtenin the same erase unit as the block itself sothat they are erased together. The block’sidentity may be stored in a header imme-diately preceding the data, or it may bewritten to some other area in the unit, of-ten a sector of block numbers. The mainuse of the inverse map is to reconstruct

a direct map during device initialization(when the flash device is inserted into asystem or when the system boots).

Direct maps are stored at least partiallyin RAM which is volatile. The reason thatdirect maps are stored in RAM is that, bydefinition, they support fast lookups. Thisimplies that when a block is rewritten andmoved from one sector to another, a fixedlookup location must be updated. Flashdoes not support this kind of in-placemodification.

To summarize, the indirect map on theflash device itself ensures that sectors canalways be associated with the blocks thatthey contain. The direct map, which isstored in RAM, allows the system to quicklyfind the sector that contains a given block.These block-mapping data structures areillustrated in Figure 1.

A direct map is not absolutely neces-sary. The system can search sequentiallythrough the indirect map to find a validsector containing a requested block. Thisis slow but efficient in terms of RAM usage.By only allowing each block to be storedon a small number of sectors, searchingcan be performed much faster (perhapsthrough the use of hardware comparatorsas patented in Assar et al. [1995a, 1995b]).This technique, which is similar to set-associative caches, reduces the amount ofRAM or hardware comparators required forthe searches but reduces the flexibility ofthe mapping. The reduced flexibility canlead to more frequent erases and to accel-erated wear.

The Flash Translation Layer (FTL) is atechnique to store some of the direct mapwithin the flash device itself while tryingto reduce the cost of updating the map onthe flash device. This technique was origi-nally patented by Ban [1995] and was lateradopted as a PCMCIA standard [IntelCorporation 1998b].1

1Intel writes that “M-Systems does grant a royalty-free, non-exclusive license for the design and devel-opment of FTL-compatible drivers, file systems, andutilities using the data formats with PCMCIA PCCards as described in the FTL Specifications” [IntelCorporation 1998b].

ACM Computing Surveys, Vol. 37, No. 2, June 2005.

Page 5: Algorithms and Data Structures for Flash Memorieslass.cs.umass.edu/~shenoy/courses/fall07/papers/flash...Algorithms and Data Structures for Flash Memories 139 set bits (change their

142 E. Gal and S. Toledo

Fig. 1. Block mapping in a flash device. The gray array on the left is the virtual block to thephysical sector direct map, residing in RAM. Each physical sector contains a header and data.The header contains the index of the virtual block stored in the sector, an erase counter, validand obsolete bits, and perhaps an error-correction code and a version number. The virtual blocknumbers in the headers of populated sectors constitute the inverse map from which a direct mapcan be constructed. A version number allows the system to determine which of two valid sectorscontaining the same virtual block is more recent.

The FTL uses a combination of mecha-nisms, illustrated in Figure 2, to performthe block-to-sector mapping.

(1) Block numbers are first mapped to log-ical block numbers which consist of alogical erase unit number (specified bythe most significant bits of the logi-cal block number) and a sector indexwithin the erase unit. This mechanismallows the valid sectors of an eraseunit to be copied to a newly-erasederase unit without changing the block-to-logical-block map since each sectoris copied to the same location in thenew erase unit.

(2) This block-to-logical-block map can bestored partially in RAM and partiallywithin the flash itself. The mapping ofthe first blocks, which in FAT-formatteddevices change frequently, can bestored in RAM, while the rest is storedin the flash device. The transition pointcan be configured when the flash is for-matted and is stored in a header in thebeginning of the flash device.

(3) The flash portion of the block-to-logical-block map is not storedcontiguously in the flash but is scat-tered throughout the device alongwith an inverse map. A direct mapin RAM, which is reconstructed duringinitialization, points to the sectors of

the map. To look up the logical numberof a block, the system first finds thesector containing the mapping in thetop-level RAM map, and then retrievesthe mapping itself. In short, the mapis stored in a two-level hierarchicalstructure.

(4) When a block is rewritten and movedto a new sector, its mapping mustbe changed. To allow this to happenat least some of the time withoutrewriting the relevant mapping block,a backup map is used. If the relevantentry in the backup map which is alsostored on flash is available (all 1s),the original entry in the main mapis cleared, and the new location iswritten to the backup map. Otherwise,the mapping sector must be rewritten.During lookup, if the mapping entry isall 0s, the system looks up the mappingin the backup map. This mechanism fa-vors sequential modification of blockssince, in such cases, multiple map-pings are moved from the main map tothe backup map before a new mappingsector must be written. The backupmap can be sparse; not every mappingsector must have a backup sector.

(5) Finally, logical erase units are mappedto physical erase units using a smalldirect map in RAM. Because it is small

ACM Computing Surveys, Vol. 37, No. 2, June 2005.

Page 6: Algorithms and Data Structures for Flash Memorieslass.cs.umass.edu/~shenoy/courses/fall07/papers/flash...Algorithms and Data Structures for Flash Memories 139 set bits (change their

Algorithms and Data Structures for Flash Memories 143

Fig. 2. An example of the FTL mapping structures. The system in the figure maps two logical eraseunits onto three physical units. Each erase unit contains four sectors. Sectors that contain pagemaps contain four mappings each. Pointers represented as gray rectangles are stored in RAM. Thevirtual-to-logical page maps, shown on the top right, are not contiguous so a map in RAM maps theirsectors. Normally, the first sectors in the primary map reside in RAM as well. The replacement mapcontains only one sector; not every primary map sector must have a replacement. The illustrationof the entire device on the bottom also shows the page-map sectors. In the mapping of virtual block5, the replacement map entry is used, because it is not free (all 1’s).

(one entry per erase unit, not persector), the RAM overhead is small. It isconstructed during initialization froman inverse map; each physical eraseunit stores its logical number. Thisdirect map is updated whenever anerase unit is reclaimed.

Ban later patented a translation layerfor NAND devices, called NFTL [Ban 1999]. Itis simpler than the FTL and comes in twotypes: one for devices with spare storagefor each sector (sometimes called out-of-band data), and one for devices withoutsuch storage. The type of devices withoutspare data is less efficient but simpler, sowe’ll start with it. The virtual block num-ber is broken up into a logical erase-unitnumber and a sector number within theerase unit. A data structure in RAM mapseach logical erase unit to a chain of phys-ical units. To locate a block, say block 5in logical unit 7, the system searches theappropriate chain. The units in the chainare examined sequentially. As soon as oneof them contains a valid sector in position5, it is returned. The 5th sectors in ear-lier units in the chain are obsolete, and

the 5th sectors in later units are still free.To update block 5, the new data is writtento sector 5 in the first unit in the chainwhere it is still free. If sector 5 is usedin all the units in the chain, the systemadds another unit to the chain. To reclaimspace, the system folds all the valid sectorsin the chain to the last unit in the chain.That unit becomes the first unit in the newchain, and all the other units in the oldchain are erased. The length of chains isone or longer.

If spare data is available in every sec-tor, the chains are always of length one ortwo. The first unit in the chain is the pri-mary unit, and blocks are stored in it intheir nominal sectors (sector 5 in our ex-ample). When a valid sector in the primaryunit is updated, the new data are writ-ten to an arbitrary sector in the secondunit in the chain, the replacement unit.The replacement unit can contain manycopies of the same virtual block, but onlyone of them is valid. To reclaim space, orwhen the replacement unit becomes full,the valid sectors in the chain are copiedto a new unit and the two units in the oldchain are erased.

ACM Computing Surveys, Vol. 37, No. 2, June 2005.

Page 7: Algorithms and Data Structures for Flash Memorieslass.cs.umass.edu/~shenoy/courses/fall07/papers/flash...Algorithms and Data Structures for Flash Memories 139 set bits (change their

144 E. Gal and S. Toledo

Fig. 3. Block mapping with variable-length sectors. Fixed sized headers are added to one end ofthe erase unit, and variable-length sectors are added to the other size. The figure shows a unitwith four valid sectors, two obsolete ones, and some free space (including a free header, the thirdone).

It is also possible to map variable-lengthlogical blocks onto flash memory, as shownin Figure 3. Wells et al. [1998] patentedsuch a technique, and a similar techniquewas used by the Microsoft Flash File Sys-tem [Torelli 1995]. The motivation for theWells et al. [1998] patent was compressedstorage of standard disk sectors. For ex-ample, if the last 200 bytes of a 512-bytesector are all zeros, the zeros can be rep-resented implicitly rather than explicitly,thereby saving storage. The main idea insuch techniques is to fill an erase unitwith variable-length data blocks from oneend of the unit, say the low end, whilefilling fixed-size headers from the otherend. Each header contains a pointer to thevariable-length data block that it repre-sents. The fixed-size headers allow con-stant time access to data (i.e, to the firstword of the data). The fixed-size headersoffer another potential advantage to sys-tems that reference data blocks by logi-cal erase-unit number and a block indexwithin the unit. The Microsoft Flash FileSystem [Torelli 1995] is one such system.In this system, a unit can be reclaimedand defragmented without any need to up-date references to the blocks that were re-located. We describe this mechanism inmore detail in the following.

Smith and Garvin [1999] patented asimilar system but at a coarser granular-ity. Their system divides each erase unitinto a header, an allocation map, and sev-eral fixed-size sectors. The system allo-cates storage in blocks comprised of oneor more contiguous sectors. Such blocks

are usually called extents. Each allocatedextent is described by an entry in the al-location map. The entry specifies the lo-cation and length of the extent and thevirtual block number of the first sector inthe extent (the other sectors in the extentstore consecutive virtual blocks). When avirtual block within an extent is updated,the extent is broken into two or three newextents, one of which contain the now obso-lete block. The original entry for the extentin the allocation map is marked as invalid,and one or two new entries are added atthe end of the map.

2.3. Erase-Unit Reclamation

Over time, the flash device accumulatesobsolete sectors and the number of freesectors decrease. To make space for newblocks and for updated blocks, obsoletesectors must be reclaimed. Since theonly way to reclaim a sector is to erasean entire unit, reclamation (sometimescalled garbage collection) operates on en-tire erase units.

Reclamation can take place either in thebackground (when the CPU is idle) or on-demand when the amount of free spacedrops below a predetermined threshold.The system reclaims space in severalstages.

—One or more erase units are selected forreclamation.

—The valid sectors of these units arecopied to newly allocated free space else-where in the device. Copying the valid

ACM Computing Surveys, Vol. 37, No. 2, June 2005.

Page 8: Algorithms and Data Structures for Flash Memorieslass.cs.umass.edu/~shenoy/courses/fall07/papers/flash...Algorithms and Data Structures for Flash Memories 139 set bits (change their

Algorithms and Data Structures for Flash Memories 145

data prior to erasing the reclaimed unitsensures persistence even if a fault occursduring reclamation.

—The data structures that map logicalblocks to sectors are updated, if neces-sary, to reflect the relocation.

—Finally, the reclaimed erase units areerased and their sectors are addedto the free-sector reserve. This stagemight also include writing an erase-unitheader on each newly-erased unit.

Immediately following the reclamation ofa unit, the number of free sectors in thedevice is at least one unit’s worth. There-fore, the maximum amount of useful datathat a device can contain is smaller by oneerase unit than its physical size. In manycases, the system keeps at least one ortwo free and erased units at all times toallow all the valid data in a unit that isbeing reclaimed to be relocated to a sin-gle erase unit. This scheme is absolutelynecessary when data is stored on the unitin variable-size blocks since, in this case,fragmentation may prevent reclamationaltogether.

The reclamation mechanism is gov-erned by two policies: which units toreclaim, and where to relocate valid sec-tors to. These policies are related toanother policy which governs sector allo-cation during block updates. These threeinterrelated policies affect the system inthree ways. They affect the effectivenessof the reclamation process which is mea-sured by the number of obsolete sectorsin reclaimed units; they affect wear lev-eling; and they affect the mapping datastructures (some relocations require sim-ple map updates and some require com-plex updates).

The goals of wear leveling and efficientreclamation are often contradictory. Sup-pose that an erase unit contains only so-called static data, data that is never orrarely updated. Efficiency considerationssuggest that this unit should not be re-claimed since reclaiming it would not freeup any storage—its data will simply becopied to another erase unit which willimmediately become full. But although re-claiming the unit is inefficient, this recla-

mation can reduce the wear on other unitsand thereby level the wear. Our supposi-tion that the data is static implies that itwill not change soon (or ever). Therefore,by copying the contents of the unit to an-other unit which has undergone many era-sures, we can reduce future wear on theother unit.

Erase-unit reclamation involves afourth policy, but it is irrelevant to ourdiscussion. The fourth policy triggersreclamation events. Clearly, reclamationmust take place when the system needsto update a block but no free sectoris available. This is called on-demandreclamation. But some systems can alsoreclaim erase units in the backgroundwhen the flash device, or the system asa whole, are idle. The ability to reclaimin the background is largely determinedby the overall structure of the system.Some systems can identify and utilizeidle periods, while others cannot. Thecharacteristics of the flash device arealso important. When erases are slow,avoiding on-demand erases improves theresponse time of block updates; whenerases are fast, the impact of an on-demand erase is less severe. Also, whenerases can be interrupted and resumedlater, a background erase operation haslittle or no impact on the response time ofa block update, but when erases are unin-terruptible, a background erase operationcan delay a block update. However, all ofthese issues are largely orthogonal to thealgorithms and data structures that areused in mapping and reclamation so wedo not discuss this issue any further.

2.3.1. Wear-Centric Reclamation Policies andWear-Measuring Techniques. In this section,we describe reclamation techniques thatare primarily designed to reduce wear.These techniques are used in systems thatseparate efficient reclamation and wearleveling. Typically, the system uses anefficiency-centric reclamation policy mostof the time but switches to a wear-centrictechnique that ignores efficiency once ina while. Sometimes uneven wear triggersthe switch, and sometimes it happens pe-riodically whether or not wear is even.

ACM Computing Surveys, Vol. 37, No. 2, June 2005.

Page 9: Algorithms and Data Structures for Flash Memorieslass.cs.umass.edu/~shenoy/courses/fall07/papers/flash...Algorithms and Data Structures for Flash Memories 139 set bits (change their

146 E. Gal and S. Toledo

Many techniques attempt to level thewear by measuring it; therefore, this sec-tion will also describe wear-measurementmechanisms.

Lofgren et al. [2000, 2003] patenteda simple wear-leveling technique that istriggered by erase-unit reclamation. Inthis technique, the header of each eraseunit includes an erase counter. Also, thesystem sets aside one erase unit as a spare.When one of the most worn-out units is re-claimed, its counter is compared to thatof the least worn-out unit. If the differ-ence exceeds a threshold, say 15,000, awear-leveling relocation is used instead ofa simple reclamation. The contents of theleast-worn out unit (or one of the least-worn out units) are relocated to the spareunit, and the contents of the most worn outunit, the one whose reclamation triggeredthe event, are copied to the just-erasedleast-worn out unit. The most worn-outunit that was reclaimed becomes the newspare. This technique attempts to iden-tify worn-out sectors and static blocks andto relocate static blocks to worn-out sec-tors. In the next wear-leveling event, itwill be used to store the blocks from theleast-worn out units. Presumably, the datain the least-worn out unit are relativelystatic; storing them on a worn-out unit re-duces the chances that the worn-out unitwill soon undergo further rapid wear. Also,by removing the static data from the leastworn-out unit, we usually bring its futureerase rate close to the average of the otherunits.

Clearly, any technique that relies onerase counters in the erase-unit headersis susceptible to loss of an erase counterif power is lost after a unit is erased butbefore the new counter is written to theheader. The risk is higher when erase op-erations are slow.

One way to address this risk is to storethe erase counter of unit i on another unitj �= i. One such technique was patentedby Marshall and Manning [1998], as partof a flash file system. Their system storesan erase counter in the header of eachunit. Prior to the reclamation of unit i,the counter is copied to a specially-markedarea in an arbitrary unit j �= i. Should

power be lost during the reclamation, theerase count of i will be recovered fromunit j after power is restored. Assar et al.[1996] patented a simpler but less efficientsolution. They proposed a bounded unary8-bit erase counter which is stored on an-other erase unit. The counter of unit i,which is stored on another unit j �= i,starts at all ones, and a bit is cleared ev-ery time i is erased. Because the countercan be updated, it does not need to beerased every time unit i is erased. On theother hand, the number of updates to theerase counter is bounded. When it reachesthe maximum (8 in their patent), furthererases of unit i will cause loss of accu-racy in the counter. In their system, suchcounters were coupled with periodic globalrestarts of the wear-leveling mechanismin which all the counters are rolled backto the erased state.

Jou and Jeppesen III [1996] patented atechnique that maintains an upper boundon the wear (number of erasures). Thebound is always correct but not neces-sarily tight. Their system uses an erase-before-write strategy: the valid contentsof an erase unit chosen for reclamationare copied to another unit, but the unitis not erased immediately. Instead, it ismarked in the flash device as an erasurecandidate and added to a priority queue ofcandidates in RAM. The queue is sorted bywear; the unit with the least wear in thequeue (actually the least-wear bound) iserased when the system needs a free unit.If power is lost during an erasure, the newbound for the erased unit is set to the min-imum wear among the other erase candi-dates plus 1. Since the pre-erasure boundon the unit was less than or equal to thatof all the other ones in the queue, the newbound may be too high, but it is correct.(The patent does not increase the bound by1 over that of the minimum in the queue;this yields a wear estimate that may bejust as useful in practice but not a bound.)This technique levels the wear to some ex-tent by delaying reuse of worn-out units.The evenness of the wear in this techniquedepends on the number of surplus units: ifthe queue of candidates is short, reuse ofworn-out units cannot be delayed much.

ACM Computing Surveys, Vol. 37, No. 2, June 2005.

Page 10: Algorithms and Data Structures for Flash Memorieslass.cs.umass.edu/~shenoy/courses/fall07/papers/flash...Algorithms and Data Structures for Flash Memories 139 set bits (change their

Algorithms and Data Structures for Flash Memories 147

Another solution to the same prob-lem, patented by Han [2000], relies onwear-estimation using erase latencies. Onsome flash devices, the erase latency in-creases with wear. Han’s technique com-pares erase times in order to rank eraseunit by wear. This avoids altogether theneed to store erase counters. The wearrankings can be used in a wear-levelingrelocation or allocation policy. Without ex-plicit erase counters, the system can onlyestimate the wear of a unit after it iserased in a session. Therefore, this tech-nique is probably not applicable in its pureform (without counters) when sessions areshort and only a few units are erased.

Another approach to wear leveling is torely on randomness rather than on esti-mates of actual wear. Woodhouse [2001]proposed a simple randomized wear-leveling technique. Every 100th reclama-tion the system selects for reclamation aunit containing only valid data, at ran-dom. This has the effect of moving staticdata from units with little wear to unitswith more wear. If this technique is usedin a system that otherwise always favorsreclamation efficiency over wear leveling,extreme wear imbalance can still occur. Ifunits are selected for reclamation, basedsolely upon the amount of invalid datathey contain, a little worn-out unit witha small amount of invalid data may neverbe reclaimed.

At about the same time, Ban [2004]patented a more robust technique. Histechnique, like the one of Lofgren et al.[2000, 2003], relies on a spare unit. Everycertain number of reclamations, an eraseunit is selected at random, its contents re-located to the spare unit, and it is markedas the new spare. The trigger for this wear-leveling event can be deterministic, saythe 1000th erase since the last event, orrandom. Using a random trigger ensuresthat wear leveling is triggered even if ev-ery session is short and encompasses onlya few erase operations. The aim of thistechnique is to have every unit undergoa fairly large number of random swaps,say 100, during the lifetime of the flashdevice. The large number of swaps is sup-posed to diminish the likelihood that an

erase unit stores static data for much ofthe device’s lifetime. In addition, the totaloverhead of wear leveling in this techniqueis predictable and evenly spread in time.

It appears that the idea behind thistechnique was used in earlier software.M-Systems developed and marketed soft-ware called TrueFFS, a block-mapping de-vice driver that implements the FTL. TheM-Systems literature [Dan and Williams1997] states that TrueFFS uses a wear-leveling technique that combines random-ness with erase counts. Their literatureclaimed that the use of randomness elim-inates the need to protect the exact erasecounts stored in each erase unit. Thedetails of the wear-leveling algorithm ofTrueFFS are not described in the open lit-erature or in patents.

2.3.2. Combining Wear-Leveling with EfficientReclamation. We now describe policiesthat attempt to address both wear level-ing and efficient reclamation.

Kawaguchi et al. [1995], describe onesuch policy. They implemented a block de-vice driver for a flash device. The driverwas intended for use with a log-structuredUnix file system which we describe in thefollowing. This file system operates muchlike a block-mapping mechanism: it relo-cates block on update, and reclaims eraseunits to free space. Kawaguchi et al. [1995]describe two reclamation policies. The firstpolicy selects the next unit for reclama-tion based on a weighted benefit/cost ra-tio. The benefit of a unit reclamation isthe amount of invalid space in the unit,and the cost is incurred by the need toread the valid data and write it back else-where. This is weighted by the age of theblock, the time since the last invalidation.A large weight is assumed to indicate thatthe remaining valid data in the unit is rel-atively static. This implies that the validoccupancy of the unit is unlikely to de-crease soon so there is no point in wait-ing until the benefit increases. In otherwords, the method tries to avoid reclaim-ing units whose benefit/cost ratio is likelyto increase soon. This policy is not explic-itly designed to level wear, but it does

ACM Computing Surveys, Vol. 37, No. 2, June 2005.

Page 11: Algorithms and Data Structures for Flash Memorieslass.cs.umass.edu/~shenoy/courses/fall07/papers/flash...Algorithms and Data Structures for Flash Memories 139 set bits (change their

148 E. Gal and S. Toledo

result in some wear leveling since itensures that blocks with some invaliddata are eventually reclaimed, even if theamount of invalid data in them is small.

Kawaguchi et al. [1995] found that thispolicy still leads to inefficient reclamationin some cases and proposed a second pol-icy which tends to improve efficiency at theexpense of worse wear leveling. They pro-posed to write data to two units. One unitis used for sectors relocated during thereclamation of so-called cold units, onesthat were not modified recently. The otherunit is used for sectors relocated from hotunits and for updating blocks not under-going reclamation. This policy tends tocluster static data in some units and dy-namic data in others. This, in turn, tendsto increase the efficiency of reclamation,because units with dynamic data tend tobe almost empty upon reclamation, andstatic units do not need to be reclaimed atall because they often contain no invaliddata. Clearly, units with static data can re-main unreclaimed for long periods whichleads to uneven wear unless a separate ex-plicit wear-leveling mechanism is used.

A more elaborate policy is described byWu and Zwaenepoel [1994]. Their sys-tem partitions the erase units into fixed-size partitions. Lower-numbered parti-tions are supposed to store hot virtualblocks, while higher-numbered partitionsare supposed to store cold virtual blocks.Each partition has one active erase unitthat is used to store updated blocks. Whena virtual block is updated, the new datais written to a sector in the active unit inthe same partition that the block currentlyresides in. When the active unit in a par-tition fills up, the system finds the unitwith the least valid sectors in the samepartition and reclaims it. During reclama-tion, valid data in the unit that is beingreclaimed is copied to the beginning of anempty unit. Thus, blocks that are not up-dated frequently tend to slide toward thebeginning of erase units. Blocks that wereupdated after the unit they are on becameactive, and are hence hot, tend to reside to-ward the end of the unit. This allows Wuand Zwaenepoel’s [1994] system to classifyblocks as hot or cold.

The system tries to achieve a roughlyconstant reclamation frequency in allthe partitions. Therefore, hot partitionsshould contain fewer blocks than cold par-titions because hot data tends to be in-validated more quickly. At every reclama-tion, the system compares the reclamationfrequency of the current partition to theaverage frequency. If the partition’s recla-mation frequency is higher than average,some of its blocks are moved to neighbor-ing partitions. Blocks from the beginningof the unit being reclaimed are moved to acolder partition, and blocks from the endare moved to a hotter partition.

Wu and Zwaenepoel [1994] employ asimple form of wear leveling. When theerase count of the most worn-out unit ishigher by 100 than that of the least worn-out unit, the data on them is swapped.This probably works well when the leastworn-out unit contains static or nearlystatic data. If this is the case, then swap-ping the extreme units allows the mostworn-out unit some time to rest.

We shall discuss Wu and Zwaenepoel’s[1994] system again later in the survey.The system was intended as a main mem-ory replacement not as a disk replace-ment, so it has some additional interestingfeatures that we describe in Section 4.

Wells [1994] patented a reclamation pol-icy that relies on a weighted combinationof efficiency and wear leveling. The systemselects the next unit to be reclaimed basedon a score. The score of a unit j is definedto be

score( j ) = 0.8 × obsolete( j )+ 0.2 × (max

i{erasures(i)}

− erasures( j )) ,

where obsolete( j ) is the amount of in-valid data in unit j , and erasures( j ) isthe number of erasures that unit j hasundergone. The unit with the maximalscore is reclaimed next. Since obsolete( j )and erasures( j ) are measured in dif-ferent units, the precise weights of thetwo terms, 0.8 and 0.2 in the systemdescribed in the patent, should dependon the space-measurement metric. The

ACM Computing Surveys, Vol. 37, No. 2, June 2005.

Page 12: Algorithms and Data Structures for Flash Memorieslass.cs.umass.edu/~shenoy/courses/fall07/papers/flash...Algorithms and Data Structures for Flash Memories 139 set bits (change their

Algorithms and Data Structures for Flash Memories 149

principle, however, is to weigh efficiencyheavily and wear differences lightly. Af-ter enough units have been reclaimed us-ing this policy, the system checks whethermore aggressive wear leveling is neces-sary. If the difference between the mostand the least worn out units is 500 orhigher, the system selects additional unitsfor reclamation, using a wear-heavy policythat maximizes a different score,

score′( j ) = 0.2 × obsolete( j )+ 0.8 × (max

i{ erasures(i)}

− erasures( j )) .

The combined policy is designed to first en-sure a good supply of free sectors, and oncethat goal is achieved to ensure that thewear imbalance is not too extreme. In thewear-leveling phase, efficiency is not im-portant since that phase only starts afterthere are enough free sectors. Also, thereis no point in activating the wear-levelingphase where wear is even since this willjust increase the overall wear.

Chiang et al. [1999] proposed a blockclustering that they call CAT. Chiang andChang [1999] later proposed an improvedtechnique, called DAC. At the heart of thesemethods lies a concept called temperature.The temperature of a block is an estimateof the likelihood that it will be updatedsoon. The system maintains a tempera-ture estimate for every block using twosimple rules: (1) when a block is updated,its temperature rises, and (2), blocks cooldown over time. The CAT policy classifiesblocks into three categories: read-only (ab-solutely static), cold, and hot. The categorythat a block belongs to does not necessarilymatches its temperature because, in CAT,blocks are only reclassified during recla-mation. Each erase unit stores blocks fromone category. When an erase unit is re-claimed, its valid sectors are reclassified.CAT selects units for reclamation by maxi-mizing the score function

cat-score( j ) = obsolete( j ) × age( j )valid( j ) × erasures( j )

,

where valid( j ) is the amount of valid datain unit j , and age( j ) is a discrete mono-tone function of the time since the last era-sure of unit j . This score function leads thesystem to prefer units with a lot of obsoletedata and little valid data, units that havenot been reclaimed recently, and units thathave not been erased many times. Thiscombines efficiency with some degree ofwear leveling. The CAT policy also includesan additional wear-leveling mechanism:when an erase unit nears the end of itslife, it is exchanged with the least worn-out unit.

The DAC policy is more sophisticated.First, blocks may be classified into morethan three categories. More importantly,blocks are reclassified on every update soa cold block that heats up will be relocatedto a hotter erase unit, even if the unitsthat store it never get reclaimed while itis valid.

TrueFFS selects units for reclamationbased on both the amount of invalid data,the number of erasures, and identificationof static areas [Dan and Williams 1997].The details are not described in the openliterature. TrueFFS also tries to clusterrelated blocks so that multiple blocks inthe same unit are likely to become in-valid together. This is done by trying tomap contiguous logical blocks onto a sin-gle unit under the assumption that higher-level software (i.e., a file system) attemptsto cluster related data at the logical-blocklevel.

Kim and Lee [2002] proposed an adap-tive scheme for combining wear levelingand efficiency in selecting units for recla-mation. Their scheme is designed for a filesystem not for a block-mapping mecha-nism but since it is applicable for block-mapping mechanisms, we describe it here.Their scheme, which is called CICL, selectsan erase unit (actually a group of eraseunits called a segment) for reclamation byminimizing the following score,

cicl-score( j ) = (1 − λ)

(valid( j )

valid( j ) + obsolete( j )

)

+ λ

(erasures( j )

1 + maxi{ erasures(i)}

).

ACM Computing Surveys, Vol. 37, No. 2, June 2005.

Page 13: Algorithms and Data Structures for Flash Memorieslass.cs.umass.edu/~shenoy/courses/fall07/papers/flash...Algorithms and Data Structures for Flash Memories 139 set bits (change their

150 E. Gal and S. Toledo

In this expression, λ is not a constant,but a monotonic function that depends onthe discrepancy between the most and theleast worn-out units,

0 < λ( maxi{erasures(i)}− mini{erasures(i)}) < 1 .

When λ is small, units are selected forreclamation mostly upon efficiency consid-erations. When λ is high, unit selectionis driven mostly by wear, with preferencegiven to young units. Letting λ grow whenwear become uneven and shrink when itevens out leads to more emphasis on wearwhen wear is imbalanced, and more em-phasis on efficiency when wear is roughlyeven. Kim and Lee [2002] augment thispolicy with two other techniques for clus-tering data and for unit allocation butthese are file-system specific, so we de-scribe them later in the survey.

2.3.3. Real-Time Reclamation. Reclaim-ing an erase unit to make space for newor updated data can take a considerableamount of time. A slow and unpredictablereclamation can cause a real-time systemto miss a deadline. Chang et al. [2004]proposed a guaranteed reclamation policyfor real-time systems with periodic tasks.They assume that tasks are periodic, andthat each task provides the system withits period, with per-period upper boundson CPU time and the number of sectorupdates.

Chang et al.’s system [2004] relies ontwo principles. First, it uses a greedy recla-mation policy that reclaims the unit withthe least amount of valid data, and it onlyreclaims units when the number of freesectors falls below a threshold. (This pol-icy is used only under deadline pressure;for nonreal-time reclamation, a differentpolicy is used.) This policy ensures that ev-ery reclamation generates a certain num-ber of free sectors. Consider, for exam-ple, a 64MB flash device that is used tostore 32MB worth of virtual blocks, andthat reclamation is triggered when theamount of free space falls below 16MB.When reclamation is triggered, the device

must contain more than 16MB of obsoletedata. Therefore, on average, a quarter ofthe sectors on a unit are obsolete. Theremust be at least one unit with that muchobsolete data, so by reclaiming it, the sys-tem is guaranteed to generate a quarter ofa unit’s worth of free sectors.

To ensure that the system meets thedeadlines of all the tasks that it admits,every task is associated with a reclama-tion task. These tasks reclaim at most oneunit every time one of them is invoked. Theperiod of a reclamation task is chosen sothat it is invoked once every time the taskit is associated with writes α blocks, whereα is the number of sectors that are guaran-teed to be freed in every reclamation. Thisensures that a task uses up free sectorsat the rate that its associated reclamationtask frees sectors. (If a reclamation taskdoes not reclaim a unit because there aremany free pages, the system is guaranteedto have enough sectors for the associatedtask.) The system admits a task only if itcan meet the deadlines of both it and itsreclamation task.

Chang et al. [2004] also proposed aslightly more efficient reclamation policythat avoids unnecessary reclamation andan auxiliary wear-leveling policy that thesystem applies when it is not under dead-line pressure.

3. FLASH-SPECIFIC FILE SYSTEMS

Block-mapping technique present theflash device to higher-level software, inparticular file systems, as a rewritableblock device. The block device driver (or ahardware equivalent) perform the block-to-sector mapping, erase-unit reclama-tion, wear leveling, and perhaps even re-covery, of the block device to a designatedstate following a crash. Another approachis to expose the hardware characteristicsof the flash device to the file-system layerand let it manage erase units and wear.The argument is that an end-to-end solu-tion can be more efficient than stackinga file system designed for the character-istics of magnetic hard disks on top of adevice driver designed to emulate disksusing flash.

ACM Computing Surveys, Vol. 37, No. 2, June 2005.

Page 14: Algorithms and Data Structures for Flash Memorieslass.cs.umass.edu/~shenoy/courses/fall07/papers/flash...Algorithms and Data Structures for Flash Memories 139 set bits (change their

Algorithms and Data Structures for Flash Memories 151

The block-mapping approach does haveseveral advantages over flash-specific filesystems. First, the block-mapping ap-proach allows developers to utilize exist-ing file system implementations, therebyreducing development and testing costs.Second, removable flash devices, such asCompactFlash and SmartMedia, must usestorage formats that are understood by allthe platforms in which users need to usethem. These platforms currently includeWindows, Mac, and Linux/Unix operatingsystems as well as hand-held devicessuch as digital cameras, music players,PDAs, and phones. The only rewritablefile system format supported by all majoroperating systems is FAT so removabledevices typically use it. If the removabledevice exposes the flash memory devicedirectly, then the host platforms mustalso understand the headers that describethe content of erase units and sectors.The standardization of the FTL allowsthese headers to be processed on multipleplatforms; this is also the approach takenby the SmartMedia standard. Anotherapproach, invented by SunDisk (now San-Disk) and typified by the CompactFlashformat, is to hide the flash memory devicebehind a disk interface implemented inhardware as part of the removable deviceso the host is not required to understandthe flash header format. Typically, usinga CompactFlash on a personal computerdoes require even a special device driverbecause the existing disk device-drivercan access the device.

Even on a removable device that mustuse a given file system structure, say aFAT file system, combining the file systemwith the block-mapping mechanism yieldsbenefits. Consider the deletion of a file,for example. If the system uses a stan-dard file system implementation on top ofa block-mapping device driver, the deletedfile’s data sectors will be marked as freein a bitmap or file-allocation table, butthey will normally not be overwritten orotherwise be marked as obsolete. Indeed,when the file system is stored on a mag-netic disk, there is no reason to move theread-write head to these now free sectors.But if the file system is stored on a flash

device, the deleted file’s blocks, which arenot marked as obsolete, are copied fromone unit to another whenever the unit thatstores them is reclaimed. By combiningthe block device driver with the file systemsoftware, the system can mark the deletedfile’s sectors as obsolete which preventsthem from ever being copied. This appearsto be the approach of FLite, a FAT file sys-tem/device driver combination softwarefrom M-Systems [Dan and Williams 1997].Similar combination software is also soldby HCC Embedded. Their file systems aredescribed in the following, but it is notclear whether they exploit such opportu-nities or not.

But when the flash memory device is notremovable, a flash-specific file system isa reasonable solution and over the years,several such file systems have been devel-oped. In the mid 1990’s Microsoft tried tostandardize flash-specific file systems forremovable memory devices, in particular,a file system called FFS2 but this effortdid not succeed. Douglis et al. [1994] re-port very poor write performance for thissystem which is probably the main reasonit failed. We comment further on this filesystem later.

Most of the flash-specific file systemsuse the same overall principle, that of alog-structured file system. This principlewas invented for file systems that utilizemagnetic disks where it is not currentlyused much. It turns out, however, to beappropriate for flash file systems. There-fore, we next describe how log-structuredfile systems work, and then describe indi-vidual flash file systems.

3.1. Background: Log-Structured FileSystems

Conventional file systems modify infor-mation in place. When a block of data,whether containing part of a file or file sys-tem metadata, must be modified, the newcontents overwrite old data in exactly thesame physical locations on the disk. Over-writing makes finding data easy: the samedata structure that allowed the system tofind the old data will now find the newdata. For example, if the modified block

ACM Computing Surveys, Vol. 37, No. 2, June 2005.

Page 15: Algorithms and Data Structures for Flash Memorieslass.cs.umass.edu/~shenoy/courses/fall07/papers/flash...Algorithms and Data Structures for Flash Memories 139 set bits (change their

152 E. Gal and S. Toledo

contained part of a file, the pointer in theinode or in the indirect block that pointedto the old data is still valid. In other words,in conventional file systems, data items(including both metadata and user data)are mutable so pointers remain valid af-ter modification.

In-place modification creates two kindsof problems, however. The first kind in-volves movement of the read-write headand disk rotation. Modifying existing dataforces the system to write in specific physi-cal disk locations so these writes may suf-fer long seek and rotational delays. Theeffect of these mechanical delays are espe-cially severe when writing cannot be de-layed, for example, when a file is closed.The second problem that in-place modifi-cation creates is the risk of metadata in-consistency after a crash. Even if writesare atomic, high-level operations consist-ing of several writes, such as creating ordeleting a file, are not atomic so the filesystem data structure may be in an incon-sistent state following a crash. There areseveral techniques to address this issue.In lazy-write file systems, a fixing-up pro-cess corrects the inconsistencies followinga crash prior to any use of the file sys-tem. The fixing-up can delay the recoveryof the computer system from the crash. Incareful-write systems, metadata changesare performed immediately according to astrict ordering so that, after a crash, thedata structure is always in a consistentstate except perhaps for loss of disk blockswhich are recovered later by a garbage col-lector. This approach reduces the recoverytime but amplifies the effect of mechanicaldisk delays. Soft updates is a generaliza-tion of careful writing, where the file sys-tem is allowed to cache modified metadatablocks; they are later modified accordingto a partial ordering that guarantees thatthe only inconsistencies following a crashare loss of disk blocks. Experiments withthis approach have shown that it deliversperformance similar to that of journalingfile systems [Rosenblum and Ousterhout1992; Seltzer et al. 1993] which we de-scribe next, but currently only Sun usesthis approach. In journaling file systems,each metadata modification is written to a

journal (or a log) before the block is modi-fied in place. Following a crash, the fixing-up process examines the tail of the log andeither completes or rolls back each meta-data operation that may have been inter-rupted. Except for Sun Solaris systemswhich use soft updates almost all the com-mercial and free operating systems todayuse journaling file systems such as NTFSon Windows, JFS on AIX and Linux, XFSon IRIX and Linux, ext3 and reiserfs onLinux, and the new journaling HFS+ onMacOS.

Log-structured file systems take thejournaling approach to the limit: the jour-nal is the file system. The disk is organizedas a log, a continuous medium that is inprinciple infinite. In reality, the log con-sist of fixed-sized segments of contiguousareas of the disk, chained together into alinked list. Data and metadata are alwayswritten to the end of the log; they neveroverwrite old data. This raises two prob-lems: how do you find the newly-writtendata, and how do you reclaim disk spaceoccupied by obsolete data? To allow newly-written data to be found, the pointers tothe data must change so the blocks con-taining these blocks must also be writtento the log, and so on. To limit the possi-ble snowball effect of these rewrites, filesare identified by a logical inode index. Thelogical inodes are mapped to physical lo-cation in the log by a table that is keptin memory and is periodically flushed tothe log. Following a crash, the table canbe reconstructed by finding the most re-cent copy of the table in the log, and thenscanning the log from that position to theend in order to find files whose position haschanged after that copy was written. Toreclaim disk space, a cleaner process findssegments that contain a large amount ofobsolete data, copies the valid data to theend of the log (as if they were changed),and then erases the segment and adds itto the end of the log.

The rationale behind log-structured filesystem is that they enjoy good write per-formance since writing is only done to theend of the log so it does not incur seekand rotational delays (except when switch-ing to a new segment which occurs only

ACM Computing Surveys, Vol. 37, No. 2, June 2005.

Page 16: Algorithms and Data Structures for Flash Memorieslass.cs.umass.edu/~shenoy/courses/fall07/papers/flash...Algorithms and Data Structures for Flash Memories 139 set bits (change their

Algorithms and Data Structures for Flash Memories 153

rarely). On the other hand, read perfor-mance may be slow because the blocks ofa file might be scattered around the diskif each block was modified at a differenttime, and the cleaner process might fur-ther slow down the system if it happensto clean segments that still contain a sig-nificant amount of valid data. As a con-sequence, log-structured file systems arenot in widespread use on disks; not manyhave been developed and they are rarelydeployed.

On flash devices, however, log-structured file systems make perfectsense. On flash, old data cannot beoverwritten. The modified copy must bewritten elsewhere. Furthermore, log-structuring the file system on flash doesnot influence read performance since flashallows uniform-access-time random ac-cess. Therefore, most of the flash-specificfile systems use this design principle.

Kawaguchi et al. [1995] were probablythe first to identify log-structured file sys-tems as appropriate for flash memories.Although they used the log-structured ap-proach to design a block-mapping devicedriver, their paper pointed out that thisapproach is suitable for flash memorymanagement in general.

Kim and Lee [2002] proposed clean-ing, allocation, and clustering policies forlog-structured file systems that reside onflash devices. We have already describedtheir reclamation policy which is not filesystem-specific. Their other policies arespecific to file systems. Their system triesto identify files that have not been up-dated recently and whose data is frag-mented over many erase units. Once in awhile, a clustering process tries to collectthe blocks of such files into a single eraseunit. The rationale behind this policy isthat if an infrequently modified file is frag-mented, it does not contribute much validdata to the units that store it so they maybecome good candidates for reclamation.But whenever such a unit is reclaimed,the file’s data is simply copied to a newunit, thereby reducing the effectiveness ofreclamation. By collecting the file’s dataonto a single unit, the unit is likely to con-tain a significant amount of valid data for

a long period until the file is deleted orupdated again. Therefore, the unit is un-likely to be reclaimed soon; and the file’sdata is less likely to be copied over andover again. Kim and Lee [2002] proposepolicies for finding infrequently modifiedfiles, for estimating their fragmentation,and for selecting the collection period andscope (size).

Kim and Lee [2002] also propose to ex-ploit clustering to improve wear leveling.They propose to allocate the most worn-out free unit for storage of cold data duringsuch collections and to allocate the leastworn-out free unit for normal updates ofdata. The rationale here is that not-much-used data that is collected is likely to re-main valid for a long period, thereby allow-ing a worn-out unit to rest. On the otherhand, recently updated data is likely to beupdated again soon so a unit that is usedfor normal log operations is likely to be-come a candidate for reclamation soon soit is better to use a little worn-out unit.

3.2. The Research-In-Motion File System

Research In Motion, a company mak-ing handheld text messaging devices andsmart phones,2 patented a log-structuredfile system for flash memories [Parker2003]. The file system is designed tostore contiguous variable-length recordsof data, each having a unique identifier.

The flash is partitioned into an areafor programs and an area for the file sys-tem (this is fairly common and is designedto allow programs to be executed in-placedirectly from NOR flash memory). The filesystem area is organized as a perfectly cir-cular log containing a sequence of records.Each record starts with a header contain-ing the record’s size, identity, invalidationflags and a pointer to the next record inthe same file, if any.

Because the log is circular, cleaning maynot be effective: the oldest erase unit maynot contain much obsolete data, but it iscleaned anyway. To address this issue, thepatent proposes to partition the flash fur-ther into a log for so-called hot (frequently

2http://www.rim.net.

ACM Computing Surveys, Vol. 37, No. 2, June 2005.

Page 17: Algorithms and Data Structures for Flash Memorieslass.cs.umass.edu/~shenoy/courses/fall07/papers/flash...Algorithms and Data Structures for Flash Memories 139 set bits (change their

154 E. Gal and S. Toledo

modified) data and a log for cold data. Nosuggestion is made as to how to classifyrecords.

Keeping the records contiguous allowsthe file system to return pointers directlyinto the NOR flash in read operation. Pre-sumably, this is enabled by an API that onlyallows access to a single record at a time,not to multiple records or other arbitraryparts of files.

The patent suggests keeping a directmap in RAM, mapping each logical recordnumber to its current location on the flashdevice. The patent also mentions the pos-sibility of keeping this data structure, orparts thereof, in flash but without provid-ing any details.

3.3. The Journaling Flash Filing System

The Journaling Flash File System (JFFS)was originally developed by Axis Com-munications [2004] AB for embeddedLinux. It was later enhanced, in a ver-sion called JFFS2, by David Woodhouse ofRed Hat [Woodhouse 2001]. Both versionsare freely available under the GNU PublicLicense (GPL). Both versions focus mainlyon NOR devices and may not work reliablyon NAND devices. Our description here fo-cuses on the more recent JFFS2.

JFFS2 is a posix compliant file system.Files are represented by an inode number.Inode numbers are never reused, and eachversion of an inode structure on flash car-ries a version number. Version numbersare also not reused. The version numbersallow the host to reconstruct a direct inodemap from the inverse map stored on flash.

In JFFS2, the log consists of a linkedlist of variable-length nodes. Most nodescontain parts of files. Such nodes con-tain a copy of the inode (the file’s meta-data), along with a range of data from thefile, possibly empty. There are also specialdirectory-entry nodes which contain a filename and an associated inode number.

At mount time, the system scans all thenodes in the log and builds two data struc-tures. One is a direct map from each inodenumber to the most recent version of it onflash. This map is kept in a hash table. Theother is a collection of structures that rep-

resent each valid node on the flash. Eachstructure participates in two linked lists,one chaining all the nodes according tophysical address, to assist in garbage col-lection, and the other containing all thenodes of a file, in order. The list of nodesbelonging to a file form a direct map of filepositions to flash addresses. Because boththe inode to flash address and file posi-tion to flash-address maps are only keptin RAM, the data structure on the flash canbe very simple. In particular, when a fileis extended or modified, only the new dataand the inode are written to the flash, butno other mapping information. The obvi-ous consequence of this design choice ishigh RAM usage.

JFFS2 uses a simple wear-leveling tech-nique. Most of the time, the cleanerselects for cleaning an erase unit that con-tains at least some obsolete data (the arti-cle describing JFFS2 does not specify thespecific policy, but it is probably based onthe amount of obsolete data in each unit).But on every 100th cleaning operation, thecleaner selects a unit with only valid datain an attempt to move static data aroundthe device.

The cleaner can merge small blocks ofdata belonging to a file into a new largechunk. This is supposed to improve per-formance, but it can actually degrade per-formance. If later, only part of that largechunk is modified, a new copy of the en-tire large chunk must be written to flashbecause writes are not allowed to modifyparts of valid chunks.

3.4. YAFFS: Yet Another Flash Filing System

YAFFS was written by Aleph One [2002]as a NAND file system for embedded device.It has been released under the GPL andhas been used in products running bothLinux and Windows CE. It was writtenbecause the authors evaluated JFFS andJFFS2 and concluded that they are notsuitable for NAND devices.

In YAFFS, files are stored in fixed-sizedchunks which can be 512 bytes, 1KB, or2KB in size. The file system relies on be-ing able to associate a header with eachchunk. The header is 16 bytes for 512 bytes

ACM Computing Surveys, Vol. 37, No. 2, June 2005.

Page 18: Algorithms and Data Structures for Flash Memorieslass.cs.umass.edu/~shenoy/courses/fall07/papers/flash...Algorithms and Data Structures for Flash Memories 139 set bits (change their

Algorithms and Data Structures for Flash Memories 155

chunks, 30 bytes for 1KB, and 42 bytesfor 2KB. Each file (including directories)include one header chunk, containing thefile’s name and permissions, and zero ormore data chunks. As in JFFS2, the onlymapping information on flash is the con-tent of each chunk, stored as part of theheader. This implies that at mount timeall the headers must be read from flash toconstruct the file ID and file contents maps,and that, as in JFFS, the maps of all thefiles are stored in RAM at all times.

To save RAM relative to JFFS, YAFFSuses a much more efficient map structureto map file locations to physical flash ad-dresses. The mapping uses a tree struc-ture with 32-byte nodes. Internal nodescontain 8 pointers to other nodes, andleaf nodes contain 16 2-byte pointers tophysical addresses. For large flash mem-ories, 16-bit words cannot point to an ar-bitrary chunk. Therefore, YAFFS uses ascheme that we call approximate point-ers. Each pointer value represents a con-tiguous range of chunks. For example, ifthe flash contains a 218 chunk, each 16-bitvalue represents 4 chunks. To find thechunk that contains the required datablock, the system searches the headers ofthese 4 chunks to find the right one. Inessence, approximate pointers work be-cause the data they point to are self de-scribing. To avoid file-header modificationon every append operation, each chunk’sheader contains the amount of data it car-ries; upon append, a new tail chunk iswritten containing the new size of the tailwhich, together with the number of fullblocks, gives the size of the file.

The first version of YAFFS uses 512-byte chunks and invalidates chunks byclearing a byte in the header. To ensurethat random bit errors, which are com-mon in NAND devices, do not cause a deletedchunk to reappear as valid (or vice versa),invalidity is signaled by at 4 or more zerobits in the byte (that is, by a majority voteof the bits in the byte). This requires writ-ing on each page twice before the eraseunit is reclaimed.

YAFFS2 uses a slightly more complexarrangement to invalidate chunks of 1or 2KB. The aim of this modification is

to achieve a strictly sequential writingorder within erase units so that erasedpages are written one after the otherand never rewritten. This is achieved us-ing two mechanisms. First, each chunk’sheader contains not only the file ID andthe position within the file but also asequence number. The sequence numberdetermines which, among all the chunksrepresenting a single block of a file, isthe valid chunk. The rest can be re-claimed. Second, files and directories aredeleted by moving them to a trash di-rectory which implicitly marks all theirchunks for garbage collection. When thelast chunk of a deleted file is erased, thefile itself can be deleted from the trashdirectory. (Application code cannot rescuefiles from this trash directory.) The RAM

file-contents maps are recreated at mounttime by reading the headers of the chunksby sequence number to ensure that onlythe valid chunk of each file block is re-ferred to.

Wear leveling is achieved mostly by in-frequent random selection of an erase unitto reclaim. But as in JFFS, most of thetime an attempt is made to erase an eraseunit that contains no valid data. The au-thors argue that wear leveling is some-what less important for NAND devices thanfor NOR devices. NAND devices are oftenshipped with bad pages, marked as suchin their headers, to improve yield, Theyalso suffer from bit flipping during normaloperation which requires using ECC/EDC

codes in the headers. Therefore, the au-thors of YAFFS argue, file systems andblock-mapping techniques must be able tocope with errors and bad pages anyways soan erase unit that is defective due to exces-sive wear is not particularly exceptional.In other words, uneven wear will lead toloss of storage capacity but it should nothave any other impact on the file system.

3.5. The Trimble File System

The Trimble file system was a NOR imple-mented by Marshall and Manning [1998]for Trimble Navigation (a maker of GPSequipment). Manning is also one of theauthors of the more recent YAFFS.

ACM Computing Surveys, Vol. 37, No. 2, June 2005.

Page 19: Algorithms and Data Structures for Flash Memorieslass.cs.umass.edu/~shenoy/courses/fall07/papers/flash...Algorithms and Data Structures for Flash Memories 139 set bits (change their

156 E. Gal and S. Toledo

The overall structure of the file systemis fairly similar to that of YAFFS. Filesare broken into 252-byte chunks, and eachchunk is stored with a 4-byte header ina 256-byte flash sector. The 4-byte headerincludes the file number and chunk num-ber within the file. Each file also includes aheader sector, containing the file number,a valid/invalid word, a file name, and up to14 file records, only one of which, the lastone, is valid. Each record contains the sizeof the file, a checksum, and the last modi-fication time. Whenever a file is modified,a new record is used and when they are allused up, a new header sector is allocatedand written.

As in YAFFS, all the mapping informa-tion is stored in RAM during normal oper-ation since the flash contains no mappingstructures other than the inverse maps.

Erase units are chosen for reclamationbased solely on the number of valid sec-tors that they still contain and only if theycontain no free sectors. Ties are broken byerase counts to provide some wear level-ing. To avoid losing the erase counts dur-ing a crash that occurs after a unit iserased but before its header is written, theerase count is written prior to erasure intoa special sector containing block erasureinformation.

Sectors are allocated sequentiallywithin erase units and the next eraseunit to be used is selected from amongthe available ones based on erase counts,again providing some wear-levelingcapability.

3.6. The Microsoft Flash File System

In the mid 1990’s, Microsoft tried to pro-mote a standardized file system for re-movable flash memories which was calledFFS2 (we did not find documentation ofan earlier version, but we assume one ex-isted). Douglis et al. [1994] report verypoor write performance for this systemwhich is probably the main reason itfailed. By 1998, Intel [1998a] listed thissolution as obsolete.

Current Microsoft documentation (asof Spring 2004) does not mention FFS2

(nor FFS). Microsoft obtained a number ofpatents on flash file system, and we as-sume that the systems described by thepatents are similar to FFS2.

The earliest patent [Barrett et al. 1995]describes a file system for NOR flash de-vices that contain one large erase unit.Therefore, the device is treated as a write-once device, except that bits that were notcleared when an area was first written canbe cleared later. The system uses linkedlists to represent the files in a directoryand the data blocks of a file. When a file isextended, a new record is appended to theend of its block list. This is done by clear-ing bits in the next field of the last recordin the current list; that field was originallyleft all set, indicating that it was not yetvalid (the all 1s bit pattern is consideredan invalid pointer).

Updating a range of bytes in a file ismore difficult. A file is updated by patch-ing the linked list, as shown in Figure 4.The first record that points to now in-valid data is marked invalid by setting areplacement field to the address of a re-placement record (the replacement fieldis actually called secondary ptr in thepatent). This again uses a field that is ini-tially left at the erased state, all 1s. Thenext field of the replacement record canpoint back to the original linked list or itcan point to additional new records. Butunless the update reaches to the end of thefile, the new part of the linked list eventu-ally points back to old records; hence, thelist is patched. A record does not containfile data, only a pointer to a run of data.The runs of data are raw and not markedby any special header. This allows a run tobe broken into three when data in its mid-dle are updated. Three new records willpoint to the still valid prefix of the run,to a new replacement run, and to the stillvalid suffix.

The main defect in this scheme is thatrepeated updates to a file lead to longerand longer linked lists that must be tra-versed to access the data. For example,if the first 10 bytes of a file are updatedt times, then a chain of t invalid recordsmust be traversed before we reach therecord that points to the most recent data.

ACM Computing Surveys, Vol. 37, No. 2, June 2005.

Page 20: Algorithms and Data Structures for Flash Memorieslass.cs.umass.edu/~shenoy/courses/fall07/papers/flash...Algorithms and Data Structures for Flash Memories 139 set bits (change their

Algorithms and Data Structures for Flash Memories 157

Fig. 4. The data structures of the Microsoft File System. The data structure near thetop shows a linked-list element pointing to a block of 20 raw bytes in a file. The bottomfigure shows how the data structure is modified to accomodate an update of 5 of these20 bytes. The data and next-in-list pointers of the original node are invalidated. Thereplacement pointer, which was originally free (all 1’s; marked in gray in the figure), isset to point to a chain of 3 new nodes, two of which point to still-valid data within theexisting block, and one of which points to a new block of raw data. The last node in thenew chain points back to the tail of the original list.

The cause of this defect is the attemptto keep objects (files and directories) in astatic addresses. For example, the headerof the device, which is written once andfor all, contains a pointer to the root di-rectory as it does in conventional file sys-tems. This arrangement makes it easy to

find things but requires traversing longchains of invalid data to find current data.The log-structured approach, where ob-jects are moved when they are updated,makes it more difficult to find things, butonce an object is found, not invalid dataneeds to be accessed.

ACM Computing Surveys, Vol. 37, No. 2, June 2005.

Page 21: Algorithms and Data Structures for Flash Memorieslass.cs.umass.edu/~shenoy/courses/fall07/papers/flash...Algorithms and Data Structures for Flash Memories 139 set bits (change their

158 E. Gal and S. Toledo

Later versions of FFS allowed reclama-tion of erase units [Torelli 1995]. This isdone by associating each data block (a dataextent or a linked-list record) with a smalldescriptor. Memory is allocated contigu-ously from the low end of the unit, andfixed-size descriptors are allocated fromthe top end, toward the low end. Each de-scriptor describes the offset of a data blockwithin the unit, the size of the block, andwhether it is valid or obsolete. Pointersto data blocks within the file system arenot physical pointers but concatenation ofa logical erase-unit number and a blocknumber within the unit. The block numberis the index of the descriptor of the block.This allows the actual blocks to be movedand compacted when the unit is reclaimed;only the valid descriptors need to retaintheir position in the new block. This doesnot cause much fragmentation because de-scriptors are small and uniform in size(6 bytes). This system is described in threepatents Krueger and Rajagopalan [1999a,1999b, 2001].

It is possible to reclaim linked-listrecords in this system, but it’s not trivial.From Torelli’s article [1995], it seems thatat least some implementations did do that,thereby not only reclaiming space but alsoshortening the lists. Suppose that a validrecord a points to record b, and that in b,the replacement pointer is used and pointsto c. This means that c replaced b. Whena is moved to another unit during recla-mation, it can be modified to point to c,and b can be marked obsolete. Obviously,it is also possible to skip multiple replacedrecords.

This design is highly NOR specific, bothbecause of the late assignment to thereplacement pointers, and the way thatunits are filled from both ends with datafrom one end and with descriptors fromthe other.

3.7. Norris Flash File System

Norris Communications Corporationpatented a flash file system based onlinked lists [Daberko 1998], much likethe Microsoft Flash File System. Thepatent is due to Daberko who apparently

designed and implemented the file systemfor use in a handheld audio recorder.

3.8. Other Commercial Embedded FileSystems

Several other companies offer embeddedflash file systems but provide only few de-tails on their design and implementation.

TargetFFS. Blunk Microsystems offersTargetFFT, an embedded flash file system,in both NAND and NOR varieties.3 It worksunder their own operating system but isdesigned to be portable to other operatingsystems and to products without an oper-ating system. The file system uses a POSIX-like API.

Blunk claims that the file system guar-antees integrity across unexpected shut-downs that it levels the wear of the eraseunits, that the NAND version uses ECC/EDC,and that it is optimized for fast mounts,typically a second for a 64MB file system.The code footprint of the file system isaround 60KB plus about 64KB RAM.

The company’s Web site contains oneperformance graph showing that the writeperformance degrades as the volume fillsup. No explanation is given, but the likelyreason is the drop in the effectiveness oferase-unit reclamation.

smxFFS. This file system from MicroDigital only supports nonremovable NAND

devices.4 The file system consists of ablock-mapping device driver and a simpleFAT-like file system with a flat directorystructure. The block-mapping driver as-sumes that every flash page is associatedwith a small spare area (16 bytes for 512-byte pages) and that pages can be updatedin place three times before erasure. Thesystem performs wear-leveling by relocat-ing a fixed number of static blocks when-ever the difference between the most andthe least worn out page exceeds a thresh-old. The default software configuration fora 16MB flash device requires about 168KBof RAM.

3http://www.blunkmicro.com.4http://www.smxinfo.com/rtos/fileio/smxffs.htm.

ACM Computing Surveys, Vol. 37, No. 2, June 2005.

Page 22: Algorithms and Data Structures for Flash Memorieslass.cs.umass.edu/~shenoy/courses/fall07/papers/flash...Algorithms and Data Structures for Flash Memories 139 set bits (change their

Algorithms and Data Structures for Flash Memories 159

EFFS. HCC Embedded5 offers severalflash file systems. EFFS-STD is a full filesystem. The company makes claims simi-lar to those of Blunk, including claims ofintegrity, wear-leveling, and NOR and NAND

capability.EFFS-FAT and EFFS-THIN are FAT

implementation for removable flash mem-ories. The two versions offer similarfunctionality, except that EFFS-THIN isoptimized for 8-bit processors.

FLite. FLite combines a standard FAT

file system with an FTL-compatible block-mapping device driver [Dan and Williams1997]. It is unclear whether this is a cur-rent product.

4. BEYOND FILE SYSTEMS

Allowing high-level software to view aflash device as a simple rewritable blockdevice or as an abstract file store arenot the only possibilities. This sectiondescribe additional services that flash-management software can provide. Thefirst class of services are higher-level ab-stractions than files, the second level ofservices are programs that execute di-rectly from flash, and the third is a mech-anism to use flash devices as a persistentmain memory.

4.1. Flash-Aware Application-Specific DataStructures

Some applications maintain sophisticateddata structures in nonvolatile storage.Search trees, which allow database man-agement systems and other applicationsto respond quickly to queries, are the mostcommon of these data structures. Nor-mally, such a data structure is stored ina file. In more demanding applications,the data structure might be stored directlyon a block device such as a disk parti-tion without a file system. Some authorsargue that by implementing flash-awareapplication-specific data structures, per-formance and endurance can be improvedbeyond implementation over a file systemor even over a block device.

5http://www.hcc-embedded.com.

Wu et al. [2003a, 2003b] proposed flash-aware implementations of B-trees andR-trees. Their implementations representa tree node as an ordered set of smallitems. Items in a set represent individ-ual insertions, deletions, and updates to anode. The items are ordered by time so thesystem can construct the current state ofa tree node by traversing its set of items.To conserve space, item sets can be com-pacted when they grow too much. In orderto avoid frequent write operations of smallamounts of data, new items are collectedin RAM and flushed to disk when the RAM

buffer fills. To find the items that consti-tute a node, the system maintains in RAM alinked list of the items of each node in thetree. Hence it appears that the RAM con-sumption of these trees can be quite high.

The main problem with flash-awareapplication-specific data structures isthat they require that the flash devicebe partitioned. One partition holds theapplication-specific data structure, an-other holds other data, usually files.Partitioning the device at the physicallevel (physical addresses) adversely af-fects wear leveling because worn-out unitsin one partition cannot be swapped withrelatively fresh units in the other. TheWu et al. [2003a, 2003b] implementationspartition the virtual block address spaceso both the tree blocks and file-systemblocks are managed by the same block-mapping mechanism. In other words, theirdata structures are flash-aware, but theyoperate at the virtual block level not atthe physical sector level. Another prob-lem with partitioning is the potential forwasting space if the partitions cannot bedynamically resized.

4.2. Execute-in-Place

Code stored in a NOR device, which isdirectly addressable by the processor,can be executed from the device itselfwithout being copied into RAM. This isknown as execute-in-place, or XIP. Unfor-tunately, execute-in-place raises some dif-ficult problems.

Unless the system uses a virtual-memory mechanism, which requires a

ACM Computing Surveys, Vol. 37, No. 2, June 2005.

Page 23: Algorithms and Data Structures for Flash Memorieslass.cs.umass.edu/~shenoy/courses/fall07/papers/flash...Algorithms and Data Structures for Flash Memories 139 set bits (change their

160 E. Gal and S. Toledo

hardware memory-management unit,code must be contiguous in flash, and itmust not move. This implies that eraseunits containing code cannot participatein a wear-leveling block-mapping scheme.This also precludes storage of code in fileswithin a read-write file system since suchfile systems do not store files contiguously.Therefore, to implement XIP in a systemwithout virtual memory requires parti-tioning the flash memory at the physicaladdress level into a code partition and adata partition. As explained earlier, thisaccelerates wear.

Even if the system does use virtual-memory, implementing XIP is tricky. First,this requires that the block-mappingdevice driver be able to return thephysical addresses that correspond to agiven virtual block number. Second, theblock-mapping device driver must notifythe virtual memory subsystem whenevermemory-mapped sectors are relocated.

In addition, using XIP requires that codebe stored uncompressed. In some proces-sor architectures, machine code can beeffectively compressed. If this is possi-ble, then XIP saves RAM but wastes flashstorage.

These reasons have led some to suggestthat, at least in systems with large RAM,XIP should not be used [Woodhouse 2001].Systems with small RAM, where XIP is moreimportant, often do not have a memorymanagement unit so their only option is topartition the physical flash address space.

Hardware limitations also contribute tothe difficulty of implementing XIP. Manyflash devices cannot read from one eraseunit while another is being written to orerased. If this is the case, code that mightneed to run during a write or erase oper-ation cannot be executed in place. To ad-dress this issue, some flash devices allow awrite or erase to be suspended, and otherdevices allow reads while writing (RWW).

4.3. Flash-Based Main Memories

Wu and Zwaenepoel [1994] describe eNVy,a flash-based nonvolatile main memory.The system was designed to reside on thememory bus of a computer and to ser-

vice single-word read and write requests.Because this memory is both fast forsingle-word access and nonvolatile, it canreplace both the magnetic disk and theDRAM in a computer. The memory systemitself contained a large NOR flash mem-ory that was connected by a very widebus to a battery-backed static RAM device.Because the static RAM is much more ex-pensive than flash, the system combineda large flash memory with a small staticRAM. The system also contained a processorwith a memory management unit (MMU)that was able to access both the flash andthe internal RAM.

The system partitions the physical ad-dress space of the external memory businto 256-byte pages that are normallymapped by the internal MMU to flashpages. Read requests are serviced directlyfrom this memory-mapped flash. Write re-quests are serviced by copying a page fromthe flash to the internal RAM, modifyingthe internal MMU’s state so that the exter-nal physical page is now mapped to theRAM, and then performing the word-sizewrite onto the RAM. As long as the phys-ical page is mapped into the RAM, furtherread and write requests are performed di-rectly on the RAM. When the RAM fills, theoldest page in the RAM is written back tothe flash, again modifying the MMU’s stateto reflect the change.

The system works well thanks to thewide bus between the flash and the in-ternal RAM and thanks to the ability tobuffer pages in the RAM for a while. Thewide bus allows pages stored on flash tobe transfered to RAM in one cycle whichis critical for processing write requestsquickly. By buffering pages in the RAM fora while before they are written back toflash, many updates to a single page can beperformed with a single RAM-to-flash pagetransfer. The reduction in the number offlash writes reduces unit erasures, therebyimproving performance and extending thesystem’s lifetime.

Using a wide bus has a significant draw-back, however. To build a wide bus, Wu andZwaenepoel [1994] used many flash chipsin parallel. This made the effective size oferase units much larger. Large erase units

ACM Computing Surveys, Vol. 37, No. 2, June 2005.

Page 24: Algorithms and Data Structures for Flash Memorieslass.cs.umass.edu/~shenoy/courses/fall07/papers/flash...Algorithms and Data Structures for Flash Memories 139 set bits (change their

Algorithms and Data Structures for Flash Memories 161

are harder to manage and, as a result, areprone to accelerated wear.

5. SUMMARY

Flash memories have been an enablingtechnology for the introduction of com-puters into numerous hand held devices.A decade ago, flash memories were usedmostly in boot loaders (BIOS chips) and asdisk replacements for ruggedized comput-ers. Today, flash memories are also usedin mobile phones and PDA’s, portable musicplayers and audio recorders, digital cam-eras, USB memory devices, remote controls,and more. Flash memories provide thesedevices with fast and reliable storage ca-pabilities thanks to the sophisticated datastructures and algorithms that this articlesurveys.

In general, the challenges posed bynewer flash devices are greater than thoseposed by older devices—the devices are be-coming harder to use. This happens be-cause flash hardware technology is drivenmostly by the desire for increased capac-ity and performance often at the expenseof ease of use. This trend requires devel-opment of new software techniques andnew system architectures for new types ofdevices.

Unfortunately, many of these tech-niques are only described in patents, notin technical articles. Although the purposeof patents is to reveal inventions, they suf-fer from three disadvantages relative totechnical articles in journals and confer-ence proceedings. First, a patent almostnever contains a realistic assessment ofthe technique that it presents. A patentusually contains no quantitative compar-ison to alternative techniques and notheoretical analysis. Although patents of-ten do contain a qualitative comparison toalternatives, the comparison is almost al-ways one-sided, describing the advantagesof the new invention but not its potentialdisadvantages. Second, patents often de-scribe a prototype not how the inventionis used in actual products. This again re-duces the reader’s ability to assess the ef-fectiveness of specific techniques. Third,patents are sometimes harder to read

than articles in the technical and scientificliterature.

Our aim has been to survey flash-management techniques in order to pro-vide both practitioners and researcherswith a broad overview of existing tech-niques. In particular, by surveying bothtechnical articles, patents, and corporateliterature we ensure a thorough coverageof all the relevant techniques. We hopethat our article will encourage researchersto analyze these techniques both theoret-ically and experimentally. We also hopethat this article will facilitate the de-velopment of new and improved flash-management techniques.

REFERENCES

ALEPH ONE. 2002. YAFFS: Yet another flash fil-ing system. Cambridge, UK. Available athttp://www.aleph1.co.uk/yaffs/index.html.

ASSAR, M., NEMAZIE, S., AND ESTAKHRI, P. 1995a.Flash memory mass storage architecture. USpatent 5,388,083. Filed March 26, 1993; IssuedFebruary 7, 1995; Assigned to Cirrus Logic.

ASSAR, M., NEMAZIE, S., AND ESTAKHRI, P. 1995b.Flash memory mass storage architecture incor-poration wear leveling technique. US patent5,479,638. Filed March 26, 1993; Issued Decem-ber 26, 1995; Assigned to Cirrus Logic.

ASSAR, M., NEMAZIE, S., AND ESTAKHRI, P. 1996.Flash memory mass storage architecture incor-poration wear leveling technique without usingCAM cells. US patent 5,485,595. Filed October4, 1993; Issued January 16, 1996; Assigned toCirrus Logic.

AXIS COMMUNICATIONS. 2004. JFFS home page.Lund, Sweden. Available at http://developer.axis.com/software/jffs/.

BAN, A. 1995. Flash file system. US patent5,404,485. Filed March 8, 1993; Issued April 4,1995; Assigned to M-Systems.

BAN, A. 1999. Flash file system optimized for page-mode flash technologies. US patent 5,937,425.Filed October 16, 1997; Issued August 10, 1999;Assigned to M-Systems.

BAN, A. 2004. Wear leveling of static areas inflash memory. US patent 6,732,221. Filed June1, 2001; Issued May 4, 2004; Assigned to M-Systems.

BARRETT, P. L., QUINN, S. D., AND LIPE, R. A. 1995.System for updating data stored on a flash-erasable, programmable, read-only memory(FEPROM) based upon predetermined bit valueof indicating pointers. US patent 5,392,427.Filed May 18, 1993; Issued February 21, 1995;Assigned to Microsoft.

ACM Computing Surveys, Vol. 37, No. 2, June 2005.

Page 25: Algorithms and Data Structures for Flash Memorieslass.cs.umass.edu/~shenoy/courses/fall07/papers/flash...Algorithms and Data Structures for Flash Memories 139 set bits (change their

162 E. Gal and S. Toledo

CHANG, L.-P., KUO, T.-W., AND LO, S.-W. 2004. Real-time garbage collection for flash-memory storagesystems of real-time embedded systems. ACMTrans. Embed. Comput. Syst. 3, 4, 837–863.

CHIANG, M.-L. AND CHANG, R.-C. 1999. Cleaningpolicies in mobile computers using flash mem-ory. J. Syst. Softw. 48, 3, 213–231.

CHIANG, M.-L., LEE, P. C., AND CHANG, R.-C. 1999.Using data clustering to improve cleaningperformance for flash memory. Softw. Pract.Exp. 29, 3, 267–290.

DABERKO, N. 1998. Operating system includingimproved file management for use in devices uti-lizing flash memory as main memory. US patent5,787,445. Filed March 7, 1996; Issued July 28,1998; Assigned to Norris Communications.

DAN, R. AND WILLIAMS, J. 1997. A TrueFFS andFLite technical overview of M-Systems’ flash filesystems. Tech. rep. 80-SR-002-00-6L Rev. 1.30,M-Systems.

DOUGLIS, F., CACERES, R., KAASHOEK, M. F., LI, K.,MARSH, B., AND TAUBER, J. A. 1994. Storagealternatives for mobile computers. In Proceed-ings of the 1st USENIX Symposium on Operat-ing Systems Design and Implementation (OSDI).Monterey, CA. 25–37.

HAN, S.-W. 2000. Flash memory wear leveling sys-tem and method. US patent 6,016,275. FiledNovember 4, 1998; Issued January 18, 2000; As-signed to LG Semiconductors.

INTEL CORPORATION. 1998a. Flash file system selec-tion guide. Application Note 686, Intel Corpora-tion.

INTEL CORPORATION. 1998b. Understanding theflash translation layer (FTL) specification.Application Note 648, Intel Corporation.

JOU, E. AND JEPPESEN III, J. H. 1996. Flash memorywear leveling system providing immediate directaccess to microprocessor. US patent 5,568,423.Filed April 14, 1995; Issued October 22, 1996;Assigned to Unisys.

KAWAGUCHI, A., NISHIOKA, S., AND MOTODA, H. 1995.A flash-memory based file system. In Proceed-ings of the USENIX Technical Conference. NewOrleans, LA. 155–164.

KIM, H.-J. AND LEE, S.-G. 2002. An effectiveflash memory manager for reliable flash mem-ory space management. IEICE Trans. Inform.Syst. E85-D, 6, 950–964.

KRUEGER, W. J. AND RAJAGOPALAN, S. 1999a. Methodand system for file system management using aflash-erasable, programmable, read-only mem-ory. US patent 5,634,050. Filed June 7, 1995; Is-sued May 27, 1999; Assigned to Microsoft.

KRUEGER, W. J. AND RAJAGOPALAN, S. 1999b. Methodand system for file system management us-ing a flash-erasable, programmable, read-onlymemory. US patent 5,898,868. Filed 7 June1995; Issued April 27, 1999; Assigned toMicrosoft.

KRUEGER, W. J. AND RAJAGOPALAN, S. 2001. Method

and system for file system management us-ing a flash-erasable, programmable, read-only memory. US patent 6,256,642. Filed 29January, 1992; Issued July 3, 2001; Assigned toMicrosoft.

LOFGREN, K. M., NORMAN, R. D., THELIN, GREGORY,B., AND GUPTA, A. 2003. Wear leveling tech-niques for flash EEPROM systems. US patent6,594,183. Filed June 30, 1998; Issued July,15, 2003; Assigned to Sandisk and WesternDigital.

LOFGREN, K. M. J., NORMAN, R. D., THELIN, G. B., AND

GUPTA, A. 2000. Wear leveling techniques forflash EEPROM systems. US patent 6,081,447.Filed March 25, 1999; Issued June 27, 2000; As-signed to Western Digital and Sandisk.

MARSHALL, J. M. AND MANNING, C. D. H. 1998. Flashfile management system. US patent 5,832,493.Filed April 24, 1997; Issued November 3, 1998;Assigned to Trimble Navigation.

PARKER, K. W. 2003. Portable electronic devicehaving a log-structured file system in flash mem-ory. US patent 6,535,949. Filed April 19, 1999;Issued March 18, 2003; Assigned to Research InMotion.

ROSENBLUM, M. AND OUSTERHOUT, J. K. 1992. Thedesign and implementation of a log-structuredfile system. ACM Trans. Comput. Syst. 10, 1, 26–52.

SELTZER, M. I., BOSTIC, K., MCKUSICK, M. K., AND

STAELIN, C. 1993. An implementation of a log-structured file system for UNIX. In USENIXWinter Conference Proceedings. San Diego, CA.307–326.

SMITH, K. B. AND GARVIN, P. K. 1999. Methodand apparatus for allocating storage in flashmemory. US patent 5,860,082. Filed March 28,1996; Issued January 12, 1999; Assigned toDatalight.

TORELLI, P. 1995. The Microsoft flash file system.Dr. Dobb’s Journal 20, 62–72.

WELLS, S., HASBUN, R. N., AND ROBINSON, K. 1998.Sector-based storage device emulator havingvariable-sized sector. US patent 5,822,781. FiledOctober 30, 1992; Issued October 13, 1998; As-signed to Intel.

WELLS, S. E. 1994. Method for wear leveling in aflash EEPROM memory. US patent 5,341,339.Filed November 1, 1993; Issued August 23, 1994;Assigned to Intel.

WOODHOUSE, D. 2001. JFFS: The journalingflash file system. Ottawa Linux Symposium(July). Available at http://sources.redhat.com/jffs2/jffs2.pdf.

WU, C.-H., CHANG, L.-P., AND KUO, T.-W. 2003a. Anefficient B-tree layer for flash-memory stor-age systems. In Proceedings of the 9th In-ternational Conference on Real-Time and Em-bedded Computing Systems and Applications(RTCSA). Tainan City,Taiwan. Lecture Notesin Computer Science, vol. 2968. Springer, 409–430.

ACM Computing Surveys, Vol. 37, No. 2, June 2005.

Page 26: Algorithms and Data Structures for Flash Memorieslass.cs.umass.edu/~shenoy/courses/fall07/papers/flash...Algorithms and Data Structures for Flash Memories 139 set bits (change their

Algorithms and Data Structures for Flash Memories 163

WU, C.-H., CHANG, L.-P., AND KUO, T.-W. 2003b.An efficient R-tree implementation over flash-memory storage systems. In Proceedings of the11th ACM International Symposium on Ad-vances in Geographic Information Systems. NewOrleans, LA. 17–24.

WU, M. AND ZWAENEPOEL, W. 1994. eNVy: A non-volatile, main memory storage system. In Pro-ceedings of the 6th International Conference onArchitectural Support for Programming Lan-guages and Operating Systems. San Jose, CA.86–97.

Received July 2004; accepted August 2005

ACM Computing Surveys, Vol. 37, No. 2, June 2005.