24
44 Flash-based SSDs After decades of hard-disk drive dominance, a new form of persistent storage device has recently gained significance in the world. Generi- cally referred to as solid-state storage, such devices have no mechani- cal or moving parts like hard drives; rather, they are simply built out of transistors, much like memory and processors. However, unlike typical random-access memory (e.g., DRAM), such a solid-state storage device (a.k.a., an SSD) retains information despite power loss, and thus is an ideal candidate for use in persistent storage of data. The technology we’ll focus on is known as flash (more specifically, NAND-based flash), which was created by Fujio Masuoka in the 1980s [M+14]. Flash, as we’ll see, has some unique properties. For example, to write to a given chunk of it (i.e., a flash page), you first have to erase a big- ger chunk (i.e., a flash block), which can be quite expensive. In addition, writing too often to a page will cause it to wear out. These two properties make construction of a flash-based SSD an interesting challenge: CRUX:HOW TO BUILD AFLASH- BASED SSD How can we build a flash-based SSD? How can we handle the expen- sive nature of erasing? How can we build a device that lasts a long time, given that repeated overwrite will wear the device out? Will the march of progress in technology ever cease? Or cease to amaze? 44.1 Storing a Single Bit Flash chips are designed to store one or more bits in a single transis- tor; the level of charge trapped within the transistor is mapped to a binary value. In a single-level cell (SLC) flash, only a single bit is stored within a transistor (i.e., 1 or 0); with a multi-level cell (MLC) flash, two bits are encoded into different levels of charge, e.g., 00, 01, 10, and 11 are repre- sented by low, somewhat low, somewhat high, and high levels. There is even triple-level cell (TLC) flash, which encodes 3 bits per cell. Overall, SLC chips achieve higher performance and are more expensive. 1

Flash-based SSDs

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Flash-based SSDs

44

Flash-based SSDs

After decades of hard-disk drive dominance, a new form of persistentstorage device has recently gained significance in the world. Generi-cally referred to as solid-state storage, such devices have no mechani-cal or moving parts like hard drives; rather, they are simply built out oftransistors, much like memory and processors. However, unlike typicalrandom-access memory (e.g., DRAM), such a solid-state storage device(a.k.a., an SSD) retains information despite power loss, and thus is anideal candidate for use in persistent storage of data.

The technology we’ll focus on is known as flash (more specifically,NAND-based flash), which was created by Fujio Masuoka in the 1980s[M+14]. Flash, as we’ll see, has some unique properties. For example, towrite to a given chunk of it (i.e., a flash page), you first have to erase a big-ger chunk (i.e., a flash block), which can be quite expensive. In addition,writing too often to a page will cause it to wear out. These two propertiesmake construction of a flash-based SSD an interesting challenge:

CRUX: HOW TO BUILD A FLASH-BASED SSDHow can we build a flash-based SSD? How can we handle the expen-

sive nature of erasing? How can we build a device that lasts a long time,given that repeated overwrite will wear the device out? Will the march ofprogress in technology ever cease? Or cease to amaze?

44.1 Storing a Single Bit

Flash chips are designed to store one or more bits in a single transis-tor; the level of charge trapped within the transistor is mapped to a binaryvalue. In a single-level cell (SLC) flash, only a single bit is stored withina transistor (i.e., 1 or 0); with a multi-level cell (MLC) flash, two bits areencoded into different levels of charge, e.g., 00, 01, 10, and 11 are repre-sented by low, somewhat low, somewhat high, and high levels. There iseven triple-level cell (TLC) flash, which encodes 3 bits per cell. Overall,SLC chips achieve higher performance and are more expensive.

1

Page 2: Flash-based SSDs

2 FLASH-BASED SSDS

TIP: BE CAREFUL WITH TERMINOLOGY

You may have noticed that some terms we have used many times before(blocks, pages) are being used within the context of a flash, but in slightlydifferent ways than before. New terms are not created to make your lifeharder (although they may be doing just that), but arise because there isno central authority where terminology decisions are made. What is ablock to you may be a page to someone else, and vice versa, dependingon the context. Your job is simple: to know the appropriate terms withineach domain, and use them such that people well-versed in the disciplinecan understand what you are talking about. It’s one of those times wherethe only solution is simple but sometimes painful: use your memory.

Of course, there are many details as to exactly how such bit-level stor-age operates, down at the level of device physics. While beyond the scopeof this book, you can read more about it on your own [J10].

44.2 From Bits to Banks/Planes

As they say in ancient Greece, storing a single bit (or a few) does nota storage system make. Hence, flash chips are organized into banks orplanes which consist of a large number of cells.

A bank is accessed in two different sized units: blocks (sometimescalled erase blocks), which are typically of size 128 KB or 256 KB, andpages, which are a few KB in size (e.g., 4KB). Within each bank there area large number of blocks; within each block, there are a large number ofpages. When thinking about flash, you must remember this new termi-nology, which is different than the blocks we refer to in disks and RAIDsand the pages we refer to in virtual memory.

Figure 44.1 shows an example of a flash plane with blocks and pages;there are three blocks, each containing four pages, in this simple exam-ple. We’ll see below why we distinguish between blocks and pages; itturns out this distinction is critical for flash operations such as readingand writing, and even more so for the overall performance of the device.The most important (and weird) thing you will learn is that to write toa page within a block, you first have to erase the entire block; this trickydetail makes building a flash-based SSD an interesting and worthwhilechallenge, and the subject of the second-half of the chapter.

0 1 2Block:

Page:

Content:

00 01 02 03 04 05 06 07 08 09 10 11

Figure 44.1: A Simple Flash Chip: Pages Within Blocks

OPERATING

SYSTEMS

[VERSION 1.01]WWW.OSTEP.ORG

Page 3: Flash-based SSDs

FLASH-BASED SSDS 3

44.3 Basic Flash Operations

Given this flash organization, there are three low-level operations thata flash chip supports. The read command is used to read a page from theflash; erase and program are used in tandem to write. The details:

• Read (a page): A client of the flash chip can read any page (e.g.,2KB or 4KB), simply by specifying the read command and appro-priate page number to the device. This operation is typically quitefast, 10s of microseconds or so, regardless of location on the device,and (more or less) regardless of the location of the previous request(quite unlike a disk). Being able to access any location uniformlyquickly means the device is a random access device.

• Erase (a block): Before writing to a page within a flash, the natureof the device requires that you first erase the entire block the pagelies within. Erase, importantly, destroys the contents of the block(by setting each bit to the value 1); therefore, you must be sure thatany data you care about in the block has been copied elsewhere(to memory, or perhaps to another flash block) before executing theerase. The erase command is quite expensive, taking a few millisec-onds to complete. Once finished, the entire block is reset and eachpage is ready to be programmed.

• Program (a page): Once a block has been erased, the program com-mand can be used to change some of the 1’s within a page to 0’s,and write the desired contents of a page to the flash. Program-ming a page is less expensive than erasing a block, but more costlythan reading a page, usually taking around 100s of microsecondson modern flash chips.

One way to think about flash chips is that each page has a state asso-ciated with it. Pages start in an INVALID state. By erasing the block thata page resides within, you set the state of the page (and all pages withinthat block) to ERASED, which resets the content of each page in the blockbut also (importantly) makes them programmable. When you program apage, its state changes to VALID, meaning its contents have been set andcan be read. Reads do not affect these states (although you should onlyread from pages that have been programmed). Once a page has been pro-grammed, the only way to change its contents is to erase the entire blockwithin which the page resides. Here is an example of states transitionafter various erase and program operations within a 4-page block:

iiii Initial: pages in block are invalid (i)Erase() → EEEE State of pages in block set to erased (E)Program(0) → VEEE Program page 0; state set to valid (V)Program(0) → error Cannot re-program page after programmingProgram(1) → VVEE Program page 1Erase() → EEEE Contents erased; all pages programmable

© 2008–20, ARPACI-DUSSEAUTHREE

EASY

PIECES

Page 4: Flash-based SSDs

4 FLASH-BASED SSDS

A Detailed Example

Because the process of writing (i.e., erasing and programming) is so un-usual, let’s go through a detailed example to make sure it makes sense.In this example, imagine we have the following four 8-bit pages, withina 4-page block (both unrealistically small sizes, but useful within this ex-ample); each page is VALID as each has been previously programmed.

Page 0 Page 1 Page 2 Page 3

00011000 11001110 00000001 00111111

VALID VALID VALID VALID

Now say we wish to write to page 0, filling it with new contents. Towrite any page, we must first erase the entire block. Let’s assume we doso, thus leaving the block in this state:

Page 0 Page 1 Page 2 Page 3

11111111 11111111 11111111 11111111

ERASED ERASED ERASED ERASED

Good news! We could now go ahead and program page 0, for exam-ple with the contents 00000011, overwriting the old page 0 (contents00011000) as desired. After doing so, our block looks like this:

Page 0 Page 1 Page 2 Page 3

00000011 11111111 11111111 11111111

VALID ERASED ERASED ERASED

And now the bad news: the previous contents of pages 1, 2, and 3are all gone! Thus, before overwriting any page within a block, we mustfirst move any data we care about to another location (e.g., memory, orelsewhere on the flash). The nature of erase will have a strong impact onhow we design flash-based SSDs, as we’ll soon learn about.

Summary

To summarize, reading a page is easy: just read the page. Flash chipsdo this quite well, and quickly; in terms of performance, they offer thepotential to greatly exceed the random read performance of modern diskdrives, which are slow due to mechanical seek and rotation costs.

Writing a page is trickier; the entire block must first be erased (takingcare to first move any data we care about to another location), and thenthe desired page programmed. Not only is this expensive, but frequentrepetitions of this program/erase cycle can lead to the biggest reliabilityproblem flash chips have: wear out. When designing a storage systemwith flash, the performance and reliability of writing is a central focus.We’ll soon learn more about how modern SSDs attack these issues, deliv-ering excellent performance and reliability despite these limitations.

OPERATING

SYSTEMS

[VERSION 1.01]WWW.OSTEP.ORG

Page 5: Flash-based SSDs

FLASH-BASED SSDS 5

Read Program EraseDevice (µs) (µs) (µs)

SLC 25 200-300 1500-2000MLC 50 600-900 ˜3000TLC ˜75 ˜900-1350 ˜4500

Figure 44.2: Raw Flash Performance Characteristics

44.4 Flash Performance And Reliability

Because we’re interested in building a storage device out of raw flashchips, it is worthwhile to understand their basic performance character-istics. Figure 44.2 presents a rough summary of some numbers found inthe popular press [V12]. Therein, the author presents the basic operationlatency of reads, programs, and erases across SLC, MLC, and TLC flash,which store 1, 2, and 3 bits of information per cell, respectively.

As we can see from the table, read latencies are quite good, taking just10s of microseconds to complete. Program latency is higher and morevariable, as low as 200 microseconds for SLC, but higher as you packmore bits into each cell; to get good write performance, you will haveto make use of multiple flash chips in parallel. Finally, erases are quiteexpensive, taking a few milliseconds typically. Dealing with this cost iscentral to modern flash storage design.

Let’s now consider reliability of flash chips. Unlike mechanical disks,which can fail for a wide variety of reasons (including the gruesome andquite physical head crash, where the drive head actually makes contactwith the recording surface), flash chips are pure silicon and in that sensehave fewer reliability issues to worry about. The primary concern is wearout; when a flash block is erased and programmed, it slowly accrues alittle bit of extra charge. Over time, as that extra charge builds up, itbecomes increasingly difficult to differentiate between a 0 and a 1. At thepoint where it becomes impossible, the block becomes unusable.

The typical lifetime of a block is currently not well known. Manufac-turers rate MLC-based blocks as having a 10,000 P/E (Program/Erase)cycle lifetime; that is, each block can be erased and programmed 10,000times before failing. SLC-based chips, because they store only a single bitper transistor, are rated with a longer lifetime, usually 100,000 P/E cycles.However, recent research has shown that lifetimes are much longer thanexpected [BD10].

One other reliability problem within flash chips is known as distur-bance. When accessing a particular page within a flash, it is possible thatsome bits get flipped in neighboring pages; such bit flips are known asread disturbs or program disturbs, depending on whether the page isbeing read or programmed, respectively.

© 2008–20, ARPACI-DUSSEAUTHREE

EASY

PIECES

Page 6: Flash-based SSDs

6 FLASH-BASED SSDS

TIP: THE IMPORTANCE OF BACKWARDS COMPATIBILITY

Backwards compatibility is always a concern in layered systems. Bydefining a stable interface between two systems, one enables innovationon each side of the interface while ensuring continued interoperability.Such an approach has been quite successful in many domains: operatingsystems have relatively stable APIs for applications, disks provide thesame block-based interface to file systems, and each layer in the IP net-working stack provides a fixed unchanging interface to the layer above.

Not surprisingly, there can be a downside to such rigidity, as interfacesdefined in one generation may not be appropriate in the next. In somecases, it may be useful to think about redesigning the entire system en-tirely. An excellent example is found in the Sun ZFS file system [B07];by reconsidering the interaction of file systems and RAID, the creators ofZFS envisioned (and then realized) a more effective integrated whole.

44.5 From Raw Flash to Flash-Based SSDs

Given our basic understanding of flash chips, we now face our nexttask: how to turn a basic set of flash chips into something that looks likea typical storage device. The standard storage interface is a simple block-based one, where blocks (sectors) of size 512 bytes (or larger) can be reador written, given a block address. The task of the flash-based SSD is toprovide that standard block interface atop the raw flash chips inside it.

Internally, an SSD consists of some number of flash chips (for persis-tent storage). An SSD also contains some amount of volatile (i.e., non-persistent) memory (e.g., SRAM); such memory is useful for caching andbuffering of data as well as for mapping tables, which we’ll learn aboutbelow. Finally, an SSD contains control logic to orchestrate device opera-tion. See Agrawal et. al for details [A+08]; a simplified block diagram isseen in Figure 44.3 (page 7).

One of the essential functions of this control logic is to satisfy clientreads and writes, turning them into internal flash operations as need be.The flash translation layer, or FTL, provides exactly this functionality.The FTL takes read and write requests on logical blocks (that comprise thedevice interface) and turns them into low-level read, erase, and programcommands on the underlying physical blocks and physical pages (that com-prise the actual flash device). The FTL should accomplish this task withthe goal of delivering excellent performance and high reliability.

Excellent performance, as we’ll see, can be realized through a com-bination of techniques. One key will be to utilize multiple flash chipsin parallel; although we won’t discuss this technique much further, suf-fice it to say that all modern SSDs use multiple chips internally to obtainhigher performance. Another performance goal will be to reduce writeamplification, which is defined as the total write traffic (in bytes) issuedto the flash chips by the FTL divided by the total write traffic (in bytes) is-

OPERATING

SYSTEMS

[VERSION 1.01]WWW.OSTEP.ORG

Page 7: Flash-based SSDs

FLASH-BASED SSDS 7

Inte

rfa

ce

Lo

gic

FlashController

Memory Flash Flash Flash

Flash Flash Flash

Figure 44.3: A Flash-based SSD: Logical Diagram

sued by the client to the SSD. As we’ll see below, naive approaches to FTLconstruction will lead to high write amplification and low performance.

High reliability will be achieved through the combination of a few dif-ferent approaches. One main concern, as discussed above, is wear out. Ifa single block is erased and programmed too often, it will become unus-able; as a result, the FTL should try to spread writes across the blocks ofthe flash as evenly as possible, ensuring that all of the blocks of the devicewear out at roughly the same time; doing so is called wear leveling andis an essential part of any modern FTL.

Another reliability concern is program disturbance. To minimize suchdisturbance, FTLs will commonly program pages within an erased blockin order, from low page to high page. This sequential-programming ap-proach minimizes disturbance and is widely utilized.

44.6 FTL Organization: A Bad Approach

The simplest organization of an FTL would be something we call di-rect mapped. In this approach, a read to logical page N is mapped di-rectly to a read of physical page N . A write to logical page N is morecomplicated; the FTL first has to read in the entire block that page N iscontained within; it then has to erase the block; finally, the FTL programsthe old pages as well as the new one.

As you can probably guess, the direct-mapped FTL has many prob-lems, both in terms of performance as well as reliability. The performanceproblems come on each write: the device has to read in the entire block(costly), erase it (quite costly), and then program it (costly). The end re-sult is severe write amplification (proportional to the number of pagesin a block) and as a result, terrible write performance, even slower thantypical hard drives with their mechanical seeks and rotational delays.

Even worse is the reliability of this approach. If file system metadataor user file data is repeatedly overwritten, the same block is erased andprogrammed, over and over, rapidly wearing it out and potentially los-ing data. The direct mapped approach simply gives too much controlover wear out to the client workload; if the workload does not spreadwrite load evenly across its logical blocks, the underlying physical blockscontaining popular data will quickly wear out. For both reliability andperformance reasons, a direct-mapped FTL is a bad idea.

© 2008–20, ARPACI-DUSSEAUTHREE

EASY

PIECES

Page 8: Flash-based SSDs

8 FLASH-BASED SSDS

44.7 A Log-Structured FTL

For these reasons, most FTLs today are log structured, an idea usefulin both storage devices (as we’ll see now) and file systems above them (aswe’ll see in the chapter on log-structured file systems). Upon a write tological block N , the device appends the write to the next free spot in thecurrently-being-written-to block; we call this style of writing logging. Toallow for subsequent reads of block N , the device keeps a mapping table(in its memory, and persistent, in some form, on the device); this tablestores the physical address of each logical block in the system.

Let’s go through an example to make sure we understand how thebasic log-based approach works. To the client, the device looks like atypical disk, in which it can read and write 512-byte sectors (or groups ofsectors). For simplicity, assume that the client is reading or writing 4-KBsized chunks. Let us further assume that the SSD contains some largenumber of 16-KB sized blocks, each divided into four 4-KB pages; theseparameters are unrealistic (flash blocks usually consist of more pages) butwill serve our didactic purposes quite well.

Assume the client issues the following sequence of operations:

• Write(100) with contents a1• Write(101) with contents a2• Write(2000) with contents b1• Write(2001) with contents b2

These logical block addresses (e.g., 100) are used by the client of theSSD (e.g., a file system) to remember where information is located.

Internally, the device must transform these block writes into the eraseand program operations supported by the raw hardware, and somehowrecord, for each logical block address, which physical page of the SSDstores its data. Assume that all blocks of the SSD are currently not valid,and must be erased before any page can be programmed. Here we showthe initial state of our SSD, with all pages marked INVALID (i):

0 1 2Block:

Page:

Content:

State:

00

i

01

i

02

i

03

i

04

i

05

i

06

i

07

i

08

i

09

i

10

i

11

i

When the first write is received by the SSD (to logical block 100), theFTL decides to write it to physical block 0, which contains four physicalpages: 0, 1, 2, and 3. Because the block is not erased, we cannot write toit yet; the device must first issue an erase command to block 0. Doing soleads to the following state:

0 1 2Block:

Page:

Content:

State:

00

E

01

E

02

E

03

E

04

i

05

i

06

i

07

i

08

i

09

i

10

i

11

i

OPERATING

SYSTEMS

[VERSION 1.01]WWW.OSTEP.ORG

Page 9: Flash-based SSDs

FLASH-BASED SSDS 9

Block 0 is now ready to be programmed. Most SSDs will write pagesin order (i.e., low to high), reducing reliability problems related to pro-gram disturbance. The SSD then directs the write of logical block 100into physical page 0:

0 1 2Block:

Page:

Content:

State:

00

a1

V

01

E

02

E

03

E

04

i

05

i

06

i

07

i

08

i

09

i

10

i

11

i

But what if the client wants to read logical block 100? How can it findwhere it is? The SSD must transform a read issued to logical block 100into a read of physical page 0. To accommodate such functionality, whenthe FTL writes logical block 100 to physical page 0, it records this fact inan in-memory mapping table. We will track the state of this mappingtable in the diagrams as well:

Memory

Flash

Chip

Table: 100 0

0 1 2Block:

Page:

Content:

State:

00

a1

V

01

E

02

E

03

E

04

i

05

i

06

i

07

i

08

i

09

i

10

i

11

i

Now you can see what happens when the client writes to the SSD.The SSD finds a location for the write, usually just picking the next freepage; it then programs that page with the block’s contents, and recordsthe logical-to-physical mapping in its mapping table. Subsequent readssimply use the table to translate the logical block address presented bythe client into the physical page number required to read the data.

Let’s now examine the rest of the writes in our example write stream:101, 2000, and 2001. After writing these blocks, the state of the device is:

Memory

Flash

Chip

Table: 100 0 101 1 2000 2 2001 3

0 1 2Block:

Page:

Content:

State:

00

a1

V

01

a2

V

02

b1

V

03

b2

V

04

i

05

i

06

i

07

i

08

i

09

i

10

i

11

i

The log-based approach by its nature improves performance (erasesonly being required once in a while, and the costly read-modify-write ofthe direct-mapped approach avoided altogether), and greatly enhancesreliability. The FTL can now spread writes across all pages, performingwhat is called wear leveling and increasing the lifetime of the device;we’ll discuss wear leveling further below.

© 2008–20, ARPACI-DUSSEAUTHREE

EASY

PIECES

Page 10: Flash-based SSDs

10 FLASH-BASED SSDS

ASIDE: FTL MAPPING INFORMATION PERSISTENCE

You might be wondering: what happens if the device loses power? Doesthe in-memory mapping table disappear? Clearly, such information can-not truly be lost, because otherwise the device would not function as apersistent storage device. An SSD must have some means of recoveringmapping information.

The simplest thing to do is to record some mapping information witheach page, in what is called an out-of-band (OOB) area. When the deviceloses power and is restarted, it must reconstruct its mapping table byscanning the OOB areas and reconstructing the mapping table in mem-ory. This basic approach has its problems; scanning a large SSD to findall necessary mapping information is slow. To overcome this limitation,some higher-end devices use more complex logging and checkpointingtechniques to speed up recovery; learn more about logging by readingchapters on crash consistency and log-structured file systems [AD14].

Unfortunately, this basic approach to log structuring has some down-sides. The first is that overwrites of logical blocks lead to something wecall garbage, i.e., old versions of data around the drive and taking upspace. The device has to periodically perform garbage collection (GC) tofind said blocks and free space for future writes; excessive garbage collec-tion drives up write amplification and lowers performance. The secondis high cost of in-memory mapping tables; the larger the device, the morememory such tables need. We now discuss each in turn.

44.8 Garbage Collection

The first cost of any log-structured approach such as this one is thatgarbage is created, and therefore garbage collection (i.e., dead-block recla-mation) must be performed. Let’s use our continued example to makesense of this. Recall that logical blocks 100, 101, 2000, and 2001 have beenwritten to the device.

Now, let’s assume that blocks 100 and 101 are written to again, withcontents c1 and c2. The writes are written to the next free pages (in thiscase, physical pages 4 and 5), and the mapping table is updated accord-ingly. Note that the device must have first erased block 1 to make suchprogramming possible:

Memory

Flash

Chip

Table: 100 4 101 5 2000 2 2001 3

0 1 2Block:

Page:

Content:

State:

00

a1

V

01

a2

V

02

b1

V

03

b2

V

04

c1

V

05

c2

V

06

E

07

E

08

i

09

i

10

i

11

i

OPERATING

SYSTEMS

[VERSION 1.01]WWW.OSTEP.ORG

Page 11: Flash-based SSDs

FLASH-BASED SSDS 11

The problem we have now should be obvious: physical pages 0 and1, although marked VALID, have garbage in them, i.e., the old versionsof blocks 100 and 101. Because of the log-structured nature of the de-vice, overwrites create garbage blocks, which the device must reclaim toprovide free space for new writes to take place.

The process of finding garbage blocks (also called dead blocks) andreclaiming them for future use is called garbage collection, and it is animportant component of any modern SSD. The basic process is simple:find a block that contains one or more garbage pages, read in the live(non-garbage) pages from that block, write out those live pages to thelog, and (finally) reclaim the entire block for use in writing.

Let’s now illustrate with an example. The device decides it wants toreclaim any dead pages within block 0 above. Block 0 has two dead blocks(pages 0 and 1) and two live blocks (pages 2 and 3, which contain blocks2000 and 2001, respectively). To do so, the device will:

• Read live data (pages 2 and 3) from block 0• Write live data to end of the log• Erase block 0 (freeing it for later usage)

For the garbage collector to function, there must be enough informa-tion within each block to enable the SSD to determine whether each pageis live or dead. One natural way to achieve this end is to store, at somelocation within each block, information about which logical blocks arestored within each page. The device can then use the mapping table todetermine whether each page within the block holds live data or not.

From our example above (before the garbage collection has taken place),block 0 held logical blocks 100, 101, 2000, 2001. By checking the mappingtable (which, before garbage collection, contained 100->4, 101->5,

2000->2, 2001->3), the device can readily determine whether each ofthe pages within the SSD block holds live information. For example, 2000and 2001 clearly are still pointed to by the map; 100 and 101 are not andtherefore are candidates for garbage collection.

When this garbage collection process is complete in our example, thestate of the device is:

Memory

Flash

Chip

Table: 100 4 101 5 2000 6 2001 7

0 1 2Block:

Page:

Content:

State:

00

E

01

E

02

E

03

E

04

c1

V

05

c2

V

06

b1

V

07

b2

V

08

i

09

i

10

i

11

i

As you can see, garbage collection can be expensive, requiring readingand rewriting of live data. The ideal candidate for reclamation is a blockthat consists of only dead pages; in this case, the block can immediatelybe erased and used for new data, without expensive data migration.

© 2008–20, ARPACI-DUSSEAUTHREE

EASY

PIECES

Page 12: Flash-based SSDs

12 FLASH-BASED SSDS

ASIDE: A NEW STORAGE API KNOWN AS TRIM

When we think of hard drives, we usually just think of the most ba-sic interface to read and write them: read and write (there is also usu-ally some kind of cache flush command, ensuring that writes have actu-ally been persisted, but sometimes we omit that for simplicity). Withlog-structured SSDs, and indeed, any device that keeps a flexible andchanging mapping of logical-to-physical blocks, a new interface is use-ful, known as the trim operation.

The trim operation takes an address (and possibly a length) and simplyinforms the device that the block(s) specified by the address (and length)have been deleted; the device thus no longer has to track any informa-tion about the given address range. For a standard hard drive, trim isn’tparticularly useful, because the drive has a static mapping of block ad-dresses to specific platter, track, and sector(s). For a log-structured SSD,however, it is highly useful to know that a block is no longer needed, asthe SSD can then remove this information from the FTL and later reclaimthe physical space during garbage collection.

Although we sometimes think of interface and implementation as sepa-rate entities, in this case, we see that the implementation shapes the inter-face. With complex mappings, knowledge of which blocks are no longerneeded makes for a more effective implementation.

To reduce GC costs, some SSDs overprovision the device [A+08]; byadding extra flash capacity, cleaning can be delayed and pushed to thebackground, perhaps done at a time when the device is less busy. Addingmore capacity also increases internal bandwidth, which can be used forcleaning and thus not harm perceived bandwidth to the client. Manymodern drives overprovision in this manner, one key to achieving excel-lent overall performance.

44.9 Mapping Table Size

The second cost of log-structuring is the potential for extremely largemapping tables, with one entry for each 4-KB page of the device. With alarge 1-TB SSD, for example, a single 4-byte entry per 4-KB page resultsin 1 GB of memory needed by the device, just for these mappings! Thus,this page-level FTL scheme is impractical.

Block-Based Mapping

One approach to reduce the costs of mapping is to only keep a pointer perblock of the device, instead of per page, reducing the amount of mapping

information by a factor of Sizeblock

Sizepage. This block-level FTL is akin to having

OPERATING

SYSTEMS

[VERSION 1.01]WWW.OSTEP.ORG

Page 13: Flash-based SSDs

FLASH-BASED SSDS 13

bigger page sizes in a virtual memory system; in that case, you use fewerbits for the VPN and have a larger offset in each virtual address.

Unfortunately, using a block-based mapping inside a log-based FTLdoes not work very well for performance reasons. The biggest problemarises when a “small write” occurs (i.e., one that is less than the size ofa physical block). In this case, the FTL must read a large amount of livedata from the old block and copy it into a new one (along with the datafrom the small write). This data copying increases write amplificationgreatly and thus decreases performance.

To make this issue more clear, let’s look at an example. Assume theclient previously wrote out logical blocks 2000, 2001, 2002, and 2003 (withcontents, a, b, c, d), and that they are located within physical block1 at physical pages 4, 5, 6, and 7. With per-page mappings, the transla-tion table would have to record four mappings for these logical blocks:2000→4, 2001→5, 2002→6, 2003→7.

If, instead, we use block-level mapping, the FTL only needs to recorda single address translation for all of this data. The address mapping,however, is slightly different than our previous examples. Specifically,we think of the logical address space of the device as being chopped intochunks that are the size of the physical blocks within the flash. Thus,the logical block address consists of two portions: a chunk number andan offset. Because we are assuming four logical blocks fit within eachphysical block, the offset portion of the logical addresses requires 2 bits;the remaining (most significant) bits form the chunk number.

Logical blocks 2000, 2001, 2002, and 2003 all have the same chunknumber (500), and have different offsets (0, 1, 2, and 3, respectively).Thus, with a block-level mapping, the FTL records that chunk 500 mapsto block 1 (starting at physical page 4), as shown in this diagram:

Memory

Flash

Chip

Table: 500 4

0 1 2Block:

Page:

Content:

State:

00

i

01

i

02

i

03

i

04

a

V

05

b

V

06

c

V

07

d

V

08

i

09

i

10

i

11

i

In a block-based FTL, reading is easy. First, the FTL extracts the chunknumber from the logical block address presented by the client, by takingthe topmost bits out of the address. Then, the FTL looks up the chunk-number to physical-page mapping in the table. Finally, the FTL computesthe address of the desired flash page by adding the offset from the logicaladdress to the physical address of the block.

For example, if the client issues a read to logical address 2002, the de-vice extracts the logical chunk number (500), looks up the translation inthe mapping table (finding 4), and adds the offset from the logical ad-dress (2) to the translation (4). The resulting physical-page address (6) is

© 2008–20, ARPACI-DUSSEAUTHREE

EASY

PIECES

Page 14: Flash-based SSDs

14 FLASH-BASED SSDS

where the data is located; the FTL can then issue the read to that physicaladdress and obtain the desired data (c).

But what if the client writes to logical block 2002 (with contents c’)?In this case, the FTL must read in 2000, 2001, and 2003, and then writeout all four logical blocks in a new location, updating the mapping tableaccordingly. Block 1 (where the data used to reside) can then be erasedand reused, as shown here.

Memory

Flash

Chip

Table: 500 8

0 1 2Block:

Page:

Content:

State:

00

i

01

i

02

i

03

i

04

E

05

E

06

E

07

E

08

a

V

09

b

V

10

c’

V

11

d

V

As you can see from this example, while block level mappings greatlyreduce the amount of memory needed for translations, they cause signif-icant performance problems when writes are smaller than the physicalblock size of the device; as real physical blocks can be 256KB or larger,such writes are likely to happen quite often. Thus, a better solution isneeded. Can you sense that this is the part of the chapter where we tellyou what that solution is? Better yet, can you figure it out yourself, beforereading on?

Hybrid Mapping

To enable flexible writing but also reduce mapping costs, many modernFTLs employ a hybrid mapping technique. With this approach, the FTLkeeps a few blocks erased and directs all writes to them; these are calledlog blocks. Because the FTL wants to be able to write any page to anylocation within the log block without all the copying required by a pureblock-based mapping, it keeps per-page mappings for these log blocks.

The FTL thus logically has two types of mapping table in its memory: asmall set of per-page mappings in what we’ll call the log table, and a largerset of per-block mappings in the data table. When looking for a particularlogical block, the FTL will first consult the log table; if the logical block’slocation is not found there, the FTL will then consult the data table to findits location and then access the requested data.

The key to the hybrid mapping strategy is keeping the number of logblocks small. To keep the number of log blocks small, the FTL has to pe-riodically examine log blocks (which have a pointer per page) and switchthem into blocks that can be pointed to by only a single block pointer.This switch is accomplished by one of three main techniques, based onthe contents of the block [KK+02].

For example, let’s say the FTL had previously written out logical pages1000, 1001, 1002, and 1003, and placed them in physical block 2 (physical

OPERATING

SYSTEMS

[VERSION 1.01]WWW.OSTEP.ORG

Page 15: Flash-based SSDs

FLASH-BASED SSDS 15

pages 8, 9, 10, 11); assume the contents of the writes to 1000, 1001, 1002,and 1003 are a, b, c, and d, respectively.

Memory

Flash

Chip

Log Table:

Data Table: 250 8

0 1 2Block:

Page:

Content:

State:

00

i

01

i

02

i

03

i

04

i

05

i

06

i

07

i

08

a

V

09

b

V

10

c

V

11

d

V

Now assume that the client overwrites each of these blocks (with dataa’, b’, c’, and d’), in the exact same order, in one of the currently avail-able log blocks, say physical block 0 (physical pages 0, 1, 2, and 3). In thiscase, the FTL will have the following state:

Memory

Flash

Chip

Log Table: 1000 0 1001 1 1002 2 1003 3

Data Table: 250 8

0 1 2Block:

Page:

Content:

State:

00

a’

V

01

b’

V

02

c’

V

03

d’

V

04

i

05

i

06

i

07

i

08

a

V

09

b

V

10

c

V

11

d

V

Because these blocks have been written exactly in the same manner asbefore, the FTL can perform what is known as a switch merge. In thiscase, the log block (0) now becomes the storage location for blocks 0, 1, 2,and 3, and is pointed to by a single block pointer; the old block (2) is nowerased and used as a log block. In this best case, all the per-page pointersrequired replaced by a single block pointer.

Memory

Flash

Chip

Log Table:

Data Table: 250 0

0 1 2Block:

Page:

Content:

State:

00

a’

V

01

b’

V

02

c’

V

03

d’

V

04

i

05

i

06

i

07

i

08

i

09

i

10

i

11

i

This switch merge is the best case for a hybrid FTL. Unfortunately,sometimes the FTL is not so lucky. Imagine the case where we havethe same initial conditions (logical blocks 1000 ... 1003 stored in physi-cal block 2) but then the client overwrites logical blocks 1000 and 1001.

What do you think happens in this case? Why is it more challengingto handle? (think before looking at the result on the next page)

© 2008–20, ARPACI-DUSSEAUTHREE

EASY

PIECES

Page 16: Flash-based SSDs

16 FLASH-BASED SSDS

Memory

Flash

Chip

Log Table: 1000 0 1001 1

Data Table: 250 8

0 1 2Block:

Page:

Content:

State:

00

a’

V

01

b’

V

02

i

03

i

04

i

05

i

06

i

07

i

08

a

V

09

b

V

10

c

V

11

d

V

To reunite the other pages of this physical block, and thus be able to re-fer to them by only a single block pointer, the FTL performs what is calleda partial merge. In this operation, logical blocks 1002 and 1003 are readfrom physical block 2, and then appended to the log. The resulting stateof the SSD is the same as the switch merge above; however, in this case,the FTL had to perform extra I/O to achieve its goals, thus increasingwrite amplification.

The final case encountered by the FTL known as a full merge, and re-quires even more work. In this case, the FTL must pull together pagesfrom many other blocks to perform cleaning. For example, imagine thatlogical blocks 0, 4, 8, and 12 are written to log block A. To switch this logblock into a block-mapped page, the FTL must first create a data blockcontaining logical blocks 0, 1, 2, and 3, and thus the FTL must read 1, 2,and 3 from elsewhere and then write out 0, 1, 2, and 3 together. Next, themerge must do the same for logical block 4, finding 5, 6, and 7 and recon-ciling them into a single physical block. The same must be done for logi-cal blocks 8 and 12, and then (finally), the log block A can be freed. Fre-quent full merges, as is not surprising, can seriously harm performanceand thus should be avoided when at all possible [GY+09].

Page Mapping Plus Caching

Given the complexity of the hybrid approach above, others have sug-gested simpler ways to reduce the memory load of page-mapped FTLs.Probably the simplest is just to cache only the active parts of the FTL inmemory, thus reducing the amount of memory needed [GY+09].

This approach can work well. For example, if a given workload onlyaccesses a small set of pages, the translations of those pages will be storedin the in-memory FTL, and performance will be excellent without highmemory cost. Of course, the approach can also perform poorly. If mem-ory cannot contain the working set of necessary translations, each accesswill minimally require an extra flash read to first bring in the missingmapping before being able to access the data itself. Even worse, to makeroom for the new mapping, the FTL might have to evict an old map-ping, and if that mapping is dirty (i.e., not yet written to the flash per-sistently), an extra write will also be incurred. However, in many cases,the workload will display locality, and this caching approach will bothreduce memory overheads and keep performance high.

OPERATING

SYSTEMS

[VERSION 1.01]WWW.OSTEP.ORG

Page 17: Flash-based SSDs

FLASH-BASED SSDS 17

44.10 Wear Leveling

Finally, a related background activity that modern FTLs must imple-ment is wear leveling, as introduced above. The basic idea is simple:because multiple erase/program cycles will wear out a flash block, theFTL should try its best to spread that work across all the blocks of the de-vice evenly. In this manner, all blocks will wear out at roughly the sametime, instead of a few “popular” blocks quickly becoming unusable.

The basic log-structuring approach does a good initial job of spreadingout write load, and garbage collection helps as well. However, sometimesa block will be filled with long-lived data that does not get over-written;in this case, garbage collection will never reclaim the block, and thus itdoes not receive its fair share of the write load.

To remedy this problem, the FTL must periodically read all the livedata out of such blocks and re-write it elsewhere, thus making the blockavailable for writing again. This process of wear leveling increases thewrite amplification of the SSD, and thus decreases performance as extraI/O is required to ensure that all blocks wear at roughly the same rate.Many different algorithms exist in the literature [A+08, M+14]; read moreif you are interested.

44.11 SSD Performance And Cost

Before closing, let’s examine the performance and cost of modern SSDs,to better understand how they will likely be used in persistent storagesystems. In both cases, we’ll compare to classic hard-disk drives (HDDs),and highlight the biggest differences between the two.

PerformanceUnlike hard disk drives, flash-based SSDs have no mechanical compo-nents, and in fact are in many ways more similar to DRAM, in that theyare “random access” devices. The biggest difference in performance, ascompared to disk drives, is realized when performing random reads andwrites; while a typical disk drive can only perform a few hundred ran-dom I/Os per second, SSDs can do much better. Here, we use some datafrom modern SSDs to see just how much better SSDs perform; we’re par-ticularly interested in how well the FTLs hide the performance issues ofthe raw chips.

Table 44.4 shows some performance data for three different SSDs andone top-of-the-line hard drive; the data was taken from a few differentonline sources [S13, T15]. The left two columns show random I/O per-formance, and the right two columns sequential; the first three rows showdata for three different SSDs (from Samsung, Seagate, and Intel), and thelast row shows performance for a hard disk drive (or HDD), in this casea Seagate high-end drive.

We can learn a few interesting facts from the table. First, and mostdramatic, is the difference in random I/O performance between the SSDs

© 2008–20, ARPACI-DUSSEAUTHREE

EASY

PIECES

Page 18: Flash-based SSDs

18 FLASH-BASED SSDS

Random SequentialReads Writes Reads Writes

Device (MB/s) (MB/s) (MB/s) (MB/s)Samsung 840 Pro SSD 103 287 421 384Seagate 600 SSD 84 252 424 374Intel SSD 335 SSD 39 222 344 354Seagate Savvio 15K.3 HDD 2 2 223 223

Figure 44.4: SSDs And Hard Drives: Performance Comparison

and the lone hard drive. While the SSDs obtain tens or even hundreds ofMB/s in random I/Os, this “high performance” hard drive has a peak ofjust a couple MB/s (in fact, we rounded up to get to 2 MB/s). Second, youcan see that in terms of sequential performance, there is much less of a dif-ference; while the SSDs perform better, a hard drive is still a good choiceif sequential performance is all you need. Third, you can see that SSD ran-dom read performance is not as good as SSD random write performance.The reason for such unexpectedly good random-write performance isdue to the log-structured design of many SSDs, which transforms ran-dom writes into sequential ones and improves performance. Finally, be-cause SSDs exhibit some performance difference between sequential andrandom I/Os, many of the techniques we will learn in subsequent chap-ters about how to build file systems for hard drives are still applicable toSSDs; although the magnitude of difference between sequential and ran-dom I/Os is smaller, there is enough of a gap to carefully consider howto design file systems to reduce random I/Os.

Cost

As we saw above, the performance of SSDs greatly outstrips modern harddrives, even when performing sequential I/O. So why haven’t SSDs com-pletely replaced hard drives as the storage medium of choice? The an-swer is simple: cost, or more specifically, cost per unit of capacity. Cur-rently [A15], an SSD costs something like $150 for a 250-GB drive; suchan SSD costs 60 cents per GB. A typical hard drive costs roughly $50 for1-TB of storage, which means it costs 5 cents per GB. There is still morethan a 10× difference in cost between these two storage media.

These performance and cost differences dictate how large-scale stor-age systems are built. If performance is the main concern, SSDs are aterrific choice, particularly if random read performance is important. If,on the other hand, you are assembling a large data center and wish tostore massive amounts of information, the large cost difference will driveyou towards hard drives. Of course, a hybrid approach can make sense– some storage systems are being assembled with both SSDs and harddrives, using a smaller number of SSDs for more popular “hot” data anddelivering high performance, while storing the rest of the “colder” (lessused) data on hard drives to save on cost. As long as the price gap exists,hard drives are here to stay.

OPERATING

SYSTEMS

[VERSION 1.01]WWW.OSTEP.ORG

Page 19: Flash-based SSDs

FLASH-BASED SSDS 19

44.12 Summary

Flash-based SSDs are becoming a common presence in laptops, desk-tops, and servers inside the datacenters that power the world’s economy.Thus, you should probably know something about them, right?

Here’s the bad news: this chapter (like many in this book) is just thefirst step in understanding the state of the art. Some places to get somemore information about the raw technology include research on actualdevice performance (such as that by Chen et al. [CK+09] and Grupp etal. [GC+09]), issues in FTL design (including works by Agrawal et al.[A+08], Gupta et al. [GY+09], Huang et al. [H+14], Kim et al. [KK+02],Lee et al. [L+07], and Zhang et al. [Z+12]), and even distributed systemscomprised of flash (including Gordon [CG+09] and CORFU [B+12]). And,if we may say so, a really good overview of all the things you need to doto extract high performance from an SSD can be found in a paper on the“unwritten contract” [HK+17].

Don’t just read academic papers; also read about recent advances inthe popular press (e.g., [V12]). Therein you’ll learn more practical (butstill useful) information, such as Samsung’s use of both TLC and SLC cellswithin the same SSD to maximize performance (SLC can buffer writesquickly) as well as capacity (TLC can store more bits per cell). And thisis, as they say, just the tip of the iceberg. Dive in and learn more aboutthis “iceberg” of research on your own, perhaps starting with Ma et al.’sexcellent (and recent) survey [M+14]. Be careful though; icebergs can sinkeven the mightiest of ships [W15].

© 2008–20, ARPACI-DUSSEAUTHREE

EASY

PIECES

Page 20: Flash-based SSDs

20 FLASH-BASED SSDS

ASIDE: KEY SSD TERMS

• A flash chip consists of many banks, each of which is organized intoerase blocks (sometimes just called blocks). Each block is furthersubdivided into some number of pages.

• Blocks are large (128KB–2MB) and contain many pages, which arerelatively small (1KB–8KB).

• To read from flash, issue a read command with an address andlength; this allows a client to read one or more pages.

• Writing flash is more complex. First, the client must erase the en-tire block (which deletes all information within the block). Then,the client can program each page exactly once, thus completing thewrite.

• A new trim operation is useful to tell the device when a particularblock (or range of blocks) is no longer needed.

• Flash reliability is mostly determined by wear out; if a block iserased and programmed too often, it will become unusable.

• A flash-based solid-state storage device (SSD) behaves as if it werea normal block-based read/write disk; by using a flash translationlayer (FTL), it transforms reads and writes from a client into reads,erases, and programs to underlying flash chips.

• Most FTLs are log-structured, which reduces the cost of writingby minimizing erase/program cycles. An in-memory translationlayer tracks where logical writes were located within the physicalmedium.

• One key problem with log-structured FTLs is the cost of garbagecollection, which leads to write amplification.

• Another problem is the size of the mapping table, which can be-come quite large. Using a hybrid mapping or just caching hotpieces of the FTL are possible remedies.

• One last problem is wear leveling; the FTL must occasionally mi-grate data from blocks that are mostly read in order to ensure saidblocks also receive their share of the erase/program load.

OPERATING

SYSTEMS

[VERSION 1.01]WWW.OSTEP.ORG

Page 21: Flash-based SSDs

FLASH-BASED SSDS 21

References

[A+08] “Design Tradeoffs for SSD Performance” by N. Agrawal, V. Prabhakaran, T. Wobber, J.D. Davis, M. Manasse, R. Panigrahy. USENIX ’08, San Diego California, June 2008. An excellentoverview of what goes into SSD design.

[AD14] “Operating Systems: Three Easy Pieces” by Chapters: Crash Consistency: FSCK and Jour-naling and Log-Structured File Systems. Remzi Arpaci-Dusseau and Andrea Arpaci-Dusseau. Alot more detail here about how logging can be used in file systems; some of the same ideas can be appliedinside devices too as need be.

[A15] “Amazon Pricing Study” by Remzi Arpaci-Dusseau. February, 2015. This is not an actualpaper, but rather one of the authors going to Amazon and looking at current prices of hard drives andSSDs. You too can repeat this study, and see what the costs are today. Do it!

[B+12] “CORFU: A Shared Log Design for Flash Clusters” by M. Balakrishnan, D. Malkhi, V.Prabhakaran, T. Wobber, M. Wei, J. D. Davis. NSDI ’12, San Jose, California, April 2012. A newway to think about designing a high-performance replicated log for clusters using Flash.

[BD10] “Write Endurance in Flash Drives: Measurements and Analysis” by Simona Boboila,Peter Desnoyers. FAST ’10, San Jose, California, February 2010. A cool paper that reverse en-gineers flash-device lifetimes. Endurance sometimes far exceeds manufacturer predictions, by up to100×.

[B07] “ZFS: The Last Word in File Systems” by Jeff Bonwick and Bill Moore. Available here:http://www.ostep.org/Citations/zfs_last.pdf. Was this the last word in file sys-tems? No, but maybe it’s close.

[CG+09] “Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensiveApplications” by Adrian M. Caulfield, Laura M. Grupp, Steven Swanson. ASPLOS ’09, Wash-ington, D.C., March 2009. Early research on assembling flash into larger-scale clusters; definitelyworth a read.

[CK+09] “Understanding Intrinsic Characteristics and System Implications of Flash Memorybased Solid State Drives” by Feng Chen, David A. Koufaty, and Xiaodong Zhang. SIGMET-RICS/Performance ’09, Seattle, Washington, June 2009. An excellent overview of SSD performanceproblems circa 2009 (though now a little dated).

[G14] “The SSD Endurance Experiment” by Geoff Gasior. The Tech Report, September 19,2014. Available: http://techreport.com/review/27062. A nice set of simple experimentsmeasuring performance of SSDs over time. There are many other similar studies; use google to findmore.

[GC+09] “Characterizing Flash Memory: Anomalies, Observations, and Applications” by L.M. Grupp, A. M. Caulfield, J. Coburn, S. Swanson, E. Yaakobi, P. H. Siegel, J. K. Wolf. IEEEMICRO ’09, New York, New York, December 2009. Another excellent characterization of flashperformance.

[GY+09] “DFTL: a Flash Translation Layer Employing Demand-Based Selective Caching ofPage-Level Address Mappings” by Aayush Gupta, Youngjae Kim, Bhuvan Urgaonkar. ASP-LOS ’09, Washington, D.C., March 2009. This paper gives an excellent overview of different strategiesfor cleaning within hybrid SSDs as well as a new scheme which saves mapping table space and improvesperformance under many workloads.

[HK+17] “The Unwritten Contract of Solid State Drives” by Jun He, Sudarsun Kannan, AndreaC. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau. EuroSys ’17, Belgrade, Serbia, April 2017. Ourown paper which lays out five rules clients should follow in order to get the best performance out ofmodern SSDs. The rules are request scale, locality, aligned sequentiality, grouping by death time, anduniform lifetime. Read the paper for details!

[H+14] “An Aggressive Worn-out Flash Block Management Scheme To Alleviate SSD Perfor-mance Degradation” by Ping Huang, Guanying Wu, Xubin He, Weijun Xiao. EuroSys ’14,2014. Recent work showing how to really get the most out of worn-out flash blocks; neat!

© 2008–20, ARPACI-DUSSEAUTHREE

EASY

PIECES

Page 22: Flash-based SSDs

22 FLASH-BASED SSDS

[J10] “Failure Mechanisms and Models for Semiconductor Devices” by Unknown author. Re-port JEP122F, November 2010. Available on the internet at this exciting so-called web site:http://www.jedec.org/sites/default/files/docs/JEP122F.pdf. A highly detaileddiscussion of what is going on at the device level and how such devices fail. Only for those not faint ofheart. Or physicists. Or both.

[KK+02] “A Space-Efficient Flash Translation Layer For Compact Flash Systems” by JesungKim, Jong Min Kim, Sam H. Noh, Sang Lyul Min, Yookun Cho. IEEE Transactions on Con-sumer Electronics, Volume 48, Number 2, May 2002. One of the earliest proposals to suggesthybrid mappings.

[L+07] “A Log Buffer-Based Flash Translation Layer by Using Fully-Associative Sector Trans-lation. ” Sang-won Lee, Tae-Sun Chung, Dong-Ho Lee, Sangwon Park, Ha-Joo Song. ACMTransactions on Embedded Computing Systems, Volume 6, Number 3, July 2007 A terrific paperabout how to build hybrid log/block mappings.

[M+14] “A Survey of Address Translation Technologies for Flash Memories” by Dongzhe Ma,Jianhua Feng, Guoliang Li. ACM Computing Surveys, Volume 46, Number 3, January 2014.Probably the best recent survey of flash and related technologies.

[S13] “The Seagate 600 and 600 Pro SSD Review” by Anand Lal Shimpi. AnandTech, May 7,2013. Available: http://www.anandtech.com/show/6935/seagate-600-ssd-review.One of many SSD performance measurements available on the internet. Haven’t heard of the internet?No problem. Just go to your web browser and type “internet” into the search tool. You’ll be amazed atwhat you can learn.

[T15] “Performance Charts Hard Drives” by Tom’s Hardware. January 2015. Available here:http://www.tomshardware.com/charts/enterprise-hdd-charts. Yet another sitewith performance data, this time focusing on hard drives.

[V12] “Understanding TLC Flash” by Kristian Vatto. AnandTech, September, 2012. Available:http://www.anandtech.com/show/5067/understanding-tlc-nand. A short descrip-tion about TLC flash and its characteristics.

[W15] “List of Ships Sunk by Icebergs” by Many authors. Available at this location on the“web”: http://en.wikipedia.org/wiki/List of ships sunk by icebergs. Yes, thereis a wikipedia page about ships sunk by icebergs. It is a really boring page and basically everyone knowsthe only ship the iceberg-sinking-mafia cares about is the Titanic.

[Z+12] “De-indirection for Flash-based SSDs with Nameless Writes” by Yiying Zhang, LeoPrasath Arulraj, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau. FAST ’13, San Jose,California, February 2013. Our research on a new idea to reduce mapping table space; the key is tore-use the pointers in the file system above to store locations of blocks, instead of adding another level ofindirection.

OPERATING

SYSTEMS

[VERSION 1.01]WWW.OSTEP.ORG

Page 23: Flash-based SSDs

FLASH-BASED SSDS 23

Homework (Simulation)

This section introduces ssd.py, a simple SSD simulator you can useto understand better how SSDs work. Read the README for details onhow to run the simulator. It is a long README, so boil a cup of tea (caf-feinated likely necessary), put on your reading glasses, let the cat curl up

on your lap1, and get to work.

Questions

1. The homework will mostly focus on the log-structured SSD, whichis simulated with the “-T log” flag. We’ll use the other types ofSSDs for comparison. First, run with flags -T log -s 1 -n 10

-q. Can you figure out which operations took place? Use -c tocheck your answers (or just use -C instead of -q -c). Use differentvalues of -s to generate different random workloads.

2. Now just show the commands and see if you can figure out theintermediate states of the Flash. Run with flags -T log -s 2 -n

10 -C to show each command. Now, determine the state of theFlash between each command; use -F to show the states and see ifyou were right. Use different random seeds to test your burgeoningexpertise.

3. Let’s make this problem ever so slightly more interesting by addingthe -r 20 flag. What differences does this cause in the commands?Use -c again to check your answers.

4. Performance is determined by the number of erases, programs, andreads (we assume here that trims are free). Run the same workloadagain as above, but without showing any intermediate states (e.g.,-T log -s 1 -n 10). Can you estimate how long this workloadwill take to complete? (default erase time is 1000 microseconds,program time is 40, and read time is 10) Use the -S flag to checkyour answer. You can also change the erase, program, and readtimes with the -E, -W, -R flags.

5. Now, compare performance of the log-structured approach and the(very bad) direct approach (-T direct instead of -T log). First,estimate how you think the direct approach will perform, then checkyour answer with the -S flag. In general, how much better will thelog-structured approach perform than the direct one?

6. Let us next explore the behavior of the garbage collector. To doso, we have to set the high (-G) and low (-g) watermarks appro-priately. First, let’s observe what happens when you run a largerworkload to the log-structured SSD but without any garbage col-lection. To do this, run with flags -T log -n 1000 (the high wa-

1Now you might complain, “But I’m a dog person!” To this, we say, too bad! Get a cat,put it on your lap, and do the homework! How else will you learn, if you can’t even followthe most basic of instructions?

© 2008–20, ARPACI-DUSSEAUTHREE

EASY

PIECES

Page 24: Flash-based SSDs

24 FLASH-BASED SSDS

termark default is 10, so the GC won’t run in this configuration).What do you think will happen? Use -C and perhaps -F to see.

7. To turn on the garbage collector, use lower values. The high water-mark (-G N) tells the system to start collecting once N blocks havebeen used; the low watermark (-G M) tells the system to stop col-lecting once there are only M blocks in use. What watermark valuesdo you think will make for a working system? Use -C and -F toshow the commands and intermediate device states and see.

8. One other useful flag is -J, which shows what the collector is doingwhen it runs. Run with flags -T log -n 1000 -C -J to see boththe commands and the GC behavior. What do you notice about theGC? The final effect of GC, of course, is performance. Use -S tolook at final statistics; how many extra reads and writes occur dueto garbage collection? Compare this to the ideal SSD (-T ideal);how much extra reading, writing, and erasing is there due to thenature of Flash? Compare it also to the direct approach; in whatway (erases, reads, programs) is the log-structured approach supe-rior?

9. One last aspect to explore is workload skew. Adding skew to theworkload changes writes such that more writes occur to some smallerfraction of the logical block space. For example, running with -K

80/20 makes 80% of the writes go to 20% of the blocks. Pick somedifferent skews and perform many randomly-chosen operations (e.g.,-n 1000), using first -T direct to understand the skew, and then-T log to see the impact on a log-structured device. What do youexpect will happen? One other small skew control to explore is -k100; by adding this flag to a skewed workload, the first 100 writesare not skewed. The idea is to first create a lot of data, but then onlyupdate some of it. What impact might that have upon a garbagecollector?

OPERATING

SYSTEMS

[VERSION 1.01]WWW.OSTEP.ORG