Embedded System Lab. Daeyeon Son Understanding the robustness of SSDs under power fault Mai Zheng*,...
25
Embedded System Lab. Embedded System Lab. Daeyeon Son Understanding the robustness of SSDs under power fault Mai Zheng*, Joseph Tucek**, Feng Qin*, Mark Lillibridge** The Ohio State University*, HP Labs** Daeyeon Son 2015.6.9
Embedded System Lab. Daeyeon Son Understanding the robustness of SSDs under power fault Mai Zheng*, Joseph Tucek**, Feng Qin*, Mark Lillibridge** The Ohio
Embedded System Lab. Daeyeon Son Understanding the robustness
of SSDs under power fault Mai Zheng*, Joseph Tucek**, Feng Qin*,
Mark Lillibridge** The Ohio State University*, HP Labs** Daeyeon
Son 2015.6.9
Slide 2
Daeyeon Son Embedded System Lab. Introduction Power failure
problem has been caused in many environments to use the computer.
Computer including disk is basically operated by electric power. In
the SSD case, electron in the floating gate on flash cell is served
by threshold voltage. During data transfer to flash cell, if
blackout is occurred reliability of data is not guaranteed. In this
paper, it can to find the problem of data state upon power failure
situation in variety commodity SSDs.
Slide 3
Daeyeon Son Embedded System Lab. Power fault About the power
fault The cause of power fault are very variety and different. In
blackout situation, if datacenter didnt prepare the uninterruptable
power supply, computer in datacenter will be the shutdown momently.
Data in disk will be unstable and lost in the blackout. Ref.
Southern California Edison +
Slide 4
Daeyeon Son Embedded System Lab. Power fault Architecture of a
solid-state drive Solid-state drive has firmware about operating in
device. SSD Controller is central position on SSD for transfer the
data. Data of the host is stored by SSD Controller to flash memory.
FTL(File Translation Layer) is applied by firmware in the SSD.
User-level program cant trace about file translation layer on
device-level.
Slide 5
Daeyeon Son Embedded System Lab. Power fault Step of transfer
the data in SSDs Power fault can occur anywhere on data processing
step of SSDs. But, it can't know the error point of device without
seeing the data flow design exactly. Its only abstraction the flow
of data in SSD. Time Map table sync from DRAM to NAND Sync Roll
back up to this point Aggressive roll back point Map table updates
in DRAM Updates Next Sync Power Loss
Slide 6
Daeyeon Son Embedded System Lab. Power fault SSDs are not
Evangelion. However, it has a battery(Super cap.) similarly. When
cut the power by the sudden power supply in computer, SSDs use
battery for flush data from cache to flash cell. Data emergency
transfer time is very short and it is only use to flush the data
for apply to FTL. But, If data will be not written in the SSDs
unfortunately, The manager of computer will cry above the desk.
SSDs can transfer data during sometime in power fault situation.
Evangelion can move during 5 minutes in power fault situation.
Slide 7
Daeyeon Son Embedded System Lab. Symptom of power fault in SSDs
In this paper, authors was guessing about type of error the power
fault in SSDs. If data in devices is not protected, it is
synchronization problem. 1. Any data is not written by power loss.
2. Metadata of data is crashed like the mapping table problem. 3.
Electron in floating gate is not enough on target value. 4. The
data writing is not completed due to work like a sponge. 5.
Electric supply error caused by hardware problem. How to observe
the realtime power fault situation?
Slide 8
Daeyeon Son Embedded System Lab. Symptom of power fault in SSDs
Detail Explanation (1)
Slide 9
Daeyeon Son Embedded System Lab. Symptom of power fault in SSDs
Detail Explanation (2)
Slide 10
Daeyeon Son Embedded System Lab. Symptom of power fault in SSDs
Unserializability Identifying the Unserializability type needs more
than two records. An unserialized write is a violation of the
synchronous constraint of write requests issued by on thread or
between multiple threads.
Slide 11
Daeyeon Son Embedded System Lab. Test Workload Workload Format
Workload header length is the 512 bytes and all record has
continuous arrange of header. (512 bytes is programming unit in
some SSDs) Seed is used for input variable into random number to
product raw block number. Raw block number is the original 64bit
random number. Block number is LBN(Logical Block Number) that be
able to check in user-level on system architecture. (Workload size
= 512 byte * N) Worker ID is tid of the threads that use for
multiple I/O. Operation Count is the writing count on worker
identifier.
Slide 12
Daeyeon Son Embedded System Lab. Test Workload Workload Format
Checksum is made by CRC(Cyclic Redundancy Check), it can check the
original data state. Timestamp is record of written time the data.
Marker is unique string for check boundary of header. But, the
standard random number generation algorithm is affected by time.
And, time is not stable. It use the random number on hash function
in counter mode for generate unique sequence. Seed : 3, 8, 2, 4, 2,
5, 7, , 9, 1 Worker ID : 0, 0, 1, 1, 1, 2, 2, , 0, 2 Op_cnt : 0, 1,
0, 1, 2, 0, 1, , 9, 7 Hash Function the CTR ModeIt can trace the
order exactly Very important~!!
Slide 13
Daeyeon Son Embedded System Lab. Test Workload Data protection
from compressed by SSD Some SSDs have function of the auto
compression on advanced file translation layer. Data through the
compression can cause change the original workload frame. It can
avoid using random mask for temporary method by the XOR operation.
Original Data Compressed Data
Slide 14
Daeyeon Son Embedded System Lab. Test Workload Workload Type
Concurrent random writes Multiple-Thread I/O Memory starting point
is random. Workload length is random. Memory Map
Slide 15
Daeyeon Son Embedded System Lab. Test Workload Workload Type
Concurrent sequential writes Multiple-Thread I/O Memory starting
point is random. Workload length is continuous. Memory Map
Slide 16
Daeyeon Son Embedded System Lab. Test Workload Workload Type
Single-threaded sequential writes Single-Thread I/O Memory starting
point is random. Workload length is continuous. Memory Map
Slide 17
Daeyeon Son Embedded System Lab. Test Environment Making the
device for purpose the test. Power connection to SSD is only
supported by adapter in computer. But, SSD of S-ATA is configured
the non-unified both power cable and data transfer cable. New power
cable that made is connected by scheduler in software and cut the
power of the SSD aperiodically. Power fault test performs more than
a thousand count.
Slide 18
Daeyeon Son Embedded System Lab. Test Environment Power fault
injection
Slide 19
Daeyeon Son Embedded System Lab. Test Environment Components of
the framework 1. Scheduler Manage about the testing framework. 2.
Checker Verify the result that is applied by workloads on SSDs. 3.
Worker Through multiple threads to inject the workload. 4. Switcher
Control the power adapter cable on SSDs. I/O Scheduler : noop Write
option : 1. O_SYNC 2. O_DIRECT For data synchronization For bypass
the cache No operation Hardware Layer Software Layer
Slide 20
Daeyeon Son Embedded System Lab. Test Environment Target Device
In generally, Vendors offer the specification on their SSDs.
Specification of SSDs includes many information about hardware
design. But, we cant find about the data connection structure that
is called the 'FTL'. File translation layer is very important
component in SSDs. Logical page number is connected to physical
page number through FTL. Vendors never opens the firmware source.
Because its related on device performance.
Slide 21
Daeyeon Son Embedded System Lab. Result Summary of test result
It didnt occur the flying writes on SSDs. Because the file
translation layer is very stable in all target devices. But, SSD#3
was shutdown on some power fault situation. It has metadata
corruption problem into SSD. SSD#1 has critical problem about dead
device. Data in solid-state disk didnt recovery.
Slide 22
Daeyeon Son Embedded System Lab. Result Summary of test result
In figure 7, it can find reason why we decided workload header
length. Shorn write is occurred by programming unit in some SSDs.
Workload test needs to follow the specifications on programming
unit in flash cells on SSDs. In figure 8, SSDs that was occurred by
power fault isn't related with price and performance.
Slide 23
Daeyeon Son Embedded System Lab. Result Summary of test result
In figure 9, it can observe the very different behavior on file I/O
synchronization problem. SSD#4 has critical problem on
serialization error in device firmware. Serialization error is
caused by control ability about transfer the data. But, many SSDs
cant recovery the data on sudden power loss.
Slide 24
Daeyeon Son Embedded System Lab. Conclusion This paper proposes
a methodology to automatically expose the bugs in block devices
such as SSDs that are triggered by power faults. The block-level
behavior of SSDs exposed in experiments has important implications
for the design of storage systems. Vendors of SSDs need to improve
the recovery performance in power fault situation.
Slide 25
Daeyeon Son Embedded System Lab. References 1. Tseng, Hung-Wei,
Laura Grupp, and Steven Swanson. "Understanding the impact of power
loss on flash memory." Proceedings of the 48th Design Automation
Conference. ACM, 2011. 2. Jung, Sanghyuk, and Yong Ho Song. "Data
loss recovery for power failure in flash memory storage systems."
Journal of Systems Architecture 61.1 (2015): 12-27. 3. Meza,
Justin, et al. "A Large-Scale Study of Flash Memory Failures in the
Field., 2015. 4. Verma, Rajat, et al. "Failure-atomic updates of
application data in a linux file system." Proceedings of the 13th
USENIX Conference on File and Storage Technologies. USENIX
Association, 2015. 5. Ma, Haozhi, et al. "Word line program
disturbance based data retention error recovery strategy for MLC
NAND Flash." Solid-State Electronics 109 (2015): 1-7. 6. Cai, Yu,
et al. "Read Disturb Errors in MLC NAND Flash Memory:
Characterization, Mitigation, and Recovery." DSN, 2015. 7.
Bouganim, Luc, Bjrn Jnsson, and Philippe Bonnet. "uFLIP:
Understanding flash IO patterns." arXiv preprint arXiv:0909.1780
(2009).