25
Embedded System Lab. Embedded System Lab. Daeyeon Son Understanding the robustness of SSDs under power fault Mai Zheng*, Joseph Tucek**, Feng Qin*, Mark Lillibridge** The Ohio State University*, HP Labs** Daeyeon Son 2015.6.9

Embedded System Lab. Daeyeon Son Understanding the robustness of SSDs under power fault Mai Zheng*, Joseph Tucek**, Feng Qin*, Mark Lillibridge** The Ohio

Embed Size (px)

Citation preview

  • Slide 1
  • Embedded System Lab. Daeyeon Son Understanding the robustness of SSDs under power fault Mai Zheng*, Joseph Tucek**, Feng Qin*, Mark Lillibridge** The Ohio State University*, HP Labs** Daeyeon Son 2015.6.9
  • Slide 2
  • Daeyeon Son Embedded System Lab. Introduction Power failure problem has been caused in many environments to use the computer. Computer including disk is basically operated by electric power. In the SSD case, electron in the floating gate on flash cell is served by threshold voltage. During data transfer to flash cell, if blackout is occurred reliability of data is not guaranteed. In this paper, it can to find the problem of data state upon power failure situation in variety commodity SSDs.
  • Slide 3
  • Daeyeon Son Embedded System Lab. Power fault About the power fault The cause of power fault are very variety and different. In blackout situation, if datacenter didnt prepare the uninterruptable power supply, computer in datacenter will be the shutdown momently. Data in disk will be unstable and lost in the blackout. Ref. Southern California Edison +
  • Slide 4
  • Daeyeon Son Embedded System Lab. Power fault Architecture of a solid-state drive Solid-state drive has firmware about operating in device. SSD Controller is central position on SSD for transfer the data. Data of the host is stored by SSD Controller to flash memory. FTL(File Translation Layer) is applied by firmware in the SSD. User-level program cant trace about file translation layer on device-level.
  • Slide 5
  • Daeyeon Son Embedded System Lab. Power fault Step of transfer the data in SSDs Power fault can occur anywhere on data processing step of SSDs. But, it can't know the error point of device without seeing the data flow design exactly. Its only abstraction the flow of data in SSD. Time Map table sync from DRAM to NAND Sync Roll back up to this point Aggressive roll back point Map table updates in DRAM Updates Next Sync Power Loss
  • Slide 6
  • Daeyeon Son Embedded System Lab. Power fault SSDs are not Evangelion. However, it has a battery(Super cap.) similarly. When cut the power by the sudden power supply in computer, SSDs use battery for flush data from cache to flash cell. Data emergency transfer time is very short and it is only use to flush the data for apply to FTL. But, If data will be not written in the SSDs unfortunately, The manager of computer will cry above the desk. SSDs can transfer data during sometime in power fault situation. Evangelion can move during 5 minutes in power fault situation.
  • Slide 7
  • Daeyeon Son Embedded System Lab. Symptom of power fault in SSDs In this paper, authors was guessing about type of error the power fault in SSDs. If data in devices is not protected, it is synchronization problem. 1. Any data is not written by power loss. 2. Metadata of data is crashed like the mapping table problem. 3. Electron in floating gate is not enough on target value. 4. The data writing is not completed due to work like a sponge. 5. Electric supply error caused by hardware problem. How to observe the realtime power fault situation?
  • Slide 8
  • Daeyeon Son Embedded System Lab. Symptom of power fault in SSDs Detail Explanation (1)
  • Slide 9
  • Daeyeon Son Embedded System Lab. Symptom of power fault in SSDs Detail Explanation (2)
  • Slide 10
  • Daeyeon Son Embedded System Lab. Symptom of power fault in SSDs Unserializability Identifying the Unserializability type needs more than two records. An unserialized write is a violation of the synchronous constraint of write requests issued by on thread or between multiple threads.
  • Slide 11
  • Daeyeon Son Embedded System Lab. Test Workload Workload Format Workload header length is the 512 bytes and all record has continuous arrange of header. (512 bytes is programming unit in some SSDs) Seed is used for input variable into random number to product raw block number. Raw block number is the original 64bit random number. Block number is LBN(Logical Block Number) that be able to check in user-level on system architecture. (Workload size = 512 byte * N) Worker ID is tid of the threads that use for multiple I/O. Operation Count is the writing count on worker identifier.
  • Slide 12
  • Daeyeon Son Embedded System Lab. Test Workload Workload Format Checksum is made by CRC(Cyclic Redundancy Check), it can check the original data state. Timestamp is record of written time the data. Marker is unique string for check boundary of header. But, the standard random number generation algorithm is affected by time. And, time is not stable. It use the random number on hash function in counter mode for generate unique sequence. Seed : 3, 8, 2, 4, 2, 5, 7, , 9, 1 Worker ID : 0, 0, 1, 1, 1, 2, 2, , 0, 2 Op_cnt : 0, 1, 0, 1, 2, 0, 1, , 9, 7 Hash Function the CTR ModeIt can trace the order exactly Very important~!!
  • Slide 13
  • Daeyeon Son Embedded System Lab. Test Workload Data protection from compressed by SSD Some SSDs have function of the auto compression on advanced file translation layer. Data through the compression can cause change the original workload frame. It can avoid using random mask for temporary method by the XOR operation. Original Data Compressed Data
  • Slide 14
  • Daeyeon Son Embedded System Lab. Test Workload Workload Type Concurrent random writes Multiple-Thread I/O Memory starting point is random. Workload length is random. Memory Map
  • Slide 15
  • Daeyeon Son Embedded System Lab. Test Workload Workload Type Concurrent sequential writes Multiple-Thread I/O Memory starting point is random. Workload length is continuous. Memory Map
  • Slide 16
  • Daeyeon Son Embedded System Lab. Test Workload Workload Type Single-threaded sequential writes Single-Thread I/O Memory starting point is random. Workload length is continuous. Memory Map
  • Slide 17
  • Daeyeon Son Embedded System Lab. Test Environment Making the device for purpose the test. Power connection to SSD is only supported by adapter in computer. But, SSD of S-ATA is configured the non-unified both power cable and data transfer cable. New power cable that made is connected by scheduler in software and cut the power of the SSD aperiodically. Power fault test performs more than a thousand count.
  • Slide 18
  • Daeyeon Son Embedded System Lab. Test Environment Power fault injection
  • Slide 19
  • Daeyeon Son Embedded System Lab. Test Environment Components of the framework 1. Scheduler Manage about the testing framework. 2. Checker Verify the result that is applied by workloads on SSDs. 3. Worker Through multiple threads to inject the workload. 4. Switcher Control the power adapter cable on SSDs. I/O Scheduler : noop Write option : 1. O_SYNC 2. O_DIRECT For data synchronization For bypass the cache No operation Hardware Layer Software Layer
  • Slide 20
  • Daeyeon Son Embedded System Lab. Test Environment Target Device In generally, Vendors offer the specification on their SSDs. Specification of SSDs includes many information about hardware design. But, we cant find about the data connection structure that is called the 'FTL'. File translation layer is very important component in SSDs. Logical page number is connected to physical page number through FTL. Vendors never opens the firmware source. Because its related on device performance.
  • Slide 21
  • Daeyeon Son Embedded System Lab. Result Summary of test result It didnt occur the flying writes on SSDs. Because the file translation layer is very stable in all target devices. But, SSD#3 was shutdown on some power fault situation. It has metadata corruption problem into SSD. SSD#1 has critical problem about dead device. Data in solid-state disk didnt recovery.
  • Slide 22
  • Daeyeon Son Embedded System Lab. Result Summary of test result In figure 7, it can find reason why we decided workload header length. Shorn write is occurred by programming unit in some SSDs. Workload test needs to follow the specifications on programming unit in flash cells on SSDs. In figure 8, SSDs that was occurred by power fault isn't related with price and performance.
  • Slide 23
  • Daeyeon Son Embedded System Lab. Result Summary of test result In figure 9, it can observe the very different behavior on file I/O synchronization problem. SSD#4 has critical problem on serialization error in device firmware. Serialization error is caused by control ability about transfer the data. But, many SSDs cant recovery the data on sudden power loss.
  • Slide 24
  • Daeyeon Son Embedded System Lab. Conclusion This paper proposes a methodology to automatically expose the bugs in block devices such as SSDs that are triggered by power faults. The block-level behavior of SSDs exposed in experiments has important implications for the design of storage systems. Vendors of SSDs need to improve the recovery performance in power fault situation.
  • Slide 25
  • Daeyeon Son Embedded System Lab. References 1. Tseng, Hung-Wei, Laura Grupp, and Steven Swanson. "Understanding the impact of power loss on flash memory." Proceedings of the 48th Design Automation Conference. ACM, 2011. 2. Jung, Sanghyuk, and Yong Ho Song. "Data loss recovery for power failure in flash memory storage systems." Journal of Systems Architecture 61.1 (2015): 12-27. 3. Meza, Justin, et al. "A Large-Scale Study of Flash Memory Failures in the Field., 2015. 4. Verma, Rajat, et al. "Failure-atomic updates of application data in a linux file system." Proceedings of the 13th USENIX Conference on File and Storage Technologies. USENIX Association, 2015. 5. Ma, Haozhi, et al. "Word line program disturbance based data retention error recovery strategy for MLC NAND Flash." Solid-State Electronics 109 (2015): 1-7. 6. Cai, Yu, et al. "Read Disturb Errors in MLC NAND Flash Memory: Characterization, Mitigation, and Recovery." DSN, 2015. 7. Bouganim, Luc, Bjrn Jnsson, and Philippe Bonnet. "uFLIP: Understanding flash IO patterns." arXiv preprint arXiv:0909.1780 (2009).