13
IEICE TRANS. INF. & SYST., VOL.E97–D, NO.3 MARCH 2014 497 PAPER UStore: STT-MRAM Based Light-Weight User-Level Storage for Enhancing Performance of Accessing Persistent Data Yong SONG a) , Member and Kyuho PARK †∗ , Nonmember SUMMARY Traditionally, in computer systems, file I/O has been a big performance bottleneck for I/O intensive applications. The recent advent of non-volatile byte-addressable memory (NVM) technologies such as STT- MRAM and PCM, provides a chance to store persistent data with a high performance close to DRAM’s. However, as the location of the persistent storage device gets closer to the CPU, the system software layers overheads for accessing the data such as file system layer including virtual file system layer and device driver are no longer negligible. In this paper, we propose a light-weight user-level persistent storage, called UStore, which is physi- cally allocated on the NVM and is mapped directly into the virtual address space of an application. UStore makes it possible for the application to fast access the persistent data without the system software overheads and extra data copy between the user space and kernel space. We show how UStore is easily applied to existing applications with little elaboration and evaluate its performance enhancement through several benchmark tests. key words: non-volatile memory, storage, persistent data, implementation 1. Introduction Recent promising non-volatile memory (NVM) technolo- gies such as STT-MRAM (Spin Transfer Torque MRAM) and PCM (Phase Change Memory) pose challenging new issues in computing systems [1], [2]. These techonologies can be directly attached to the processor via memory bus like DRAM and data can simply be accessed on it with load/store instructions. This means that NVM, in terms of performance, can be considered an upper bound of persis- tent storage devices. Therefore, it is obviously meaningful to study how to store data persistently on NVM. In early research, in order to improve the performance of the existing flash file system, PCM is exploited as a sub- storage for storing a part of NAND file systems such as metadata and a part of data because it is faster than NAND, but has less capacity [3], [4]. Recent works focus on how to design a complete file system able to store data per- sistently, just on NVM by aggressively utilizing its byte- addressability. Condit et al. [5] proposed a byte-level ac- cessible on-memory file system, called BPFS, which par- ticularly improves the performance of small-sized accesses of meta-data and data with data consistency. Storage Class Memory File System (SCMFS) shows better performance compared to the existing naive memory-based file system, tmpfs and ext2 on ramdisk, by managing the file system di- rectly on kernel virtual address space, which helps to elim- Manuscript received August 26, 2013. The authors are with Korea Advanced Institute of Science and Technology, Korea. Presently, with Electrical Engineering Department of KAIST. a) E-mail: [email protected] DOI: 10.1587/transinf.E97.D.497 inate the overheads of a generic block device and emulated block I/O operations [6]. In general, a file system is an excellent means to store data persistently, and a file I/O operation, to access a file data from the file system, can be processed through system software intervention, such as system call, file system op- erations including VFS layer, block device driver, and ac- tual data copy between kernel space and user space. In the traditional computer system, the system software overheads account for only a small portion of an I/O operation due to the much slower I/O performance of the secondary storage. However, in the cases where data is persistently stored and accessed on NVM, especially STT-MRAM, whose perfor- mance is similar to that of DRAM, we can no longer over- look this factor. In this work, to achieve good performance when ac- cessing persistent data of existing I/O intensive applica- tions, we propose a light-weight user-level persistent stor- age, called UStore, which is physically allocated on the STT-MRAM, and directly mapped to the user-address space of an application without an additional kernel patch. To sup- plement the limited size of the STT-MRAM, we design US- tore so that it can provide a virtually larger space than its size. For this, we assign an auxiliary file on an underlying file system and exclusively store the excess data of UStore in the file. According to requirements of an application, we can apply an appropriate page replacement policy, to pro- vide an ecient way to store a part of the persistent data in the UStore’s limited space. In addition, so that UStore can be easily applied to ex- isting applications, we provide APIs, similar to POSIX I/O primitive calls such as open, write, read, and so on. We simply replace the I/O primitives with these, so that we can apply UStore to SQLite, an open-source file-based database. Finally, we show the performance enhancement of applying UStore in several experiments. The rest of this paper is as follows. In the next chapters, we introduce background and related works. In Chapter 4, we explain our proposed solution and its implementation in detail and in Chapter 5, we show how to apply UStore to SQLite. In Chapter 6, we evaluate our implemented sys- tem with various experiments. Finally, we mention further works and conclude. 2. Background Recent reports [2], [10] describe the currently projected sta- Copyright c 2014 The Institute of Electronics, Information and Communication Engineers

UStore: STT-MRAM Based Light-Weight User-Level Storage for

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: UStore: STT-MRAM Based Light-Weight User-Level Storage for

IEICE TRANS. INF. & SYST., VOL.E97–D, NO.3 MARCH 2014497

PAPER

UStore: STT-MRAM Based Light-Weight User-Level Storage forEnhancing Performance of Accessing Persistent Data

Yong SONG†a), Member and Kyuho PARK†∗, Nonmember

SUMMARY Traditionally, in computer systems, file I/O has been a bigperformance bottleneck for I/O intensive applications. The recent advent ofnon-volatile byte-addressable memory (NVM) technologies such as STT-MRAM and PCM, provides a chance to store persistent data with a highperformance close to DRAM’s. However, as the location of the persistentstorage device gets closer to the CPU, the system software layers overheadsfor accessing the data such as file system layer including virtual file systemlayer and device driver are no longer negligible. In this paper, we proposea light-weight user-level persistent storage, called UStore, which is physi-cally allocated on the NVM and is mapped directly into the virtual addressspace of an application. UStore makes it possible for the application to fastaccess the persistent data without the system software overheads and extradata copy between the user space and kernel space. We show how UStoreis easily applied to existing applications with little elaboration and evaluateits performance enhancement through several benchmark tests.key words: non-volatile memory, storage, persistent data, implementation

1. Introduction

Recent promising non-volatile memory (NVM) technolo-gies such as STT-MRAM (Spin Transfer Torque MRAM)and PCM (Phase Change Memory) pose challenging newissues in computing systems [1], [2]. These techonologiescan be directly attached to the processor via memory buslike DRAM and data can simply be accessed on it withload/store instructions. This means that NVM, in terms ofperformance, can be considered an upper bound of persis-tent storage devices. Therefore, it is obviously meaningfulto study how to store data persistently on NVM.

In early research, in order to improve the performanceof the existing flash file system, PCM is exploited as a sub-storage for storing a part of NAND file systems such asmetadata and a part of data because it is faster than NAND,but has less capacity [3], [4]. Recent works focus on howto design a complete file system able to store data per-sistently, just on NVM by aggressively utilizing its byte-addressability. Condit et al. [5] proposed a byte-level ac-cessible on-memory file system, called BPFS, which par-ticularly improves the performance of small-sized accessesof meta-data and data with data consistency. Storage ClassMemory File System (SCMFS) shows better performancecompared to the existing naive memory-based file system,tmpfs and ext2 on ramdisk, by managing the file system di-rectly on kernel virtual address space, which helps to elim-

Manuscript received August 26, 2013.†The authors are with Korea Advanced Institute of Science and

Technology, Korea.∗Presently, with Electrical Engineering Department of KAIST.

a) E-mail: [email protected]: 10.1587/transinf.E97.D.497

inate the overheads of a generic block device and emulatedblock I/O operations [6].

In general, a file system is an excellent means to storedata persistently, and a file I/O operation, to access a filedata from the file system, can be processed through systemsoftware intervention, such as system call, file system op-erations including VFS layer, block device driver, and ac-tual data copy between kernel space and user space. In thetraditional computer system, the system software overheadsaccount for only a small portion of an I/O operation due tothe much slower I/O performance of the secondary storage.However, in the cases where data is persistently stored andaccessed on NVM, especially STT-MRAM, whose perfor-mance is similar to that of DRAM, we can no longer over-look this factor.

In this work, to achieve good performance when ac-cessing persistent data of existing I/O intensive applica-tions, we propose a light-weight user-level persistent stor-age, called UStore, which is physically allocated on theSTT-MRAM, and directly mapped to the user-address spaceof an application without an additional kernel patch. To sup-plement the limited size of the STT-MRAM, we design US-tore so that it can provide a virtually larger space than itssize. For this, we assign an auxiliary file on an underlyingfile system and exclusively store the excess data of UStorein the file. According to requirements of an application, wecan apply an appropriate page replacement policy, to pro-vide an efficient way to store a part of the persistent data inthe UStore’s limited space.

In addition, so that UStore can be easily applied to ex-isting applications, we provide APIs, similar to POSIX I/Oprimitive calls such as open, write, read, and so on. Wesimply replace the I/O primitives with these, so that we canapply UStore to SQLite, an open-source file-based database.Finally, we show the performance enhancement of applyingUStore in several experiments.

The rest of this paper is as follows. In the next chapters,we introduce background and related works. In Chapter 4,we explain our proposed solution and its implementation indetail and in Chapter 5, we show how to apply UStore toSQLite. In Chapter 6, we evaluate our implemented sys-tem with various experiments. Finally, we mention furtherworks and conclude.

2. Background

Recent reports [2], [10] describe the currently projected sta-

Copyright c© 2014 The Institute of Electronics, Information and Communication Engineers

Page 2: UStore: STT-MRAM Based Light-Weight User-Level Storage for

498IEICE TRANS. INF. & SYST., VOL.E97–D, NO.3 MARCH 2014

Table 1 Comparison of DRAM, PCM, and STT-MRAM. (*F representsusually the metal line width)

DRAM NAND PCM STT-MRAM

Read latency <10ns 25µs 48ns <10nsWrite latency <10ns 200µs 40∼150ns 12.5nsEndurance >1015 104 >108∼9 >1012

Cell size 6∼8F2 4∼5F2 4∼16F2 20∼37F2

tus of various NVM technologies and give us confidenceabout NVM’s future as a persistent storage device.

Among several NVM prototypes, we focus on STT-MRAM. With its spin-torque technology, STT-MRAMshows good read/write performance close to that of DRAMas shown in Table 1. It has also high (almost unlimited)write endurance, and relatively higher density based on pro-jections of related technology. Depending on the structureof the Magnetic Tunnel Junctions (MTJ) of STT-MRAM,its characteristics can be varied. STT-MRAM can be usedas part of a CPU cache element [15]. It is also expected toreplace DRAM or NOR flash [16] or, it can be targeted to bea storage class memory with high retention time [17].

From these streams of STT-MRAM related technolo-gies, we expect that STT-MRAM can be a good candidatefor a high performance persistent storage device. There-fore, in this paper, we assume that STT-MRAM can replaceDRAM in terms of performance and has a reasonably highcapacity for storing large data persistently.

3. Related Works

We summarize the recent approaches to persistent stor-age systems which take advantage of non-volatile byte-addressable memory (NVM).

BPFS and SCMFS, provide a persistent file systembased on NVM, and they consider the problems of the sim-ple approach of laying the existing file system on ramdisk.BPFS presents a tree-structured file system based on PCMwhich supports byte-level updates of file data as well as filesystem meta data [5]. It outperforms NTFS file system onramdisk, on which data only can be accessed in page-level.They proposed a short-circuit shadow paging mechanismfor providing atomic, fine-grained and in-place updates topersistent storage, which prevents cascaded copy-on-writesfrom the modified location up to the root of the file systemtree.

SCMFS shows better performance compared with thenaive configuration for memory-based file system, tmpfsand ext2 on ramdisk [6]. They focus on the overheads of thegeneric block device and the emulated block I/O operationsand in order to eliminate them, SCMFS manages their filesystem directly on kernel virtual address space, and each fileis contiguously mapped to the kernel virtual address. Thesefile system-based persistent storages cannot avoid the sys-tem software overhead because the persistent data is man-aged in kernel space. Therefore, we focus on the next twosolutions which provide software frameworks for support-ing persistent memory using NVM.

Fig. 1 Architectural view of NV-heaps and Mnemosyne.

Mnemosyne [8] and NV-heaps [7] made efforts to pro-vide programmers a persistent memory abstraction whichallocates and manipulates persistent objects on top of NVMby following the concept of persistent programming forobject-oriented programming [9]. NV-heaps focuses on pro-hibiting the usages of dangerous pointers between NV-heapsand volatile memory, and it uses a specific wrapper objectwith reference counter for checking dangerous pointers. Asshown in Fig. 1 (a), NVHeaps is implemented by mappedfiles of memory-based file system, for example, ext2 onramdisk, on NVM to user address space with mmap systemcall. The actual persistent data is located in the mmappedfile and their copy is allocated in the mapped region of useraddress space on main memory. After updates on the userspace region, the updated data have to be finally synchro-nized to the mmapped file with msync call.

In Mnemosyne, unlike NV-heaps, persistent regionsand persistent heaps are allocated directly on physical NVMpage frames in user space by region manager, which recordsthe mapped pages in the persistent mapping table. For aapplication process, multiple persistent regions can be cre-ated, and they are mapped to their corresponding backingfiles with its modified mmap system call. The backing filesare used for memory swapping when there is memory pres-sure. In this approach, because a persistent data is tightlycoupled with a dedicated application instance, it can not beshared with any other application instances, and it is hardto migrate to other persistent storage or backup the data. Inaddition, it requires a garbage collection process for the per-sistent data which will be never accessed. For allocationof the persistent region, they modify a part of the virtualmemory system in kernel, and implement a dedicated com-piler for supporting their programming model. Mnemosynetargets to provide the persistent memory abstraction, not toreplace the traditional persistent storage, so that their newprogramming model is hard to be applicable to the existingI/O intensive applications.

In this work, we focus on the system software over-heads for accessing persistent data in the previous kernel-level persistent storage solution such as BPFS and SCMFS,and we design a user-level persistent storage to mitigate thesystem software intervention. Storing persistent data in userspace is similar to Mnemosyne, but the basic target and the

Page 3: UStore: STT-MRAM Based Light-Weight User-Level Storage for

SONG and PARK: USTORE: STT-MRAM BASED LIGHT-WEIGHT USER-LEVEL STORAGE FOR ENHANCING PERFORMANCE499

detailed methodologies are different. Our work targets ahigh performance user-level persistent storage for existingapplications. Because it is not dedicated to an application, itcan be easily shared between different application processes.Since our methodology does not require a kernel patch ornew programming model, we can easily apply our solutionto the legacy I/O intensive applications in the existing com-puting environment with little effort.

Additionally, because the previous approaches aim tostore a moderate size of data at byte-level or object-levelgranularity, they are not appropriate for existing I/O inten-sive applications and they requires new implementation ofthe existing applications based on I/O primitives.

4. UStore: User-Level Storage

The main objective of this work is to find how to store datapersistently on STT-MRAM for high-performance comput-ing. It is highly appealing that the pure data access latencyof STT-MRAM is only a few nano-seconds. However, in thetraditional computing system, we can only access persistentdata through the system software layer in kernel space and,in that case, the system software overheads can not be ig-nored any more. Therefore, in this Chapter, we show howsignificant overheads are and propose our solution for elim-inating them.

4.1 Our Motivation

Generally, for storing data persistently, a file system whichprovides file abstraction to applications as well as file man-agement facilities is an excellent solution. In traditional per-sistent storages based on a file system, an actual file datacan be accessed in page level through the file system layerincluding a virtual file system layer in kernel, and is phys-ically stored in a persistent storage device through a blockdevice driver. Thus, an application can access the data onlyon user-level buffer after copying the original data from thekernel space to user space. However, in the case that data onthe non-volatile byte-addressable memory can be directlyaccessed in byte-level by CPU via memory bus, the systemsoftware overheads can be a big burden.

In order to find out how much the overheads cost, weran a simple test program based on Linux kernel 2.6.31 ona 3.0 GHz Xeon machine. The test program calls a read orwrite I/O primitive call which requests for zero sized data.We measure the performance with rdtsc instruction. rdtscis a simple instruction which reads a counter register valueincreased by CPU clock, so that we can obtain the exact re-sult of the short-term process. The measured results are onaverage 1,378 clock counts in the read case and 1,656 clockcounts in the write case. This means that the software layerlatency for read call is about 459 ns and the one for writecall is about 552 ns on our machine. Even though the la-tency is much bigger than the access latency to cache-linesize (64byte) of data on STT-MRAM, which is less than200 ns (12.5 ns× 16 words) from a conservative estimation

Table 2 Composition of a “write” I/O primitive call in Linux 2.6.31.

Symbols Category %

Kernelspace

system call

System call

11.47

24.41sysret signal 4.11system call after swapgs 3.46sysret check 3.14sys write 2.23avc has perm noaudit

SELinxrelated (VFS)

6.70

17.48

file has perm 2.88inode has perm 2.81avc has perm 2.55selinux file permission 1.27security file permission 1.27mutext lock

Locking (VFS)6.67

10.71mutext unlock 4.04do sync write

VFS internal

6.05

25.64vfs write 2.80ext4 file write 2.23generic file aio write 1.90(others) 12.66

(Skipped)Audit related 10.5perf related 0.12

Userspace

write nocancel glibc related 8.5911.14

main user program 2.55

based on Table 1.Also, using Linux perf [11], we analyzed which sys-

tem software parts contribute an I/O primitive call and howmuch they do. The I/O primitive call is iterated million timesin the test. The reason of this iterated call is that we wantto eliminate the effects caused by other components exceptthe I/O primitive call as much as possible, Table 2 repre-sents compositions of the overheads of write I/O primitivecall. User-space overheads, including glibc library, accountfor only 11.14%. Among the kernel space overheads, partsrelated to system call, which includes both the calling pro-cess and returning process, account for 24.41%. Even if weignore the parts related to audit and perf itself among theremaining kernel parts, 53.83% belongs to the virtual filesystem layer and other kernel parts. Interestingly, 17.48%is for parts related to SELinux which provides the mecha-nism for supporting access control and security policies and10.71% of the whole overheads is caused by locking whichis mainly used in the virtual file system layer.

The result for read I/O primitive call is similar to thewrite case except for the followings, no parts for lockingand several contrary symbols, such as “sys read” for “read”system call procedure and “ GI libc read” for read call inglibc library.

These results are our main motivation for consideringhow to mitigate the overheads of the system call, the virtualfile system layer, and extra kernel software components foraccessing persistent file data.

4.2 Design Goal

We summarize our design goals for STT-MRAM based highperformance persistent storage as follows.

• Managing the persistent data in user space: if data ismanaged in kernel level, the overheads of system soft-

Page 4: UStore: STT-MRAM Based Light-Weight User-Level Storage for

500IEICE TRANS. INF. & SYST., VOL.E97–D, NO.3 MARCH 2014

Fig. 2 UStore’s physical and logical view.

ware layers and data copy between user space and ker-nel space are not irresistible. We manage the persistentdata in user space.• Providing virtually large storage space: STT-MRAM

has good performance but, it has limited capacity. Tosupplement the lack of storage space, we provide vir-tually larger storage space than the actual size of STT-MRAM by assigning an auxiliary file on the underlyingfile system which stores the excess data.• Supporting an easy way to apply the persistent storage

to the existing application: compatibility with existingapplication is important. For accessing persistent datain the storage, we should provide APIs which have sim-ilar syntax to the existing I/O related calls.

Based on these design goals, we propose a light-weightuser-level persistent storage, called UStore. Figure 2 repre-sents physical and logical views of UStore. One or moreUStores can be allocated physically in STT-MRAM andthey are directly mapped into the virtual address space ofeach application. With the direct mapping to user space,we can replace the heavy I/O operations for persistent datawith only light-weight memory operations, so we expect toenhance the access performance for persistent data throughUStore.

At the initial step, the limited size of UStore is allo-cated on STT-MRAM and an auxiliary file of an underlyingfile system is also assigned. This gives applications an il-lusion that they have a user-level persistent storage spacemuch larger than the limited space of UStore, and data isexclusively stored in UStore or the auxiliary file. Accordingto the application characteristic, we have to be able to applyan efficient management policy to keep frequently accesseddata on the limited space, for example LRU-based policy.

In addition, we also consider the ease of applying US-tore to existing I/O intensive applications. For this, we de-sign UStore APIs to have similar syntax to the existing I/Oprimitive calls.

Figure 3 shows the difference between UStore and thefile system-based persistent storage systems. Traditionally,in order to access a persistent file data, an I/O primitive func-tion call can be completed after passing through the virtual

Fig. 3 Comparison of traditional persistent data storing and UStore.

file system layer, the generic file system, and emulated de-vice driver. In the case of SCMFS, they eliminates the em-ulated device driver layer and multiple memory referencinglike indirect blocks of the ext2 file system, by mapping thephysical STT-MRAM into kernel address space. Neverthe-less, there are still the overheads of the other system soft-ware layers for accessing the persistent data in kernel space.

We provide UStore with two modes.

• Compatible mode: providing an easy and compatibleway to apply UStore to existing application.• Direct access mode: for better performance, allowing

direct access to the persistent data in UStore.

Figure 3 (c) shows the case when UStore is applied tothe existing applications with compatible mode. Usually, theexisting I/O intensive application allocates large user-levelbuffers on the main memory for performance and it holdsvolatile copies of the persistent data in the buffers. The up-dated buffered data has to be synchronized with the originaldata and finally, actual block-level I/O operations happen. Inorder to comply with this convention, we provide APIs forusing UStore which have similar syntaxes to POSIX APIs.With them, we can easily apply UStore to the existing ap-plication. However, when the persistent data is stored inSTT-MRAM, the access of data on the user buffer (DRAM)and the data copy between the user buffer and UStore spaceare duplicated operations.

Therefore, as shown in Fig. 3 (d), we also support directaccess to the persistent data on UStore without the duplicateaccess. The direct access gives us more chance to enhancethe access performance, but it requires modifying the exist-ing application more than in the case of compatible mode.

4.3 Basic Methodology

In this section, we explain how to allocate a UStore within aspecific STT-MRAM space and how to map it into user ad-dress space in detail. First, a device driver for UStore, whichallocates a pre-defined size of STT-MRAM space, is allo-cated. It is used only to map physical UStore space into useraddress space and after mapping, it does not participate inthe actual data access in UStore. In order to expose a UStore

Page 5: UStore: STT-MRAM Based Light-Weight User-Level Storage for

SONG and PARK: USTORE: STT-MRAM BASED LIGHT-WEIGHT USER-LEVEL STORAGE FOR ENHANCING PERFORMANCE501

Fig. 4 UStore internal page layout. (Flat and LRU-based policy)

instance to applications, we create a device file, for instance/dev/ustore, with mknod command. This makes it possiblefor applications to access the persistent STT-MRAM spacedirectly and with the device file, we can impose access con-trol for the UStore.

For the actual mapping of UStore into user addressspace of an application, we call mmap system call, whichis implemented in most modern operating systems, and cre-ates a new mapping in the virtual address space of the call-ing process. When the UStore device file is opened, mmapsystem call is called and it calls the actual address mappingfunction registered in the UStore device driver. And then, byupdating page tables of the application for the mapped ad-dress space, the address mapping process is completed. Un-til unmmap system call is called, the mapping is retained andthe application can directly access data on the STT-MRAMregion without the help of any system software layers.

4.4 UStore Internals

In order to provide a virtually larger persistent storage spacethan the limited size of UStore, we add a simple layer on topof UStore for managing the UStore space efficiently, whichis similar to the meta data of the traditional file system.

The detailed UStore layout is depicted in Fig. 4. Thereare content pages containing actual data and meta data pagesfor page management. The first block, whose size is fixed to1 KB, contains a magic string for representing UStore andthe essential information of page layout such as the size ofa page, the total number of pages, identifier of the auxiliaryfile, identifier of page management policy, offsets of otherkinds of pages, and current total size of data. The page sizeand the selected page management policy identifier dependon application requirements. The rest of the space of US-tore, except for the information block, is divided by a fixed-size of blocks and we refer each block to Upage or simplypage in this paper. Upage is a basic unit for managementand page replacement. In our implementation, the size ofUpage is from 4 KB to 32 KB and it is configured when theUStore device driver is registered.

If the required working size for persistent data issmaller than or similar to the space of the total content pagesof UStore, complex page management is not necessary. Oth-erwise, more efficient page management for utilizing the

Fig. 5 Pointing cells and inner pointers of UStore.

limited UStore space is required, in spite of the page re-placement overhead.

Case of flat page management policyAs shown in Fig. 4 (a), in the case of the simplest page

layout of a UStore, based on a flat page policy basically, I/Orequests within the range of UStore space can be directlyaccessed on STT-MRAM. The other requests are simplyredirected to the corresponding offset of the auxiliary filethrough normal I/O primitive calls, so that excess data withoffset beyond the total size of UStore are simply stored in thefile. In this case, additional metadata for management is notrequired. This policy can be simply implemented becauseit does not require any extra management for Upages, how-ever, because the data beyond the range of UStore is locatedin the auxiliary file, the data can only be accessed throughthe traditional I/O data path and we cannot expect an I/Operformance enhancement for the data. When the applica-tion does not require much more data space than UStore’savailable space, the flat page policy can be a good solution.

Case of LRU-based page management policyFor better performance in more general cases, we can

apply more efficient page management policies, by whichmore useful page is kept in UStore, for example, LRU basedpage replacement. As shown in Fig. 4 (b), we allocate sev-eral kinds of extra pages for managing UStore between theinformation block and the content pages. The pages arecomposed of bitmap pages, pointing pages, and a loggingpage. From the second page, bitmap pages are allocated andthey are used for indicating the allocation of pages on US-tore.

Basically, the pointing pages contain data structures formanaging all of the following content pages. The data struc-tures are used for searching each content page and evictingvictim pages by a page replacement policy. As shown inFig. 5 (a), a pointing page is physically divided by a fixedsize of cells, called a pointing cell, which represents the cor-responding content page. In the current implementation, thesize of a pointing cell is 40 byte.

The first cell stands for the first content page, the nextone for the next content page, and so on. When accessingthe pointing cells, we can obtain the address of the corre-sponding content page. Each pointing cell contains an in-

Page 6: UStore: STT-MRAM Based Light-Weight User-Level Storage for

502IEICE TRANS. INF. & SYST., VOL.E97–D, NO.3 MARCH 2014

Table 3 Effect of cache friendly structure for UStore page management.

CacheUnfriendly

CacheFriendly

Elapsed time 3.22s 1.32sLLC-misses 19,160,594 10,929,450dTLB-misses 19,485,688 13,456,523

dex value of a content page and some pointers, called in-ner pointers. An inner pointer is different from a normalpointer in a process. The term, inner pointer, was selectedbecause the pointer is used for pointing other pointing cellsonly within UStore. It holds a relative address value of an-other pointing cell from the start address to which UStore ismapped. This makes it possible that any application accessto the data structures of the pointing cells, properly thoughUStore, is mapped to a different virtual address location foreach application. Figures 5 (b) and 5 (c) represent the logi-cal view of the data structures for the pointing cells used tosearch request pages, and to replace the recently least-usedcontent page with the requested page.

The main reason why we isolate the pointing cells fromthe corresponding content pages and get them together inpointing pages is that the isolation of data structures formanagement and the corresponding content page can in-crease the possibility of a CPU cache hit or TLB hit; thatis, it is cache friendly during management operations. Inthis work, we assume that the page size is larger than 4 KB.In the case of normal data structures including a payload(content page), the total size of a data structure is more than4 KB. When we search a content page or we insert a con-tent page entry into the data structure, we have to traverse anumber of pages and this causes increases of the cache missratio and TLB miss ratio.

Table 3 is the result of our simple test based on datastructure for hash table with 4 KB payload. The per-formance of the cache-friendly approach combining in-sert and search operations is about 144% better than thecache-unfriendly approach, and we confirm with perf thatour cache-friendly approach can decrease both last-levelcache-miss counts (LLC-misses) and Data TLB-miss counts(dTLB-misses).

The undo-logging pages are used for atomic updatesof meta data in the pointing pages which guarantee consis-tency between pointing cells and content pages. The re-quired number of logging pages depends on implementeddata structures in the pointing cells. In the implementationsection, this is covered in detail.

4.5 Implementation of UStore

In this section, we explain how to implement UStore in de-tail and introduce our effort to provide ease application toexisting applications. We provide a user library which mapsa UStore to an application’s address space and manages US-tore’s content pages. We also provide some useful APIs forstoring and accessing the pages, whose syntax are similar tothe existing I/O primitive calls.

Fig. 6 Ustore library on the existing system.

In our implementation, in order to allocate a spe-cific region of STT-MRAM, by setting a kernel parameter,memmap, we notify the kernel not to use a pre-defined re-gion of installed physical memory space. Under this config-uration, the region is regarded as another memory zone, dif-ferent to the main memory allocated to kernel. The methodfor mapping to the different memory zone is similar to thatof SCMFS, but while SCMFS adds a new address range typeto the 820 memory map table, which is reported by the BIOSof an x86 based computer system, and kernel APIs for allo-cating non-volatile memory area, our approach requires noadditional kernel patch.

With the first creation of UStore, an information blockand other meta pages such as bitmap pages, pointing pages,and logging pages are initialized. First, a magic string,which represent whether UStore is initialized, is written inthe information block and then, identifiers of page manage-ment policy and auxiliary file are also written.

In order to access a data structure of a pointing cell ina pointing page, we internally define and use two macro,us to virt and virt to us. When we want to update the valueof an inner pointer, we can replace it to the offset value fromthe base address of UStore by using virt to us macro. Withus to virt, an application can obtain a virtual address of acontent page which it can directly access. These macros areinternally used for Upage management.

UStore Library and APIsUnlike related works which require an additional ker-

nel patch or new compiler, our current implementation ver-sion of UStore is based on current OS functionality and isbuilt as a high-level user library, as shown in Fig. 6. This ap-proach is possible because the access to UStore’s persistentdata can be handled only in user space. However, it requiresthat original I/O primitive calls be replaced with the UStoreAPIs, and the application be compiled.

In order to make application of UStore to existing ap-plications easier, we provide several APIs for UStore whichare similar to the original I/O primitive system calls such asread, write, fread, fwrite, and so on. The syntaxes of theAPIs are similar to those of POSIX APIs except for the suf-fix “U”, as described in Table 4.

Figure 7 shows the detailed operation processes of thecompatible mode and LRU-based page replacement pol-icy. openU is an initial step to declare intent to use US-tore and to open an auxiliary file. The file name is com-

Page 7: UStore: STT-MRAM Based Light-Weight User-Level Storage for

SONG and PARK: USTORE: STT-MRAM BASED LIGHT-WEIGHT USER-LEVEL STORAGE FOR ENHANCING PERFORMANCE503

Table 4 UStore user-level APIs.API Description

openU(filename∗, flag) create and map UStore and open or create an aux-iliary file (* : manipulated file name for UStore)

lseekU(fd, off, flag) set offset for accessreadU(fd, buf, size) read data to bufferwriteU(fd, buf, size) write data in buffercloseU(fd) unmap UStore and close its auxiliary filegetUpointer(fd, off) get a direct pointerreleaseUpointer(fd, off) release the corresponding Upage

Fig. 7 UStore operations process: open, read/write. (LRU-based policy)

Table 5 An example code about general buffered I/O using UStore APIs.char rbuf[BUF_SIZE];char wbuf[BUF_SIZE];// UStore openint fd = openU("/dev/ustore::data.ustore::lru",0x755);lseekU(fd, read_off, SEEK_SET); //access offsetwriteU(fd, rbuf, REQ_SIZE); // UStore writelseekU(fd, write_off, SEEK_SET); //access offsetwriteU(fd, wbuf, REQ_SIZE); // UStore readcloseU(fd); // UStore close

posed of names of an actual file, a device file for UStore,and the adopted page replacement policy and they are dis-tinguished by a separator string, ‘::’. For example, if weopen a UStore, “/dev/ustore” and assign an auxiliary file forit, “data.ustore” as a LRU-based page replacement, the ma-nipulated file name is “/dev/ustore::data.ustore::lru”. Thepolicy which is selected at the time UStore is first createdcannot be altered until it is reinitialized. If the device filename is not specified or the corresponding device does notexist, the operation of openU is identical to that of the origi-nal open call. If UStore device exists, it is mapped to the vir-tual address space of the application and finds out whetherit is initialized by checking the magic number in the infor-mation block. If not initialized, UStore’s meta pages areinitialized according to the page replacement policy, “lru”.

After opening UStore by openU, through readU andwriteU, data within the range of UStore can be directly ac-cessed and the others are is redirected to the auxiliary file.The access offset value is repositioned with lseekU. If thecontent page corresponding to the current file offset exists onUStore, it can be accessed directly without passing throughthe system software layer. Otherwise, they are accessed nor-

Fig. 8 Implemented pointing page: hash table and LRU list.

mally through the original standard I/O path and accord-ing to the page management policy, they are replaced withthe victim pages. closeU unmaps UStore from the applica-tion address space altogether and closes the auxiliary file.Table 5 is an example code of a UStore compatible modewhich uses the above APIs and has syntax familiar to pro-grammers.

Additionally, to support direct access of a UStore’scontent page, we provide two APIs, getUpointer and re-leaseUpage. getUpointer returns a direct address pointerof the content page and internally moves the correspond-ing pointing cell from the LRU list to the pinned page listwhich keeps pages currently referenced by the application,and this prevents evicting of the content page. releaseUpagereleases the content pages in the pinned page list and inter-nally moves the pointing cell back to the LRU list.

However, in the case of direct page access to UStoremanaged with LRU-based page replacement policy, becauselogically neighboring content pages cannot be physicallyneighbored, an overlapped memory access between physi-cally adjacent pages can produce unexpected results. In theprevious persistent object solutions, Mnemosnye and NV-heaps, the overlapped data access between persistent ob-jects can also happen. In the current version of our solu-tion, we consider the usage of the direct access only forsmall-sized data structures, including primitive data typesand a set of them, which are included within a Upage. Toprohibit overlapped data access, we recommend using oursimple macro, which is provided for checking whether thedata request is overlapped or not and asserting if overlapped.With the macros, we can handle data structures within a Up-age boundary.

Exclusive Storage CacheAs described in the previous section, in order to handle

the excess data which cannot be stored in the limited contentpages of UStore, two page management policies, flat andLRU-based policy, are introduced. In this subsection, wedescribe our implementation of the latter policy in detail.

As shown in Fig. 8, the data structures for supportingthe LRU-based page management policy are allocated ineach pointing cell, which contains the logical page num-ber of the corresponding content page, hash chain list en-try, LRU list entry, and dirty bit. In order to rapidly searcha content page among all the content pages located in US-

Page 8: UStore: STT-MRAM Based Light-Weight User-Level Storage for

504IEICE TRANS. INF. & SYST., VOL.E97–D, NO.3 MARCH 2014

tore, we manage a hash table for the pointing cells and eachhash bucket is linked with a hash chain list, within whichall content pages have the same hashed value for their pagenumber. By traversing a hash chain list started from a hashbucket, we can find the requested page. The LRU list is usedfor holding the most recently used page in UStore. When-ever access to pages happens, it moves the MRU position ofthe LRU list. If the requested page does not exist in UStoreand UStore is full, as described in the flow chart in Fig. 7, avictim page is selected from the LRU position of the LRUlist and if the dirty bit for the victim page is set, first, it iswritten to the auxiliary file. And then, if requested page is inthe file, it is loaded from the file to UStore. In order to savethe storage space for the dirty bit, we use the lowest bit of“next” pointer value in the hash chain list head as the dirtybit because the pointer is aligned on 64-bit boundary and thelowest 3 bits are not used for 64-bit addressing.

Comparing this approach to the previous related works,BPFS and SCMFS do not consider usage of secondary stor-age for the excess storage space requirements. In such acase, Mnemosyne swaps out the persistent memory pagesto a backing file, which is only supported by their modifiedvirtual memory management system.

Consistent and Atomic OperationUStore basically supports consistency and atomicity

only for a page management operation, not for actual con-tent update, like the ordered mode of ext3 file system whichguarantees only atomicity of file system meta-data update.

When a new content page is allocated or a content pageis replaced to another page based on UStore’s page replace-ment policy, inner pointers between pointing cells for in-stance, inner pointers for the hash chain and the LRU list, areupdated together and the operations have to be performedatomically. Otherwise, the data structures can be easily cor-rupted by system failures or abrupt faults of the application.Moreover, because the actual updates of all the metadata fora page operation are performed on CPU cache, we also haveto guarantee data consistency between the CPU cache andthe persistent meta data on STT-MRAM.

First, for consistency, we take a software approach sim-ilar to the previous works [7], [8], [21], in which clflush andmfence instruction of Intel ISA are followed together by ev-ery update. Additionally, in order to provide atomicity ofthe operation, UStore performs undo-logging for all the up-dated pointing cells in the undo-logging pages. We log inword-level each pointing cell update containing their offsetsin each log entry. The reason we choose the hash table forfast page searching is because every UStore managementoperation only causes updates of a limited number of point-ing cells, not depending on the number of Upage in UStore.Our first implementation was based on red-black tree struc-ture, but we discarded it because the balanced tree structureis more complicated and can generate cascaded updates ofpointing cells for each page operation. This means it needsmore logging space and it has bigger logging overhead. In asimilar way, we can flexibly apply better data structure in a

Fig. 9 SQLite applied with UStore’s direct access mode.

pointing cell for Upage management according to the char-acteristic of the application.

Concurrency SupportThe biggest difference with the previous persistent

memory solution is that UStore is not dedicated to an appli-cation instance. This means that a UStore can be simultane-ously mapped to multiple processes’ virtual address spaceand be accessed by them. The basic mechanism of US-tore sharing between multiple processes is natively similarto that of shared memory in the commodity OSes. However,in order to share UStore between them, we should supportthe concurrency for accessing UStore.

In the current implementation for concurrency, basedon the assumption that the processor is cache coherent, wesimply allocate a lock variable in the information block andit is only used to support the concurrency of UStore’s meta-data (bitmap and pointing cells) for page management, notfor actual data of the content page.

5. Example: Applying UStore to I/O Intensive Appli-cation

In this Chapter, in order to prove that UStore can enhancethe performance of existing applications and is easily appli-cable, we introduce our additional work, an example of anI/O intensive application.

Among the variety of I/O intensive applications,we chose a well-known open source database system,SQLite [18], as our target application because it is a typi-cal I/O intensive application and its internal architecture isappropriate to apply our UStore. SQLite is a library-formeddatabase system which is easily portable to different com-puting environments from a small embedded system to alarge-scale server system. All data of a database are storedin a single file. A file is generally irrespective to the underly-ing file system and storage system, that is, cross platformed,thus, SQLite is already widely adopted to many platforms.The internal architecture of the SQLite-base application isshown in the left hand of Fig. 9, which only represents the

Page 9: UStore: STT-MRAM Based Light-Weight User-Level Storage for

SONG and PARK: USTORE: STT-MRAM BASED LIGHT-WEIGHT USER-LEVEL STORAGE FOR ENHANCING PERFORMANCE505

case that we apply UStore’s direct access mode to SQLite.

5.1 SQLite Internal

The original SQLite manages a main DB file for an actualdatabase which is composed of B+ tree-based contiguousDB pages. In SQLite, a DB page is a basic unit accessedfrom the file system based persistent storage as shown inthe left side of Fig. 9. When a database is opened, SQLitecreates internally a user-level page cache in DRAM whichcontains a fixed number of data pages and during DB oper-ations, parts of the cached pages are accessed. Moreover, inorder to support data atomicity and transactional operation,SQLite basically provides the functionality of rollback jour-naling which is widely used in transactional systems. Whena write transaction is processed, before updating the contentof a cached DB page, the journal manager writes the originalcontent of the page on the rollback journal file. With this,if the system fails abnormally or the transaction is forcedto be aborted by the application’s decision, all the updatedpages during the transaction can be recovered by the orig-inal contents of the pages from the rollback journal file bythe recovery procedure.

5.2 SQLite I/O Characteristics

In a DB system, generally, small random accesses to dataoccur frequently. They are caused by many kinds of DBoperations, such B+ tree operations (search, insertion, dele-tion), data update, and so on, and the updated page mustbe synchronized to the original page of the DB file within atransaction even though an update part is much smaller thanthe page. So, they can generate frequent I/O operations andcan have an affect on the overall system performance.

In order to find out the actual DB I/O operations,we performed a simple preliminary test based on DBT2benchmark [19], which creates database workloads used tosimulate heavy user loads for OLTP, such as e-commercedatabase transactions. It is hard to observe the actual readoperations because read requests from the secondary stor-age are masked by SQLite’s page cache and the page cachein kernel and even when a small amount of data in a page isrequested to be read, SQLite reads the whole page. There-fore, we observed only DB page write operations which areactually synchronized to the secondary storage by “fsync”system call, and examined the difference ratio in byte-levelbetween the content of the original page and the content ofa newly updated page.

The size of a DB page is set to 1 KB and the CDF ofthe benchmark test results is shown in Fig. 10. Because afterevery write transaction “fsync” system call is called, this re-sult is the same as the actual write requests to the secondarystorage. According to this result, the average difference ra-tio is only 9.28%, and 54% of the page writing operationsupdate only 1% of the page contents. From the results, weare convinced that small DB read/write occurs frequentlyand expect that if we apply UStore, the performance of DB

Fig. 10 CDF for difference ratio of DB page writes of DBT2 benchmark.

applications can be enhanced.

5.3 Applying UStore to SQLite

As shown in the right side of Fig. 9, first, in order to applyUStore for the SQLite main DB file, we simply replace opensystem call with openU functions and open it with a manipu-lated file name specifying UStore, named “/dev/ustore”, andLRU-based page replacement policy. Next, we also applyanother UStore, “/dev/ustore j” as the journaling file. Be-cause the journaling pages are sequentially written to the fileduring a transaction and they are invalidated after the trans-action, we simply select the flat page management policy.

We implemented two different versions of the SQLiteinstance which are applied with compatible mode and directaccess mode. In the former case, the code migration is sosimple that only 7 lines are modified and 5 lines are addedin the whole SQLite code. With only these few code modifi-cations, we could obtain a better performance enhancementthan the original version of SQLite. In another version ofSQLite, we eliminated the user-level page cache of SQLiteand with getUpointer, we made the data structure for thecaching directly point the content pages in UStore, as shownin Fig. 9. In this case, a more elaborate code modification isneeded, but DRAM memory space can be saved, and it out-performs than the compatible mode case.

6. Evaluation

Our implementation and experiments are performed on 64-bit dual core Xeon 5160 based machine operating at 3.0 GHzwhich has 4 M cache, 1333 MHz FSB and is equipped with8 G of DRAM. 4 G of DRAM is allocated for Linux kernel2.6.31 and the rest of the 4 G space is used for STT-MRAMemulated space and the size of each UStore has 512 M or1 G. We assume that the read/write performances of STT-MRAM are the same as those of DRAM based on Table 1.

Because the original implemented version of SCMFSis not available to us, we implemented a pseudo version ofSCMFS based on their basic idea. We did not implement thewhole file system functionality of SCMFS and simply builtit as a simple character device driver named “/dev/scmfs”

Page 10: UStore: STT-MRAM Based Light-Weight User-Level Storage for

506IEICE TRANS. INF. & SYST., VOL.E97–D, NO.3 MARCH 2014

Fig. 11 Read performance of tmpfs, SCMFS, and UStore(FLAT).

Fig. 12 Write performance of tmpfs, SCMFS, and UStore(FLAT).

which initially assigns large size of kernel memory spacewith vmalloc. Therefore, it supports only read/write oper-ations at a certain offset on the device for SCMFS and be-cause no file system overheads are included in the simplifiedSCMFS, its measured performance is considered to be betterthan that of the original SCMFS.

6.1 Micro-Benchmarks

First, to evaluate in detail the access performance of eachstorage system with relatively small size of requests, weperformed our simple benchmark tests which repeat theread/write access operation requesting a specific size of dataat the random offset on the storage system. For this, we as-sign 1 GB size for each storage space.

Figures 11 and 12 represent the performance compar-ison with other systems in case of a flat page managementpolicy, and they show the basic performance characteristicof each. Because SCMFS’ space is assigned to contiguouskernel address space, it does not require multiple memoryreferencing to reach an object memory page on a certainoffset and show better performance than tmpfs.

However, for accessing the data on SCMFS, the systemsoftware layer overheads still exist. Meanwhile, becausedata can be directly access without additional overheads inthe flat page management policy, UStore shows much lowerdata access latency than others. In particular, as shown inthe embedded graphs, when the request size is small, thereduced latency ratio is much bigger.

The reason the write operation takes more time thanthe read operation is because in the case of read, only theuser buffer data is updated and the data can be alive in CPU

Fig. 13 Read performance of LRU-based UStore (hash bucket size:10 K).

Fig. 14 Write performance of LRU-based UStore (hash bucket size:10 K).

cache, but in the write case, amounts of data in the targetoffset are updated and they should finally be written to thephysical memory by eviction from the CPU cache.

Figures 13 and 14 show the access latencies of UStorefollowing LRU-base page management policy with the dif-ferent size of Upage. In the experiment, the bucket size ofthe hash table for fast page searching is set to 10 K. Thesmall Upage size means the number of pages to be man-aged is large. As the request size is bigger, the overheads toaccess multiple Upages within the requested data region isgrowing, because the number of Upages to be searched hasincreased. In addition, as the number of Upages increases,the searching time of our implemented hash table increasesmore than the AVL tree structure of tmpfs.

Consequently, the read performance of UStore with4 KB Upage, is similar to that of tmpfs. This is clear fromthe fact that the access time is increased abruptly every 4 KBincrement of the request size. Even though it is smaller thanin the 4 KB case, in UStore with 16 KB Upage, the abruptincrements of the access time also show in every 16 KB in-crement of the request size. According to the analysis basedon perf, when the request size is 512 byte, the pure glibcmemcpy overhead in the case of 32 KB Upage, accounts for45.0%, while, in the case of 4 KB Upage, the memcpy over-head is only 24.4%; but, the overheads for searching the

Page 11: UStore: STT-MRAM Based Light-Weight User-Level Storage for

SONG and PARK: USTORE: STT-MRAM BASED LIGHT-WEIGHT USER-LEVEL STORAGE FOR ENHANCING PERFORMANCE507

Fig. 15 Read performance of LRU-based UStore with different hashbucket sizes (10 K vs. 32 K).

Fig. 16 Overhead contribution of a write operation for 4 KB Upage.

Fig. 17 Overhead contribution of a write operation for 32 KB Upage.

requested pages between the pointing cells is over 50%.Compared with the above experiments, however, as

shown in Fig. 15, when the number of hash bucket is 32 K,the overhead of searching the requested page of UStore with4 KB Upage is smaller than with a smaller bucket size. Thedecision of hash bucket size in our current hash-based im-plementation is a trade-off issue because while a large hashbucket makes the length of hash chain shorter, it occupiesthe storage space.

The next two results in Figs. 16 and 17 confirm ouranalysis for the page management overhead. The majorsource of access overhead for the write operation is memcpy

Fig. 18 IOZone Random Read throughput.

Fig. 19 IOZone Random Write throughput.

and the overhead contribution of memcpy is proportional tothe request size. The memcpy overhead is unavoidable and isregardless of the Upage size. While, the page managementoverhead depends on the Upage size. In the case of 4 KBUpage, the page management overhead is much bigger thanthe case of 32 KB Upage. From the above results, we con-sider the reasonable size of Upage to be 32 KB, which doesnot affect the access time much.

6.2 IOZone Benchmark

We also ran the popular benchmark, IOZone [20] for eachstorage system. For testing, we modified the original IO-Zone to be able to run on UStore and our pseudo version ofSCMFS. In Figs. 18 and 19, IOzone random access through-puts are shown. The random read throughput of UStore isbetter than other storage in almost all the range of requestsize. However, random write throughput is much lower thanSCMFS in the case that the request size is from 32 KB to8 MB. We think the cause is that glibc generic memcpyused for UStore is different from that of copy to user orcopy from user used in kernel space. The user-level mem-cpy is targeting to relatively small size of memory copy forapplication. As a result of our examination for the sourcecode of glibc 2.11, glibc generic memcpy does not usemovsq instruction which copies 8 bytes and is good for largememory data copy. Meanwhile, in the case of kernel, the

Page 12: UStore: STT-MRAM Based Light-Weight User-Level Storage for

508IEICE TRANS. INF. & SYST., VOL.E97–D, NO.3 MARCH 2014

Fig. 20 IOZone Random Write throughput with faster memcpy.

Fig. 21 DBT2 benchmark performance (TPM, Transaction per minute)of different versions of SQLite.

two copy operations between user space and kernel spaceuse movsq instruction.

So, we additionally apply a new version of memcpywhich uses “prefetch” instructions, especially to supportwrite operation for large data. Because the performance byprefetch is expected to be better when the requests are se-quential, it is appropriate for large data copy. The enhancedperformance is shown in the result of “UStore Write new”of Fig. 20. For all ranges of request size, the IOZonethroughput of UStore increases. This result implies that thememcpy performance in NVM-based persistent storage sys-tem is a very important part.

6.3 Performance of SQLite

As introduced in the previous Chapter, we applied UStoreto SQLite with two operating mode, compatible mode anddirect-access mode. For evaluating new versions of SQLite,we run DBT2 benchmark for them. We compare our mod-ified SQLite with the original SQLite running on HDD andtemp file system. As shown in Fig. 21, the performance ofthe original SQLite on HDD is only 3.7 TPM (the number oftransactions per minutes) because each transaction requiresdata synchronization to HDD. By placing the main DB fileof SQLite on memory, the performance extremely increases,In this case, the application is no longer I/O intensive and ismemory intensive. The total performance of SQLite on temp

file system is on average 322.8 TPM, while, SQLite appliedby UStore compatible mode is 4.1% better than tmpfs-basedSQLite, at 336.0 TPM. This improvement may seem to besmall, but this is a remarkable result in the case of memoryintensive workload.

In addition, another version of SQLite applied by US-tore direct-access mode shows 9.6% better performancethan the tmpfs based one. Because the OLTP like workloadfrequently generates small access to the persistent storageas introduced in the previous Chapter, in this case, the effectof UStore’s direct access mode is much bigger. We thinkthese results are meaningful for representing the goals ofthis work.

7. Further Works

UStore is not a file system itself and provides simply a vir-tual large persistent storage, similar to a large file. For pro-viding multiple file abstractions on UStore, it can be a goodsolution to lay an existing user-level file system, such asWheFS [26] or UKFS (User Kernel File System) library ofRumpFS [27] on UStore following the flat page policy. Weexpect that the user-level file system on STT-MRAM willnot cause kernel system faults and shows better performancethan other kernel file systems.

In this work, we do not consider the case where a US-tore is shared concurrently by multiple running applications.Innately, a UStore can be mapped into the virtual addressspace of each different application simultaneously, which issimilar to the case of shared memory. To support sharing thedata, however, we have to consider mutual exclusive accessto pages in UStore between applications.

8. Conclusion

We examine how much the system software overhead, suchas system call, virtual file system overhead, and so on, influ-ences persistent data access performance when they resideon the promising NVM, especially STT-MRAM.

Therefore, we propose a light-weight user-level persis-tent storage, called UStore, based on STT-MRAM. By di-rectly mapping the persistent space of STT-MRAM to user-address space, it can eliminate system software overheads.Consequently, through several benchmarks, we confirm US-tore outperforms the other file system. UStore is able to pro-vide much larger virtual persistent storage associated withthe underlying file system than the limited space size of US-tore. To accomplish this, we design the page layout on US-tore for appropriate page management. In addition, in orderto support the ease of applying UStore to an existing appli-cation, we provide some APIs similar to the traditional I/Oprimitive calls. By minor code modification with API, webuild new versions of an actual database application, SQLiteand show the performance enhancement. We expect thatUStore can be a good solution to provide fast persistent stor-age for high-performance applications.

Page 13: UStore: STT-MRAM Based Light-Weight User-Level Storage for

SONG and PARK: USTORE: STT-MRAM BASED LIGHT-WEIGHT USER-LEVEL STORAGE FOR ENHANCING PERFORMANCE509

References

[1] M.H. Kryder and C.S. Kim, “After hard drives - what comes next?,”IEEE Trans. Magn., vol.45, no.10, pp.3406–3413, 2009.

[2] T. Perez, Non-volatile memory: Emerging technologies and theirimpacts on memory systems, Technical Report, 2010.

[3] Y. Park, S.H. Lim, C. Lee, and K.H. Park, “PFFS: A scalable flashmemory file system for the hybrid architecture of phase-changeRAM and NAND flash,” Proc. 2008 ACM Symp. on Applied Com-puting (SAC), Fortaleza, Ceara, Brazil, pp.1498–1508, March 2008.

[4] J.K. Kim, H.G. Lee, S. Choi, and K.I. Bahng, “A PRAM and NANDflash hybrid architecture for high-performance embedded storagesubsystems,” Proc. 8th ACM & IEEE Itnl. Conf. on Embedded Soft-ware, pp.31–40, Atlanta, USA, Oct. 2008.

[5] J. Condit, E.B. Nightingale, C. Frost, E. Ipek, B.C. Lee, D.Burger, and D. Coetzee, “Better I/O through byte-addressable, per-sistent memory,” Procs 22nd Symp.on Operating Systems Princi-ples, pp.133–146, Big Sky, Montana, USA, Oct. 2009.

[6] X. Wu and A.L.N. Reddy, “SCMFS: A file system for storage classmemory,” Conf. on High Performance Computing Networking, Stor-age and Analysis, no.39, pp.1–11, Seattle, Washington, USA, Nov.2011.

[7] J. Coburn, A.M. Caulfield, A. Akel, L.M. Grupp, R.K. Gupta, R.Jhala, and S. Swanson, “NV-Heaps: Making persistent objects fastand safe with next-generation, non-volatile memories,” Proc. 16thItnl. Conf. on Architectural Support for Programming Languagesand Operating Systems (ASPLOS), pp.105–118, Newport Beach,CA, USA, March 2011.

[8] H. Volos, A.J. Tack, and M.M. Swift, “Mnemosyne: Lightweightpersistent memory,” Proc. 16th Itnl. Conf. on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS),pp.91–104, Newport Beach, CA, USA, March 2011.

[9] M.P. Atkinson, P.J. Bailey, K.J. Chisholm, P.W. Cockshott, and R.Morrison, “An approach to persistent programming,” The ComputerJournal, vol.26, no.4, pp.360–365, Nov. 1983.

[10] Copyright ITRS, “International technology roadmap for semicon-ductors: Emerging research devices, International TechnologyRoadmap for Semiconductors,” www.itrs.net, accessed Aug. 312012.

[11] Linux Perf, Perf: Linux profiling with performance counters,https://perf.wiki.kernel.org/index.php/Main Page, accessed Oct. 20.2012.

[12] K. Tsuchida, T. Inaba, K. Fujita, Y. Ueda, T. Shimizu, Y. Asao,T. Kajiyama, M. Iwayama, K. Sugiura, S. Ikegawa, T. Kishi,T. Kai, M. Amano, N. Shimomura, H. Yoda, and Y. Watanabe,“A 64Mb MRAM with clamped-reference and adequate-referenceschemes,” Proc. Solid-State Circuits Conf.: Digest of Technical Pa-pers (ISSCC), pp.258–259, Feb. 2010.

[13] AMD Technology, “AMD64 Programmer’s Manual Volume 2: Sys-tem Programming,” http://support.amd.com/us/Processor TechDocs/24593 APM v2.pdf, accessed Oct. 20. 2012.

[14] S. Vassiliadis, F. Duarte, and S. Wong, “A Load/Store unit for amemcpy hardware accelerator,” Itnl. Conf. on Field ProgrammableLogic and Applications, pp.537–541, Amsterdam, The Netherlands,Aug. 2007.

[15] C.W. Smulllen IV, V. Mohan, A. Nigam, S. Gurumurthi, and M.R.Stan, “Relaxing non-volatility for fast and energy-efficient STT-RAM caches,” IEEE Intl. Symp. on High Performance ComputerArchitecture (HPCA), pp.50–61, San Antonio, Texas, USA, Feb.2011.

[16] S.C. Oh, J.H. Jeong, W.C. Lim, W.J. Kim, Y.H. Kim, H.J. Shin,J.E. Lee, Y.G. Shin, S. Choi, and C. Chung, “On-axis scheme andnovel MTJ structure for sub-30nm Gb density STT-MRAM,” IEEEIntl. Electron Devices Meeting (IEDM), pp.12.6.1–12.6.4, San Fran-sisco, CA, USA, Dec. 2010.

[17] A. Driskill-Smith, S. Watts, V. Nikitin, D. Apalkov, D. Druist, R.

Kawakami, X. Tang, X. Luo, A. Ong, and E. Chen, “Non-volatilespin-transfer torque RAM (STT-RAM): Data, analysis and designrequirements for thermal stability,” IEEE Symp. on VLSI Technol-ogy (VLSIT), pp.51–52, Milpitas, CA, USA, May 2010.

[18] SQLite, SQLite official home page, http://sqlite.org, accessed Oct.20. 2012.

[19] M. Wong and T.D. Witham, “DBT2 benchmark,” http://sourceforge.net/apps/mediawiki/osdldbt/index.php, accessed Oct. 20. 2012.

[20] W.D. Norcott and D. Capps, “Iozone File System Benchmark,”http://www.iozone.org, accessed Oct. 20. 2012.

[21] S. Venkataraman, N. Tolia, P. Ranganathan, and R.H. Camp-bell, “Consistent and durable data structures fornon-volatile byte-addressable memory,” 9th USENIX Conf. on File and Storage Tech-nologies (FAST), pp.61–75, San Jose, CA, USA, Feb. 2011.

[22] T.J. Lehman and M.J. Carey, “A study of index structures for a mainmemory database management system,” Proc. 12th Itnl. Conf. onVery Large Data Bases (VLDB), pp.294–303, Kyoto, Japan, 1986.

[23] A.C. Ammann, M. Hanrahan, and R. Krishnamurthy, “Design of amemory resident DBMS,” 30th IEEE Computer Society Itnl. Conf.,pp.54–58, San Francisco, CA, USA, Feb. 1985.

[24] H. Garcia-Molina and K. Salem, “Main memory database systems:An overview,” IEEE Trans. Knowl. Data Eng., vol.4, no.6, pp.509–516, Dec. 1992.

[25] D.J. DeWitt, R.H. Katz, F. Olken, L.D. Shapiro, M.R. Stonebraker,and D.A. Wood, “Implementation techniques for main memorydatabase systems,” Proc. 1984 ACM SIGMOD Itnl. Conf. on Man-agement of data, pp.1–8, Boston, Massachusetts, June 1984.

[26] S. Beal, “Whefs: Embedded file system library for C,”http://code.google.com/p/whefs/wiki/HomePage, accessed Oct. 20.2012.

[27] A. Kantee, “Rump file systems: Kernel code reborn,” USENIX An-nual Technical Conference, pp.15–15, San Diego, CA, USA, June2009.

Yong Song was born in 1978. He graduatedbachelor course in electrical engineering depart-ment of KAIST in 2001 and was also granted amaster degree in the same department in 2003.He is a PhD. student in CORE lab in the samedepartment in KAIST. The Korea Advanced In-stitute of Science and Technology, Electrical En-gineering Dept, Gu-sung Dong, Daejeon, Korea.

Kyuho Park received the B.S. degreein electronics engineering from Seoul NationalUniversity, Korea in 1973, the M.S. degree inelectrical engineering from KAIST, Korea in1975, and the Dr. Ing. degree in electrical engi-neering from the Universite de Paris XI, Francein 1983. He has been a Professor of the Depart-ment of Electrical Engineering, KAIST, Koreasince 1983. His research areas are computer ar-chitectures, file systems, storage systems, andparallel processing. He is a member of KISS,

KITE, Korea Institute of Next Generation Computing, IEEE and ACM.