SplitFS: Reducing Software Overhead in File Systems for ...vijay/papers/sosp19-splitfs-slides.pdf · SplitFS: Reducing Software Overhead in File Systems for Persistent Memory Rohan

SplitFS: Reducing Software Overhead in File Systems for Persistent Memory

Rohan Kadekodi, Se Kwon Lee, Sanidhya Kashyap*, Taesoo Kim, Aasheesh Kolli, Vijay Chidambaram

* on the job market�1

Persistent Memory (PM)

Non-volatile Fast

�2

PM file systems

�3

PM file system

PM file systemsApplication

File 1PM File 2 File n

�3

PM file system



read(), write(), open(), close()POSIX API

�3

PM file system



read(), write(), open(), close()POSIX API

ext4-DAX PMFS NOVA Strata

�3

ext4-DAX

700

�4

ext4-DAX

700

�4

Modification of the ext4 file system for Persistent Memory

Works with modern Linux kernels

Under active development by the ext4 community

Only PM file system that is widely used

Software Overhead in File Systems

700

• Append 4KB data to a file

0

2000

4000

6000

8000

10000

device Strata NOVA PMFS ext4-DAX

Tim

e (n

s)

�5


700

Tim

e (n

s)

�5

0

2000

4000

6000

8000

10000


700

• Append 4KB data to a file • Time taken to copy user data to PM: ~700 ns


700

Tim

e (n

s)

�5


0

2000

4000

6000

8000

10000


700

9002 (12x)


700

Tim

e (n

s)

�5


0

2000

4000

6000

8000

10000


700

9002 (12x)

software overhead


700

Tim

e (n

s)

�5


0

2000

4000

6000

8000

10000


4150 (5x)

3021 (3x)

700

2450 (2.5x)

9002 (12x)


700

Tim

e (n

s)

�5


0

2000

4000

6000

8000

10000


4150 (5x)

3021 (3x)

700

2450 (2.5x)

9002 (12x)

File systems suffer from high software overhead!


700

Tim

e (n

s)

�5


0

2000

4000

6000

8000

10000


4150 (5x)

3021 (3x)

700

2450 (2.5x)

9002 (12x)

File systems suffer from high software overhead!

ext4-DAX, although widely used, suffers from highest software overhead and provides weak guarantees

�6

Goals

- Low software overhead

�6

Goals

- Low software overhead- Strong consistency guarantees

�6

Goals

- Low software overhead- Strong consistency guarantees- Leverage the maturity and active

development of ext4-DAX

�6

Goals

SplitFSPOSIX file system aimed at reducing software overhead for PM

�7�7


SplitFS serves data operations from user space and metadata operations using the ext4-DAX kernel file system

�7�7



Provides strong guarantees such as atomic and synchronous data operations

�7�7



Reduces software overhead by up to 17x compared to ext4-DAX

https://github.com/utsaslab/splitfs

Provides strong guarantees such as atomic and synchronous data operations

�7�7

Improves application throughput by up to 2x compared to NOVA


Outline

• Target usage scenario • High-level design • Handling data operations • Consistency guarantees • Evaluation

�8

Outline


�9

Target usage scenario

�10


�10

SplitFS is targeted at POSIX applications which use read() / write() system calls in order to access their data on Persistent Memory.


�10

SplitFS is targeted at POSIX applications which use read() / write() system calls in order to access their data on Persistent Memory.

SplitFS does not optimize for the case when multiple processes concurrently access the same file

Outline


�11

High-level Design

�12

High-level DesignSplitFS lies both in user space as well as in the kernel. • Data operations are served from user space • Metadata operations are served from ext4-DAX

�12


�12

Data in kernel Metadata in kernel

Data in user Metadata in user


ext4-DAX, PMFS [EuroSys 14],

NOVA [FAST 16]

�12





NOVA [FAST 16]Low performance

�12





NOVA [FAST 16]

Strata [SOSP 17]

Low performance

�12





NOVA [FAST 16]

Strata [SOSP 17]

Low performance High complexity

�12





NOVA [FAST 16]

Strata [SOSP 17]


�12



Aerie [EuroSys 14]

High complexity


Allocations in kernel

Low performance



NOVA [FAST 16]

Strata [SOSP 17]


SplitFS

Low complexityHigh performance

Data in user Metadata in kernel

�12



Aerie [EuroSys 14]

High complexity


Allocations in kernel

Low performance

High-level Design

�13

High performance

Low complexity

High-level Design

�13

Accelerate data operations from user space • Data operations are common and simple

High performance

Low complexity

High-level Design

�13

Accelerate data operations from user space • Data operations are common and simple

Use ext4-DAX for metadata operations • Metadata operations are rare and complex • POSIX has many complex corner-cases

High performance

Low complexity

High-level Design

user

kernel

Application

File 1 File 2 File 3PM

�14

U-Split

K-Split (ext4-DAX)

High-level Design

user

kernel

Application


read() write()

�14

U-Split

K-Split (ext4-DAX)

High-level Design

creat()delete()

user

kernel

Application


read() write()

�14

U-Split

K-Split (ext4-DAX)

High-level Design

creat()delete()

user

kernel

Application


read() write()

�14

Log

U-Split

K-Split (ext4-DAX)

High-level Design

creat()delete()

user

kernel

Application


read() write()

�14

Log

U-Split

SplitFS accelerates common case data operations while leveraging the maturity of ext4-DAX for

metadata operations

K-Split (ext4-DAX)

High-level Design

creat()delete()

user

kernel

Application


read() write()

�14

Log

U-Split

SplitFS accelerates common case data operations while leveraging the maturity of ext4-DAX for

metadata operations

K-Split (ext4-DAX)

SplitFS uses logging and out of place updates for providing atomic and synchronous operations

Outline

• Target usage scenario • High-level design • Handling data operations

• Handling file reads and updates • Handling file appends

• Consistency guarantees • Evaluation

�15

K-Split (ext4-DAX)

U-Split

Application

FilePM

UserKernel

Handling reads and updates

�16

read / update

K-Split (ext4-DAX)

U-Split

Application

FilePM

UserKernel


�16

read / update

K-Split (ext4-DAX)

U-Split

Application

FilePM

mmap

perform mmap

UserKernel


�16

read / update

K-Split (ext4-DAX)

U-Split

Application

FilePM

DAX-mmaps

mmap

perform mmap

UserKernel


�16

read / update

K-Split (ext4-DAX)

U-Split

Application

FilePM

DAX-mmaps

UserKernel


�16

read / update

K-Split (ext4-DAX)

U-Split

Application

FilePM

DAX-mmaps

UserKernelIn the common case, file reads and updates do

not pass through the kernel


�16

Outline

• Target usage scenario • High-level design • Handling data operations

• Handling file reads and updates • Handling file appends

• Consistency guarantees • Evaluation

�17

foo

size = 10foo inode

Persistent Memory

Handling appends

�18

userkernel

foo

size = 10foo inode

Persistent Memory

Handling appends

�18

userkernel

Application

Start

foo

size = 10foo inode

Persistent Memory

Handling appends

�18

userkernel

Application

staging file

size = 100staging file inode

Start

foo

size = 10foo inode

Persistent Memory

Handling appends

�18

userkernel

Application

staging file


Startstaging file mmap

foo

size = 10foo inode

Persistent Memory

Handling appends

�18

userkernel

Application

staging file


Startstaging file mmapappend (foo,“abc”)

abc

store

foo

size = 10foo inode

Persistent Memory

Handling appends

�18

userkernel

Application

staging file



abc

read (foo)load

foo

size = 10foo inode

Persistent Memory

Handling appends

�18

userkernel

Application

staging file



abc

read (foo)

fsync (foo)

foo

size = 10foo inode

Persistent Memory

Handling appends

�18

userkernel

Application

staging file



abc

read (foo)

fsync (foo)relink()

foo

size = 10foo inode

Persistent Memory

Handling appends

�18

userkernel

Application

staging file



abc

read (foo)

fsync (foo)foo staging

ext4-journal transaction

relink()

foo

size = 10foo inode

Persistent Memory

Handling appends

�18

userkernel

Application

staging file



read (foo)



abc

relink()

foo

size = 10foo inode

Persistent Memory

Handling appends

�18

userkernel

Application

staging file



read (foo)



abc

In the common case, file appends do not pass through the kernel

relink()

Outline


�19

Consistency Guarantees

Mode Metadata Atomicity

Synchronous Operations

Data Atomicity File System

POSIX ext4-DAX, SplitFS-POSIX

Sync PMFS, SplitFS-Sync

Strict NOVA, Strata,SplitFS-Strict

�20








�20








�20








�20








Optimized logging is used in order to provide stronger guarantees in sync and strict modes

�20

Optimized logging

�21

SplitFS employs a per-application log in sync and strict mode, which logs every logical operation

Optimized logging

�21

SplitFS employs a per-application log in sync and strict mode, which logs every logical operation

In the common case • Each log entry fits in one cache line • Persisted using a single non-temporal store and sfence instruction

Optimized logging

�21

Flexible SplitFS

K-Split (ext4-DAX)

App 1

File 1PM

App 2 App 3

File 2 File 3 File 4

UserKernel

�22

Flexible SplitFS

K-Split (ext4-DAX)

App 1

File 1PM

App 2 App 3

U-Split-POSIX

U-Split-sync

U-Split-strict


UserKernel

�22

Flexible SplitFS

K-Split (ext4-DAX)

App 1

File 1PM

App 2 App 3

U-Split-POSIX

U-Split-sync

U-Split-strict


Data Data DataMeta Meta Meta

UserKernel

�22

�23

Visibility

�23

Visibility

When are updates from one application visible to another?

�23

• All metadata operations are immediately visible to all other processes

Visibility


�23


Visibility

• Writes are visible to all other processes on subsequent fsync()


�23


Visibility

• Writes are visible to all other processes on subsequent fsync()

• Memory mapped files have the same visibility guarantees as that of ext4-DAX


SplitFS Techniques

Technique Benefit

�24

SplitFS Techniques

Technique Benefit

SplitFS Architecture Low-overhead data operations, Correct metadata operations

�24

SplitFS Techniques

Technique Benefit


Staging + Relink Optimized appends, No data copy

�24

SplitFS Techniques

Technique Benefit


Staging + Relink Optimized appends, No data copy

Optimized Logging + out-of-place writes Stronger guarantees

�24

Outline


�25

Evaluation

�26

Evaluation

Setup: • 2-socket 96-core machine with 32 MB LLC • 768 GB Intel Optane DC PMM, 378 GB DRAM

�26

Evaluation


File systems compared: • ext4-DAX, PMFS, NOVA, Strata

�26

Evaluation


�26

File systems compared: • ext4-DAX, PMFS, NOVA, Strata

Does SplitFS reduce software overhead compared to other file systems?

How does SplitFS perform on metadata intensive workloads?

How does SplitFS perform on data intensive workloads?

�27

Does SplitFS reduce software overhead compared to other file systems?

How does SplitFS perform on metadata intensive workloads?

How does SplitFS perform on data intensive workloads?

�27

• < 15% overhead for metadata intensive workloads


Software Overhead of SplitFS

0

2000

4000

6000

8000

10000

device SplitFS-strict Strata NOVA PMFS ext4-DAX

4150 (5x)

3021 (3x)

9002 (12x)

700

2450 (2.5x)

Tim

e (n

s)

�28


Software Overhead of SplitFSTi

me

(ns)

�28

0

2000

4000

6000

8000

10000

device SplitFS-strict Strata NOVA PMFS ext4-DAX

1251 (0.8x)

4150 (5x)

3021 (3x)

700

2450 (2.5x)

9002 (12x)

Workloads

YCSB on LevelDB

Seq reads

Redis

Tar Git Rsync

Microbenchmarks

Data intensive

Metadata intensive

TPCC on SQLite

Rand reads

Seq writes

Rand writesAppends

�29

Workloads

YCSB on LevelDB

Seq reads

Redis

Tar Git Rsync

Microbenchmarks

Data intensive

Metadata intensive

TPCC on SQLite

Rand reads

Seq writes

Rand writesAppends

�29

YCSB on LevelDB

Load A - 100% writesRun A - 50% reads, 50% writesRun B - 95% reads, 5% writesRun C - 100% reads

Run D - 95% reads (latest), 5% writesLoad E - 100% writesRun E - 95% range queries, 5% writesRun F - 50% reads, 50% read-modify-writes

Yahoo! Cloud Serving Benchmark - Industry standard macro-benchmarkInsert 5M keys. Run 5M operations. Key size = 16 bytes. Value size = 1K

0

0.5

1

1.5

2

2.5

Load A Run A Run B Run C Run D Load E Run E Run F

NOVASplitFS-Strict

Nor

mal

ized

thro

ughp

ut

�30

0

0.5

1

1.5

2

2.5


NOVASplitFS-Strict



13.3

9 ko

ps/s

32.2

4 ko

ps/s

139.

94 k

ops/

s

174.

85 k

ops/

s

191.

54 k

ops/

s

13.5

9 ko

ps/s

17.7

5 ko

ps/s

66.5

4 ko

ps/s


YCSB on LevelDBN

orm

aliz

ed th

roug

hput

�31




0

0.5

1

1.5

2

2.5


NOVASplitFS-Strict

13.3

9 ko

ps/s

32.2

4 ko

ps/s

139.

94 k

ops/

s

174.

85 k

ops/

s

191.

54 k

ops/

s

13.5

9 ko

ps/s

17.7

5 ko

ps/s

66.5

4 ko

ps/s

YCSB on LevelDBN

orm

aliz

ed th

roug

hput

�32




Read-heavy workloads optimized because of converting reads to loads

0

0.5

1

1.5

2

2.5


NOVASplitFS-Strict

13.3

9 ko

ps/s

32.2

4 ko

ps/s

139.

94 k

ops/

s

174.

85 k

ops/

s

191.

54 k

ops/

s

13.5

9 ko

ps/s

17.7

5 ko

ps/s

66.5

4 ko

ps/s

YCSB on LevelDBN

orm

aliz

ed th

roug

hput

�32




0

0.5

1

1.5

2

2.5


NOVASplitFS-Strict

13.3

9 ko

ps/s

32.2

4 ko

ps/s

139.

94 k

ops/

s

174.

85 k

ops/

s

191.

54 k

ops/

s

13.5

9 ko

ps/s

17.7

5 ko

ps/s

66.5

4 ko

ps/s

YCSB on LevelDBN

orm

aliz

ed th

roug

hput

�33




Write-heavy workloads optimized because of staging and relink

0

0.5

1

1.5

2

2.5


NOVASplitFS-Strict

13.3

9 ko

ps/s

32.2

4 ko

ps/s

139.

94 k

ops/

s

174.

85 k

ops/

s

191.

54 k

ops/

s

13.5

9 ko

ps/s

17.7

5 ko

ps/s

66.5

4 ko

ps/s

YCSB on LevelDBN

orm

aliz

ed th

roug

hput

�33

SplitFS

�34

SplitFSSplitFS introduces a new architecture for building PM file systems that… reduces software overhead, provides strong guarantees, and leverages the widely-used ext4-DAX

�34

SplitFSSplitFS introduces a new architecture for building PM file systems that… reduces software overhead, provides strong guarantees, and leverages the widely-used ext4-DAX

https://github.com/utsaslab/splitfs�34


Backup Slides

�35




YCSB on LevelDBN

orm

aliz

ed th

roug

hput

�36

0

0.5

1

1.5

2

2.5


StrataSplitFS-Strict

Documents

SplitFS: Reducing Software Overhead in File Systems for ...vijay/papers/sosp19-splitfs-slides.pdf · SplitFS: Reducing Software Overhead in File Systems for Persistent Memory Rohan