75
Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael Swift, Xinyu Zhang University of Wisconsin - Madison Tuesday, December 1, 15

Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Physical Separation in Modern Storage Systems

Lanyue Lu

Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau

Shan Lu, Michael Swift, Xinyu ZhangUniversity of Wisconsin - Madison

Tuesday, December 1, 15

Page 2: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Local Storage Systems Are ImportantGFS,

HDFSvmwaredocker

Local Storage

Riak, MongoDB

ext4, NTFS, SQLite

Tuesday, December 1, 15

Page 3: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Data Layout of Storage Systems

Data layout is fundamental➡ how to organize data on disks and in memory ➡ impact both reliability and performance

Locality is the key➡ store relevant data together ➡ locality is pursued in various storage systems➡ file systems, key-value stores, databases

➡ better performance (caching and prefetching)➡ high space utilization ➡ optimize for hard drives

Tuesday, December 1, 15

Page 4: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Problems of Data Locality

New environments ➡ fast storage hardware (e.g., SSDs) ➡ servers with many cores and large memory➡ sharing infrastructure is the reality

➡ virtualization, containers, data centers

Unexpected entanglement➡ shared failures (e.g., VMs, containers)➡ bundled performance (e.g., apps) ➡ lack flexibility to manage data differently

Tuesday, December 1, 15

Page 5: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

New Technique: Physical Separation

Redesign data layout ➡ rethink existing data layouts➡ key: separate data structures➡ apply in both file systems and key-value stores

Many new benefits ➡ IceFS: disentangle structures and transactions➡ isolated failures, faster recovery➡ customized performance

➡ WiscKey: key-value separation ➡ minimize I/O amplification➡ leverage devices’ internal parallelism

Tuesday, December 1, 15

Page 6: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Research Contributions

A study of Linux file system evolution ➡ the first comprehensive file-system study➡ published in FAST ’13 (best paper award)

Physical disentanglement in IceFS➡ localized failure, localized recovery ➡ specialized journaling performance➡ published in OSDI ’14

Key-value separation in WiscKey ➡ an SSD-conscious LSM-tree ➡ over 100x performance improvement➡ submitted to FAST ’16

1

2

3

Tuesday, December 1, 15

Page 7: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

IntroductionDisentanglement in IceFS➡ File system Disentanglement➡ The Ice File System➡ Evaluation

Key-Value Separation in WiscKey➡ Key-value Separation Idea➡ Challenges and Optimization ➡ Evaluation

Conclusion

Outline

Tuesday, December 1, 15

Page 8: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Isolation Is Important

Reliability➡ independent failures and recovery

Performance ➡ isolated performance

Isolation at various scenarios ➡ computing: virtual machines, Linux containers➡ security: BSD jail, sandbox➡ cloud: multi-tenant systems

Tuesday, December 1, 15

Page 9: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

File Systems Lack Isolation

Local file systems are core building blocks➡ manage user data➡ long-standing and stable➡ foundation for distributed file systems

Existing abstractions provide logical isolation➡ file, directory, namespace➡ just illusion

Physical entanglement in local file systems prevents isolation

➡ entangled data structures and transactions

Tuesday, December 1, 15

Page 10: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Metadata Entanglement

foo.txt

foo.txt inode

bar.c

one 4KB inode block

bar.c inode

I/O failureMetadata corruption

Shared metadata for multiple files➡ e.g., multiple files share one inode block➡ many shared structures: bitmap, directory block

Problem: faults in shared structures lead to shared failures and recovery

Tuesday, December 1, 15

Page 11: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Transaction Entanglement

data of foo.txt data of bar.c

foo.txt bar.c

Disk

Mem

fsync(bar.c)

A shared transaction for all updates

Problem: shared transactions lead to entangled performance

Tuesday, December 1, 15

Page 12: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Our Solution: IceFS

Propose a data container abstraction: cube

Disentangle data structures and transactions

Provide reliability and performance isolation

Benefits for local file systems➡ isolated failures for data containers➡ up to 8x faster localized recovery➡ up to 50x higher performance

Benefits for high-level services➡ virtualized systems: reduce the downtime over 5x➡ HDFS: improve the recovery efficiency over 7x

Tuesday, December 1, 15

Page 13: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Data Container Abstraction: Cube

b1 b2

a

d1

c

c1

/

b d

b, b1, b2 d, d1/, a, c, c1Disk

cube1 cube2

An isolated directory in a file system➡ physically disentangled on disk and in memory

Tuesday, December 1, 15

Page 14: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Principles of Disentanglement

No shared physical resources➡ no shared metadata: e.g., block groups➡ no shared disk blocks or buffers

No dependency➡ partition linked lists or trees➡ avoid directory hierarchy dependency

No entangled updates➡ use separate transactions➡ enable customized journaling modes

Tuesday, December 1, 15

Page 15: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

IntroductionDisentanglement in IceFS➡ File system Disentanglement➡ The Ice File System➡ Evaluation

Key-Value Separation in WiscKey➡ Key-value Separation Idea➡ Challenges and Optimization ➡ Evaluation

Conclusion

Outline

Tuesday, December 1, 15

Page 16: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

IceFS Overview

A data container based file system➡ isolated reliability and performance for containers

Disentanglement techniques➡ physical resource isolation➡ directory indirection➡ transaction splitting

A prototype based on Ext3➡ local file system: Ext3/JBD➡ kernel: VFS➡ user level tool: e2fsprogs

Tuesday, December 1, 15

Page 17: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Ext3 Disk Layout

One block group

block groupSBDisk

metadata data blocks

group descriptors bitmapsinodes

block group

block group

block group

block group

block group

block group

A disk is divided into block groups➡ physical partition for disk locality

Tuesday, December 1, 15

Page 18: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

IceFS Disk Layout

SBDisk S0 block group

block groupS1 block

groupblock group

block group

sub super blocks

cube metadata

Each cube has isolated metadata➡ sub-super block (Si) and isolated block groups

Tuesday, December 1, 15

Page 19: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Directory Indirection

b1 b2

a

d1

c

c1

/

b d

cube1 cube2

1. load cube pathnamesfrom sub-super blocks

/a/b/, cube1 dentry/d/, cube2 dentry... ...

2. pathname prefix match

read file “/a/b/b2”match cube1jump to cube1 top directory

Tuesday, December 1, 15

Page 20: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Ext3/4 Transaction

Journal

file1

Memory

dirty data

file2

dirty data

file3

dirty data

Diskcommit tx

fsync(file1)

Tuesday, December 1, 15

Page 21: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

IceFS Transaction Splitting

Journal

file1

Memory

dirty data

file2

dirty data

file3

dirty data

Diskcommit

tx

fsync(file1)

committx

committx

fsync(file2) fsync(file3)

Tuesday, December 1, 15

Page 22: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Benefits of Disentanglement

Localized reactions to failures➡ per-cube read-only and crash➡ encourage more runtime checking

Localized recovery➡ only check faulty cubes➡ offline and online

Specialized journaling➡ concurrent and independent transactions ➡ diverse journal modes (e.g., no journal, no fsync)

Tuesday, December 1, 15

Page 23: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

IntroductionDisentanglement in IceFS➡ File system Disentanglement➡ The Ice File System➡ Evaluation

Key-Value Separation in WiscKey➡ Key-value Separation Idea➡ Challenges and Optimization ➡ Evaluation

Conclusion

Outline

Tuesday, December 1, 15

Page 24: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Evaluation

Does IceFS isolate failures ?➡ inject around 200 faults➡ per-cube failure (read-only or crash) in IceFS

Tuesday, December 1, 15

Page 25: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Evaluation

Does IceFS isolate failures ?➡ inject around 200 faults➡ per-cube failure (read-only or crash) in IceFS

Does IceFS have faster recovery ?

Tuesday, December 1, 15

Page 26: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Recovery In Ext3

Ext3: 20 directories

0

200

400

600

800

1000

Fsck

Tim

e (s

)

File-system Capacity200GB 400GB 600GB 800GB

231

476

723

1007Ext3

Tuesday, December 1, 15

Page 27: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Fast Recovery In IceFS

Ext3: 20 directories

IceFS: 20 cubes

0

200

400

600

800

1000

Fsck

Tim

e (s

)

File-system Capacity200GB 400GB 600GB 800GB

231

476

723

1007

35 64 91 122

Ext3 IceFS

Partial recovery for a cube (up to 8x)

Tuesday, December 1, 15

Page 28: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Evaluation

Does IceFS isolate failures ?➡ inject around 200 faults➡ per-cube failure (read-only or crash) for IceFS

Does IceFS have faster recovery ?➡ independent recovery for a cube

Tuesday, December 1, 15

Page 29: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Evaluation

Does IceFS isolate failures ?➡ inject around 200 faults➡ per-cube failure (read-only or crash) for IceFS

Does IceFS have faster recovery ?➡ independent recovery for a cube

Does IceFS have better performance ?

Tuesday, December 1, 15

Page 30: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Workloads

SQLite➡ a database application➡ sequentially write large key/value pairs➡ asynchronous

Varmail➡ an email server workload➡ randomly write small blocks➡ fsync after each write

Tuesday, December 1, 15

Page 31: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Ext3 Journaling

0

30

60

90

120

150

180

Thro

ughp

ut (M

B/s)

146.7

76.1

120.6

201.9 9.8

SQLite Varmail

AloneExt3

Ext3 runs with 2 directories

Tuesday, December 1, 15

Page 32: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Ext3 Journaling

0

30

60

90

120

150

180

Thro

ughp

ut (M

B/s)

146.7

76.1

120.6

201.9 9.8

SQLite Varmail

AloneExt3

TogetherExt3

Shared transactions hurt performance (over 10x)

Ext3 runs with 2 directories

Tuesday, December 1, 15

Page 33: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Isolated Journaling In IceFS

0

30

60

90

120

150

180

Thro

ughp

ut (M

B/s)

146.7

76.1

120.6

201.9 9.8

SQLite Varmail

Alonein Ext3

Togetherin Ext3

Togetherin IceFS

Ext3 runs with 2 directories

IceFS runs with 2 cubes

Parallel transactions in IceFS provide isolated performance (over 5x)

Tuesday, December 1, 15

Page 34: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Specialized Journaling In IceFS

orde

red

0

50

100

150

200

250

Thro

ughp

ut (M

B/s)

120.6

220.3

125.4

9.8 5.6

103.4

SQLite Varmail

orde

red

Both cubes use ordered mode

Tuesday, December 1, 15

Page 35: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Specialized Journaling In IceFS

0

50

100

150

200

250

Thro

ughp

ut (M

B/s)

120.6

220.3

125.4

9.8 5.6

103.4

SQLite Varmail

no jo

urna

l

orde

red

orde

red

orde

red

SQLite runs with no journalVarmail runs with ordered

Tuesday, December 1, 15

Page 36: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Specialized Journaling In IceFS

0

50

100

150

200

250

Thro

ughp

ut (M

B/s)

120.6

220.3

125.4

9.8 5.6

103.4

SQLite Varmail

no jo

urna

l

orde

red

orde

red

orde

red

orde

red

no jo

urna

l

SQLite runs with ordered

Varmail runs with no journal

Specialized journaling in IceFS provide flexibility between consistency and performance (over 50x)

Tuesday, December 1, 15

Page 37: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Evaluation

Isolate failures ?➡ inject around 200 faults➡ per-cube failure (read-only or crash) for IceFS

Faster recovery ?➡ independent recovery for a cube

Better journaling performance ?➡ isolated journaling performance➡ flexibility between consistency and performance

Tuesday, December 1, 15

Page 38: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Evaluation

Isolate failures ?➡ inject around 200 faults➡ per-cube failure (read-only or crash) for IceFS

Faster recovery ?➡ independent recovery for a cube

Better journaling performance ?➡ isolated journaling performance➡ flexibility between consistency and performance

Useful for applications ?

Tuesday, December 1, 15

Page 39: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Server Virtualization

Shared file system

Disk virtual disk 2 virtual disk 3virtual disk 1

vm1 vm2 vm3

Failures and recovery of the shared file system impact all virtual machines

Tuesday, December 1, 15

Page 40: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Virtual Machines

0 50 100 150 200 250 300 350 400 450 500 550 600 650 7000

20

40

60

80

100

Time (Second)

Thro

ughp

ut (I

OPS

)

fsck: 496s + bootup: 68s

VM1 VM2 VM3

Inject metadata corruption bootup three vms

Tuesday, December 1, 15

Page 41: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Server Virtualization with IceFS

Shared file system with cubes

Disk virtual disk 2 virtual disk 3virtual disk 1

vm1 vm2 vm3

cube1 cube2 cube3

Tuesday, December 1, 15

Page 42: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Server Virtualization with IceFS

0 50 100 150 200 250 300 350 400 450 500 550 600 650 7000

20

40

60

80

100

0 50 100 150 200 250 300 350 400 450 500 550 600 650 7000

20

40

60

80

100Th

roug

hput

(IO

PS)IceFS-Offline

0 50 100 150 200 250 300 350 400 450 500 550 600 650 7000

20

40

60

80

100

0 50 100 150 200 250 300 350 400 450 500 550 600 650 7000

20

40

60

80

100

Time (Second)

IceFS-Online

fsck: 35s+

bootup: 67s

fsck: 74s+

bootup: 39s

VM1 VM2 VM3

recover a cube offline

Inject metadata corruptionbootup three vms

Tuesday, December 1, 15

Page 43: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Server Virtualization with IceFS

0 50 100 150 200 250 300 350 400 450 500 550 600 650 7000

20

40

60

80

100

0 50 100 150 200 250 300 350 400 450 500 550 600 650 7000

20

40

60

80

100Th

roug

hput

(IO

PS)IceFS-Offline

0 50 100 150 200 250 300 350 400 450 500 550 600 650 7000

20

40

60

80

100

0 50 100 150 200 250 300 350 400 450 500 550 600 650 7000

20

40

60

80

100

Time (Second)

IceFS-Online

fsck: 35s+

bootup: 67s

fsck: 74s+

bootup: 39s

VM1 VM2 VM3

recover a cube offline

recover a cube online

Tuesday, December 1, 15

Page 44: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Evaluation

Isolate failures ?➡ inject around 200 faults➡ per-cube failure (read-only or crash) for IceFS

Faster recovery ?➡ independent recovery for a cube

Better journaling performance ?➡ isolated journaling performance for cubes➡ flexibility between consistency and performance

Useful for applications ?➡ significantly reduce system downtime

Tuesday, December 1, 15

Page 45: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Summary of IceFS

Local file systems lack physical isolation➡ physical entanglement ➡ reliability and performance problems

IceFS provides isolation with data containers

Computing is becoming virtualized, shared, and multi-tenant

➡ isolation is the key

Systems need to rethink isolation ➡ avoid entanglement➡ provide useful abstractions for applications

Tuesday, December 1, 15

Page 46: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

IntroductionDisentanglement in IceFS➡ File system Disentanglement➡ The Ice File System➡ Evaluation

Key-Value Separation in WiscKey➡ Key-value Separation Idea➡ Challenges and Optimization ➡ Evaluation

Conclusion

Outline

Tuesday, December 1, 15

Page 47: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Key-Value Stores

Key-value stores are important➡ web indexing, e-commerce, social networks➡ local and distributed key-value stores ➡ hash table, b-trees➡ log-structured merge trees (LSM-trees)

LSM-tree based key-value stores are popular ➡ optimize for write intensive workloads➡ advanced features: range query, snapshot➡ widely deployed ➡ BigTable and LevelDB at Google➡ HBase, Cassandra and RocksDB at FaceBook

Tuesday, December 1, 15

Page 48: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

LSM-trees Background

LogL0 (8MB)

L1 (10MB)

L2 (100MB)

L6 (ITB)

memory 1

KVmemTmemT23

4

5

disk

Batch and write sequentiallySort data for quick lookups

LevelDB

Tuesday, December 1, 15

Page 49: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Problems:

large write amplification

large read amplification

Random load: a 100GB database

Random lookup:100,000 lookups

I/O Amplification in LSM-trees

1

10

100

1000Am

plifi

catio

n R

atio

14

327

100 GB

Write Read

Tuesday, December 1, 15

Page 50: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Why LSM-trees ?

Good for hard drives➡ high write throughput➡ sequential vs random: can be up to 1000

Not optimal for SSDs ➡ large write/read amplification ➡ waste device resource➡ decrease device’s lifetime

➡ unique characteristics of SSDs➡ fast random reads➡ internal parallelism

Tuesday, December 1, 15

Page 51: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Our Solution: WiscKey

An SSD-conscious LSM-tree store➡ main idea: separate keys and values➡ harness SSD’s internal parallelism for range queries➡ online and light-weight garbage collection➡ minimize I/O amplification and crash consistent

Performance of WiscKey➡ 2.5x to 111x for loading, 1.6x to 14x for lookups➡ both micro and macro benchmarks

LSM-tree

key value

Value Log

Tuesday, December 1, 15

Page 52: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Key-Value Separation

key

LSM-tree

value

Value Log

k, addr value

SSD device

Main idea: only keys are required to be sorted, values can be managed separately

Tuesday, December 1, 15

Page 53: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

IntroductionDisentanglement in IceFS➡ File system Disentanglement➡ The Ice File System➡ Evaluation

Key-Value Separation in WiscKey➡ Key-value Separation Idea➡ Challenges and Optimization ➡ Evaluation

Conclusion

Outline

Tuesday, December 1, 15

Page 54: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Parallel Range Query

1KB 4KB 16KB 64KB 256KB0

100

200

300

400

500

600

Request size: 1KB to 256KB

Thro

ughp

ut (M

B/s)

Sequential Rand-1thread Rand-32threads

SSD: Samsung 840 EVO 500GB Reads on a 100GB file on ext4

SSD read performance➡ sequential, random, parallel

Tuesday, December 1, 15

Page 55: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Parallel Range Query

Challenge➡ sequential reads in LevelDB➡ read keys and values separately in WiscKey

Parallel range query➡ leverage parallel random reads of SSDs➡ prefetch key-value pairs in advance ➡ range query interface: seek(), next(), prev()➡ detect a sequential pattern ➡ prefetch concurrently in background

Tuesday, December 1, 15

Page 56: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Garbage Collection

LSM-tree Value Log

valuek, addr value value

SSD device

ksize, vsize, key, value

tail head

Online and light-weight ➡ append (ksize, vsize, key, value) in value log➡ tail and head pointers for the valid range ➡ tail and head are stored in LSM-tree

Tuesday, December 1, 15

Page 57: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Garbage Collection

LSM-tree Value Log

k, addr

tail head

memory

disk

addr match ?write back

1. read from the tail2. check the LSM-tree

3. write back valid kv pairs4. free space and update pointers

Tuesday, December 1, 15

Page 58: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Optimizing LSM-tree Log

LSM-tree Value Log

k, addr ksize, vsize, key, value

tail head

log ksize, vsize, key, value

LSM-tree log ➡ used for recovery in case of a crash ➡ performance overhead for small kv pairs

Remove LSM-tree log in WiscKey ➡ store head in LSM-tree periodically➡ scan the value log from the head to recover

Tuesday, December 1, 15

Page 59: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

WiscKey Implementation

Based on LevelDB➡ a separate vLog file for values ➡ modify I/O paths to separate keys and values ➡ straightforward to implement

Range query ➡ a background thread pool ➡ detect sequential pattern with the Iterator interface

File-system support➡ fadvise to predeclare access patterns ➡ hole-punching to free space

Tuesday, December 1, 15

Page 60: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

IntroductionDisentanglement in IceFS➡ File system Disentanglement➡ The Ice File System➡ Evaluation

Key-Value Separation in WiscKey➡ Key-value Separation Idea➡ Challenges and Optimization ➡ Evaluation

Conclusion

Outline

Tuesday, December 1, 15

Page 61: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Experiment Setup

Testing machine➡ 16 cores (3.3 GHz), 64 GB memory➡ Samsung 840 EVO SSD (500 GB)➡ maximal sequential read: 500 MB/s➡ maximal sequential write: 400 MB/s

Workloads ➡ micro benchmarks (db_bench) ➡ YCSB benchmark

Tuesday, December 1, 15

Page 62: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Evaluation

How does key-value separation impact the performance of WiscKey ?

Tuesday, December 1, 15

Page 63: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Sequential Load

WiscKey is over 3x faster due to its write buffer and removing the LSM-tree log

64B 256B 1KB 4KB 16KB 64KB 256KB0

50100150200250300350400450500

Key: 16B, Value: 64B to 256KB

Thro

ughp

ut (M

B/s)

LevelDB WiscKey load 100 GB database

log writing in LevelDB has high overhead

Tuesday, December 1, 15

Page 64: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Random Load

only 2 MB/s to 4.1 MB/s

Small write amplification in WiscKey due to key-value separation (up to 111x in throughput)

64B 256B 1KB 4KB 16KB 64KB 256KB0

50100150200250300350400450500

Key: 16B, Value: 64B to 256KB

Thro

ughp

ut (M

B/s)

LevelDB WiscKey load 100 GB database

large write amplification (12 to 16) in LevelDB

Tuesday, December 1, 15

Page 65: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Random Lookup

Smaller LSM-tree in WiscKey leads to better lookup performance (1.6x - 14x)

64B 256B 1KB 4KB 16KB 64KB 256KB0

50

100

150

200

250

300

Key: 16B, Value: 64B to 256KB

Thro

ughp

ut (M

B/s)

LevelDB WiscKey

100,000 lookups on a randomly loaded 100 GB database large read amplification

in LevelDB

Tuesday, December 1, 15

Page 66: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Evaluation

How does key-value separation impact the performance of WiscKey ?

➡ low write and read amplification ➡ load (2.5x to 111x), lookup (1.6x to 14x)

Is the parallel range query fast enough ?

Tuesday, December 1, 15

Page 67: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Range Query

Better for large kv pairs, but worse for small kv pairs on an unsorted database

64B 256B 1KB 4KB 16KB 64KB 256KB0

100

200

300

400

500

600

Key: 16B, Value: 64B to 256KB

Thro

ughp

ut (M

B/s)

LevelDB-RandWiscKey-Rand

read 4GB from a randomly loaded 100 GB database

WiscKey is limited by SSD’s parallel random read performance

For large kv pairs, WiscKey can perform better

Tuesday, December 1, 15

Page 68: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Range Query

Sorted databases help WiscKey’s range query

64B 256B 1KB 4KB 16KB 64KB 256KB0

100

200

300

400

500

600

Key: 16B, Value: 64B to 256KB

Thro

ughp

ut (M

B/s)

LevelDB-RandWiscKey-Rand

LevelDB-SeqWiscKey-Seq

read 4GB from a sequentially loaded 100 GB database

Both WiscKey and LevelDB read sequentially

Tuesday, December 1, 15

Page 69: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Evaluation

How does key-value separation impact the performance of WiscKey ?

➡ low write and read amplification ➡ load (2.5x to 111x), lookup (1.6x to 14x)

Is the parallel range query fast enough ?➡ limited by random read performance➡ sorting helps

How about real workloads ? What is the effect of garbage collection ?

Tuesday, December 1, 15

Page 70: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

YCSB Benchmarks

A: 50% R, 50% U; B: 95% R, 5% U; C: 100% R; D: 95% R, 5% I; E: 95% Scan, 5% I; F: 50% R, 50% RMW

48x-116x

6x-16x

2x-20x

2.6x-25x

1.5x-4x

1x-7x

6x-8x

0.1

1

10

100

1000

Nor

mal

ized

Per

form

ance

Key size: 16B, Value size: 1KBLOAD A B C D E F

LevelDB RocksDB WiscKey-GC WiscKey

many small range queries

Tuesday, December 1, 15

Page 71: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Evaluation

How does key-value separation impact the performance of WiscKey ?

➡ low write and read amplification ➡ load (2.5x to 111x), lookup (1.6x to 14x)

Is the parallel range query fast enough ?➡ limited by random read performance➡ sorting helps

How about real workloads ? What is the effect of garbage collection ?

➡ faster on all workloads➡ performance similar to micro benchmarks

Tuesday, December 1, 15

Page 72: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Summary of WiscKey

LSM-trees are not optimized for SSD devices

WiscKey separates keys from values with an SSD-conscious design

Many novel storage systems have been built for hard drives

Transition to new storage hardware➡ leverage existing software ➡ explore new ways to utilize the new hardware➡ get the best of two worlds

Tuesday, December 1, 15

Page 73: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

IntroductionDisentanglement in IceFS➡ File system Disentanglement➡ The Ice File System➡ Evaluation

Key-Value Separation in WiscKey➡ Key-value Separation Idea➡ Challenges and Optimization ➡ Evaluation

Conclusion

Outline

Tuesday, December 1, 15

Page 74: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Lessons Learned

A large-scale study is feasible and valuable

Research should match reality

History repeats itself

Don’t settle for existing abstraction

Isolation should be a fundamental design goal

Don’t run old software on new hardware

Fundamental details matter

Work on systems extremely slow or unreliableTuesday, December 1, 15

Page 75: Physical Separation in Modern Storage Systems · Physical Separation in Modern Storage Systems Lanyue Lu Committee: Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Shan Lu, Michael

Conclusion

Local storage systems are important

Physical separation is useful ➡ improve both reliability and performance over 10x➡ better reliability: isolated failures, localized recovery➡ better performance: specialized journaling, minimize I/O amplification

Computing and storage are evolving ➡ virtualized, shared and fast➡ physical separation is the key➡ IceFS and WiscKey are just a beginning

Tuesday, December 1, 15