38
Understanding Write Behaviors of Storage Backends in Ceph Object Store Dong-Yun Lee, Kisik Jeong, Sang-Hoon Han, Jin-Soo Kim, Joo-Young Hwang and Sangyeun Cho

Understanding Write Behaviors of Storage Backends in Ceph ...csl.snu.ac.kr/papers/msst17-slides.pdf · Result & Analysis Conclusion 15. ... Workload scenario 1. Install Ceph and create

Embed Size (px)

Citation preview

Page 1: Understanding Write Behaviors of Storage Backends in Ceph ...csl.snu.ac.kr/papers/msst17-slides.pdf · Result & Analysis Conclusion 15. ... Workload scenario 1. Install Ceph and create

Understanding Write Behaviors of Storage Backendsin Ceph Object Store

Dong-Yun Lee, Kisik Jeong, Sang-Hoon Han, Jin-Soo Kim,

Joo-Young Hwang† and Sangyeun Cho†

Page 6: Understanding Write Behaviors of Storage Backends in Ceph ...csl.snu.ac.kr/papers/msst17-slides.pdf · Result & Analysis Conclusion 15. ... Workload scenario 1. Install Ceph and create

How Ceph Amplifies Writes

6

client

DataCeph

metadataCeph

journalDataCeph

metadataCeph

journalDataCeph

metadataCeph

journal

File system metadata

File system journal

File system metadata

File system journal

File system metadata

File system journal

Single write can make several hidden I/Os What about in long-term situation?

Page 7: Understanding Write Behaviors of Storage Backends in Ceph ...csl.snu.ac.kr/papers/msst17-slides.pdf · Result & Analysis Conclusion 15. ... Workload scenario 1. Install Ceph and create

0 2 4 6 8 10 12 14 16

File

Sto

re

Ceph data Ceph metadata Ceph journal File system metadata File system journal

How Much Does Ceph Amplifies Writes

7

0 2 4 6 8 10 12 14 16WAF

DataCeph

metadataCeph

journalFile system metadata

File system journal

Page 8: Understanding Write Behaviors of Storage Backends in Ceph ...csl.snu.ac.kr/papers/msst17-slides.pdf · Result & Analysis Conclusion 15. ... Workload scenario 1. Install Ceph and create

0 2 4 6 8 10 12 14 16

File

Sto

re

Ceph data Ceph metadata Ceph journal File system metadata File system journal

How Much Does Ceph Amplifies Writes

8

0 2 4 6 8 10 12 14 16WAF

DataCeph

metadataCeph

journalFile system metadata

File system journal

Page 9: Understanding Write Behaviors of Storage Backends in Ceph ...csl.snu.ac.kr/papers/msst17-slides.pdf · Result & Analysis Conclusion 15. ... Workload scenario 1. Install Ceph and create

0 2 4 6 8 10 12 14 16

File

Sto

re

Ceph data Ceph metadata Ceph journal File system metadata File system journal

How Much Does Ceph Amplifies Writes

9

0 2 4 6 8 10 12 14 16WAF

DataCeph

metadataCeph

journalFile system metadata

File system journal

Page 10: Understanding Write Behaviors of Storage Backends in Ceph ...csl.snu.ac.kr/papers/msst17-slides.pdf · Result & Analysis Conclusion 15. ... Workload scenario 1. Install Ceph and create

0 2 4 6 8 10 12 14 16

File

Sto

re

Ceph data Ceph metadata Ceph journal File system metadata File system journal

How Much Does Ceph Amplifies Writes

10

0 2 4 6 8 10 12 14 16WAF

DataCeph

metadataCeph

journalFile system metadata

File system journal

Page 11: Understanding Write Behaviors of Storage Backends in Ceph ...csl.snu.ac.kr/papers/msst17-slides.pdf · Result & Analysis Conclusion 15. ... Workload scenario 1. Install Ceph and create

0 2 4 6 8 10 12 14 16

File

Sto

re

Ceph data Ceph metadata Ceph journal File system metadata File system journal

How Much Does Ceph Amplifies Writes

11

0 2 4 6 8 10 12 14 16WAF

DataCeph

metadataCeph

journalFile system metadata

File system journal

Page 12: Understanding Write Behaviors of Storage Backends in Ceph ...csl.snu.ac.kr/papers/msst17-slides.pdf · Result & Analysis Conclusion 15. ... Workload scenario 1. Install Ceph and create

How Much Does Ceph Amplifies Writes

12

0 2 4 6 8 10 12 14 16

File

Sto

re

Ceph data Ceph metadata Ceph journal File system metadata File system journal

WAF

DataCeph

metadataCeph

journalFile system metadata

File system journal

Page 13: Understanding Write Behaviors of Storage Backends in Ceph ...csl.snu.ac.kr/papers/msst17-slides.pdf · Result & Analysis Conclusion 15. ... Workload scenario 1. Install Ceph and create

How Much Does Ceph Amplifies Writes

13

0 2 4 6 8 10 12 14 16

File

Sto

re

Ceph data Ceph metadata Ceph journal File system metadata File system journal

DataCeph

metadataCeph

journalFile system metadata

File system journal

3 0.702 5.994 0.419 3.635Writes amplified by over 13x

Page 14: Understanding Write Behaviors of Storage Backends in Ceph ...csl.snu.ac.kr/papers/msst17-slides.pdf · Result & Analysis Conclusion 15. ... Workload scenario 1. Install Ceph and create

Our Motivation & Goal

How/Why are writes highly amplified in Ceph?• With the exact numbers

Why do we focus on write amplification?• WAF(Write amplification Factor) affects the overall performance

• When using SSDs, it hurts the lifespan of SSDs

• Larger WAF → smaller effective bandwidth

• Redundant journaling of journal may exist

Goals of this paper• Understanding of write behaviors of Ceph

• Empirical study of write amplification in Ceph

14

Page 15: Understanding Write Behaviors of Storage Backends in Ceph ...csl.snu.ac.kr/papers/msst17-slides.pdf · Result & Analysis Conclusion 15. ... Workload scenario 1. Install Ceph and create

Outline

Introduction

Background

Evaluation environment

Result & Analysis

Conclusion

15

Page 17: Understanding Write Behaviors of Storage Backends in Ceph ...csl.snu.ac.kr/papers/msst17-slides.pdf · Result & Analysis Conclusion 15. ... Workload scenario 1. Install Ceph and create

Architecture of Ceph RADOS Block Device (RBD)

17

librbd, librados

/dev/rbd

publicnetwork

MONOSD

#0OSD

#1

OSD#2

OSD#3

• • •

I/O Stack

CEPH RADOS Block Device

D

Page 18: Understanding Write Behaviors of Storage Backends in Ceph ...csl.snu.ac.kr/papers/msst17-slides.pdf · Result & Analysis Conclusion 15. ... Workload scenario 1. Install Ceph and create

Architecture of Ceph RADOS Block Device (RBD)

18

librbd, librados

/dev/rbd

MONOSD

#0OSD

#1

OSD#2

OSD#3

• • •

I/O Stack

D

Object (4MB)

publicnetwork

Page 19: Understanding Write Behaviors of Storage Backends in Ceph ...csl.snu.ac.kr/papers/msst17-slides.pdf · Result & Analysis Conclusion 15. ... Workload scenario 1. Install Ceph and create

Architecture of Ceph RADOS Block Device (RBD)

19

librbd, librados

/dev/rbd

MONOSD

#0OSD

#1

OSD#2

OSD#3

• • •

I/O Stack

D

Gets which OSD the object should be sent to

(CRUSH algorithm)

publicnetwork

Page 20: Understanding Write Behaviors of Storage Backends in Ceph ...csl.snu.ac.kr/papers/msst17-slides.pdf · Result & Analysis Conclusion 15. ... Workload scenario 1. Install Ceph and create

Architecture of Ceph RADOS Block Device (RBD)

20*Figure Courtesy: Sébastien Han https://www.sebastien-han.fr

BlueStoreKStore

Storage

XFS

FileStore

*

OSD #2 OSD #10OSD #4OSD #1OSD #0

• • •• • • • • •

OSD Internal layersData

AttributesKey-Value Data

Replication 3x

D

Page 21: Understanding Write Behaviors of Storage Backends in Ceph ...csl.snu.ac.kr/papers/msst17-slides.pdf · Result & Analysis Conclusion 15. ... Workload scenario 1. Install Ceph and create

Architecture of Ceph RADOS Block Device (RBD)

21

BlueStoreKStore

Storage

XFS

FileStore

*

OSD #2 OSD #10OSD #4OSD #1OSD #0

• • •• • • • • •

OSD Internal layersD Ceph Storage Backends (Object Store)

*Figure Courtesy: Sébastien Han https://www.sebastien-han.fr

Transaction

Page 22: Understanding Write Behaviors of Storage Backends in Ceph ...csl.snu.ac.kr/papers/msst17-slides.pdf · Result & Analysis Conclusion 15. ... Workload scenario 1. Install Ceph and create

Ceph Storage Backends: (1) FileStore

Manages each object as a file

Write flow in FileStore1. Write-Ahead journaling

− For consistency and performance

2. Performs actual write to file after journaling− Can be absorbed by page cache

3. Calls syncfs + flush journal entries for every 5 seconds

NVMe SSD HDDs or SSDs

Ceph Journal XFS file system

ObjectsMetadataAttributes

Ceph journalCeph data Ceph metadata

FS metadata FS journal

Write-AheadJournaling

DB WALLevelDB

22

<Breakdown of FileStore>

Page 23: Understanding Write Behaviors of Storage Backends in Ceph ...csl.snu.ac.kr/papers/msst17-slides.pdf · Result & Analysis Conclusion 15. ... Workload scenario 1. Install Ceph and create

Ceph Storage Backends: (2) KStore

Using existing key-value stores• Encapsulates everything to key-

value pair

Supports LevelDB, RocksDB and Kinetic Store

Write flow in KStore1. Simply calls key-value APIs with

the key-value pair

23

<Breakdown of KStore>

HDDs or SSDs

XFS file system

Objects, Metadata, Attributes

Ceph data Ceph metadataCompaction

FS metadataFS journal

DB WALLevelDB or RocksDB

Page 24: Understanding Write Behaviors of Storage Backends in Ceph ...csl.snu.ac.kr/papers/msst17-slides.pdf · Result & Analysis Conclusion 15. ... Workload scenario 1. Install Ceph and create

Ceph Storage Backends: (3) BlueStore

To avoid limitations of FileStore• Double-write issue due to

journaling− Directly stores data to raw device

Write flow in BlueStore1. Puts data into raw block device

2. Sets metadata to RocksDB on BlueFS (user-level file system)

24

<Breakdown of BlueStore>

HDDs or SSDs NVMe SSD

Raw Device BlueFS

ObjectsMetadataAttributes

Ceph dataZero-filled data

RocksDB DB

DB WALRocksDB

NVMe SSD

BlueFS

RocksDB WAL

Page 25: Understanding Write Behaviors of Storage Backends in Ceph ...csl.snu.ac.kr/papers/msst17-slides.pdf · Result & Analysis Conclusion 15. ... Workload scenario 1. Install Ceph and create

Outline

Introduction

Background

Evaluation environment

Result & Analysis

Conclusion

25

Page 26: Understanding Write Behaviors of Storage Backends in Ceph ...csl.snu.ac.kr/papers/msst17-slides.pdf · Result & Analysis Conclusion 15. ... Workload scenario 1. Install Ceph and create

H/W Evaluation Environment

26

OSD Servers (x4)

Model DELL R730

Processor Intel® Xeon® CPU E5-2640 v3

Memory 32 GB

Storage HGST UCTSSC600 600 GB x4Samsung PM1633 960 GB x4Intel® 750 series 400 GB x2

Admin Server / Client (x1)

Model DELL R730XD

Processor Intel® Xeon® CPU E5-2640 v3

Memory 128 GB

Switch (x2)

Public Network

DELL N4032 10Gbps Ethernet

StorageNetwork

Mellanox SX6012 40Gbps InfiniBand P

ub

lic Netw

ork

Sto

rage

Net

wo

rk

Page 27: Understanding Write Behaviors of Storage Backends in Ceph ...csl.snu.ac.kr/papers/msst17-slides.pdf · Result & Analysis Conclusion 15. ... Workload scenario 1. Install Ceph and create

S/W Evaluation Environment

Linux 4.4.43 kernel • Some modifications to collect and classify write requests

Ceph Jewel LTS version (v10.2.5)

From client side• Configures a 64GiB KRBD

− 4 OSDs per OSD server (total 16 OSDs)

• Generates workloads using fio while collecting trace and diskstats

Perform 2 different workloads• Microbenchmark

• Long-term workload

27

Page 28: Understanding Write Behaviors of Storage Backends in Ceph ...csl.snu.ac.kr/papers/msst17-slides.pdf · Result & Analysis Conclusion 15. ... Workload scenario 1. Install Ceph and create

Outline

Introduction

Background

Evaluation environment

Result & Analysis

Conclusion

28

Page 29: Understanding Write Behaviors of Storage Backends in Ceph ...csl.snu.ac.kr/papers/msst17-slides.pdf · Result & Analysis Conclusion 15. ... Workload scenario 1. Install Ceph and create

Key Findings in Microbenchmark

FileStore• Large WAF when write request size is small (over 40 at 4KB write)

• WAF converges to 6 (3x by replication, 2x by Ceph journaling)

KStore• Sudden WAF jumps due to memtable flush

• WAF converges to 3 (by replication)

BlueStore• Because the minimum extent size is set to 64KB

− Zero-filled data for the hole within the object

− WAL (Write-Ahead Logging) for small overwrite to guarantee durability

• WAF converges to 3 (by replication)

29

Page 30: Understanding Write Behaviors of Storage Backends in Ceph ...csl.snu.ac.kr/papers/msst17-slides.pdf · Result & Analysis Conclusion 15. ... Workload scenario 1. Install Ceph and create

Long-Term Workload Methodology

We focus on 4KB random writes• In VDI, most of the writes are random and 4KB in size

Workload scenario1. Install Ceph and create a krbd partition

2. Drop page cache, call sync and wait for 600 secs

3. Issue 4KB random writes with QD=128 until the total write amount reaches 90% of the capacity (57.6GiB)

Run tests with 16 HDDs first and repeat with 16 SSDs

Calculate and breakdown WAF for given time period30

Page 31: Understanding Write Behaviors of Storage Backends in Ceph ...csl.snu.ac.kr/papers/msst17-slides.pdf · Result & Analysis Conclusion 15. ... Workload scenario 1. Install Ceph and create

Long-term Results: (1) FileStore

31

0

1000

2000

3000

4000

5000

0

1

2

3

4

0 2025 4051 6077 8132

IOP

S (o

ps/

sec)

Wri

te A

mo

un

t (G

B/1

5se

c)

Time (sec)

0

15000

30000

45000

0

2

4

6

8

0 93 186 280 379

IOP

S (o

ps/

sec)

Wri

te A

mo

un

t (G

B/3

sec)

Time (sec)

Ceph data + Ceph metadata Ceph journal File system metadata File system journal IOPS

<HDD>

<SSD>

Large fluctuation due to repeated throttling

No throttling: SAS SSDs with XFS are enough fast

Performance throttling due to slow HDDs

Ceph journal first absorbs random writes

Page 32: Understanding Write Behaviors of Storage Backends in Ceph ...csl.snu.ac.kr/papers/msst17-slides.pdf · Result & Analysis Conclusion 15. ... Workload scenario 1. Install Ceph and create

0

1000

2000

3000

4000

5000

0123456789

0 4711 9422 14134 18905

IOP

S (o

ps/

sec)

Wri

te A

mo

un

t (G

B/1

5se

c)

Time (sec)

0

15000

30000

45000

0

2

4

6

8

10

12

0 424 848 1272 1700

IOP

S (o

ps/

sec)

Wri

te A

mo

un

t (G

B/3

sec)

Time (sec)

Ceph data Ceph metadata Compaction File system metadata File system journal IOPS

Long-term Results: (2) KStore (RocksDB)

32

<HDD>

<SSD>

Huge Compaction Traffic in both cases

Huge compaction traffic in both casesFluctuation gets worse

Page 33: Understanding Write Behaviors of Storage Backends in Ceph ...csl.snu.ac.kr/papers/msst17-slides.pdf · Result & Analysis Conclusion 15. ... Workload scenario 1. Install Ceph and create

<HDD>

<SSD>

0

15000

30000

45000

0

5

10

15

20

0 105 210 316 421

IOP

S (o

ps/

sec)

Wri

te A

mo

un

t (G

B/3

sec)

Time (sec)

Ceph data RocksDB RocksDB WAL Compaction Zero-filled data IOPS

0

1000

2000

3000

4000

5000

0

1

2

3

4

5

6

7

0 2251 4501 6752 9032

IOP

S (o

ps/

sec)

Wri

te A

mo

un

t (G

B/1

5se

c)

Time (sec)

Long-term Results: (3) BlueStore

33

Since allocating extents is in sequential order,first incoming writes become sequential writes

Zero-filled data traffic is huge at fistZero-filled data traffic is huge at fist

SSDs have superior random write performance

Page 34: Understanding Write Behaviors of Storage Backends in Ceph ...csl.snu.ac.kr/papers/msst17-slides.pdf · Result & Analysis Conclusion 15. ... Workload scenario 1. Install Ceph and create

Overall Results of WAF

34

0 5 10 15 20 25 30 35 40 45

SSD

HDD

SSD

HDD

SSD

HDD

Blu

eSt

ore

KSt

ore

(Ro

cksD

B)

File

Sto

re

Ceph data Ceph metadata Ceph journal Compaction

Zero-filled data File system metadata File system journal

60 70 80 WAF

Page 35: Understanding Write Behaviors of Storage Backends in Ceph ...csl.snu.ac.kr/papers/msst17-slides.pdf · Result & Analysis Conclusion 15. ... Workload scenario 1. Install Ceph and create

Overall Results of WAF

35

0 5 10 15 20 25 30 35 40 45

SSD

HDD

SSD

HDD

SSD

HDD

Blu

eSt

ore

KSt

ore

(Ro

cksD

B)

File

Sto

re

Ceph data Ceph metadata Ceph journal Compaction

Zero-filled data File system metadata File system journal

60 70 80 WAF

WAF for Ceph journal is about 6 in both cases, not 3→ Ceph journal triples the write traffic

FS metadata + FS journal = 4.2 (HDD), 4.054 (SSD)

FS metadata + FS journal = 4.2 (HDD), 4.054 (SSD)

Page 36: Understanding Write Behaviors of Storage Backends in Ceph ...csl.snu.ac.kr/papers/msst17-slides.pdf · Result & Analysis Conclusion 15. ... Workload scenario 1. Install Ceph and create

Overall Results of WAF

36

0 5 10 15 20 25 30 35 40 45

SSD

HDD

SSD

HDD

SSD

HDD

Blu

eSt

ore

KSt

ore

(Ro

cksD

B)

File

Sto

re

Ceph data Ceph metadata Ceph journal Compaction

Zero-filled data File system metadata File system journal

60 70 80 WAF

Huge compaction traffic matters

Page 37: Understanding Write Behaviors of Storage Backends in Ceph ...csl.snu.ac.kr/papers/msst17-slides.pdf · Result & Analysis Conclusion 15. ... Workload scenario 1. Install Ceph and create

Overall Results of WAF

37

0 5 10 15 20 25 30 35 40 45

SSD

HDD

SSD

HDD

SSD

HDD

Blu

eSt

ore

KSt

ore

(Ro

cksD

B)

File

Sto

re

Ceph data Ceph metadata Ceph journal Compaction

Zero-filled data File system metadata File system journal

60 70 80 WAF

WAF is still larger than FileStore’seven without zero-filled data

Compaction traffic for RocksDB is not ignorable

Page 38: Understanding Write Behaviors of Storage Backends in Ceph ...csl.snu.ac.kr/papers/msst17-slides.pdf · Result & Analysis Conclusion 15. ... Workload scenario 1. Install Ceph and create

Conclusion

Writes are amplified by more than 13x in Ceph• No matter which storage backend is used

FileStore• External Ceph journaling triples write traffic

• File system overhead exceeds the original data traffic

KStore• Suffers huge compaction overhead

BlueStore• Small write requests are logged on RocksDB WAL

• Unignorable zero-filled data & compaction traffic

38

Thank you!