20
High Performance Ceph Jun Park (Adobe) and Dan Ferber (Intel)

High Performance Ceph Jun Park (Adobe) and Dan …schd.ws/.../4b/Ceph_Deep_Dive-s1.pdf · High Performance Ceph Jun Park (Adobe) and Dan Ferber (Intel) ... Next Generation Ceph: BlueStore

Embed Size (px)

Citation preview

High Performance CephJun Park (Adobe) and Dan Ferber (Intel)

Caveats

§ Relative meaning of “High Performance”

§ Still on a long journey; Evolving

§ Not large scale; unlike CERN

§ Possibly opinionated in some cases

2

OpenStack Core Services

3

CEPH

Neutron

Compute

Murano

App Catalog

Heat

Orchestration

How To Evaluate Storage?

4

CapacityIOPS

(Bandwidth)

Typically, hit IOPS as a first bottleneck with HDDs

* IOPS: I/O operations per second, regardless of block size

Durability

Availability

Path For IOPS

5

Physical StoreHDD or SSD Network Compute Host

VM

IOPS(almost bandwidth :)

Bandwidth (Throughput) = IOPS X Block Size

Relationship Between IOPS and Bandwidth

6

Bandwidth

Block Size Block Size

IOPS

4KB 8KB 16KB 4096KB

Block Size

4KB 8KB 16KB 4096KB

Bandwidth (Throughput) = IOPS X Block Size

In some cases, flat!Then , you got double of bandwidth

Ceph With Replication

§ Distribution Algorithm: Straw -> Tree

7

Rack1 Rack2 Rack3 Rack4

3 Replicas

Data node 1

Data node 2

Data node 3

16 osds x 4 racks x 4 data nodes x 2TB / 3 replicas = 170 TB (Effective Disk Capacity for users)

Ceph Architecture

8

CephData

Nodes

VLAN100: Ceph Public

VLAN200: Ceph Cluster

Compute

CephMonitorNodes

2 x 10G 2 x 10G 2 x 10G

Ceph Clients

9

Write Operation With Journaling

10

SSDs

Drives, e.g. SAS 2TB drives

<Ceph1>:~# lscpu | egrep 'Thread|Core|Socket|^CPU\(’CPU(s): 48 Thread(s) per core: 2Core(s) per socket: 12Socket(s): 2

Consistent Hashing In Ceph

§ Advantages§ No need to store metadata explicitly§ Fast

§ Disadvantages§ Overhead of rebalance§ Operational difficulties in dealing with edge cases

§ E.g., Swift, Cassandra, Amazon Dynamo, and so on.

11

Network Bandwidth Impact (rados bench write)

12

0 500 1000 1500 2000

10 Gbps Interface

20 Gbps Interface

1165

1953

Bandwidth (MB/s) For Write

Same as in our LABwith 20G(due to smaller # of data nodes)

Different I/O Patterns With 128K Block Size

13

0 1000 2000 3000 4000 5000 6000 7000 8000

RadonWrite

SeqWrite

SeqRead

W25R75

604

364

978

233

4828

2913

7827

1864

Block Size: 128K

IOPS Bandwidth (MB/s)

ß For write

Random Writes of 3 VMS On The Same Compute

14

0

50

100

150

200

250

300

350

VM1 VM2 VM3

318 313 325

Bandwidth (MB/s)

• Block Size: 64KB• System Wide Max Performance with other traffic: Max 40,000 IOPS,

1367 MB/s Write

0

500

1000

1500

2000

2500

3000

VM1 VM2 VM3

2547 2505 2600

IOPS

Pain Points In Production

15

Computes vs. Data nodesRatio?

Upgrading? E.g., DecaPodSeparated from OpenStack

Operational Overheads?E.g., adding more data nodes

-> creating internal trafficDeep Scrubbing

Pin-pointBottlenecks?

QoS?

Pleasure Points§ Generic Architecture§ With high tech such as NVMe SSDs, immediately

improved

§ Various use cases

§ Good community§ Open & stable

§ Work well with OpenStack

§ Truly Scale Out§ High Performance with low cost

16

Future Of Ceph

17

Next Generation Ceph: BlueStore

18

SSDs

Drives, e.g. SAS 2TB drives

BlueStore: Optimal Key-Value Store,

No POSIX, etc.

Next Steps?

§ NVMe (Non-volatile memory Express) SSD

§ Ceph Cashing Tier

§ RDMA (Remote Direct Memory Access)

§ BlueStore

19

ONS ‘15, Amin Vahdat at Google

20