27
1 Investor Day 2014 | May 7, 2014 Optimizing Ceph for All-Flash Architectures Vijayendra Shamanna (Viju) Senior Manager Systems and Software Solutions

Optimizing Ceph for All-Flash Architectures

  • Upload
    doannhu

  • View
    221

  • Download
    5

Embed Size (px)

Citation preview

Page 1: Optimizing Ceph for All-Flash Architectures

1 Investor Day 2014 | May 7, 2014

Optimizing Ceph for All-Flash Architectures

Vijayendra Shamanna (Viju) Senior Manager Systems and Software Solutions

Page 2: Optimizing Ceph for All-Flash Architectures

Systems and Software Solutions 2

Forward-Looking Statements

During our meeting today we may make forward-looking statements.

Any statement that refers to expectations, projections or other characterizations of future events or circumstances is a forward-looking statement, including those relating to future technologies, products and applications. Actual results may differ materially from those expressed in these forward-looking statements due to factors detailed under the caption “Risk Factors” and elsewhere in the documents we file from time to time with the SEC, including our annual and quarterly reports.

We undertake no obligation to update these forward-looking statements, which speak only as of the date hereof

Page 3: Optimizing Ceph for All-Flash Architectures

Systems and Software Solutions 3

What is Ceph?

A Distributed/Clustered Storage System

File, Object and Block interfaces to common storage substrate

Ceph is a mostly LGPL open source project

– Part of Linux kernel and most distros since 2.6.x

Page 4: Optimizing Ceph for All-Flash Architectures

Systems and Software Solutions 4

Ceph Architecture

Client Client Client

IP Network

STORAGE NODE

STORAGE NODE

STORAGE NODE

Page 5: Optimizing Ceph for All-Flash Architectures

Systems and Software Solutions 5

Ceph Deployment Modes

WebServer

RGW

librados

Guest OS

Block Iface

KVM

/dev/rbd

librados

Page 6: Optimizing Ceph for All-Flash Architectures

Systems and Software Solutions 6

Ceph Operation

Object

Hash

PG PG PG PG

Placement Rules

OSD OSD OSD OSD OSD

PG PG …

libra

do

s O

SDs

Interface (block, object, file)

ifac

e Client / Application

Page 7: Optimizing Ceph for All-Flash Architectures

Systems and Software Solutions 7

Ceph Placement Rules and Data Protection

PG[a]

PG[b]

PG[c]

Server Server Server

Rack

Page 8: Optimizing Ceph for All-Flash Architectures

Systems and Software Solutions 8

Ceph’s built in Data Integrity & Recovery

Checksum based “patrol reads”

– Local read of object, exchange of checksum over network

Journal-based Recovery

– Short down time (tunable)

– Re-syncs only changed objects

Object-based Recovery

– Long down time OR partial storage loss

– Object checksums exchanged over network

Page 9: Optimizing Ceph for All-Flash Architectures

Systems and Software Solutions 9

Enterprise Storage Features

FILE SYSTEM * BLOCK STORAGE OBJECT STORAGE

Keystone authentication

Geo-Replication

Erasure Coding

Striped Objects

Incremental backups

Open Stack integration

Configurable Striping

iSCSI CIFS/NFS

Linux Kernel

Configurable Striping

S3 & Swift

Multi-tenant

RESTful Interface

Thin Provisioning

Copy-on Write Clones

Snapshots

Dynamic Rebalancing

Distributed Metadata

POSIX compliance

* Future

Page 10: Optimizing Ceph for All-Flash Architectures

Systems and Software Solutions 10

Software Architecture – Key Components of System built for Hyper Scale

Storage Clients – Standard Interface to use the data (POSIX,

Device, Swift/S3…) – Transparent for Applications – Intelligent, Coordinates with peers

Object Storage Cluster (OSDs) – Stores all data and metadata into flexible-

sized containers – Objects

Cluster Monitors – lightweight process for Authentication,

Cluster Membership, Critical Cluster State

Clients authenticates with monitors, direct IO to OSDs for scalable performance

No gateways, brokers, lookup tables, indirection …

Clients

OSDs

Ceph Key Components

Blo

ck,

Ob

ject

IO

Mo

nit

ors

Cluster Maps

Compute

Storage

Page 11: Optimizing Ceph for All-Flash Architectures

Systems and Software Solutions 11

Ceph Interface Eco System

Native Object Access (librados)

File System (libcephfs)

HDFS Kernel Device + Fuse

Rados Gateway (RGW)

S3 Swift

Block Interface (librbd)

Kernel

Samba NFS iSCSI-cur

Client

RADOS (Reliable, Autonomous, Distributed Object Store)

Page 12: Optimizing Ceph for All-Flash Architectures

Systems and Software Solutions 12

OSD Architecture

work queue cluster

client

Messenger

Dispatcher OSD thread pools

PG Backend

Replication Erasure Coding

ObjectStore

JournallingObjectStore KeyValueStore MemStore

FileStore

Transactions

Network messages

Page 13: Optimizing Ceph for All-Flash Architectures

Systems and Software Solutions 13

Messenger layer

Removed Dispatcher and introduced a “fast path” mechanism for read/write requests

• Same mechanism is now present on client side (librados) as well

Fine grained locking in message transmit path

Introduced an efficient buffering mechanism for improved throughput

Configuration options to disable message signing, CRC check etc

Page 14: Optimizing Ceph for All-Flash Architectures

Systems and Software Solutions 14

OSD Request Processing

Running with Memstore backend revealed bottleneck in OSD thread pool code

OSD worker thread pool mutex heavily contended

Implemented a sharded worker thread pool. Requests sharded based on their pg (placement group) identifier

Configuration options to set number of shards and number of worker threads per shard

Optimized OpTracking path (Sharded Queue and removed redundant locks)

Page 15: Optimizing Ceph for All-Flash Architectures

Systems and Software Solutions 15

FileStore improvements

Eliminated backend storage from picture by using a small workload (FileStore served data from page cache)

Severe lock contention in LRU FD (file descriptor) cache. Implemented a sharded version of LRU cache

CollectionIndex (per-PG) object was being created upon every IO request. Implemented a cache for the same as PG info doesn’t change often

Optimized “Object-name to XFS file name” mapping function

Removed redundant snapshot related checks in parent read processing path

Page 16: Optimizing Ceph for All-Flash Architectures

Systems and Software Solutions 16

Inconsistent Performance Observation

Large performance variations on different pools across multiple clients

First client after cluster restart gets maximum performance irrespective of the pool

Continued degraded performance from clients starting later

Issue also observed on read I/O with unpopulated RBD images – Ruled out FS issues

Performance counters show up to 3x increase in latency through the I/O path with no particular bottleneck

Page 17: Optimizing Ceph for All-Flash Architectures

Systems and Software Solutions 17

Issue with TCmalloc

Perf top shows rapid increase in time spent in TCmalloc functions 14.75% libtcmalloc.so.4.1.2 [.] tcmalloc::CentralFreeList::FetchFromSpans()

7.46% libtcmalloc.so.4.1.2 [.] tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int)

I/O from different client causing new threads in sharded thread pool to process I/O

Causing memory movement from thread caches and increasing alloc/free latency

JEmalloc and Glibc malloc do not exhibit this behavior

JEmalloc build option added to Ceph Hammer

Setting TCmalloc tunable 'TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES’ to larger value (64M) alleviates the issue

Page 18: Optimizing Ceph for All-Flash Architectures

Systems and Software Solutions 18

Client Optimizations

Ceph by default, turns Nagle’s algorithm OFF

RBD kernel driver ignored TCP_NODELAY setting

Large latency variations at lower queue depths

Changes to RBD driver submitted upstream

Page 19: Optimizing Ceph for All-Flash Architectures

Emerging Storage Solutions (EMS) SanDisk Confidential 19

Results!

Page 20: Optimizing Ceph for All-Flash Architectures

Systems and Software Solutions 20

Hardware Topology

Page 21: Optimizing Ceph for All-Flash Architectures

Systems and Software Solutions 21

8K Random IOPs Performance -1 RBD /Client

IOPS :1 Lun/Client ( Total 4 Clients)

[Queue Depth] Read Percent

0

50000

100000

150000

200000

250000

1 4 16 1 4 16 1 4 16 1 4 16 1 4 16

0 25 50 75 100

Firefly

Giant

Lat (ms):1 Lun/Client ( Total 4 Clients)

0

5

10

15

20

25

30

1 4 16 1 4 16 1 4 16 1 4 16 1 4 16

0 25 50 75 100

Page 22: Optimizing Ceph for All-Flash Architectures

Systems and Software Solutions 22

64K Random IOPs Performance -1 RBD /Client

IOPS :1 Lun/Client ( Total 4 Clients)

[Queue Depth] Read Percent

0

20000

40000

60000

80000

100000

120000

140000

160000

1 4 16 1 4 16 1 4 16 1 4 16 1 4 16

0 25 50 75 100

Firefly

Giant

Lat(ms) :1 Lun/Client ( Total 4 Clients)

0

5

10

15

20

25

30

35

1 4 16 1 4 16 1 4 16 1 4 16 1 4 16

0 25 50 75 100

Page 23: Optimizing Ceph for All-Flash Architectures

Systems and Software Solutions 23

256K Random IOPs Performance -1 RBD /Client

IOPS :1 Lun/Client ( Total 4 Clients) Lat(ms) :1 Lun/Client ( Total 4 Clients)

[Queue Depth] Read Percent

0

5000

10000

15000

20000

25000

30000

35000

40000

1 4 16 1 4 16 1 4 16 1 4 16 1 4 16

0 25 50 75 100

Firefly

Giant

0

10

20

30

40

50

60

70

1 4 16 1 4 16 1 4 16 1 4 16 1 4 16

0 25 50 75 100

Page 24: Optimizing Ceph for All-Flash Architectures

Systems and Software Solutions 24

8K Random IOPs Performance -2 RBD /Client IOPS :2 Luns /Client ( Total 4 Clients)

0

50000

100000

150000

200000

250000

300000

1 4 16 1 4 16 1 4 16 1 4 16 1 4 16

0 25 50 75 100

Firefly

Giant

Lat(ms) :2 Luns/Client ( Total 4 Clients)

[Queue Depth] Read Percent

0

20

40

60

80

100

120

1 4 16 1 4 16 1 4 16 1 4 16 1 4 16

0 25 50 75 100

Page 25: Optimizing Ceph for All-Flash Architectures

SanDisk Confidential 25

SanDisk Emerging Storage Solutions (EMS) Our goal for EMS Expand the usage of flash for newer enterprise workloads (Ex: Ceph, Hadoop, Content

Repositories, Media Streaming, Clustered Database Applications, etc)

… through disruptive innovation of “disaggregation” of storage from compute

… by building shared storage solutions

… thereby allowing customers to lower TCO relative to HDD based offerings through scaling storage and compute separately

Mission: Provide the best building blocks for “shared storage” flash solutions.

Page 26: Optimizing Ceph for All-Flash Architectures

SanDisk Confidential 26

SanDisk Systems & Software Solutions Building Blocks Transforming Hyperscale

InfiniFlash™

Capacity & High Density for Big Data Analytics, Media Services, Content Repositories

512TB 3U, 780K IOPS, SAS storage Compelling TCO– Space/Energy Savings

ION Accelerator™

Accelerate Database, OLTP & other high performance workloads 1.7M IOPS, 23 GB/s, 56 us latency Lower costs with IT consolidation

FlashSoft®

Server-side SSD based caching to reduce I/O Latency Improve application performance Maximize storage infrastructure investment

ZetaScale™

Lower TCO while retaining performance for in-memory databases

Run multiple TB data sets on SSDs vs. GB data set on DRAM

Deliver up to 5:1 server consolidation, improving TCO

Massive Capacity Extreme Performance for Applications.. .. Servers/Virtualization

Page 27: Optimizing Ceph for All-Flash Architectures

SanDisk Confidential 27

Thank you ! Questions? [email protected]