Upload
doannhu
View
221
Download
5
Embed Size (px)
Citation preview
1 Investor Day 2014 | May 7, 2014
Optimizing Ceph for All-Flash Architectures
Vijayendra Shamanna (Viju) Senior Manager Systems and Software Solutions
Systems and Software Solutions 2
Forward-Looking Statements
During our meeting today we may make forward-looking statements.
Any statement that refers to expectations, projections or other characterizations of future events or circumstances is a forward-looking statement, including those relating to future technologies, products and applications. Actual results may differ materially from those expressed in these forward-looking statements due to factors detailed under the caption “Risk Factors” and elsewhere in the documents we file from time to time with the SEC, including our annual and quarterly reports.
We undertake no obligation to update these forward-looking statements, which speak only as of the date hereof
Systems and Software Solutions 3
What is Ceph?
A Distributed/Clustered Storage System
File, Object and Block interfaces to common storage substrate
Ceph is a mostly LGPL open source project
– Part of Linux kernel and most distros since 2.6.x
Systems and Software Solutions 4
Ceph Architecture
Client Client Client
IP Network
STORAGE NODE
STORAGE NODE
STORAGE NODE
…
…
Systems and Software Solutions 5
Ceph Deployment Modes
WebServer
RGW
librados
Guest OS
Block Iface
KVM
/dev/rbd
librados
Systems and Software Solutions 6
Ceph Operation
Object
Hash
PG PG PG PG
Placement Rules
OSD OSD OSD OSD OSD
PG PG …
…
libra
do
s O
SDs
Interface (block, object, file)
ifac
e Client / Application
Systems and Software Solutions 7
Ceph Placement Rules and Data Protection
PG[a]
PG[b]
PG[c]
Server Server Server
Rack
Systems and Software Solutions 8
Ceph’s built in Data Integrity & Recovery
Checksum based “patrol reads”
– Local read of object, exchange of checksum over network
Journal-based Recovery
– Short down time (tunable)
– Re-syncs only changed objects
Object-based Recovery
– Long down time OR partial storage loss
– Object checksums exchanged over network
Systems and Software Solutions 9
Enterprise Storage Features
FILE SYSTEM * BLOCK STORAGE OBJECT STORAGE
Keystone authentication
Geo-Replication
Erasure Coding
Striped Objects
Incremental backups
Open Stack integration
Configurable Striping
iSCSI CIFS/NFS
Linux Kernel
Configurable Striping
S3 & Swift
Multi-tenant
RESTful Interface
Thin Provisioning
Copy-on Write Clones
Snapshots
Dynamic Rebalancing
Distributed Metadata
POSIX compliance
* Future
Systems and Software Solutions 10
Software Architecture – Key Components of System built for Hyper Scale
Storage Clients – Standard Interface to use the data (POSIX,
Device, Swift/S3…) – Transparent for Applications – Intelligent, Coordinates with peers
Object Storage Cluster (OSDs) – Stores all data and metadata into flexible-
sized containers – Objects
Cluster Monitors – lightweight process for Authentication,
Cluster Membership, Critical Cluster State
Clients authenticates with monitors, direct IO to OSDs for scalable performance
No gateways, brokers, lookup tables, indirection …
Clients
OSDs
Ceph Key Components
Blo
ck,
Ob
ject
IO
Mo
nit
ors
Cluster Maps
Compute
Storage
Systems and Software Solutions 11
Ceph Interface Eco System
Native Object Access (librados)
File System (libcephfs)
HDFS Kernel Device + Fuse
Rados Gateway (RGW)
S3 Swift
Block Interface (librbd)
Kernel
Samba NFS iSCSI-cur
Client
RADOS (Reliable, Autonomous, Distributed Object Store)
Systems and Software Solutions 12
OSD Architecture
work queue cluster
client
Messenger
Dispatcher OSD thread pools
PG Backend
Replication Erasure Coding
ObjectStore
JournallingObjectStore KeyValueStore MemStore
FileStore
Transactions
Network messages
Systems and Software Solutions 13
Messenger layer
Removed Dispatcher and introduced a “fast path” mechanism for read/write requests
• Same mechanism is now present on client side (librados) as well
Fine grained locking in message transmit path
Introduced an efficient buffering mechanism for improved throughput
Configuration options to disable message signing, CRC check etc
Systems and Software Solutions 14
OSD Request Processing
Running with Memstore backend revealed bottleneck in OSD thread pool code
OSD worker thread pool mutex heavily contended
Implemented a sharded worker thread pool. Requests sharded based on their pg (placement group) identifier
Configuration options to set number of shards and number of worker threads per shard
Optimized OpTracking path (Sharded Queue and removed redundant locks)
Systems and Software Solutions 15
FileStore improvements
Eliminated backend storage from picture by using a small workload (FileStore served data from page cache)
Severe lock contention in LRU FD (file descriptor) cache. Implemented a sharded version of LRU cache
CollectionIndex (per-PG) object was being created upon every IO request. Implemented a cache for the same as PG info doesn’t change often
Optimized “Object-name to XFS file name” mapping function
Removed redundant snapshot related checks in parent read processing path
Systems and Software Solutions 16
Inconsistent Performance Observation
Large performance variations on different pools across multiple clients
First client after cluster restart gets maximum performance irrespective of the pool
Continued degraded performance from clients starting later
Issue also observed on read I/O with unpopulated RBD images – Ruled out FS issues
Performance counters show up to 3x increase in latency through the I/O path with no particular bottleneck
Systems and Software Solutions 17
Issue with TCmalloc
Perf top shows rapid increase in time spent in TCmalloc functions 14.75% libtcmalloc.so.4.1.2 [.] tcmalloc::CentralFreeList::FetchFromSpans()
7.46% libtcmalloc.so.4.1.2 [.] tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int)
I/O from different client causing new threads in sharded thread pool to process I/O
Causing memory movement from thread caches and increasing alloc/free latency
JEmalloc and Glibc malloc do not exhibit this behavior
JEmalloc build option added to Ceph Hammer
Setting TCmalloc tunable 'TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES’ to larger value (64M) alleviates the issue
Systems and Software Solutions 18
Client Optimizations
Ceph by default, turns Nagle’s algorithm OFF
RBD kernel driver ignored TCP_NODELAY setting
Large latency variations at lower queue depths
Changes to RBD driver submitted upstream
Emerging Storage Solutions (EMS) SanDisk Confidential 19
Results!
Systems and Software Solutions 20
Hardware Topology
Systems and Software Solutions 21
8K Random IOPs Performance -1 RBD /Client
IOPS :1 Lun/Client ( Total 4 Clients)
[Queue Depth] Read Percent
0
50000
100000
150000
200000
250000
1 4 16 1 4 16 1 4 16 1 4 16 1 4 16
0 25 50 75 100
Firefly
Giant
Lat (ms):1 Lun/Client ( Total 4 Clients)
0
5
10
15
20
25
30
1 4 16 1 4 16 1 4 16 1 4 16 1 4 16
0 25 50 75 100
Systems and Software Solutions 22
64K Random IOPs Performance -1 RBD /Client
IOPS :1 Lun/Client ( Total 4 Clients)
[Queue Depth] Read Percent
0
20000
40000
60000
80000
100000
120000
140000
160000
1 4 16 1 4 16 1 4 16 1 4 16 1 4 16
0 25 50 75 100
Firefly
Giant
Lat(ms) :1 Lun/Client ( Total 4 Clients)
0
5
10
15
20
25
30
35
1 4 16 1 4 16 1 4 16 1 4 16 1 4 16
0 25 50 75 100
Systems and Software Solutions 23
256K Random IOPs Performance -1 RBD /Client
IOPS :1 Lun/Client ( Total 4 Clients) Lat(ms) :1 Lun/Client ( Total 4 Clients)
[Queue Depth] Read Percent
0
5000
10000
15000
20000
25000
30000
35000
40000
1 4 16 1 4 16 1 4 16 1 4 16 1 4 16
0 25 50 75 100
Firefly
Giant
0
10
20
30
40
50
60
70
1 4 16 1 4 16 1 4 16 1 4 16 1 4 16
0 25 50 75 100
Systems and Software Solutions 24
8K Random IOPs Performance -2 RBD /Client IOPS :2 Luns /Client ( Total 4 Clients)
0
50000
100000
150000
200000
250000
300000
1 4 16 1 4 16 1 4 16 1 4 16 1 4 16
0 25 50 75 100
Firefly
Giant
Lat(ms) :2 Luns/Client ( Total 4 Clients)
[Queue Depth] Read Percent
0
20
40
60
80
100
120
1 4 16 1 4 16 1 4 16 1 4 16 1 4 16
0 25 50 75 100
SanDisk Confidential 25
SanDisk Emerging Storage Solutions (EMS) Our goal for EMS Expand the usage of flash for newer enterprise workloads (Ex: Ceph, Hadoop, Content
Repositories, Media Streaming, Clustered Database Applications, etc)
… through disruptive innovation of “disaggregation” of storage from compute
… by building shared storage solutions
… thereby allowing customers to lower TCO relative to HDD based offerings through scaling storage and compute separately
Mission: Provide the best building blocks for “shared storage” flash solutions.
SanDisk Confidential 26
SanDisk Systems & Software Solutions Building Blocks Transforming Hyperscale
InfiniFlash™
Capacity & High Density for Big Data Analytics, Media Services, Content Repositories
512TB 3U, 780K IOPS, SAS storage Compelling TCO– Space/Energy Savings
ION Accelerator™
Accelerate Database, OLTP & other high performance workloads 1.7M IOPS, 23 GB/s, 56 us latency Lower costs with IT consolidation
FlashSoft®
Server-side SSD based caching to reduce I/O Latency Improve application performance Maximize storage infrastructure investment
ZetaScale™
Lower TCO while retaining performance for in-memory databases
Run multiple TB data sets on SSDs vs. GB data set on DRAM
Deliver up to 5:1 server consolidation, improving TCO
Massive Capacity Extreme Performance for Applications.. .. Servers/Virtualization
SanDisk Confidential 27
Thank you ! Questions? [email protected]