RED HAT CEPH STORAGE ACCELERATION UTILIZING FLASH TECHNOLOGYApplications and Ecosystem Solutions Development
Rick Stehno
Red Hat Storage Day -
Dallas 2017 1
Seagate Confidential 2
• Utilize flash caching features to accelerate critical data. Caching methods
can be write-back for writes, write-thru for disk/cache transparency, read
cache, etc..
• Utilize storage tiering capabilities. Performance critical data resides on
flash storage, colder data resides on HDD
• Utilize all flash storage to accelerate performance when all application
data is performance critical or when the application does not provide the
features or capabilities to cache or to migrate the data
Three ways to accelerate application performance with flash
Flash Acceleration for Applications
Seagate Confidential 3
Configurations:
• All flash storage - Performance
• Highest performance per node
• Less maximum capacity per node
• Hybrid HDD and flash storage - Balanced
• Balances performance, capacity and cost
• Application and workload suitable for
• Performance critical data on flash
• Utilize host software caching or tiering on flash
• All HDD storage - Capacity
• Maximum capacity per node, lowest cost
• Lower performance per node
Ceph Software Defined Storage (SDS) Acceleration
Seagate Confidential 4
–Higher performance in half the rack space
–28% less power and cooling
–Higher MTBF inherent with reduced component count
–Reduced OSD recovery time per Ceph node
–Lower TCO
Why 1U server with 10 NVMe SSDs may be better choice
vs. 2U Server with 24 SATA SSDs
Storage - NVMe vs SATA SSD
Seagate Confidential 5
• 4.5x increase for 128k sequential
reads
• 3.5x increase for 128k sequential
writes
• 3.7x increase for 4k random reads
• 1.4x increase for 4k random 70/30
RR/RW
• Equal performance for 4k random
writes
Why 1U server with 10 NVMe SSDs may be better choice
vs. 2U Server with 24 SATA SSDs
All Flash Storage - NVMe vs SATA SSD cont’d
FIO Benchmarks(1x represents 24 SATA SSD baseline)
Seagate Confidential 6
Why 1U server with 10 NVMe SSDs may be better choice
vs. 2U Server with 24 SATA SSDs
All Flash Storage - NVMe vs SATA SSD cont’d
Increasing the load to extend NVMe
advantage over and above the 128
thread SATA SSD Test:
• 5.8x increase for Random Writes at
512 threads
• 3.1x increase for 70/30 RR/RW at
512 threads
• 4.2x increase for Random Reads at
790 threads
• 8.2x increase for Sequential Reads
at 1264 threads
10 NVMe SSDs support higher
workloads and more users
3x
5.8x
1.4x
3.1x
1.0x
4.2x
1.3x
8.2x
128Theads
512Theads
128Threads
512Threads
128threads
790threads
128threads
1264threads
Ga
ins
Random Write 70/30 RR/RW Random Reads Sequential Reads
Ceph RBD NVMe Performance Gains over SATA SSD
Random Writes 70/30 RR/RW Random Reads Sequential Reads
128k FIO RBD IOEngine Benchmark
Seagate Confidential 7
Price per MB/s: Cost of ((Retail Cost of SSD) / MB/s for each test)
SSD
Total SSD
Price
Price MB/s 128k Random Writes
128 threads
Price MB/s 128k Random Writes
512 threads
24 - SATA SSD 960G $7,896 24 - SATA SSD 960G $15.00
10 - NVMe 2TB $10,990 10 - NVMe 2TB $7.00 10 – NVMe 2TB $3.00
These prices do not include savings from electrical/cooling costs, reducing datacenter floor space, from the reduction of SATA SSD
Note: 128k random write FIO RBD benchmark: SATA SSD averaged 85% busy, NVMe averaged 80% busy with 512 threads
FIO RBD Maximum Threads Random Write Performance for NVMe
Ceph Storage Costs
Seagate SATA SSD vs. Seagate NVMe SSD
Seagate Confidential 8
MySQL
• MySQL is the most popular and the most widely used open-source database in the world
• MySQL is both feature rich in the areas of performance, scalability and reliability
• Database users demand high OLTP performance - Small random reads/writes
Ceph
• Most popular Software Defined Storage system
• Scalable
• Reliable
Does it make sense implementing Ceph into a MySQL
Database environment?
Ceph was not designed to provide high performance for OLTP environments
OLTP entails small random reads/writes
Seagate Confidential 9
MySQL Setup:
Release 5.7
45,000,000 rows
6GB Buffer
4G logfiles
RAID 0 over 18 HDD
Ceph Setup:
3 Nodes each containing:
Jewel Using Filestore
4 NVMe SSDs
1 Pool over 12 NVMe SSDs
Replica 2
40G private and public
network
For all tests, all MySQL
files were local on local
server except the database
file, this file was moved to
the Ceph cluster.
MySQL - Comparing Local HDD to Ceph Cluster
Threads
Seagate Confidential 10
MySQL - Comparing Local NVMe SSD to Ceph Cluster
MySQL Setup:
Release 5.7
45,000,000 rows
6GB Buffer
4G logfiles
RAID 0 over 4 NVMe SSDs
Ceph Setup:
3 Nodes each containing:
Jewel Using Filestore
4 NVMe SSDs
1 Pool over 12 NVMe SSDs
Replica 1
40G private and public
network
For all tests, all MySQL
files were local on local
server except the database
file, this file was moved to
the Ceph cluster.
Seagate Confidential 11
All SSD
Case-1: Case-2: Case-3:
2 SSDs 2 SSDs 1 PCIe flash
1 OSD/SSD 4 OSDs/SSD 4 OSDs/SSD
8 OSD journals on PCIe flash
0
100000
200000
300000
400000
500000
600000
700000
800000
0
200000
400000
600000
800000
1000000
1200000
2 ssd, 2 osd 2 ssd, 8 osd 2 ssd, 8 osd,+journal
IOP
S
KB
/s
FIO Random Write - 200 Threads -128k Data
Seagate SSD and Seagate PCIe Storage
Ceph All Flash Storage Acceleration
Seagate Confidential 12
Ceph All Flash Storage Acceleration
4K FIO RBD Benchmarks
3 node Ceph cluster
100G Public and Private Networks
4 - Seagate NVMe SSD per node
12 Seagate NVMe SSD per cluster
Benchmark 1:
1 - OSD per NVMe SSD
Benchmark 2:
4 - OSD per NVMe
Seagate Confidential 13
• Use RAW device or create 1st partition on 1M boundary (sector 2048 for 512B
sectors, sector 256 for 4k sectors)
• Ceph-deploy uses the optimal alignment when creating an OSD
• Use blk-mq/scsi-mq if kernel supports it
• rq_affinity = 1 for NVMe, rq_affinity = 2 for non-NVMe
• rotational = 0
• blockdev --setra 256 (for 4k sectors, 4096 for 512B sectors)
Linux tuning is still a requirement to get optimum performance out of a SSD
Linux Flash Storage Tuning
Seagate Confidential 14
• If using an older kernel that doesn’t support BLK-MQ, use:
• “deadline” IO-Scheduler with supporting variables:
• fifo-batch
• front-merges
• writes-starved
• XFS Mount options:
• nobarrier,discard,noatime,attr2,inode64,noquota
• MySQL – when using flash, configure both innodb_io_capacity and
innodb_lru_scan_depth
• Modify Linux read ahead on mapped RBD image on client
• echo 1024 > /sys/class/block/rbd0/queue/read_ahead_kb
Linux tuning is still a requirement to get optimum performance out of a SSD
Linux Flash Storage Tuning cont’d
Seagate Confidential 15
Flash Storage Device Configuration
Ceph tuning is still a requirement to get optimum performance out of a SSD
Ceph tuning options that can make a difference:
• RBD Cache:
• If using a smaller number of SSD/NVMe SSD, test with creating multiple OSD’s
per SSD/NVMe SSD. Have seen good performance increases using 4 OSD per
SSD/NVMe SSD
Seagate Confidential 16
Flash Storage Device Configuration
If the NVMe SSD or SAS/SATA SSD device can be configured to use a 4k sector size,
this could increase performance for certain applications like databases.
For all of my FIO tests with the RBD engine and for all of my MySQL tests, I saw up to
a 3x improvement (depending on the test) when using 4k sector sizes compared to
using 512 byte sectors.
Precondition all SSD before running benchmarks. Have seen over a 3x gain in
performance after preconditioning
Storage devices used for all of the above benchmarks/tests:
• Seagate Nytro XF1440 NVMe SSD
• Seagate Nytro XF1230 SATA SSD
• Seagate 1200.2 SAS SSD
• Seagate XP6500 PCIe Flash Accelerator Card
Seagate Confidential 17
Seagate Broadest PCIe, SAS and SATA Portfolio
Seagate Confidential 18Seagate Confidential
Thank You!
Questions?
Learn how Seagate accelerates storage
with one of the broadest SSD and Flash
portfolios in the market