CephFS and SambaCephFS and SambaScale-out file serving for the MassesScale-out file serving for the Masses
David DisseldorpDavid DisseldorpSenior Software EngineerSenior Software [email protected]@suse.com
Jan FajerskiJan FajerskiSenior Software EngineerSenior Software [email protected]@suse.com
https://jan--f.github.io/19_04_susecon/https://jan--f.github.io/19_04_susecon/
AgendaAgenda•
•
•
•
Quick intro
Spotlight on a few (new) features
Samba Gateway
Performance numbers
Ceph IntroductionCeph Introduction•
–
–
–
•
–
–
–
–
Distributed storage system based on RADOS
Scalability by design
Fault tolerant
Self-healing and self-managing
Unified storage cluster
Object storage
Block Storage
File system (POSIX compatible)
Your applicaton
Ceph ArchitectureCeph Architecture
CephFS ArchitectureCephFS Architecture
CephFS features (a selection)CephFS features (a selection)
Multiple MDS daemonsMultiple MDS daemonsHigh availability and scalabilityHigh availability and scalability
•
•
•
•
A failed MDS will bring service down
Many clients and many files can overwhelm MDS cache
Directory tree is partitioned into ranks - max_mds defaults to 1
Additional MDS daemons will join the cluster as standby's
• increase MDS count
systemctl start [email protected] # start an additional mds
ceph fs set <cephfs> max_mds 2
Multi MDS configurationMulti MDS configurationFailover handling parametersFailover handling parameters
•
•
•
•
•
mds_beacon_grace
mds_standby_replay
mds_replay_interval
mds_standby_for_rank
mds_standby_for_name
Control rank partitioningControl rank partitioning
•
•
–
–
Automatic partitioning based on load - hard to beat
Pin directories (and its sub-directories) to a certain rank
Spread periodic loads
Prevent particular load from impacting others
Extended attributesExtended attributesMany options can be controlled at runtime via Many options can be controlled at runtime via setxattr
CaveatsCaveats
•
•
File and dir layout are inherited at creation time - don't apply to existing inodes
Clients need the p flag in their cephx key to set quota and layout restrictions
setfattr -n <attribute.name> -v 1000000 /some/dir # set to 1 milliongetfattr -n <attribute.name> /some/dir # read xattrsetfattr -x <attribute.name> /some/dir # remove xattr
ceph.dir.pinceph.quota.[max_bytes|max_files]ceph.[file|dir].layout.[pool|pool_namespace|stripe_unit|stripe_count|object_size]
QuotasQuotas•
•
Directory based
Set with xattrs
LimitationLimitation
•
•
•
•
•
Cooperative - needs client cooperation
Imprecise - rule of thumb: Clients will be stopped within 10s of reaching a quota
Kernel support require >=4.17 kernel and mimic+ cluster
Path restrictions can interner - Clients need read permission on quota directory
Snapshots of since deleted data does not count - see bug #24284
SnapshotsSnapshots•
•
•
•
Snapshot a directory tree
Fast creation - data writeback is asynchronous
Magic .snap directory
Clients need s flag in their cephx key
mkdir .snap/my_snapshot # Take a snapshot called my_snapshotls .snap/ # List snapshotsrmdir .snap/my_snapshot # Remove snapshot
client.0 key: AQAz7EVWygILFRAAdIcuJ12opU/JKyfFmxhuaw== caps: [mds] allow rw, allow rws path=/bar caps: [mon] allow r caps: [osd] allow rw tag cephfs data=cephfs_a
SambaSamba
Samba IntroductionSamba Introduction•
–
•
–
•
–
–
File and print server
SMB / CIFS, SMB2 and SMB3+ dialects
Authentication
NTLMv2 and Kerberos
Identity mapping
Windows SIDs to uids and gids
Active Directory domain member
Samba Gateway for CephFSSamba Gateway for CephFS•
–
•
•
–
Samba VFS layer for filesystem abstraction
Ceph VFS module provides libcephfs callouts
One or more nodes "proxy" SMB traffic through to CephFS
Samba performs authentication and ID mapping
static CephX credentials per Samba gateway
Samba Gateway for CephFSSamba Gateway for CephFS
Samba Clustering with CTDBSamba Clustering with CTDB•
•
–
–
•
•
•
Clustered Trivial Database (CTDB)
Share state across multiple Samba nodes
Key-value store
Reliable messaging
Active / Active
HA features
Monitoring and failover
Clustering with CTDBClustering with CTDB•
•
–
•
–
•
–
–
Record location master and data master
Elected recovery master monitors state of cluster
Performs database recovery if necessary
Cluster-wide mutex used to prevent split brain
Uses Ceph RADOS object lock
connections to public IPs are tracked
reset on IP failover
gratuitous ARP and "Tickle" clients
Clustering with CTDBClustering with CTDB
Gateway limitationsGateway limitations•
•
•
•
•
Cross protocol access
Performance
Leases / oplocks
Clustered node failover
Load balancing
DemonstrationDemonstration
PerformancePerformance
Basic hardware performanceBasic hardware performanceCluster hardwareCluster hardware
•
–
–
–
•
–
Ceph on 8 node
5 OSD nodes – 24 cores – 128 GB RAM
16 OSD daemons per node – 1 per SSD, 5 per NVME
3 MON/MDS nodes – 24 cores – 128 GB RAM
9 client nodes
16 cores – 64 GB RAM
Basic networking performanceBasic networking performance
•
•
•
•
25G networking, 25G to 100G interfaces
Client node -> OSD node 25Gbit/s
Multiple client nodes -> OSD node 72 Gbit/s
OSD Node -> OSD Node 68 Gbit/s
OSD devicesOSD devices
•
•
2 x Intel® SSD DC P3700 Series 800GB, 1/2 Height PCIe 3.0, 20nm, MLC
5 x Intel® SSD DC S3700 Series 400GB, 2.5in SATA 6Gb/s, 25nm, MLC
Work loadsWork loadsExperimentsExperiments
•
•
•
•
•
Each client (thread) operates on 100 4MB files
Sequential IO in 4k and 4M blocksizes
fsync every 64 IO operations
10 minutes runtime with 2 minutes ramp time
7 clients, various process counts per client
Thank you - Questions?Thank you - Questions?
Many Thanks to , and Adam Spiers Florian Haas Hakim El-Hattab and contributors
https://jan--f.github.io/19_04_susecon/https://jan--f.github.io/19_04_susecon/
fio example jobfio example job[global]directory=/run/ceph_bench/1:/run/ceph_bench/2:/run/ceph_bench/3:/run/ceph_bench/4:/run/ceph_bench/5:unlink=1wait_for_previous=1filesize=4mnrfiles=100ramp_time=2mtime_based=1runtime=10m#size shouldn't be reached due to runtimesize=100Gfsync=64numjobs=64