Download pdf - CephFS and SambaCephFS and Samba...Samba IntrSamba Introdoductionuction • – • – • – – File and print server SMB / CIFS, SMB2 and SMB3+ dialects Authentication NTLMv2

CephFS and SambaCephFS and SambaScale-out file serving for the MassesScale-out file serving for the Masses

David DisseldorpDavid DisseldorpSenior Software EngineerSenior Software [email protected]@suse.com

Jan FajerskiJan FajerskiSenior Software EngineerSenior Software [email protected]@suse.com

https://jan--f.github.io/19_04_susecon/https://jan--f.github.io/19_04_susecon/

AgendaAgenda•

•

•

•

Quick intro

Spotlight on a few (new) features

Samba Gateway

Performance numbers

Ceph IntroductionCeph Introduction•

–

–

–

•

–

–

–

–

Distributed storage system based on RADOS

Scalability by design

Fault tolerant

Self-healing and self-managing

Unified storage cluster

Object storage

Block Storage

File system (POSIX compatible)

Your applicaton

Ceph ArchitectureCeph Architecture

CephFS ArchitectureCephFS Architecture

CephFS features (a selection)CephFS features (a selection)

Multiple MDS daemonsMultiple MDS daemonsHigh availability and scalabilityHigh availability and scalability

•

•

•

•

A failed MDS will bring service down

Many clients and many files can overwhelm MDS cache

Directory tree is partitioned into ranks - max_mds defaults to 1

Additional MDS daemons will join the cluster as standby's

• increase MDS count

systemctl start [email protected] # start an additional mds

ceph fs set <cephfs> max_mds 2

Multi MDS configurationMulti MDS configurationFailover handling parametersFailover handling parameters

•

•

•

•

•

mds_beacon_grace

mds_standby_replay

mds_replay_interval

mds_standby_for_rank

mds_standby_for_name

Control rank partitioningControl rank partitioning

•

•

–

–

Automatic partitioning based on load - hard to beat

Pin directories (and its sub-directories) to a certain rank

Spread periodic loads

Prevent particular load from impacting others

Extended attributesExtended attributesMany options can be controlled at runtime via Many options can be controlled at runtime via setxattr

CaveatsCaveats

•

•

File and dir layout are inherited at creation time - don't apply to existing inodes

Clients need the p flag in their cephx key to set quota and layout restrictions

setfattr -n <attribute.name> -v 1000000 /some/dir # set to 1 milliongetfattr -n <attribute.name> /some/dir # read xattrsetfattr -x <attribute.name> /some/dir # remove xattr

ceph.dir.pinceph.quota.[max_bytes|max_files]ceph.[file|dir].layout.[pool|pool_namespace|stripe_unit|stripe_count|object_size]

QuotasQuotas•

•

Directory based

Set with xattrs

LimitationLimitation

•

•

•

•

•

Cooperative - needs client cooperation

Imprecise - rule of thumb: Clients will be stopped within 10s of reaching a quota

Kernel support require >=4.17 kernel and mimic+ cluster

Path restrictions can interner - Clients need read permission on quota directory

Snapshots of since deleted data does not count - see bug #24284

SnapshotsSnapshots•

•

•

•

Snapshot a directory tree

Fast creation - data writeback is asynchronous

Magic .snap directory

Clients need s flag in their cephx key

mkdir .snap/my_snapshot # Take a snapshot called my_snapshotls .snap/ # List snapshotsrmdir .snap/my_snapshot # Remove snapshot

client.0 key: AQAz7EVWygILFRAAdIcuJ12opU/JKyfFmxhuaw== caps: [mds] allow rw, allow rws path=/bar caps: [mon] allow r caps: [osd] allow rw tag cephfs data=cephfs_a

SambaSamba

Samba IntroductionSamba Introduction•

–

•

–

•

–

–

File and print server

SMB / CIFS, SMB2 and SMB3+ dialects

Authentication

NTLMv2 and Kerberos

Identity mapping

Windows SIDs to uids and gids

Active Directory domain member

Samba Gateway for CephFSSamba Gateway for CephFS•

–

•

•

–

Samba VFS layer for filesystem abstraction

Ceph VFS module provides libcephfs callouts

One or more nodes "proxy" SMB traffic through to CephFS

Samba performs authentication and ID mapping

static CephX credentials per Samba gateway

Samba Gateway for CephFSSamba Gateway for CephFS

Samba Clustering with CTDBSamba Clustering with CTDB•

•

–

–

•

•

•

Clustered Trivial Database (CTDB)

Share state across multiple Samba nodes

Key-value store

Reliable messaging

Active / Active

HA features

Monitoring and failover

Clustering with CTDBClustering with CTDB•

•

–

•

–

•

–

–

Record location master and data master

Elected recovery master monitors state of cluster

Performs database recovery if necessary

Cluster-wide mutex used to prevent split brain

Uses Ceph RADOS object lock

connections to public IPs are tracked

reset on IP failover

gratuitous ARP and "Tickle" clients

Clustering with CTDBClustering with CTDB

Gateway limitationsGateway limitations•

•

•

•

•

Cross protocol access

Performance

Leases / oplocks

Clustered node failover

Load balancing

DemonstrationDemonstration

PerformancePerformance

Basic hardware performanceBasic hardware performanceCluster hardwareCluster hardware

•

–

–

–

•

–

Ceph on 8 node

5 OSD nodes – 24 cores – 128 GB RAM

16 OSD daemons per node – 1 per SSD, 5 per NVME

3 MON/MDS nodes – 24 cores – 128 GB RAM

9 client nodes

16 cores – 64 GB RAM

Basic networking performanceBasic networking performance

•

•

•

•

25G networking, 25G to 100G interfaces

Client node -> OSD node 25Gbit/s

Multiple client nodes -> OSD node 72 Gbit/s

OSD Node -> OSD Node 68 Gbit/s

OSD devicesOSD devices

•

•

2 x Intel® SSD DC P3700 Series 800GB, 1/2 Height PCIe 3.0, 20nm, MLC

5 x Intel® SSD DC S3700 Series 400GB, 2.5in SATA 6Gb/s, 25nm, MLC

Work loadsWork loadsExperimentsExperiments

•

•

•

•

•

Each client (thread) operates on 100 4MB files

Sequential IO in 4k and 4M blocksizes

fsync every 64 IO operations

10 minutes runtime with 2 minutes ramp time

7 clients, various process counts per client

Thank you - Questions?Thank you - Questions?

Many Thanks to , and Adam Spiers Florian Haas Hakim El-Hattab and contributors

https://jan--f.github.io/19_04_susecon/https://jan--f.github.io/19_04_susecon/

fio example jobfio example job[global]directory=/run/ceph_bench/1:/run/ceph_bench/2:/run/ceph_bench/3:/run/ceph_bench/4:/run/ceph_bench/5:unlink=1wait_for_previous=1filesize=4mnrfiles=100ramp_time=2mtime_based=1runtime=10m#size shouldn't be reached due to runtimesize=100Gfsync=64numjobs=64