Scaling Ceph at CERN - Ceph Day Frankfurt

Scaling Ceph at CERN Dan van der Ster ([email protected]) Data and Storage Service Group | CERN IT Department

CERN’s Mission and Tools ●  CERN studies the fundamental laws of nature

○  Why do particles have mass? ○  What is our universe made of? ○  Why is there no antimatter left? ○  What was matter like right after the “Big Bang”? ○  …

●  The Large Hadron Collider (LHC) ○  Built in a 27km long tunnel, ~200m underground ○  Dipole magnets operated at -271°C (1.9K) ○  Particles do ~11’000 turns/sec, 600 million collisions/sec ○  …

●  Detectors ○  Four main experiments, each the size of a cathedral ○  DAQ systems Processing PetaBytes/sec

Scaling Ceph at CERN - D. van der Ster 3

Big Data at CERN Physics Data on CASTOR/EOS ●  LHC experiments produce ~10GB/s

25PB/year

User Data on OpenAFS & DFS ●  Home directories for 30k users ●  Physics analysis development ●  Project spaces for applications

Service Data on AFS/NFS ●  Databases, admin applications

Tape archival with CASTOR/TSM ●  RAW physics outputs ●  Desktop/Server backups


Service Size Files

OpenAFS 290TB 2.3B

CASTOR 89.0PB 325M

EOS 20.1PB 160M

IT Evolution at CERN


Cloudifying CERN’s IT infrastructure ... ●  Centrally-managed and uniform hardware

○  No more service-specific storage boxes ●  OpenStack VMs for most services

○  Building for 100k nodes (mostly for batch processing) ●  Attractive desktop storage services

○  Huge demand for a local Dropbox, Google Drive … ●  Remote data centre in Budapest

○  More rack space and power, plus disaster recovery

… brings new storage requirements ●  Block storage for OpenStack VMs

○  Images and volumes ●  Backend storage for existing and new services

○  AFS, NFS, OwnCloud, Data Preservation, ... ●  Regional storage

○  Use of our new data centre in Hungary ●  Failure tolerance, data checksumming, easy to operate, security, ...

Ceph at CERN


12 racks of disk server quads

Wiebalck / van der Ster -- Building an organic block storage service at CERN with Ceph

Our 3PB Ceph Cluster

Dual Intel Xeon L5640 24 threads incl. HT Dual 1Gig-E NICs Only one connected 2x 2TB Hitachi system disks RAID-1 mirror 1x 240GB OCZ Deneva 2 /var/lib/ceph/mon 48GB RAM


Dual Intel Xeon E5-2650 32 threads incl. HT

Dual 10Gig-E NICs Only one connected

24x 3TB Hitachi disks Eco drive, ~5900 RPM

3x 2TB Hitachi system disks Triple mirror

64GB RAM

47 disk servers/1128 OSDs 5 monitors

# df -‐h /mnt/ceph Filesystem Size Used Avail Use% Mounted on xxx:6789:/ 3.1P 173T 2.9P 6% /mnt/ceph

Use-Cases Being Evaluated 1.  Images and Volumes for OpenStack 2.  S3 Storage for Data Preservation / Public

Dissemination 3.  Physics data storage for archival and/or

analysis


#1 is moving into production. #2 and #3 are more exploratory at the moment.

OpenStack Volumes & Images •  Glance: using RBD for ~3 months now.

•  Only issue was to increase ulimit -n above 1024 (10k is good).

•  Cinder: testing with close colleagues. •  126 Cinder Volumes attached today – 56TB used


Growing # of volumes/images Usual traffic is ~50-100MB/s with current usage. (~idle)

RBD for OpenStack Volumes •  Before general availability, we need to test and

enable qemu iops/bps throttling •  Otherwise VMs with many IOs can disrupt other

users.

•  One ongoing issue is that a few clients are getting an (infrequent) segfault of qemu during a VM reboot. •  Happens on VMs with many attached RBD’s. •  Difficult to get a complete (16GB) core dump.


CASTOR & XRootD/EOS •  Exploring RADOS backend for these two HEP-developed

file systems •  Gateway model, similar to S3 via RADOSGW

•  CASTOR needs raw throughput performance (to feed many tape drives at 250MBps each). •  Striped RWs across many OSDs are important.

•  XRootD/EOS may benefit from the highly scalable namespace to store O(billion) objects •  Bonus: XRootD also offers http/webdav with X509/kerberos,

possibly even fuse mountable.

•  Developments are in early stages.


Operations & Lessons Learned


Configuration and Deployment •  Dumpling 0.67.7

•  Fully Puppet-ized •  Automated server deployment,

automated OSD replacement

•  Very few custom ceph.conf options à

•  Experimenting with the filestore wbthrottle •  we find that disabling it

completely gives better IOps performance

•  But don’t do this!!!


mon osd down out interval = 900 osd pool default size = 3 osd pool default min size = 1 osd pool default pg num = 1024 osd pool default pgp num = 1024 osd pool default flag hashpspool = true osd max backfills = 1 osd recovery max active = 1

Cluster Activity


General Comments… •  In these ~7 months of running the cluster, there have been very

few problems •  No outages •  No data losses/corruptions •  No unfixable performance issues •  Behaves well during stress tests

•  But now we’re starting to get real/varied/creative users, and this brings up many interesting issues...

•  “No amount of stress testing can prepare you for real users” - Unknown

•  (point being, don’t take the next slides to be too negative – I’m just trying to give helpful advice ;)


Latency & Slow Requests •  Best latency we can achieve is 20-40ms

•  Slow SATA disks, no SSDs: hard to justify SSDs in a multi-PB cluster, but could in a smaller limited use-case cluster (e.g. for Cinder-only)

•  Latency can increase dramatically with heavy usage •  Don’t mix latency-bound and throughput-bound users on the same

OSDs •  Local processes scanning the disks can hurt performance

•  Add /var/lib/ceph to the updatedb PRUNEPATH •  If you have slow disks like us, you need to understand your disk IO

scheduler – e.g. deadline prefers reads over writes: writes are given a 5 second deadline vs. 500ms for reads!

•  Scrubbing! •  Kernel tuning: vm.* sysctl, dirty page flushing, memory

reclaiming… •  “Something is flushing the buffers, blocking the OSD processes”

•  Slow requests: monitor them, eliminate them.


Life with 250 million objects •  Recently, a user decided to write 250 million 1kB objects

•  Not so unreasonable: 250M * 4MB = 1PB, so this simulates the cluster being full of RBD images, at least in terms of # objects

•  It worked – no big problems from holding this many objects. •  Tested single OSD failure: ~7 hours to backfill, including a

double-backfill glitch that we’re trying to understand.

•  But now we want to cleanup, and it is not trivial to remove 250M objects! •  rados rmpool generated quite a load when we rm’d a 3 million object

pool (some OSDs were temporarily marked down). •  Probably due to a mistake in our wbthrottle tuning


Other backfilling issues •  During a backfilling event (draining a whole server),

we started observing repeated monitor elections •  Caused by the mons’ LevelDBs being so active that the

local SATA disks couldn’t keep up. •  When a mon falls behind, it calls an election •  Could be due to LevelDB compaction…

•  We moved /var/lib/ceph/mon to SSDs – no more elections during backfilling

•  Avoid double backfilling when taking an OSD out of service: •  Start with ceph osd crush rm <osd id> !! •  If you mark the OSD out first, then crush rm it, you will

compute a new CRUSH map twice, i.e. backfill twice.


Fun with CRUSH •  CRUSH is simple yet powerful, so it is tempting to

play with the cluster layout •  But once you have non-zero amounts of data, significant

CRUSH changes will lead to massive data movements, which create extra disk load and may disrupt users.

•  Early CRUSH planning is crucial!

•  A network switch is a failure domain, so we should configure CRUSH to replicate across switches, right? •  But (assuming we don’t have a private cluster network)

that would send all replication traffic via the switch uplinks – bottleneck!

•  Unclear tradeoff between uptime and performance.


CRUSH & Data distribution •  CRUSH may give your cluster

an uneven data distribution

•  An OSD’s used space will scale with the number of PGs assigned to it

•  After you have designed your cluster, created your pools, started adding data, check the PG and volume distributions

•  reweight-‐by-‐utilization is useful to iron out an uneven PG distribution

•  The hashpspool flag is also important if you have many active pools


0

20

40

60

80

100

120

140

160

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 n

OSD

s n PGs

Number of OSDs having N PGs (for pool = volumes)

RBD Reliability with 3 Replicas •  RBD devices are chunked across thousands of objects:

•  A full 1TB volume is composed of 250,000 4MB objects •  If any single object is lost, the whole RBD can be considered to be corrupted

(obviously, it depends which blocks are lost!) •  If you lose an entire PG, you can consider all RBDs to be lost / corrupted.

•  Our incorrect & irrational fears: •  Any simultaneous triple disk failure in the cluster would lead to objects being

lost – and somehow all RBDs would be corrupted. •  As we add OSDs to the cluster, the data gets spread wider, and the chances of

RBD data loss increase. •  But this is wrong!!

•  The only triple disk failures that can lead to data loss are those combinations actively used by PGs – so having e.g. 4096 PGs for RBDs means that only 4096 combinations out of the 10^9 possible combinations matter.

•  N_PGs * ~(P_diskfailure^3) / 3!

•  We use 4 replicas for the RBD volumes, but this is probably overkill.


Trust your clients •  There is no server-side per-client throttling

•  A few nasty clients can overwhelm an OSD, leading to slow requests for everyone.

•  When you have a high load / slow requests, it is not always trivial to identify and blacklist/firewall the misbehaving client •  Could use some help in the monitoring: per-client perf stats?

•  One of our creative users found a way to make the mon’s generate 5*40 MBps of outbound network traffic •  Could saturate the mon network, lead to disruptions

•  RADOS is not for end-users. A cephx keyring is for trusted persons only, not for Joe Random User.


Fat fingers •  A healthy cluster is always vulnerable to human errors

•  We’ve thus far avoided any big mistakes

•  Used PG splitting to grow a pool from 8 to 2048 PGs •  Leads to unresponsive OSDs who get marked down à degraded objs. •  Safer & now-enforced to grow in 2x or 4x steps

•  ulimits, ulimits, ulimits •  With a large number of OSDs (say, more than 500), you will hit num

file and num process limits everywhere: •  Glance, qemu, radosgw, ceph/rados CLI, …

•  If you use XFS, don’t put your OSD journal as a file on the disk •  Use a separate partition, the first partition! •  We still need to reinstall our whole cluster to re-partition the OSDs


Scale up and out •  Scale up: we are demonstrating the viability of a

3PB cluster with O(1000) OSDs. •  What about 10,000 or 100,000 OSDs? •  What about 10,000 or 100,000 clients? •  Many Ceph instances is always an option, but not ideal

•  Scale out: our growing data centre in Budapest brings many options: •  Replicate over the WAN (though, 30ms RTT) •  Tiering / Caching pools (new feature, need to get

experience…) •  Data locality – direct IOs to nearby replica or caching pool


Summary


Summary •  CERN IT infrastructure is undergoing a private

cloud revolution, and Ceph is providing the underlying storage.

•  Our CASTOR and XRootD physics data use-cases may exploit RADOS for improved performance/scalability.

•  In seven months with a 3PB cluster, we’ve not had any disasters. Actually it’s working quite well.

•  Presented some lessons learned, I hope they prove useful in your Ceph explorations.


Technology

Scaling Ceph at CERN - Ceph Day Frankfurt