Upload
inktank
View
6.855
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Dan van der Ster, CERN
Citation preview
Scaling Ceph at CERN Dan van der Ster ([email protected]) Data and Storage Service Group | CERN IT Department
CERN’s Mission and Tools ● CERN studies the fundamental laws of nature
○ Why do particles have mass? ○ What is our universe made of? ○ Why is there no antimatter left? ○ What was matter like right after the “Big Bang”? ○ …
● The Large Hadron Collider (LHC) ○ Built in a 27km long tunnel, ~200m underground ○ Dipole magnets operated at -271°C (1.9K) ○ Particles do ~11’000 turns/sec, 600 million collisions/sec ○ …
● Detectors ○ Four main experiments, each the size of a cathedral ○ DAQ systems Processing PetaBytes/sec
Scaling Ceph at CERN - D. van der Ster 3
Big Data at CERN Physics Data on CASTOR/EOS ● LHC experiments produce ~10GB/s
25PB/year
User Data on OpenAFS & DFS ● Home directories for 30k users ● Physics analysis development ● Project spaces for applications
Service Data on AFS/NFS ● Databases, admin applications
Tape archival with CASTOR/TSM ● RAW physics outputs ● Desktop/Server backups
Scaling Ceph at CERN - D. van der Ster 4
Service Size Files
OpenAFS 290TB 2.3B
CASTOR 89.0PB 325M
EOS 20.1PB 160M
IT Evolution at CERN
Scaling Ceph at CERN - D. van der Ster 5
Cloudifying CERN’s IT infrastructure ... ● Centrally-managed and uniform hardware
○ No more service-specific storage boxes ● OpenStack VMs for most services
○ Building for 100k nodes (mostly for batch processing) ● Attractive desktop storage services
○ Huge demand for a local Dropbox, Google Drive … ● Remote data centre in Budapest
○ More rack space and power, plus disaster recovery
… brings new storage requirements ● Block storage for OpenStack VMs
○ Images and volumes ● Backend storage for existing and new services
○ AFS, NFS, OwnCloud, Data Preservation, ... ● Regional storage
○ Use of our new data centre in Hungary ● Failure tolerance, data checksumming, easy to operate, security, ...
Ceph at CERN
Scaling Ceph at CERN - D. van der Ster 6
12 racks of disk server quads
Wiebalck / van der Ster -- Building an organic block storage service at CERN with Ceph
Our 3PB Ceph Cluster
Dual Intel Xeon L5640 24 threads incl. HT Dual 1Gig-E NICs Only one connected 2x 2TB Hitachi system disks RAID-1 mirror 1x 240GB OCZ Deneva 2 /var/lib/ceph/mon 48GB RAM
Scaling Ceph at CERN - D. van der Ster 8
Dual Intel Xeon E5-2650 32 threads incl. HT
Dual 10Gig-E NICs Only one connected
24x 3TB Hitachi disks Eco drive, ~5900 RPM
3x 2TB Hitachi system disks Triple mirror
64GB RAM
47 disk servers/1128 OSDs 5 monitors
# df -‐h /mnt/ceph Filesystem Size Used Avail Use% Mounted on xxx:6789:/ 3.1P 173T 2.9P 6% /mnt/ceph
Use-Cases Being Evaluated 1. Images and Volumes for OpenStack 2. S3 Storage for Data Preservation / Public
Dissemination 3. Physics data storage for archival and/or
analysis
Scaling Ceph at CERN - D. van der Ster 9
#1 is moving into production. #2 and #3 are more exploratory at the moment.
OpenStack Volumes & Images • Glance: using RBD for ~3 months now.
• Only issue was to increase ulimit -n above 1024 (10k is good).
• Cinder: testing with close colleagues. • 126 Cinder Volumes attached today – 56TB used
Scaling Ceph at CERN - D. van der Ster 10
Growing # of volumes/images Usual traffic is ~50-100MB/s with current usage. (~idle)
RBD for OpenStack Volumes • Before general availability, we need to test and
enable qemu iops/bps throttling • Otherwise VMs with many IOs can disrupt other
users.
• One ongoing issue is that a few clients are getting an (infrequent) segfault of qemu during a VM reboot. • Happens on VMs with many attached RBD’s. • Difficult to get a complete (16GB) core dump.
Scaling Ceph at CERN - D. van der Ster 11
CASTOR & XRootD/EOS • Exploring RADOS backend for these two HEP-developed
file systems • Gateway model, similar to S3 via RADOSGW
• CASTOR needs raw throughput performance (to feed many tape drives at 250MBps each). • Striped RWs across many OSDs are important.
• XRootD/EOS may benefit from the highly scalable namespace to store O(billion) objects • Bonus: XRootD also offers http/webdav with X509/kerberos,
possibly even fuse mountable.
• Developments are in early stages.
Scaling Ceph at CERN - D. van der Ster 12
Operations & Lessons Learned
Scaling Ceph at CERN - D. van der Ster 13
Configuration and Deployment • Dumpling 0.67.7
• Fully Puppet-ized • Automated server deployment,
automated OSD replacement
• Very few custom ceph.conf options à
• Experimenting with the filestore wbthrottle • we find that disabling it
completely gives better IOps performance
• But don’t do this!!!
Scaling Ceph at CERN - D. van der Ster 14
mon osd down out interval = 900 osd pool default size = 3 osd pool default min size = 1 osd pool default pg num = 1024 osd pool default pgp num = 1024 osd pool default flag hashpspool = true osd max backfills = 1 osd recovery max active = 1
Cluster Activity
Scaling Ceph at CERN - D. van der Ster 15
General Comments… • In these ~7 months of running the cluster, there have been very
few problems • No outages • No data losses/corruptions • No unfixable performance issues • Behaves well during stress tests
• But now we’re starting to get real/varied/creative users, and this brings up many interesting issues...
• “No amount of stress testing can prepare you for real users” - Unknown
• (point being, don’t take the next slides to be too negative – I’m just trying to give helpful advice ;)
Scaling Ceph at CERN - D. van der Ster 16
Latency & Slow Requests • Best latency we can achieve is 20-40ms
• Slow SATA disks, no SSDs: hard to justify SSDs in a multi-PB cluster, but could in a smaller limited use-case cluster (e.g. for Cinder-only)
• Latency can increase dramatically with heavy usage • Don’t mix latency-bound and throughput-bound users on the same
OSDs • Local processes scanning the disks can hurt performance
• Add /var/lib/ceph to the updatedb PRUNEPATH • If you have slow disks like us, you need to understand your disk IO
scheduler – e.g. deadline prefers reads over writes: writes are given a 5 second deadline vs. 500ms for reads!
• Scrubbing! • Kernel tuning: vm.* sysctl, dirty page flushing, memory
reclaiming… • “Something is flushing the buffers, blocking the OSD processes”
• Slow requests: monitor them, eliminate them.
Scaling Ceph at CERN - D. van der Ster 17
Life with 250 million objects • Recently, a user decided to write 250 million 1kB objects
• Not so unreasonable: 250M * 4MB = 1PB, so this simulates the cluster being full of RBD images, at least in terms of # objects
• It worked – no big problems from holding this many objects. • Tested single OSD failure: ~7 hours to backfill, including a
double-backfill glitch that we’re trying to understand.
• But now we want to cleanup, and it is not trivial to remove 250M objects! • rados rmpool generated quite a load when we rm’d a 3 million object
pool (some OSDs were temporarily marked down). • Probably due to a mistake in our wbthrottle tuning
Scaling Ceph at CERN - D. van der Ster 18
Other backfilling issues • During a backfilling event (draining a whole server),
we started observing repeated monitor elections • Caused by the mons’ LevelDBs being so active that the
local SATA disks couldn’t keep up. • When a mon falls behind, it calls an election • Could be due to LevelDB compaction…
• We moved /var/lib/ceph/mon to SSDs – no more elections during backfilling
• Avoid double backfilling when taking an OSD out of service: • Start with ceph osd crush rm <osd id> !! • If you mark the OSD out first, then crush rm it, you will
compute a new CRUSH map twice, i.e. backfill twice.
Scaling Ceph at CERN - D. van der Ster 19
Fun with CRUSH • CRUSH is simple yet powerful, so it is tempting to
play with the cluster layout • But once you have non-zero amounts of data, significant
CRUSH changes will lead to massive data movements, which create extra disk load and may disrupt users.
• Early CRUSH planning is crucial!
• A network switch is a failure domain, so we should configure CRUSH to replicate across switches, right? • But (assuming we don’t have a private cluster network)
that would send all replication traffic via the switch uplinks – bottleneck!
• Unclear tradeoff between uptime and performance.
Scaling Ceph at CERN - D. van der Ster 20
CRUSH & Data distribution • CRUSH may give your cluster
an uneven data distribution
• An OSD’s used space will scale with the number of PGs assigned to it
• After you have designed your cluster, created your pools, started adding data, check the PG and volume distributions
• reweight-‐by-‐utilization is useful to iron out an uneven PG distribution
• The hashpspool flag is also important if you have many active pools
Scaling Ceph at CERN - D. van der Ster 21
0
20
40
60
80
100
120
140
160
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 n
OSD
s n PGs
Number of OSDs having N PGs (for pool = volumes)
RBD Reliability with 3 Replicas • RBD devices are chunked across thousands of objects:
• A full 1TB volume is composed of 250,000 4MB objects • If any single object is lost, the whole RBD can be considered to be corrupted
(obviously, it depends which blocks are lost!) • If you lose an entire PG, you can consider all RBDs to be lost / corrupted.
• Our incorrect & irrational fears: • Any simultaneous triple disk failure in the cluster would lead to objects being
lost – and somehow all RBDs would be corrupted. • As we add OSDs to the cluster, the data gets spread wider, and the chances of
RBD data loss increase. • But this is wrong!!
• The only triple disk failures that can lead to data loss are those combinations actively used by PGs – so having e.g. 4096 PGs for RBDs means that only 4096 combinations out of the 10^9 possible combinations matter.
• N_PGs * ~(P_diskfailure^3) / 3!
• We use 4 replicas for the RBD volumes, but this is probably overkill.
Scaling Ceph at CERN - D. van der Ster 22
Trust your clients • There is no server-side per-client throttling
• A few nasty clients can overwhelm an OSD, leading to slow requests for everyone.
• When you have a high load / slow requests, it is not always trivial to identify and blacklist/firewall the misbehaving client • Could use some help in the monitoring: per-client perf stats?
• One of our creative users found a way to make the mon’s generate 5*40 MBps of outbound network traffic • Could saturate the mon network, lead to disruptions
• RADOS is not for end-users. A cephx keyring is for trusted persons only, not for Joe Random User.
Scaling Ceph at CERN - D. van der Ster 23
Fat fingers • A healthy cluster is always vulnerable to human errors
• We’ve thus far avoided any big mistakes
• Used PG splitting to grow a pool from 8 to 2048 PGs • Leads to unresponsive OSDs who get marked down à degraded objs. • Safer & now-enforced to grow in 2x or 4x steps
• ulimits, ulimits, ulimits • With a large number of OSDs (say, more than 500), you will hit num
file and num process limits everywhere: • Glance, qemu, radosgw, ceph/rados CLI, …
• If you use XFS, don’t put your OSD journal as a file on the disk • Use a separate partition, the first partition! • We still need to reinstall our whole cluster to re-partition the OSDs
Scaling Ceph at CERN - D. van der Ster 24
Scale up and out • Scale up: we are demonstrating the viability of a
3PB cluster with O(1000) OSDs. • What about 10,000 or 100,000 OSDs? • What about 10,000 or 100,000 clients? • Many Ceph instances is always an option, but not ideal
• Scale out: our growing data centre in Budapest brings many options: • Replicate over the WAN (though, 30ms RTT) • Tiering / Caching pools (new feature, need to get
experience…) • Data locality – direct IOs to nearby replica or caching pool
Scaling Ceph at CERN - D. van der Ster 25
Summary
Scaling Ceph at CERN - D. van der Ster 26
Summary • CERN IT infrastructure is undergoing a private
cloud revolution, and Ceph is providing the underlying storage.
• Our CASTOR and XRootD physics data use-cases may exploit RADOS for improved performance/scalability.
• In seven months with a 3PB cluster, we’ve not had any disasters. Actually it’s working quite well.
• Presented some lessons learned, I hope they prove useful in your Ceph explorations.
Scaling Ceph at CERN - D. van der Ster 27