71
2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and beyond Sage Weil Inktank

Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

Ceph: scaling storage for the cloud and beyond

Sage Weil

Inktank

Page 2: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

outline● why you should care● what is it, what it does● distributed object storage● ceph fs● who we are, why we do this

Page 3: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

why should you care about anotherstorage system?

Page 4: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

requirements● diverse storage needs

– object storage

– block devices (for VMs) with snapshots, cloning

– shared file system with POSIX, coherent caches

– structured data... files, block devices, or objects?

● scale– terabytes, petabytes, exabytes

– heterogeneous hardware

– reliability and fault tolerance

Page 5: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

time● ease of administration● no manual data migration, load balancing● painless scaling

– expansion and contraction

– seamless migration

Page 6: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

cost● linear function of size or performance● incremental expansion

– no fork-lift upgrades

● no vendor lock-in– choice of hardware

– choice of software

● open

Page 7: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

what is ceph?

Page 8: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

unified storage system● objects

– native

– RESTful

● block– thin provisioning, snapshots, cloning

● file– strong consistency, snapshots

Page 9: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

Page 10: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

open source● LGPLv2

– copyleft

– ok to link to proprietary code

● no copyright assignment– no dual licensing

– no “enterprise-only” feature set

● active community● commercial support

Page 11: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

distributed storage system● data center scale

– 10s to 10,000s of machines

– terabytes to exabytes

● fault tolerant– no single point of failure

– commodity hardware

● self-managing, self-healing

Page 12: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

ceph object model● pools

– 1s to 100s

– independent namespaces or object collections

– replication level, placement policy

● objects– bazillions

– blob of data (bytes to gigabytes)

– attributes (e.g., “version=12”; bytes to kilobytes)

– key/value bundle (bytes to gigabytes)

Page 13: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

why start with objects?● more useful than (disk) blocks

– names in a single flat namespace

– variable size

– simple API with rich semantics

● more scalable than files– no hard-to-distribute hierarchy

– update semantics do not span objects

– workload is trivially parallel

Page 14: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

HUMANHUMAN COMPUTERCOMPUTER DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

Page 15: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

HUMANHUMAN COMPUTERCOMPUTER DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

HUMANHUMAN

HUMANHUMAN

Page 16: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMANHUMANHUMAN

HUMANHUMANHUMANHUMAN

HUMANHUMAN

HUMANHUMANHUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN (actually more like this…)

(COMPUTER)(COMPUTER)

Page 17: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

DISKDISK

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

Page 18: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

DISK

FS

DISK DISK

OSD

DISK DISK

OSD OSD OSD OSD

FS FS FSFS btrfsxfsext4

MMM

Page 19: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

Monitors:

• Maintain cluster membership and state

• Provide consensus for distributed decision-making via Paxos

• Small, odd number

• These do not serve stored objects to clients

M

Object Storage Daemons (OSDs):• At least three in a cluster• One per disk or RAID group• Serve stored objects to clients• Intelligently peer to perform

replication tasks

Page 20: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

M

M

M

HUMAN

Page 21: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

data distribution● all objects are replicated N times● objects are automatically placed, balanced,

migrated in a dynamic cluster● must consider physical infrastructure

– ceph-osds on hosts in racks in rows in data centers

● three approaches– pick a spot; remember where you put it– pick a spot; write down where you put it

– calculate where to put it, where to find it

Page 22: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

CRUSH• Pseudo-random placement

algorithm

• Fast calculation, no lookup

• Repeatable, deterministic

• Ensures even distribution

• Stable mapping

• Limited data migration

• Rule-based configuration

• specifiable replication

• infrastructure topology aware

• allows weighting

Page 23: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

10 10 01 01 10 10 01 11 01 1010 10 01 01 10 10 01 11 01 10

1010 1010 0101 0101 1010 1010 0101 1111 0101 1010

hash(object name) % num pg

CRUSH(pg, cluster state, policy)

Page 24: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

10 10 01 01 10 10 01 11 01 1010 10 01 01 10 10 01 11 01 10

1010 1010 0101 0101 1010 1010 0101 1111 0101 1010

Page 25: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

RADOS● monitors publish osd map that describes cluster state

– ceph-osd node status (up/down, weight, IP)

– CRUSH function specifying desired data distribution

● object storage daemons (OSDs)– safely replicate and store object

– migrate data as the cluster changes over time

– coordinate based on shared view of reality – gossip!

● decentralized, distributed approach allows– massive scales (10,000s of servers or more)

– the illusion of a single copy with consistent behavior

M

Page 26: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

CLIENTCLIENT

??

Page 27: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

Page 28: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

Page 29: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

CLIENT

??

Page 30: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

Page 31: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

LIBRADOSLIBRADOS

MM

MM

MM

APPAPP

native

Page 32: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

LLLIBRADOS

• Provides direct access to RADOS for applications

• C, C++, Python, PHP, Java• No HTTP overhead

Page 33: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

atomic transactions● client operations send to the OSD cluster

– operate on a single object

– can contain a sequence of operations, e.g.● truncate object● write new object data● set attribute

● atomicity– all operations commit or do not commit atomically

● conditional– 'guard' operations can control whether operation is performed

● verify xattr has specific value● assert object is a specific version

– allows atomic compare-and-swap etc.

Page 34: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

key/value storage● store key/value pairs in an object

– independent from object attrs or byte data payload

● based on google's leveldb– efficient random and range insert/query/removal

– based on BigTable SSTable design

● exposed via key/value API– insert, update, remove

– individual keys or ranges of keys

● avoid read/modify/write cycle for updating complex objects– e.g., file system directory objects

Page 35: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

watch/notify● establish stateful 'watch' on an object

– client interest persistently registered with object

– client keeps session to OSD open

● send 'notify' messages to all watchers– notify message (and payload) is distributed to all watchers

– variable timeout

– notification on completion● all watchers got and acknowledged the notify

● use any object as a communication/synchronization channel– locking, distributed coordination (ala ZooKeeper), etc.

Page 36: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

CLIENT #1

CLIENT #2

CLIENT #3

OSD

watch

ack/commit

ack/commit

watch

ack/commitwatch

notify

notify

notify

notify

ackack

ack

complete

Page 37: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

watch/notify example● radosgw cache consistency

– radosgw instances watch a single object (.rgw/notify)

– locally cache bucket metadata

– on bucket metadata changes (removal, ACL changes)

● write change to relevant bucket object● send notify with bucket name to other radosgw

instances

– on receipt of notify● invalidate relevant portion of cache

Page 38: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

rados classes● dynamically loaded .so

– /var/lib/rados-classes/*

– implement new object “methods” using existing methods

– part of I/O pipeline

– simple internal API

● reads– can call existing native or class methods

– do whatever processing is appropriate

– return data

● writes– can call existing native or class methods

– do whatever processing is appropriate

– generates a resulting transaction to be applied atomically

Page 39: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

class examples● grep

– read an object, filter out individual records, and return those

● sha1– read object, generate fingerprint, return that

● images– rotate, resize, crop image stored in object

– remove red-eye

● crypto– encrypt/decrypt object data with provided key

Page 40: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

Page 41: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

Page 42: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

DISKDISK

COMPUTERCOMPUTER

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

DISKDISK

Page 43: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

VMVM

VMVM

VMVM

Page 44: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

MM

MM

MM

VMVM

LIBRADOSLIBRADOS

LIBRBDLIBRBD

VIRTUALIZATION CONTAINERVIRTUALIZATION CONTAINER

Page 45: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

LIBRADOSLIBRADOS

MM

MM

MM

LIBRBDLIBRBD

CONTAINERCONTAINER

LIBRADOSLIBRADOS

LIBRBDLIBRBD

CONTAINERCONTAINERVMVM

Page 46: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

LIBRADOSLIBRADOS

MM

MM

MM

KRBD (KERNEL MODULE)KRBD (KERNEL MODULE)

HOSTHOST

Page 47: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

RADOS Block Device:• Storage of virtual disks in RADOS• Decouples VMs and containers

• Live migration!• Images are striped across the cluster• Snapshots!• Support in

• Qemu/KVM

• OpenStack, CloudStack

• Mainline Linux kernel

• Image cloning

• Copy-on-write “snapshot” of existing image

Page 48: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

Page 49: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

MM

MM

MM

CLIENTCLIENT

0110

0110

datametadata

Page 50: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

MM

MM

MM

Page 51: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

Metadata Server• Manages metadata for a

POSIX-compliant shared filesystem• Directory hierarchy• File metadata (owner,

timestamps, mode, etc.)• Stores metadata in RADOS• Does not serve file data to

clients• Only required for shared

filesystem

Page 52: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

legacy metadata storage● a scaling disaster

– name → inode → block list → data

– no inode table locality

– fragmentation● inode table● directory

● many seeks● difficult to partition

usr

etc

var

home

vmlinuz

passwdmtabhosts

lib…

includebin

Page 53: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

ceph fs metadata storage● block lists unnecessary● inode table mostly useless

– APIs are path-based, not inode-based

– no random table access, sloppy caching

● embed inodes inside directories– good locality, prefetching

– leverage key/value object

102

100

1

usr

etc

var

home

vmlinuz

passwdmtabhosts

libincludebin

Page 54: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

one tree

three metadata servers

??

Page 55: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

Page 56: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

Page 57: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

Page 58: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

Page 59: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

DYNAMIC SUBTREE PARTITIONING

Page 60: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

● scalable– arbitrarily partition

metadata

● adaptive– move work from busy

to idle servers

– replicate hot metadata

● efficient– hierarchical partition

preserve locality

● dynamic– daemons can

join/leave

– take over for failed nodes

dynamic subtree partitioning

Page 61: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

controlling metadata io

journal

directories

● view ceph-mds as cache– reduce reads

● dir+inode prefetching

– reduce writes● consolidate multiple writes

● large journal or log– stripe over objects

– two tiers● journal for short term● per-directory for long term

– fast failure recovery

Page 62: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

what is journaled● lots of state

– journaling is expensive up-front, cheap to recover

– non-journaled state is cheap, but complex (and somewhat expensive) to recover

● yes– client sessions

– actual fs metadata modifications

● no– cache provenance

– open files

● lazy flush– client modifications may not be durable until fsync() or visible by

another client

Page 63: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

client protocol● highly stateful

– consistent, fine-grained caching

● seamless hand-off between ceph-mds daemons–

Page 64: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

an example● mount -t ceph 1.2.3.4:/ /mnt

– 3 ceph-mon RT

– 2 ceph-mds RT (1 ceph-mds to -osd RT)

● cd /mnt/foo/bar

– 2 ceph-mds RT (2 ceph-mds to -osd RT)

● ls -al

– open

– readdir

● 1 ceph-mds RT (1 ceph-mds to -osd RT)

– stat each file

– close

● cp * /tmp

– N ceph-osd RT

ceph-mon

ceph-mds

ceph-osd

Page 65: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

recursive accounting● ceph-mds tracks recursive directory stats

– file sizes

– file and directory counts

– modification time

● virtual xattrs present full stats● efficient

$ ls ­alSh | headtotal 0drwxr­xr­x 1 root            root      9.7T 2011­02­04 15:51 .drwxr­xr­x 1 root            root      9.7T 2010­12­16 15:06 ..drwxr­xr­x 1 pomceph         pg4194980 9.6T 2011­02­24 08:25 pomcephdrwxr­xr­x 1 mcg_test1       pg2419992  23G 2011­02­02 08:57 mcg_test1drwx­­x­­­ 1 luko            adm        19G 2011­01­21 12:17 lukodrwx­­x­­­ 1 eest            adm        14G 2011­02­04 16:29 eestdrwxr­xr­x 1 mcg_test2       pg2419992 3.0G 2011­02­02 09:34 mcg_test2drwx­­x­­­ 1 fuzyceph        adm       1.5G 2011­01­18 10:46 fuzycephdrwxr­xr­x 1 dallasceph      pg275     596M 2011­01­14 10:06 dallasceph

Page 66: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

snapshots● volume or subvolume snapshots unusable at petabyte scale

– snapshot arbitrary subdirectories

● simple interface– hidden '.snap' directory

– no special tools

$ mkdir foo/.snap/one # create snapshot$ ls foo/.snapone$ ls foo/bar/.snap_one_1099511627776 # parent's snap name is mangled$ rm foo/myfile$ ls -F foobar/$ ls -F foo/.snap/onemyfile bar/$ rmdir foo/.snap/one # remove snapshot

Page 67: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

multiple client implementations● Linux kernel client

– mount -t ceph 1.2.3.4:/ /mnt

– export (NFS), Samba (CIFS)

● ceph-fuse● libcephfs.so

– your app

– Samba (CIFS)

– Ganesha (NFS)

– Hadoop (map/reduce)

kernel

libcephfs

ceph fuseceph-fuse

your app

libcephfsSamba

libcephfsGanesha

NFS SMB/CIFS

libcephfsHadoop

Page 68: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

NEARLYAWESOME

AWESOMEAWESOME

AWESOME

AWESOME

Page 69: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

why we do this● limited options for scalable open source

storage ● proprietary solutions

– expensive

– don't scale (well or out)

– marry hardware and software

● industry needs to change

Page 70: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

who we are● Ceph created at UC Santa Cruz (2007)● supported by DreamHost (2008-2011)● Inktank (2012)

– Los Angeles, Sunnyvale, San Francisco, remote

● growing user and developer community– Linux distros, users, cloud stacks, SIs, OEMs

Page 71: Ceph: scaling storage for the cloud and beyond - SNIA · 2020-06-07 · 2012 Storage Developer Conference. © Inktank. All Rights Reserved. Ceph: scaling storage for the cloud and

2012 Storage Developer Conference. © Inktank. All Rights Reserved.

thanks

sage weil

[email protected]

@liewegas http://github.com/ceph

http://ceph.com/