71
a unified distributed storage system sage weil ceph day – november 2, 2012

Ceph Day Nov 2012 - Sage Weil

  • Upload
    inktank

  • View
    979

  • Download
    0

Embed Size (px)

DESCRIPTION

A Ceph overview from the creator of Ceph, Sage Weil at the first Ceph Day in Amsterdam. Nov 2012.

Citation preview

Page 1: Ceph Day Nov 2012 - Sage Weil

a unified distributed storage system

sage weilceph day – november 2, 2012

Page 2: Ceph Day Nov 2012 - Sage Weil

outline

● why you should care● what is it, what it's for● how it works

● architecture

● how you can use it● librados● radosgw● RBD, the ceph block device● distributed file system

● roadmap● why we do this, who we are

Page 3: Ceph Day Nov 2012 - Sage Weil

why should you care about anotherstorage system?

Page 4: Ceph Day Nov 2012 - Sage Weil

requirements

● diverse storage needs● object storage● block devices (for VMs) with snapshots, cloning● shared file system with POSIX, coherent caches● structured data... files, block devices, or objects?

● scale● terabytes, petabytes, exabytes● heterogeneous hardware● reliability and fault tolerance

Page 5: Ceph Day Nov 2012 - Sage Weil

time

● ease of administration● no manual data migration, load balancing● painless scaling

● expansion and contraction● seamless migration

Page 6: Ceph Day Nov 2012 - Sage Weil

cost

● linear function of size or performance● incremental expansion

● no fork-lift upgrades

● no vendor lock-in● choice of hardware● choice of software

● open

Page 7: Ceph Day Nov 2012 - Sage Weil

what is ceph?

Page 8: Ceph Day Nov 2012 - Sage Weil

unified storage system

● objects● native● RESTful

● block● thin provisioning, snapshots, cloning

● file● strong consistency, snapshots

Page 9: Ceph Day Nov 2012 - Sage Weil

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

Page 10: Ceph Day Nov 2012 - Sage Weil

distributed storage system

● data center scale● 10s to 10,000s of machines● terabytes to exabytes

● fault tolerant● no single point of failure● commodity hardware

● self-managing, self-healing

Page 11: Ceph Day Nov 2012 - Sage Weil

ceph object model

● pools● 1s to 100s● independent namespaces or object collections● replication level, placement policy

● objects● bazillions● blob of data (bytes to gigabytes)● attributes (e.g., “version=12”; bytes to kilobytes)● key/value bundle (bytes to gigabytes)

Page 12: Ceph Day Nov 2012 - Sage Weil

why start with objects?

● more useful than (disk) blocks● names in a single flat namespace● variable size● simple API with rich semantics

● more scalable than files● no hard-to-distribute hierarchy● update semantics do not span objects● workload is trivially parallel

Page 13: Ceph Day Nov 2012 - Sage Weil

HUMANHUMAN COMPUTERCOMPUTER DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

Page 14: Ceph Day Nov 2012 - Sage Weil

HUMANHUMAN COMPUTERCOMPUTER DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

HUMANHUMAN

HUMANHUMAN

Page 15: Ceph Day Nov 2012 - Sage Weil

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMANHUMANHUMAN

HUMANHUMANHUMANHUMAN

HUMANHUMAN

HUMANHUMANHUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN (actually more like this…)

(COMPUTER)(COMPUTER)

Page 16: Ceph Day Nov 2012 - Sage Weil

DISKDISK

HUMANHUMAN

HUMANHUMAN

HUMANHUMAN

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

Page 17: Ceph Day Nov 2012 - Sage Weil

DISK

FS

DISK DISK

OSD

DISK DISK

OSD OSD OSD OSD

FS FS FSFS btrfsxfsext4

MMM

Page 18: Ceph Day Nov 2012 - Sage Weil

Monitors:

• Maintain cluster membership and state

• Provide consensus for distributed decision-making

• Small, odd number

• These do not serve stored objects to clients

M

Object Storage Daemons (OSDs):• At least three in a cluster• One per disk or RAID group• Serve stored objects to clients• Intelligently peer to perform

replication tasks

Page 19: Ceph Day Nov 2012 - Sage Weil

M

M

M

HUMAN

Page 20: Ceph Day Nov 2012 - Sage Weil

data distribution

● all objects are replicated N times● objects are automatically placed, balanced, migrated

in a dynamic cluster● must consider physical infrastructure

● ceph-osds on hosts in racks in rows in data centers

● three approaches● pick a spot; remember where you put it● pick a spot; write down where you put it● calculate where to put it, where to find it

Page 21: Ceph Day Nov 2012 - Sage Weil

CRUSH• Pseudo-random placement

algorithm

• Fast calculation, no lookup

• Repeatable, deterministic

• Ensures even distribution

• Stable mapping

• Limited data migration

• Rule-based configuration

• specifiable replication

• infrastructure topology aware

• allows weighting

Page 22: Ceph Day Nov 2012 - Sage Weil

10 10 01 01 10 10 01 11 01 1010 10 01 01 10 10 01 11 01 10

1010 1010 0101 0101 1010 1010 0101 1111 0101 1010

hash(object name) % num pg

CRUSH(pg, cluster state, policy)

Page 23: Ceph Day Nov 2012 - Sage Weil

10 10 01 01 10 10 01 11 01 1010 10 01 01 10 10 01 11 01 10

1010 1010 0101 0101 1010 1010 0101 1111 0101 1010

Page 24: Ceph Day Nov 2012 - Sage Weil

RADOS

● monitors publish osd map that describes cluster state● ceph-osd node status (up/down, weight, IP)● CRUSH function specifying desired data distribution

● object storage daemons (OSDs)● safely replicate and store object● migrate data as the cluster changes over time● coordinate based on shared view of reality

● decentralized, distributed approach allows● massive scales (10,000s of servers or more)● the illusion of a single copy with consistent behavior

M

Page 25: Ceph Day Nov 2012 - Sage Weil

CLIENTCLIENT

??

Page 26: Ceph Day Nov 2012 - Sage Weil
Page 27: Ceph Day Nov 2012 - Sage Weil
Page 28: Ceph Day Nov 2012 - Sage Weil

CLIENT

??

Page 29: Ceph Day Nov 2012 - Sage Weil

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

Page 30: Ceph Day Nov 2012 - Sage Weil

LIBRADOSLIBRADOS

MM

MM

MM

APPAPP

native

Page 31: Ceph Day Nov 2012 - Sage Weil

LLLIBRADOS

• Provides direct access to RADOS for applications

• C, C++, Python, PHP, Java• No HTTP overhead

Page 32: Ceph Day Nov 2012 - Sage Weil

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

Page 33: Ceph Day Nov 2012 - Sage Weil

MM

MM

MM

LIBRADOSLIBRADOS

RADOSGWRADOSGW

APPAPP

native

REST

LIBRADOSLIBRADOS

RADOSGWRADOSGW

APPAPP

Page 34: Ceph Day Nov 2012 - Sage Weil

RADOS Gateway:• REST-based interface to

RADOS• Supports buckets,

accounting• Compatible with S3 and

Swift applications

Page 35: Ceph Day Nov 2012 - Sage Weil

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

Page 36: Ceph Day Nov 2012 - Sage Weil

DISKDISK

COMPUTERCOMPUTER

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

DISKDISK

Page 37: Ceph Day Nov 2012 - Sage Weil

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

DISKDISK

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

COMPUTERCOMPUTER

VMVM

VMVM

VMVM

Page 38: Ceph Day Nov 2012 - Sage Weil

MM

MM

MM

VMVM

LIBRADOSLIBRADOS

LIBRBDLIBRBD

VIRTUALIZATION CONTAINERVIRTUALIZATION CONTAINER

Page 39: Ceph Day Nov 2012 - Sage Weil

LIBRADOSLIBRADOS

MM

MM

MM

LIBRBDLIBRBD

CONTAINERCONTAINER

LIBRADOSLIBRADOS

LIBRBDLIBRBD

CONTAINERCONTAINERVMVM

Page 40: Ceph Day Nov 2012 - Sage Weil

LIBRADOSLIBRADOS

MM

MM

MM

KRBD (KERNEL MODULE)KRBD (KERNEL MODULE)

HOSTHOST

Page 41: Ceph Day Nov 2012 - Sage Weil

RADOS Block Device:• Storage of virtual disks in RADOS• Decouples VMs and containers

• Live migration!• Images are striped across the cluster• Snapshots!• Support in

• Qemu/KVM

• OpenStack, CloudStack

• Mainline Linux kernel

Page 42: Ceph Day Nov 2012 - Sage Weil

HOW DO YOU

SPIN UP

THOUSANDS OF VMs

INSTANTLY

AND

EFFICIENTLY?

Page 43: Ceph Day Nov 2012 - Sage Weil

144 0 0 0 0 = 144

instant copy

Page 44: Ceph Day Nov 2012 - Sage Weil

4144

CLIENT

write

write

write

= 148

write

Page 45: Ceph Day Nov 2012 - Sage Weil

4144

CLIENTread

read

read

= 148

Page 46: Ceph Day Nov 2012 - Sage Weil

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

Page 47: Ceph Day Nov 2012 - Sage Weil

MM

MM

MM

CLIENTCLIENT

0110

0110

datametadata

Page 48: Ceph Day Nov 2012 - Sage Weil

MM

MM

MM

Page 49: Ceph Day Nov 2012 - Sage Weil

Metadata Server• Manages metadata for a

POSIX-compliant shared filesystem• Directory hierarchy• File metadata (owner,

timestamps, mode, etc.)• Stores metadata in RADOS• Does not serve file data to

clients• Only required for shared

filesystem

Page 50: Ceph Day Nov 2012 - Sage Weil

one tree

three metadata servers

??

Page 51: Ceph Day Nov 2012 - Sage Weil
Page 52: Ceph Day Nov 2012 - Sage Weil
Page 53: Ceph Day Nov 2012 - Sage Weil
Page 54: Ceph Day Nov 2012 - Sage Weil
Page 55: Ceph Day Nov 2012 - Sage Weil

DYNAMIC SUBTREE PARTITIONING

Page 56: Ceph Day Nov 2012 - Sage Weil

recursive accounting

● ceph-mds tracks recursive directory stats● file sizes ● file and directory counts● modification time

● virtual xattrs present full stats● efficient

$ ls ­alSh | headtotal 0drwxr­xr­x 1 root            root      9.7T 2011­02­04 15:51 .drwxr­xr­x 1 root            root      9.7T 2010­12­16 15:06 ..drwxr­xr­x 1 pomceph         pg4194980 9.6T 2011­02­24 08:25 pomcephdrwxr­xr­x 1 mcg_test1       pg2419992  23G 2011­02­02 08:57 mcg_test1drwx­­x­­­ 1 luko            adm        19G 2011­01­21 12:17 lukodrwx­­x­­­ 1 eest            adm        14G 2011­02­04 16:29 eestdrwxr­xr­x 1 mcg_test2       pg2419992 3.0G 2011­02­02 09:34 mcg_test2drwx­­x­­­ 1 fuzyceph        adm       1.5G 2011­01­18 10:46 fuzycephdrwxr­xr­x 1 dallasceph      pg275     596M 2011­01­14 10:06 dallasceph

Page 57: Ceph Day Nov 2012 - Sage Weil

snapshots

● volume or subvolume snapshots unusable at petabyte scale● snapshot arbitrary subdirectories

● simple interface● hidden '.snap' directory● no special tools

$ mkdir foo/.snap/one # create snapshot$ ls foo/.snapone$ ls foo/bar/.snap_one_1099511627776 # parent's snap name is mangled$ rm foo/myfile$ ls -F foobar/$ ls -F foo/.snap/onemyfile bar/$ rmdir foo/.snap/one # remove snapshot

Page 58: Ceph Day Nov 2012 - Sage Weil

multiple protocols, implementations

● Linux kernel client● mount -t ceph 1.2.3.4:/ /mnt● export (NFS), Samba (CIFS)

● ceph-fuse● libcephfs.so

● your app● Samba (CIFS)● Ganesha (NFS)● Hadoop (map/reduce) kernel

libcephfs

ceph fuseceph-fuse

your app

libcephfsSamba

libcephfsGanesha

NFS SMB/CIFS

libcephfsHadoop

Page 59: Ceph Day Nov 2012 - Sage Weil

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

LIBRADOS

A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RBD

A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

RADOSGW

A bucket-based REST gateway, compatible with S3 and Swift

APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

CEPH FS

A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

NEARLYAWESOME

AWESOMEAWESOME

AWESOME

AWESOME

Page 60: Ceph Day Nov 2012 - Sage Weil

current status

● argonaut stable release v0.48● rados, RBD, radosgw

● bobtail stable release v0.55● RBD cloning● improved performance, scaling, failure behavior● radosgw API, performance improvements● freeze in ~1 week, release in ~4 weeks

Page 61: Ceph Day Nov 2012 - Sage Weil

roadmap

● file system● pivot in engineering focus● CIFS (Samba), NFS (Ganesha), Hadoop

● RBD● Xen integration, iSCSI

● radosgw● Keystone integration

● RADOS● geo-replication● PG split

Page 62: Ceph Day Nov 2012 - Sage Weil

why we do this

● limited options for scalable open source storage ● proprietary solutions

● expensive● don't scale (well or out)● marry hardware and software

● users hungry for alternatives● scalability● cost● features

Page 63: Ceph Day Nov 2012 - Sage Weil

two fields

● green: cloud, big data● incumbents don't have a viable solution● most players can't afford to build their own● strong demand for open source solutions

● brown: traditional SAN, NAS; enterprise● incumbents struggle to scale out● can't compete on price with open solutions

Page 64: Ceph Day Nov 2012 - Sage Weil

licensing

● <yawn>● promote adoption● enable community development● prevent ceph from becoming proprietary● allow organic commercialization

Page 65: Ceph Day Nov 2012 - Sage Weil

ceph license

● LGPLv2● “copyleft”

– free distribution– allow derivative works– changes you distribute/sell must be shared

● ok to link to proprietary code– allow proprietary products to incude and build on ceph– does not allow proprietary derivatives of ceph

Page 66: Ceph Day Nov 2012 - Sage Weil

fragmented copyright

● we do not require copyright assignment from contributors● no single person or entity owns all of ceph● no single entity can make ceph proprietary

● strong community● many players make ceph a safe technology bet● project can outlive any single business

Page 67: Ceph Day Nov 2012 - Sage Weil

why its important

● ceph is an ingredient● we need to play nice in a larger ecosystem● community will be key to ceph's success

● truly open source solutions are disruptive● open is a competitive advantage

– frictionless integration with projects, platforms, tools– freedom to innovate on protocols– leverage community testing, development resources– open collaboration is efficient way to build technology

Page 68: Ceph Day Nov 2012 - Sage Weil

who we are

● Ceph created at UC Santa Cruz (2004-2007)● supported by DreamHost (2008-2011)● Inktank (2012)

● Los Angeles, Sunnyvale, San Francisco, remote

● growing user and developer community● Linux distros, users, cloud stacks, SIs, OEMs

http://ceph.com/

Page 69: Ceph Day Nov 2012 - Sage Weil

thanks

sage weil

[email protected]

@liewegas

http://github.com/ceph

http://ceph.com/

Page 71: Ceph Day Nov 2012 - Sage Weil

why we like btrfs

● pervasive checksumming● snapshots, copy-on-write● efficient metadata (xattrs)● inline data for small files● transparent compression● integrated volume management

● software RAID, mirroring, error recovery● SSD-aware

● online fsck● active development community