BACD LA 2013 - Scaling Storage with Ceph

Preview:

DESCRIPTION

"Scaling Storage with Ceph", Ross Turk, VP of Community, Inktank Ceph is an open source distributed object store, network block device, and file system designed for reliability, performance, and scalability. It runs on commodity hardware, has no single point of failure, and is supported by the Linux kernel. This talk will describe the Ceph architecture, share its design principles, and discuss how it can be part of a cost-effective, reliable cloud stack.

Citation preview

SCALING  STORAGE  WITH  CEPH

Ross  Turk,  Inktank  

RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP

RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

IN  THE  BEGINNING Magic Madzik, Flickr / CC BY 2.0

EARLY   INFORMATION  STORAGE Chico.Ferreira, Flickr / CC BY 2.0

WRITING  >  CAVE  PAINTINGS kevingessner, Flickr / CC BY-SA 2.0

x1000

== x1

PEOPLE  BEGIN  WRITING  A  LOT Moyan_Brenn, Flickr / CC BY-ND 2.0

WRITING   IS  T IME-­‐CONSUMING trekkyandy, Flickr / CC BY 2.0

THE   INDUSTRIALIZATION  OF  WRITING FateDenied, Flickr / CC BY 2.0

x1000

== x1

+ magnet = tape magnetic tape

STORAGE  BECOMES  MECHANICAL Erik Pitti, Wikipedia / CC BY-ND 2.0

HUMAN COMPUTER TAPE

HUMAN ROCK

HUMAN

INK

PAPER

COMPUTERS  NEED  PEOPLE  TO  WORK USDAgov, Flickr / CC BY 2.0

HUMAN COMPUTER TAPE

11101011 10110110 10110101 10101001 00100100 01001001 10100100 10100101 01011010 01101010 10101010 10101010 01010110 01010011

==

THROUGHPUT  BECOMES   IMPORTANT Zane Luke, Flickr / CC BY-ND 2.0

LAZ0R  B3AMS  CHANGE  EVERYTHING!! Jeff Kubina, Flickr / CC-BY-SA 2.0

HARD  DRIVES  ARE  TOTALLY  BETTER

amazing spinny hard drives sucky stupid tape slow

EVERYTHING  GETS  MESSY Rob!, Flickr / CC BY 2.0

000

aa

ac ab

ba

111010

bb bc

110

010 111

dc

101

da 000

110 001

010 011 db

owner: rturk created: aug12

last viewed: aug17 size: 42025 perms: 644 11101011 10110110 10110101

10101001 00100100 01001001 10100100 10100101 01011010 01101010 10101010 10101010

file

000

aa

ac ab

ba

111010

bb bc

110

010 111

dc

101

da 000

110 001

010 db 01 10

WE  OUTGROW  THE  HARD  DRIVE Mr. T in DC, Flickr / CC BY 2.0

HUMAN COMPUTER DISK

DISK

DISK

DISK

DISK

DISK

DISK

PEOPLE  NEED  S IMULTANEOUS  ACCESS wFourier, Flickr / CC BY 2.0

HUMAN COMPUTER DISK

DISK

DISK

DISK

DISK

DISK

DISK

HUMAN

HUMAN

(COMPUTER)

DISK

DISK

DISK

DISK

DISK

DISK

DISK

DISK

DISK

DISK

DISK

DISK

HUMAN

HUMAN

HUMAN

HUMAN HUMAN

HUMAN

HUMAN HUMAN

HUMAN HUMAN

HUMAN

HUMAN HUMAN

HUMAN

HUMAN

HUMAN

HUMAN

HUMAN

HUMAN

HUMAN

HUMAN

HUMAN (actually more like this…)

DISK COMPUTER

HUMAN

HUMAN

HUMAN

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

000

aa

ac ab

ba

111010

bb bc

110

010 111

dc

101

da 000

110 001

010 011 db X

pace: quick driver: frog

license: expired expression: agog

11101011 10110110 10110101 10101001 00100100 01001001 10100100 10100101 01011010 01101010 10101010 10101010

object

DISK COMPUTER

APP

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK

COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

COMPUTER

DISK

DISK

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

COMPUTER

VM

VM

VM

STORAGE  THROUGHOUT  H ISTORY Time-scale: Roughly logarithmic. Content: Whatever the opposite of “scientific” is.

Writing

Computers

Shared storage

Distributed storage

Cloud computing

Ceph

Painting

DISK COMPUTER

HUMAN

HUMAN

HUMAN

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

HUMAN

HUMAN

HUMAN

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

STORAGE  APPLIANCES Michael Moll, Wikipedia / CC BY-SA 2.0

6.4  MILL ION  SQFT  OF  FACTORIES Dude94111, Flickr / CC BY 2.0

STORAGE  VENDORS  HAVE  BIG  BILLS CarbonNYC, Flickr / CC BY 2.0

STORAGE  APPLIANCES  ARE  EXPENSIVE 401K 2012, Flickr / CC BY-SA 2.0

TECHNOLOGY   IS  A  COMMODITY RaeAllen, Flickr / CC-BY 2.0

COMMODITY  PRICES  FLUCTUATE

May-07 May-08 May-09 May-10 May-11 May-12

GROWING  WITH  HARDWARE  APPLIANCES

§  First PB §  Proprietary

storage hardware

§  Well-known storage vendor

§  $14 b’zillion

§  Second PB §  Proprietary

storage hardware

§  Same storage vendor

§  Another $14 b’zillion

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

APPLIANCES  ARE  OLD  TECHNOLOGY Paul Keller, Flickr / CC BY 2.0

Source: http://www.cpubenchmark.net/high_end_cpus.html

FLAGSHIP HARDWARE APPLIANCE

Hardware Appliances are Mysterious Black Boxes Abode of Chaos, Flickr / CC BY 2.0

DC

DC

DC

DC

D

C

DC

DC

DC

DC

DC

DC

DC

C++

DC

DC

DC

DC

D

C

DC

DC

DC

DC

DC

DC

DC

C++ X

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

HUMAN [DEVELOPER]

!!

THE WORLD NEEDS

A STORAGE TECHNOLOGY THAT

SCALES INFINITELY

THE WORLD NEEDS

A STORAGE TECHNOLOGY THAT DOESN’T REQUIRE

AN INDUSTRIAL

MANUFACTURING PROCESS

SAGE  WEIL

§  Co-founder of DreamHost

§  Inventor of Ceph

§  CEO of Inktank

OPEN SOURCE

philosophy design

OPEN  SOURCE  SPREADS   IDEAS orchidgalore, Flickr / CC BY 2.0

OPEN SOURCE

COMMUNITY-FOCUSED

philosophy design

WE  ARE  SMARTER  TOGETHER rturk, Linkedin Inmap

CEPH  BELONGS  TO  ALL  OF  US wackybadger, Flickr / CC BY 2.0

OPEN SOURCE

COMMUNITY-FOCUSED

SCALABLE

philosophy design

CEPH   IS  BUILT  TO  SCALE

Too much for a book

Too much for a drive

Too much for a computer

Too much for a room

Ceph

Too much for a cave

OPEN SOURCE

COMMUNITY-FOCUSED

SCALABLE

NO SINGLE POINT OF FAILURE

philosophy design

ARILOMAX  CALIFORNICUS aroid, Flickr / CC BY 2.0

THE  OCTOPUS   (A  METAPHOR) I love speaking in metaphors.

single point of failure

highly-available replicated

THE  BEEHIVE   (ANOTHER  METAPHOR) blumenbiene, Flickr / CC BY 2.0

OPEN SOURCE

COMMUNITY-FOCUSED

SCALABLE

NO SINGLE POINT OF FAILURE

SOFTWARE BASED

philosophy design

DC

DC

DC

DC

D

C

DC

DC

DC

DC

DC

DC

DC

C++

DC

DC

DC

DC

D

C

DC

DC

DC

DC

DC

DC

DC

C++ ✔

OPEN SOURCE

COMMUNITY-FOCUSED

SCALABLE

NO SINGLE POINT OF FAILURE

SOFTWARE BASED

SELF-MANAGING

philosophy design

DISKS  =   JUST  T INY  RECORD  PLAYERS jon_a_ross, Flickr / CC BY 2.0

D

55 times / day

= D

D D

x 1 MILLION

D D

D D

IT  ALL  STARTED  WITH  A  DREAM

+

NEW  MONTHLY  CODE  COMMITS

0

100

200

300

400

500

600

700

2004-06 2005-07 2006-07 2007-07 2008-07 2009-07 2010-07 2011-07

CEPH  STARTS  POPPING  UP!

(sorry about all the logo tampering)

RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP

RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP

RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

DISK

FS

DISK DISK

OSD

DISK DISK

OSD OSD OSD OSD

FS FS FS FS btrfs xfs ext4

M M M

M

M

M

HUMAN

Monitors: §  Maintain cluster map §  Provide consensus for

distributed decision-making

§  Must have an odd number §  These do not serve stored

objects to clients

M

OSDs: §  One per disk

(recommended) §  At least three in a cluster §  Serve stored objects to

clients §  Intelligently peer to perform

replication tasks §  Supports object classes

RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP

RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

LIBRADOS

M

M

M

APP

native

L

87

LIBRADOS §  Provides direct access to

RADOS for applications §  C, C++, Python, PHP,

Java §  No HTTP overhead

RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP

RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

M

M

M

native

REST

APP

LIBRADOS RADOSGW

LIBRADOS RADOSGW

APP

RADOS Gateway: §  REST-based interface to

RADOS §  Supports buckets,

accounting §  Compatible with S3 and

Swift applications

RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP

CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

M

M

M

VM

LIBRADOS LIBRBD

VIRTUALIZATION CONTAINER

LIBRADOS

M

M

M

LIBRBD CONTAINER

LIBRADOS LIBRBD

CONTAINER VM

LIBRADOS

M

M

M

KRBD (KERNEL MODULE) HOST

RADOS Block Device: §  Storage of virtual disks in

RADOS §  Allows decoupling of VMs

and containers §  Live migration!

§  Images are striped across the cluster

§  Boot support in QEMU, KVM, and OpenStack Nova

§  Mount support in the Linux kernel

RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP

RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

M

M

M

CLIENT

01 10

data metadata

Metadata Server §  Manages metadata for a

POSIX-compliant shared filesystem §  Directory hierarchy §  File metadata (owner,

timestamps, mode, etc.) §  Stores metadata in RADOS §  Does not serve file data to

clients §  Only required for shared

filesystem

WHAT MAKES CEPH UNIQUE?

HOW  DO  YOU  F IND  YOUR  KEYS? azmeen, Flickr / CC BY 2.0

APP ??

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

APP

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

A-G

H-N

O-T

U-Z

F*

I  ALWAYS  PUT  MY  KEYS  ON  THE  HOOK vitamindave, Flickr / CC BY 2.0

APP

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

DEAR  DIARY:  KEYS  =   IN  THE  KITCHEN Barnaby, Flickr / CC BY 2.0

HOW DO YOU FIND YOUR KEYS

WHEN YOUR HOUSE IS

INFINITELY BIG AND

ALWAYS CHANGING?

THE  ANSWER:  CRUSH!! pasukaru76, Flickr / CC SA 2.0

10 10 01 01 10 10 01 11 01 10

10 10 01 01 10 10 01 11 01 10

hash(object name) % num pg

CRUSH(pg, cluster state, rule set)

10 10 01 01 10 10 01 11 01 10

10 10 01 01 10 10 01 11 01 10

CRUSH §  Pseudo-random placement

algorithm §  Ensures even distribution §  Repeatable, deterministic §  Rule-based configuration

§  Replica count §  Infrastructure topology §  Weighting

CLIENT

??

CLIENT

??

LIBRADOS

M

M

M

VM

LIBRBD VIRTUALIZATION CONTAINER

HOW DO YOU SPIN UP

THOUSANDS OF VMs INSTANTLY

AND EFFICIENTLY?

144 0 0 0 0

instant copy

= 144

4 144

CLIENT

write

write

write

= 148

write

4 144

CLIENT read

read

read

= 148

HOW DO YOU MANAGE

DIRECTORY HEIRARCHY WITHOUT

A SINGLE POINT OF FAILURE?

FILESYSTEMS  REQUIRE  METADATA Barnaby, Flickr / CC BY 2.0

M

M

M

CLIENT

01 10

M

M

M

one tree

three metadata servers

??

DYNAMIC SUBTREE PARTITIONING

AND NOW BACKPEDALING

ALMOST EVERYTHING

WORKS

RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP

RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RADOSGW A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

NEARLY AWESOME

AWESOME AWESOME

AWESOME

AWESOME

LAN SCALE!! *

* OR REALLY REALLY SCARY FAST WAN

CEPH  AND  CLOUDSTACK tableatny, Flickr / CC BY 2.0

RBD  SUPPORT   IN  CLOUDSTACK

§  Just announced two weeks ago! §  Allows storage of virtual disks inside RADOS

§  Works with KVM only right now §  No volume snapshots yet

§  Requires the latest version of, um, everything §  More information can be found on the mailing list:

§  ceph-devel / incubator-cloudstack-dev: http://article.gmane.org/gmane.comp.file-systems.ceph.devel/7505

QUESTIONS?

Ross Turk VP Community, Inktank

§  ross@inktank.com §  @rossturk

inktank.com | ceph.com

Recommended