136
SCALING STORAGE WITH CEPH Ross Turk, Inktank

BACD LA 2013 - Scaling Storage with Ceph

Embed Size (px)

DESCRIPTION

"Scaling Storage with Ceph", Ross Turk, VP of Community, Inktank Ceph is an open source distributed object store, network block device, and file system designed for reliability, performance, and scalability. It runs on commodity hardware, has no single point of failure, and is supported by the Linux kernel. This talk will describe the Ceph architecture, share its design principles, and discuss how it can be part of a cost-effective, reliable cloud stack.

Citation preview

Page 1: BACD LA 2013 - Scaling Storage with Ceph

SCALING  STORAGE  WITH  CEPH

Ross  Turk,  Inktank  

Page 2: BACD LA 2013 - Scaling Storage with Ceph
Page 3: BACD LA 2013 - Scaling Storage with Ceph
Page 4: BACD LA 2013 - Scaling Storage with Ceph

RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP

RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

Page 5: BACD LA 2013 - Scaling Storage with Ceph

IN  THE  BEGINNING Magic Madzik, Flickr / CC BY 2.0

Page 6: BACD LA 2013 - Scaling Storage with Ceph

EARLY   INFORMATION  STORAGE Chico.Ferreira, Flickr / CC BY 2.0

Page 7: BACD LA 2013 - Scaling Storage with Ceph

WRITING  >  CAVE  PAINTINGS kevingessner, Flickr / CC BY-SA 2.0

Page 8: BACD LA 2013 - Scaling Storage with Ceph

x1000

== x1

Page 9: BACD LA 2013 - Scaling Storage with Ceph

PEOPLE  BEGIN  WRITING  A  LOT Moyan_Brenn, Flickr / CC BY-ND 2.0

Page 10: BACD LA 2013 - Scaling Storage with Ceph

WRITING   IS  T IME-­‐CONSUMING trekkyandy, Flickr / CC BY 2.0

Page 11: BACD LA 2013 - Scaling Storage with Ceph

THE   INDUSTRIALIZATION  OF  WRITING FateDenied, Flickr / CC BY 2.0

Page 12: BACD LA 2013 - Scaling Storage with Ceph

x1000

== x1

+ magnet = tape magnetic tape

Page 13: BACD LA 2013 - Scaling Storage with Ceph

STORAGE  BECOMES  MECHANICAL Erik Pitti, Wikipedia / CC BY-ND 2.0

Page 14: BACD LA 2013 - Scaling Storage with Ceph

HUMAN COMPUTER TAPE

HUMAN ROCK

HUMAN

INK

PAPER

Page 15: BACD LA 2013 - Scaling Storage with Ceph

COMPUTERS  NEED  PEOPLE  TO  WORK USDAgov, Flickr / CC BY 2.0

Page 16: BACD LA 2013 - Scaling Storage with Ceph

HUMAN COMPUTER TAPE

Page 17: BACD LA 2013 - Scaling Storage with Ceph

11101011 10110110 10110101 10101001 00100100 01001001 10100100 10100101 01011010 01101010 10101010 10101010 01010110 01010011

==

Page 18: BACD LA 2013 - Scaling Storage with Ceph

THROUGHPUT  BECOMES   IMPORTANT Zane Luke, Flickr / CC BY-ND 2.0

Page 19: BACD LA 2013 - Scaling Storage with Ceph

LAZ0R  B3AMS  CHANGE  EVERYTHING!! Jeff Kubina, Flickr / CC-BY-SA 2.0

Page 20: BACD LA 2013 - Scaling Storage with Ceph

HARD  DRIVES  ARE  TOTALLY  BETTER

amazing spinny hard drives sucky stupid tape slow

Page 21: BACD LA 2013 - Scaling Storage with Ceph

EVERYTHING  GETS  MESSY Rob!, Flickr / CC BY 2.0

Page 22: BACD LA 2013 - Scaling Storage with Ceph

000

aa

ac ab

ba

111010

bb bc

110

010 111

dc

101

da 000

110 001

010 011 db

Page 23: BACD LA 2013 - Scaling Storage with Ceph

owner: rturk created: aug12

last viewed: aug17 size: 42025 perms: 644 11101011 10110110 10110101

10101001 00100100 01001001 10100100 10100101 01011010 01101010 10101010 10101010

file

Page 24: BACD LA 2013 - Scaling Storage with Ceph

000

aa

ac ab

ba

111010

bb bc

110

010 111

dc

101

da 000

110 001

010 db 01 10

Page 25: BACD LA 2013 - Scaling Storage with Ceph

WE  OUTGROW  THE  HARD  DRIVE Mr. T in DC, Flickr / CC BY 2.0

Page 26: BACD LA 2013 - Scaling Storage with Ceph

HUMAN COMPUTER DISK

DISK

DISK

DISK

DISK

DISK

DISK

Page 27: BACD LA 2013 - Scaling Storage with Ceph

PEOPLE  NEED  S IMULTANEOUS  ACCESS wFourier, Flickr / CC BY 2.0

Page 28: BACD LA 2013 - Scaling Storage with Ceph

HUMAN COMPUTER DISK

DISK

DISK

DISK

DISK

DISK

DISK

HUMAN

HUMAN

Page 29: BACD LA 2013 - Scaling Storage with Ceph

(COMPUTER)

DISK

DISK

DISK

DISK

DISK

DISK

DISK

DISK

DISK

DISK

DISK

DISK

HUMAN

HUMAN

HUMAN

HUMAN HUMAN

HUMAN

HUMAN HUMAN

HUMAN HUMAN

HUMAN

HUMAN HUMAN

HUMAN

HUMAN

HUMAN

HUMAN

HUMAN

HUMAN

HUMAN

HUMAN

HUMAN (actually more like this…)

Page 30: BACD LA 2013 - Scaling Storage with Ceph

DISK COMPUTER

HUMAN

HUMAN

HUMAN

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

Page 31: BACD LA 2013 - Scaling Storage with Ceph

000

aa

ac ab

ba

111010

bb bc

110

010 111

dc

101

da 000

110 001

010 011 db X

Page 32: BACD LA 2013 - Scaling Storage with Ceph

pace: quick driver: frog

license: expired expression: agog

11101011 10110110 10110101 10101001 00100100 01001001 10100100 10100101 01011010 01101010 10101010 10101010

object

Page 33: BACD LA 2013 - Scaling Storage with Ceph

DISK COMPUTER

APP

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

Page 34: BACD LA 2013 - Scaling Storage with Ceph

DISK

COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

COMPUTER

DISK

Page 35: BACD LA 2013 - Scaling Storage with Ceph

DISK

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

COMPUTER

VM

VM

VM

Page 36: BACD LA 2013 - Scaling Storage with Ceph

STORAGE  THROUGHOUT  H ISTORY Time-scale: Roughly logarithmic. Content: Whatever the opposite of “scientific” is.

Writing

Computers

Shared storage

Distributed storage

Cloud computing

Ceph

Painting

Page 37: BACD LA 2013 - Scaling Storage with Ceph

DISK COMPUTER

HUMAN

HUMAN

HUMAN

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

Page 38: BACD LA 2013 - Scaling Storage with Ceph

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

Page 39: BACD LA 2013 - Scaling Storage with Ceph

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

Page 40: BACD LA 2013 - Scaling Storage with Ceph

HUMAN

HUMAN

HUMAN

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

Page 41: BACD LA 2013 - Scaling Storage with Ceph

STORAGE  APPLIANCES Michael Moll, Wikipedia / CC BY-SA 2.0

Page 42: BACD LA 2013 - Scaling Storage with Ceph

6.4  MILL ION  SQFT  OF  FACTORIES Dude94111, Flickr / CC BY 2.0

Page 43: BACD LA 2013 - Scaling Storage with Ceph

STORAGE  VENDORS  HAVE  BIG  BILLS CarbonNYC, Flickr / CC BY 2.0

Page 44: BACD LA 2013 - Scaling Storage with Ceph

STORAGE  APPLIANCES  ARE  EXPENSIVE 401K 2012, Flickr / CC BY-SA 2.0

Page 45: BACD LA 2013 - Scaling Storage with Ceph

TECHNOLOGY   IS  A  COMMODITY RaeAllen, Flickr / CC-BY 2.0

Page 46: BACD LA 2013 - Scaling Storage with Ceph

COMMODITY  PRICES  FLUCTUATE

May-07 May-08 May-09 May-10 May-11 May-12

Page 47: BACD LA 2013 - Scaling Storage with Ceph

GROWING  WITH  HARDWARE  APPLIANCES

§  First PB §  Proprietary

storage hardware

§  Well-known storage vendor

§  $14 b’zillion

§  Second PB §  Proprietary

storage hardware

§  Same storage vendor

§  Another $14 b’zillion

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

Page 48: BACD LA 2013 - Scaling Storage with Ceph

APPLIANCES  ARE  OLD  TECHNOLOGY Paul Keller, Flickr / CC BY 2.0

Page 49: BACD LA 2013 - Scaling Storage with Ceph

Source: http://www.cpubenchmark.net/high_end_cpus.html

Page 50: BACD LA 2013 - Scaling Storage with Ceph

FLAGSHIP HARDWARE APPLIANCE

Page 51: BACD LA 2013 - Scaling Storage with Ceph

Hardware Appliances are Mysterious Black Boxes Abode of Chaos, Flickr / CC BY 2.0

Page 52: BACD LA 2013 - Scaling Storage with Ceph

DC

DC

DC

DC

D

C

DC

DC

DC

DC

DC

DC

DC

C++

Page 53: BACD LA 2013 - Scaling Storage with Ceph

DC

DC

DC

DC

D

C

DC

DC

DC

DC

DC

DC

DC

C++ X

Page 54: BACD LA 2013 - Scaling Storage with Ceph

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

HUMAN [DEVELOPER]

!!

Page 55: BACD LA 2013 - Scaling Storage with Ceph

THE WORLD NEEDS

A STORAGE TECHNOLOGY THAT

SCALES INFINITELY

Page 56: BACD LA 2013 - Scaling Storage with Ceph

THE WORLD NEEDS

A STORAGE TECHNOLOGY THAT DOESN’T REQUIRE

AN INDUSTRIAL

MANUFACTURING PROCESS

Page 57: BACD LA 2013 - Scaling Storage with Ceph

SAGE  WEIL

§  Co-founder of DreamHost

§  Inventor of Ceph

§  CEO of Inktank

Page 58: BACD LA 2013 - Scaling Storage with Ceph

OPEN SOURCE

philosophy design

Page 59: BACD LA 2013 - Scaling Storage with Ceph

OPEN  SOURCE  SPREADS   IDEAS orchidgalore, Flickr / CC BY 2.0

Page 60: BACD LA 2013 - Scaling Storage with Ceph

OPEN SOURCE

COMMUNITY-FOCUSED

philosophy design

Page 61: BACD LA 2013 - Scaling Storage with Ceph

WE  ARE  SMARTER  TOGETHER rturk, Linkedin Inmap

Page 62: BACD LA 2013 - Scaling Storage with Ceph

CEPH  BELONGS  TO  ALL  OF  US wackybadger, Flickr / CC BY 2.0

Page 63: BACD LA 2013 - Scaling Storage with Ceph

OPEN SOURCE

COMMUNITY-FOCUSED

SCALABLE

philosophy design

Page 64: BACD LA 2013 - Scaling Storage with Ceph

CEPH   IS  BUILT  TO  SCALE

Too much for a book

Too much for a drive

Too much for a computer

Too much for a room

Ceph

Too much for a cave

Page 65: BACD LA 2013 - Scaling Storage with Ceph

OPEN SOURCE

COMMUNITY-FOCUSED

SCALABLE

NO SINGLE POINT OF FAILURE

philosophy design

Page 66: BACD LA 2013 - Scaling Storage with Ceph

ARILOMAX  CALIFORNICUS aroid, Flickr / CC BY 2.0

Page 67: BACD LA 2013 - Scaling Storage with Ceph

THE  OCTOPUS   (A  METAPHOR) I love speaking in metaphors.

single point of failure

highly-available replicated

Page 68: BACD LA 2013 - Scaling Storage with Ceph

THE  BEEHIVE   (ANOTHER  METAPHOR) blumenbiene, Flickr / CC BY 2.0

Page 69: BACD LA 2013 - Scaling Storage with Ceph

OPEN SOURCE

COMMUNITY-FOCUSED

SCALABLE

NO SINGLE POINT OF FAILURE

SOFTWARE BASED

philosophy design

Page 70: BACD LA 2013 - Scaling Storage with Ceph

DC

DC

DC

DC

D

C

DC

DC

DC

DC

DC

DC

DC

C++

Page 71: BACD LA 2013 - Scaling Storage with Ceph

DC

DC

DC

DC

D

C

DC

DC

DC

DC

DC

DC

DC

C++ ✔

Page 72: BACD LA 2013 - Scaling Storage with Ceph

OPEN SOURCE

COMMUNITY-FOCUSED

SCALABLE

NO SINGLE POINT OF FAILURE

SOFTWARE BASED

SELF-MANAGING

philosophy design

Page 73: BACD LA 2013 - Scaling Storage with Ceph

DISKS  =   JUST  T INY  RECORD  PLAYERS jon_a_ross, Flickr / CC BY 2.0

Page 74: BACD LA 2013 - Scaling Storage with Ceph

D

55 times / day

= D

D D

x 1 MILLION

D D

D D

Page 75: BACD LA 2013 - Scaling Storage with Ceph
Page 76: BACD LA 2013 - Scaling Storage with Ceph

IT  ALL  STARTED  WITH  A  DREAM

Page 77: BACD LA 2013 - Scaling Storage with Ceph

+

Page 78: BACD LA 2013 - Scaling Storage with Ceph

NEW  MONTHLY  CODE  COMMITS

0

100

200

300

400

500

600

700

2004-06 2005-07 2006-07 2007-07 2008-07 2009-07 2010-07 2011-07

Page 79: BACD LA 2013 - Scaling Storage with Ceph

CEPH  STARTS  POPPING  UP!

(sorry about all the logo tampering)

Page 80: BACD LA 2013 - Scaling Storage with Ceph

RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP

RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

Page 81: BACD LA 2013 - Scaling Storage with Ceph

RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP

RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

Page 82: BACD LA 2013 - Scaling Storage with Ceph

DISK

FS

DISK DISK

OSD

DISK DISK

OSD OSD OSD OSD

FS FS FS FS btrfs xfs ext4

M M M

Page 83: BACD LA 2013 - Scaling Storage with Ceph

M

M

M

HUMAN

Page 84: BACD LA 2013 - Scaling Storage with Ceph

Monitors: §  Maintain cluster map §  Provide consensus for

distributed decision-making

§  Must have an odd number §  These do not serve stored

objects to clients

M

OSDs: §  One per disk

(recommended) §  At least three in a cluster §  Serve stored objects to

clients §  Intelligently peer to perform

replication tasks §  Supports object classes

Page 85: BACD LA 2013 - Scaling Storage with Ceph

RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP

RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

Page 86: BACD LA 2013 - Scaling Storage with Ceph

LIBRADOS

M

M

M

APP

native

Page 87: BACD LA 2013 - Scaling Storage with Ceph

L

87

LIBRADOS §  Provides direct access to

RADOS for applications §  C, C++, Python, PHP,

Java §  No HTTP overhead

Page 88: BACD LA 2013 - Scaling Storage with Ceph

RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP

RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

Page 89: BACD LA 2013 - Scaling Storage with Ceph

M

M

M

native

REST

APP

LIBRADOS RADOSGW

LIBRADOS RADOSGW

APP

Page 90: BACD LA 2013 - Scaling Storage with Ceph

RADOS Gateway: §  REST-based interface to

RADOS §  Supports buckets,

accounting §  Compatible with S3 and

Swift applications

Page 91: BACD LA 2013 - Scaling Storage with Ceph

RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP

CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

Page 92: BACD LA 2013 - Scaling Storage with Ceph

M

M

M

VM

LIBRADOS LIBRBD

VIRTUALIZATION CONTAINER

Page 93: BACD LA 2013 - Scaling Storage with Ceph

LIBRADOS

M

M

M

LIBRBD CONTAINER

LIBRADOS LIBRBD

CONTAINER VM

Page 94: BACD LA 2013 - Scaling Storage with Ceph

LIBRADOS

M

M

M

KRBD (KERNEL MODULE) HOST

Page 95: BACD LA 2013 - Scaling Storage with Ceph

RADOS Block Device: §  Storage of virtual disks in

RADOS §  Allows decoupling of VMs

and containers §  Live migration!

§  Images are striped across the cluster

§  Boot support in QEMU, KVM, and OpenStack Nova

§  Mount support in the Linux kernel

Page 96: BACD LA 2013 - Scaling Storage with Ceph

RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP

RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

Page 97: BACD LA 2013 - Scaling Storage with Ceph

M

M

M

CLIENT

01 10

data metadata

Page 98: BACD LA 2013 - Scaling Storage with Ceph

Metadata Server §  Manages metadata for a

POSIX-compliant shared filesystem §  Directory hierarchy §  File metadata (owner,

timestamps, mode, etc.) §  Stores metadata in RADOS §  Does not serve file data to

clients §  Only required for shared

filesystem

Page 99: BACD LA 2013 - Scaling Storage with Ceph

WHAT MAKES CEPH UNIQUE?

Page 100: BACD LA 2013 - Scaling Storage with Ceph

HOW  DO  YOU  F IND  YOUR  KEYS? azmeen, Flickr / CC BY 2.0

Page 101: BACD LA 2013 - Scaling Storage with Ceph

APP ??

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

Page 102: BACD LA 2013 - Scaling Storage with Ceph

APP

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

A-G

H-N

O-T

U-Z

F*

Page 103: BACD LA 2013 - Scaling Storage with Ceph

I  ALWAYS  PUT  MY  KEYS  ON  THE  HOOK vitamindave, Flickr / CC BY 2.0

Page 104: BACD LA 2013 - Scaling Storage with Ceph

APP

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

Page 105: BACD LA 2013 - Scaling Storage with Ceph

DEAR  DIARY:  KEYS  =   IN  THE  KITCHEN Barnaby, Flickr / CC BY 2.0

Page 106: BACD LA 2013 - Scaling Storage with Ceph

HOW DO YOU FIND YOUR KEYS

WHEN YOUR HOUSE IS

INFINITELY BIG AND

ALWAYS CHANGING?

Page 107: BACD LA 2013 - Scaling Storage with Ceph

THE  ANSWER:  CRUSH!! pasukaru76, Flickr / CC SA 2.0

Page 108: BACD LA 2013 - Scaling Storage with Ceph

10 10 01 01 10 10 01 11 01 10

10 10 01 01 10 10 01 11 01 10

hash(object name) % num pg

CRUSH(pg, cluster state, rule set)

Page 109: BACD LA 2013 - Scaling Storage with Ceph

10 10 01 01 10 10 01 11 01 10

10 10 01 01 10 10 01 11 01 10

Page 110: BACD LA 2013 - Scaling Storage with Ceph

CRUSH §  Pseudo-random placement

algorithm §  Ensures even distribution §  Repeatable, deterministic §  Rule-based configuration

§  Replica count §  Infrastructure topology §  Weighting

Page 111: BACD LA 2013 - Scaling Storage with Ceph

CLIENT

??

Page 112: BACD LA 2013 - Scaling Storage with Ceph
Page 113: BACD LA 2013 - Scaling Storage with Ceph
Page 114: BACD LA 2013 - Scaling Storage with Ceph

CLIENT

??

Page 115: BACD LA 2013 - Scaling Storage with Ceph

LIBRADOS

M

M

M

VM

LIBRBD VIRTUALIZATION CONTAINER

Page 116: BACD LA 2013 - Scaling Storage with Ceph

HOW DO YOU SPIN UP

THOUSANDS OF VMs INSTANTLY

AND EFFICIENTLY?

Page 117: BACD LA 2013 - Scaling Storage with Ceph

144 0 0 0 0

instant copy

= 144

Page 118: BACD LA 2013 - Scaling Storage with Ceph

4 144

CLIENT

write

write

write

= 148

write

Page 119: BACD LA 2013 - Scaling Storage with Ceph

4 144

CLIENT read

read

read

= 148

Page 120: BACD LA 2013 - Scaling Storage with Ceph

HOW DO YOU MANAGE

DIRECTORY HEIRARCHY WITHOUT

A SINGLE POINT OF FAILURE?

Page 121: BACD LA 2013 - Scaling Storage with Ceph

FILESYSTEMS  REQUIRE  METADATA Barnaby, Flickr / CC BY 2.0

Page 122: BACD LA 2013 - Scaling Storage with Ceph

M

M

M

CLIENT

01 10

Page 123: BACD LA 2013 - Scaling Storage with Ceph

M

M

M

Page 124: BACD LA 2013 - Scaling Storage with Ceph

one tree

three metadata servers

??

Page 125: BACD LA 2013 - Scaling Storage with Ceph
Page 126: BACD LA 2013 - Scaling Storage with Ceph
Page 127: BACD LA 2013 - Scaling Storage with Ceph
Page 128: BACD LA 2013 - Scaling Storage with Ceph
Page 129: BACD LA 2013 - Scaling Storage with Ceph

DYNAMIC SUBTREE PARTITIONING

Page 130: BACD LA 2013 - Scaling Storage with Ceph

AND NOW BACKPEDALING

Page 131: BACD LA 2013 - Scaling Storage with Ceph

ALMOST EVERYTHING

WORKS

Page 132: BACD LA 2013 - Scaling Storage with Ceph

RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP

RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RADOSGW A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

NEARLY AWESOME

AWESOME AWESOME

AWESOME

AWESOME

Page 133: BACD LA 2013 - Scaling Storage with Ceph

LAN SCALE!! *

* OR REALLY REALLY SCARY FAST WAN

Page 134: BACD LA 2013 - Scaling Storage with Ceph

CEPH  AND  CLOUDSTACK tableatny, Flickr / CC BY 2.0

Page 135: BACD LA 2013 - Scaling Storage with Ceph

RBD  SUPPORT   IN  CLOUDSTACK

§  Just announced two weeks ago! §  Allows storage of virtual disks inside RADOS

§  Works with KVM only right now §  No volume snapshots yet

§  Requires the latest version of, um, everything §  More information can be found on the mailing list:

§  ceph-devel / incubator-cloudstack-dev: http://article.gmane.org/gmane.comp.file-systems.ceph.devel/7505

Page 136: BACD LA 2013 - Scaling Storage with Ceph

QUESTIONS?

Ross Turk VP Community, Inktank

§  [email protected] §  @rossturk

inktank.com | ceph.com