136
SCALING STORAGE WITH CEPH Ross Turk, Inktank

vBACD July 2012 - Scaling Storage with Ceph

Embed Size (px)

DESCRIPTION

"Scaling Storage with Ceph", Ross Turk, VP of Community, Inktank Ceph is an open source distributed object store, network block device, and file system designed for reliability, performance, and scalability. It runs on commodity hardware, has no single point of failure, and is supported by the Linux kernel. This talk will describe the Ceph architecture, share its design principles, and discuss how it can be part of a cost-effective, reliable cloud stack.

Citation preview

Page 1: vBACD July 2012 - Scaling Storage with Ceph

SCALING  STORAGE  WITH  CEPH

Ross  Turk,  Inktank  

Page 2: vBACD July 2012 - Scaling Storage with Ceph
Page 3: vBACD July 2012 - Scaling Storage with Ceph
Page 4: vBACD July 2012 - Scaling Storage with Ceph

RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP

RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

Page 5: vBACD July 2012 - Scaling Storage with Ceph

IN  THE  BEGINNING Magic Madzik, Flickr / CC BY 2.0

Page 6: vBACD July 2012 - Scaling Storage with Ceph

EARLY   INFORMATION  STORAGE Chico.Ferreira, Flickr / CC BY 2.0

Page 7: vBACD July 2012 - Scaling Storage with Ceph

WRITING  >  CAVE  PAINTINGS kevingessner, Flickr / CC BY-SA 2.0

Page 8: vBACD July 2012 - Scaling Storage with Ceph

x1000

== x1

Page 9: vBACD July 2012 - Scaling Storage with Ceph

PEOPLE  BEGIN  WRITING  A  LOT Moyan_Brenn, Flickr / CC BY-ND 2.0

Page 10: vBACD July 2012 - Scaling Storage with Ceph

WRITING   IS  T IME-­‐CONSUMING trekkyandy, Flickr / CC BY 2.0

Page 11: vBACD July 2012 - Scaling Storage with Ceph

THE   INDUSTRIALIZATION  OF  WRITING FateDenied, Flickr / CC BY 2.0

Page 12: vBACD July 2012 - Scaling Storage with Ceph

x1000

== x1

+ magnet = tape magnetic tape

Page 13: vBACD July 2012 - Scaling Storage with Ceph

STORAGE  BECOMES  MECHANICAL Erik Pitti, Wikipedia / CC BY-ND 2.0

Page 14: vBACD July 2012 - Scaling Storage with Ceph

HUMAN COMPUTER TAPE

HUMAN ROCK

HUMAN

INK

PAPER

Page 15: vBACD July 2012 - Scaling Storage with Ceph

COMPUTERS  NEED  PEOPLE  TO  WORK USDAgov, Flickr / CC BY 2.0

Page 16: vBACD July 2012 - Scaling Storage with Ceph

HUMAN COMPUTER TAPE

Page 17: vBACD July 2012 - Scaling Storage with Ceph

11101011 10110110 10110101 10101001 00100100 01001001 10100100 10100101 01011010 01101010 10101010 10101010 01010110 01010011

==

Page 18: vBACD July 2012 - Scaling Storage with Ceph

THROUGHPUT  BECOMES   IMPORTANT Zane Luke, Flickr / CC BY-ND 2.0

Page 19: vBACD July 2012 - Scaling Storage with Ceph

LAZ0R  B3AMS  CHANGE  EVERYTHING!! Jeff Kubina, Flickr / CC-BY-SA 2.0

Page 20: vBACD July 2012 - Scaling Storage with Ceph

HARD  DRIVES  ARE  TOTALLY  BETTER

amazing spinny hard drives sucky stupid tape slow

Page 21: vBACD July 2012 - Scaling Storage with Ceph

EVERYTHING  GETS  MESSY Rob!, Flickr / CC BY 2.0

Page 22: vBACD July 2012 - Scaling Storage with Ceph

000

aa

ac ab

ba

111010

bb bc

110

010 111

dc

101

da 000

110 001

010 011 db

Page 23: vBACD July 2012 - Scaling Storage with Ceph

owner: rturk created: aug12

last viewed: aug17 size: 42025 perms: 644 11101011 10110110 10110101

10101001 00100100 01001001 10100100 10100101 01011010 01101010 10101010 10101010

file

Page 24: vBACD July 2012 - Scaling Storage with Ceph

000

aa

ac ab

ba

111010

bb bc

110

010 111

dc

101

da 000

110 001

010 db 01 10

Page 25: vBACD July 2012 - Scaling Storage with Ceph

WE  OUTGROW  THE  HARD  DRIVE Mr. T in DC, Flickr / CC BY 2.0

Page 26: vBACD July 2012 - Scaling Storage with Ceph

HUMAN COMPUTER DISK

DISK

DISK

DISK

DISK

DISK

DISK

Page 27: vBACD July 2012 - Scaling Storage with Ceph

PEOPLE  NEED  S IMULTANEOUS  ACCESS wFourier, Flickr / CC BY 2.0

Page 28: vBACD July 2012 - Scaling Storage with Ceph

HUMAN COMPUTER DISK

DISK

DISK

DISK

DISK

DISK

DISK

HUMAN

HUMAN

Page 29: vBACD July 2012 - Scaling Storage with Ceph

(COMPUTER)

DISK

DISK

DISK

DISK

DISK

DISK

DISK

DISK

DISK

DISK

DISK

DISK

HUMAN

HUMAN

HUMAN

HUMAN HUMAN

HUMAN

HUMAN HUMAN

HUMAN HUMAN

HUMAN

HUMAN HUMAN

HUMAN

HUMAN

HUMAN

HUMAN

HUMAN

HUMAN

HUMAN

HUMAN

HUMAN (actually more like this…)

Page 30: vBACD July 2012 - Scaling Storage with Ceph

DISK COMPUTER

HUMAN

HUMAN

HUMAN

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

Page 31: vBACD July 2012 - Scaling Storage with Ceph

000

aa

ac ab

ba

111010

bb bc

110

010 111

dc

101

da 000

110 001

010 011 db X

Page 32: vBACD July 2012 - Scaling Storage with Ceph

pace: quick driver: frog

license: expired expression: agog

11101011 10110110 10110101 10101001 00100100 01001001 10100100 10100101 01011010 01101010 10101010 10101010

object

Page 33: vBACD July 2012 - Scaling Storage with Ceph

DISK COMPUTER

APP

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

Page 34: vBACD July 2012 - Scaling Storage with Ceph

DISK

COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

COMPUTER

DISK

Page 35: vBACD July 2012 - Scaling Storage with Ceph

DISK

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

COMPUTER

VM

VM

VM

Page 36: vBACD July 2012 - Scaling Storage with Ceph

STORAGE  THROUGHOUT  H ISTORY Time-scale: Roughly logarithmic. Content: Whatever the opposite of “scientific” is.

Writing

Computers

Shared storage

Distributed storage

Cloud computing

Ceph

Painting

Page 37: vBACD July 2012 - Scaling Storage with Ceph

DISK COMPUTER

HUMAN

HUMAN

HUMAN

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

Page 38: vBACD July 2012 - Scaling Storage with Ceph

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

DISK COMPUTER

Page 39: vBACD July 2012 - Scaling Storage with Ceph

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

Page 40: vBACD July 2012 - Scaling Storage with Ceph

HUMAN

HUMAN

HUMAN

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

Page 41: vBACD July 2012 - Scaling Storage with Ceph

STORAGE  APPLIANCES Michael Moll, Wikipedia / CC BY-SA 2.0

Page 42: vBACD July 2012 - Scaling Storage with Ceph

6.4  MILL ION  SQFT  OF  FACTORIES Dude94111, Flickr / CC BY 2.0

Page 43: vBACD July 2012 - Scaling Storage with Ceph

STORAGE  VENDORS  HAVE  BIG  BILLS CarbonNYC, Flickr / CC BY 2.0

Page 44: vBACD July 2012 - Scaling Storage with Ceph

STORAGE  APPLIANCES  ARE  EXPENSIVE 401K 2012, Flickr / CC BY-SA 2.0

Page 45: vBACD July 2012 - Scaling Storage with Ceph

TECHNOLOGY   IS  A  COMMODITY RaeAllen, Flickr / CC-BY 2.0

Page 46: vBACD July 2012 - Scaling Storage with Ceph

COMMODITY  PRICES  FLUCTUATE

May-07 May-08 May-09 May-10 May-11 May-12

Page 47: vBACD July 2012 - Scaling Storage with Ceph

GROWING  WITH  HARDWARE  APPLIANCES

§  First PB §  Proprietary

storage hardware

§  Well-known storage vendor

§  $14 b’zillion

§  Second PB §  Proprietary

storage hardware

§  Same storage vendor

§  Another $14 b’zillion

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

Page 48: vBACD July 2012 - Scaling Storage with Ceph

APPLIANCES  ARE  OLD  TECHNOLOGY Paul Keller, Flickr / CC BY 2.0

Page 49: vBACD July 2012 - Scaling Storage with Ceph

Source: http://www.cpubenchmark.net/high_end_cpus.html

Page 50: vBACD July 2012 - Scaling Storage with Ceph

FLAGSHIP HARDWARE APPLIANCE

Page 51: vBACD July 2012 - Scaling Storage with Ceph

Hardware Appliances are Mysterious Black Boxes Abode of Chaos, Flickr / CC BY 2.0

Page 52: vBACD July 2012 - Scaling Storage with Ceph

DC

DC

DC

DC

D

C

DC

DC

DC

DC

DC

DC

DC

C++

Page 53: vBACD July 2012 - Scaling Storage with Ceph

DC

DC

DC

DC

D

C

DC

DC

DC

DC

DC

DC

DC

C++ X

Page 54: vBACD July 2012 - Scaling Storage with Ceph

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

DC

HUMAN [DEVELOPER]

!!

Page 55: vBACD July 2012 - Scaling Storage with Ceph

THE WORLD NEEDS

A STORAGE TECHNOLOGY THAT

SCALES INFINITELY

Page 56: vBACD July 2012 - Scaling Storage with Ceph

THE WORLD NEEDS

A STORAGE TECHNOLOGY THAT DOESN’T REQUIRE

AN INDUSTRIAL

MANUFACTURING PROCESS

Page 57: vBACD July 2012 - Scaling Storage with Ceph

SAGE  WEIL

§  Co-founder of DreamHost

§  Inventor of Ceph

§  CEO of Inktank

Page 58: vBACD July 2012 - Scaling Storage with Ceph

OPEN SOURCE

philosophy design

Page 59: vBACD July 2012 - Scaling Storage with Ceph

OPEN  SOURCE  SPREADS   IDEAS orchidgalore, Flickr / CC BY 2.0

Page 60: vBACD July 2012 - Scaling Storage with Ceph

OPEN SOURCE

COMMUNITY-FOCUSED

philosophy design

Page 61: vBACD July 2012 - Scaling Storage with Ceph

WE  ARE  SMARTER  TOGETHER rturk, Linkedin Inmap

Page 62: vBACD July 2012 - Scaling Storage with Ceph

CEPH  BELONGS  TO  ALL  OF  US wackybadger, Flickr / CC BY 2.0

Page 63: vBACD July 2012 - Scaling Storage with Ceph

OPEN SOURCE

COMMUNITY-FOCUSED

SCALABLE

philosophy design

Page 64: vBACD July 2012 - Scaling Storage with Ceph

CEPH   IS  BUILT  TO  SCALE

Too much for a book

Too much for a drive

Too much for a computer

Too much for a room

Ceph

Too much for a cave

Page 65: vBACD July 2012 - Scaling Storage with Ceph

OPEN SOURCE

COMMUNITY-FOCUSED

SCALABLE

NO SINGLE POINT OF FAILURE

philosophy design

Page 66: vBACD July 2012 - Scaling Storage with Ceph

ARILOMAX  CALIFORNICUS aroid, Flickr / CC BY 2.0

Page 67: vBACD July 2012 - Scaling Storage with Ceph

THE  OCTOPUS   (A  METAPHOR) I love speaking in metaphors.

single point of failure

highly-available replicated

Page 68: vBACD July 2012 - Scaling Storage with Ceph

THE  BEEHIVE   (ANOTHER  METAPHOR) blumenbiene, Flickr / CC BY 2.0

Page 69: vBACD July 2012 - Scaling Storage with Ceph

OPEN SOURCE

COMMUNITY-FOCUSED

SCALABLE

NO SINGLE POINT OF FAILURE

SOFTWARE BASED

philosophy design

Page 70: vBACD July 2012 - Scaling Storage with Ceph

DC

DC

DC

DC

D

C

DC

DC

DC

DC

DC

DC

DC

C++

Page 71: vBACD July 2012 - Scaling Storage with Ceph

DC

DC

DC

DC

D

C

DC

DC

DC

DC

DC

DC

DC

C++ ✔

Page 72: vBACD July 2012 - Scaling Storage with Ceph

OPEN SOURCE

COMMUNITY-FOCUSED

SCALABLE

NO SINGLE POINT OF FAILURE

SOFTWARE BASED

SELF-MANAGING

philosophy design

Page 73: vBACD July 2012 - Scaling Storage with Ceph

DISKS  =   JUST  T INY  RECORD  PLAYERS jon_a_ross, Flickr / CC BY 2.0

Page 74: vBACD July 2012 - Scaling Storage with Ceph

D

55 times / day

= D

D D

x 1 MILLION

D D

D D

Page 75: vBACD July 2012 - Scaling Storage with Ceph
Page 76: vBACD July 2012 - Scaling Storage with Ceph

IT  ALL  STARTED  WITH  A  DREAM

Page 77: vBACD July 2012 - Scaling Storage with Ceph

+

Page 78: vBACD July 2012 - Scaling Storage with Ceph

NEW  MONTHLY  CODE  COMMITS

0

100

200

300

400

500

600

700

2004-06 2005-07 2006-07 2007-07 2008-07 2009-07 2010-07 2011-07

Page 79: vBACD July 2012 - Scaling Storage with Ceph

CEPH  STARTS  POPPING  UP!

(sorry about all the logo tampering)

Page 80: vBACD July 2012 - Scaling Storage with Ceph

RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP

RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

Page 81: vBACD July 2012 - Scaling Storage with Ceph

RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP

RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

Page 82: vBACD July 2012 - Scaling Storage with Ceph

DISK

FS

DISK DISK

OSD

DISK DISK

OSD OSD OSD OSD

FS FS FS FS btrfs xfs ext4

M M M

Page 83: vBACD July 2012 - Scaling Storage with Ceph

M

M

M

HUMAN

Page 84: vBACD July 2012 - Scaling Storage with Ceph

Monitors: §  Maintain cluster map §  Provide consensus for

distributed decision-making

§  Must have an odd number §  These do not serve stored

objects to clients

M

OSDs: §  One per disk

(recommended) §  At least three in a cluster §  Serve stored objects to

clients §  Intelligently peer to perform

replication tasks §  Supports object classes

Page 85: vBACD July 2012 - Scaling Storage with Ceph

RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP

RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

Page 86: vBACD July 2012 - Scaling Storage with Ceph

LIBRADOS

M

M

M

APP

native

Page 87: vBACD July 2012 - Scaling Storage with Ceph

L

87

LIBRADOS §  Provides direct access to

RADOS for applications §  C, C++, Python, PHP,

Java §  No HTTP overhead

Page 88: vBACD July 2012 - Scaling Storage with Ceph

RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP

RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

Page 89: vBACD July 2012 - Scaling Storage with Ceph

M

M

M

native

REST

APP

LIBRADOS RADOSGW

LIBRADOS RADOSGW

APP

Page 90: vBACD July 2012 - Scaling Storage with Ceph

RADOS Gateway: §  REST-based interface to

RADOS §  Supports buckets,

accounting §  Compatible with S3 and

Swift applications

Page 91: vBACD July 2012 - Scaling Storage with Ceph

RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP

CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

Page 92: vBACD July 2012 - Scaling Storage with Ceph

M

M

M

VM

LIBRADOS LIBRBD

VIRTUALIZATION CONTAINER

Page 93: vBACD July 2012 - Scaling Storage with Ceph

LIBRADOS

M

M

M

LIBRBD CONTAINER

LIBRADOS LIBRBD

CONTAINER VM

Page 94: vBACD July 2012 - Scaling Storage with Ceph

LIBRADOS

M

M

M

KRBD (KERNEL MODULE) HOST

Page 95: vBACD July 2012 - Scaling Storage with Ceph

RADOS Block Device: §  Storage of virtual disks in

RADOS §  Allows decoupling of VMs

and containers §  Live migration!

§  Images are striped across the cluster

§  Boot support in QEMU, KVM, and OpenStack Nova

§  Mount support in the Linux kernel

Page 96: vBACD July 2012 - Scaling Storage with Ceph

RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP

RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

RADOSGW A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

Page 97: vBACD July 2012 - Scaling Storage with Ceph

M

M

M

CLIENT

01 10

data metadata

Page 98: vBACD July 2012 - Scaling Storage with Ceph

Metadata Server §  Manages metadata for a

POSIX-compliant shared filesystem §  Directory hierarchy §  File metadata (owner,

timestamps, mode, etc.) §  Stores metadata in RADOS §  Does not serve file data to

clients §  Only required for shared

filesystem

Page 99: vBACD July 2012 - Scaling Storage with Ceph

WHAT MAKES CEPH UNIQUE?

Page 100: vBACD July 2012 - Scaling Storage with Ceph

HOW  DO  YOU  F IND  YOUR  KEYS? azmeen, Flickr / CC BY 2.0

Page 101: vBACD July 2012 - Scaling Storage with Ceph

APP ??

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

Page 102: vBACD July 2012 - Scaling Storage with Ceph

APP

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

A-G

H-N

O-T

U-Z

F*

Page 103: vBACD July 2012 - Scaling Storage with Ceph

I  ALWAYS  PUT  MY  KEYS  ON  THE  HOOK vitamindave, Flickr / CC BY 2.0

Page 104: vBACD July 2012 - Scaling Storage with Ceph

APP

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

D C

Page 105: vBACD July 2012 - Scaling Storage with Ceph

DEAR  DIARY:  KEYS  =   IN  THE  KITCHEN Barnaby, Flickr / CC BY 2.0

Page 106: vBACD July 2012 - Scaling Storage with Ceph

HOW DO YOU FIND YOUR KEYS

WHEN YOUR HOUSE IS

INFINITELY BIG AND

ALWAYS CHANGING?

Page 107: vBACD July 2012 - Scaling Storage with Ceph

THE  ANSWER:  CRUSH!! pasukaru76, Flickr / CC SA 2.0

Page 108: vBACD July 2012 - Scaling Storage with Ceph

10 10 01 01 10 10 01 11 01 10

10 10 01 01 10 10 01 11 01 10

hash(object name) % num pg

CRUSH(pg, cluster state, rule set)

Page 109: vBACD July 2012 - Scaling Storage with Ceph

10 10 01 01 10 10 01 11 01 10

10 10 01 01 10 10 01 11 01 10

Page 110: vBACD July 2012 - Scaling Storage with Ceph

CRUSH §  Pseudo-random placement

algorithm §  Ensures even distribution §  Repeatable, deterministic §  Rule-based configuration

§  Replica count §  Infrastructure topology §  Weighting

Page 111: vBACD July 2012 - Scaling Storage with Ceph

CLIENT

??

Page 112: vBACD July 2012 - Scaling Storage with Ceph
Page 113: vBACD July 2012 - Scaling Storage with Ceph
Page 114: vBACD July 2012 - Scaling Storage with Ceph

CLIENT

??

Page 115: vBACD July 2012 - Scaling Storage with Ceph

LIBRADOS

M

M

M

VM

LIBRBD VIRTUALIZATION CONTAINER

Page 116: vBACD July 2012 - Scaling Storage with Ceph

HOW DO YOU SPIN UP

THOUSANDS OF VMs INSTANTLY

AND EFFICIENTLY?

Page 117: vBACD July 2012 - Scaling Storage with Ceph

144 0 0 0 0

instant copy

= 144

Page 118: vBACD July 2012 - Scaling Storage with Ceph

4 144

CLIENT

write

write

write

= 148

write

Page 119: vBACD July 2012 - Scaling Storage with Ceph

4 144

CLIENT read

read

read

= 148

Page 120: vBACD July 2012 - Scaling Storage with Ceph

HOW DO YOU MANAGE

DIRECTORY HEIRARCHY WITHOUT

A SINGLE POINT OF FAILURE?

Page 121: vBACD July 2012 - Scaling Storage with Ceph

FILESYSTEMS  REQUIRE  METADATA Barnaby, Flickr / CC BY 2.0

Page 122: vBACD July 2012 - Scaling Storage with Ceph

M

M

M

CLIENT

01 10

Page 123: vBACD July 2012 - Scaling Storage with Ceph

M

M

M

Page 124: vBACD July 2012 - Scaling Storage with Ceph

one tree

three metadata servers

??

Page 125: vBACD July 2012 - Scaling Storage with Ceph
Page 126: vBACD July 2012 - Scaling Storage with Ceph
Page 127: vBACD July 2012 - Scaling Storage with Ceph
Page 128: vBACD July 2012 - Scaling Storage with Ceph
Page 129: vBACD July 2012 - Scaling Storage with Ceph

DYNAMIC SUBTREE PARTITIONING

Page 130: vBACD July 2012 - Scaling Storage with Ceph

AND NOW BACKPEDALING

Page 131: vBACD July 2012 - Scaling Storage with Ceph

ALMOST EVERYTHING

WORKS

Page 132: vBACD July 2012 - Scaling Storage with Ceph

RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP

RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver

RADOSGW A bucket-based REST gateway, compatible with S3 and Swift

APP APP HOST/VM CLIENT

CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE

NEARLY AWESOME

AWESOME AWESOME

AWESOME

AWESOME

Page 133: vBACD July 2012 - Scaling Storage with Ceph

LAN SCALE!! *

* OR REALLY REALLY SCARY FAST WAN

Page 134: vBACD July 2012 - Scaling Storage with Ceph

CEPH  AND  CLOUDSTACK tableatny, Flickr / CC BY 2.0

Page 135: vBACD July 2012 - Scaling Storage with Ceph

RBD  SUPPORT   IN  CLOUDSTACK

§  Just announced two weeks ago! §  Allows storage of virtual disks inside RADOS

§  Works with KVM only right now §  No volume snapshots yet

§  Requires the latest version of, um, everything §  More information can be found on the mailing list:

§  ceph-devel / incubator-cloudstack-dev: http://article.gmane.org/gmane.comp.file-systems.ceph.devel/7505

Page 136: vBACD July 2012 - Scaling Storage with Ceph

QUESTIONS?

Ross Turk VP Community, Inktank

§  [email protected] §  @rossturk

inktank.com | ceph.com