51
Building Tomorrow's Ceph Sage Weil

Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

  • Upload
    inktank

  • View
    2.263

  • Download
    4

Embed Size (px)

DESCRIPTION

Sage Weil, Founder & CTO, Inktank

Citation preview

Page 1: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

Building Tomorrow's CephSage Weil

Page 2: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt
Page 3: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt
Page 4: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt
Page 5: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt
Page 6: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt
Page 7: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt
Page 8: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt
Page 9: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

Research beginnings

9

Page 10: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt
Page 11: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

UCSC research grant

“Petascale object storage” US Dept of Energy: LANL, LLNL, Sandia

Scalability

Reliability

Performance Raw IO bandwidth, metadata ops/sec

HPC file system workloads Thousands of clients writing to same file, directory

Page 12: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

Distributed metadata management

Innovative design Subtree-based partitioning for locality, efficiency

Dynamically adapt to current workload

Embedded inodes

Prototype simulator in Java (2004)

First line of Ceph code Summer internship at LLNL

High security national lab environment

Could write anything, as long as it was OSS

Page 13: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

The rest of Ceph

RADOS – distributed object storage cluster (2005)

EBOFS – local object storage (2004/2006)

CRUSH – hashing for the real world (2005)

Paxos monitors – cluster consensus (2006)

→ emphasis on consistent, reliable storage

→ scale by pushing intelligence to the edges

→ a different but compelling architecture

Page 14: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

Click to edit the outline text format

Second Outline Level

Third Outline Level Fourth Outline Level

Fifth Outline Level Sixth Outline Level Seventh Outline Level Eighth Outline Level

Ninth Outline LevelClick to edit Master text styles

Page 15: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

Industry black hole

Many large storage vendors Proprietary solutions that don't scale well

Few open source alternatives (2006) Very limited scale, or

Limited community and architecture (Lustre)

No enterprise feature sets (snapshots, quotas)

PhD grads all built interesting systems... ...and then went to work for Netapp, DDN, EMC, Veritas.

They want you, not your project

Page 16: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

A different path

Change the world with open source Do what Linux did to Solaris, Irix, Ultrix, etc.

What could go wrong?

License GPL, BSD...

LGPL: share changes, okay to link to proprietary code

Avoid community un-friendly practices No dual licensing

No copyright assignment

Page 17: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

Incubation

17

Page 18: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt
Page 19: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

DreamHost!

Move back to Los Angeles, continue hacking

Hired a few developers

Pure development

No deliverables

Page 20: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

Ambitious feature set

Native Linux kernel client (2007-)

Per-directory snapshots (2008)

Recursive accounting (2008)

Object classes (2009)

librados (2009)

radosgw (2009)

strong authentication (2009)

RBD: rados block device (2010)

Page 21: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

The kernel client

ceph-fuse was limited, not very fast

Build native Linux kernel implementation

Began attending Linux file system developer events (LSF)

Early words of encouragement from ex-Lustre devs

Engage Linux fs developer community as peer

Eventually merged CephFS client for v2.6.34 (early 2010)

RBD client merged in 2011

Page 22: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

Part of a larger ecosystem

Ceph need not solve all problems as monolithic stack

Replaced ebofs object file system with btrfs Same design goals

Robust, well optimized

Kernel-level cache management

Copy-on-write, checksumming, other goodness

Contributed some early functionality Cloning files

Async snapshots

Page 23: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

Budding community

#ceph on irc.oftc.net, [email protected]

Many interested users

A few developers

Many fans

Too unstable for any real deployments

Still mostly focused on right architecture and technical solutions

Page 24: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

Road to product

DreamHost decides to build an S3-compatible object storage service with Ceph

Stability Focus on core RADOS, RBD, radosgw

Paying back some technical debt Build testing automation

Code review!

Expand engineering team

Page 25: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

The reality

Growing incoming commercial interest Early attempts from organizations large and small

Difficult to engage with a web hosting company

No means to support commercial deployments

Project needed a company to back it Fund the engineering effort

Build and test a product

Support users

Bryan built a framework to spin out of DreamHost

Page 26: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

Launch

26

Page 27: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt
Page 28: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

Do it right

How do we build a strong open source company?

How do we build a strong open source community?

Models? RedHat, Cloudera, MySQL, Canonical, …

Initial funding from DreamHost, Mark Shuttleworth

Page 29: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

Goals

A stable Ceph release for production deployment DreamObjects

Lay foundation for widespread adoption Platform support (Ubuntu, Redhat, SuSE)

Documentation

Build and test infrastructure

Build a sales and support organization

Expand engineering organization

Page 30: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

Branding

Early decision to engage professional agency MetaDesign

Terms like “Brand core”

“Design system”

Keep project and company independent Inktank != Ceph

The Future of Storage

Page 31: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

Click to edit the outline text format

Second Outline Level

Third Outline Level

Fourth Outline Level

Fifth Outline Level Sixth Outline Level

Seventh Outline LevelClick to edit Master text styles

Slick graphics broken powerpoint template 31

Page 32: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

Today: adoption

32

Page 33: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt
Page 34: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

Traction

Too many production deployments to count We don't know about most of them!

Too many customers (for me) to count

Expansive partner list Lots of inbound

Lots of press and buzz

Page 35: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

Quality

Increased adoption means increased demands on robust testing

Across multiple platforms

Upgrades Rolling upgrades

Inter-version compatibility

Page 36: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

Developer community

Significant external contributors Many full-time contributors outside of Inktank

First-class feature contributions from contributors

Non-Inktank participants in daily stand-ups

External access to build/test lab infrastructure

Common toolset Github

Email (kernel.org)

IRC (oftc.net)

Linux distros

Page 37: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

CDS: Ceph Developer Summit

Community process for building project roadmap

100% online Google hangouts

Wikis

Etherpad

Quarterly Our 4th CDS next week

Great participation

Ongoing indoctrination of Inktank engineers to open development model

Page 38: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

Erasure coding

Replication for redundancy is flexible and fast

For larger clusters, it can be expensive

Erasure coded data is hard to modify, but ideal for cold or read-only objects

Will be used directly by radosgw

Coexists with new tiering capability

Storage overhead

Repair traffic

MTTDL (days)

3x replication 3x 1x 2.3 E10

RS (10, 4) 1.4x 10x 3.3 E13

LRC (10, 6, 5) 1.6x 5x 1.2 E15

Page 39: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

Tiering

Client side caches are great, but only buy so much.

Separate hot and cold data onto different storage devices

Promote hot objects into a faster (e.g., flash-backed) cache pool

Push cold object back into slower (e.g., erasure-coded) base pool

Use bloom filters to track temperature

Common in enterprise solutions; not found in open source scale-out systems

→ new (with erasure coding) in Firefly release

Page 40: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

The Future

40

Page 41: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

Technical roadmap

How do we reach new use-cases and users

How do we better satisfy existing users

How do we ensure Ceph can succeed in enough markets for supporting organizations to thrive

Enough breadth to expand and grow the community

Enough focus to do well

Page 42: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

Multi-datacenter, geo-replication

Ceph was originally designed for single DC clusters Synchronous replication

Strong consistency

Growing demand Enterprise: disaster recovery

ISPs: replication data across sites for locality

Two strategies: use-case specific: radosgw, RBD

low-level capability in RADOS

Page 43: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

RGW: Multi-site and async replication

Multi-site, multi-cluster Regions: east coast, west coast, etc.

Zones: radosgw sub-cluster(s) within a region

Can federate across same or multiple Ceph clusters

Sync user and bucket metadata across regions Global bucket/user namespace, like S3

Synchronize objects across zones Within the same region

Across regions

Admin control over which zones are master/slave

Page 44: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

RBD: block devices

Today: backup capability Based on block device snapshots

Efficiently mirror changes between consecutive snapshots across clusters

Now supported/orchestrated by OpenStack

Good for coarse synchronization (e.g., hours or days)

Tomorrow: data journaling for async mirroring Pending blueprint at next week's CDS

Mirror active block device to remote cluster

Possibly with some configurable delay

Page 45: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

Async replication in RADOS

One implementation to capture multiple use-cases RBD, CephFS, RGW, … RADOS

A harder problem Scalable: 1000s OSDs → 1000s of OSDs

Point-in-time consistency

Challenging research problem

→ Ongoing design discussion among developers

Page 46: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

CephFS

→ This is where it all started – let's get there

Today Stabilization of multi-MDS, directory fragmentation, QA

NFS, CIFS, Hadoop/HDFS bindings complete but not productized

Need Greater QA investment

Fsck

Snapshots

Amazing community effort (Intel, NUDT and Kylin) 2014 is the year

Page 47: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

Governance

How do we strengthen the project community?

2014 is the year

Recognized project leads RBD, RGW, RADOS, CephFS, ...

Formalize emerging processes around CDS, community roadmap

External foundation?

Page 48: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

The larger ecosystem

Page 49: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

The enterprise

How do we pay for all of this?

Support legacy and transitional client/server interfaces

iSCSI, NFS, pNFS, CIFS, S3/Swift

VMWare, Hyper-V

Identify the beachhead use-cases Earn others later

Single platform – shared storage resource

Bottom-up: earn respect of engineers and admins

Top-down: strong brand and compelling product

Page 50: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

Why Ceph is the Future of Storage

It is hard to compete with free and open source software

Unbeatable value proposition

Ultimately a more efficient development model

It is hard to manufacture community

Strong foundational architecture

Next-generation protocols, Linux kernel support Unencumbered by legacy protocols like NFS

Move from client/server to client/cluster

Ongoing paradigm shift Software defined infrastructure, data center

Widespread demand for open platforms

Page 51: Keynote: Building Tomorrow's Ceph - Ceph Day Frankfurt

Click to edit the outline text format

Second Outline Level

Third Outline Level

Fourth Outline Level

Fifth Outline Level Sixth Outline Level

Seventh Outline LevelClick to edit Master text styles

Thank you, and Welcome!