The current Ceph and the future -...

Preview:

Citation preview

THE CURRENT AND THE FUTURE OF CEPHHAOMAI WANG 2015.10.30

ABOUT

I’M HAOMAI WANG

▸ Ceph core developer

▸ GSOC 2014, 2015 Ceph mentor

▸ Maintain KeyValueStore, AsyncMessenger, focus on Performance optimization

▸ Involve in database, local filesystem and storage

▸ NetBSD on VirtualBox author

▸ haomaiwang@gmail.com

TEXT

AGENDA

▸ What is Ceph?

▸ The current Ceph and the roadmap

WHAT IS CEPH?

WHAT IS CEPH?

CEPH MOTIVATION PRINCIPLES

▸ everything must scale horizontally no single point of failure commodity hardware

▸ self-manage whenever possible

▸ move beyond legacy approaches

▸ client/cluster instead of client/server

▸ avoid ad hoc high-availability

▸ open source

WHAT IS CEPH?

CEPH MOTIVATION PRINCIPLES

WHAT IS CEPH?

CEPH ECOSYSTEM

WHAT IS CEPH?

FEATURES

WHAT IS CEPH?

REPLICATION/TIERING

WHAT IS CEPH?

CRUSH▸ Cephs data distribution mechanism

▸ Pseudo-random placement algorithm

▸ Deterministic function of inputs

▸ Clients can compute data location

▸ Rule-based configuration

▸ Desired/required replica count

▸ Affinity/distribution rules

▸ Infrastructure topology

▸ Weighting

▸ Excellent data distribution

▸ De-clustered placement

▸ Excellent data-re-distribution

▸ Migration proportional to change

▸ failure prediction*

WHAT IS CEPH?

USE CASES

▸ The largest Ceph cluster: CERN

▸ Yahoo Flick

▸ SourceForge

▸ DreamHost

▸ eBay

▸ Deutsche Telekom AG

▸ OpenStack Cloud(~44%)

WHAT IS CEPH?

VENDOR

▸ Redhat

▸ Intel

▸ Sandisk

▸ Samsung

▸ Fujitsu

▸ Suse

▸ Canonical

THE CURRENT CEPH AND THE ROADMAP

THE CURRENT CEPH AND THE ROADMAP

INTERNAL OVERVIEW

Dispatch Layer

IO Replicated

ObjectStore Layer

File System

Block Device Interface

Sockets

TCP

IP

Ethernet

Virtual Memory

Messenger Layer

Recovery Scrub Tiering

Scheduler

Thread

DRAM

IO Controller

Disk

Network Controller

Port

Memory Library

CPU Interconnect

Queue

FileJournal

FileStore

LibRBDApplication

RadosGW

LibRadosSession

Messenger

THE CURRENT CEPH AND THE ROADMAP

CEPH STORAGE ENGINE

▸ FileStore

▸ NewStore: Replacing FileStore*

▸ KeyValueStore

▸ LevelDB/RocksDB/LMDB

▸ Kinetics API

▸ Samsung uFTL*

▸ Sandisk SSD Library*

▸ MemStore

▸ Memory Management(malloc/free)

▸ NVM(PMBackend, libpmem)*

THE CURRENT CEPH AND THE ROADMAP

THE NEW TIERING

▸ The new storage mountain

▸ The new challenge:

▸ More storage medium

▸ More complexity management way

▸ Data lake

▸ Migrate data with “temperature”

THE CURRENT CEPH AND THE ROADMAP

NETWORK

▸ TCP Messenger

▸ posix socket

▸ DPDK*

▸ SolarFlare*

▸ RDMA

THE CURRENT CEPH AND THE ROADMAP

QOS

▸ Priority based

▸ client priority

▸ message priority

▸ mLock algorithm*

▸ each message with “tag”

▸ exchange window size p2p

THE CURRENT CEPH AND THE ROADMAP

LIBRADOS▸ Object

▸ Name

▸ Attributes

▸ Data

▸ key/value data

▸ random access insertion, deletion, range query/list

▸ Operation

▸ CAS(Compare And Swap)

▸ Group Operation: Atomic, Rollback

▸ Snapshot: Object Granularity

▸ Copy On Write

▸ Rados Classes

▸ code runs directly inside storage server I/O path

▸ Watch/Notify

▸ Multi Object Transactions*

THE CURRENT CEPH AND THE ROADMAP

RADOS CLASSES - COMPUTE IN STORAGE SIDE

▸ write new RADOS “methods”

▸ code runs directly inside storage server I/O path

▸ simple plugin API; admin deploys a .so

▸ read-side methods

▸ process data, return result

▸ write-side methods

▸ process, write; read, modify, write

▸ generate an update transaction that is applied atomically

▸ Use cases:

▸ distributed “grep”

▸ LUA interpreter

THE CURRENT CEPH AND THE ROADMAP

RBD

▸ Thin Provision

▸ Snapshot

▸ Clone

▸ Multi-Client Support

▸ Kernel Client

▸ KVM/XEN

▸ VMWare VVOL*

▸ iSCSI

▸ LIO TCMU + loopback(FUSE)*

▸ Active/Passive*

▸ Active/Active**

THE CURRENT CEPH AND THE ROADMAP

RADOSGW

▸ S3/Swift

▸ Active/Slave

▸ One Writer

▸ Multi Active Sites*

▸ Hadoop/Spark FileSystem Interface*

▸ NFS protocol aware*

THE CURRENT CEPH AND THE ROADMAP

CEPHFS

▸ Dynamic subtree partition

▸ Strict posix compatible

▸ NFS

▸ QEMU VM

▸ virtues

▸ nfs over sock

▸ FSCK

▸ Multi-tenant

THANK YOU!

2015.10

END