20
1 1 Sheepdog Status Report Sheepdog Summit 2015 Liu Yuan

Sheepdog Status Report

Embed Size (px)

Citation preview

Page 1: Sheepdog Status Report

11

Sheepdog Status Report

Sheepdog Summit 2015Liu Yuan

Page 2: Sheepdog Status Report

22

Agenda

Introduction - Sheepdog Overview

Past and Now - Sheepdog Community

Working In Progress – Problems and Solutions

Page 3: Sheepdog Status Report

33

Sheepdog Overview

Introduction

Page 4: Sheepdog Status Report

44

• Distributed Object Storage System In User Space– Manage Disks and Nodes

• Aggregate the capacity and the power (IOPS + throughput)• Hide the failure of hardware• Dynamically grow or shrink the scale

– Secure Data• Provide redundancy mechanisms (replication and erasure code) for high-

availability• Secure the data with auto-healing and auto-rebalanced mechanisms

– Provide Interfaces (in a single cluster)• Virtual volume for QEMU VM, iSCSI TGT (Best supported)• RESTful container (Openstack Swift and Amazon S3 Compatible, in progress)• Storage for Openstack Cinder, Glance, Nova (in progress)• POSIX file via NFS (in progress)• Linux Block Device

What is Sheepdog

Page 5: Sheepdog Status Report

55

Gateway

Store

1TB 1TB

1TB

Gateway

Store

1TB 1TB

2TB

Gateway

Store

1TB 2TB

XPrivate Hash Ring: Local Rebalance

Global Consistent Hash Ring and P2P Global Rebalance

No meta servers!Zookeeper: membership management and message queue

4TB Hot-plugged Auto unplugged on EIO

Disks and Nodes Management

Page 6: Sheepdog Status Report

66

Data Management

Sheep Sheep Sheep

Full Replication

Sheep Sheep Sheep Sheep Sheep Sheep

Erasure Coding

Parity

Page 7: Sheepdog Status Report

77

Sheep Sheep Sheep Sheep

Object LUN

Volume

File

Openstack

NFS HTTP iSCSI

GlanceNovaCinder

Block

SBD

Interfaces

QEMU

Sheepdog

Page 8: Sheepdog Status Report

88

Use Patterns

SD VM SD VM

SD VM SD VM

VM running inside Sheepdog Cluster

SD SD

SD SD

SD

SD

HTTP

HTTP object storage

SD SD

SD SD

SD

SD

LUN device pool

iSCSI backend

Nginx

Page 9: Sheepdog Status Report

99

Sheepdog Community

Past and Now

Page 10: Sheepdog Status Report

1010

Peoples

Kazutaka Morita 2009.9

People from Taobao 2011.9

Christph Hellwig from Nebula 2012.4

More production uses from the world

People from Intel 2014

People from China Mobile 2015

Stayed for around half the year

Valerio, Andy, startups at China and Japan

Add isa-l for Erasure code

Open sourced the Sheepdog

Add features, bug fixing, redesign

Make sheepdog better

Page 11: Sheepdog Status Report

1111

Patches

2009 2010 2011 2012 2013 2014 20150

200

400

600

800

1000

1200

Patches Per Year

● Culminate at 2012 and 2013, suffer a decline recently.

● It is always easier to open source the code, but build a community is really difficult.

● China Mobile is committed to release all its patches to the community.

Page 12: Sheepdog Status Report

1212

Comparison with Ceph and GlusterFS

Pros:

The simplicity is the biggest advantage for Sheepdog

Sheepdog: 20k+ lines in user spaceCeph: 400k+ lines in user space and 20k+ in kernel GlusterFS: 330K+ lines in user space

Cons:

● No company behind● inactive community● few users and few developers

But Sheepdog is not technically inferior! Simplicity doesn't mean bad!

Page 13: Sheepdog Status Report

1313

Sheepdog-ng

Why?We forked it at May because of endless crashes, panics by our stressing test. I discussed with NTT guys with the redesign idea to remove shared states between sheep nodes. They asked me to fork Sheepdog instead simply because they don't use zookeeper as they always replied to a user with some features they don't use (e.g., object cache)

http://lists.wpkg.org/pipermail/sheepdog/2015-May/067736.html

The technical reason:Share nothing or share more and more state with overwhelming complexity.

The non-technical reason:Community is not as friendly and open as before. We want to build a real community-based project.

Subscribe the list: send email to [email protected]

Page 14: Sheepdog Status Report

1414

Problems and Solutions

Working In Progress

Page 15: Sheepdog Status Report

1515

iSCSI Target Scalability

LUN1 LUN2

STGT

sheep

Main thread

Max req == nr of workers

Sync

LUN1 LUN2

New Target

sheep

Unlimted!

Async

Thread per LUN

Problems:

● OS tends to issue more and more request (blk-mp, scsi-mp)

● A single LUN can saturate stgt, not scale at all

● STGT take too much resource● Multipath is not so good

Solution – Rewrite

● from sync to async, less threads and Fds

● Tailored for sheepdog● Add io rebalance and cache

support New target

Page 16: Sheepdog Status Report

1616

Performance Degradation

X

IO hang

IO Resume

Problem with default Dynamic Hash Ring ● If object is in recovery, we need to wait!● What make it worse , recovery IO will

complete with user IO for bandwidth, CPU● Neither slow nor fast recovery is satisfied

Solution – Static Hash Ring

Failure of node won't change the hash ring.Trade data reliability for performance! We don't recover object if some of redundancy data are missing. Useful for small cluster with mostly deal with single node event.

X

Drop this IOSHR

DHR

Page 17: Sheepdog Status Report

1717

Live Patching

A ----> B ----> C

A B C

B`

After Patching

B` is loaded by Linux'sdynamic loader on the fly

Sheep tracer

Similar to Linux's Ftrace, virtually add constructor and destructor to every function. This mechanism relies on the 5 bytes space (A.K.A mcount) injected by GCC beforehand.

Based on the tracer, we can replace any functionin the sheep daemon on the fly.

Useful for one-liner bug fixing but is limited on function level.

Page 18: Sheepdog Status Report

1818

NFS Server

Current status:

Just a toy with file size < 4M, NFSv3 is not fully supported and virtually no file system code (need implement inode, dentry and free space management)

Todos

- finish stubs - add extent to file allocation - add btree or hash based kv store to manage dentries - implement a multi-threaded SUNRPC to take place of poor performance glibc RPC - implement NFS v4

Page 19: Sheepdog Status Report

1919

Cinder - Block Storage– Support since day 1

Glance - Image Storage– Support merged at Havana version

Nova - Ephemeral Storage– Not yet started

Swift - Object Storage– Swift API compatible In progress

Final Goal - Unified Storage– Copy-On-Write anywhere ?

– Data dedup ?

Sheep Sheep Sheep Sheep

Cinder Glance

Unified Storage

NovaSwift

Openstack

Plan to rewrite the driver with libsheepdog.so

Page 20: Sheepdog Status Report

2020

Enjoy yourself in Suzhou