Upload
liu-yuan
View
674
Download
2
Embed Size (px)
Citation preview
11
Sheepdog Status Report
Sheepdog Summit 2015Liu Yuan
22
Agenda
Introduction - Sheepdog Overview
Past and Now - Sheepdog Community
Working In Progress – Problems and Solutions
33
Sheepdog Overview
Introduction
44
• Distributed Object Storage System In User Space– Manage Disks and Nodes
• Aggregate the capacity and the power (IOPS + throughput)• Hide the failure of hardware• Dynamically grow or shrink the scale
– Secure Data• Provide redundancy mechanisms (replication and erasure code) for high-
availability• Secure the data with auto-healing and auto-rebalanced mechanisms
– Provide Interfaces (in a single cluster)• Virtual volume for QEMU VM, iSCSI TGT (Best supported)• RESTful container (Openstack Swift and Amazon S3 Compatible, in progress)• Storage for Openstack Cinder, Glance, Nova (in progress)• POSIX file via NFS (in progress)• Linux Block Device
What is Sheepdog
55
Gateway
Store
1TB 1TB
1TB
Gateway
Store
1TB 1TB
2TB
Gateway
Store
1TB 2TB
XPrivate Hash Ring: Local Rebalance
Global Consistent Hash Ring and P2P Global Rebalance
No meta servers!Zookeeper: membership management and message queue
4TB Hot-plugged Auto unplugged on EIO
Disks and Nodes Management
66
Data Management
Sheep Sheep Sheep
Full Replication
Sheep Sheep Sheep Sheep Sheep Sheep
Erasure Coding
Parity
77
Sheep Sheep Sheep Sheep
Object LUN
Volume
File
Openstack
NFS HTTP iSCSI
GlanceNovaCinder
Block
SBD
Interfaces
QEMU
Sheepdog
88
Use Patterns
SD VM SD VM
SD VM SD VM
VM running inside Sheepdog Cluster
SD SD
SD SD
SD
SD
HTTP
HTTP object storage
SD SD
SD SD
SD
SD
LUN device pool
iSCSI backend
Nginx
99
Sheepdog Community
Past and Now
1010
Peoples
Kazutaka Morita 2009.9
People from Taobao 2011.9
Christph Hellwig from Nebula 2012.4
More production uses from the world
People from Intel 2014
People from China Mobile 2015
Stayed for around half the year
Valerio, Andy, startups at China and Japan
Add isa-l for Erasure code
Open sourced the Sheepdog
Add features, bug fixing, redesign
Make sheepdog better
1111
Patches
2009 2010 2011 2012 2013 2014 20150
200
400
600
800
1000
1200
Patches Per Year
● Culminate at 2012 and 2013, suffer a decline recently.
● It is always easier to open source the code, but build a community is really difficult.
● China Mobile is committed to release all its patches to the community.
1212
Comparison with Ceph and GlusterFS
Pros:
The simplicity is the biggest advantage for Sheepdog
Sheepdog: 20k+ lines in user spaceCeph: 400k+ lines in user space and 20k+ in kernel GlusterFS: 330K+ lines in user space
Cons:
● No company behind● inactive community● few users and few developers
But Sheepdog is not technically inferior! Simplicity doesn't mean bad!
1313
Sheepdog-ng
Why?We forked it at May because of endless crashes, panics by our stressing test. I discussed with NTT guys with the redesign idea to remove shared states between sheep nodes. They asked me to fork Sheepdog instead simply because they don't use zookeeper as they always replied to a user with some features they don't use (e.g., object cache)
http://lists.wpkg.org/pipermail/sheepdog/2015-May/067736.html
The technical reason:Share nothing or share more and more state with overwhelming complexity.
The non-technical reason:Community is not as friendly and open as before. We want to build a real community-based project.
Subscribe the list: send email to [email protected]
1414
Problems and Solutions
Working In Progress
1515
iSCSI Target Scalability
LUN1 LUN2
STGT
sheep
Main thread
Max req == nr of workers
Sync
LUN1 LUN2
New Target
sheep
Unlimted!
Async
Thread per LUN
Problems:
● OS tends to issue more and more request (blk-mp, scsi-mp)
● A single LUN can saturate stgt, not scale at all
● STGT take too much resource● Multipath is not so good
Solution – Rewrite
● from sync to async, less threads and Fds
● Tailored for sheepdog● Add io rebalance and cache
support New target
1616
Performance Degradation
X
IO hang
IO Resume
Problem with default Dynamic Hash Ring ● If object is in recovery, we need to wait!● What make it worse , recovery IO will
complete with user IO for bandwidth, CPU● Neither slow nor fast recovery is satisfied
Solution – Static Hash Ring
Failure of node won't change the hash ring.Trade data reliability for performance! We don't recover object if some of redundancy data are missing. Useful for small cluster with mostly deal with single node event.
X
Drop this IOSHR
DHR
1717
Live Patching
A ----> B ----> C
A B C
B`
After Patching
B` is loaded by Linux'sdynamic loader on the fly
Sheep tracer
Similar to Linux's Ftrace, virtually add constructor and destructor to every function. This mechanism relies on the 5 bytes space (A.K.A mcount) injected by GCC beforehand.
Based on the tracer, we can replace any functionin the sheep daemon on the fly.
Useful for one-liner bug fixing but is limited on function level.
1818
NFS Server
Current status:
Just a toy with file size < 4M, NFSv3 is not fully supported and virtually no file system code (need implement inode, dentry and free space management)
Todos
- finish stubs - add extent to file allocation - add btree or hash based kv store to manage dentries - implement a multi-threaded SUNRPC to take place of poor performance glibc RPC - implement NFS v4
1919
Cinder - Block Storage– Support since day 1
Glance - Image Storage– Support merged at Havana version
Nova - Ephemeral Storage– Not yet started
Swift - Object Storage– Swift API compatible In progress
Final Goal - Unified Storage– Copy-On-Write anywhere ?
– Data dedup ?
Sheep Sheep Sheep Sheep
Cinder Glance
Unified Storage
NovaSwift
Openstack
Plan to rewrite the driver with libsheepdog.so
2020
Enjoy yourself in Suzhou