Upload
sage-weil
View
1.534
Download
2
Tags:
Embed Size (px)
Citation preview
KEEPING OPENSTACK STORAGE TRENDYWITH CEPH AND CONTAINERS
SAGE WEIL, HAOMAI WANGOPENSTACK SUMMIT - 2015.05.20
4
WEB APPLICATION
APP SERVER APP SERVER APP SERVER APP SERVER
A CLOUD SMORGASBORD
● Compelling clouds offer options
● Compute
– VM (KVM, Xen, …)
– Containers (lxc, Docker, OpenVZ, ...)
● Storage
– Block (virtual disk)
– File (shared)
– Object (RESTful, …)
– Key/value
– NoSQL
– SQL
5
WHY CONTAINERS?
Technology
● Performance
– Shared kernel
– Faster boot
– Lower baseline overhead
– Better resource sharing
● Storage
– Shared kernel → efficient IO
– Small image → efficient deployment
Ecosystem
● Emerging container host OSs
– Atomic – http://projectatomic.io
● os-tree (s/rpm/git/)
– CoreOS
● systemd + etcd + fleet
– Snappy Ubuntu
● New app provisioning model
– Small, single-service containers
– Standalone execution environment
● New open container spec nulecule
– https://github.com/projectatomic/nulecule
6
WHY NOT CONTAINERS?
Technology
● Security
– Shared kernel
– Limited isolation
● OS flexibility
– Shared kernel limits OS choices
● Inertia
Ecosystem
● New models don't capture many legacy services
7
WHY CEPH?
● All components scale horizontally
● No single point of failure
● Hardware agnostic, commodity hardware
● Self-manage whenever possible
● Open source (LGPL)
● Move beyond legacy approaches
– client/cluster instead of client/server
– avoid ad hoc HA
8
CEPH COMPONENTS
RGWA web services gateway
for object storage, compatible with S3 and
Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed
block device with cloud platform integration
CEPHFSA distributed file system
with POSIX semantics and scale-out metadata
management
APP HOST/VM CLIENT
10
EXISTING BLOCK STORAGE MODEL
VM
● VMs are the unit of cloud compute
● Block devices are the unit of VM storage
– ephemeral: not redundant, discarded when VM dies
– persistent volumes: durable, (re)attached to any VM
● Block devices are single-user
● For shared storage,
– use objects (e.g., Swift or S3)
– use a database (e.g., Trove)
– ...
11
KVM + LIBRBD.SO
● Model
– Nova → libvirt → KVM → librbd.so
– Cinder → rbd.py → librbd.so
– Glance → rbd.py → librbd.so
● Pros
– proven
– decent performance
– good security
● Cons
– performance could be better
● Status
– most common deployment model today (~44% in latest survey)
M M
RADOS CLUSTER
QEMU / KVM
LIBRBD
VM NOVA
CINDER
12
MULTIPLE CEPH DRIVERS
● librbd.so
– qemu-kvm
– rbd-fuse (experimental)
● rbd.ko (Linux kernel)
– /dev/rbd*
– stable and well-supported on modern kernels and distros
– some feature gap
● no client-side caching● no “fancy striping”
– performance delta
● more efficient → more IOPS● no client-side cache → higher latency for some workloads
13
LXC + CEPH.KO
● The model
– libvirt-based lxc containers
– map kernel RBD on host
– pass host device to libvirt, container
● Pros
– fast and efficient
– implement existing Nova API
● Cons
– weaker security than VM
● Status
– lxc is maintained
– lxc is less widely used
– no prototype
M M
RADOS CLUSTER
LINUX HOST
RBD.KO
CONTAINER
NOVA
14
NOVA-DOCKER + CEPH.KO
● The model
– docker container as mini-host
– map kernel RBD on host
– pass RBD device to container, or
– mount RBD, bind dir to container
● Pros
– buzzword-compliant
– fast and efficient
● Cons
– different image format
– different app model
– only a subset of docker feature set
● Status
– no prototype
– nova-docker is out of tree
https://wiki.openstack.org/wiki/Docker
15
IRONIC + CEPH.KO
● The model
– bare metal provisioning
– map kernel RBD directly from guest image
● Pros
– fast and efficient
– traditional app deployment model
● Cons
– guest OS must support rbd.ko
– requires agent
– boot-from-volume tricky
● Status
– Cinder and Ironic integration is a hot topic at summit
● 5:20p Wednesday (cinder)
– no prototype
● References– https://wiki.openstack.org/wiki/Ironic/blueprints/
cinder-integration
M M
RADOS CLUSTER
LINUX HOST
RBD.KO
16
BLOCK - SUMMARY
● But
– block storage is same old boring
– volumes are only semi-elastic (grow, not shrink; tedious to resize)
– storage is not shared between guests
performance efficiency VMclient cache
stripingsame
images?exists
kvm + librbd.so best good X X X yes X
lxc + rbd.ko good best close
nova-docker + rbd.ko good best no
ironic + rbd.ko good best close? planned!
18
MANILA FILE STORAGE
● Manila manages file volumes
– create/delete, share/unshare
– tenant network connectivity
– snapshot management
● Why file storage?
– familiar POSIX semantics
– fully shared volume – many clients can mount and share data
– elastic storage – amount of data can grow/shrink without explicit provisioning
MANILA
19
MANILA CAVEATS
● Last mile problem
– must connect storage to guest network
– somewhat limited options (focus on Neutron)
● Mount problem
– Manila makes it possible for guest to mount
– guest is responsible for actual mount
– ongoing discussion around a guest agent …
● Current baked-in assumptions about both of these
MANILA
20
?
APPLIANCE DRIVERS
● Appliance drivers
– tell an appliance to export NFS to guests
– map appliance IP into tenant network (Neutron)
– boring (closed, proprietary, expensive, etc.)
● Status
– several drivers from usual suspects
– security punted to vendor
NFS
MANILA
21
GANESHA DRIVER
● Model
– service VM running nfs-ganesha server
– mount file system on storage network
– export NFS to tenant network
– map IP into tenant network
● Status
– in-tree, well-supported
KVM
GANESHA???
NFS
MANILA
???
22
KVM
GANESHA
KVM + GANESHA + LIBCEPHFS
● Model
– existing Ganesha driver, backed by Ganesha's libcephfs FSAL
● Pros
– simple, existing model
– security
● Cons
– extra hop → higher latency
– service VM is SpoF
– service VM consumes resources
● Status
– Manila Ganesha driver exists
– untested with CephFS
M M
RADOS CLUSTER
LIBCEPHFS
KVM
NFS
NFS.KO
MANILA
NATIVE CEPH
23
KVM + CEPH.KO (CEPH-NATIVE)
● Model
– allow tenant access to storage network
– mount CephFS directly from tenant VM
● Pros
– best performance
– access to full CephFS feature set
– simple
● Cons
– guest must have modern distro/kernel
– exposes tenant to Ceph cluster
– must deliver mount secret to client
● Status
– no prototype
– CephFS isolation/security is work-in-progress
KVM
M M
RADOS CLUSTER
CEPH.KO
MANILA
NATIVE CEPH
24
NETWORK-ONLY MODEL IS LIMITING
● Current assumption of NFS or CIFS sucks
● Always relying on guest mount support sucks
– mount -t ceph -o what?
● Even assuming storage connectivity is via the network sucks
● There are other options!
– KVM virtfs/9p
● fs pass-through to host● 9p protocol● virtio for fast data transfer● upstream; not widely used
– NFS re-export from host
● mount and export fs on host● private host/guest net● avoid network hop from NFS
service VM
– containers and 'mount --bind'
25
NOVA “ATTACH FS” API
● Mount problem is ongoing discussion by Manila team
– discussed this morning
– simple prototype using cloud-init
– Manila agent? leverage Zaqar tenant messaging service?
● A different proposal
– expand Nova to include “attach/detach file system” API
– analogous to current attach/detach volume for block
– each Nova driver may implement function differently
– “plumb” storage to tenant VM or container
● Open question
– Would API do the final “mount” step as well? (I say yes!)
26
KVM + VIRTFS/9P + CEPHFS.KO
● Model
– mount kernel CephFS on host
– pass-through to guest via virtfs/9p
● Pros
– security: tenant remains isolated from storage net + locked inside a directory
● Cons
– require modern Linux guests
– 9p not supported on some distros
– “virtfs is ~50% slower than a native mount?”
● Status
– Prototype from Haomai Wang
HOST
M M
RADOS CLUSTER
KVM VIRTFS
MANILA
NATIVE CEPH
CEPH.KO
VM9P
NOVA
27
KVM + NFS + CEPHFS.KO
● Model
– mount kernel CephFS on host
– pass-through to guest via NFS
● Pros
– security: tenant remains isolated from storage net + locked inside a directory
– NFS is more standard
● Cons
– NFS has weak caching consistency
– NFS is slower
● Status
– no prototype
HOST
M M
RADOS CLUSTER
KVM
MANILA
NATIVE CEPH
CEPH.KO
VMNFS
NOVA
28
(LXC, NOVA-DOCKER) + CEPHFS.KO
● Model
– host mounts CephFS directly
– mount --bind share into container namespace
● Pros
– best performance
– full CephFS semantics
● Cons
– rely on container for security
● Status
– no prototype
HOST
M M
RADOS CLUSTER
CONTAINER
MANILA
NATIVE CEPH
CEPH.KO
NOVA
29
IRONIC + CEPHFS.KO
● Model
– mount CephFS directly from bare metal “guest”
● Pros
– best performance
– full feature set
● Cons
– rely on CephFS security
– networking?
– agent to do the mount?
● Status
– no prototype
– no suitable (ironic) agent (yet)
HOST
M M
RADOS CLUSTER
MANILA
NATIVE CEPH
CEPH.KO
NOVA
30
THE MOUNT PROBLEM
● Containers may break the current 'network fs' assumption
– mounting becomes driver-dependent; harder for tenant to do the right thing
● Nova “attach fs” API could provide the needed entry point
– KVM: qemu-guest-agent
– Ironic: no guest agent yet...
– containers (lxc, nova-docker): use mount --bind from host
● Or, make tenant do the final mount?
– Manila API to provide command (template) to perform the mount
● e.g., “mount -t ceph $cephmonip:/manila/$uuid $PATH -o ...”
– Nova lxc and docker
● bind share to a “dummy” device /dev/manila/$uuid● API mount command is 'mount --bind /dev/manila/$uuid $PATH'
31
SECURITY: NO FREE LUNCH
● (KVM, Ironic) + ceph.ko
– access to storage network relies on Ceph security
● KVM + (virtfs/9p, NFS) + ceph.ko
– better security, but
– pass-through/proxy limits performance
● (by how much?)
● Containers
– security (vs a VM) is weak at baseline, but
– host performs the mount; tenant locked into their share directory
32
PERFORMANCE
● 2 nodes
– Intel E5-2660
– 96GB RAM
– 10gb NIC
● Server
– 3 OSD (Intel S3500)
– 1 MON
– 1 MDS
● Client VMs
– 4 cores
– 2GB RAM
● iozone, 2x available RAM
● CephFS native
– VM ceph.ko → server
● CephFS 9p/virtfs
– VM 9p → host ceph.ko → server
● CephFS NFS
– VM NFS → server ceph.ko → server
35
SUMMARY MATRIX
performance consistency VM gateway net hops security agentmount agent
prototype
kvm + ganesha + libcephfs
slower (?) weak (nfs) X X 2 host X X
kvm + virtfs + ceph.ko good good X X 1 host X X
kvm + nfs + ceph.ko good weak (nfs) X X 1 host X
kvm + ceph.ko better best X 1 ceph X
lxc + ceph.ko best best 1 ceph
nova-docker + ceph.ko best best 1 cephIBM talk -Thurs 9am
ironic + ceph.ko best best 1 ceph X X
37
CONTAINERS ARE DIFFERENT
● nova-docker implements a Nova view of a (Docker) container
– treats container like a standalone system
– does not leverage most of what Docker has to offer
– Nova == IaaS abstraction
● Kubernetes is the new hotness
– higher-level orchestration for containers
– draws on years of Google experience running containers at scale
– vibrant open source community
38
KUBERNETES SHARED STORAGE
● Pure Kubernetes – no OpenStack
● Volume drivers
– Local
● hostPath, emptyDir
– Unshared
● iSCSI, GCEPersistentDisk, Amazon EBS, Ceph RBD – local fs on top of existing device
– Shared
● NFS, GlusterFS, Amazon EFS, CephFS
● Status
– Ceph drivers under review
● Finalizing model for secret storage, cluster parameters (e.g., mon IPs)
– Drivers expect pre-existing volumes
● recycled; missing REST API to create/destroy volumes
39
KUBERNETES ON OPENSTACK
● Provision Nova VMs
– KVM or ironic
– Atomic or CoreOS
● Kubernetes per tenant
● Provision storage devices
– Cinder for volumes
– Manila for shares
● Kubernetes binds into pod/container
● Status
– Prototype Cinder plugin for Kuberneteshttps://github.com/spothanis/kubernetes/tree/cinder-vol-plugin
KVM
Kube node
nginx pod
mysql pod
KVM
Kube node
nginx pod
mysql pod
KVM
Kube master
Volumecontroller
...
CINDER MANILA
NOVA
40
WHAT NEXT?
● Ironic agent
– enable Cinder (and Manila?) on bare metal
– Cinder + Ironic
● 5:20p Wednesday (Cinder)
● Expand breadth of Manila drivers
– virtfs/9p, ceph-native, NFS proxy via host, etc.
– the last mile is not always the tenant network!
● Nova “attach fs” API (or equivalent)
– simplify tenant experience
– paper over VM vs container vs bare metal differences
THANK YOU!
Sage WeilCEPH PRINCIPAL ARCHITECT
Haomai WangFREE AGENT
[email protected]@gmail.com
@liewegas
42
FOR MORE INFORMATION
● http://ceph.com
● http://github.com/ceph
● http://tracker.ceph.com
● Mailing lists
● irc.oftc.net
– #ceph
– #ceph-devel
– @ceph