Upload
fred-love
View
148
Download
5
Embed Size (px)
Citation preview
1© 2017 PORTWORX | LAYER CLONING FILESYSTEM
LCFS Storage Driver For Docker
Jobi FEB10, 2017
2© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Every time you build, pull or destroy a Docker container, you are using a storage driver.
Because it is designed only for containers, it is up to 2.5x faster to build an image and up to almost 2x faster to pull an image.
We're looking forward to working with the container community to improve and expand this new tool.
− Open Sourced (Apache 2.0)
− Use or Contribute!https://github.com/portworx/lcfs
Exec Summary
3© 2017 PORTWORX | LAYER CLONING FILESYSTEM
What is LCFS?
Layers are first class citizens− Atomicity guarantees for each layer, not
at system call
Provides− Efficient snapshotting/cloning
mechanism
− correctness guarantees to containers
A Posix File System in User space (FUSE) in C
− No kernel modifications or license issues
No configuration required
imag
e so
urce
: Doc
ker D
ocs
4© 2017 PORTWORX | LAYER CLONING FILESYSTEM
What is a Graphdriver?
Docker image and container data repository− And corresponding configuration data
It is a POSIX file system, with some special operations like − Create read-only layer
− Create read-write layer
− Mount a layer
− Unmount a layer
− Delete a layer
Layers are mostly ephemeral (temporary)
Docker provides ordering of operations
5© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Existing solutions
Union file systems vs. Snapshot based
Merged solutions (duplicated effort)−AUFS on top of Ext4/XFS−Overlay on top of Ext4/XFS−Devicemapper on top of LVM/Ext4/XFS
Traditional solutions are optimized for file/block storage, persistent data, point-in-time snapshots and clones, and all kinds of workflows (mostly data constantly being modified)
− Not very efficient for storing ephemeral and mostly read-only layers
6© 2017 PORTWORX | LAYER CLONING FILESYSTEM
LCFS Architecture
6
kernel
device
FUSE Library
Fedora imageLayers
MySQL imageLayers
Container 1 boot device
init
read/write
LCFS
• User mode• Purpose built• Native
Docker Daemon
FUSE in Kernel
init
read/write
init
read/write
. . .
7© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Layers
Root Layer – docker configuration data & volumes
Base layer and read-only layers
Read-write layers (2 per container)
Data shared between layers in a tree
Layers track space allocated to data created in a layer
Each layer has an inode table
Strictly read-only once a layer is created on top
Thin provisioned and branch-on-write
8© 2017 PORTWORX | LAYER CLONING FILESYSTEM
How layers different?
Layers can be created/deleted without pausing any running containers
− cloning read-only layers is a lot simple
Data access time is constant for a container irrespective of the number on containers of an image
− Different from point-in-time snapshots/clones, no roll back
Layers are deleted in the reverse order of creation− Layers are not deleted in the beginning/middle of a chain
No reference counting of blocks− Creation/Deletion time independent of size of device, size of data set and
number of layers
− Unlimited number of layers
9© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Layout
Unit of allocation is 4KB
Each layer has a super block
Superblocks are linked together to recreate the tree of layers on remount
Root layer superblock tracks blocks where free space information is tracked
Each layer tracks blocks where allocated space is tracked for the layer
Each layer tracks blocks where inodes are stored
Metadata blocks are checksummed
10© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Space Management
Space is tracked using Extents (start block + count of blocks)
Free Extent Map of the whole file system
Allocated Extent Map for each layer
Each layer make reservations in large chunks and allocate from those chunks
− Less locking of the global free list
− Better contiguity within a layer (separate chunks for user data, metadata and inodes)
Minimum size for a device, Minimum free space for writes and layer creation
11© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Inodes
Each inode takes 128 bytes on disk− Symbolic links are stored along with inode and inode consumes 4KB
− Access/Creation times not tracked
− Inode number is stored within the inode
Directory blocks are reachable from directory inodes
User data of single extent files reachable directly from the inode
Emap of fragmented files reachable from inode
Same the case with blocks tracking extended attributes
12© 2017 PORTWORX | LAYER CLONING FILESYSTEM
File Handles
Formed using layer index + inode number
Layer index is unique for a layer, range between 0-64K
Inode number is unique globally− inode numbers are shared between layers in a tree for shared files
Inode numbers are never reused
Creates duplicate copies of shared data in kernel page cache, but those are invalidated as soon as file is closed
− May work better if FUSE is smarter here
13© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Directory Tree
Global root of the file system with inode number 2
There is another directory called Layer Root Directory, created for docker for placing root directory of all layers
− This directory cannot be deleted or many operations are not allowed
Atomic rename(2) is supported
No need to keep “whiteouts” for removed files as directories are COWed
14© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Locking
Each layer has a read-write lock, taken by all operations in shared mode
A layer is locked exclusive while deleting it
Root layer is locked in shared mode while creating/deleting layers
Root layer is locked exclusive while unmounting the file system
15© 2017 PORTWORX | LAYER CLONING FILESYSTEM
File Operations
Each inode has a read-write lock, taken in shared mode by read-only operations and exclusive mode by modify operations – this lock is not taken on frozen layers
Writes are acknowledged immediately after copying data to dirty page cache of the file
fsync(2) is disabled
rmdir(2) in root layer succeeds even when directory is not empty
getxattr()/removexattr() are failed when the file system does not have any extended attributes without looking up the inode
ioctl(2) support on layer root directory for creating/ mounting / unmounting / deleting layers
16© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Branch-On-Write (BOW - COW – Copy UP)
Inode is copied up on modification along with metadata like extended attributes and directory entries or block map
− Shared metadata may be shared in cache even after copy up
User data blocks are BOWed on modification in 4KB sizes− Most applications truncate the whole file and rewrite file with new data
17© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Caching
All metadata stays in memory− Inodes, directories, emaps, extended attributes, space extent maps,
symbolic links etc.
− Caching actual amount of metadata, not page aligned metadata
Each layer has a hash table for inodes− Lookups may traverse the parent chain
Inodes have a dirty page list
Layers track hardlinks
Mostly using sequential lists (hashing scheme for large directories and dirty page list)
18© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Page Block Cache
File system blocks are cached in a private page cache, indexed by block numbers for shared data
− Data not shared still use kernel page cache
Each Base image maintains a page cache and shared by all layers in the tree which have the same base image
Shared by both user data and metadata
19© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Data Placement
Space allocated to files at the time of sync, not when written− Size of file known at the time of sync and never changes in a read-only
layer
− Most files can be placed contiguous on disk
− Temporary files and layers may not be written to disk
Small files and metadata are coalesced together as well
Zero blocks written do not consume space
Less metadata, less memory, less number of I/Os
20© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Layer Diff
Needed for docker commit/build operations to find paths modified in a layer compared to parent layer
Uses custom diff driver – Not NaiveDiffDriver− Except pre-existing layers after remount
Plugin invokes getxattr calls to get diff for a layer from LCFS
LCFS traverse the private icache of the layer and report inodes instantiated in the layer
Only for generating diff from the parent layer
21© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Crash Consistency
Docker Database of images and containers need to stay consistent even after an abnormal shutdown of the graphdriver
Considering a checkpointing scheme over a journaling scheme− Note fsync is disabled
22© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Stats
Every operation in every layer is counted and total, maximum and minimum time for each type of operation is tracked
This information can be presented in a tabular form on a per layer basis on demand, periodically or at the time a layer is unmounted
Stats for a container can be restarted before running an application for proper tracing
Memory usage tracked for each layer
Count of different file types in every layer is tracked
CPU profiling can be enabled with gperftools
23© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Container statsRunning a dd command in an ubuntu/bash container - dd if=/dev/zero of=file count=10000 bs=4096
Stats for file system 0x1878680 with root 8130 index 7 at Thu Dec 8 09:26:30 2016
Layer created at Thu Dec 8 09:25:11 2016
Last acccessed at Thu Dec 8 09:26:14 2016
Request: Total Failed Average Max Min
LOOKUP: 110 34 0s.000010u 0s.000054u 0s.000003u
GETATTR: 36 0 0s.000005u 0s.000018u 0s.000003u
READLINK: 22 0 0s.000006u 0s.000023u 0s.000004u
OPEN: 43 0 0s.000005u 0s.000013u 0s.000003u
READ: 191 0 0s.000068u 0s.000266u 0s.000004u
FLUSH: 2 0 0s.000000u 0s.000000u 0s.000000u
RELEASE: 35 0 0s.000039u 0s.000430u 0s.000003u
OPENDIR: 1 0 0s.000007u 0s.000007u 0s.000007u
RELEASEDIR: 1 0 0s.000007u 0s.000007u 0s.000007u
CREATE: 1 0 0s.000011u 0s.000011u 0s.000011u
WRITE_BUF: 10000 0 0s.000008u 0s.000120u 0s.000003u
blocks allocated 1 freed 0
2 inodes 10000 pages
0 reads 0 writes (0 inodes written)
24© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Container Memory statsRunning a dd command in an ubuntu/bash container - dd if=/dev/zero of=file count=10000 bs=4096
Memory Stats for file system 0x1435a00 with root 8130 index 7 at Fri Dec 9 06:15:15 2016
DIRENT Allocated 21 Freed 0
ICACHE Allocated 1 Freed 0
INODE Allocated 2 Freed 0
EXTENT Allocated 1 Freed 0
BLOCK Allocated 1 Freed 0
DATA Allocated 10000 Freed 0
DPAGEHASH Allocated 14 Freed 13
STATS Allocated 1 Freed 0
Total memory in use 41213339 bytes
25© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Time to Pull/Delete 30 popular images
Serial Pull Parallel Pull Serial Delete Parallel Delete0
100
200
300
400
500
600
700
800
Devmapper btrfs Overlay Overlay2 Lcfs
26© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Time to Pull/Delete 30 popular images
Serial Pull Parallel Pull Serial Delete Parallel Delete0
50
100
150
200
250
300
350
400
450
500
AUFS LCfs
27© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Time to Pull individual images
php-zendserve
r
hectcast
ro/riak
wordpresrai
ls
rabbitm
q
logstash
golan
g
sysdig/
sysdig
cassan
dra
postgres
mariad
bredis
httpdngin
x
gliderla
bs/logsp
out0
20
40
60
80
100
120
140
Overlay Overlay2 Lcfs
28© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Time to Spawn fedora/apache Containers
20 40 60 80 1000
20
40
60
80
100
120
140
160
180
Devicemapper btrfs OverlayOverlay2 Lcfs
29© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Time to Spawn fedora/apache Containers
20 40 60 80 1000
10
20
30
40
50
60
AUFS Lcfs
30© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Time to Remove fedora/apache Containers
20 40 60 80 1000
10
20
30
40
50
60
70
Devmapper btrfs Overlay Overlay2 Lcfs
31© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Time to Remove fedora/apache Containers
20 40 60 80 1000
5
10
15
20
25
30
35
40
45
AUFS Lcfs
32© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Time to Build Docker sources
Docker Build 0
200
400
600
800
1000
1200
1400
1600
Devmapper btrfs Overlay Overlay2 Lcfs
33© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Time to Build Docker sources
Docker Build 0
100
200
300
400
500
600
700
AUFS Lcfs
34© 2017 PORTWORX | LAYER CLONING FILESYSTEM
IOPS with fiograph docker run portworx/fiograph --blocksize=1024K --filename=/root/1g.bin --ioengine=libaio --readwrite=read --size=1024M --name=test --gtod_reduce=1 --iodepth=1 --time_based --runtime=60
libaio splice0
1000
2000
3000
4000
5000
6000
7000 DevmapperOverlayOverlay2Lcfs
35© 2017 PORTWORX | LAYER CLONING FILESYSTEM
LCFS - A Docker V2 Graphdriver Plugin
Download & Build LCFS or install RPM− git clone [email protected]:/portworx/lcfs, cd lcfs/lcfs, make
− rpm -Uvh http://yum.portworx.com/repo/rpms/px-graph/lcfs-0.0.0-0.x86_64.rpm
Mount a device at /var/lib/docker and /lcfs− ./lcfs <device/file> /var/lib/docker /lcfs –f
Start docker with vfs storage driver (1.13+)− dockerd –s vfs
Install LCFS plugin− docker plugin install portworx/lcfs
Restart docker with lcfs graphdriver− dockerd –experimental –s portworx/lcfs
36© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Pending tasks
Crash consistency
Metadata paging
Replace linear search algorithms
https://github.com/portworx/lcfs/issues
QA
37© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Roadmap
QOS at container level (COS, IOPS, Quotas etc.)
Distributed Graphdriver (images shared) Seamless container migration in a cluster
− Load Balancing
Backup/Restore of Graphdriver
37
38© 2017 PORTWORX | LAYER CLONING FILESYSTEM
Q&A
More info − https://docs.docker.com/engine/userguide/storagedriver/imagesandcontai
ners/
− https://github.com/portworx/lcfs
Thank You!