Upload
glusterorg
View
292
Download
4
Embed Size (px)
Citation preview
ShardingPast, Present and Future
Krutika Dhananjay - [email protected] Engineer
PAST
Alas! Sharding has no past!
Striping
What is striping?
● Client-side translator - sits below DHT
● There will be <stripe-count> copies of every striped file
● Each file split into <block-size> chunks.
● Consecutive chunks are spread across multiple piece files
in a round-robin fashion.
Unstriped file
1
2
3
4
5
6
7
8
9
10
1
4
7
10
2
5
8
3
6
9
After striping
stripe-subvol-2stripe-subvol-1stripe-subvol-0
Stripe-count = 3
<num> 1 chunk of size <stripe-block-size>
-
Striping in Action
Stripe translator - shortcomings
● Cost - you can add servers only in multiples of
‘stripe-count * replica-count’.
● File splitting not granular enough
○ Self-heal of a striped file should still heal
‘total_file_size/stripe_count’ bytes of data.
○ Geo-replication of a striped file should still sync
‘total_file_size/stripe_count’ bytes of data
Stripe shortcomings contd ...
● Suboptimal utilization of disks○ An ‘x’ TB sized file would still require at least
‘x/stripe-count’ amount of space available in any subvolume of DHT
○ … in turn implies suboptimal distribution of IOPs across bricks for a given file.
Present - Sharding forVM Image Storage
What is sharding?
● Client-side xlator – sits above DHT
● Splits file into equal -sized chunks as it grows in size
● Shards beyond first block kept in a hidden /.shard
directory and the first block under its parent dir
● Translators above shard only see the user files
● Translators below shard see shards as normal files
● Shard naming is <gfid>.<num>
● Shard size configurable at volume level - 4MB to 4TB
FUSE/ gfapi / other protocol
io-stats
write-behind
shard
DHT
AFR-0
protocol/client-1protocol/client-0 protocol/client-2 protocol/client-3
AFR-1
Short Demo
https://asciinema.org/a/brvrvh2fhhl7djlpboz74y4ll
How sharding benefits the use case● Granularity of data heal is at shard level
● Minimal resource utilization by background processes
(self-heal, geo-rep, etc)
● VM image size no longer limited by the capacity of
individual brick(s)
● Better distribution of IOPs across bricks
● Geo-rep can now operate at shard level
● Add new bricks only after existing bricks’ space is fully
utilized
Result?
Happy community users!
FAQ
Where is the file metadata stored?● File permissions, ownership, aggregated file size, block-count
and user-set extended attributes only maintained on the
base file. Shards under “.shard” owned by root.
● Being that sharding is only used in single-writer use case,
mtime is maintained on a best-effort basis in memory and
kept up-to-date as individual shards witness writes.
Moral of the story - lookup, stat, {get,set,remove}xattr are directly
served from the base file => 1 network call.
How does writing to a sharded file work?● Create ‘.shard’ if it doesn’t exist.
● Identify participant shards, given write offset and length.
● Create shards if non-existent, in parallel.
● Send writes on participant shards at appropriate shard
offsets in parallel.
● Once all write responses are received, update size and
block count through an xattrop operation.
● Update in-memory cache containing the file size and
block-count and unwind the call.
How are renames and hard-links handled?
● Both fops operate only on the base file.
● File’s gfid remains constant even after a rename =>shards
under ‘.shard’ don’t need to be renamed.
● In other words, renaming and hard-linking a sharded file
completes in one single (atomic) network call.
Interoperability with existing Gluster features?
● Verified that it works fine with geo-replication, hence
supported
● It should “theoretically” work fine in its current state with
features such as bit-rot detection, tiering etc because of its
position in the stack
● Features that won’t readily work with sharding (at least
not without additional code changes) - quota, snapshots,
etc.
FUTURE - Sharding for general purpose use cases
● Classic trade-off between consistency and performance
● For performance○ Maximise parallelism across non-overlapping regions
of the large file
● For consistency○ Keep writes atomic○ Keep file size and block-count updates atomic and
accurate○ mtime should reflect highest value○ Handle truncates and appending writes correctly
Main challenges
The idea so far ...
● Do not try to solve fault tolerance and recovery.○ Use replication!
● Avoid locking as far as possible.○ Do away with locks for writes do not span
across more than one shard
○ Use locking only for writes that modify multiple
shards to prevent interleaving of multiple
parallel writes
○ Introduce common locking framework to
minimize impact of locking by multiple
translators, on performance.
● Introduce a server-side translator to manage size update○ Eliminate the need to take locks over the
network for size update.
● Store ctime/mtime in the form of an xattr on the base file○ This is a generic problem that needs to be
solved across multiple translators
● Possibly leverage compound fops
● Bitmaps for counting blocks?
The idea so far ...
THANK YOU!