Sharding: Past, Present and Future with Krutika Dhananjay

ShardingPast, Present and Future

Krutika Dhananjay - [email protected] Engineer

PAST

Alas! Sharding has no past!

Striping

What is striping?

● Client-side translator - sits below DHT

● There will be <stripe-count> copies of every striped file

● Each file split into <block-size> chunks.

● Consecutive chunks are spread across multiple piece files

in a round-robin fashion.

Unstriped file

1

2

3

4

5

6

7

8

9

10

1

4

7

10

2

5

8

3

6

9

After striping

stripe-subvol-2stripe-subvol-1stripe-subvol-0

Stripe-count = 3

<num> 1 chunk of size <stripe-block-size>

-

Striping in Action

Stripe translator - shortcomings

● Cost - you can add servers only in multiples of

‘stripe-count * replica-count’.

● File splitting not granular enough

○ Self-heal of a striped file should still heal

‘total_file_size/stripe_count’ bytes of data.

○ Geo-replication of a striped file should still sync

‘total_file_size/stripe_count’ bytes of data

Stripe shortcomings contd ...

● Suboptimal utilization of disks○ An ‘x’ TB sized file would still require at least

‘x/stripe-count’ amount of space available in any subvolume of DHT

○ … in turn implies suboptimal distribution of IOPs across bricks for a given file.

Present - Sharding forVM Image Storage

What is sharding?

● Client-side xlator – sits above DHT

● Splits file into equal -sized chunks as it grows in size

● Shards beyond first block kept in a hidden /.shard

directory and the first block under its parent dir

● Translators above shard only see the user files

● Translators below shard see shards as normal files

● Shard naming is <gfid>.<num>

● Shard size configurable at volume level - 4MB to 4TB

FUSE/ gfapi / other protocol

io-stats

write-behind

shard

DHT

AFR-0

protocol/client-1protocol/client-0 protocol/client-2 protocol/client-3

AFR-1

Short Demo

https://asciinema.org/a/brvrvh2fhhl7djlpboz74y4ll



How sharding benefits the use case● Granularity of data heal is at shard level

● Minimal resource utilization by background processes

(self-heal, geo-rep, etc)

● VM image size no longer limited by the capacity of

individual brick(s)

● Better distribution of IOPs across bricks

● Geo-rep can now operate at shard level

● Add new bricks only after existing bricks’ space is fully

utilized

Result?

Happy community users!

FAQ

Where is the file metadata stored?● File permissions, ownership, aggregated file size, block-count

and user-set extended attributes only maintained on the

base file. Shards under “.shard” owned by root.

● Being that sharding is only used in single-writer use case,

mtime is maintained on a best-effort basis in memory and

kept up-to-date as individual shards witness writes.

Moral of the story - lookup, stat, {get,set,remove}xattr are directly

served from the base file => 1 network call.

How does writing to a sharded file work?● Create ‘.shard’ if it doesn’t exist.

● Identify participant shards, given write offset and length.

● Create shards if non-existent, in parallel.

● Send writes on participant shards at appropriate shard

offsets in parallel.

● Once all write responses are received, update size and

block count through an xattrop operation.

● Update in-memory cache containing the file size and

block-count and unwind the call.

How are renames and hard-links handled?

● Both fops operate only on the base file.

● File’s gfid remains constant even after a rename =>shards

under ‘.shard’ don’t need to be renamed.

● In other words, renaming and hard-linking a sharded file

completes in one single (atomic) network call.

Interoperability with existing Gluster features?

● Verified that it works fine with geo-replication, hence

supported

● It should “theoretically” work fine in its current state with

features such as bit-rot detection, tiering etc because of its

position in the stack

● Features that won’t readily work with sharding (at least

not without additional code changes) - quota, snapshots,

etc.

FUTURE - Sharding for general purpose use cases

● Classic trade-off between consistency and performance

● For performance○ Maximise parallelism across non-overlapping regions

of the large file

● For consistency○ Keep writes atomic○ Keep file size and block-count updates atomic and

accurate○ mtime should reflect highest value○ Handle truncates and appending writes correctly

Main challenges

The idea so far ...

● Do not try to solve fault tolerance and recovery.○ Use replication!

● Avoid locking as far as possible.○ Do away with locks for writes do not span

across more than one shard

○ Use locking only for writes that modify multiple

shards to prevent interleaving of multiple

parallel writes

○ Introduce common locking framework to

minimize impact of locking by multiple

translators, on performance.

● Introduce a server-side translator to manage size update○ Eliminate the need to take locks over the

network for size update.

● Store ctime/mtime in the form of an xattr on the base file○ This is a generic problem that needs to be

solved across multiple translators

● Possibly leverage compound fops

● Bitmaps for counting blocks?

The idea so far ...

THANK YOU!

Technology

Sharding: Past, Present and Future with Krutika Dhananjay