Data and Metadata Replication in Spectrum Scale/GPFS · the file system may be too damaged to allow any kind of recovery, and would need to be reformatted. However, if metadata is

Data and Metadata Replication in Spectrum

Scale/GPFS

Yuri Volobuev

Spectrum Scale Development

Ver 1.0, 20-Oct-16

Background

Replication is one of the most powerful and also most misunderstood features of the

General Parallel File System (GPFS) technology that serves as the foundation of

Spectrum Scale. Replication allows:

creating an added layer of data protection to be implemented entirely in software,

independent of the storage hardware

setting up an active/active multi-site synchronous Disaster Recovery

configuration

leveraging internal drive and other single-tailed storage devices with high

availability

The use of replication naturally introduces and costs and tradeoffs, too, and it is

important to understand what those are.

What is Replication?

Replication in GPFS means having more than one replica, or copy, of a given file

system object. An example of such an object is a user data block, a directory block, or

an internal GPFS metadata object (inode, indirect block, etc.). When replication is in

use, a single logical object may be represented on disk by one or more physical

replicas. For example, a file consisting of 10 logical blocks is represented on disk as 20

blocks when two-way data replication is in use. The level of replication specifies how

many replicas of a given object should be created. With GPFS V3.4 and later the

supported levels of replication are 1, 2, and 3. The GPFS architecture allows 4-way

replication, but this has not been tested. The level of replication can be specified

separately for data and metadata.

The purpose of replication is improved disk failure tolerance. If some disks fail, as long

as at least one replica of all relevant objects is available, IO processing can go on, even

if other replicas are on unavailable disks. Replication can be used to deal with failures

on any boundary, e.g. individual LUN, disk array, disk controller, rack, site, etc. This is

achieved through proper definition of failure groups. A failure group (or FG) is an

attribute of a GPFS disk object, specified as an integer ID. Failure groups are used to

group disks on expected failure boundaries. For example, all arrays served by the

same RAID controller can be placed in the same FG. GPFS block allocation code

follows a simple rule: different replicas of a given object must be placed in different FGs.

This guarantees that losing an entire FG doesn’t impact data availability when

replication is in use.

Logically, replication can be likened to traditional disk mirroring, but while this is a good

analogy, there are also important distinctions. GPFS manages replicas on the per-

object basis, as opposed to mirroring whole disks, and individual replicas can be placed

on multiple disks. Furthermore, if multiple FGs are present, blocks are allocated from all

FGs. This means that a sufficiently large file is going to have blocks spread across all

FGs. This is done intentionally, in order to maximize the IO bandwidth by spreading the

data across as many disks as possible. This is different from traditional disk mirroring,

where a one-to-one mapping between mirrored objects is present. GPFS replication

guarantees that no two replicas of the same objects can be present in the same FG, but

does not confine the spread of replicas otherwise.

The use of replication is particularly important in “shared-nothing” environments, e.g.

machines with internal disks, where each disk is only visible to a single node. This

configuration can be contrasted to a more sophisticated SAN or twin-tailed disk

configuration, where two or more nodes can access the disk through the local block

device interface. The fundamental distinction between these two configurations is the

impact of a node failure. In a SAN/twin-tailed configuration, a single node failure

doesn’t make any disks unavailable, as the disk can still be accessible from elsewhere.

In the shared-nothing configuration, a node failure means disk failure, and replication is

Block 0 Block 1 Block 2

Replica 0 FG 10

Replica 1 FG 30

Replica 0 FG 20

Replica 1 FG 10

Replica 0 FG 30

Replica 1 FG 20

File XYZ

the only mechanism available to tolerate such a failure transparently.

Replicating data vs metadata

As of 2016, all Spectrum Scale versions in service support the maximum level of

replication of 3 (older versions only supported two replicas). The level of replication can

be specified independently for data and metadata. Why treat data and metadata

separately? This has to do with the impact of replication, specifically the implications for

storage utilization efficiency. Obviously, enabling replication means using more disk

space for storing the same amount of logical data, and more overhead during writes

(the performance angle is covered in more detail further down). For instance, enabling

two-way (level 2) data replication cuts the available capacity in half, and that’s

significant. Typically metadata is much more compact than data (the usual rule of

thumb for estimating metadata size is 1-2% of the total data size, although significant

deviations are possible both ways), and thus replicating metadata typically doesn’t cost

very much in terms of disk space. Relative to data, metadata replication is not as costly

in terms of performance, as well. A good way to think about metadata replication is it

being a form of an “insurance policy”. If a disaster strikes, e.g. a RAID array fails and

cannot be recovered, and all data and metadata stored on this array are lost, the

consequences can be nothing short of catastrophic in the absence of replication. If the

failed disk happens to contain pieces of critical system metadata files (e.g. inode file),

the file system may be too damaged to allow any kind of recovery, and would need to

be reformatted. However, if metadata is replicated, the file system remains accessible

in this scenario. Some files would have holes in them, but the surviving data could be

scavenged, and perhaps restored from a backup into the existing file system. This is a

much better scenario than a complete file system loss. For this reason, the official IBM

recommendation is to always have metadata at least two-way replicated.

Configuring replication

Replication can be configured on a per-file system or a per-file basis. The level of

replication is an attribute of a file system object and an individual inode. The default

inode level of replication is usually inherited from the file system settings; although it is

also possible to use the placement policy mechanism to override the level of replication

for select files (e.g. files in a certain fileset). By far the most common approach is to

specify the file system-wide replication levels at the file system format time, and not

mess with those afterwards. For example:

mmcrfs ... -R 2 -r 2 -M 2 -m 2 ...

These mmcrfs flags specify the maximum level of data replication (-R), the maximum

level of metadata replication (-M), the actual level of data replication (-r) and the actual

level of metadata replication (-m), all set to 2, meaning two physical replicas of each

logical object. These settings are fairly typical, and allow for the loss of disks in one

failure group without affecting data availability.

What is the rationale for the extra complexity of allowing the specification of maximum

and actual level of replication separately? This provides some balance between the

metadata structure utilization efficiency and long-term planning. Replication requires

pointing at more than one disk location for a single logical block, which implies needing

more space for disk address storage in inodes and indirect blocks. For system

metadata file, this space has to be reserved at file system format time, as it would be

prohibitively hard to reformat all affected metadata objects on an existing file system if a

different level of replication is desired. If replication is deemed to be potentially needed

at some point in the future, but not right this moment, it would be prudent to specify an

appropriate level of maximum data and metadata replication at file system format time,

but leave the actual replication levels at 1:

mmcrfs ... -R 2 -r 1 -M 2 -m 1 ...

This will result in the space for extra disk address pointers being reserved in inodes and

indirect blocks for future use, but the replication per se not being enabled. Note that the

default maximum level of metadata replication (-M) is 2 by default.

How can the level of replication be changed on an existing file system? This is an

involved process. The default file system replication levels can be modified using

mmchfs -m/-r. The desired level of replication for an individual file can be changed

using mmchattr (or the migration policy engine, or the GPFS API). Note that those

operations only update the desired replication levels for future operations, but do not by

themselves do anything about blocks already replicated (or not). Changing the actual

level of replication isn’t cheap: each block in the affected object(s) needs to be

processed, and either additional replicas need to be written (if the replication level is

increased) or deallocated (if the replication level decreased). For an individual file, the

mmrestripefile command can be used to perform the necessary processing.

However, it’s much more common to process an entire file system using the

mmrestripefs command. If the file system level of replication has been changed, the

mmrestipefs -R command can be used to apply those new levels to all existing files.

The mmapplypolicy facility can also be used to perform the type of massively parallel

processing needed for a global change in the level of replication, but it should be kept in

mind that mmapplypolicy can’t operate on system metadata files.

Failure recovery

The replication raison d'être is protection against disk failures (which in a distributed

environment may actually be disk, node, or network interconnect failures). When an IO

request to a given disk fails, for whatever reason, GPFS code goes through some

complex logic to decide what to do next. If replication is enabled, and a replica of the

object in question exists on a healthy disk, the disk for which the failure is seen is

usually declared ‘down’, and the IO request is satisfied using the other replica. All of

this is completely transparent to the application layer1. Once a disk is marked ‘down’, it

remains in that state until some action is undertaken to put it back in service.

Specifically, the mmchdisk start command needs to be issued, by a human operator

or an automated management framework, when the disk is known to be ready for use

again. This command performs what can be thought of as a “resync” operation. What

exactly is being resynced? That depends on what happened while the disk was ‘down’. 1 In the absence of replication, some tough choices need to be made: is it better to force-unmount the file

system on this node, in the hope that the failure was due to a disk connectivity problem on the local node, and thus another node may have better luck, or is it better to mark the disk ‘down’ right away? There’s no easy answer.

Naturally, IO requests get handled differently while is a disk is ‘down’. The reads of

data residing on the disk are resolved using replicas elsewhere. What about writes

though? Obviously, there’s no way to actually write to the disk, but what happens if a

block which has a replica on a ‘down’ disk is written to? It would be very dangerous to

simply skip the corresponding replica write without leaving some sort of a mark. What

would happen if the disk is brought back in service without anything being done about

those partially written blocks? The result would be something known as “mismatched

replicas”, a very bad outcome. If different replicas of the same logical block contain

different data, a read may return different results depending on the circumstances (read

on for more on the replica choice during reads), and that would be very bad indeed. It is

therefore absolutely critical to keep track of what replicated blocks were partially written,

and restore full replica consistency when a formerly ‘down’ disk is put back in service,

by rewriting replicas that are not up to date. The mechanism for accomplishing this

changed over the years.

Prior to GPFS V4.1, partial replica writes were tracked on the per-inode basis. A write

that updates some but not all replicas of a data block in a given inode resulted in the

“dataUpdateMiss” inode flag being set. Similarly, a partial metadata write resulted in the

“metaUpdateMiss” inode flag being set. When “mmchdisk start” runs, a full inode

scan is performed, and all replicas belonging to the flagged inodes are rewritten. This

approach produced full replica consistency, but had a serious drawback: due to the

coarse granularity of tracking, if any replicas in a file are partially written, all replicas

need to be rewritten at “mmchdisk start” time. For large files, this can be quite

problematic. This deficiency was addressed in GPFS V4.1 with the introduction of the

“rapid repair” feature. When the feature is enabled (this can be checked with mmlsfs),

the tracking of the partially written replicas is done on the per-block level. This means

that only those blocks that were actually written to while a disk was ‘down’, and have a

replica on a ‘down’ disk, need to be rewritten at “mmchdisk start” time. Naturally,

this can speed up the resync time dramatically, especially in those environments where

very large files are present (e.g. many databases). Note that “rapid repair” is a feature

that is tied to the file system version, and thus doesn’t get enabled solely by upgrading

the running GPFS code version. In addition, the file system format version needs to be

moved up, using “mmchfs -V”. Unfortunately, the early implementation of “rapid

repair” contained a critical bug, so the current service must be applied before enabling

this feature.

What happens if a failure occurs during a replicated write? It’s impossible to perform a

write to multiple physical storage devices atomically. There’s always the possibility that

a failure may occur after some replica writes had completed but not all. Needless to

say, it would be unacceptable if such an event resulted in mismatched replicas. Disk

and disk server node failures are a part of life, and do occur occasionally, and the

recovery from such events needs to be robust. A common approach to dealing with

failures of mirrored devices is to perform a scan-based resync. This can be quite

expensive; this is an undesirable outcome for common events like node failures. GPFS

approaches the problem differently: logging (also known as journaling) is used to

restore replica consistency after interrupted replicated writes. For each replicated write

operation, a “begin replicated write” log record is written out first, and at the conclusion

of the write, a matching “end replicated write” record is spooled. The logging allows

restoring consistency of any replicas with potentially in-flight writes during node failure

recovery, when the file system manager replays the log belonging to the failed GPFS

node. This means that replicas are fully consistent and available for IO immediately

after the conclusion of node recovery, which typically takes just a few seconds. Note

that as all other logged operation, the log replay may either redo or undo the update,

depending on when the failure occurred.

Performance implications It is intuitively clear that replication must have some impact on IO performance.

Obviously, writing multiple replicas vs a single replica is more expensive, but how much

more? And if multiple replicas are present for the same block, what replica is used for

doing reads?

http://www-01.ibm.com/support/docview.wss?uid=isg3T1022582

Replica Reads

When multiple replicas are available at read time, there are several different strategies

that can be envisioned for the selection of the replica (or replicas) to read.

The simplest strategy is to simply pick the first replica (replica 0) at all times. This is

what GPFS code does by default, when no other cues for replica selection exist. This is

a good strategy in a situation where all disks have a comparable level of performance.

What if some disks are better read targets than others? A typical example is a “stretch

cluster” configuration, where some disks are local to a given site, and some disks are

remote. In this case, it would be clearly advantageous to pick a local disk over a remote

one when choosing a replica to read. This is very simple conceptually, but the

implementing these semantics is trickier than one may imagine. The basic problem is:

how can GPFS tell whether a disk is local or remote? If the choice is between a disk

available through the local OS block device interface and a disk available via an NSD

server, the choice is clear2. However, if both disks appear as block devices (e.g. when

a SAN is present), or both disks are behind an NSD server, the selection task is non-

trivial. There’s no mechanism in GPFS to define something like a disk topology, which

could be used to figure out how “far away” a given disk or an NSD server is from a given

node.

In order to provide a simple solution for the read replica selection problem in a specific

subset of configurations, the feature controlled by the “readReplicaPolicy=local”

configuration parameter was added. In this configuration, GPFS attempts to make a

guess about the “locality” of a given disk using the network topology as a hint. If a given

NSD has an NSD server defined, and the IP address of the NSD server happens to be

on the same IP subnet as the local node, this NSD is considered to be “more local” than

an NSD on a different subnet. For example, in a “stretch cluster” configuration, if all

nodes in SiteA belong to one subnet, while all nodes in SiteB belong to another subnet,

the subnet topology in combination with “readReplicaPolicy=local” logic would

2 If the local block device is present, and the NSD server is defined for the NSD in question, the local

block device is used. This is the default behavior for GPFS even when replication isn’t in use. It can be changed using “mmchfs useNsdServer”.

allow effective local replica selection on reads. The disadvantages of this configuration

model are similarly clear: it is not always feasible or desirable to lay out the subnet

configuration in the fashion preferred by GPFS. And if the performance characteristics

of different disks depend on something other than NSD server address, this mechanism

is of no use.

To address the more general case of replicas selection on read for optimal

performance, GPFS V4.1.1 introduced a new mode of operation,

“readReplicaPolicy=fastest”. This mode, just as the name suggests, involves

picking the replica from the disk that has been observed to have the lowest read

latency. This covers both the case of local vs remote NSDs, and the case of different

disks having different performance profiles (e.g. having one set of replicas on SSD and

another on spinning disk).

Replica Writes

Replicated writes are more expensive than non-replicated writes, that much is intuitively

clear. However, the actual cost of replicated writes is often higher than is expected, for

several reasons.

Quite obviously, when data replication is enabled, several replicas of the actual data

block need to be written. The writes are submitted in parallel, which softens the impact

somewhat, but only so much. In particular, when one of the writes is significantly slower

than the other (e.g. a write to a remote site), the latency of that write dominates the

overall write performance. A replicated write is a synchronous operation in GPFS, i.e.

all replica writes must succeed (or fail) before a write is considered to be complete (or

failed). That is, there are no provisions for a “lazy write”, where a slow replica write is

allowed to complete asynchronously.

As discussed above, every replicated write is logged, to allow for fast recovery from

interrupted replicated writes. The “begin replicated write” log record must be committed

(i.e. written to stable storage) before the actual data writes can commence, otherwise

log recovery wouldn’t be able to find all potentially mismatched replicas. A log write is a

metadata write, and if data is replicated, the metadata is replicated as well, and thus a

single logical log write translates into several physical disk writes to metadata disks.

For example, for a file with 2-way replication an overwrite-in-place of a single data block

results in 4 IO requests: 2 metadata IOs for log records, and 2 data replica writes. Of

course, the overwrite-in-place operation is less common than the append operation,

where new blocks are added, increasing the file size. In this case, there are additional

metadata changes than need to be committed: file size (stored in inode), and if new

block allocations have taken place, those need to be committed as well. When

possible, multiple metadata updates are committed together using complex log records,

but this is not possible in all cases. For example, for a file with 3-way replication, an

append operation typically results in 9 IO requests: 3 log replica writes, 3 inode replica

writes, 3 data replica writes.

The actual impact of the added overhead of replication for writes depends heavily of the

overall mode of operation. For standard buffered IO, when plentiful pagepool space is

available, a significant portion of the replication overhead can be amortized, thanks to

consolidation of small writes, group commits of metadata changes, and the general

background nature of dirty data flushes. Bursts of IO can be absorbed in pagepool and

flushed asynchronously. However, for synchronous writes, the added overhead cannot

be masked, the extra writes become visible in the overall write latency. One particular

configuration is particularly affected: DirectIO (O_DIRECT) write processing for data

residing on disks visible as local block devices. This configuration is commonly used by

databases performing latency-sensitive OLTP processing, and the corresponding code

path in GPFS code is heavily optimized. When all information needed for the write

submission (inode, indirect blocks, tokens) is present in cache, the write is submitted

without leaving the kernel (unlike most other IO operations in GPFS, where the

execution has to be transferred to the GPFS userspace daemon, mmfsd). This results

in a (relatively) short code path with optimized locking and good SMP scaling, and thus

minimal added latency on the GPFS level. Unfortunately, this only applies to non-

replicated writes. When a replicated write comes in, the processing is much more

complicated (the write must be logged), and is done mostly in mmfsd. This code path is

considerably longer, and requires more sophisticated locking. All of this further

compounds the impact of replication of data writes, especially in environments where

disk devices offer very low write latency, and thus make the impact of any added

overhead on the file system level more pronounced.

Whether using data replication is appropriate in a given solution is a complex question

with many variables: workload, disk, network, file system configuration all play a lot.

However, some general recommendations can be made:

1) Read-mostly workloads can perform well when replication is used, especially if

one of the replicas is available on faster storage. A replicated read is as cheap

as a non-replicated read, and GPFS has the logic to automatically select the

replica residing on the faster disk.

2) Write-heavy workloads can have their performance impacted heavily when

replication enabled, more so than may be intuitive to assume. Buffered

workloads that are not sensitive to write latency tend to be affected less, while

synchronous/DirectIO workloads are affected more. Since significant portions of

a replicated data write are actually metadata IO, the performance of metadata

writes has a significant impact on the data write latency.

Data replication certainly has a cost, in terms of storage utilization efficiency and write

performance. It is important to understand those costs during solution design. While

replication is a powerful tool, the costs make it a poor choice for some configurations.

Replication and Disaster Recovery One particularly popular way of using replication in GPFS is the “stretch cluster”

configuration. This configuration provides a way to implement an active/active,

synchronous Disaster Recovery (DR) solution, typically using two main sites (referred to

as SiteA and SiteB here), plus a small 3rd site that acts as a tiebreaker. This means that

applications can run simultaneously at both sites (as opposed to the active site/backup

site), with synchronous semantics (all committed writes are committed on both sites).

This allows either of the sites to experience a disaster than leads to a complete site loss

(or a loss of the site availability) without affecting operations on the second site: IO

processing will continue after the disaster, after a short pause to handle node failures,

without any downtime or lost data. This DR solution is described in detail in the

“Synchronous mirroring with GPFS replication” section of the IBM Spectrum Scale

product documentation. We will not repeat the same material here, but will focus

instead on the solution design aspects that tend to breed questions.

This is a very attractive DR model, for obvious reasons, but naturally there’s a catch: the

cost of replicated writes. Since all writes need to be committed to disks on both sites,

the latency of the network link connecting the two sites is a major factor that dictates

many aspects of the performance of the overall solution. In the perfect world, the

network link would be fast and reliable, but this can be hard to achieve in practice. After

all, many disasters by definition affect wide geographic areas, and thus it is necessary

to place the two sites sufficiently far apart in order to increase the chance of one of the

surviving. Implementing a fast and reliable TCP/IP network connection over long

distances is, in a word, hard. Besides the limits imposed by physics (such as the speed

of light), more practical considerations (like having multiple network segments) play a

major role.

A question gets asked quite often: what is the minimal level of performance of the inter-

site network link for a “stretch cluster” configuration to be feasible? Unfortunately, the

only simple answer is “it depends”. A poor inter-site link leads to pain, in the form of

poor write performance, spurious connection failures, and uneven performance levels.

Everyone has a different level of pain tolerance. As discussed earlier, the impact of

replication, and thus the impact of the inter-site network link, depends a lot on the

nature of the workload. A read-mostly workload is going to be a lot less sensitive to the

network latency than a write-dominated one. Applications that perform buffered IO and

are not greatly affected by write latency can do quite well, while databases that have a

need to frequently perform small DirectIO writes are going to be very sensitive to the

quality of the link.

For any workload, the link has to be reliable enough to not produce hard failures, which

generally means the packet loss rate must remain very low, well below 1%. Experience

shows that most TCP/IP implementations cannot tolerate a higher rate of packet loss

without it periodically resulting in a broken connection. GPFS tends to transfer large

blocks over TCP/IP, using large send and receive buffers, and on Linux such transfers

http://www.ibm.com/support/knowledgecenter/STXKQY_4.2.1/com.ibm.spectrum.scale.v4r21.doc/bl1adv_gpfsrep.htm

are particularly badly affected by packet loss. Even though it may seem that the packet

loss rate should be survivable, in practice this is not the case. Since a broken

connection means losing connectivity to the other site, and all disks belonging to that

site, such an event leads to disks being declared ‘down’, requiring a resync operation

when the connectivity is restored. It is very difficult to run production workloads in such

an environment.

It is difficult to provide any hard guidance on the network link performance. As long as

no hard network errors are encountered, GPFS itself can tolerate slow networks (to a

point: disk leases need to be renewed every 35 seconds, by default. And if the network

is not fast enough for that, this truly is a basket case). What is acceptable for end users

can vary. For GPFS replication, the most important performance characteristic is

latency. Obviously, link bandwidth is important too (as it effectively limits the overall

GPFS write bandwidth available on the cluster), but high bandwidth certainly cannot

compensate for high latency. When replication is used, the latency of small log writes is

as important as the latency of large block writes, and for many workloads that latency

cannot be masked. Generally speaking, it is desirable to have network latency that

doesn’t substantially exceed disk write latency. For example, if spinning disks with

6-20 ms average write latency are used, network latency in the 10-20 ms range is

probably going to be tolerable, but 50 ms is probably too much for almost all workloads.

If flash with sub-millisecond latency is used, the network needs to be commensurably

fast in order to avoid a massive performance hit. And considering that a typical

metadata operation (e.g. file create) involves multiple separate metadata IOs (before

the impact of metadata replication is even considered), it is not hard to do the math and

come to a conclusion that latencies in high double digits and up would result in

excruciatingly poor levels of performance.

Again, all of this is a matter of workload and end user pain tolerance.

Documents

Data and Metadata Replication in Spectrum Scale/GPFS · the file system may be too damaged to allow any kind of recovery, and would need to be reformatted. However, if metadata is