Upload
others
View
30
Download
4
Embed Size (px)
Citation preview
Data and Metadata Replication in Spectrum
Scale/GPFS
Yuri Volobuev
Spectrum Scale Development
Ver 1.0, 20-Oct-16
Background
Replication is one of the most powerful and also most misunderstood features of the
General Parallel File System (GPFS) technology that serves as the foundation of
Spectrum Scale. Replication allows:
creating an added layer of data protection to be implemented entirely in software,
independent of the storage hardware
setting up an active/active multi-site synchronous Disaster Recovery
configuration
leveraging internal drive and other single-tailed storage devices with high
availability
The use of replication naturally introduces and costs and tradeoffs, too, and it is
important to understand what those are.
What is Replication?
Replication in GPFS means having more than one replica, or copy, of a given file
system object. An example of such an object is a user data block, a directory block, or
an internal GPFS metadata object (inode, indirect block, etc.). When replication is in
use, a single logical object may be represented on disk by one or more physical
replicas. For example, a file consisting of 10 logical blocks is represented on disk as 20
blocks when two-way data replication is in use. The level of replication specifies how
many replicas of a given object should be created. With GPFS V3.4 and later the
supported levels of replication are 1, 2, and 3. The GPFS architecture allows 4-way
replication, but this has not been tested. The level of replication can be specified
separately for data and metadata.
The purpose of replication is improved disk failure tolerance. If some disks fail, as long
as at least one replica of all relevant objects is available, IO processing can go on, even
if other replicas are on unavailable disks. Replication can be used to deal with failures
on any boundary, e.g. individual LUN, disk array, disk controller, rack, site, etc. This is
achieved through proper definition of failure groups. A failure group (or FG) is an
attribute of a GPFS disk object, specified as an integer ID. Failure groups are used to
group disks on expected failure boundaries. For example, all arrays served by the
same RAID controller can be placed in the same FG. GPFS block allocation code
follows a simple rule: different replicas of a given object must be placed in different FGs.
This guarantees that losing an entire FG doesn’t impact data availability when
replication is in use.
Logically, replication can be likened to traditional disk mirroring, but while this is a good
analogy, there are also important distinctions. GPFS manages replicas on the per-
object basis, as opposed to mirroring whole disks, and individual replicas can be placed
on multiple disks. Furthermore, if multiple FGs are present, blocks are allocated from all
FGs. This means that a sufficiently large file is going to have blocks spread across all
FGs. This is done intentionally, in order to maximize the IO bandwidth by spreading the
data across as many disks as possible. This is different from traditional disk mirroring,
where a one-to-one mapping between mirrored objects is present. GPFS replication
guarantees that no two replicas of the same objects can be present in the same FG, but
does not confine the spread of replicas otherwise.
The use of replication is particularly important in “shared-nothing” environments, e.g.
machines with internal disks, where each disk is only visible to a single node. This
configuration can be contrasted to a more sophisticated SAN or twin-tailed disk
configuration, where two or more nodes can access the disk through the local block
device interface. The fundamental distinction between these two configurations is the
impact of a node failure. In a SAN/twin-tailed configuration, a single node failure
doesn’t make any disks unavailable, as the disk can still be accessible from elsewhere.
In the shared-nothing configuration, a node failure means disk failure, and replication is
Block 0 Block 1 Block 2
Replica 0 FG 10
Replica 1 FG 30
Replica 0 FG 20
Replica 1 FG 10
Replica 0 FG 30
Replica 1 FG 20
File XYZ
the only mechanism available to tolerate such a failure transparently.
Replicating data vs metadata
As of 2016, all Spectrum Scale versions in service support the maximum level of
replication of 3 (older versions only supported two replicas). The level of replication can
be specified independently for data and metadata. Why treat data and metadata
separately? This has to do with the impact of replication, specifically the implications for
storage utilization efficiency. Obviously, enabling replication means using more disk
space for storing the same amount of logical data, and more overhead during writes
(the performance angle is covered in more detail further down). For instance, enabling
two-way (level 2) data replication cuts the available capacity in half, and that’s
significant. Typically metadata is much more compact than data (the usual rule of
thumb for estimating metadata size is 1-2% of the total data size, although significant
deviations are possible both ways), and thus replicating metadata typically doesn’t cost
very much in terms of disk space. Relative to data, metadata replication is not as costly
in terms of performance, as well. A good way to think about metadata replication is it
being a form of an “insurance policy”. If a disaster strikes, e.g. a RAID array fails and
cannot be recovered, and all data and metadata stored on this array are lost, the
consequences can be nothing short of catastrophic in the absence of replication. If the
failed disk happens to contain pieces of critical system metadata files (e.g. inode file),
the file system may be too damaged to allow any kind of recovery, and would need to
be reformatted. However, if metadata is replicated, the file system remains accessible
in this scenario. Some files would have holes in them, but the surviving data could be
scavenged, and perhaps restored from a backup into the existing file system. This is a
much better scenario than a complete file system loss. For this reason, the official IBM
recommendation is to always have metadata at least two-way replicated.
Configuring replication
Replication can be configured on a per-file system or a per-file basis. The level of
replication is an attribute of a file system object and an individual inode. The default
inode level of replication is usually inherited from the file system settings; although it is
also possible to use the placement policy mechanism to override the level of replication
for select files (e.g. files in a certain fileset). By far the most common approach is to
specify the file system-wide replication levels at the file system format time, and not
mess with those afterwards. For example:
mmcrfs ... -R 2 -r 2 -M 2 -m 2 ...
These mmcrfs flags specify the maximum level of data replication (-R), the maximum
level of metadata replication (-M), the actual level of data replication (-r) and the actual
level of metadata replication (-m), all set to 2, meaning two physical replicas of each
logical object. These settings are fairly typical, and allow for the loss of disks in one
failure group without affecting data availability.
What is the rationale for the extra complexity of allowing the specification of maximum
and actual level of replication separately? This provides some balance between the
metadata structure utilization efficiency and long-term planning. Replication requires
pointing at more than one disk location for a single logical block, which implies needing
more space for disk address storage in inodes and indirect blocks. For system
metadata file, this space has to be reserved at file system format time, as it would be
prohibitively hard to reformat all affected metadata objects on an existing file system if a
different level of replication is desired. If replication is deemed to be potentially needed
at some point in the future, but not right this moment, it would be prudent to specify an
appropriate level of maximum data and metadata replication at file system format time,
but leave the actual replication levels at 1:
mmcrfs ... -R 2 -r 1 -M 2 -m 1 ...
This will result in the space for extra disk address pointers being reserved in inodes and
indirect blocks for future use, but the replication per se not being enabled. Note that the
default maximum level of metadata replication (-M) is 2 by default.
How can the level of replication be changed on an existing file system? This is an
involved process. The default file system replication levels can be modified using
mmchfs -m/-r. The desired level of replication for an individual file can be changed
using mmchattr (or the migration policy engine, or the GPFS API). Note that those
operations only update the desired replication levels for future operations, but do not by
themselves do anything about blocks already replicated (or not). Changing the actual
level of replication isn’t cheap: each block in the affected object(s) needs to be
processed, and either additional replicas need to be written (if the replication level is
increased) or deallocated (if the replication level decreased). For an individual file, the
mmrestripefile command can be used to perform the necessary processing.
However, it’s much more common to process an entire file system using the
mmrestripefs command. If the file system level of replication has been changed, the
mmrestipefs -R command can be used to apply those new levels to all existing files.
The mmapplypolicy facility can also be used to perform the type of massively parallel
processing needed for a global change in the level of replication, but it should be kept in
mind that mmapplypolicy can’t operate on system metadata files.
Failure recovery
The replication raison d'être is protection against disk failures (which in a distributed
environment may actually be disk, node, or network interconnect failures). When an IO
request to a given disk fails, for whatever reason, GPFS code goes through some
complex logic to decide what to do next. If replication is enabled, and a replica of the
object in question exists on a healthy disk, the disk for which the failure is seen is
usually declared ‘down’, and the IO request is satisfied using the other replica. All of
this is completely transparent to the application layer1. Once a disk is marked ‘down’, it
remains in that state until some action is undertaken to put it back in service.
Specifically, the mmchdisk start command needs to be issued, by a human operator
or an automated management framework, when the disk is known to be ready for use
again. This command performs what can be thought of as a “resync” operation. What
exactly is being resynced? That depends on what happened while the disk was ‘down’. 1 In the absence of replication, some tough choices need to be made: is it better to force-unmount the file
system on this node, in the hope that the failure was due to a disk connectivity problem on the local node, and thus another node may have better luck, or is it better to mark the disk ‘down’ right away? There’s no easy answer.
Naturally, IO requests get handled differently while is a disk is ‘down’. The reads of
data residing on the disk are resolved using replicas elsewhere. What about writes
though? Obviously, there’s no way to actually write to the disk, but what happens if a
block which has a replica on a ‘down’ disk is written to? It would be very dangerous to
simply skip the corresponding replica write without leaving some sort of a mark. What
would happen if the disk is brought back in service without anything being done about
those partially written blocks? The result would be something known as “mismatched
replicas”, a very bad outcome. If different replicas of the same logical block contain
different data, a read may return different results depending on the circumstances (read
on for more on the replica choice during reads), and that would be very bad indeed. It is
therefore absolutely critical to keep track of what replicated blocks were partially written,
and restore full replica consistency when a formerly ‘down’ disk is put back in service,
by rewriting replicas that are not up to date. The mechanism for accomplishing this
changed over the years.
Prior to GPFS V4.1, partial replica writes were tracked on the per-inode basis. A write
that updates some but not all replicas of a data block in a given inode resulted in the
“dataUpdateMiss” inode flag being set. Similarly, a partial metadata write resulted in the
“metaUpdateMiss” inode flag being set. When “mmchdisk start” runs, a full inode
scan is performed, and all replicas belonging to the flagged inodes are rewritten. This
approach produced full replica consistency, but had a serious drawback: due to the
coarse granularity of tracking, if any replicas in a file are partially written, all replicas
need to be rewritten at “mmchdisk start” time. For large files, this can be quite
problematic. This deficiency was addressed in GPFS V4.1 with the introduction of the
“rapid repair” feature. When the feature is enabled (this can be checked with mmlsfs),
the tracking of the partially written replicas is done on the per-block level. This means
that only those blocks that were actually written to while a disk was ‘down’, and have a
replica on a ‘down’ disk, need to be rewritten at “mmchdisk start” time. Naturally,
this can speed up the resync time dramatically, especially in those environments where
very large files are present (e.g. many databases). Note that “rapid repair” is a feature
that is tied to the file system version, and thus doesn’t get enabled solely by upgrading
the running GPFS code version. In addition, the file system format version needs to be
moved up, using “mmchfs -V”. Unfortunately, the early implementation of “rapid
repair” contained a critical bug, so the current service must be applied before enabling
this feature.
What happens if a failure occurs during a replicated write? It’s impossible to perform a
write to multiple physical storage devices atomically. There’s always the possibility that
a failure may occur after some replica writes had completed but not all. Needless to
say, it would be unacceptable if such an event resulted in mismatched replicas. Disk
and disk server node failures are a part of life, and do occur occasionally, and the
recovery from such events needs to be robust. A common approach to dealing with
failures of mirrored devices is to perform a scan-based resync. This can be quite
expensive; this is an undesirable outcome for common events like node failures. GPFS
approaches the problem differently: logging (also known as journaling) is used to
restore replica consistency after interrupted replicated writes. For each replicated write
operation, a “begin replicated write” log record is written out first, and at the conclusion
of the write, a matching “end replicated write” record is spooled. The logging allows
restoring consistency of any replicas with potentially in-flight writes during node failure
recovery, when the file system manager replays the log belonging to the failed GPFS
node. This means that replicas are fully consistent and available for IO immediately
after the conclusion of node recovery, which typically takes just a few seconds. Note
that as all other logged operation, the log replay may either redo or undo the update,
depending on when the failure occurred.
Performance implications It is intuitively clear that replication must have some impact on IO performance.
Obviously, writing multiple replicas vs a single replica is more expensive, but how much
more? And if multiple replicas are present for the same block, what replica is used for
doing reads?
Replica Reads
When multiple replicas are available at read time, there are several different strategies
that can be envisioned for the selection of the replica (or replicas) to read.
The simplest strategy is to simply pick the first replica (replica 0) at all times. This is
what GPFS code does by default, when no other cues for replica selection exist. This is
a good strategy in a situation where all disks have a comparable level of performance.
What if some disks are better read targets than others? A typical example is a “stretch
cluster” configuration, where some disks are local to a given site, and some disks are
remote. In this case, it would be clearly advantageous to pick a local disk over a remote
one when choosing a replica to read. This is very simple conceptually, but the
implementing these semantics is trickier than one may imagine. The basic problem is:
how can GPFS tell whether a disk is local or remote? If the choice is between a disk
available through the local OS block device interface and a disk available via an NSD
server, the choice is clear2. However, if both disks appear as block devices (e.g. when
a SAN is present), or both disks are behind an NSD server, the selection task is non-
trivial. There’s no mechanism in GPFS to define something like a disk topology, which
could be used to figure out how “far away” a given disk or an NSD server is from a given
node.
In order to provide a simple solution for the read replica selection problem in a specific
subset of configurations, the feature controlled by the “readReplicaPolicy=local”
configuration parameter was added. In this configuration, GPFS attempts to make a
guess about the “locality” of a given disk using the network topology as a hint. If a given
NSD has an NSD server defined, and the IP address of the NSD server happens to be
on the same IP subnet as the local node, this NSD is considered to be “more local” than
an NSD on a different subnet. For example, in a “stretch cluster” configuration, if all
nodes in SiteA belong to one subnet, while all nodes in SiteB belong to another subnet,
the subnet topology in combination with “readReplicaPolicy=local” logic would
2 If the local block device is present, and the NSD server is defined for the NSD in question, the local
block device is used. This is the default behavior for GPFS even when replication isn’t in use. It can be changed using “mmchfs useNsdServer”.
allow effective local replica selection on reads. The disadvantages of this configuration
model are similarly clear: it is not always feasible or desirable to lay out the subnet
configuration in the fashion preferred by GPFS. And if the performance characteristics
of different disks depend on something other than NSD server address, this mechanism
is of no use.
To address the more general case of replicas selection on read for optimal
performance, GPFS V4.1.1 introduced a new mode of operation,
“readReplicaPolicy=fastest”. This mode, just as the name suggests, involves
picking the replica from the disk that has been observed to have the lowest read
latency. This covers both the case of local vs remote NSDs, and the case of different
disks having different performance profiles (e.g. having one set of replicas on SSD and
another on spinning disk).
Replica Writes
Replicated writes are more expensive than non-replicated writes, that much is intuitively
clear. However, the actual cost of replicated writes is often higher than is expected, for
several reasons.
Quite obviously, when data replication is enabled, several replicas of the actual data
block need to be written. The writes are submitted in parallel, which softens the impact
somewhat, but only so much. In particular, when one of the writes is significantly slower
than the other (e.g. a write to a remote site), the latency of that write dominates the
overall write performance. A replicated write is a synchronous operation in GPFS, i.e.
all replica writes must succeed (or fail) before a write is considered to be complete (or
failed). That is, there are no provisions for a “lazy write”, where a slow replica write is
allowed to complete asynchronously.
As discussed above, every replicated write is logged, to allow for fast recovery from
interrupted replicated writes. The “begin replicated write” log record must be committed
(i.e. written to stable storage) before the actual data writes can commence, otherwise
log recovery wouldn’t be able to find all potentially mismatched replicas. A log write is a
metadata write, and if data is replicated, the metadata is replicated as well, and thus a
single logical log write translates into several physical disk writes to metadata disks.
For example, for a file with 2-way replication an overwrite-in-place of a single data block
results in 4 IO requests: 2 metadata IOs for log records, and 2 data replica writes. Of
course, the overwrite-in-place operation is less common than the append operation,
where new blocks are added, increasing the file size. In this case, there are additional
metadata changes than need to be committed: file size (stored in inode), and if new
block allocations have taken place, those need to be committed as well. When
possible, multiple metadata updates are committed together using complex log records,
but this is not possible in all cases. For example, for a file with 3-way replication, an
append operation typically results in 9 IO requests: 3 log replica writes, 3 inode replica
writes, 3 data replica writes.
The actual impact of the added overhead of replication for writes depends heavily of the
overall mode of operation. For standard buffered IO, when plentiful pagepool space is
available, a significant portion of the replication overhead can be amortized, thanks to
consolidation of small writes, group commits of metadata changes, and the general
background nature of dirty data flushes. Bursts of IO can be absorbed in pagepool and
flushed asynchronously. However, for synchronous writes, the added overhead cannot
be masked, the extra writes become visible in the overall write latency. One particular
configuration is particularly affected: DirectIO (O_DIRECT) write processing for data
residing on disks visible as local block devices. This configuration is commonly used by
databases performing latency-sensitive OLTP processing, and the corresponding code
path in GPFS code is heavily optimized. When all information needed for the write
submission (inode, indirect blocks, tokens) is present in cache, the write is submitted
without leaving the kernel (unlike most other IO operations in GPFS, where the
execution has to be transferred to the GPFS userspace daemon, mmfsd). This results
in a (relatively) short code path with optimized locking and good SMP scaling, and thus
minimal added latency on the GPFS level. Unfortunately, this only applies to non-
replicated writes. When a replicated write comes in, the processing is much more
complicated (the write must be logged), and is done mostly in mmfsd. This code path is
considerably longer, and requires more sophisticated locking. All of this further
compounds the impact of replication of data writes, especially in environments where
disk devices offer very low write latency, and thus make the impact of any added
overhead on the file system level more pronounced.
Whether using data replication is appropriate in a given solution is a complex question
with many variables: workload, disk, network, file system configuration all play a lot.
However, some general recommendations can be made:
1) Read-mostly workloads can perform well when replication is used, especially if
one of the replicas is available on faster storage. A replicated read is as cheap
as a non-replicated read, and GPFS has the logic to automatically select the
replica residing on the faster disk.
2) Write-heavy workloads can have their performance impacted heavily when
replication enabled, more so than may be intuitive to assume. Buffered
workloads that are not sensitive to write latency tend to be affected less, while
synchronous/DirectIO workloads are affected more. Since significant portions of
a replicated data write are actually metadata IO, the performance of metadata
writes has a significant impact on the data write latency.
Data replication certainly has a cost, in terms of storage utilization efficiency and write
performance. It is important to understand those costs during solution design. While
replication is a powerful tool, the costs make it a poor choice for some configurations.
Replication and Disaster Recovery One particularly popular way of using replication in GPFS is the “stretch cluster”
configuration. This configuration provides a way to implement an active/active,
synchronous Disaster Recovery (DR) solution, typically using two main sites (referred to
as SiteA and SiteB here), plus a small 3rd site that acts as a tiebreaker. This means that
applications can run simultaneously at both sites (as opposed to the active site/backup
site), with synchronous semantics (all committed writes are committed on both sites).
This allows either of the sites to experience a disaster than leads to a complete site loss
(or a loss of the site availability) without affecting operations on the second site: IO
processing will continue after the disaster, after a short pause to handle node failures,
without any downtime or lost data. This DR solution is described in detail in the
“Synchronous mirroring with GPFS replication” section of the IBM Spectrum Scale
product documentation. We will not repeat the same material here, but will focus
instead on the solution design aspects that tend to breed questions.
This is a very attractive DR model, for obvious reasons, but naturally there’s a catch: the
cost of replicated writes. Since all writes need to be committed to disks on both sites,
the latency of the network link connecting the two sites is a major factor that dictates
many aspects of the performance of the overall solution. In the perfect world, the
network link would be fast and reliable, but this can be hard to achieve in practice. After
all, many disasters by definition affect wide geographic areas, and thus it is necessary
to place the two sites sufficiently far apart in order to increase the chance of one of the
surviving. Implementing a fast and reliable TCP/IP network connection over long
distances is, in a word, hard. Besides the limits imposed by physics (such as the speed
of light), more practical considerations (like having multiple network segments) play a
major role.
A question gets asked quite often: what is the minimal level of performance of the inter-
site network link for a “stretch cluster” configuration to be feasible? Unfortunately, the
only simple answer is “it depends”. A poor inter-site link leads to pain, in the form of
poor write performance, spurious connection failures, and uneven performance levels.
Everyone has a different level of pain tolerance. As discussed earlier, the impact of
replication, and thus the impact of the inter-site network link, depends a lot on the
nature of the workload. A read-mostly workload is going to be a lot less sensitive to the
network latency than a write-dominated one. Applications that perform buffered IO and
are not greatly affected by write latency can do quite well, while databases that have a
need to frequently perform small DirectIO writes are going to be very sensitive to the
quality of the link.
For any workload, the link has to be reliable enough to not produce hard failures, which
generally means the packet loss rate must remain very low, well below 1%. Experience
shows that most TCP/IP implementations cannot tolerate a higher rate of packet loss
without it periodically resulting in a broken connection. GPFS tends to transfer large
blocks over TCP/IP, using large send and receive buffers, and on Linux such transfers
are particularly badly affected by packet loss. Even though it may seem that the packet
loss rate should be survivable, in practice this is not the case. Since a broken
connection means losing connectivity to the other site, and all disks belonging to that
site, such an event leads to disks being declared ‘down’, requiring a resync operation
when the connectivity is restored. It is very difficult to run production workloads in such
an environment.
It is difficult to provide any hard guidance on the network link performance. As long as
no hard network errors are encountered, GPFS itself can tolerate slow networks (to a
point: disk leases need to be renewed every 35 seconds, by default. And if the network
is not fast enough for that, this truly is a basket case). What is acceptable for end users
can vary. For GPFS replication, the most important performance characteristic is
latency. Obviously, link bandwidth is important too (as it effectively limits the overall
GPFS write bandwidth available on the cluster), but high bandwidth certainly cannot
compensate for high latency. When replication is used, the latency of small log writes is
as important as the latency of large block writes, and for many workloads that latency
cannot be masked. Generally speaking, it is desirable to have network latency that
doesn’t substantially exceed disk write latency. For example, if spinning disks with
6-20 ms average write latency are used, network latency in the 10-20 ms range is
probably going to be tolerable, but 50 ms is probably too much for almost all workloads.
If flash with sub-millisecond latency is used, the network needs to be commensurably
fast in order to avoid a massive performance hit. And considering that a typical
metadata operation (e.g. file create) involves multiple separate metadata IOs (before
the impact of metadata replication is even considered), it is not hard to do the math and
come to a conclusion that latencies in high double digits and up would result in
excruciatingly poor levels of performance.
Again, all of this is a matter of workload and end user pain tolerance.