Upload
phamthu
View
215
Download
1
Embed Size (px)
Citation preview
High Performance, Open Source, Lustre 2.1.x Filesystem
on the Dell PowerVault MD32xx Storage, PowerEdge
Server Platform and Mellanox InfiniBand Interconnect
Dell /Cambridge HPC Solution CentreWojciech Turek | May 2013
© Dell
© Dell | Section 1
High Performance, Open Source, Lustre 2 Filesystem Whitepaper
2
Table of Contents
Abstract 3
Introduction 3
Lustre 2.x Overview 4
Lustre 2.1.x as a Production Filesystem 4
Lustre 2.1.x Features 5
Changelogs (2.1.x) 5
Commit on Share (2.1.x) 5
lustre_rsync (2.1.x) 5
File Identifier (2.1.x) 5
Ext4/ldiskfs (2.1.x) 6
Lustre 2.3 Features 6
Backup/Restore – Object Index Scrub (2.3) 6
SMP Scalability (2.3) 6
Job Stats (2.3) 7
CRC Checksums (2.3) 7
Statahead (2.3) 7
Lustre Silo Model 7
Hardware Overview 9
Hardware Components 10
HA-MDS Overview 11
HA-MDS Configuration 12
HA-OSS Overview 12
HA-OSS Configuration 14
Software Components Overview 15
Driver Installation 16
Disk Array Configuration 17
Lustre 2.1.x Installation and Configuration 19
Lustre Network Configuration 20
Lustre Configuration for HA-MDS Module 21
Lustre Configuration for HA-OSS Module 21
Dell Storage Brick with Lustre 2.1.x Performance Tests 23
Dell Lustre Storage Brick Test Tools 23
sgpdd-survey 24
obdfilter-survey 25
LNET Selftest 25
IOR 25
Tests Results 26
Tests Results - Conclusions 29
Summary 29
© Dell | Section 1
High Performance, Open Source, Lustre 2 Filesystem Whitepaper
Abstract
The following paper was produced by the Dell Cambridge HPC Solution Centre and is the third in a series of detailed
Lustre® papers based on operational experience gained from using mass storage technologies within the University of
Cambridge production HPC environment. The paper provides a detailed description of how to build a commodity ‘Dell
Lustre Storage Brick’ together with comprehensive performance characteristics obtained from the Lustre storage brick
when integrated into a FDR InfiniBand® network. This paper updates information described in the previous paper and
focuses on Lustre 2.x features, installation and performance. The paper also discusses the operational characteristics and
system administration best practices derived from long-term, large-scale production usage of the solution within the
University of Cambridge HPC environment.
Introduction
This paper has been produced by the Dell Cambridge HPC Solution Centre, which is part of the wider University of
Cambridge HPC Service. The service runs one of the largest HPC facilities in the UK with a large user base and a high
I/O workload. The solution centre works closely with Dell customer-facing staff, as well as engineering, to provide
solution development and a production test environment for Dell in which the operational experience of the Cambridge
HPC Service (HPCS) is combined with the customer requirement and the understanding, product and implementation
knowledge of Dell to produce innovative, robust and fully-tested HPC solutions. These solutions are then made openly
available to the HPC community via white papers and technical bulletins.
The Cambridge HPCS has been successfully using the Lustre parallel filesystem within a large scale central cluster
filesystem for over six years. The current production Lustre system has 1PB of usable space and is continually growing
to match ever-growing user demand. The Dell Lustre Storage Brick described here is an upgrade to our previous Lustre
filesystem and uses the latest stable version of Lustre 2.1.x branch. The Dell Lustre Storage Brick as described here has
evolved to fill the operational requirements of a busy, working HPC centre and has been used in production at Cambridge
for several years. Its development was critical in order to meet the pressing operational requirements caused by a growing
demand for storage volume and the inability of our Network File System (NFS) filesystems to keep up with the I/O
workload.
This paper is describing in detail how to build and configure a Lustre parallel filesystem storage brick using Dell PowerVault
systems. The solution uses current Dell MD3200/MD1200 arrays, Lustre 2.1.5 and a FDR InfiniBand client network and
as such represents an example of a current best-in-class commodity Lustre solution. The paper will also comment on
operational characteristics and system administration best practices derived from over six years production usage of
a large-scale mass storage solution of the same architecture. The main focus of this paper is to show how to build a
commodity storage array using Lustre 2, with the objective of presenting the new Lustre features and motivation to start
using the new improved Lustre.
3
© Dell | Section 1
High Performance, Open Source, Lustre 2 Filesystem Whitepaper
Lustre 2.x Overview
Lustre is an open source cluster parallel filesystem which for almost a decade has been used in production on many of
the world’s largest clusters. The Lustre 1.8 branch has reached exceptionally high stability and is still considered to be
the best choice for small to medium production filesystems. However, the rapid growth of the HPC industry has caused
a high demand for storage space and throughput. This means that the filesystem needs to be able to scale and meet
continuously growing demands while retaining its performance levels and stability. Lustre 1.8 is limited in that regard and
hence work on Lustre 2 began a few years ago. Lustre 2 has been designed from the outset to meet the high demands of
modern hardware and software. The current Lustre code is maintained by Whamcloud/Intel and receives input from many
other contributors (Xyratex, OpenSFS, EOFS and others).
The current maintainers of the Lustre code have divided Lustre 2 releases into two streams: the maintenance release
stream and the feature release stream. Currently, Lustre 2.1.x is the designated maintenance release stream. It does not
receive any new feature updates but only bug fixes and updates to reflect kernel changes. This ensures greater stability of
the release and makes it suitable for production environments. The Lustre 2.1.x stream is targeted at conservative users
wanting to use a well proven release but still be able to take advantage of the new Lustre 2 architecture advantages.
The feature release stream is targeted at users requiring access to new features and able to accept a lower stability level.
The current feature release is Lustre 2.3.x and the next planned feature release will be 2.4.x. Whamcloud intends to land
feature releases at six-month intervals.
Lustre 2.1.x as a Production Filesystem
Lustre 1.8.x is currently the most popular release of Lustre and it has been widely used from small to medium production
systems. However it is becoming quickly outgrown by the continually growing demands of user applications and modern
hardware. One of the shortcomings in Lustre 1.8 is lack of support for Object Storage Targets (OSTs) bigger than 16TB (this
limitation is due to 32-bit block numbers in ext4). Nowadays, when disks in RAID arrays can be 3TB in size or more, the
ext4 limit of 16TB per OST becomes a big problem in the creation of an optimal RAID group configuration. Lustre requires
careful tuning of RAID layout and disk parameters to achieve best performance. At the time of writing this paper there
is ongoing work to enable the 64-bit feature of ext4 which will allow creation of OSTs bigger than 16TB in Lustre 1.8.x
[LU-419]. Another important shortcoming of Lustre 1.8.x is its code base that does not scale well enough on large storage
systems, hence the need for the new, more scalable code base that was introduced in Lustre 2. The Lustre 2 code
base promises to deliver much better bandwidth and metadata performance. This is being achieved by rewriting the
MDS and client I/O subsystems. The new Lustre 2 architecture and code base enables implementation of a range of new
features that are required to use, maintain and manage large amounts of data. For these reasons Lustre 2.1.x becomes
a better choice for modern storage platforms. Although it is possible to upgrade from Lustre 1.8 to Lustre 2 this is not
recommended for production systems as the conversion method has not been extensively tested and may have bugs
– also, the upgrade is a one-way process. The choice of Lustre 2 needs to be made at filesystem creation as this will avoid
future problems and compatibility issues.
Lustre 2.1.x Features
Lustre is in its second decade as a high performance, production parallel filesystem. Lustre has grown and adapted to
continuously growing demands to increase performance while maintaining production quality stability. With the arrival
of the 2.x release series, a number of new features have been introduced. Most of the features are not yet available in the
maintenance release (2.1.x), but they are being implemented and tested in the feature releases (2.3 and 2.4).
4
© Dell | Section 1
High Performance, Open Source, Lustre 2 Filesystem Whitepaper
Changelogs (2.1.x)The new Lustre 2 code introduces a new feature called ‘changelog’, which records all events that change the filesystem’s
metadata or namespace. Changelogs record events such as file creation, deletion, renaming, attribute changes, etc.
Recorded changelogs are persistent and are kept on a local disk until cleared by the user. They accurately reflect the
on-disk changes in the event a server failure. These new records can be used for a variety of purposes, such as:
• Tracking changes in the filesystem and feeding them to hierarchical storage management software (HSM)
• Using recorded events to replicate exactly all changes in a filesystem mirror
• Creating filesystem policies to take action upon certain events for files and directories
• File backup policy management
• Speeding up recovery after server crash
• Helping to construct an audit trail for the filesystem.
Commit on Share (2.1.x)From time to time Lustre servers may need to be taken down for maintenance or they may simply crash due to hardware
component failures; the Lustre services then have to be recovered. When a Lustre MDS server enters recovery mode
clients begin to reconnect and start replaying their uncommitted transactions. The recovery time window is limited and
depends on the obd timeout. If one or more clients miss their recovery window, it may cause other clients to abort their
outstanding transactions or even be evicted. If a client is evicted all outstanding transactions that belong to that client have
to be aborted. This may lead to a cascade eviction effect as the transactions depending on the aborted ones fail and so
on. Lustre 2 addresses this problem with the Commit on Share feature by eliminating dependent transactions, hence the
clients may replay their requests independently without risk of being evicted. This allows clients to recover, regardless of
the state of other clients. In the event of multiple client failures this reduces recovery problems significantly.
Lustre_rsync (2.1.x)Moving large amounts of data between filesystems in a reasonably short time can be a daunting task that a system
administrator may face. Using standard tools for copying data would require the filesystem to be scanned file by file and
directory by directory, which for large filesystems can be unreasonably time consuming. Lustre 2 addresses this problem
by introducing the lustre_rsync tool that utilizes changelogs to determine which files need to be copied. This is much
more efficient than the traditional method of scanning filesystem metadata. Lustre_rsync can use as a target another
Lustre filesystem or any other filesystem. Lustre_rsync is a perfect tool to use for managing a backup of a Lustre filesystem.
It allows one to quickly synchronise a backup filesystem with the source or vice versa. In the case of recovery it allows one
to perform reverse replication.
File Identifier (2.1.x)Although File Identifier (FID) is not a new concept, Lustre 2 introduces a new FID and request format. Lustre 1.8 uses a
different FID for MDT inodes and OST objects. The MDT FID depends on ldiskfs and is only unique to a single MDT.
This model has several problems and constraints which are addressed in the new FID introduced in Lustre 2. The new FID
scheme is independent of the underlying filesystem; is unique across the entire Lustre filesystem; simplifies Lustre recovery
by eliminating the need to recreate a specific inode number in replay; can be generated on the client allowing a metadata
writeback cache; and adds support for object versioning. The MDT and OST will use the same 128-bit FID, but the MDS
and clients will be assigned a new sequence for each OSS. The new FID scheme introduces interoperability constraints
between Lustre 1.8 and 2.x. Since Lustre version 1.8.6, the Lustre client understands the new FID format and can
5
© Dell | Section 1
High Performance, Open Source, Lustre 2 Filesystem Whitepaper
communicate with Lustre 2.x servers. However, the Lustre 2.x client does not understand the old Lustre 1.8 format and
servers must be upgraded to version 2.x. Live upgrade is not possible due to the RPC format change which would result in
a global client eviction.
Ext4/ldiskfs (2.1.x)The Lustre 2.x underlying ldiskfs filesystem has received a number of changes which make it better suited to large scale
Lustre deployments. One of the biggest limitations of previous incarnations of ldiskfs was the limited size of OSTs due
to the 32-bit nature of the code. This limitation has now been removed and the maximum supported OST size is now
128TB. Bigger OSTs may mean longer mkfs and fsck time; this is however addressed by the flex_bg feature and by creating
fewer inodes during filesystem creation. The flex_bg feature helps to avoid seeks for both data and metadata, co-locates
multiple block/inode bitmaps, and provides larger contiguous space. The ldiskfs defaults for the MDT were also changed to
increase inode density which allows use of flash-based storage for bigger filesystems.
Lustre 2.3 Features
The recently released Lustre 2.3 introduces many interesting new features. Some of them aim to further increase scalability
and performance. Some introduce new administrative functionalities and make the filesystem better suited for production
environments. Lustre 2.3 is a good candidate to be declared a new maintenance release as soon as it proves its stability.
Below is a list of some of the new features implemented in the latest Lustre 2.3 .
Backup/Restore – Object Index scrub (2.3)In Lustre 2.x, the FID is the unique identifier for a file or object. It differs from traditional inode number/generation and is
independent of the backend filesystem. Internal to the Lustre Object Storage Device (OSD), an Object Index (OI) table is
used to map the FID to the local inode number. One scenario where the inode number will be changed arises if the MDT is
restored from file-level backup. In this case, the OI table that is restored from backup will be stale, since the inode number/
generation will be newly allocated by the backend filesystem for each file. The OI table can be rebuilt by iterating through
all inodes and correcting the FID-to-inode mapping in the OI. In addition, the ability to iterate through inodes and cross-
checking the OI against the filesystem is useful to maintain filesystem consistency. It is important to note that since this
feature is not available in Lustre 2.1.x, the only way to backup/restore the MDT is by performing block level copy of the
MDT devices – by using, for example, the dd command.
SMP scalability (2.3)Current versions of Lustre rely on a single active metadata server. Metadata throughput may become a bottleneck for
large filesystems with many thousands of nodes. System architects typically resolve metadata performance issues by
deploying the MDS on more powerful hardware. However, this approach is not a foolproof solution. Nowadays, faster
hardware means more cores and Lustre has a number of SMP scalability issues while running on servers with many CPUs.
Scaling across multiple CPU sockets is a real problem. Lustre 2.3 introduces several new patches that address some of the
scalability issues, mainly in the performance of the Lustre network layer LNET and the performance of shared directories.
6
© Dell | Section 1
High Performance, Open Source, Lustre 2 Filesystem Whitepaper
Job stats (2.3)This feature collects filesystem operation statistics for the jobs running on Lustre clients. When a job scheduler is running
on a compute node, the Lustre client will include the job ID in each request, such as write, read, create, unlink etc. The
server will record this information and expose it via lprocfs. Currently, a number of popular schedulers are supported, such
as SLURM, PBS, Torque/Moab, LFS, SGE, Loadleveler. This feature is particularly useful in job profiling and in discovering a
job’s I/O bottlenecks.
CRC checksums (2.3)Checksums were initially implemented in software, but in Lustre 2.x new hardware accelerated checksums were
implemented. This feature uses a kernel API to choose optimal hash and checksumming algorithms as opposed to fixed
values. Hardware accelerated checksums are supported on both Intel and AMD processors.
Statahead (2.3)Commands such as ‘ls -l’, ‘du’, ‘find’, which traverse a directory sequentially, are frequently used commands. Lustre processes
these commands by triggering several RPCs for each file inside the directory. Initially, these RPCs were processed serially
and synchronously, hence hurting performance. This is especially bad due to the fact that these commands are exposed
to the filesystem users and define their experience with it. Lustre-2.x supports statahead, which addresses this problem by
pre-fetching objects’ attributes from the MDS and many MDS-side RPCs now overlap with other processes. In addition,
Lustre2.3 introduces the asynchronous glimpse lock (AGL ) to prefetch file size from the OST.
Lustre Silo Model
A typical approach to design of a Lustre-based storage system is to create one large filesystem spanning multiple hardware
instances. This allows not only building high capacity storage but since Lustre throughput grows almost linearly with
addition of disk arrays it also delivers very high performance in a single filesystem namespace. Binding all hardware into
one large filesystem although clearly having its advantages, creates a single point of failure hence decreases filesystem
data safety. In applications where the data safety is of secondary importance and can be sacrificed in favor of less complex
design, bigger capacity and higher performance, this approach is acceptable. However, in applications where data safety
is critical, mirroring or backing up the data on secondary hardware can provide the necessary resiliency. This is however a
costly solution and often not implemented due to budget limitations. To address the data safety problem at lower expense,
the silo architecture may be the answer. This type of architecture is commonly used in NFS-based storage systems where
a large number of disk arrays are divided into silos of a certain fixed size. The limitation of NFS is that each NFS filesystem
can have only one NFS server that creates a throughput and metadata bottleneck. Lustre, due to its parallel nature, allows
the creation of large silos which span multiple storage servers and multiple disk arrays providing good throughput and
metadata performance per silo filesystem. Multiple filesystems means slightly more complex data management, but this
is outweighed by the decreasing risk of losing all the data if one of the filesystems fails. Experience teaches us that bad
things do happen, hence taking measures to minimize their impact is a good practice that needs to be adopted early in the
storage system design process. Figure 1 below shows an example of a single Lustre silo of 500TB capacity consisting of
four OSS modules.
7
25 percent capacity
25 percent capacity
25 percent capacity25 percent capacity
Figure 1. Single 500TB Lustre silo.
The silo architecture is based on the assumption that each silo should have enough reserved free capacity so, if necessary,
data from one filesystem can be migrated to the remaining silo filesystems. For 2PB of usable storage hardware it is
recommended to create four 500TB silos. Each silo will have 25% of its capacity reserved allowing migration of the data
between the silos if needed, see figure2 below. Figure 3 shows an example scenario when one of the silo fails or needs
to be taken offline and data migrated off it. Using a silo model system administrator can migrate data of one silo to the
remaining three and continue to serve all data without interruption. The Lustre silo model introduces more complexity
into data management and localization; however, it significantly increases data resiliency and also increases stability by
balancing the load across multiple Lustre silo filesystems.
Figure 2. 2PB of usable storage divided into four Lustre silos.
© Dell | Section 1
High Performance, Open Source, Lustre 2 Filesystem Whitepaper
8
Lustre Silo 500TB capacity
oss01 oss02 oss03 oss04
Max 75 percent capacity used
Max 75 percent capacity used
Max 75 percent capacity used
Max 75 percent capacity used
Lustre Silo
oss01 oss02 oss03 oss04
Lustre Silo
oss01 oss02 oss03 oss04
Lustre Silo
oss01 oss02 oss03 oss04
Lustre Silo
oss01 oss02 oss03 oss04
25% 25%
75% 75%
Migrate Migrate
25% 25%
75% 75%
Migrate
Migrate
Migrate
Migrate
© Dell | Section 1
High Performance, Open Source, Lustre 2 Filesystem Whitepaper
Figure 3. An example scenario when one silo fails and data can be redistributed across remaining silos.
Hardware Overview
The hardware design of the Dell Lustre Storage System solution is driven by the following main goals, namely, maximised
I/O performance, scalability, reliability, stability, data safety and integrity, simplified configuration and competitive price
point. These goals are accomplished by using well-established, on-the-market, enterprise-level hardware; open source
software with a large user base and strong code development team; the high performance InfiniBand network and a
modular design approach.
The following modules are used in the Dell Lustre Storage System design:
HA-MDS/MGS – High Availability Metadata Server and Management Server module.
HA-OSS – High Availability Object Storage Server module.
See Figure 4 below for the logical relationship between clients and the storage modules.
9
HA-MDS module
HPC Cluster HA-OSS moduleFailover
Interconnect
MDS1(active)
OOS1(active)
Shared storage
OOS2(active)
Clients
MDS2(standby)
Figure 4. Overview of Dell Lustre Storage System design.
Hardware Components
The table below provides an overview of the core building blocks for the Dell Lustre Storage System:
© Dell | Section 1
High Performance, Open Source, Lustre 2 Filesystem Whitepaper
10
HA-MDS brick HA-OSS brick
• 3.2 TB of raw disk space
• 1.6 TB of RAID10 usable disk space
• FDR InfiniBand
• MDS/MGS Servers: Two Dell PowerEdge
R720 servers, running Scientific Linux 6.2 and
latest Lustre-2.1.x, equipped with dual socket
motherboard, eight core Intel Sandy Bridge CPUs
and 128GB of RAM
• Storage: One Dell PowerVault MD3220 with 24
136TB 15k SAS drives.
• 240TB of raw space
• 192TB of RAID6 usable disk space
• FDR InfiniBand
• OSS servers: Two Dell PowerEdge R510 servers,
running Scientific Linux 6.2 and latest Lustre-2.1.x,
equipped with dual socket motherboard, six core
Intel CPUs and 32GB of RAM
• Storage: One Dell PowerVault MD3200 and four
MD1200 enclosures, each with 12 4TB NL SAS
drives.
Infiniband switch
R720 Server (MDS01)
R720 Server (MDS02)
SAS links
Management links
Management switch
MD3220 RAIDcontroller
© Dell | Section 1
High Performance, Open Source, Lustre 2 Filesystem Whitepaper
HA-MDS Overview
The HA-MDS module is comprised of two Dell PowerEdge R720 servers, each equipped with two high-performance
Intel® Xeon® E5 processors, 128GB of RAM, two low-latency, enterprise-class SAS 15K hard drives managed by an internal
H310 RAID controller (figure 5). The internal drives are configured as a RAID1 group and provide fast and reliable disk space
for the Linux OS. The reason we have chosen R720 over R510 server is the higher expandability of the former and ability to
add additional expansion enclosure later if needed.
The Dell PowerVault MD3220 direct attached external RAID storage array is designed to increase cluster metadata
performance and functionality. Equipped with data-protection features to help keep the storage system online and
available, the storage array is architected to avoid single points of failure by employing redundant RAID controllers and
cache coherency provided by direct mirroring.
Figure 5. Dell Lustre HA-MDS module.
The HA-MDS module is configured to provide a good IOPS rate and high availability of data. The metadata server connects
to the disk array via multiple redundant, fast, 6Gbps SAS connections. The system is built to avoid a single point of failure
by using two dual port HBA cards, multiple SAS links and dual RAID controllers.
MD3220 storage array key features:
• 24 small factor 2.5inch 15k SAS disk drives (also support for SSDs)
• Mirrored data cache of up to 2048 MB on each RAID controller
• Persistent cache backup on SD flash disk.
11
RAID10Lustre 1
RAID10Lustre 2
RAID10Lustre 3
© Dell | Section 1
High Performance, Open Source, Lustre 2 Filesystem Whitepaper
• Snapshots of virtual disks.
• High Performance Tier to optimise IOPS and throughput.
• In-band and out-of-band management using Dell Modular Storage Manager.
• Redundant, hot-swappable components and virtual disk failover.
• Transparent failover for clustered operating systems to maximise data protection.
• RAID level migration, capacity expansion, consistency checks, and Self-Monitoring,
Analysis and Reporting Technology (SMART).
HA-MDS Configuration
The metadata performance is one of the key factors influencing the overall Lustre filesystem responsiveness and usability.
Hence, a correct choice of hardware and optimal configuration of the RAID controllers play an important role in the system
build process. Correctly configured, RAID groups minimise the number of expensive read and write operations that are
needed to complete a single I/O operation and also increase the number of operations that can be completed per unit time.
Figure 6. Dell Lustre HA-MDS module RAID configuration.
Figure 6 shows the disk layout of the Dell Lustre HA-MDS module. The 24 drives are divided into three groups, each
comprising eight disks and configured into a RAID10 disk group with a chunk size of 64KB. Since Lustre metadata operations
consist of many small I/O transactions, a 64KB chunk size is a good choice and provides satisfactory I/O performance.
Another parameter that can make a significant impact on I/O performance of the array is cache block size the value of
which can be set between 4KB and 32KB. A small cache block size is best suited to small transactional I/O patterns; thus,
for MDS purposes, we set this tunable to 4KB which matches well to Lustre metadata transactions. A single MDS module,
if configured as above, can provide metadata targets for three Lustre filesystems.
The server RAM size is also an important factor which has a significant importance in terms of Lustre meta data
responsiveness. The more RAM the MDS has the more aggressive tuning can be used to improve meta data expensive
operations like ‘ls’ and ‘find’. This is because Lustre MDS can cache locks on per client basis which allows to configure some
nodes – for example, login nodes with large lock caches – and improve Lustre responsiveness to meta data operations.
HA-OSS Overview
The HA-OSS module consists of a single Dell PowerEdge R520 servers attached to storage brick, each consisting of a
single PowerVault MD3200 disk array extended with four PowerVault MD1200 disk enclosures (Figure 7). The PowerEdge
server is equipped with two high-performance Intel® Xeon® processors, 32GB of RAM, two internal hard drives and an
12
MD1200 disk enclosure
MD3200 dual RAIDcontroller and disk enclosure
PE R520 Lustre OSS
© Dell | Section 1
High Performance, Open Source, Lustre 2 Filesystem Whitepaper
internal RAID controller. All disk enclosures are populated with fast 4TB NL SAS disks. The Dell PowerVault MD3200 direct
attached external RAID storage array is built to increase cluster I/O performance and functionality. Equipped with data-
protection features to help keep the storage system online and available, the storage array is designed to avoid single
points of failure by employing redundant RAID controllers and cache coherency provided by direct mirroring. The Dell
PowerVault disk array provides performance and reliability at the commodity price point.
Figure 7. Dell Lustre HA-OSS module.
To ensure high availability failover, each OSS connects to both MD3200 arrays providing four redundant links to the
PowerVault MD3200 disk array per OSS. This design allows for either OSS to drive all the volumes on the Dell PowerVault
disk arrays. During normal operation, each OSS exclusively owns a subset of volumes. In the event of failure the remaining
server in the failover pair will take over the volumes from the failed server and subsequently service requests to all the
volumes.
MD3200 storage array key features:
• 12 x 3.5 inch disk slots
• Mirrored data cache of up to 2048 MB on each RAID controller
• Persistent cache backup on SD flash disk
• Snapshots of virtual disks
• High Performance Tier to optimise IOPS and throughput
• In-band and out-of-band management using Dell Modular Storage Manager
• Redundant, hot-swappable components and virtual disk failover
• Transparent failover for clustered operating systems to maximise data protection
• RAID level migration, capacity expansion, consistency checks, and Self-Monitoring,
Analysis and Reporting Technology (SMART).
13
15TB
15TB
15TB
15TB
15TB
15TB
VD1
VD2
VD3
VD4
VD5
VD6
VD1 VD1VD2 VD2VD3 VD3VD4 VD4VD5 VD5VD6 VD6
RAID/VirtualDisks configuration
Failover
© Dell | Section 1
High Performance, Open Source, Lustre 2 Filesystem Whitepaper
HA-OSS Configuration
An important factor affecting overall Dell PowerVault MD3200 storage array performance is the correct choice and
configuration of the RAID disk groups and virtual disks, especially when using RAID6 disk groups. Correctly configured, RAID
groups minimise the number of expensive read and write operations that are needed to complete a single I/O operation.
Figure 8. RAID groups and virtual disk layout of the Dell Lustre HA-OSS module.
Figure 8 shows the disk configuration of the Dell Lustre HA-OSS module. There are six RAID6 disk groups configured.
Each RAID6 disk group consists of eight data disks and two parity disks. The segment size (chunk size) of the disk group
is 128KB, thus the stripe size is 1024KB. A correct segment size minimises the number of disks that need to be accessed
to satisfy a single I/O operation. The Lustre typical I/O size is 1MB, therefore each I/O operation requires only one RAID
stripe. Disks in each disk group are spanned across all five disk arrays. This leads to faster disk access per I/O operation.
For optimal performance, only one virtual disk per disk group is configured which results in more sequential access of the
RAID disk group and which may significantly improve I/O performance. The Lustre manual suggests using the full device
instead of a partition (sdb vs sdb1). When using the full device, Lustre writes well-aligned 1 MB chunks to disk. Partitioning
the disk removes this alignment and may noticeably impact performance. In this configuration, a RAID disk group
distributed across all enclosures has the additional advantage of not having more than two disks per disk enclosure.
This means that in the event of enclosure outage, data integrity is still protected by RAID6 redundancy.
14
© Dell | Section 1
High Performance, Open Source, Lustre 2 Filesystem Whitepaper
After performing a number of tests exercising various controller parameters the following choice of configuration
parameters seems to be best suited to the Lustre filesystem where the requirement is high performance under large
I/O workload:
Summary of configuration parameters:
cacheBlockSize=32
segmentSize=128
readCacheEnabled=true
cacheReadPrefetch=true
modificationPriority=lowest
cacheMirroring=true
Some write performance boost can be gained by disabling the write cache mirroring feature. Making this change however
needs to be carefully considered because it decreases data integrity safety, thus it is only recommended for non-sensitive
scratch data filesystems.
Performance can be also improved by tweaking various Linux parameters:
Summary of Linux tuning options on the OSS server:
Linux I/O scheduler - parameter set to “deadline”
echo deadline > /sys/block/<blk_dev>/queue/scheduler
max_sectors_kb - parameter set to 2048
echo 2048 > /sys/block/<blk_dev>/queue/max_sectors_kb
read_ahead_kb - parameter set to 2048
echo 2048 > /sys/block/<blk_dev>/queue/read_ahead_kb
nr_requests - parameter set to 512
echo 512 > /sys/block/<blk_dev>/queue/nr_requests
Software Components Overview
The Dell Lustre Storage System software stack installed on the metadata servers and object storage servers includes the
following software components:
• Linux operating system
• Lustre filesystem
• Heartbeat/Pacemaker
• Dell configuration and monitoring tools.
The Dell Lustre Storage System uses the Scientific Linux 6.3 operating system. This is a rebuild of the RedHat® Enterprise
Linux® (Update 3) distribution (RHEL6.3). This Linux distribution is an enterprise-level operating system that provides one
of the industry’s most comprehensive suite of Linux solutions. RHEL6.3 delivers a high performance platform with many
recent Linux features and high levels of reliability. The Dell Lustre Storage System uses Lustre version 2.1.5, which is the
current stable version of the maintenance release line of Lustre 2.
15
© Dell | Section 1
High Performance, Open Source, Lustre 2 Filesystem Whitepaper
Driver Installation
The first step is to install software drivers and agents that will allow the servers to communicate and monitor the external
RAID controllers. All necessary Linux host software is available on the Dell PowerVault MD32xx resource CD image.
The latest resource CD image can be downloaded from the Dell Support website at: support.dell.com.
The downloaded image can be mounted as a loop device. It is recommended to copy the contents of the MD32xx
resource CD to a shared NFS location in order to make it available to all OSS servers:
[root]# mkdir /mnt/md32xxiso
[root]# mount -o loop md32xx_resource_cd.iso /mnt/md32xxiso
[root]# rsync –a /mnt/md32xxiso/ /nfs_share/md32xx
When installing the PowerVault Modular Disk Storage Manager (MDSM) and Host Utilities, the system the System
administrator has the option of using an interactive installer, which requires the X Windows System and Java™ or using
text mode which can be performed either interactively or non-interactively in a Linux terminal. The second method is a
good choice for the cluster environment where software has to be installed on many nodes and allows automation of
the installation process for all available servers.
To install the MDSM in silent mode use one of the provided install scripts located in the same directory as the MDSM
installer. Depending on the role of the server in the Dell Lustre Storage System, use one of the three provided scripts.
Use mgmt.properties on a server that is a dedicated management node. Use host.properties file on a server which is
either an OSS or MDS (and is directly connected to a MD32xx enclosure).
OSS/MDS servers - host.properties file contents:
INSTALLER_UI=silent
CHOSEN_INSTALL_FEATURE_LIST=SMruntime,SMutil,SMagent
AUTO_START_CHOICE=0
USER_REQUESTED_RESTART=NO
REQUESTED_FO_DRIVER=mpio
Management servers - mgmt.properties file contents:
INSTALLER_UI=silent
CHOSEN_INSTALL_FEATURE_LIST=SMclient,Java Ac
AUTO_START_CHOICE=0
USER_REQUESTED_RESTART=NO
Once the appropriate properties file is chosen, the software can be installed with the following commands:
[root]# cd /nfs_share/md32xx/linux/mdsm/
[root]# ./SMIA-LINUX-{version}.bin -i silent -f host.properties
It is recommended to reboot the servers once the drivers and agents are installed to ensure that they are properly loaded
and working.
16
OSS-MODULE OSS-MODULE
MDS-MODULE
© Dell | Section 1
High Performance, Open Source, Lustre 2 Filesystem Whitepaper
Disk Array Configuration
It is important to first install drivers on the servers and then connect storage arrays. Otherwise the lack of the proper device
handling will cause the Linux OS to report IO errors when accessing storage array. Before attempting MD3200 array
configuration, care should be taken to ensure that all storage arrays are correctly connected and the necessary drivers and
software tools described in the previous section have been successfully installed. The connections use SAS cables and are
shown on Figure 9 and Figure 10.
Figure 9. MDS module SAS cabling.
Figure 10. OSS module fault tolerant SAS cabling.
17
© Dell | Section 1
High Performance, Open Source, Lustre 2 Filesystem Whitepaper
Storage arrays can be configured using either a command line interface (CLI) or modular disk storage manager graphical
interface (MDSM GUI). The MDSM GUI can be accessed by executing the SMclient command on the management node.
Below is a screenshot of a main MDSM GUI window, which provides summary information about the storage array. The
GUI interface is an easy way to access and control all aspects of the storage array.
Figure 11. MDSM GUI.
On the OSS servers the SMagent service should be started prior to attempting any configuration. The SMagent service can
be configured in the Linux OS to start automatically at boot time:
[root]# chkoconfig SMagent on
[root]# service SMagent start
To start the configuration process, run the MDSM application installed on the management node:
/opt/dell/mdstoragemanager/client/SMclient
Perform the initial setup tasks as follows (using inband management):
In MDSM main window click Tools at the top left corner of the
application window, and then click Automatic Discovery… and allow
MDSM to scan for MD3200 storage arrays.
Once the scan is complete all connected MD3200 storage arrays should be visible. It is recommended to start by
upgrading the storage array components’ firmware. Detailed instructions can be found in the User’s Guide on the Dell
PowerVault MD3200 documentation site:
http://support.dell.com/support/edocs/systems/md3200/
The next step of the configuration process is enabling OSS Linux server host access to the MD3200 storage array. Host
access can be configured either automatically or manually. In the automatic method, all hosts running the SMagent service
18
© Dell | Section 1
High Performance, Open Source, Lustre 2 Filesystem Whitepaper
will be automatically detected by the MDSM.
Follow these steps to configure host access automatically:
In MDSM main window
1. Right Click on the Storage Array Name and choose Manage Storage Array
option and a new window will open
2. In the new window click on Storage Array -> Configuration -> Automatic…
3. Follow instruction to complete the task
Configure storage array name:
[root]# SMcli oss01 -c “set storageArray userLabel=\”dell01\”;”
Example configuration of storage array host access and host group:
SMcli oss01 -c “create hostGroup userLabel=\”dell01\”;”
SMcli oss01 -c “create host userLabel=\”oss01\” hostType=1
hostGroup=\”dell01\”;”
SMcli oss01 -c “create hostPort host=\”oss01\” userLabel=\”oss010\”
identifier=\”{SASportAddress}\” interfaceType=SAS;”
The most efficient way to configure is via a configuration script:
[root]# SMcli oss01 -f storageArray.cfg
A more detailed script for configuring a Dell MD3200 HA-OSS module can be found in the previous white paper from this series.
Lustre 2.1.x Installation and Configuration
The easiest way to install Lustre is to obtain the Lustre RPM packages from the Lustre maintainer website,
currently: http://downloads.whamcloud.com/public/lustre/latest-maintenance-release/.
In order to determine which Lustre packages should be downloaded, please check the Lustre support matrix: http://wiki.
whamcloud.com/display/PUB/Lustre+Support+Matrix
The Lustre servers require the following RPMs to be installed:
• kernel-lustre: Lustre patched kernel
• lustre-ldiskfs: Lustre-patched backing filesystem kernel module
• lustre-modules: Lustre kernel modules
• lustre: Lustre utilities package
• e2fsprogs: utilities package for the ext4 backing filesystem
• lustre-test: various test utilities (optional)
19
© Dell | Section 1
High Performance, Open Source, Lustre 2 Filesystem Whitepaper
The Lustre client requires these RPMS to be installed:
• lustre-modules: Lustre kernel modules
• lustre: Lustre utilities package
• lustre-test: various test utilities (optional)
Additionally, the following packages may be required from your Linux distribution, or InfinBand vendor:
• kernel-ib: OFED
Lustre installation is completed by installing the above RPMs and rebooting the server.
Lustre Network Configuration
In order for the Lustre network to operate optimally, the system administrator must verify that the cluster network
for example, InfiniBand has been set up correctly and that clients and servers can communicate with each other using
standard network protocols for the particular network type used. An InfiniBand network can be tested using tools provided
in the perftest package. Also a ping utility can be used to check that the TCP/IP configuration of the IB interfaces functions
properly.
In a cluster with a Lustre filesystem, servers and clients communicate with each other using the Lustre networking API
(LNET). LNET supports a variety of network transports through Lustre Network Drivers (LNDs). For example, ksocklnd
enables support for ethernet and ko2iblnd enables support for InfiniBand. End points of the Lustre network are defined
by Lustre Network Identifiers NIDs. Every node that uses the Lustre filesystem must have a NID. Typically, a NID looks like
address@network, where address is the name or IP address of the host and network is the Lustre network, for example, tcp0
or o2ib.
When configuring a Lustre InfiniBand network in RHEL, the system administrator needs to add the following lines to the file
/etc/modprobe.conf:
# lnet configuration for Infiniband
options lnet networks=o2ib(ib0)
This will configure the appropriate NID and load the correct LND for the InfiniBand interconnect. More information about
Lustre networking can be found at:
wiki.lustre.org
20
© Dell | Section 1
High Performance, Open Source, Lustre 2 Filesystem Whitepaper
Lustre Configuration for HA-MDS Module
Before Lustre can be used a system administrator must format Lustre targets with the ldiskfs filesystem. During the format
procedure, some Lustre configuration options can be set. The Lustre configuration can be changed later, if required, using
the nondestructive command: tunefs.lustre.
In order to configure and start the Lustre management server, run the following commands on one of the failover pair,
mds01 and mds02:
root]# mkfs.lustre --mgs –-fsname=testfs /
--failnode=mds02_ib@o2ib /
/dev/mpath/mgs
# Start Lustre management service
[root]# mount -t lustre /dev/mpath/mgs /lustre/mgs
Similarly, to configure and start the Lustre metadata server (note that typically the MGS and MDS run on the same host):
[root]# mkfs.lustre --reformat --mdt –fsname=testfs /
--failnode= mds02_ib@o2ib /
--mgsnode= mds01_ib@o2ib /
--mgsnode= mds02_ib@o2ib /
/dev/mpath/testfs_mdt00
# Start metadata service
[root]# mount -t lustre /dev/mpath/testfs_mdt00 /lustre/testfs_mdt00
Lustre Configuration for HA-OSS Module
To configure Lustre Object Storage Servers (OSS) run the following commands on one server of the OSS failover pair:
# on oss01 server
[root]# mkfs.lustre --ost -–fsname=testfs /
--mkfsoptions=’-O flex_bg -G 256 –E /
stride=32,stripe-width=256’
--failnode=oss02_ib@o2ib /
--mgsnode=mds01_ib@o2ib /
--mgsnode=mds02_ib@o2ib /
/dev/mpath/testfs_ost00
21
© Dell | Section 1
High Performance, Open Source, Lustre 2 Filesystem Whitepaper
[root]# mkfs.lustre --ost -–fsname=testfs /
--mkfsoptions=’-O flex_bg -G 256 –E /
stride=32,stripe-width=256’
--failnode=oss02_ib@o2ib /
--mgsnode=mds01_ib@o2ib /
--mgsnode=mds02_ib@o2ib /
/dev/mpath/testfs_ost01
[root]# mkfs.lustre --ost -–fsname=testfs /
--mkfsoptions=’-O flex_bg -G 256 –E /
stride=32,stripe-width=256’
--failnode=oss02_ib@o2ib /
--mgsnode=mds01_ib@o2ib /
--mgsnode=mds02_ib@o2ib /
/dev/mpath/testfs_ost02
[root]# mkfs.lustre --ost -–fsname=testfs /
--mkfsoptions=’-O flex_bg -G 256 –E /
stride=32,stripe-width=256’
--failnode=oss02_ib@o2ib /
--mgsnode=mds01_ib@o2ib /
--mgsnode=mds02_ib@o2ib /
/dev/mpath/testfs_ost03
[root]# mkfs.lustre --ost -–fsname=testfs /
--mkfsoptions=’-O flex_bg -G 256 –E /
stride=32,stripe-width=256’
--failnode=oss02_ib@o2ib /
--mgsnode=mds01_ib@o2ib /
--mgsnode=mds02_ib@o2ib /
/dev/mpath/testfs_ost04
[root]# mkfs.lustre --ost -–fsname=testfs /
--mkfsoptions=’-O flex_bg -G 256 –E /
stride=32,stripe-width=256’
--failnode=oss02_ib@o2ib /
--mgsnode=mds01_ib@o2ib /
--mgsnode=mds02_ib@o2ib /
/dev/mpath/testfs_ost05
22
Shared storage (MDT, MGS)
OSS01
OSS02
OSSn
Failover pair
Failover pair
Failover pair
Failover pair
Interconnect(Infiniband, Ethernet)
MDS1 MDS2
Shared storage (OSTs)
Shared storage (OSTs)
Shared storage (OSTs)
Computer Cluster
© Dell | Section 1
High Performance, Open Source, Lustre 2 Filesystem Whitepaper
Start object storage services:
[root]# mount -t lustre /dev/mpath/testfs_ost00 /lustre/testfs_ost00
#repeat above mount command for remaining OSTs, mount 6 OSTs per OSS
Start Lustre on the client node:
# Mount clients
[root]# mount -t lustre mds01@o2ib;mds02@02ib:/lustre /lustre
Dell Storage Brick with Lustre 2.1.x Performance Tests
The purpose of the performance testing is to provide accurate Lustre filesystem throughput measurements of the Dell
Lustre Storage System as configured in this paper. The results can then be compared with the expected theoretical
performance of the tested storage system. Performance tests were conducted using one HA-MDS module and up to three
HA-OSS module.
Figure 12 shows a system configuration used in the performance tests. The system is consisting of three Dell Lustre Bricks
configured with failover and a number of client servers. All servers are interconnected with fast FDR InfiniBand network.
Dell Lustre Storage Brick Test Tools
Figure 12. Test system setup.
23
Lustre Clients(lozone, IOR)
InfiniBand Network(LNET selftest)
OST/MDT(obdfilter – survey)
RAID, LINUX block device layer(sgpdd – survey, dd)
Disks Disks Disks
© Dell | Section 1
High Performance, Open Source, Lustre 2 Filesystem Whitepaper
It is good practice to obtain baseline performance figures for all storage system elements involved in tests. Figure 13 shows
a high level representation of the testing path that should be followed to properly profile a storage system. Lustre provides
a handful of profiling tools which can be used to obtain baseline figures for the backend storage system hardware. These
tools allow testing the hardware platform from the very bottom of the benchmarking stack by measuring raw performance
data with minimal software overhead. Moving up the stack Lustre provides tools to measure the server’s local filesystem,
the network connecting client nodes and the clients themselves.
Figure 13. Lustre storage testing path.
sgpdd-surveyThe Lustre IOKit included in the lustre-test rpm package provides a set of tools which help to validate the performance
of various hardware and software layers. One of the tools called sgppd_survey can test bare metal performance of the
back-end storage system, while bypassing as much of the kernel as possible. This tool can be used to characterise the
performance of a storage device by simulating an OST serving files consisting of multiple stripes. The data collected by
this survey can help to determine the performance of a Lustre OST exporting the device. Sgpdd-survey is a shell script
included in the Lustre IOKit toolkit. It uses the sgp_dd tool provided by the sg3 utilities package, which is a scalable version
of the dd command and adds the ability to do multithreaded I/O with various block sizes and to multiple disk regions.
The script uses sgp_dd to carry out raw sequential disk I/O. It runs with variable numbers of sgp_dd threads to show how
performance varies with different request queue depths. The script spawns variable numbers of sgp_dd instances, each
reading or writing to a separate area of the disk to demonstrate performance variance within a number of concurrent
stripe files. The purpose of running sgppdd_survey is to obtain a performance baseline and to set expectations for the
Lustre filesystem.
crg (con current regions) – describes how many regions on the disk are written to or read by sgp_dd. crg simulates
multiple Lustre clients accessing an OST. As crg grows, performance will drop due to more disk seeks (especially for reads).
thr (number of threads) – this parameter simulates Lustre OSS threads. More OSS threads can do more I/O, but if too
many threads are in use, the hardware will not be able to process them and performance will drop.
24
© Dell | Section 1
High Performance, Open Source, Lustre 2 Filesystem Whitepaper
obdfilter-surveyOnce the block devices are formatted with a Lustre filesystem, the next step is to benchmark the OSTs and their underlying
ldiskfs filesystem. This can be done using a tool provided by the Lustre IOKit called obdfilter-survey. This script profiles the
overall throughput of the storage hardware by applying a range of workloads to the OSTs. The main purpose of running
obdfilter-survey is to measure the maximum performance of a storage system and to find the saturation points which
cause performance drops.
obj (Lustre objects) - describes how many Lustre objects are written or read. Obj simulates multiple Lustre clients
accessing the OST and reading/writing multiple objects. As the number of obj grows, performance will drop due to
more disk seeks (especially for reads).
thr (number of threads) - this parameter simulates Lustre OSS threads. More OSS threads can do more I/O, but if too many
threads are in use, the hardware will not be able to process them and performance will drop.
LNET selftestThe LNET selftest (lst) utility helps to test the operation of a Lustre network between servers and clients. It allows
verification that the Lustre network was properly configured and that performance meets expected values. LNET selftest
runs as a kernel module and has to be loaded on all nodes that take part in the test. A utility called lst is used to configure
and launch Lustre network tests.
IOR IOR is an I/O benchmark providing a wide range of useful functions and features. With IOR, clients first write data and
then, when read, is performed clients read data written by another client. This completely eliminates the client-read cache
effect, so avoiding the problem of having to flush the client cache after write. IOR uses MPI for multimode communication
and thread synchronization, which helps to provide very accurate results from large multinode tests. IOR can also be
compiled with Lustre support, allowing to tune some of Lustre parameters like file striping.
25
0
0
200
400
600
800
1000
1200
1400
1600
100 200 300 400
Number of threads
sgpdd-survey, cache mirroing enabled – write test
cgr 12
cgr 24
cgr 96
cgr 192
cgr 48
Th
rou
gh
pu
t M
B/s
500 600 700 800 900
0
0
1000
2000
3000
4000
5000
6000
100 200 300 400
Number of threads
sgpdd-survey, cache mirroing enabled – read test
cgr 12
cgr 24
cgr 96
cgr 192
cgr 48
Th
rou
gh
pu
t M
B/s
500 600 700 800 900
© Dell | Section 1
High Performance, Open Source, Lustre 2 Filesystem Whitepaper
Tests Results
Command Line:
size=”24576” crglo=1 crghi=16 thrlo=1 thrhi=64 scsidevs=”/dev/sdk
/dev/sdw /dev/sdaa /dev/sdo /dev/sdi /dev/sdu /dev/sdm /dev/sdy
/dev/sdg /dev/sds /dev/sdac /dev/sdq” ./sgpdd-survey
Figure 14. sgpdd-survey single brick write test (bare metal performance).
Figure 15. sgpdd-survey single brick read test (bare metal performance).
26
1200
1300
1250
1350
1400
1450
1500
32 64
obdfilter-survey cache mirroring enabled – write
Th
rou
gh
pu
t M
B/s
128 256 512
Threads
obj12
obj48
obj96
obj24
0
500
1500
1000
2000
2500
3000
32 64
obdfilter-survey cache mirroring enabled – read
Th
rou
gh
pu
t M
B/s
128 256 512
Threads
obj12
obj48
obj96
obj24
© Dell | Section 1
High Performance, Open Source, Lustre 2 Filesystem Whitepaper
Command Line:
nobjlo=”2” nobjhi=”16” thrlo=”2” thrhi=”64” size=”24576”
targets=”testfsOST0000 testfs-OST0001 testfs-OST0002 testfs-OST0003
testfs-OST0004 testfsOST0005” ./obdfilter-survey
Figure 16. obdfilter-survey write test.
Figure 17. obdfilter-survey read test.
27
0
500
1000
1500
2000
2500
3000
3500
4000
4 8
Number of clients
IOR cache mirroring enabled – write
Th
rou
gh
pu
t M
B/s
16 32
1 brick
2 bricks
3 bricks
0
500
1000
1500
2000
2500
3000
4000
5000
4500
3500
4 8
Number of clients
IOR cache mirroring enabled – read
Th
rou
gh
pu
t M
B/s
16 32
1 brick
2 bricks
3 bricks
© Dell | Section 1
High Performance, Open Source, Lustre 2 Filesystem Whitepaper
Command Line:
IOR -a POSIX -N 30 -b 8g -d 5 -t 1024k –o /testfs/test_file -e -g -w
-r -s 1 -i 2 -vv -F -C
Figure 18. IOR sequential write test.
Figure 19. IOR sequential read test.
28
© Dell | Section 1
High Performance, Open Source, Lustre 2 Filesystem Whitepaper
Tests Results - Conclusions
The raw performance test results shown on Figure 11 and 12 show good base line performance for both reads and writes.
The write performance is lower due to active cache mirroring. This feature has to be enabled if Dell Lustre bricks are used
in failover configuration to ensure data integrity in event of a RAID controller failure. All further tests are run with cache
mirroring enabled.
Obdfilter test presented on Figure 13 and 14 shows that formatting raw block devices with Lustre filesystem decreases
slightly performance due to Lustre backend filesystem overhead. The performance loss is not high and it is in range of a
few percentages.
The IOR large throughput test shown in Figure 15 and 16 was run on a range of Lustre clients from 4 to 32. Each client
writes its own file, where the file’s distribution is balanced equally across all bricks and all OSTs. For small node counts, read
outperforms write throughput. However, with the increase of node count – thereby increasing the files written
– I/O concurrency on the OSS grows, so read performance drops. When comparing these results to tests from the
previous paper, we can see improvement in both writes and reads. This is most likely thanks to new features, like servers
side caching and increased Lustre multithreading.
Summary
As HPC commodity clusters continue to grow in performance and more applications adopt parallel I/O, traditional
filesystems such as NFS, NAS, SAN and DAS are failing to keep up with HPC I/O demand. The Lustre2.x parallel filesystem
is gaining ground within the departmental and workgroup HPC space since it has proved capable of meeting the high I/O
workload within HPC environments, as well as presenting a range of additional operational benefits as described in the
introduction. This paper presents a structured description of how to build a Dell PowerVault MD3200 Lustre 2.x storage
brick. The storage brick as described shows good sequential I/O throughput, high redundancy and high availability with
good set of data safety features, all of which are important within the HPC environment. The storage brick shows very
good Lustre filesystem performance. This solution has demonstrated scalable performance across the storage bricks,
yielding over 4GB/s file I/O bandwidth for reads and over 3 GB/s for writes with high levels of availability and data security
when using three Dell Lustre Storage Bricks. Operational experience has demonstrated less than 0.5% un-planned
downtime. We have found the Dell Lustre Storage Brick is a good match to both HPC performance and the operational
requirements seen within a large university HPC data centre. The Cambridge HPC Service has constructed a two-silo
solution, which provides very high throughput with built-in redundancy and a large capacity storage system that currently
satisfies all our users’ needs. Shortly, this system will be expanded to a four-silo solution providing even more throughput
to drive University of Cambridge science and research. The solution described here has been used in the Cambridge HPC
production environment for the last three years, serving over 500 users. Availability and data security demonstrates less
than 0.5% unscheduled downtime in our 24x7 operational service.
This Whitepaper is for informational purposes only, and may contain typographical errors and technical inaccuracies. The content is provided as is, without express or implied warranties of any kind.
©2014 Dell Corporation Ltd. All rights reserved. Dell, the Dell Logo and Latitude are registered trademarks or trademarks of Dell Corporation Ltd. Lustre is a registered Trademark of Oracle in the United States and in other countries. Intel, Intel logo, Intel Xeon, Intel Dual Core are registered trademarks of Intel Corporation in the United States and in other countries. Other trademarks or trade names may be used in this document to refer to either the entities claiming the marks and names or their products. Dell disclaims any proprietary interest in the marks and names of others. Dell Corporation Limited. Dell Corporation Limited, Reg. No. 02081369, Dell House, The Boulevard, Cain Road, Bracknell, Berkshire RG12 1LF.
29