High Performance, Open Source, Lustre 2.1.x Filesystem on ...i.dell.com/sites/doccontent/business/solutions/whitepapers/en/... · The following paper was produced by the Dell Cambridge

High Performance, Open Source, Lustre 2.1.x Filesystem

on the Dell PowerVault MD32xx Storage, PowerEdge

Server Platform and Mellanox InfiniBand Interconnect

Dell /Cambridge HPC Solution CentreWojciech Turek | May 2013

© Dell

© Dell | Section 1

High Performance, Open Source, Lustre 2 Filesystem Whitepaper

2

Table of Contents

Abstract 3

Introduction 3

Lustre 2.x Overview 4

Lustre 2.1.x as a Production Filesystem 4

Lustre 2.1.x Features 5

Changelogs (2.1.x) 5

Commit on Share (2.1.x) 5

lustre_rsync (2.1.x) 5

File Identifier (2.1.x) 5

Ext4/ldiskfs (2.1.x) 6

Lustre 2.3 Features 6

Backup/Restore – Object Index Scrub (2.3) 6

SMP Scalability (2.3) 6

Job Stats (2.3) 7

CRC Checksums (2.3) 7

Statahead (2.3) 7

Lustre Silo Model 7

Hardware Overview 9

Hardware Components 10

HA-MDS Overview 11

HA-MDS Configuration 12

HA-OSS Overview 12

HA-OSS Configuration 14

Software Components Overview 15

Driver Installation 16

Disk Array Configuration 17

Lustre 2.1.x Installation and Configuration 19

Lustre Network Configuration 20

Lustre Configuration for HA-MDS Module 21

Lustre Configuration for HA-OSS Module 21

Dell Storage Brick with Lustre 2.1.x Performance Tests 23

Dell Lustre Storage Brick Test Tools 23

sgpdd-survey 24

obdfilter-survey 25

LNET Selftest 25

IOR 25

Tests Results 26

Tests Results - Conclusions 29

Summary 29

© Dell | Section 1


Abstract

The following paper was produced by the Dell Cambridge HPC Solution Centre and is the third in a series of detailed

Lustre® papers based on operational experience gained from using mass storage technologies within the University of

Cambridge production HPC environment. The paper provides a detailed description of how to build a commodity ‘Dell

Lustre Storage Brick’ together with comprehensive performance characteristics obtained from the Lustre storage brick

when integrated into a FDR InfiniBand® network. This paper updates information described in the previous paper and

focuses on Lustre 2.x features, installation and performance. The paper also discusses the operational characteristics and

system administration best practices derived from long-term, large-scale production usage of the solution within the

University of Cambridge HPC environment.

Introduction

This paper has been produced by the Dell Cambridge HPC Solution Centre, which is part of the wider University of

Cambridge HPC Service. The service runs one of the largest HPC facilities in the UK with a large user base and a high

I/O workload. The solution centre works closely with Dell customer-facing staff, as well as engineering, to provide

solution development and a production test environment for Dell in which the operational experience of the Cambridge

HPC Service (HPCS) is combined with the customer requirement and the understanding, product and implementation

knowledge of Dell to produce innovative, robust and fully-tested HPC solutions. These solutions are then made openly

available to the HPC community via white papers and technical bulletins.

The Cambridge HPCS has been successfully using the Lustre parallel filesystem within a large scale central cluster

filesystem for over six years. The current production Lustre system has 1PB of usable space and is continually growing

to match ever-growing user demand. The Dell Lustre Storage Brick described here is an upgrade to our previous Lustre

filesystem and uses the latest stable version of Lustre 2.1.x branch. The Dell Lustre Storage Brick as described here has

evolved to fill the operational requirements of a busy, working HPC centre and has been used in production at Cambridge

for several years. Its development was critical in order to meet the pressing operational requirements caused by a growing

demand for storage volume and the inability of our Network File System (NFS) filesystems to keep up with the I/O

workload.

This paper is describing in detail how to build and configure a Lustre parallel filesystem storage brick using Dell PowerVault

systems. The solution uses current Dell MD3200/MD1200 arrays, Lustre 2.1.5 and a FDR InfiniBand client network and

as such represents an example of a current best-in-class commodity Lustre solution. The paper will also comment on

operational characteristics and system administration best practices derived from over six years production usage of

a large-scale mass storage solution of the same architecture. The main focus of this paper is to show how to build a

commodity storage array using Lustre 2, with the objective of presenting the new Lustre features and motivation to start

using the new improved Lustre.

3

© Dell | Section 1


Lustre 2.x Overview

Lustre is an open source cluster parallel filesystem which for almost a decade has been used in production on many of

the world’s largest clusters. The Lustre 1.8 branch has reached exceptionally high stability and is still considered to be

the best choice for small to medium production filesystems. However, the rapid growth of the HPC industry has caused

a high demand for storage space and throughput. This means that the filesystem needs to be able to scale and meet

continuously growing demands while retaining its performance levels and stability. Lustre 1.8 is limited in that regard and

hence work on Lustre 2 began a few years ago. Lustre 2 has been designed from the outset to meet the high demands of

modern hardware and software. The current Lustre code is maintained by Whamcloud/Intel and receives input from many

other contributors (Xyratex, OpenSFS, EOFS and others).

The current maintainers of the Lustre code have divided Lustre 2 releases into two streams: the maintenance release

stream and the feature release stream. Currently, Lustre 2.1.x is the designated maintenance release stream. It does not

receive any new feature updates but only bug fixes and updates to reflect kernel changes. This ensures greater stability of

the release and makes it suitable for production environments. The Lustre 2.1.x stream is targeted at conservative users

wanting to use a well proven release but still be able to take advantage of the new Lustre 2 architecture advantages.

The feature release stream is targeted at users requiring access to new features and able to accept a lower stability level.

The current feature release is Lustre 2.3.x and the next planned feature release will be 2.4.x. Whamcloud intends to land

feature releases at six-month intervals.

Lustre 2.1.x as a Production Filesystem

Lustre 1.8.x is currently the most popular release of Lustre and it has been widely used from small to medium production

systems. However it is becoming quickly outgrown by the continually growing demands of user applications and modern

hardware. One of the shortcomings in Lustre 1.8 is lack of support for Object Storage Targets (OSTs) bigger than 16TB (this

limitation is due to 32-bit block numbers in ext4). Nowadays, when disks in RAID arrays can be 3TB in size or more, the

ext4 limit of 16TB per OST becomes a big problem in the creation of an optimal RAID group configuration. Lustre requires

careful tuning of RAID layout and disk parameters to achieve best performance. At the time of writing this paper there

is ongoing work to enable the 64-bit feature of ext4 which will allow creation of OSTs bigger than 16TB in Lustre 1.8.x

[LU-419]. Another important shortcoming of Lustre 1.8.x is its code base that does not scale well enough on large storage

systems, hence the need for the new, more scalable code base that was introduced in Lustre 2. The Lustre 2 code

base promises to deliver much better bandwidth and metadata performance. This is being achieved by rewriting the

MDS and client I/O subsystems. The new Lustre 2 architecture and code base enables implementation of a range of new

features that are required to use, maintain and manage large amounts of data. For these reasons Lustre 2.1.x becomes

a better choice for modern storage platforms. Although it is possible to upgrade from Lustre 1.8 to Lustre 2 this is not

recommended for production systems as the conversion method has not been extensively tested and may have bugs

– also, the upgrade is a one-way process. The choice of Lustre 2 needs to be made at filesystem creation as this will avoid

future problems and compatibility issues.

Lustre 2.1.x Features

Lustre is in its second decade as a high performance, production parallel filesystem. Lustre has grown and adapted to

continuously growing demands to increase performance while maintaining production quality stability. With the arrival

of the 2.x release series, a number of new features have been introduced. Most of the features are not yet available in the

maintenance release (2.1.x), but they are being implemented and tested in the feature releases (2.3 and 2.4).

4

© Dell | Section 1


Changelogs (2.1.x)The new Lustre 2 code introduces a new feature called ‘changelog’, which records all events that change the filesystem’s

metadata or namespace. Changelogs record events such as file creation, deletion, renaming, attribute changes, etc.

Recorded changelogs are persistent and are kept on a local disk until cleared by the user. They accurately reflect the

on-disk changes in the event a server failure. These new records can be used for a variety of purposes, such as:

• Tracking changes in the filesystem and feeding them to hierarchical storage management software (HSM)

• Using recorded events to replicate exactly all changes in a filesystem mirror

• Creating filesystem policies to take action upon certain events for files and directories

• File backup policy management

• Speeding up recovery after server crash

• Helping to construct an audit trail for the filesystem.

Commit on Share (2.1.x)From time to time Lustre servers may need to be taken down for maintenance or they may simply crash due to hardware

component failures; the Lustre services then have to be recovered. When a Lustre MDS server enters recovery mode

clients begin to reconnect and start replaying their uncommitted transactions. The recovery time window is limited and

depends on the obd timeout. If one or more clients miss their recovery window, it may cause other clients to abort their

outstanding transactions or even be evicted. If a client is evicted all outstanding transactions that belong to that client have

to be aborted. This may lead to a cascade eviction effect as the transactions depending on the aborted ones fail and so

on. Lustre 2 addresses this problem with the Commit on Share feature by eliminating dependent transactions, hence the

clients may replay their requests independently without risk of being evicted. This allows clients to recover, regardless of

the state of other clients. In the event of multiple client failures this reduces recovery problems significantly.

Lustre_rsync (2.1.x)Moving large amounts of data between filesystems in a reasonably short time can be a daunting task that a system

administrator may face. Using standard tools for copying data would require the filesystem to be scanned file by file and

directory by directory, which for large filesystems can be unreasonably time consuming. Lustre 2 addresses this problem

by introducing the lustre_rsync tool that utilizes changelogs to determine which files need to be copied. This is much

more efficient than the traditional method of scanning filesystem metadata. Lustre_rsync can use as a target another

Lustre filesystem or any other filesystem. Lustre_rsync is a perfect tool to use for managing a backup of a Lustre filesystem.

It allows one to quickly synchronise a backup filesystem with the source or vice versa. In the case of recovery it allows one

to perform reverse replication.

File Identifier (2.1.x)Although File Identifier (FID) is not a new concept, Lustre 2 introduces a new FID and request format. Lustre 1.8 uses a

different FID for MDT inodes and OST objects. The MDT FID depends on ldiskfs and is only unique to a single MDT.

This model has several problems and constraints which are addressed in the new FID introduced in Lustre 2. The new FID

scheme is independent of the underlying filesystem; is unique across the entire Lustre filesystem; simplifies Lustre recovery

by eliminating the need to recreate a specific inode number in replay; can be generated on the client allowing a metadata

writeback cache; and adds support for object versioning. The MDT and OST will use the same 128-bit FID, but the MDS

and clients will be assigned a new sequence for each OSS. The new FID scheme introduces interoperability constraints

between Lustre 1.8 and 2.x. Since Lustre version 1.8.6, the Lustre client understands the new FID format and can

5

© Dell | Section 1


communicate with Lustre 2.x servers. However, the Lustre 2.x client does not understand the old Lustre 1.8 format and

servers must be upgraded to version 2.x. Live upgrade is not possible due to the RPC format change which would result in

a global client eviction.

Ext4/ldiskfs (2.1.x)The Lustre 2.x underlying ldiskfs filesystem has received a number of changes which make it better suited to large scale

Lustre deployments. One of the biggest limitations of previous incarnations of ldiskfs was the limited size of OSTs due

to the 32-bit nature of the code. This limitation has now been removed and the maximum supported OST size is now

128TB. Bigger OSTs may mean longer mkfs and fsck time; this is however addressed by the flex_bg feature and by creating

fewer inodes during filesystem creation. The flex_bg feature helps to avoid seeks for both data and metadata, co-locates

multiple block/inode bitmaps, and provides larger contiguous space. The ldiskfs defaults for the MDT were also changed to

increase inode density which allows use of flash-based storage for bigger filesystems.

Lustre 2.3 Features

The recently released Lustre 2.3 introduces many interesting new features. Some of them aim to further increase scalability

and performance. Some introduce new administrative functionalities and make the filesystem better suited for production

environments. Lustre 2.3 is a good candidate to be declared a new maintenance release as soon as it proves its stability.

Below is a list of some of the new features implemented in the latest Lustre 2.3 .

Backup/Restore – Object Index scrub (2.3)In Lustre 2.x, the FID is the unique identifier for a file or object. It differs from traditional inode number/generation and is

independent of the backend filesystem. Internal to the Lustre Object Storage Device (OSD), an Object Index (OI) table is

used to map the FID to the local inode number. One scenario where the inode number will be changed arises if the MDT is

restored from file-level backup. In this case, the OI table that is restored from backup will be stale, since the inode number/

generation will be newly allocated by the backend filesystem for each file. The OI table can be rebuilt by iterating through

all inodes and correcting the FID-to-inode mapping in the OI. In addition, the ability to iterate through inodes and cross-

checking the OI against the filesystem is useful to maintain filesystem consistency. It is important to note that since this

feature is not available in Lustre 2.1.x, the only way to backup/restore the MDT is by performing block level copy of the

MDT devices – by using, for example, the dd command.

SMP scalability (2.3)Current versions of Lustre rely on a single active metadata server. Metadata throughput may become a bottleneck for

large filesystems with many thousands of nodes. System architects typically resolve metadata performance issues by

deploying the MDS on more powerful hardware. However, this approach is not a foolproof solution. Nowadays, faster

hardware means more cores and Lustre has a number of SMP scalability issues while running on servers with many CPUs.

Scaling across multiple CPU sockets is a real problem. Lustre 2.3 introduces several new patches that address some of the

scalability issues, mainly in the performance of the Lustre network layer LNET and the performance of shared directories.

6

© Dell | Section 1


Job stats (2.3)This feature collects filesystem operation statistics for the jobs running on Lustre clients. When a job scheduler is running

on a compute node, the Lustre client will include the job ID in each request, such as write, read, create, unlink etc. The

server will record this information and expose it via lprocfs. Currently, a number of popular schedulers are supported, such

as SLURM, PBS, Torque/Moab, LFS, SGE, Loadleveler. This feature is particularly useful in job profiling and in discovering a

job’s I/O bottlenecks.

CRC checksums (2.3)Checksums were initially implemented in software, but in Lustre 2.x new hardware accelerated checksums were

implemented. This feature uses a kernel API to choose optimal hash and checksumming algorithms as opposed to fixed

values. Hardware accelerated checksums are supported on both Intel and AMD processors.

Statahead (2.3)Commands such as ‘ls -l’, ‘du’, ‘find’, which traverse a directory sequentially, are frequently used commands. Lustre processes

these commands by triggering several RPCs for each file inside the directory. Initially, these RPCs were processed serially

and synchronously, hence hurting performance. This is especially bad due to the fact that these commands are exposed

to the filesystem users and define their experience with it. Lustre-2.x supports statahead, which addresses this problem by

pre-fetching objects’ attributes from the MDS and many MDS-side RPCs now overlap with other processes. In addition,

Lustre2.3 introduces the asynchronous glimpse lock (AGL ) to prefetch file size from the OST.

Lustre Silo Model

A typical approach to design of a Lustre-based storage system is to create one large filesystem spanning multiple hardware

instances. This allows not only building high capacity storage but since Lustre throughput grows almost linearly with

addition of disk arrays it also delivers very high performance in a single filesystem namespace. Binding all hardware into

one large filesystem although clearly having its advantages, creates a single point of failure hence decreases filesystem

data safety. In applications where the data safety is of secondary importance and can be sacrificed in favor of less complex

design, bigger capacity and higher performance, this approach is acceptable. However, in applications where data safety

is critical, mirroring or backing up the data on secondary hardware can provide the necessary resiliency. This is however a

costly solution and often not implemented due to budget limitations. To address the data safety problem at lower expense,

the silo architecture may be the answer. This type of architecture is commonly used in NFS-based storage systems where

a large number of disk arrays are divided into silos of a certain fixed size. The limitation of NFS is that each NFS filesystem

can have only one NFS server that creates a throughput and metadata bottleneck. Lustre, due to its parallel nature, allows

the creation of large silos which span multiple storage servers and multiple disk arrays providing good throughput and

metadata performance per silo filesystem. Multiple filesystems means slightly more complex data management, but this

is outweighed by the decreasing risk of losing all the data if one of the filesystems fails. Experience teaches us that bad

things do happen, hence taking measures to minimize their impact is a good practice that needs to be adopted early in the

storage system design process. Figure 1 below shows an example of a single Lustre silo of 500TB capacity consisting of

four OSS modules.

7

25 percent capacity

25 percent capacity

25 percent capacity25 percent capacity

Figure 1. Single 500TB Lustre silo.

The silo architecture is based on the assumption that each silo should have enough reserved free capacity so, if necessary,

data from one filesystem can be migrated to the remaining silo filesystems. For 2PB of usable storage hardware it is

recommended to create four 500TB silos. Each silo will have 25% of its capacity reserved allowing migration of the data

between the silos if needed, see figure2 below. Figure 3 shows an example scenario when one of the silo fails or needs

to be taken offline and data migrated off it. Using a silo model system administrator can migrate data of one silo to the

remaining three and continue to serve all data without interruption. The Lustre silo model introduces more complexity

into data management and localization; however, it significantly increases data resiliency and also increases stability by

balancing the load across multiple Lustre silo filesystems.

Figure 2. 2PB of usable storage divided into four Lustre silos.

© Dell | Section 1


8

Lustre Silo 500TB capacity

oss01 oss02 oss03 oss04

Max 75 percent capacity used




Lustre Silo


Lustre Silo


Lustre Silo


Lustre Silo


25% 25%

75% 75%

Migrate Migrate

25% 25%

75% 75%

Migrate

Migrate

Migrate

Migrate

© Dell | Section 1


Figure 3. An example scenario when one silo fails and data can be redistributed across remaining silos.

Hardware Overview

The hardware design of the Dell Lustre Storage System solution is driven by the following main goals, namely, maximised

I/O performance, scalability, reliability, stability, data safety and integrity, simplified configuration and competitive price

point. These goals are accomplished by using well-established, on-the-market, enterprise-level hardware; open source

software with a large user base and strong code development team; the high performance InfiniBand network and a

modular design approach.

The following modules are used in the Dell Lustre Storage System design:

HA-MDS/MGS – High Availability Metadata Server and Management Server module.

HA-OSS – High Availability Object Storage Server module.

See Figure 4 below for the logical relationship between clients and the storage modules.

9

HA-MDS module

HPC Cluster HA-OSS moduleFailover

Interconnect

MDS1(active)

OOS1(active)

Shared storage

OOS2(active)

Clients

MDS2(standby)

Figure 4. Overview of Dell Lustre Storage System design.

Hardware Components

The table below provides an overview of the core building blocks for the Dell Lustre Storage System:

© Dell | Section 1


10

HA-MDS brick HA-OSS brick

• 3.2 TB of raw disk space

• 1.6 TB of RAID10 usable disk space

• FDR InfiniBand

• MDS/MGS Servers: Two Dell PowerEdge

R720 servers, running Scientific Linux 6.2 and

latest Lustre-2.1.x, equipped with dual socket

motherboard, eight core Intel Sandy Bridge CPUs

and 128GB of RAM

• Storage: One Dell PowerVault MD3220 with 24

136TB 15k SAS drives.

• 240TB of raw space

• 192TB of RAID6 usable disk space

• FDR InfiniBand

• OSS servers: Two Dell PowerEdge R510 servers,

running Scientific Linux 6.2 and latest Lustre-2.1.x,

equipped with dual socket motherboard, six core

Intel CPUs and 32GB of RAM

• Storage: One Dell PowerVault MD3200 and four

MD1200 enclosures, each with 12 4TB NL SAS

drives.

Infiniband switch

R720 Server (MDS01)

R720 Server (MDS02)

SAS links

Management links

Management switch

MD3220 RAIDcontroller

© Dell | Section 1


HA-MDS Overview

The HA-MDS module is comprised of two Dell PowerEdge R720 servers, each equipped with two high-performance

Intel® Xeon® E5 processors, 128GB of RAM, two low-latency, enterprise-class SAS 15K hard drives managed by an internal

H310 RAID controller (figure 5). The internal drives are configured as a RAID1 group and provide fast and reliable disk space

for the Linux OS. The reason we have chosen R720 over R510 server is the higher expandability of the former and ability to

add additional expansion enclosure later if needed.

The Dell PowerVault MD3220 direct attached external RAID storage array is designed to increase cluster metadata

performance and functionality. Equipped with data-protection features to help keep the storage system online and

available, the storage array is architected to avoid single points of failure by employing redundant RAID controllers and

cache coherency provided by direct mirroring.

Figure 5. Dell Lustre HA-MDS module.

The HA-MDS module is configured to provide a good IOPS rate and high availability of data. The metadata server connects

to the disk array via multiple redundant, fast, 6Gbps SAS connections. The system is built to avoid a single point of failure

by using two dual port HBA cards, multiple SAS links and dual RAID controllers.

MD3220 storage array key features:

• 24 small factor 2.5inch 15k SAS disk drives (also support for SSDs)

• Mirrored data cache of up to 2048 MB on each RAID controller

• Persistent cache backup on SD flash disk.

11

RAID10Lustre 1

RAID10Lustre 2

RAID10Lustre 3

© Dell | Section 1


• Snapshots of virtual disks.

• High Performance Tier to optimise IOPS and throughput.

• In-band and out-of-band management using Dell Modular Storage Manager.

• Redundant, hot-swappable components and virtual disk failover.

• Transparent failover for clustered operating systems to maximise data protection.

• RAID level migration, capacity expansion, consistency checks, and Self-Monitoring,

Analysis and Reporting Technology (SMART).

HA-MDS Configuration

The metadata performance is one of the key factors influencing the overall Lustre filesystem responsiveness and usability.

Hence, a correct choice of hardware and optimal configuration of the RAID controllers play an important role in the system

build process. Correctly configured, RAID groups minimise the number of expensive read and write operations that are

needed to complete a single I/O operation and also increase the number of operations that can be completed per unit time.

Figure 6. Dell Lustre HA-MDS module RAID configuration.

Figure 6 shows the disk layout of the Dell Lustre HA-MDS module. The 24 drives are divided into three groups, each

comprising eight disks and configured into a RAID10 disk group with a chunk size of 64KB. Since Lustre metadata operations

consist of many small I/O transactions, a 64KB chunk size is a good choice and provides satisfactory I/O performance.

Another parameter that can make a significant impact on I/O performance of the array is cache block size the value of

which can be set between 4KB and 32KB. A small cache block size is best suited to small transactional I/O patterns; thus,

for MDS purposes, we set this tunable to 4KB which matches well to Lustre metadata transactions. A single MDS module,

if configured as above, can provide metadata targets for three Lustre filesystems.

The server RAM size is also an important factor which has a significant importance in terms of Lustre meta data

responsiveness. The more RAM the MDS has the more aggressive tuning can be used to improve meta data expensive

operations like ‘ls’ and ‘find’. This is because Lustre MDS can cache locks on per client basis which allows to configure some

nodes – for example, login nodes with large lock caches – and improve Lustre responsiveness to meta data operations.

HA-OSS Overview

The HA-OSS module consists of a single Dell PowerEdge R520 servers attached to storage brick, each consisting of a

single PowerVault MD3200 disk array extended with four PowerVault MD1200 disk enclosures (Figure 7). The PowerEdge

server is equipped with two high-performance Intel® Xeon® processors, 32GB of RAM, two internal hard drives and an

12

MD1200 disk enclosure

MD3200 dual RAIDcontroller and disk enclosure

PE R520 Lustre OSS

© Dell | Section 1


internal RAID controller. All disk enclosures are populated with fast 4TB NL SAS disks. The Dell PowerVault MD3200 direct

attached external RAID storage array is built to increase cluster I/O performance and functionality. Equipped with data-

protection features to help keep the storage system online and available, the storage array is designed to avoid single

points of failure by employing redundant RAID controllers and cache coherency provided by direct mirroring. The Dell

PowerVault disk array provides performance and reliability at the commodity price point.

Figure 7. Dell Lustre HA-OSS module.

To ensure high availability failover, each OSS connects to both MD3200 arrays providing four redundant links to the

PowerVault MD3200 disk array per OSS. This design allows for either OSS to drive all the volumes on the Dell PowerVault

disk arrays. During normal operation, each OSS exclusively owns a subset of volumes. In the event of failure the remaining

server in the failover pair will take over the volumes from the failed server and subsequently service requests to all the

volumes.

MD3200 storage array key features:

• 12 x 3.5 inch disk slots

• Mirrored data cache of up to 2048 MB on each RAID controller

• Persistent cache backup on SD flash disk

• Snapshots of virtual disks

• High Performance Tier to optimise IOPS and throughput

• In-band and out-of-band management using Dell Modular Storage Manager

• Redundant, hot-swappable components and virtual disk failover

• Transparent failover for clustered operating systems to maximise data protection

• RAID level migration, capacity expansion, consistency checks, and Self-Monitoring,

Analysis and Reporting Technology (SMART).

13

15TB

15TB

15TB

15TB

15TB

15TB

VD1

VD2

VD3

VD4

VD5

VD6

VD1 VD1VD2 VD2VD3 VD3VD4 VD4VD5 VD5VD6 VD6

RAID/VirtualDisks configuration

Failover

© Dell | Section 1


HA-OSS Configuration

An important factor affecting overall Dell PowerVault MD3200 storage array performance is the correct choice and

configuration of the RAID disk groups and virtual disks, especially when using RAID6 disk groups. Correctly configured, RAID

groups minimise the number of expensive read and write operations that are needed to complete a single I/O operation.

Figure 8. RAID groups and virtual disk layout of the Dell Lustre HA-OSS module.

Figure 8 shows the disk configuration of the Dell Lustre HA-OSS module. There are six RAID6 disk groups configured.

Each RAID6 disk group consists of eight data disks and two parity disks. The segment size (chunk size) of the disk group

is 128KB, thus the stripe size is 1024KB. A correct segment size minimises the number of disks that need to be accessed

to satisfy a single I/O operation. The Lustre typical I/O size is 1MB, therefore each I/O operation requires only one RAID

stripe. Disks in each disk group are spanned across all five disk arrays. This leads to faster disk access per I/O operation.

For optimal performance, only one virtual disk per disk group is configured which results in more sequential access of the

RAID disk group and which may significantly improve I/O performance. The Lustre manual suggests using the full device

instead of a partition (sdb vs sdb1). When using the full device, Lustre writes well-aligned 1 MB chunks to disk. Partitioning

the disk removes this alignment and may noticeably impact performance. In this configuration, a RAID disk group

distributed across all enclosures has the additional advantage of not having more than two disks per disk enclosure.

This means that in the event of enclosure outage, data integrity is still protected by RAID6 redundancy.

14

© Dell | Section 1


After performing a number of tests exercising various controller parameters the following choice of configuration

parameters seems to be best suited to the Lustre filesystem where the requirement is high performance under large

I/O workload:

Summary of configuration parameters:

cacheBlockSize=32

segmentSize=128

readCacheEnabled=true

cacheReadPrefetch=true

modificationPriority=lowest

cacheMirroring=true

Some write performance boost can be gained by disabling the write cache mirroring feature. Making this change however

needs to be carefully considered because it decreases data integrity safety, thus it is only recommended for non-sensitive

scratch data filesystems.

Performance can be also improved by tweaking various Linux parameters:

Summary of Linux tuning options on the OSS server:

Linux I/O scheduler - parameter set to “deadline”

echo deadline > /sys/block/<blk_dev>/queue/scheduler

max_sectors_kb - parameter set to 2048

echo 2048 > /sys/block/<blk_dev>/queue/max_sectors_kb

read_ahead_kb - parameter set to 2048

echo 2048 > /sys/block/<blk_dev>/queue/read_ahead_kb

nr_requests - parameter set to 512

echo 512 > /sys/block/<blk_dev>/queue/nr_requests

Software Components Overview

The Dell Lustre Storage System software stack installed on the metadata servers and object storage servers includes the

following software components:

• Linux operating system

• Lustre filesystem

• Heartbeat/Pacemaker

• Dell configuration and monitoring tools.

The Dell Lustre Storage System uses the Scientific Linux 6.3 operating system. This is a rebuild of the RedHat® Enterprise

Linux® (Update 3) distribution (RHEL6.3). This Linux distribution is an enterprise-level operating system that provides one

of the industry’s most comprehensive suite of Linux solutions. RHEL6.3 delivers a high performance platform with many

recent Linux features and high levels of reliability. The Dell Lustre Storage System uses Lustre version 2.1.5, which is the

current stable version of the maintenance release line of Lustre 2.

15

© Dell | Section 1


Driver Installation

The first step is to install software drivers and agents that will allow the servers to communicate and monitor the external

RAID controllers. All necessary Linux host software is available on the Dell PowerVault MD32xx resource CD image.

The latest resource CD image can be downloaded from the Dell Support website at: support.dell.com.

The downloaded image can be mounted as a loop device. It is recommended to copy the contents of the MD32xx

resource CD to a shared NFS location in order to make it available to all OSS servers:

[root]# mkdir /mnt/md32xxiso

[root]# mount -o loop md32xx_resource_cd.iso /mnt/md32xxiso

[root]# rsync –a /mnt/md32xxiso/ /nfs_share/md32xx

When installing the PowerVault Modular Disk Storage Manager (MDSM) and Host Utilities, the system the System

administrator has the option of using an interactive installer, which requires the X Windows System and Java™ or using

text mode which can be performed either interactively or non-interactively in a Linux terminal. The second method is a

good choice for the cluster environment where software has to be installed on many nodes and allows automation of

the installation process for all available servers.

To install the MDSM in silent mode use one of the provided install scripts located in the same directory as the MDSM

installer. Depending on the role of the server in the Dell Lustre Storage System, use one of the three provided scripts.

Use mgmt.properties on a server that is a dedicated management node. Use host.properties file on a server which is

either an OSS or MDS (and is directly connected to a MD32xx enclosure).

OSS/MDS servers - host.properties file contents:

INSTALLER_UI=silent

CHOSEN_INSTALL_FEATURE_LIST=SMruntime,SMutil,SMagent

AUTO_START_CHOICE=0

USER_REQUESTED_RESTART=NO

REQUESTED_FO_DRIVER=mpio

Management servers - mgmt.properties file contents:

INSTALLER_UI=silent

CHOSEN_INSTALL_FEATURE_LIST=SMclient,Java Ac

AUTO_START_CHOICE=0

USER_REQUESTED_RESTART=NO

Once the appropriate properties file is chosen, the software can be installed with the following commands:

[root]# cd /nfs_share/md32xx/linux/mdsm/

[root]# ./SMIA-LINUX-{version}.bin -i silent -f host.properties

It is recommended to reboot the servers once the drivers and agents are installed to ensure that they are properly loaded

and working.

16

OSS-MODULE OSS-MODULE

MDS-MODULE

© Dell | Section 1


Disk Array Configuration

It is important to first install drivers on the servers and then connect storage arrays. Otherwise the lack of the proper device

handling will cause the Linux OS to report IO errors when accessing storage array. Before attempting MD3200 array

configuration, care should be taken to ensure that all storage arrays are correctly connected and the necessary drivers and

software tools described in the previous section have been successfully installed. The connections use SAS cables and are

shown on Figure 9 and Figure 10.

Figure 9. MDS module SAS cabling.

Figure 10. OSS module fault tolerant SAS cabling.

17

© Dell | Section 1


Storage arrays can be configured using either a command line interface (CLI) or modular disk storage manager graphical

interface (MDSM GUI). The MDSM GUI can be accessed by executing the SMclient command on the management node.

Below is a screenshot of a main MDSM GUI window, which provides summary information about the storage array. The

GUI interface is an easy way to access and control all aspects of the storage array.

Figure 11. MDSM GUI.

On the OSS servers the SMagent service should be started prior to attempting any configuration. The SMagent service can

be configured in the Linux OS to start automatically at boot time:

[root]# chkoconfig SMagent on

[root]# service SMagent start

To start the configuration process, run the MDSM application installed on the management node:

/opt/dell/mdstoragemanager/client/SMclient

Perform the initial setup tasks as follows (using inband management):

In MDSM main window click Tools at the top left corner of the

application window, and then click Automatic Discovery… and allow

MDSM to scan for MD3200 storage arrays.

Once the scan is complete all connected MD3200 storage arrays should be visible. It is recommended to start by

upgrading the storage array components’ firmware. Detailed instructions can be found in the User’s Guide on the Dell

PowerVault MD3200 documentation site:

http://support.dell.com/support/edocs/systems/md3200/

The next step of the configuration process is enabling OSS Linux server host access to the MD3200 storage array. Host

access can be configured either automatically or manually. In the automatic method, all hosts running the SMagent service

18

© Dell | Section 1


will be automatically detected by the MDSM.

Follow these steps to configure host access automatically:

In MDSM main window

1. Right Click on the Storage Array Name and choose Manage Storage Array

option and a new window will open

2. In the new window click on Storage Array -> Configuration -> Automatic…

3. Follow instruction to complete the task

Configure storage array name:

[root]# SMcli oss01 -c “set storageArray userLabel=\”dell01\”;”

Example configuration of storage array host access and host group:

SMcli oss01 -c “create hostGroup userLabel=\”dell01\”;”

SMcli oss01 -c “create host userLabel=\”oss01\” hostType=1

hostGroup=\”dell01\”;”

SMcli oss01 -c “create hostPort host=\”oss01\” userLabel=\”oss010\”

identifier=\”{SASportAddress}\” interfaceType=SAS;”

The most efficient way to configure is via a configuration script:

[root]# SMcli oss01 -f storageArray.cfg

A more detailed script for configuring a Dell MD3200 HA-OSS module can be found in the previous white paper from this series.

Lustre 2.1.x Installation and Configuration

The easiest way to install Lustre is to obtain the Lustre RPM packages from the Lustre maintainer website,

currently: http://downloads.whamcloud.com/public/lustre/latest-maintenance-release/.

In order to determine which Lustre packages should be downloaded, please check the Lustre support matrix: http://wiki.

whamcloud.com/display/PUB/Lustre+Support+Matrix

The Lustre servers require the following RPMs to be installed:

• kernel-lustre: Lustre patched kernel

• lustre-ldiskfs: Lustre-patched backing filesystem kernel module

• lustre-modules: Lustre kernel modules

• lustre: Lustre utilities package

• e2fsprogs: utilities package for the ext4 backing filesystem

• lustre-test: various test utilities (optional)

19

© Dell | Section 1


The Lustre client requires these RPMS to be installed:

• lustre-modules: Lustre kernel modules

• lustre: Lustre utilities package

• lustre-test: various test utilities (optional)

Additionally, the following packages may be required from your Linux distribution, or InfinBand vendor:

• kernel-ib: OFED

Lustre installation is completed by installing the above RPMs and rebooting the server.

Lustre Network Configuration

In order for the Lustre network to operate optimally, the system administrator must verify that the cluster network

for example, InfiniBand has been set up correctly and that clients and servers can communicate with each other using

standard network protocols for the particular network type used. An InfiniBand network can be tested using tools provided

in the perftest package. Also a ping utility can be used to check that the TCP/IP configuration of the IB interfaces functions

properly.

In a cluster with a Lustre filesystem, servers and clients communicate with each other using the Lustre networking API

(LNET). LNET supports a variety of network transports through Lustre Network Drivers (LNDs). For example, ksocklnd

enables support for ethernet and ko2iblnd enables support for InfiniBand. End points of the Lustre network are defined

by Lustre Network Identifiers NIDs. Every node that uses the Lustre filesystem must have a NID. Typically, a NID looks like

address@network, where address is the name or IP address of the host and network is the Lustre network, for example, tcp0

or o2ib.

When configuring a Lustre InfiniBand network in RHEL, the system administrator needs to add the following lines to the file

/etc/modprobe.conf:

# lnet configuration for Infiniband

options lnet networks=o2ib(ib0)

This will configure the appropriate NID and load the correct LND for the InfiniBand interconnect. More information about

Lustre networking can be found at:

wiki.lustre.org

20

© Dell | Section 1


Lustre Configuration for HA-MDS Module

Before Lustre can be used a system administrator must format Lustre targets with the ldiskfs filesystem. During the format

procedure, some Lustre configuration options can be set. The Lustre configuration can be changed later, if required, using

the nondestructive command: tunefs.lustre.

In order to configure and start the Lustre management server, run the following commands on one of the failover pair,

mds01 and mds02:

root]# mkfs.lustre --mgs –-fsname=testfs /

--failnode=mds02_ib@o2ib /

/dev/mpath/mgs

# Start Lustre management service

[root]# mount -t lustre /dev/mpath/mgs /lustre/mgs

Similarly, to configure and start the Lustre metadata server (note that typically the MGS and MDS run on the same host):

[root]# mkfs.lustre --reformat --mdt –fsname=testfs /

--failnode= mds02_ib@o2ib /

--mgsnode= mds01_ib@o2ib /

--mgsnode= mds02_ib@o2ib /

/dev/mpath/testfs_mdt00

# Start metadata service

[root]# mount -t lustre /dev/mpath/testfs_mdt00 /lustre/testfs_mdt00

Lustre Configuration for HA-OSS Module

To configure Lustre Object Storage Servers (OSS) run the following commands on one server of the OSS failover pair:

# on oss01 server

[root]# mkfs.lustre --ost -–fsname=testfs /

--mkfsoptions=’-O flex_bg -G 256 –E /

stride=32,stripe-width=256’

--failnode=oss02_ib@o2ib /

--mgsnode=mds01_ib@o2ib /


/dev/mpath/testfs_ost00

21

© Dell | Section 1





































22

Shared storage (MDT, MGS)

OSS01

OSS02

OSSn

Failover pair

Failover pair

Failover pair

Failover pair

Interconnect(Infiniband, Ethernet)

MDS1 MDS2

Shared storage (OSTs)



Computer Cluster

© Dell | Section 1


Start object storage services:

[root]# mount -t lustre /dev/mpath/testfs_ost00 /lustre/testfs_ost00

#repeat above mount command for remaining OSTs, mount 6 OSTs per OSS

Start Lustre on the client node:

# Mount clients

[root]# mount -t lustre mds01@o2ib;mds02@02ib:/lustre /lustre

Dell Storage Brick with Lustre 2.1.x Performance Tests

The purpose of the performance testing is to provide accurate Lustre filesystem throughput measurements of the Dell

Lustre Storage System as configured in this paper. The results can then be compared with the expected theoretical

performance of the tested storage system. Performance tests were conducted using one HA-MDS module and up to three

HA-OSS module.

Figure 12 shows a system configuration used in the performance tests. The system is consisting of three Dell Lustre Bricks

configured with failover and a number of client servers. All servers are interconnected with fast FDR InfiniBand network.

Dell Lustre Storage Brick Test Tools

Figure 12. Test system setup.

23

Lustre Clients(lozone, IOR)

InfiniBand Network(LNET selftest)

OST/MDT(obdfilter – survey)

RAID, LINUX block device layer(sgpdd – survey, dd)

Disks Disks Disks

© Dell | Section 1


It is good practice to obtain baseline performance figures for all storage system elements involved in tests. Figure 13 shows

a high level representation of the testing path that should be followed to properly profile a storage system. Lustre provides

a handful of profiling tools which can be used to obtain baseline figures for the backend storage system hardware. These

tools allow testing the hardware platform from the very bottom of the benchmarking stack by measuring raw performance

data with minimal software overhead. Moving up the stack Lustre provides tools to measure the server’s local filesystem,

the network connecting client nodes and the clients themselves.

Figure 13. Lustre storage testing path.

sgpdd-surveyThe Lustre IOKit included in the lustre-test rpm package provides a set of tools which help to validate the performance

of various hardware and software layers. One of the tools called sgppd_survey can test bare metal performance of the

back-end storage system, while bypassing as much of the kernel as possible. This tool can be used to characterise the

performance of a storage device by simulating an OST serving files consisting of multiple stripes. The data collected by

this survey can help to determine the performance of a Lustre OST exporting the device. Sgpdd-survey is a shell script

included in the Lustre IOKit toolkit. It uses the sgp_dd tool provided by the sg3 utilities package, which is a scalable version

of the dd command and adds the ability to do multithreaded I/O with various block sizes and to multiple disk regions.

The script uses sgp_dd to carry out raw sequential disk I/O. It runs with variable numbers of sgp_dd threads to show how

performance varies with different request queue depths. The script spawns variable numbers of sgp_dd instances, each

reading or writing to a separate area of the disk to demonstrate performance variance within a number of concurrent

stripe files. The purpose of running sgppdd_survey is to obtain a performance baseline and to set expectations for the

Lustre filesystem.

crg (con current regions) – describes how many regions on the disk are written to or read by sgp_dd. crg simulates

multiple Lustre clients accessing an OST. As crg grows, performance will drop due to more disk seeks (especially for reads).

thr (number of threads) – this parameter simulates Lustre OSS threads. More OSS threads can do more I/O, but if too

many threads are in use, the hardware will not be able to process them and performance will drop.

24

© Dell | Section 1


obdfilter-surveyOnce the block devices are formatted with a Lustre filesystem, the next step is to benchmark the OSTs and their underlying

ldiskfs filesystem. This can be done using a tool provided by the Lustre IOKit called obdfilter-survey. This script profiles the

overall throughput of the storage hardware by applying a range of workloads to the OSTs. The main purpose of running

obdfilter-survey is to measure the maximum performance of a storage system and to find the saturation points which

cause performance drops.

obj (Lustre objects) - describes how many Lustre objects are written or read. Obj simulates multiple Lustre clients

accessing the OST and reading/writing multiple objects. As the number of obj grows, performance will drop due to

more disk seeks (especially for reads).

thr (number of threads) - this parameter simulates Lustre OSS threads. More OSS threads can do more I/O, but if too many

threads are in use, the hardware will not be able to process them and performance will drop.

LNET selftestThe LNET selftest (lst) utility helps to test the operation of a Lustre network between servers and clients. It allows

verification that the Lustre network was properly configured and that performance meets expected values. LNET selftest

runs as a kernel module and has to be loaded on all nodes that take part in the test. A utility called lst is used to configure

and launch Lustre network tests.

IOR IOR is an I/O benchmark providing a wide range of useful functions and features. With IOR, clients first write data and

then, when read, is performed clients read data written by another client. This completely eliminates the client-read cache

effect, so avoiding the problem of having to flush the client cache after write. IOR uses MPI for multimode communication

and thread synchronization, which helps to provide very accurate results from large multinode tests. IOR can also be

compiled with Lustre support, allowing to tune some of Lustre parameters like file striping.

25

0

0

200

400

600

800

1000

1200

1400

1600

100 200 300 400

Number of threads

sgpdd-survey, cache mirroing enabled – write test

cgr 12

cgr 24

cgr 96

cgr 192

cgr 48

Th

rou

gh

pu

t M

B/s

500 600 700 800 900

0

0

1000

2000

3000

4000

5000

6000

100 200 300 400

Number of threads

sgpdd-survey, cache mirroing enabled – read test

cgr 12

cgr 24

cgr 96

cgr 192

cgr 48

Th

rou

gh

pu

t M

B/s

500 600 700 800 900

© Dell | Section 1


Tests Results

Command Line:

size=”24576” crglo=1 crghi=16 thrlo=1 thrhi=64 scsidevs=”/dev/sdk

/dev/sdw /dev/sdaa /dev/sdo /dev/sdi /dev/sdu /dev/sdm /dev/sdy

/dev/sdg /dev/sds /dev/sdac /dev/sdq” ./sgpdd-survey

Figure 14. sgpdd-survey single brick write test (bare metal performance).

Figure 15. sgpdd-survey single brick read test (bare metal performance).

26

1200

1300

1250

1350

1400

1450

1500

32 64

obdfilter-survey cache mirroring enabled – write

Th

rou

gh

pu

t M

B/s

128 256 512

Threads

obj12

obj48

obj96

obj24

0

500

1500

1000

2000

2500

3000

32 64

obdfilter-survey cache mirroring enabled – read

Th

rou

gh

pu

t M

B/s

128 256 512

Threads

obj12

obj48

obj96

obj24

© Dell | Section 1


Command Line:

nobjlo=”2” nobjhi=”16” thrlo=”2” thrhi=”64” size=”24576”

targets=”testfsOST0000 testfs-OST0001 testfs-OST0002 testfs-OST0003

testfs-OST0004 testfsOST0005” ./obdfilter-survey

Figure 16. obdfilter-survey write test.

Figure 17. obdfilter-survey read test.

27

0

500

1000

1500

2000

2500

3000

3500

4000

4 8

Number of clients

IOR cache mirroring enabled – write

Th

rou

gh

pu

t M

B/s

16 32

1 brick

2 bricks

3 bricks

0

500

1000

1500

2000

2500

3000

4000

5000

4500

3500

4 8

Number of clients

IOR cache mirroring enabled – read

Th

rou

gh

pu

t M

B/s

16 32

1 brick

2 bricks

3 bricks

© Dell | Section 1


Command Line:

IOR -a POSIX -N 30 -b 8g -d 5 -t 1024k –o /testfs/test_file -e -g -w

-r -s 1 -i 2 -vv -F -C

Figure 18. IOR sequential write test.

Figure 19. IOR sequential read test.

28

© Dell | Section 1


Tests Results - Conclusions

The raw performance test results shown on Figure 11 and 12 show good base line performance for both reads and writes.

The write performance is lower due to active cache mirroring. This feature has to be enabled if Dell Lustre bricks are used

in failover configuration to ensure data integrity in event of a RAID controller failure. All further tests are run with cache

mirroring enabled.

Obdfilter test presented on Figure 13 and 14 shows that formatting raw block devices with Lustre filesystem decreases

slightly performance due to Lustre backend filesystem overhead. The performance loss is not high and it is in range of a

few percentages.

The IOR large throughput test shown in Figure 15 and 16 was run on a range of Lustre clients from 4 to 32. Each client

writes its own file, where the file’s distribution is balanced equally across all bricks and all OSTs. For small node counts, read

outperforms write throughput. However, with the increase of node count – thereby increasing the files written

– I/O concurrency on the OSS grows, so read performance drops. When comparing these results to tests from the

previous paper, we can see improvement in both writes and reads. This is most likely thanks to new features, like servers

side caching and increased Lustre multithreading.

Summary

As HPC commodity clusters continue to grow in performance and more applications adopt parallel I/O, traditional

filesystems such as NFS, NAS, SAN and DAS are failing to keep up with HPC I/O demand. The Lustre2.x parallel filesystem

is gaining ground within the departmental and workgroup HPC space since it has proved capable of meeting the high I/O

workload within HPC environments, as well as presenting a range of additional operational benefits as described in the

introduction. This paper presents a structured description of how to build a Dell PowerVault MD3200 Lustre 2.x storage

brick. The storage brick as described shows good sequential I/O throughput, high redundancy and high availability with

good set of data safety features, all of which are important within the HPC environment. The storage brick shows very

good Lustre filesystem performance. This solution has demonstrated scalable performance across the storage bricks,

yielding over 4GB/s file I/O bandwidth for reads and over 3 GB/s for writes with high levels of availability and data security

when using three Dell Lustre Storage Bricks. Operational experience has demonstrated less than 0.5% un-planned

downtime. We have found the Dell Lustre Storage Brick is a good match to both HPC performance and the operational

requirements seen within a large university HPC data centre. The Cambridge HPC Service has constructed a two-silo

solution, which provides very high throughput with built-in redundancy and a large capacity storage system that currently

satisfies all our users’ needs. Shortly, this system will be expanded to a four-silo solution providing even more throughput

to drive University of Cambridge science and research. The solution described here has been used in the Cambridge HPC

production environment for the last three years, serving over 500 users. Availability and data security demonstrates less

than 0.5% unscheduled downtime in our 24x7 operational service.

This Whitepaper is for informational purposes only, and may contain typographical errors and technical inaccuracies. The content is provided as is, without express or implied warranties of any kind.

©2014 Dell Corporation Ltd. All rights reserved. Dell, the Dell Logo and Latitude are registered trademarks or trademarks of Dell Corporation Ltd. Lustre is a registered Trademark of Oracle in the United States and in other countries. Intel, Intel logo, Intel Xeon, Intel Dual Core are registered trademarks of Intel Corporation in the United States and in other countries. Other trademarks or trade names may be used in this document to refer to either the entities claiming the marks and names or their products. Dell disclaims any proprietary interest in the marks and names of others. Dell Corporation Limited. Dell Corporation Limited, Reg. No. 02081369, Dell House, The Boulevard, Cain Road, Bracknell, Berkshire RG12 1LF.

29

Documents

High Performance, Open Source, Lustre 2.1.x Filesystem on ...i.dell.com/sites/doccontent/business/solutions/whitepapers/en/... · The following paper was produced by the Dell Cambridge