Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can...

Preview:

Citation preview

ORNL is managed by UT-Battelle

for the US Department of Energy

Robinhood

Operational Preparation for Large-Scale Deployment

2 Presentation_name

Overview of OLCF / NCCS

• National Center for Computational Sciences

– Focus on at-scale HPC challenges

– Support for projects like SNS, NCRC

• Oak Ridge Leadership Computing Facility

– Largest project of NCCS

– Home of Titan/Atlas. Future home of Summit/Alpine

3 Presentation_name

Overview of Robinhood

• Policy Engine for POSIX file systems

• Extra hooks for Lustre

• Allows for near-real time file system information

Making

Robinhood fit

OLCF Production

Standards

5 Presentation_name

Reproducible Builds

• Using GitLab Runners

– Binary that can execute builds as part of a CI pipeline

– Settings -> General -> Enable pipelines

– Settings -> Pipelines

6 Presentation_name

Building Lustre

• Current setup:

– GitLab runner runs as bot build user on storage-util1 node

– Build script checks out copy of Lustre repo

– Uses current build system to create Lustre RPMs

– Stores them in staging area for manual signing/approval

– Only for Robinhood testing currently

• Future setup:

– “Bring your own build host”

– Using runner ”tags”

7 Presentation_name

Building Robinhood

• Similar setup to Lustre

– Lustre RPMs are installed manually

– Kick off pipeline build

– Robinhood RPMs are built against installed Lustre client

– RPMs are placed in staging area for testing/signing/deployment/installation

8 Presentation_name

Puppet Setup

• NCCS uses Puppet’s role and profile design workflow

• https://docs.puppet.com/pe/2017.2/r_n_p_full_example.html

• No current module on Puppet Forge

• WIP robinhood module

9 Presentation_name

Puppet Robinhood Module

Basic 1-to-1 setup between Robinhoodconfig options and Puppet parameters

Testing

Environment

11 Presentation_name

Testing Setup

• Tested against older hardware

• Used AtlasTDS file system

• Partition of NetApp E5500 with 48x 900GB 10k SAS drives, over 6G SAS

12 Presentation_name

Testing Hardware

• Storage-util1 Node:

– Dell PowerEdge R620

– 2x Intel® Xeon® CPU E5-2640 @ 2.50GHz

– 16x 16GB DIMM DDR3 1333 MHz

– Hyperthreading Disabled

– Diskless provisioning

13 Presentation_name

MariaDB tuning

• Mostly same settings as recommended by Robinhood’sstarting page

• innodb_additional_mem_pool_size setting is not used in 10.3

• For stock RHEL installs, the log_slow_queries and associated tunings (long_query_time and log-queries-not-using-indexes) can show if the database is a bottleneck

14 Presentation_name

Robinhood Tuning

• Set nb_threads to twice the number of physical cores

• Changed max_pending_operations from 10000 to 200000

• Set nb_threads_scan to twice the number of physical cores. This may be too many

• Changed queue_max_size to 10000 (from 1000) and queue_max_age from 5s to 10s

• Trade-off between consistency/recovery-time and speed

15 Presentation_name

Disk Utilization

16 Presentation_name

Bottlenecks?

• File system backend limited metadata performance

• Under certain metadata intensive workloads:

– Not really an easy solution

– Mentioned in https://jira.hpdd.intel.com/browse/LU-8047

17 Presentation_name

Issues with RHEL7

• Stock mariadb

• Systemd ulimit settings

18 Presentation_name

Testing Summary

• Current testing hardware can only process so quickly – we appear to have hit this limit

• Moved the bottleneck towards Lustre

• GET_FID is typically highest latency command

• Bursts of metadata traffic cause spikes of “Wait”-state commands; in our testing, shifts between GET_INFO_DB, DB_APPLY and CHGLOG_CLR

19 Presentation_name

Daemon vs. One-shot

• Split use-case

• Daemon:

– File system scanning

– Changelog consumption

– RBH_OPT="--readlog --scan"

• One-shot (“manual” process / cronjob):

– Policy application (e.g., purging)

Comparison to

Existing Tools

21 Presentation_name

PCircle

• Suite of file system tools for parallel data copying, checksumming, and profiling

• Currently used for ~weekly file system profiling

• Includes directory count, sym/hard linkcounts, file count, average file size, maxfiles within a directory, among other statistics

• Reports file size histograms, and top files (by size)

• https://github.com/olcf/pcircle

22 Presentation_name

fprof

• Able to reproduce fprof-like reporting by setting up fileclassbuckets

• Built-in reports like ‘top x’ files/directories provide similar functionality

23 Presentation_name

Output: rbh-report --class-info

24 Presentation_name

LustreDU

25 Presentation_name

LustreDU

• Provides directory-level usage for users/projects

• Populated by:

– Parsing Lester output

– Contacting inode query daemons running on OSS nodes

– Populating/updating MySQL database

• Only updated daily

• Issues running as privileged user

26 Presentation_name

rbh-du output

• Provides a quick du option

• Potentially provide a smart wrapper for users that use du vs rbh-du based on file path

27 Presentation_name

Purging Policies

• Non-Robinhood workflow:

– User submits request

– RUC approval

– UAO team member enters exemption into RATS

– Purge config is generated using those exemptions

• Robinhood pieces still WIP

• Example:

28 Presentation_name

Purging – Integration with Robinhood

• Want to keep same workflow for users and other groups

• Current thoughts:

– Pull list of purge exemptions

– Generate purge configuration file using multiple “tree” statements in a cleanup rule

– Run Robinhood with --once with that policy

– Log and remove configuration

Future Work

30 Presentation_name

Hardware upgrades

• Transition to using similar setup to current MDS nodes

• Single socket, faster clock speed

• SSD / NVMe storage target

31 Presentation_name

Clustering

• Move processes to multiple nodes

– Multiple physical nodes vs namespaced mounts / VMs

• Set up a Mariadb/MySQL cluster

– Millions of SQL statements per second

– https://www.mysql.com/why-mysql/benchmarks/mysql-cluster/

32 Presentation_name

CEA’s Lustre Changelogs Aggregate &

Publish (lcap) integration

• Ability for multiple change-log readers

• Redirect a copy of the changelog to our Kafka instances while still using a single reader

33 Presentation_name

Lustre Jobstats

34 Presentation_name

Jobstats Integration

35 Presentation_name

Jobstats Integration - continued

• Database schema changes

– Add new columns to database: creation_job, last_access_job, last_mod_job, and last_mdchange_job

– Parse job_id (semi-support exists currently) to populate these fields

36 Presentation_name

Jobstats Integration – potential wins

• File system usage heuristics

• Security triggers / auditing

• File-level history

37 Presentation_name

References

• https://github.com/cea-hpc/robinhood/wiki/Documenation

• https://dev.mysql.com/doc

• https://mariadb.com

• https://github.com/fwang2/ioutils

• https://cug.org/proceedings/cug2014_proceedings/includes/files/pap157.pdf

• https://gitlab.com/gitlab-org/gitlab-ci-multi-runner

• http://wiki.lustre.org/images/0/02/LUG-2011-Aurelien_Degremont-Robinhood_Quick_Tour.pdf

• https://github.com/cea-hpc/lcap

• http://syst.univ-brest.fr/per3s/wp-content/uploads/2017/02/robinhood-Per3S.pdf

• https://build.hpdd.intel.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml

Recommended