Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can...

ORNL is managed by UT-Battelle

for the US Department of Energy

Robinhood

Operational Preparation for Large-Scale Deployment

2 Presentation_name

Overview of OLCF / NCCS

• National Center for Computational Sciences

– Focus on at-scale HPC challenges

– Support for projects like SNS, NCRC

• Oak Ridge Leadership Computing Facility

– Largest project of NCCS

– Home of Titan/Atlas. Future home of Summit/Alpine

3 Presentation_name

Overview of Robinhood

• Policy Engine for POSIX file systems

• Extra hooks for Lustre

• Allows for near-real time file system information

Making

Robinhood fit

OLCF Production

Standards

5 Presentation_name

Reproducible Builds

• Using GitLab Runners

– Binary that can execute builds as part of a CI pipeline

– Settings -> General -> Enable pipelines

– Settings -> Pipelines

6 Presentation_name

Building Lustre

• Current setup:

– GitLab runner runs as bot build user on storage-util1 node

– Build script checks out copy of Lustre repo

– Uses current build system to create Lustre RPMs

– Stores them in staging area for manual signing/approval

– Only for Robinhood testing currently

• Future setup:

– “Bring your own build host”

– Using runner ”tags”

7 Presentation_name

Building Robinhood

• Similar setup to Lustre

– Lustre RPMs are installed manually

– Kick off pipeline build

– Robinhood RPMs are built against installed Lustre client

– RPMs are placed in staging area for testing/signing/deployment/installation

8 Presentation_name

Puppet Setup

• NCCS uses Puppet’s role and profile design workflow

• https://docs.puppet.com/pe/2017.2/r_n_p_full_example.html

• No current module on Puppet Forge

• WIP robinhood module

9 Presentation_name

Puppet Robinhood Module

Basic 1-to-1 setup between Robinhoodconfig options and Puppet parameters

Testing

Environment

11 Presentation_name

Testing Setup

• Tested against older hardware

• Used AtlasTDS file system

• Partition of NetApp E5500 with 48x 900GB 10k SAS drives, over 6G SAS

Testing Hardware

• Storage-util1 Node:

– Dell PowerEdge R620

– 2x Intel® Xeon® CPU E5-2640 @ 2.50GHz

– 16x 16GB DIMM DDR3 1333 MHz

– Hyperthreading Disabled

– Diskless provisioning

MariaDB tuning

• Mostly same settings as recommended by Robinhood’sstarting page

• innodb_additional_mem_pool_size setting is not used in 10.3

• For stock RHEL installs, the log_slow_queries and associated tunings (long_query_time and log-queries-not-using-indexes) can show if the database is a bottleneck

Robinhood Tuning

• Set nb_threads to twice the number of physical cores

• Changed max_pending_operations from 10000 to 200000

• Set nb_threads_scan to twice the number of physical cores. This may be too many

• Changed queue_max_size to 10000 (from 1000) and queue_max_age from 5s to 10s

• Trade-off between consistency/recovery-time and speed

Disk Utilization

Bottlenecks?

• File system backend limited metadata performance

• Under certain metadata intensive workloads:

– Not really an easy solution

– Mentioned in https://jira.hpdd.intel.com/browse/LU-8047

Issues with RHEL7

• Stock mariadb

• Systemd ulimit settings

Testing Summary

• Current testing hardware can only process so quickly – we appear to have hit this limit

• Moved the bottleneck towards Lustre

• GET_FID is typically highest latency command

• Bursts of metadata traffic cause spikes of “Wait”-state commands; in our testing, shifts between GET_INFO_DB, DB_APPLY and CHGLOG_CLR

Daemon vs. One-shot

• Split use-case

• Daemon:

– File system scanning

– Changelog consumption

– RBH_OPT="--readlog --scan"

• One-shot (“manual” process / cronjob):

– Policy application (e.g., purging)

Comparison to

Existing Tools

PCircle

• Suite of file system tools for parallel data copying, checksumming, and profiling

• Currently used for ~weekly file system profiling

• Includes directory count, sym/hard linkcounts, file count, average file size, maxfiles within a directory, among other statistics

• Reports file size histograms, and top files (by size)

• https://github.com/olcf/pcircle

• Able to reproduce fprof-like reporting by setting up fileclassbuckets

• Built-in reports like ‘top x’ files/directories provide similar functionality

Output: rbh-report --class-info

LustreDU

• Provides directory-level usage for users/projects

• Populated by:

– Parsing Lester output

– Contacting inode query daemons running on OSS nodes

– Populating/updating MySQL database

• Only updated daily

• Issues running as privileged user

rbh-du output

• Provides a quick du option

• Potentially provide a smart wrapper for users that use du vs rbh-du based on file path

Purging Policies

• Non-Robinhood workflow:

– User submits request

– RUC approval

– UAO team member enters exemption into RATS

– Purge config is generated using those exemptions

• Robinhood pieces still WIP

• Example:

Purging – Integration with Robinhood

• Want to keep same workflow for users and other groups

• Current thoughts:

– Pull list of purge exemptions

– Generate purge configuration file using multiple “tree” statements in a cleanup rule

– Run Robinhood with --once with that policy

– Log and remove configuration

Future Work

Hardware upgrades

• Transition to using similar setup to current MDS nodes

• Single socket, faster clock speed

• SSD / NVMe storage target

Clustering

• Move processes to multiple nodes

– Multiple physical nodes vs namespaced mounts / VMs

• Set up a Mariadb/MySQL cluster

– Millions of SQL statements per second

– https://www.mysql.com/why-mysql/benchmarks/mysql-cluster/

CEA’s Lustre Changelogs Aggregate &

Publish (lcap) integration

• Ability for multiple change-log readers

• Redirect a copy of the changelog to our Kafka instances while still using a single reader

Lustre Jobstats

Jobstats Integration

Jobstats Integration - continued

• Database schema changes

– Add new columns to database: creation_job, last_access_job, last_mod_job, and last_mdchange_job

– Parse job_id (semi-support exists currently) to populate these fields

Jobstats Integration – potential wins

• File system usage heuristics

• Security triggers / auditing

• File-level history

References

• https://github.com/cea-hpc/robinhood/wiki/Documenation

• https://dev.mysql.com/doc

• https://mariadb.com

• https://github.com/fwang2/ioutils

• https://cug.org/proceedings/cug2014_proceedings/includes/files/pap157.pdf

• https://gitlab.com/gitlab-org/gitlab-ci-multi-runner

• http://wiki.lustre.org/images/0/02/LUG-2011-Aurelien_Degremont-Robinhood_Quick_Tour.pdf

• https://github.com/cea-hpc/lcap

• http://syst.univ-brest.fr/per3s/wp-content/uploads/2017/02/robinhood-Per3S.pdf

• https://build.hpdd.intel.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml

Operational Preparation for Large- Scale Deployment · •Using GitLab Runners –Binary that can...

Documents

Introduction to Gitlab

Continuous Integration with Gitlab · •gitlab.ebi.ac.ukhas shared runners, but you are expected to provide your own for production deployments •Specify runners capabilities with

GitLab CIPipeline GitLab CI 6. CI/CD Mkdocs Pipeline GitLab CI Déploiement sur Netlify 7. CI/CD Maven - Apache Tomcat Premier exemple Essai local Pipeline GitLab CI Initialisation

GitFlow, SourceTree and GitLab

Slide: Introducing GitLab by ALMtoolbox

Using GitLab CI

Advanced Git and Gitlab - NERSC · PDF fileAdvanced Git and Gitlab ... • Suppose you have a par

Gitlab Training with GIT and SourceTree

Introduction to Git & Gitlab

MESOSPHERE AND GITLAB:

Jenkins Automation - NETWAYS · The evolution of Jenkins Jenkins is moving fast (there is competition: travis-ci, gitlab-runners) Far away from just the java world (e.g. mvn specifics

Introduction to GitLab – Basics and Continuous … · Introduction to GitLab ... 3 $ git add * 4 $ git commit -m ... GitHub BitBucket GitLab etc... 8. WhatisGitLab? web-basedinterfaceforversioncontrolwith

Jenkins + Gitlab + RabbitMQ + Symfony2 + Phing

Introducing GitLab - ALMtoolbox · GitHub GitLab What does it mean? In GitLab a request to merge a feature branch into the official master is called a Merge Request Pull Request Merge

Priyanka Sharma, Director Alliances, Gitlab Dee Kumar ...€¦ · 2 @GitLab | #lfosls About us Priyanka Sharma Director of Technical Evangelism GitLab @pritianka Dee Kumar Vice President,

Webinar - Continuous Integration with GitLab

GitLab as an Institutional Service - ub.uni-bielefeld.decpietsch/talks/2019-04-03-gitlab_as_an_institutional... · GitLab CE is free software GitLab CE includes enterprise features

Git Introduction - GitLab · * 50 characters or less Execute git status Output no longer mentions Git-introduction.org * Up to date from Git's perspective Output indicates that your

Puppet, Gitlab, and R10kscottn/puppet_gitlab_r10k.pdf · 2016-10-31 · July 2016, Gitlab-CI method: Puppet, Gitlab, Gitlab-CI, and R10k •The gitlab GUI is a first class citizen

OTN and GitLab