5
KyRIC Cluster User Guide Last update: December 14, 2020 Status Updates and Notices Pre-production: 03/01/2021-03/31/2021 Production : 04/01/2021 Introduction Scientific discovery today is being enabled through computational and data intensive research that exploit enormous amounts of available data. KyRIC will advance several exciting research programs across many disciplines, such as Bioinformatics and System Biology Algorithms, Large Graph and Evolutionary Network Analysis, Image Processing, and Computational Modeling and Simulation. The KyRIC system is a hybrid architecture. It has large memory nodes that are increasingly needed by a wide range of XSEDE researchers, particularly researchers working with big data. Innovative Components: Large memory nodes with local SSD drives and NFS-mounted scratch. Award Number: NSF MRI infrastructure award (ACI-1626364) XSEDE hostname: kxc.ccs.uky.edu Figure 1. LCC-KyRIC System Allocation Information As an XSEDE computing resource, KyRIC is accessible to XSEDE users who are given time on the system. To obtain an account, users may submit a proposal through the (XRAS) or request a . XSEDE Allocation Request System Trial Account Interested parties may contact XSEDE User Support for help with a KyRIC proposal. System Architecture The KyRIC hybrid system consists of two subsystems: a 5 nodes cluster, each with 4 10-core processors, 3TB RAM, and a 5TB SSD array; Each of these nodes have 40 cores (Broadwell class and lntel(R) Xeon(R) CPU E7-4820 v4 @ 2.00GHz with 4 sockets, 10 cores/socket). These 5 dedicated XSEDE nodes will have exclusive access to approximately 300 TB of network attached disk storage. All these compute nodes are interconnected through a 100 Gigabit Ethernet (l00GbE) backbone, and the cluster login and data transfer nodes will be connected through a 100Gb uplink to internet2 for external connections. Due to the use of the 100GbE network, this cluster is for single node jobs only and is not recommended for multi-node jobs, such as those using MPI.

KyRIC Cluster User Guide - University of Kentucky

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: KyRIC Cluster User Guide - University of Kentucky

KyRIC Cluster User GuideLast update: December 14, 2020

Status Updates and Notices

Pre-production: 03/01/2021-03/31/2021

Production       : 04/01/2021   

 

Introduction

Scientific discovery today is being enabled through computational and data intensive research that exploit enormous amounts of available data. KyRIC will advance several exciting research programs across many disciplines, such as Bioinformatics and System Biology Algorithms, Large Graph and Evolutionary Network Analysis, Image Processing, and Computational Modeling and Simulation.

The KyRIC system is a hybrid architecture. It has large memory nodes that are increasingly needed by a wide range of XSEDE researchers, particularly researchers working with big data.

Innovative Components: Large memory nodes with local SSD drives and NFS-mounted scratch.

Award Number: NSF MRI infrastructure award (ACI-1626364)

XSEDE hostname: kxc.ccs.uky.edu

 Figure 1. LCC-KyRIC System

Allocation Information

As an XSEDE computing resource, KyRIC is accessible to XSEDE users who are given time on the system. To obtain an account, users may submit a proposal through the (XRAS) or request a .XSEDE Allocation Request System Trial Account

Interested parties may contact XSEDE User Support for help with a KyRIC proposal.

System Architecture

The KyRIC hybrid system consists of two subsystems: a 5 nodes cluster, each with 4 10-core processors, 3TB RAM, and a 5TB SSD array;  Each of these nodes have 40 cores (Broadwell class and lntel(R) Xeon(R) CPU E7-4820 v4 @ 2.00GHz with 4 sockets, 10 cores/socket). These 5 dedicated XSEDE nodes will have exclusive access to approximately 300 TB of network attached disk storage. All these compute nodes are interconnected through a 100 Gigabit Ethernet (l00GbE) backbone, and the cluster login and data transfer nodes will be connected through a 100Gb uplink to internet2 for external connections. Due to the use of the 100GbE network, this cluster is for single node jobs only and is not recommended for multi-node jobs, such as those using MPI.

Page 2: KyRIC Cluster User Guide - University of Kentucky

Compute Nodes

These nodes are where jobs are actually executed after being submitted via the user-facing login nodes.

Model: PowerEdge R930; Intel(R) Xeon(R) CPU E7-4820 v4 @ 2.00GHz

Number of nodes 5

Total cores per node: 40 cores (4 sockets; 10 cores/socket)

Threads per core: 2

Threads per node: 80

Clock rate: 2.00

RAM: 3TB

Local storage: 5.1TB (SSD)

Extended storage: 300 TB (NFS-mounted)

 Login Nodes

The login node is what users will directly access in order to submit jobs that will get forwarded to and executed in the compute nodes.

Model: Virtual Machines hosted in bare metal server (PowerEdge R930; Intel(R) Xeon(R) CPU E7-4820 v4 @ 2.00GHz)

Number of nodes 2

Total cores per node:

4

Threads per core: 2

Threads per node: 8

Clock rate: 2.00

RAM: 16GB

Extended storage: 300 TB (NFS-mounted)

 Data Transfer Node

This node facilitates the transfer of data in and out of the KyRIC system. Users will log in to this node with the same credentials as for the login nodes.

Model: Virtual Machines hosted in bare metal server (PowerEdge R930; Intel(R) Xeon(R) CPU E7-4820 v4 @ 2.00GHz)

Number of nodes 1

Total cores per node:

8

Threads per core: 2

Threads per node: 16

Clock rate: 2.00

RAM: 32GB

Extended storage: 300 TB (NFS-mounted)

Network

All nodes are interconnected through a 100 Gigabit Ethernet (l00GbE) backbone, and the cluster login and data transfer nodes will be connected through a 100Gb uplink to internet2 for external connections

File Systems

File System Quota Key Features

Page 3: KyRIC Cluster User Guide - University of Kentucky

$HOME  10GB  No file deletion policy applied on this partition

$PROJECT  500GB  No file deletion policy applied on this partition

$SCRATCH  10TB 30-day file deletion policy

Accessing the System

The login node for the cluster is , which supports the GSISSH protocol on port 2222 and the standard SSH protocol on port 22. All users kxc.ccs.uky.edumust first authenticate to the system using the XSEDE Single Sign-On Hub, as mentioned in the next section. Users may then (optionally) generate and install their own SSH keys if they wish to access the system outside of XSEDE. Local password authentication is not supported.

XSEDE Single Sign-On Hub

users can access KyRIC via the .XSEDE XSEDE Single Sign-On Hub

Command to connect to the system from the Single Sign-On hub:

gsissh kyric

When reporting a login problem to the help desk, please execute the gsissh command with the “-vvv” option and include the verbose output in your problem description.

Notes and hints

When you log in to , you will be assigned one of the two login nodes: kxc-login[1-2]. . These nodes are identical in both kxc.ccs.uky.edu ccs.uky.eduarchitecture and software environment. Users should normally log in through but may specify one of the two nodes directly if they see poor kxc.ccs.uky.eduperformance.

Do use the login nodes for computationally intensive processes. These nodes are meant for compilation, file editing, simple data analysis, and other NOTtasks that use minimal compute resources. All computationally demanding jobs should be submitted and run through the batch queuing system.

Computing Environment

 Environment Modules

 The Environment Modules package provides for dynamic modification of your shell environment. “Module” commands set, change, or delete environment variables, typically in support of an application. They also let the user choose between different versions of the same software or different combinations of related codes. Several modules that determine the default KyRIC environment are loaded at login time.

 Citizenship

You share KyRIC with other users, and what you do on the system affects others. Exercise good citizenship to ensure that your activity does not adversely impact the system and the research community with whom you share it. Here are some rules of thumb:

Don’t run jobs on the login nodes.Don’t stress the filesystem.Do use the debug partition to test out your job submission script.Do submit an informative help-desk ticket.

 Managing Data and Transferring Files

 No user data is backed up. Users are responsible for their own backups.

 Each project and user is given a scratch space and home space. A good practice is to write your job’s output into your scratch space. All compute nodes also have a local 5 TB SSD disk attached to it, but this local temporary space is shared among all jobs running on a single node and will be cleaned up (deleted) upon job completion.

 Transferring your Files

KyRIC nodes support the following file transfer protocols.

scp ( if you have your ssh keys setup)rsync (if you have your ssh keys setup)Globus (only through DTN node)

Users are encouraged to transfer data using rclone, scp, globus, etc. through the high-speed data transfer node (DTN) and not through the login nodes.

Page 4: KyRIC Cluster User Guide - University of Kentucky

Building Software

Singularity containers are supported. Building a Singularity container requires root access outside of the cluster. If you have a Singularity container ready, you can copy it into the cluster and run your jobs. Most of the software will be provided through singularity containers. Standard GNU and Intel compilers will be provided.

Software

Installed software can be found by running “module avail”.

Job Accounting

KyRIC allocations are made in core-hours. The recommended method for estimating your resource needs for an allocation request is to perform benchmark runs. The core-hours used for a job are calculated by multiplying the number of processor cores used by the wall-clock duration in hours. KyRIC core-hour calculations should assume that all jobs will run in the regular queue and that they are charged for use of all 40 cores on each node.

The Slurm scheduler tracks and charges for usage to a granularity of a few seconds of wall clock time. The system charges only for the resources you use, not those you request. If your job finishes early and exits properly, Slurm will release the node back into the pool of available nodes. Your job will only be charged for as long as you are using the node.

Running Jobs and accessing the Compute Nodes

KyRIC uses the batch environment. When you run in batch mode, you submit jobs to be run Simple Linux Utility for Resource Management (SLURM)on the compute nodes using the sbatch command as described below. Remember that computationally intensive jobs should be run only on the compute nodes and not the login nodes.

The user must create a slurm submission job script and the job can be executed by running “sbatch <jobscript>”

Job Scheduler

Most, if not all, XSEDE resources are running Slurm and this documentation already exists in some form.

Common Slurm Commands

Command Description

sbatch script_file Submit SLURM job script

scancel job_id Cancel job that has job_id

squeue -u user_id Show jobs that are on queue for user_id

sinfo Show partitions/queues, their time limits, number of nodes, and which compute nodes are running jobs or idle.

Slurm Job Script Options

Property Syntax Expanded Syntax

Example Use Description

Job name  #SBATCH–J jobname

#SBATCH --job-name=jobname

#SBATCH --job-name=my_first_job The job will be custom-labeled with (in addition to jobnamean integer id for the job automatically given by the program)

Partition/queue

#SBATCH -p partition_id

#SBATCH --partition=partition_id

#SBATCH -partition=normal         

#SBATCH -partition=debug

The job will be running in compute node(s) that is/are in partition_id

Time limit  #SBATCH-t time_limit

#SBATCH --time=time_limit

#SBATCH --time=01:00:00             # one-hour limit

#SBATCH --time=2-00:00:00          # 2-day limit

The job will be killed if it reaches specified.time_limit

Memory (RAM)

#SBATCH --mem=memory_amount

#SBATCH --mem=32g                    # 32 GB ram asked The job will use up to the specified memory_amount.

Project account

#SBATCH -A account

#SBATCH --account=account

#SBATCH --account=sampe_proj_acct Run the job under this project account.

Page 5: KyRIC Cluster User Guide - University of Kentucky

Standard error filename

#SBATCH -e filename

#SBATCH --error=filename

#SBATCH --error=slurm%[email protected]    # special variables used; will be substituted with job array number and job id number

#SBATCH --error=prog_error.log   # You can use any file name (no whitespaces)

Standard error of the job will be stored under filename

Standard output filename

#SBATCH -o filename

#SBATCH --output=filename

#SBATCH --output=slurm%[email protected]

#SBATCH --output=prog_output.log

Standard output of the job will be stored under filename

 Partitions (Queues)

Table. KyRIC Production Queues

Queue Name Node Type Max Nodes per Job

(assoc'd cores)*

Max Duration Max Jobs in Queue* Charge Rate

(per core-hour)

normal compute 1 node

(40 cores)

72 hrs. 5* 1 SU

 

Interactive Sessions

You can also login to the compute node and run the jobs interactively if only the node is allocated to you.

Sample Job Scripts

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

#!/bin/bash

#SBATCH --time=00:15:00          # Time limit for the job (REQUIRED).

#SBATCH --job-name=my_test_job              # Job name

#SBATCH --ntasks=1                   # Number of cores for the job. Same as SBATCH -n 1

#SBATCH --partition=normal     # Partition/queue to run the job in. (REQUIRED)

#SBATCH -e slurm-%j.err  # Error file for this job.

#SBATCH -o slurm-%j.out  # Output file for this job.

#SBATCH -A <your project account>  # Project allocation account name (REQUIRED)

./myprogram   # This is the program that will be executed on the compute node. You will substitute this with your scientific application.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Protected Data

No protected data (such as HIPAA data) are allowed in the cluster.

Help

Please submit tickets through the XSEDE portal with information detailing your problems.

References

SLURM scheduler:

https://slurm.schedmd.com/documentation.html

Latest documentation:

https://docs.ccs.uky.edu