HPC Across Campus and the Cloud€¦ · Cloud –Google Cloud environment is set up similar to local resource, includes SLURM –Connection through Internet2 between both environments

Barr von OehsenAssociate Vice President

Office of Advanced Research ComputingMarch 20, 2018oarc.rutgers.edu

HPC Across Campus and the Cloud

VisionTo make Rutgers a nationally and internationally recognized leader in

and provider of advanced computing/cyberinfrastructure (ACI) to

enable research

MissionTo provide strategic leadership in advancing Rutgers University’s

research and scholarly achievements on all campuses through next

generation computing, data science, networking, and creative learning

environments in partnership with the Office of Research and Economic

Development, the Office of Information Technology, and academic and

research communities, both internal and external.

2

• Neutral territory that can facilitate discussions across multiple domains to help researchers be more productive with local and national cyberinfrastructure ecosystems (e.g. OSG, XSEDE, CloudLab, GENI)

• Access to research scientists with domain knowledge and technical expertise

• Consulting• Workflows, software, optimization, and infrastructure• Cloud solutions

• Grants• Boiler plate information on resources, infrastructure, people• Team members can (and encouraged) to be included on proposals

• Access to the advanced computing resources• With full OARC system and application support• Includes commercial cloud services

• Installation of applications• Porting codes, compiling, debugging

• ACI Education, Outreach and Training 3

OARC Offers Rutgers Community:

A Campus Level Research Platform

4

Federated Computing Across RutgersState-wide multi-campus, distributed HPC and storage, fast, low-latency network that is part of global Science DMZs:• SDN Based 100 Gbps Network

Core• Data Transfer Nodes (FIONA)• Advanced Computing and Storage• Performance and monitoring support

(perfSONAR, XDMoD)• Testbed as a service• Federated across campuses• Policy driven

Local Storage

HPC

DTN

Local Storage

HPC

DTN

Local Storage

HPC

DTN

New Brunswick

Newark

Camden

Central StorageRepository

DTN

PE

NYSERNet, I2

I2, NJEdge, MagPi, KINBER

Rutgers Data DMZ (Local Research Platform)

6

Requires ACI to be in place – fast/low-

latency networks, Science DMZs, SDN,

Data Transfer Nodes (FIONAs),

perfSONAR, XDMod, Advanced Computing,

and Storage. Building block for region,

national…

• The Condo Model (Local Centralized HPC)– Community owned resource– Unused owner cycles go to general public– Owners have highest priority– Designed to feel like a local resource in a centralized

environment

• Challenges– Non-traditional users of HPC

• We have been successful in helping people transition from desktop to HPC• Models more sophisticated• Data sets larger

– MPI vs High Throughput Computing

7

• Original Condo Model doesn’t fit all– Buy-in for priority

• Doesn’t make sense if only use compute a few times a year

– Free access often means waiting

– Great for small to medium sized jobs• What about those who want to submit thousands of single core/node jobs?

• How do we fix this?

8

The One-Rutgers ACI Ecosystem

9National/Commercial Cloud Services

Distributed federated condo model designed so everything feels local and accessible through a single login node. Includes Science DMZ and Software Defined Networking

Internet2

(100 Gbps)

One-Rutgers Cloud Infrastructure• Requires ACI to be in place – fast/low-latency networks, Science

DMZs, SDN, perfSONAR, Advanced Computing, Storage, Data Transfer Nodes (DTNs), and a 100 Gb/s connection to the outside world

• Resources will be distributed across New Brunswick/Piscataway, Newark, and Camden

• Tiered Storage solution that includes HIPAA compliance• Designed to be a testbed to serve multiple research needs

(including industry)• Elastic in the sense that we can grow and shrink into cloud

resources based on demand and job type (OSG, Amazon, Azure)• Couples NSF funded national projects and commercial cloud

services directly into our environment creating a one-stop shop for the researcher

• Positions us to be a national (and international) player 10

Connecting to the Cloud

11

• Partnership with Google Cloud and SchedMD (SLURM) to develop federated hybrid environment– Allows federation across resources (including cloud services)– Allows us to offer one stop shop for users– Elastic Computing being introduced into new version of SLURM

• Proof of Concept– Set up federated environment between Rutgers and Google

Cloud– Google Cloud environment is set up similar to local resource,

includes SLURM– Connection through Internet2 between both environments– Able to submit jobs from Rutgers HPC to Google Cloud– Elasticity/bursting capability being included in SLURM

• Should be available to general public by February 2018

12

Test Environment• Two physical clusters, Tyr and Mace, at Rutgers• Single virtual cluster, Gcloud• All use latest release candidate of SLURM with federation and

elastic cloud• All clusters use the same SLURM database• All clusters are separately configured to submit and run jobs

locally, and have differing scheduler policies• All clusters have nfs /home, local /scratch and a distributed

global filesystem (xrootd) for data sharing• Any of the clusters can query and submit to each other• Before federation clusters were administratively separate• Through federation environment appears to user as a single

entity13

Test Environment• Federation extends multi-cluster by defining a single

administrative entity (the federation) which contains multiple clusters

• Jobs submitted on one of the clusters are actually submitted to the federation, and copies of that job get submitted to each scheduler

• The three schedulers coordinate with each other and select the cluster that can run the job soonest

• In effect, the user just submits jobs and they flow to the appropriate cluster

• Scheduler parameters can be adjusted on each cluster to achieve desired job behavior– For example, gcloud might support only single-core jobs, or only

jobs from a particular group 14

Test Environment

• Elastic computing involves on-demand provisioning of cloud resources to run jobs. It's an extension of the slurm power save functionality, where physical compute nodes can be suspended when idle and resumed when a job is allocated (or powered off and back on)

• When the clusters are idle, tyr and mace compute nodes are idle and gcloud compute nodes are shut off. When jobs are submitted they end up on all three clusters and the gcloudinstances spin up to process jobs and spin down when done.

15

Next Steps

• Assigning cluster weights, so jobs would pack on tyr and mace first and only go to gcloud when necessary

• Educate the research community– Already have a user with NSF grant that includes $250K of

Google Cloud credits• Figure out a funding model that makes sense• Duplicate environment in Azure and AWS

16

Future: East Coast Regional Research Platform• The Rutgers Local Research Platform is a building block• Includes Partnerships with:

– Regional network providers• NJEdge• KINBER• NYSERNET• OSHEAN

– Internet2– The Massachusetts Green High Performance Computing Center

(MGHPCC)• University of Massachusetts, Boston University, Harvard University,

MIT, Northeastern University and the Commonwealth of Massachusetts

– Regional universities• Syracuse, Buffalo, Penn State, Bucknell, Brown, Rutgers, University

of New Hampshire, University of Maine, University of Rhode Island

Acknowledgements

• Google Cloud– Karan Bhatia– Jesus Gomez

• Rutgers University– Bill Abbott– Eric Marshall

• SchedMD– Brian Christensen– Jess Arrington

• NSF– Campus Cyberinfrastructure (CC*) OAC-1659232

18

Documents

HPC Across Campus and the Cloud€¦ · Cloud –Google Cloud environment is set up similar to local resource, includes SLURM –Connection through Internet2 between both environments