Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Barr von OehsenAssociate Vice President
Office of Advanced Research ComputingMarch 20, 2018oarc.rutgers.edu
HPC Across Campus and the Cloud
VisionTo make Rutgers a nationally and internationally recognized leader in
and provider of advanced computing/cyberinfrastructure (ACI) to
enable research
MissionTo provide strategic leadership in advancing Rutgers University’s
research and scholarly achievements on all campuses through next
generation computing, data science, networking, and creative learning
environments in partnership with the Office of Research and Economic
Development, the Office of Information Technology, and academic and
research communities, both internal and external.
2
• Neutral territory that can facilitate discussions across multiple domains to help researchers be more productive with local and national cyberinfrastructure ecosystems (e.g. OSG, XSEDE, CloudLab, GENI)
• Access to research scientists with domain knowledge and technical expertise
• Consulting• Workflows, software, optimization, and infrastructure• Cloud solutions
• Grants• Boiler plate information on resources, infrastructure, people• Team members can (and encouraged) to be included on proposals
• Access to the advanced computing resources• With full OARC system and application support• Includes commercial cloud services
• Installation of applications• Porting codes, compiling, debugging
• ACI Education, Outreach and Training 3
OARC Offers Rutgers Community:
A Campus Level Research Platform
4
Federated Computing Across RutgersState-wide multi-campus, distributed HPC and storage, fast, low-latency network that is part of global Science DMZs:• SDN Based 100 Gbps Network
Core• Data Transfer Nodes (FIONA)• Advanced Computing and Storage• Performance and monitoring support
(perfSONAR, XDMoD)• Testbed as a service• Federated across campuses• Policy driven
Local Storage
HPC
DTN
Local Storage
HPC
DTN
Local Storage
HPC
DTN
New Brunswick
Newark
Camden
Central StorageRepository
DTN
PE
NYSERNet, I2
I2, NJEdge, MagPi, KINBER
Rutgers Data DMZ (Local Research Platform)
6
Requires ACI to be in place – fast/low-
latency networks, Science DMZs, SDN,
Data Transfer Nodes (FIONAs),
perfSONAR, XDMod, Advanced Computing,
and Storage. Building block for region,
national…
• The Condo Model (Local Centralized HPC)– Community owned resource– Unused owner cycles go to general public– Owners have highest priority– Designed to feel like a local resource in a centralized
environment
• Challenges– Non-traditional users of HPC
• We have been successful in helping people transition from desktop to HPC• Models more sophisticated• Data sets larger
– MPI vs High Throughput Computing
7
• Original Condo Model doesn’t fit all– Buy-in for priority
• Doesn’t make sense if only use compute a few times a year
– Free access often means waiting
– Great for small to medium sized jobs• What about those who want to submit thousands of single core/node jobs?
• How do we fix this?
8
The One-Rutgers ACI Ecosystem
9National/Commercial Cloud Services
Distributed federated condo model designed so everything feels local and accessible through a single login node. Includes Science DMZ and Software Defined Networking
Internet2
(100 Gbps)
One-Rutgers Cloud Infrastructure• Requires ACI to be in place – fast/low-latency networks, Science
DMZs, SDN, perfSONAR, Advanced Computing, Storage, Data Transfer Nodes (DTNs), and a 100 Gb/s connection to the outside world
• Resources will be distributed across New Brunswick/Piscataway, Newark, and Camden
• Tiered Storage solution that includes HIPAA compliance• Designed to be a testbed to serve multiple research needs
(including industry)• Elastic in the sense that we can grow and shrink into cloud
resources based on demand and job type (OSG, Amazon, Azure)• Couples NSF funded national projects and commercial cloud
services directly into our environment creating a one-stop shop for the researcher
• Positions us to be a national (and international) player 10
Connecting to the Cloud
11
• Partnership with Google Cloud and SchedMD (SLURM) to develop federated hybrid environment– Allows federation across resources (including cloud services)– Allows us to offer one stop shop for users– Elastic Computing being introduced into new version of SLURM
• Proof of Concept– Set up federated environment between Rutgers and Google
Cloud– Google Cloud environment is set up similar to local resource,
includes SLURM– Connection through Internet2 between both environments– Able to submit jobs from Rutgers HPC to Google Cloud– Elasticity/bursting capability being included in SLURM
• Should be available to general public by February 2018
12
Test Environment• Two physical clusters, Tyr and Mace, at Rutgers• Single virtual cluster, Gcloud• All use latest release candidate of SLURM with federation and
elastic cloud• All clusters use the same SLURM database• All clusters are separately configured to submit and run jobs
locally, and have differing scheduler policies• All clusters have nfs /home, local /scratch and a distributed
global filesystem (xrootd) for data sharing• Any of the clusters can query and submit to each other• Before federation clusters were administratively separate• Through federation environment appears to user as a single
entity13
Test Environment• Federation extends multi-cluster by defining a single
administrative entity (the federation) which contains multiple clusters
• Jobs submitted on one of the clusters are actually submitted to the federation, and copies of that job get submitted to each scheduler
• The three schedulers coordinate with each other and select the cluster that can run the job soonest
• In effect, the user just submits jobs and they flow to the appropriate cluster
• Scheduler parameters can be adjusted on each cluster to achieve desired job behavior– For example, gcloud might support only single-core jobs, or only
jobs from a particular group 14
Test Environment
• Elastic computing involves on-demand provisioning of cloud resources to run jobs. It's an extension of the slurm power save functionality, where physical compute nodes can be suspended when idle and resumed when a job is allocated (or powered off and back on)
• When the clusters are idle, tyr and mace compute nodes are idle and gcloud compute nodes are shut off. When jobs are submitted they end up on all three clusters and the gcloudinstances spin up to process jobs and spin down when done.
15
Next Steps
• Assigning cluster weights, so jobs would pack on tyr and mace first and only go to gcloud when necessary
• Educate the research community– Already have a user with NSF grant that includes $250K of
Google Cloud credits• Figure out a funding model that makes sense• Duplicate environment in Azure and AWS
16
Future: East Coast Regional Research Platform• The Rutgers Local Research Platform is a building block• Includes Partnerships with:
– Regional network providers• NJEdge• KINBER• NYSERNET• OSHEAN
– Internet2– The Massachusetts Green High Performance Computing Center
(MGHPCC)• University of Massachusetts, Boston University, Harvard University,
MIT, Northeastern University and the Commonwealth of Massachusetts
– Regional universities• Syracuse, Buffalo, Penn State, Bucknell, Brown, Rutgers, University
of New Hampshire, University of Maine, University of Rhode Island
Acknowledgements
• Google Cloud– Karan Bhatia– Jesus Gomez
• Rutgers University– Bill Abbott– Eric Marshall
• SchedMD– Brian Christensen– Jess Arrington
• NSF– Campus Cyberinfrastructure (CC*) OAC-1659232
18