16
TEXAS ADVANCED COMPUTING CENTER TEXAS ADVANCED COMPUTING CENTER Deployment of NMI Deployment of NMI Components on the UT Components on the UT Grid Grid Shyamal Mitra Shyamal Mitra

TEXAS ADVANCED COMPUTING CENTER Deployment of NMI Components on the UT Grid Shyamal Mitra

Embed Size (px)

Citation preview

TEXAS ADVANCED COMPUTING CENTERTEXAS ADVANCED COMPUTING CENTER

Deployment of NMI Components Deployment of NMI Components on the UT Gridon the UT Grid

Shyamal MitraShyamal Mitra

2

OutlineOutline

• TACC Grid ProgramTACC Grid Program

• NMI Testbed ActivitiesNMI Testbed Activities

• Synergistic ActivitiesSynergistic Activities– Computations on the Grid Computations on the Grid

– Grid PortalsGrid Portals

3

TACC Grid ProgramTACC Grid Program

• Building GridsBuilding Grids– UT Campus GridUT Campus Grid

– State Grid (TIGRE)State Grid (TIGRE)

• Grid ResourcesGrid Resources– NMI ComponentsNMI Components

– United DevicesUnited Devices

– LSF MulticlusterLSF Multicluster

• Significantly leveraging NMI Components and Significantly leveraging NMI Components and experienceexperience

4

Resources at TACCResources at TACC

• IBM Power 4 System (224 processors, 512 GB Memory, 1.16 TF)IBM Power 4 System (224 processors, 512 GB Memory, 1.16 TF)• IBM IA-64 Cluster (40 processors, 80 GB Memory, 128 GF)IBM IA-64 Cluster (40 processors, 80 GB Memory, 128 GF)• IBM IA-32 Cluster (64 processors, 32 GB Memory, 64 GF)IBM IA-32 Cluster (64 processors, 32 GB Memory, 64 GF)• Cray SV1 (16 processors, 16 GB Memory, 19.2 GF)Cray SV1 (16 processors, 16 GB Memory, 19.2 GF)• SGI Origin 2000 (4 processors, 2 GB Memory, 1 TB storage)SGI Origin 2000 (4 processors, 2 GB Memory, 1 TB storage)• SGI Onyx 2 (24 processors, 25 GB Memory, 6 Infinite Reality-2 SGI Onyx 2 (24 processors, 25 GB Memory, 6 Infinite Reality-2

Graphics pipes)Graphics pipes)• NMI components Globus and NWS installed on all systems save the NMI components Globus and NWS installed on all systems save the

Cray SV1Cray SV1

5

Resources at UT CampusResources at UT Campus

• Individual clusters belonging to professors inIndividual clusters belonging to professors in– engineeringengineering

– computer sciencescomputer sciences

– NMI components Globus and NWS installed on NMI components Globus and NWS installed on several machines on campusseveral machines on campus

• Computer laboratories having ~100s of PCs in Computer laboratories having ~100s of PCs in the engineering and computer sciences the engineering and computer sciences departmentsdepartments

6

Campus Grid ModelCampus Grid Model

• ““Hub and Spoke” ModelHub and Spoke” Model

• Researchers build programs on their clusters Researchers build programs on their clusters and migrate bigger jobs to TACC resourcesand migrate bigger jobs to TACC resources– Use GSI for authenticationUse GSI for authentication

– Use GridFTP for data migrationUse GridFTP for data migration

– Use LSF Multicluster for migration of jobsUse LSF Multicluster for migration of jobs

• Reclaim unused computing cycles on PCs Reclaim unused computing cycles on PCs through United Devices infrastructure.through United Devices infrastructure.

7

UT Campus Grid OverviewUT Campus Grid Overview

LSF

8

NMI Testbed ActivitiesNMI Testbed Activities

• Globus 2.2.2 – GSI, GRAM, MDS, GridFTPGlobus 2.2.2 – GSI, GRAM, MDS, GridFTP– Robust softwareRobust software– Standard Grid middlewareStandard Grid middleware– Need to install from source code to link to other Need to install from source code to link to other

components like MPICH-G2, Simple CAcomponents like MPICH-G2, Simple CA

• Condor-G 6.4.4 – submit jobs using GRAM, Condor-G 6.4.4 – submit jobs using GRAM, monitor queues, receive notification, and monitor queues, receive notification, and maintain Globus credentials. Lacksmaintain Globus credentials. Lacks– scheduling capability of Condorscheduling capability of Condor– checkpointingcheckpointing

9

NMI Testbed ActivitiesNMI Testbed Activities

• Network Weather Service 2.2.1Network Weather Service 2.2.1– name server for directory servicesname server for directory services

– memory server for storage of datamemory server for storage of data

– sensors to gather performance measurementssensors to gather performance measurements

– useful for predicting performance that can be used for a useful for predicting performance that can be used for a scheduler or “virtual grid”scheduler or “virtual grid”

• GSI-enabled OpenSSH 1.7GSI-enabled OpenSSH 1.7– modified version of OpenSSH that allows login to remote modified version of OpenSSH that allows login to remote

systems and transfer files between systems without entering a systems and transfer files between systems without entering a passwordpassword

– requires replacing native sshd file with GSI-enabled OpenSSHrequires replacing native sshd file with GSI-enabled OpenSSH

10

Computations on the UT GridComputations on the UT Grid

• Components used – GRAM, GSI, GridFTP, MPICH-G2Components used – GRAM, GSI, GridFTP, MPICH-G2

• Machines involved – Linux RH (2), Sun (2), Linux Debian Machines involved – Linux RH (2), Sun (2), Linux Debian (2), Alpha Cluster (16 processors)(2), Alpha Cluster (16 processors)

• Applications run – PI, Ring, SeismicApplications run – PI, Ring, Seismic

• Successfully ran a demo at SC02 using NMI R2 Successfully ran a demo at SC02 using NMI R2 componentscomponents

• Relevance to NMIRelevance to NMI– must build from source to link to MPICH-G2must build from source to link to MPICH-G2

– should be easily configured to submit jobs to schedulers like should be easily configured to submit jobs to schedulers like PBS, LSF, or LoadlevelerPBS, LSF, or Loadleveler

11

Computations on the UT GridComputations on the UT Grid

• Issues to be addressed on clustersIssues to be addressed on clusters– must submit to local scheduler: PBS, LSF or must submit to local scheduler: PBS, LSF or

LoadlevelerLoadleveler

– compute nodes on subnet; cannot communicate with compute nodes on subnet; cannot communicate with compute nodes on another clustercompute nodes on another cluster

– must open ports through firewall for communicationmust open ports through firewall for communication

– version incompatibility – affects source code that are version incompatibility – affects source code that are linked to shared librarieslinked to shared libraries

12

Grid PortalsGrid Portals

• HotPage – web page to obtain information on HotPage – web page to obtain information on the status of grid resourcesthe status of grid resources– NPACI HotPage (https://hotpage.npaci.edu)NPACI HotPage (https://hotpage.npaci.edu)– TIGRE Testbed portal (http://tigre.hipcat.net)TIGRE Testbed portal (http://tigre.hipcat.net)

• Grid Technologies EmployedGrid Technologies Employed– Security: GSI, SSH, MyProxy for remote proxiesSecurity: GSI, SSH, MyProxy for remote proxies– Job Execution: GRAM GatekeeperJob Execution: GRAM Gatekeeper– Information Services: MDS (GRIS + GIIS), NWS, Information Services: MDS (GRIS + GIIS), NWS,

Custom information scriptsCustom information scripts– File Management: GridFTPFile Management: GridFTP

13

14

GridPort 2.0 Multi-Application Arch.GridPort 2.0 Multi-Application Arch.(Using Globus as Middleware)(Using Globus as Middleware)

15

Future WorkFuture Work

• Use NMI components where possible in building Use NMI components where possible in building gridsgrids

• Use Lightweight Campus Certificate Policy for Use Lightweight Campus Certificate Policy for instantiating a Certificate Authority at TACCinstantiating a Certificate Authority at TACC

• Build portals and deploy applications on the UT Build portals and deploy applications on the UT GridGrid

16

CollaboratorsCollaborators

• Mary ThomasMary Thomas

• Dr. John BoisseauDr. John Boisseau

• Rich ToscanoRich Toscano

• Jeson MartajayaJeson Martajaya

• Eric RobertsEric Roberts

• Maytal DahanMaytal Dahan

• Tom UrbanTom Urban