50
Campus, State, and Regional Grid Issues Warren Smith Texas Advanced Computing Center University of Texas at Austin

Campus, State, and Regional Grid Issues

Embed Size (px)

DESCRIPTION

Campus, State, and Regional Grid Issues. Warren Smith Texas Advanced Computing Center University of Texas at Austin. Outline. Overview of several grids UTGrid - a University of Texas campus grid TIGRE - a state of Texas grid SURAgrid - a southeast US regional grid Summary of issues. - PowerPoint PPT Presentation

Citation preview

Page 1: Campus, State, and Regional Grid Issues

Campus, State, and Regional Grid Issues

Warren Smith

Texas Advanced Computing Center

University of Texas at Austin

Page 2: Campus, State, and Regional Grid Issues

Outline

• Overview of several grids– UTGrid - a University of Texas campus grid– TIGRE - a state of Texas grid– SURAgrid - a southeast US regional grid

• Summary of issues

Page 3: Campus, State, and Regional Grid Issues

UT Grid Project

Page 4: Campus, State, and Regional Grid Issues

UTGrid Overview

• Create a campus cyberinfrastructure– Supporting research and education– Diverse resources, but easy to use environment

• Worked in several main areas– Serial computing– Parallel computing– User interfaces

• Close partnership with IBM– Partly funded by IBM– 2 IBM staff located at TACC

Page 5: Campus, State, and Regional Grid Issues

Serial Computing

• There are many science domains that have “embarrassingly parallel” computations– DNA sequence analysis, protein docking,

molecular modeling, CGI, engineering simulation

• Can older clusters and desktop systems be used for these?

Page 6: Campus, State, and Regional Grid Issues

Parallel Rendering of Single Frames

Now: 15 min

Was: 2h 17 min

5-8 min

5-8 min

Page 7: Campus, State, and Regional Grid Issues

Roundup

• Aggregates unused cycles on desktop computers

• Managed by United Devices Grid MP software– Support for Windows, Linux, Mac,

AIX, & Solaris clients– Linux servers located at TACC

• Supports hosted applications– Pre-configured by an expert for use

by many

• Resources contributed by several UT organizations– Nearly 2000 desktops available

• Production resource – Supported by TACC

Roundup

FrioUnited Devices

Roundup Grid MP

`

`

`

ITS

`

`

`ENG

College of Engineering

Computer Science`

`CS

ITS

`

`TACC

TACC

``

COC

College of Fine Arts

College of Communication

Roundup

UT Grid User

Grid User Portal Grid User Node (Windows, Mac, Linux)

Page 8: Campus, State, and Regional Grid Issues

Rodeo

• Condor Pool of dedicated and non-dedicated resources

• Dedicated resources– Condor Central Manager (collector and negotiator)– One of our older clusters - Tejas

• Non-dedicated resources– Linux, Windows, and Mac resources

are managed by Condor– Usage policy is configured by

resource owner• When there is no other activity• When load (utilization) is low• Give preference to certain group or users

• TACC pool configured to flock to and from CS and ICES pools

• 500 systems available • Production resource

– Supported by TACC

`

`

`

ICES

`

`

` CS

ComputerScience

Condor Pool ICES Condor Pool

Rodeo

UT Grid User

Grid User Portal Grid User Node (Windows, Mac, Linux)

Collector/NegotiatorCollector/Negotiator

TACC Condor Pool

Page 9: Campus, State, and Regional Grid Issues

Parallel Computing

• Support executing parallel jobs on clusters around campus• Our approach to campus parallel computing influenced by

location of resources• TACC has the largest systems at UT Austin

– Lonestar: 1024-processor Linux– Wrangler: 656-processor Linux cluster– Champion: 96 processor IBM Power5– Maverick: Sun system w/64 dual-core UltraSPARC 4 procs, 512 GB

mem• Also a number of clusters in other locations on campus

• Initially a hub and spoke model– TACC has largest systems– Allow users of non-TACC systems to easily send their jobs to

TACC

Page 10: Campus, State, and Regional Grid Issues

Parallel Grid Technologies

• File management– Quickly move (perhaps large) files between systems

– Evaluated Avaki Data Grid, IBM GPFS, GridNFS, GridFTP

– Using GridFTP

• Job execution– Using Globus GRAM

• Resource brokering– Evaluated Condor-G, Platform CSF

– Created GridShell & MyCluster

– Researching performance prediction services

Page 11: Campus, State, and Regional Grid Issues

GridShell

• Transparently execute commands across grid resources• Extensions to tcsh and bash

– “on” - “a.out on 2 procs”– “in” - “a.out in 1000 instances”– Redirection

• a.out > gsiftp://lonestar.tacc.utexas.edu/work/data

– Overload programs• cp - copy between systems

– Environment variables• _GRID_THROTTLE - number of active jobs• _GRID_TAG - job tag• _GRID_TASKID - if part of a parallel job

Page 12: Campus, State, and Regional Grid Issues

MyCluster

• GridShell extensions require job management• Approach is to form virtual cluster

– Select systems to incorporate• User-specified maximum for each system• Number of nodes up to that maximum selected based on load

– Submit “MyScheduler” daemons to selected systems– Iterate over this process to maintain a virtual cluster– Submit user jobs to this virtual cluster– “MyScheduler” schedules jobs to nodes– “MyScheduler” can be:

• Condor• Sun Grid Engine (SGE)• Others possible

• Deployed on TeraGrid, in addition to UT Grid Scheduler

MyScheduler

Cluster

MyCluster

Scheduler

Cluster

1. Submit MyScheduler daemons

2. MyCluster is formed

3. User submits jobs

4. MyScheduler daemons run jobs

Page 13: Campus, State, and Regional Grid Issues

User Interfaces

• Command line access via Grid User Node (GUN)– Submit jobs to Rodeo and Roundup– Submit jobs to clusters– Use GridShell to form virtual clusters and run jobs– Manage files

• Graphical interface via Grid User Portal (GUP)– Using a standard web browser– Submit jobs to Rodeo and Roundup– Submit jobs to clusters– Manage files

Page 14: Campus, State, and Regional Grid Issues

Grid User Node• Command line access to UT Grid• Developed a software stack

– UT Grid users can install on their systems– Can use serial and parallel UT Grid resources

• Provided grid user nodes with this software stack– UT Grid users can login and use UT Grid

• Targeted toward experienced users

Page 15: Campus, State, and Regional Grid Issues

Grid User Portal

• Web interface to UT Grid• Simple GUI interface to complex grid computing

capabilities• Can present information more clearly than a CLI• Lower the barrier of entry for novice user

• Implemented as a set of configurable portlets

Page 16: Campus, State, and Regional Grid Issues

Portlet Concept Example: yahoo.com

Page 17: Campus, State, and Regional Grid Issues

Resource Information

Page 18: Campus, State, and Regional Grid Issues

CFT – Screen Shot

Page 19: Campus, State, and Regional Grid Issues

UTGrid Status

• UT Grid received IBM funding for 2 years– Ended recently

• Continuing efforts in a number of areas• Several new technologies developed that

have been used in other projects– GridShell, MyCluster, GridPort

• Rodeo and Roundup continue to be available to users

Page 20: Campus, State, and Regional Grid Issues

UTGrid Lessons Learned

• Grid Technologies– Growing pains with some

• Several versions of Globus• Platform Community Scheduling Framework

– Surprising successes with others• United Devices Grid MP, Condor

• Grids for serial computing are ready for users• Grids for parallel computing need further

improvements before they are ready• Creating a campus grid harder than we expected

– Technologies not as ready as advertised– More labor involved

Page 21: Campus, State, and Regional Grid Issues

Texas Internet Grid for Research and Education (TIGRE)

Page 22: Campus, State, and Regional Grid Issues

Computing Across Texas

• Grid computing is central to many high tech industries– Computational science and engineering– Microelectronics and advanced materials (nanotechnology)– Medicine, bioengineering and bioinformatics– Energy and environment– Defense and aerospace

• Texas wants to encourage these industries • State of Texas is moving forward aggressively

– Funded Lonestar Education And Research Network (LEARN)• $7.3M for optical fiber across state (33 schools)• Provides the infrastructure for a Grid

– Funded Texas Internet Grid for Research and Education (TIGRE)• $2.5M for programmers / sysadmins / etc. (5 schools)• Provides the manpower to construct a Grid

Page 23: Campus, State, and Regional Grid Issues

TIGRE Mission

• Integrate resources of Texas institutions to enhance research and educational capabilities

• Foster academic, private, and government partnerships

Page 24: Campus, State, and Regional Grid Issues

UT Austin

Texas Tech Univ.

Texas A&M

Rice UniversityUniv. of Houston

El Paso

San Antonio

Austin

Lubbock

Houston

Dallas

Corpus Cristi

Galveston

Beaumont

Fort Worth

Denton

Longview

Waco

CollegeStation

TIGRE Approach• Construct a grid across 5 funded sites:

– Rice University– Texas A&M University– Texas Tech University– University of Houston– University of Texas at Austin

• Support a small set of applications important to Texas

• Package the grid– Software– Documentation– Procedures– Organizational structure– Leverage everything that’s already out there!

• To easily bring other LEARN members into TIGRE (providing resources & running apps)

Page 25: Campus, State, and Regional Grid Issues

TIGRE Organization

• Steering Committee– 1 participant from each of the 5 partners– Decisions by consensus– Select application areas and applications– High-level guidance

• Development group– Technical group constructing TIGRE– Several members from each site– Break work up in to activities– Decisions by consensus– ~10 people now, still growing

Page 26: Campus, State, and Regional Grid Issues

TIGRE Resources

• Amount of allocations undecided• Available from TIGRE institutions

– Lonestar (UT Austin):• 1024 Xeons + Myrinet + GigE

– Hrothgar (Texas Tech):• 256 Xeons + Infiniband + GigE

– Cosmos (Texas A&M):• 128 Itaniums + Numalink

– Rice Terascale Cluster:• 128 Itaniums + GigE

– Atlantis (Houston):• 124 Itaniums + GigE + SCI

– plus several smaller systems• Will incorporate other institutions as appropriate to

applications

Page 27: Campus, State, and Regional Grid Issues

TIGRE Activities• Planning

– Documents: Project management, requirements, architecture, initial design

• Authentication and authorization– Setting up a CA (including policies) for TIGRE– Experimenting with a VOMS server

• Assembling a software stack– Start small, add when needed– Pulling components from the Virtual Data Toolkit

• Globus Toolkit 4.0, pre-Web Services and Web Services• GSI OpenSSH• UberFTP• Condor-G• MyProxy

– VDT providing 64-bit versions for us• User portal

– Creating a user portal for TIGRE

Page 28: Campus, State, and Regional Grid Issues

TIGRE Timeline

Q1 Q2 Q3 Q4

Q1 Q2 Q3 Q4

–Project plan–Web site–Certificate Authority–Testbed requirements–Eriving applications

Web portal –Software stack–Distribution Mechanism–Demonstrate TIGRE app

Client softwarepackage

User support system

today

Y1

Y2

Global scheduler –Software feature freeze–TIGRE service requirements

–Final software–Final documentation–Procedures and policies to join TIGRE–Demonstrate at SC

December 1, 2005

Page 29: Campus, State, and Regional Grid Issues

TIGRE Applications

• Developing in three areas– Biology / medicine

• Rice, Houston collaborating with Baylor College of Medicine • UT Austin working with UT Southwestern Medical Center

– Atmospheric science / air quality• Research interest and expertise at Texas A&M, Rice, Houston• WRF and WRF-CHEM under evaluation

– Energy• Very preliminary discussions about seismic processing, reservoir

modeling• Looking for industrial partners

• Grid-ready applications– EMAN - 3D reconstruction from electron microscopy– ENDYNE - quantum chemistry– ALICE - high-energy physics

Page 30: Campus, State, and Regional Grid Issues

TIGRE Lessons Learned

• Funding structure is important– Funding was given from state directly to each university– No one person responsible, 5 are responsible– Communication and coordination challenges

• Organizational structure is important– Related to funding structure– Must be able to hold people accountable– Even with good people, can be difficult to form consensus

• Availability of computational resources is important– TIGRE contains no funding for compute resources resources– Makes it harder to generate interest from users

• Driving users are important– Identified some, need more

Page 31: Campus, State, and Regional Grid Issues

Southeastern Universities Research Association Grid

(SURAgrid)

Page 32: Campus, State, and Regional Grid Issues

Southeastern UniversitiesResearch Association

• An organization formed to manage the Jefferson National Laboratory– Has extended it’s activities beyond this

• Mission: Foster excellence in scientific research, strengthen capabilities, provide training opportunities

• Region: 16 states & DC• Membership: 62 research universities

Page 33: Campus, State, and Regional Grid Issues

SURAgrid Goals

SURAgrid: Organizations collaborating to bring grids to the level of seamless, shared infrastructure

Goals:To develop scalable infrastructure that leverages local

institutional identity and authorization while managing access to shared resources

To promote the use of this infrastructure for the broad research and education community

To provide a forum for participants to gain additional experience with grid technology, and participate in collaborative project development

Page 34: Campus, State, and Regional Grid Issues

SURAgrid Participants

• University of Alabama at Birmingham• University of Alabama in Huntsville• University of Arkansas• University of Florida• George Mason University• Georgia State University • Great Plains Network• University of Kentucky• University of Louisiana at Lafayette• Louisiana State University• University of Michigan• Mississippi Center for

SuperComputing Research

• University of North Carolina, Charlotte• North Carolina State University• Old Dominion University• University of South Carolina• University of Southern California• Southeastern Universities Research

Association (SURA)• Texas A&M University• Texas Advanced Computing Center

(TACC)• Texas Tech• Tulane University • Vanderbilt University • University of Virginia

Page 35: Campus, State, and Regional Grid Issues

Current Activities

• Grid-Building– Themes: heterogeneity, flexibility, interoperability, scalability

• User Portal– Themes: capability, usability

• Inter-institutional AuthN/AuthZ– Themes: maintain local autonomy; leverage enterprise infrastructure

• Application Development – Themes: immediate benefit to applications; apps

drive development • Education, Outreach, and Training

– Cookbooks (how-to documents)– Workshops on science on grids and grid infrastructure– Tutorials on building and using grids

• A small amount of funding provided by SURA for activities• The majority of effort on SURAgrid is volunteer

Page 36: Campus, State, and Regional Grid Issues

Building SURAgrid

• Software selection– Globus pre-WS– GPIR information provider

• Environment variables– Defined a basic set of environment variables that users can

expect

• Assistance adding resources– Installing & configuring software & environment

• Assistance using resources– User support, modifying software environments

• Almost totally volunteer– Access to smaller clusters– People’s time

Page 37: Campus, State, and Regional Grid Issues

SURAgrid Portal

• Single sign-on to access all grid resources• Documentation tab has details on:

– Adding resources to the grid– Setting up user ids and uploading proxy certificates

Page 38: Campus, State, and Regional Grid Issues

Resource Monitoring

http://gridportal.sura.org/gridsphere/gridsphere?cid=resource-monitor

Page 39: Campus, State, and Regional Grid Issues

Proxy Management

• Upload proxy certificates to MyProxy server• Portal provides support for selecting a proxy

certificate to be used in a user session

Page 40: Campus, State, and Regional Grid Issues

File Management

• List directories, Move files between grid resources, Upload/download files from local machine

Page 41: Campus, State, and Regional Grid Issues

Job Management

• Submit Jobs for execution on remote grid resources• Check status of submitted jobs• Cancel and delete jobs.

Page 42: Campus, State, and Regional Grid Issues

SURAgrid AuthenticationBridge CA

• Problem:– Many different Certificate Authorities (CA) issuing

credentials– Which ones should a grid trust?

• An Approach: Bridge CA– Trust any CAs signed by the bridge CA– Can query the Bridge CA to ask if it trusts a CA

• A way to implement Policy Management Authorities (e.g. TAGPMA)

Campus Grid

CA CA CA CA

? ? ? ?

Campus Grid

CA CA CA CA

Bridge CA

Page 43: Campus, State, and Regional Grid Issues

SURAgrid Authorization

• The Globus grid-mapfile– Controls the basic (binary) authorization process– Sites add certificate Subject DNs from remote sites to their

grid-mapfile based on email from SURAgrid sites

• Grid-mapfile automation– Sites that use a recent version of Globus can use an LDAP

callout that replaces the grid-mapfile– Directory holds and coordinates

• Certificate Subject DN

• Unix login name

• Allocated Unix UID

• Some Unix GIDs?

Page 44: Campus, State, and Regional Grid Issues

SURAgrid Applications

• Multiple Genome Alignment (GSU, UAB, UVA)

• Task Farming (LSU)• Muon Detector Grid (GSU)• BLAST (UAB)• ENDYNE (TTU)• SCOOP/ADCIRC (UNC, RENCI, MCNC, SCOOP partners, SURAgrid partners)

• Seeking more

Page 45: Campus, State, and Regional Grid Issues

SCOOP & ADCIRC

• SURA Coastal Ocean Observing and Prediction (SCOOP)– Develop data standards– Make observations available & integrate them– Modeling, analysis and delivery of real-time data

• ADCIRC: forecast storm surge– Preparing for hurricane season now– Uses data acquired from IOOS– Executes on SURAgrid

• Resource selection (query MDS)• Build package (application & data)• Send package to resource (GridFTP)• Run ADCIRC in MPI mode (Globus rsl & qsub)• Retrieve results from resource (GridFTP)

Page 46: Campus, State, and Regional Grid Issues

SCOOP/ADCIRC…

Left: ADCIRC max water level for 72 hr forecast starting 29 Aug 2005,driven by the "usual, always-available” ETA winds.

Right: ADCIRC max water level over ALL of UFL ensemble wind fields for 72 hr forecast starting 29 Aug 2005, driven by “UFL always-available” ETA winds.

Images credit: Brian O. Blanton, Dept of Marine Sciences, UNC Chapel Hill

Page 47: Campus, State, and Regional Grid Issues

SURAgrid Lessons • Volunteer efforts are different

– Lots of good people involved• They are also involved in many other (funded) activities

– Progress can be uneven– Must interest people (builders & users)

• Leadership is important– SURA is providing a lot of energy to lead this effort– Direction and purpose not always clear

• Computational resources– Needed to interest users

• Driving applications– Needed to guide infrastructure development

• Matching organization with funding boundaries– SURAgrid isn’t matched to a funding boundary– Must compete against US national grid efforts

Page 48: Campus, State, and Regional Grid Issues

Summary of Issues I• Technology

– Grid technologies are in constant flux• Stability and reliability also varies widely

– Integrating technologies challenging– People choose different technologies– Single provider solutions available in some areas

• Serial computing grids

– Choose technology path carefully…• But you’ll be wrong, anyway• Be prepared for change

• Funding– Always need it– The way it is obtained affects many things

• Participants• Project organization• Integration with end users

– Think about what you are proposing and who you collaborate with

Page 49: Campus, State, and Regional Grid Issues

Summary of Issues II

• Organization– Different challenges for hierarchical vs committee, funded or

volunteer– Important that everyone involved have the same (or at least similar)

goals– Work it out ahead of time…

• Or you’ll spend the first few months of the project doing it• It will change, anyway

• Users– User-driven infrastructure is very important!– Include them from the very beginning…

• Even if you are doing a narrow technical project

• Resources– Need resources to interest users– But, need users to get resources…

• One approach: Multidisciplinary teams

Page 50: Campus, State, and Regional Grid Issues

Questions?