18
PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith <[email protected]> Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University

PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University

Embed Size (px)

Citation preview

Page 1: PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University

PCGRID ‘08 Workshop, Miami, FL April 18, 2008

Preston Smith <[email protected]>

Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University

Page 2: PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University

• Introduction– Environment– Motivation

• Challenges– Infrastructure– Usage Tracking – Storage– Staffing

• Future Work• Results

BoilerGrid

Page 3: PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University

BoilerGrid - Growth

How did we get from here….

To here?

Page 4: PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University

BoilerGrid - Rosen Center for Advanced Computing

• Research Computing arm of ITaP - Information Technology at Purdue

• Clusters in RCAC are arranged in larger “Community Clusters”– One cluster, one configuration, many owners– Leverages economies of scale for purchasing,

and provides expertise in systems engineering, user support, and networking

Page 5: PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University

BoilerGrid - Motivation

• Early on, we recognized that the diverse owners of the community clusters don’t use the machine at 100% capacity– Community clusters used approximately 70% of

capacity– Condor installed on community clusters to cycle-

scavenge from PBS, the primary scheduler

• Goal: provide a general-purpose high-throughput computing resource on existing hardware

Page 6: PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University

BoilerGrid - Challenges

• In 2005, the Condor deployment at Purdue was unable to scale to the size of the clusters, and ran on an old version of the software

• An overhaul of the Condor infrastructure was needed!

Page 7: PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University

BoilerGrid - Keep Condor Up-to-date

• Upgrading Condor– In late 2005, we were running Condor version 6.6.5,

which was 1.5 years old.– First, we needed to upgrade!

• In a large, busy, Condor grid, we found it’s usually advantageous to run the development release of Condor– Early access to new features, scalability

improvements

Page 8: PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University

BoilerGrid - Pool Design

• Use many machines– In 2005, we ran a single Condor pool with

~1800 machines.

• In 2005, the largest single Condor pools in existence were ~1000 machines.– We implemented BoilerGrid as a flock of 4

pools, of up to 1200 machines each.– Implementing BoilerGrid today?

• Would have looked much different!

Page 9: PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University

BoilerGrid - Submit Hosts

• Many submit hosts– In 2005, a single host ran the Condor schedd

and could submit jobs

– Today, any machine in RCAC for user login, and in many cases end-user desktops are able to submit Condor jobs

Page 10: PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University

BoilerGrid - Challenges

• Usage Tracking– Tracking job-level accounting with a large

Condor pool is difficult– Job history resides on every submit host

– Recent versions of Condor’s Quill software allow for a central database holding job (and machine) information

• Deploying this on BoilerGrid now

Page 11: PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University

BoilerGrid - Storage

• If your users expect to run jobs using a shared filesystem, a large Condor installation can overwhelm NFS servers.

• DAGMan and user logs on NFS can cause problems– The defaults don’t allow this for a reason!

• Train users to rely less on the shared filesystem and take advantage of Condor’s ability to transfer files

Page 12: PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University

BoilerGrid - Expansion

• Successful use of Condor in clusters led us to identify partners around campus– Student computer labs operated by sister unit in ITaP

(2500 machines and growing)– Library terminals (200 machines)– Other campuses (500+ machines)

• Management support is critical!– Purdue’s CIO supports using Condor on many

machines run by ITaP, including the one on his own desk

Page 13: PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University

BoilerGrid - Expansion

• An even better route of expansion– Condor users adding their own resources

• Machines in their own lab• All the machines in their department

• With distributed ownership comes new challenges– Regular contact with owner’s system administration

staff– Ensure that owners are able to set their own policies

Page 14: PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University

BoilerGrid - Staffing

• Implementing BoilerGrid required minimal staff effort– Assuming an existing IT infrastructure exists that can

operate many machines– .25 FTE ongoing to maintain Condor and coordinating

with distributed Condor installations

• With success comes more demand, and the end-user support to go along with it– 1.5 science support consultants assist with porting

codes,training users to effectively use Condor

Page 15: PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University

BoilerGrid - Future Work

• TeraGrid (NSF HPCOPS) - Portal for submission and monitoring of Condor jobs

• Centralized Quill database for job and machine state– Excellent source of data for future research in

distributed systems

Page 16: PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University

BoilerGrid - Results

Year Pool Size

Jobs Hours Delivered

Unique Users

2004 1500 43,551 346,000 14

2005 4000 210,717 1,695,000 26

2006 6100 4,251,981 5,527,000 72

2007 7700 9,611,813 9,524,000 117

2008 13000+ ? ? 63 so far..

Page 17: PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University

BoilerGrid - Results

Page 18: PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University

BoilerGrid - Conclusions

• Condor is a powerful tool for getting real science done on otherwise unused hardware

http://www.rcac.purdue.edu/boilergrid

Questions?