Download ppt - Software Scalability Issues in Large Clusters CHEP2003 – San Diego March 24-28, 2003 A. Chan, R. Hogue, C. Hollowell, O. Rind, T. Throwe, T. Wlodek RHIC

Software Scalability Issues in Large Clusters

CHEP2003 – San Diego

March 24-28, 2003

A. Chan, R. Hogue, C. Hollowell, O. Rind,

T. Throwe, T. Wlodek

RHIC Computing Facility

Brookhaven National Laboratory

Background

Rapid development of large clusters built with affordable commodity hardware

Need to address software scalability issues with deploying and effectively operating large clusters

Critical for the efficient operation of the 2000+ CPU cluster in the Linux Farm at the RCF

The rapid growth of the Linux Farm

0100200

300400

500600

700800

9001000

1999 2000 2001 2002 2003

KSpecInt2000

Hardware in the Linux Farm

Brand CPU RAM Disk Quantity

VA Linux 450 MHz 0.5-1 GB 9-120 GB 154

VA Linux 700 MHz 0.5 GB 9-36 GB 48

VA Linux 800 MHz 0.5-1 GB 18-480 GB

168

IBM 1.0 GHz 0.5-1 GB 18-144 GB

315

IBM 1.4 GHz 1 GB 36-144 GB

160

IBM 2.4 GHz 1 GB 240 GB 252

Monitoring

Mix of open-source, staff-designed and vendor-provided monitoring software

Software-redesign for scalability (push vs. pull method) in large clusters

Persistency and fault-tolerant features

Near real-time information

Monitoring Models

Cluster Monitoring (Staff-designed)

Cluster Monitoring (Ganglia project)

Image Distribution in the Linux Farm

NFS-based image distribution system until 2001 – not scalable

Switched to Web-based RedHat KickStart installer

Fast and scalable (20 minutes/server with 100’s of servers at a time)

Highly configurable (multiple images, build options, etc)

Database Systems

MySQL widely used throughout the RCF

Open-source nature

General monitoring & control (cluster, infrastructure, batch, storage, etc)

Flexible and scalable for lightweight operations

MySQL Usage in the Linux Farm

Batch job control via MySQL database

Other System Administration Tools

PYTHON-based scripts for fast, parallel access to multiple servers

PYTHON-based scripts for infrastructure emergency remote power management access

Vendor-provided scalable, remote power management software

Cluster Management Tool (RCF-designed)

Cluster Management Tool (vendor-provided)

Conclusion

Scalable system software important for efficiently deploying and managing large clusters

Fast image downloading with current software

Necessary to mix system software from various sources to address all our needs and requirements