Software Scalability Issues in Large Clusters
CHEP2003 – San Diego
March 24-28, 2003
A. Chan, R. Hogue, C. Hollowell, O. Rind,
T. Throwe, T. Wlodek
RHIC Computing Facility
Brookhaven National Laboratory
Background
Rapid development of large clusters built with affordable commodity hardware
Need to address software scalability issues with deploying and effectively operating large clusters
Critical for the efficient operation of the 2000+ CPU cluster in the Linux Farm at the RCF
The rapid growth of the Linux Farm
0100200
300400
500600
700800
9001000
1999 2000 2001 2002 2003
KSpecInt2000
Hardware in the Linux Farm
Brand CPU RAM Disk Quantity
VA Linux 450 MHz 0.5-1 GB 9-120 GB 154
VA Linux 700 MHz 0.5 GB 9-36 GB 48
VA Linux 800 MHz 0.5-1 GB 18-480 GB
168
IBM 1.0 GHz 0.5-1 GB 18-144 GB
315
IBM 1.4 GHz 1 GB 36-144 GB
160
IBM 2.4 GHz 1 GB 240 GB 252
Monitoring
Mix of open-source, staff-designed and vendor-provided monitoring software
Software-redesign for scalability (push vs. pull method) in large clusters
Persistency and fault-tolerant features
Near real-time information
Image Distribution in the Linux Farm
NFS-based image distribution system until 2001 – not scalable
Switched to Web-based RedHat KickStart installer
Fast and scalable (20 minutes/server with 100’s of servers at a time)
Highly configurable (multiple images, build options, etc)
Database Systems
MySQL widely used throughout the RCF
Open-source nature
General monitoring & control (cluster, infrastructure, batch, storage, etc)
Flexible and scalable for lightweight operations
Other System Administration Tools
PYTHON-based scripts for fast, parallel access to multiple servers
PYTHON-based scripts for infrastructure emergency remote power management access
Vendor-provided scalable, remote power management software