Configuration Life-Cycle Management on the TeraGrid
Ti Leggett
Challenges of Managing Computational Resources
• Software, hardware, and user needs change rapidly
• Maintaining uniform resources• Handling one-offs• Staying current with patches and
security updates• Documenting how and what machines
run
Managing Configurations
• Unattended OS deployment– Jumpstart, Kickstart, Yast
• Cluster distributions– OSCAR, ROCKS
• Configuration management systems– Cfengine, LCFG, Bcfg2
UC/ANL Cluster Configuration Management
• A microcosm of machine classes
• Cluster goals are to maximize availability, predictability and reliability
• Originally used SystemImager to duplicate similar classes
• Switched to Bcfg2 early 2005
Cluster Uniformity
• Necessary for the user
• Necessary for the administrator
• UC/ANL has two compute classes and many management classes running two different OS versions
Security
• Performing security patches
• Auditing cluster status
• Updating machines after extended downtime or maintenance
• Aiding intrusion detection
Reusability
• Machine failures– Disk failures– Non-disk failures
• Machine replication
• New machines
Specification as Documentation
• Dealing with administrator absences
• Using version control
• Teaching new administrators
• Dealing with already running and working machines
Future Work
• Reduce dependency on tape backups
• Integrate with tools such as Nagios, Nessus, and iptables
• Integration with LDAP
Questions?