Managing A Large Farm: CSF
Andrew Sansum26 November 2002
Overview
• Will cover many of the large scale issues associated with big CPU/disk farms
• Intent is to provoke discussion rather than provide answers:• I don’t claim to be an expert!• Many RAL solutions are dated but new staff will soon be making changes.
Large FarmsThe BIG differences
• BIG is not beautiful - – A small mistake can proliferate:– problems can multiply, – many components can become involved. – THINK before you make changes!– Manual login on 500 nodes is major disaster!
• Funding bodies often expect big farms to be run more professionally.
Hardware Specification
• Good quality hardware is vital.• Go with a reputable company• Evaluate quality of solution. • Check for component compatibility• Consider long warranties or be
prepared for major interventions yourself (eg replace all the fans)
Power Requirements
• Is there enough (steady state). Right plugs!!• Cope with surge on power up (think about
power sequencing).• What impact do PSUs have on power supply
(cf. SLAC) - neutral current imbalance - higher order harmonics…
• Remote/Automated power up/down is nice (eg APC units)
• Worry about equipment on different phases
Cooling
• Cooling must be sufficient!• Must be able to cope with local
hot spots.• If cooling fails - things get hot very
fast - monitoring/automated shutdown.
Installation
• Netboot/PXE avoids need for manual insertion of floppies.
• Use something like kickstart to:– Speed up installation task– Maintain record of configuration– Allow automated reconfiguration
• LCFG not recommended - but maybe successors?
Configuration Management
• Autorpm is useful for maintaining updates, but update from local managed copy - control changes!
• Test changes before rolling out!!!!!!!!• Need to ensure coherent, reproducible
configuration - tricky!– LCFG is good at this but cumbersome– Kickstart needs great care - update kickstart
AND systems independently?
Management Tools
• Very simple at RAL. Local parallel ssh
• Parallel rsh/ssh commands: prsh seems popular.
• Project C3 seems worth a look• Oscar bundles many interesting
tools together
Exception monitoring
• Need to spot problems before users do.• Run daemon or crontab checking for
errors. On detection:– Notify: SURE, Bigbrother,... (not email!)– Automated fixup (Daemon restart, filesystem
cleanup ...)– Automated Drain/Remove from configuration.
Automated power down/up. Automated DNS updates.
Incident Tracking
• Keep track of significant interventions.– Which hosts keep crashing. Dates, times
errors etc.– What disks failed - serial numbers of
returns - returns outstanding ...
• Keep track of tasks outstanding: eg: why is csflnx231 currently offline - who is fixing it ...
Hardware Management
• Many systems, eventually means:– Many system crashes.– Many hardware failures
• Consider purchasing 3 years warranty. On-site is easier.
• Define standard hardware (re) certification procedure . Make use of junior staff (operators postgrads, gran, ...!)
Utilisation/Capacity planning
• Monitor everything you can conveniently manage.– MRTG is standard network monitoring– Ganglia appears to be popular for
system utilisation etc.– PBS accounting records (or process
accounting).
Conclusions
• Careful planning, specification and hardware selection can pay dividends.
• Get smart or invest in lots of staff• Monitor so you know what is going on.• Many issues raised - few solutions
offered. Wide range of experience out in the UK HEPSYSMAN community. Make use of of it!