Managing A Large Farm: CSF Andrew Sansum 26 November 2002

Managing A Large Farm: CSF

Andrew Sansum26 November 2002

Overview

• Will cover many of the large scale issues associated with big CPU/disk farms

• Intent is to provoke discussion rather than provide answers:• I don’t claim to be an expert!• Many RAL solutions are dated but new staff will soon be making changes.

Large FarmsThe BIG differences

• BIG is not beautiful - – A small mistake can proliferate:– problems can multiply, – many components can become involved. – THINK before you make changes!– Manual login on 500 nodes is major disaster!

• Funding bodies often expect big farms to be run more professionally.

Hardware Specification

• Good quality hardware is vital.• Go with a reputable company• Evaluate quality of solution. • Check for component compatibility• Consider long warranties or be

prepared for major interventions yourself (eg replace all the fans)

Power Requirements

• Is there enough (steady state). Right plugs!!• Cope with surge on power up (think about

power sequencing).• What impact do PSUs have on power supply

(cf. SLAC) - neutral current imbalance - higher order harmonics…

• Remote/Automated power up/down is nice (eg APC units)

• Worry about equipment on different phases

Cooling

• Cooling must be sufficient!• Must be able to cope with local

hot spots.• If cooling fails - things get hot very

fast - monitoring/automated shutdown.

Installation

• Netboot/PXE avoids need for manual insertion of floppies.

• Use something like kickstart to:– Speed up installation task– Maintain record of configuration– Allow automated reconfiguration

• LCFG not recommended - but maybe successors?

Configuration Management

• Autorpm is useful for maintaining updates, but update from local managed copy - control changes!

• Test changes before rolling out!!!!!!!!• Need to ensure coherent, reproducible

configuration - tricky!– LCFG is good at this but cumbersome– Kickstart needs great care - update kickstart

AND systems independently?

Management Tools

• Very simple at RAL. Local parallel ssh

• Parallel rsh/ssh commands: prsh seems popular.

• Project C3 seems worth a look• Oscar bundles many interesting

tools together

Exception monitoring

• Need to spot problems before users do.• Run daemon or crontab checking for

errors. On detection:– Notify: SURE, Bigbrother,... (not email!)– Automated fixup (Daemon restart, filesystem

cleanup ...)– Automated Drain/Remove from configuration.

Automated power down/up. Automated DNS updates.

Incident Tracking

• Keep track of significant interventions.– Which hosts keep crashing. Dates, times

errors etc.– What disks failed - serial numbers of

returns - returns outstanding ...

• Keep track of tasks outstanding: eg: why is csflnx231 currently offline - who is fixing it ...

Hardware Management

• Many systems, eventually means:– Many system crashes.– Many hardware failures

• Consider purchasing 3 years warranty. On-site is easier.

• Define standard hardware (re) certification procedure . Make use of junior staff (operators postgrads, gran, ...!)

Utilisation/Capacity planning

• Monitor everything you can conveniently manage.– MRTG is standard network monitoring– Ganglia appears to be popular for

system utilisation etc.– PBS accounting records (or process

accounting).

Conclusions

• Careful planning, specification and hardware selection can pay dividends.

• Get smart or invest in lots of staff• Monitor so you know what is going on.• Many issues raised - few solutions

offered. Wide range of experience out in the UK HEPSYSMAN community. Make use of of it!

Documents

Managing A Large Farm: CSF Andrew Sansum 26 November 2002