HEPiX Trip ReportJefferson Laboratory 9 -13 October 2006
Martin Bly – RAL Tier1HEPSysMan – Cambridge
23 October 2006
23 October 2006 HEPSysMan - Cambridge - Autumn 2006
Introduction
• Site issues• Subject talks
23 October 2006 HEPSysMan - Cambridge - Autumn 2006
Sites: CERN• Successfully negotiated new LCG-wide licences for Oracle• All Physics databases now migrated to Oracle RAC hosting• SLC4 for LHC start up, SLC3 support ends October 2007• Lemon Alarm System (LAS) replacing SURE• Central CVS service running well
– Looking at Subversion
• First Opteron systems in CERN CC• Insecure mail protocols forbidden/blocked
– POP/IMAP etc must use SSL
• No compromise on performance of disk servers to get ‘fat’ systems
23 October 2006 HEPSysMan - Cambridge - Autumn 2006
Sites: FermiLab
• Multiple 10Gb/s connections to Starlight• Efforts to automate computer security
– Replace home-grown tools with commercial utilities
• New computer rooms– Overhead power and networking– Plastic curtains to trap cold air in front of machines
• US-CMS– 700TB dCache space
• Expected to be 2.5Pb by autumn 2007– 700 node cluster expanding to 1600 nodes
• BlueArc NAS for online storage– Expensive…
23 October 2006 HEPSysMan - Cambridge - Autumn 2006
Sites: GridKa
• Issues with recent Opteron procurement– MSI K1-1000D motherboard, AMD 270s
• BIOS issues, BMC and NIC firmware updates
• Issues with water cooled racks traced to leaks in chillers
• NEC supplying 4500TB storage– 28 Storage controllers, RAID 6, 60 file servers
• Report on latest benchmarks– Woodcrest performs very well
23 October 2006 HEPSysMan - Cambridge - Autumn 2006
Sites: NERSCNational Energy Research Scientific Computing
Center, Berkely
• NERSC Global Filesystem (NFG) in production– 70TB for project file space (subject of separate talk)– Aim to procure ‘just storage’
• 10Gb/s internal/external networks– 10Gb/s ‘jumbo’ network
• Cray Hood system– 19000+ CPUs, 70TB disk, 102 cabinets
• Nagios for monitoring, extending to the Cray• Computer room full, need more power, space
23 October 2006 HEPSysMan - Cambridge - Autumn 2006
Sites: INFN
• 10Gb/s link to GARR backbone• T2s now at 1Gb/s• GPFS now robust enough to be adopted
by many sites– Lustre also being tested by a few sites
• Testing iSCSI– Satisfactory but not completely satisfying– Looking at new EMC device and home-
grown solutions to try and resolve issues
23 October 2006 HEPSysMan - Cambridge - Autumn 2006
Sites: GSI Darmstadt
• Issues with large storage farm– 100/120 nodes failed to boot after move to
new racks• Had been OK for 6 months previously in old
racks
– Traced to vibration resonance between disk and CPU cooling fans
• Issues with cooling in racks– Keep cold and warm air flows separate
• Blanking plates important
23 October 2006 HEPSysMan - Cambridge - Autumn 2006
Sites: SLAC
• SLAC now a US-Atlas site– Procurements to start soon
• Non-HEP experiment computing building up• Many old clusters being decommissioned to make
space• Plan for 150/200-node infiniband cluster
– Model check-pointing is a challenge
• Testing Lustre• Need to move away from AFS (K4) token passing
– SSH/K5 with GSSAPI to pass K5 tickets
• New wireless registration scheme to enable users to be contacted should their machine cause problems
23 October 2006 HEPSysMan - Cambridge - Autumn 2006
Sites: INFN-CNAF
• CPU capacity upgrade delayed while cooling system upgraded after cooling issues during summer
• Using Quattor/Lemon– CERN customisations sometimes a problem
• Staying with SLC3 (v3.0.8 supports Woodcrest)
• SLC4 when EGEE moves
23 October 2006 HEPSysMan - Cambridge - Autumn 2006
Sites: LAL
• VMware still preferred Linux-on-desktop solution
• Installed gLite3 on SL4 without modification
• Using Quattor and Lemon– Having removed CERN specifics
23 October 2006 HEPSysMan - Cambridge - Autumn 2006
Sites: General
• Moving to specifying computing capacity requirements in performance terms for CPUs– Needs ‘common’ benchmarking
• Require vendors to do it (and prove it!)
• Corresponding interest in benchmarking and how to do it so it means something
• 10Gb links now very common• Big Condor pools in use at some sites• Waiting for Grid middleware to be ported to
SL4
23 October 2006 HEPSysMan - Cambridge - Autumn 2006
Scientific Linux Update• UK top by download (no stats from mirrors)• FTP repository moved from GFS to NAS• New plone version for scientificlinux.org site• SL 4.4 Oct 2006 for i386, x86_64• SL 3.0.8 release candidate available soon
– Now available…• Bug fix repositories for SL variants
– bugfixNN where NN is version• SL 3.0.8 should be the last of the 3 series
– Support plan as previously published: till Autumn 2007• Working on SL5 (installers etc)
– SL5 alphas to be based on TUV beta releases
23 October 2006 HEPSysMan - Cambridge - Autumn 2006
Core Services/Infrastructure (1)
• Tail of FermiLab’s run in with SpamCop– SpamCop don’t respond to any requests
• Takes 24 hrs to ‘fall off’ list
– Remove bounce messages and verify local addresses
– Trap obvious Spam– Have alternative ip addresses for email
gateways
• Propose ‘white list’ of HEP sites…
23 October 2006 HEPSysMan - Cambridge - Autumn 2006
Core Services/Infrastructure (2)
• Service Level Status service– CERN tool for displaying the status of
services rather than individual nodes– Status defined by managers in terms of
dependencies and dependants, and what service availability levels mean
– Services and meta-services– Displays Key Performance Indicators of
service levels compared to targets
23 October 2006 HEPSysMan - Cambridge - Autumn 2006
Core Services/Infrastructure (3)
• RT used to manage Installation workflow (SLAC)
• High Availability methods and experiences at GSI
• Scientific Linux Inventory Project (FermiLab)– Need to monitor software inventory and
hardware of a machine
23 October 2006 HEPSysMan - Cambridge - Autumn 2006
Compute Clusters & Storage
• Hazards of Fast tape Drives (JLAB)– Is your memory buffer big enough to prevent the tape drive
having to stop, rewind and take a run up to speed when more data is available to write?
– CERN report 100MB/sec using two stage tape serving, with large (8GB) RAM on the L1 caches
• NGF: NERSC’s Global File System (NERSC)• Benchmark Updates (CERN)
– Spec.org results unreliable for HEP purposes• Don’t match our conditions
– Requires vendors to use ‘fixed’ configuration of SPEC2000 benchmark
– HPL used to benchmark ‘power’ perfomance
23 October 2006 HEPSysMan - Cambridge - Autumn 2006
Security
• No Bob Cowles– Therefore, no ‘scare the pants off everyone’
talk
• But:– The Stakkato Intrusion
• The tail of the long-running intrusion at the Swedish National Supercomputer Centre, 2004-2005
– Network Security Monitoring• How it is done at Brookhaven National Lab, with
Sguil
23 October 2006 HEPSysMan - Cambridge - Autumn 2006
Grid Projects• Issues and problems around Grid Site Management
(+discussions) – Ian Bird– Measuring site availability: T1s poor– Instabilities in site availabilities observed– Strategies:
• Improve sites, improve job direction– SAM (Site Availability Monitor)
• An expansion of SFT functionality• Sensors integrated with submission framework or standalone• Integrated tests done by test job submission• Analysis of job efficiencies (failure rates): reasons non-trivial
– Good sites change daily!– Plan to use Job wrappers to test as submitting-VO view rather than
OPS-VO view• Better view of system ‘weather’
23 October 2006 HEPSysMan - Cambridge - Autumn 2006
IHEPCCC
• IHEPCCC discussing collaborating with HEPiX on areas of mutual interest, particularly benchmarking and global file systems
• RTAG format proposed– Short-term study groups, report to
HEPiX/IHEPCCC
• Lots of interest in participating, particularly in benchmarking and discussing whether SPEC2006 is appropriate
23 October 2006 HEPSysMan - Cambridge - Autumn 2006
Next meetings
• Spring 2007:– April 23rd to 27th in DESY Hamburg
• Topics suggested included benchmarking, cluster file systems, VoIP and in general, ‘discussion topics’ (as opposed to LCG workshops) likely to attract LCG Tier 2 sites.
• Autumn/Fall 2007:– possibly early November at either Berkeley or
FermiLab, hopefully in the week preceding Supercomputing’07 in Reno
• Spring 2008:– CERN
23 October 2006 HEPSysMan - Cambridge - Autumn 2006
References
• Abstracts and slides form HEPiX Fall 2006:https://indico.fnal.gov/conferenceDisplay.py?confId=384
• Alan Silverman’s comprehensive trip report:https://www.hepix.org/mtg/fall_06_jlab/HEPiX%20_Lab_Trip_Report_silverman.pdf