20
BNL Site Report Ofer Rind Brookhaven National Laboratory rind@ bnl .gov Spring HEPiX Meeting, CASPUR April 3, 2006

BNL Site Report Ofer Rind Brookhaven National Laboratory [email protected] Spring HEPiX Meeting, CASPUR April 3, 2006

Embed Size (px)

Citation preview

Page 1: BNL Site Report Ofer Rind Brookhaven National Laboratory rind@bnl.gov Spring HEPiX Meeting, CASPUR April 3, 2006

BNL Site Report

Ofer RindBrookhaven National Laboratory

[email protected]

Spring HEPiX Meeting, CASPURApril 3, 2006

Page 2: BNL Site Report Ofer Rind Brookhaven National Laboratory rind@bnl.gov Spring HEPiX Meeting, CASPUR April 3, 2006

BNL Site Report

Ofer RindBrookhaven National Laboratory

[email protected]

Spring HEPiX Meeting, CASPURApril 4, 2006

Page 3: BNL Site Report Ofer Rind Brookhaven National Laboratory rind@bnl.gov Spring HEPiX Meeting, CASPUR April 3, 2006

BNL Site Report, Spring Hepix 2006 3

(Brief) Facility Overview• RHIC/ATLAS Computing

Facility is operated by BNL Physics Dept. to support the scientific computing needs of two large user communities– RCF is the “Tier-0” facility

for the four RHIC expts.– ACF is the Tier-1 facility

for ATLAS in the U.S.– Both are full-service

facilities

• >2400 Users, 31 FTE• RHIC Run6 (Polarized

Protons) started March 5th

Page 4: BNL Site Report Ofer Rind Brookhaven National Laboratory rind@bnl.gov Spring HEPiX Meeting, CASPUR April 3, 2006

BNL Site Report, Spring Hepix 2006 4

Mass Storage• Soon to be in full production....

– Two SL8500’s: 2 X 6.5K tape slots, ~5 PB capacity

– LTO-3 drives: 30 X 80 MB/sec; 400 GB/tape (native)

– All Linux movers: 30 RHEL4 machines, each with 7 Gbps ethernet connectivity and aggregate 4 Gbps direct attached connection to DataDirect S2A fibre channel disk

• This is in addition to the 4 STK Powderhorn silos already in service (~4 PB, 20K 9940B tapes)

• Transition to HPSS 5.1 is complete– “It’s different”...learning curve due to

numerous changes– PFTP: client incompatibilities and cosmetic

changes• Improvements to Oak Ridge Batch System

optimizer– Code fixed to remove long-time source of

instability (no crashes since)– New features being designed to improve

access control

Page 5: BNL Site Report Ofer Rind Brookhaven National Laboratory rind@bnl.gov Spring HEPiX Meeting, CASPUR April 3, 2006

BNL Site Report, Spring Hepix 2006 5

Centralized Storage• NFS: Currently ~220 TB of

FC SAN; 37 Solaris 9 servers– Over the next year, plan to

retire ~100 TB of mostly NFS served storage (MTI, ZZYZX)

• AFS: RHIC and USATLAS cells– Looking at Infortrend disk

(SATA + FC frontend + RAID6) for additional 4 TB (raw) per cell

– Future: upgrade to OpenAFS 1.4

• Panasas: 20 shelves, 100 TB, heavily used by RHIC

Page 6: BNL Site Report Ofer Rind Brookhaven National Laboratory rind@bnl.gov Spring HEPiX Meeting, CASPUR April 3, 2006

BNL Site Report, Spring Hepix 2006 6

Panasas Issues• Panasas DirectFlow (version 2.3.2)

– High performance and fairly stable, but...problematic from an administrative perspective:

• Occasional stuck client-side processes left in uninterruptible sleep

• DirectFlow module causes kernel panics from time to time– Can always panic a kernel with panfs mounted by running a

Nessus scan on the host• Changes in ActiveScale server configuration (e.g.

changing the IP addresses of non-primary director blades), which the company claims are innocuous, can cause clients to hang.

• Server-side NFS limitations– NFS mounting was tried and found to be unfeasible as

a fallback option with our configuration heavy NFS traffic causes director blades crash; Panasas suggests limiting to <100 clients per director blade.

Page 7: BNL Site Report Ofer Rind Brookhaven National Laboratory rind@bnl.gov Spring HEPiX Meeting, CASPUR April 3, 2006

BNL Site Report, Spring Hepix 2006 7

Update on Security• Nessus scanning program

implemented as part of ongoing DOE C&A process– Constant low-level

scanning– Quarterly scanning: more

intensive port exclusion scheme to protect sensitive processes

• Samhain – Filesystem integrity

checker (akin to Tripwire) with central management of monitored systems

– Currently deployed on all administrative systems

Page 8: BNL Site Report Ofer Rind Brookhaven National Laboratory rind@bnl.gov Spring HEPiX Meeting, CASPUR April 3, 2006

BNL Site Report, Spring Hepix 2006 8

Linux Farm Hardware• >4000 processors, >3.5

MSI2K• ~700 TB of local storage

(SATA, SCSI, PATA)• SL3.05(03) for RHIC

(ATLAS)• Evaluated dual-core

Opteron & Xeon for upcoming purchase– Recently encountered

problems with Bonnie++ I/O tests using RHEL4 64 bit w/software RAID+LVM on Opteron

– Xeon (Paxville) gives poor SI/watt performance

Page 9: BNL Site Report Ofer Rind Brookhaven National Laboratory rind@bnl.gov Spring HEPiX Meeting, CASPUR April 3, 2006

BNL Site Report, Spring Hepix 2006 9

Power & Cooling• Power & Cooling now

significant factors in purchasing

• Added 240KW to facility for ‘06 upgrades– Long term: possible site

expansion

• Liebert XDV Vertical Top Cooling Modules to be installed on new racks

• CPU and ambient temperature monitoring via dtgraph and custom python scripts

Page 10: BNL Site Report Ofer Rind Brookhaven National Laboratory rind@bnl.gov Spring HEPiX Meeting, CASPUR April 3, 2006

BNL Site Report, Spring Hepix 2006 10

Distributed Storage• Two large dCache instances (v1.6.6) deployed in hybrid

server/client model:– PHENIX: 25 TB disk, 128 servers, >240 TB data– ATLAS: 147 TB disk, 330 servers, >150 TB data– Two custom HPSS backend interfaces– Perf. tuning on ATLAS write pools– Peak transfer rates of >50 TB/day

• Other: large deployment of Xrootd (STAR), rootd, anatrain (PHENIX)

Page 11: BNL Site Report Ofer Rind Brookhaven National Laboratory rind@bnl.gov Spring HEPiX Meeting, CASPUR April 3, 2006

BNL Site Report, Spring Hepix 2006 11

Batch Computing• All reconstruction and analysis batch systems have

been migrated to Condor, except STAR analysis ---which still awaits features like global job-level resource reservation --- and some ATLAS distributed analysis (these use LSF 6.0)

• Configuration:– Five Condor (6.6.x) pools on two central managers– 113 available submit nodes– One monitoring/Condorview server and one backup

central manager• Lots of performance tuning

– Autoclustering of jobs for scheduling; timeouts; negotiation cycle; socket cache; collector query forking; etc....

Page 12: BNL Site Report Ofer Rind Brookhaven National Laboratory rind@bnl.gov Spring HEPiX Meeting, CASPUR April 3, 2006

BNL Site Report, Spring Hepix 2006 12

Condor Usage• Use of heavily modified CondorView client to

display historical usage.

Page 13: BNL Site Report Ofer Rind Brookhaven National Laboratory rind@bnl.gov Spring HEPiX Meeting, CASPUR April 3, 2006

BNL Site Report, Spring Hepix 2006 13

Condor Flocking• Goal: Full utilization of computing resources on the farm• Increasing use of a “general queue” which allows jobs to

run on idle resources belonging to other experiments, provided that there are no local resources available to run the job

• Currently, such “opportunistic” jobs are immediately evicted if a local job places a claim on the resource

• >10K jobs completed so far

Page 14: BNL Site Report Ofer Rind Brookhaven National Laboratory rind@bnl.gov Spring HEPiX Meeting, CASPUR April 3, 2006

BNL Site Report, Spring Hepix 2006 14

Condor Monitoring• Nagios and custom scripts provide

live monitoring of critical daemons• Place job history from ~100 submit

nodes into central database– This model will be replaced by Quill.– Custom statistics extracted from

database (i.e. general queue, throughput, etc.)

• Custom startd, schedd, and “startd cron” ClassAds allow for quick viewing of the state of the pool using Condor commands

– Some information accessible via web interface

• Custom startd ClassAds allow for remote and peaceful turn off of any node– Not available in Condor– Note that the “condor_off -peaceful”

command (v6.8) cannot be canceled, must wait until running jobs exit

Page 15: BNL Site Report Ofer Rind Brookhaven National Laboratory rind@bnl.gov Spring HEPiX Meeting, CASPUR April 3, 2006

BNL Site Report, Spring Hepix 2006 15

Nagios Monitoring• 13958 services total and 1963 hosts: average of 7 services

checked per host.• Originally had one nagios server (dual 2.4 Gz)....

– Tremendous latency: services reported down many minutes after the fact.

– Web interface completely unusable (due to number of hosts and services)

• ...all of this despite a lot of nagios and system tuning...– Nagios data written to ramdisk– Increased no. of file descriptors and no. of processes allowed– Monitoring data read from MySQL database on separate host– Web interface replaced with lightweight interface to the database

server• Solution: split services roughly in half between two nagios

servers.– Latency is now very good– Events from both servers logged to one MySQL server– With two servers there is still room for many more hosts and a

handful of service checks.

Page 16: BNL Site Report Ofer Rind Brookhaven National Laboratory rind@bnl.gov Spring HEPiX Meeting, CASPUR April 3, 2006

BNL Site Report, Spring Hepix 2006 16

Nagios Monitoring

Page 17: BNL Site Report Ofer Rind Brookhaven National Laboratory rind@bnl.gov Spring HEPiX Meeting, CASPUR April 3, 2006

BNL Site Report, Spring Hepix 2006 17

ATLAS Tier-1 Activities• OSG 0.4, LCG 2.7 (this wk.)• ATLAS Panda (Production And

Distributed Analysis) used for production since Dec. ‘05– Good performance in scaling tests, with

low failure rate and manpower requirements

• Network Upgrade– 2 X 10 Gig LAN and WAN

• Terapath QoS/MPLS (BNL, UM, FNAL, SLAC, ESNET)

• DOE supported project to introduce end-to-end QoS network into data mgmt.

• Ongoing intensive development w/ESNET

SC

2

00

5

Page 18: BNL Site Report Ofer Rind Brookhaven National Laboratory rind@bnl.gov Spring HEPiX Meeting, CASPUR April 3, 2006

BNL Site Report, Spring Hepix 2006 18

ATLAS Tier-1 Activities

• SC3 Service Phase (Oct-Dec 05)– Functionality validated for full production chain to Tier-1– Exposed some interoperability problems between BNL dCache and FTS

(fixed now)– Needed further improvement in operation, performance and monitoring.

• SC3 Rerun Phase (Jan-Feb 06)– Achieved performance (disk-disk, disk-tape) and operations benchmarks

Page 19: BNL Site Report Ofer Rind Brookhaven National Laboratory rind@bnl.gov Spring HEPiX Meeting, CASPUR April 3, 2006

BNL Site Report, Spring Hepix 2006 19

ATLAS Tier-1 Activities• SC4 Plan:

– Deployment of storage element, grid middleware, (LFC, LCG, FTS), and ATLAS VO box

– April: Data throughput phase (disk-disk and disk-tape); goal is T0 to T1 operational stability

– May: T1 to T1 data exercise.– June: ATLAS Data Distribution from T0 to T1 to

select T2. – July-August: Limited distributed data

processing, plus analysis– Remainder of ‘06: Increasing scale of data

processing and analysis.

Page 20: BNL Site Report Ofer Rind Brookhaven National Laboratory rind@bnl.gov Spring HEPiX Meeting, CASPUR April 3, 2006

BNL Site Report, Spring Hepix 2006 20

Recent P-P Collision in STAR