Design & Management of the JLAB Farms Ian Bird, Jefferson Lab May 24, 2001 FNAL LCCWS

Preview:

Citation preview

Design & Management of the JLAB Farms

Ian Bird, Jefferson LabMay 24, 2001

FNAL LCCWS

Overview

• JLAB clusters– Aims– Description– Environment

• Batch software• Management

– Configuration – Maintenance – Monitoring

• Performance monitoring• Comments

Clusters at JLAB - 1

• Farm– Support experiments – reconstruction, analysis– 250 ( 320) Intel Linux CPU ( + 8 Sun Solaris)

• 6400 8000 SPECint95

– Goals:• Provide 2 passes of 1st level reconstruction at average

incoming data rate (10 MB/s)• (More recently) provide analysis, simulation, and general batch

facility

– Systems• First phase (1997) was 5 dual Ultra2 + 5 dual IBM 43p• 10 dual Linux (PII 300) acquired in 1998• Currently 165 dual PII/III (300, 400, 450, 500, 750, 1GHz)

– ASUS motherboards, 256 MB, ~40 GB SCSI, IDE, 100 Mbit– First 75 systems towers, 50 2u rackmount, 40 1u (½u?)

• Interactive front-ends– Sun E450’s, 4-proc Intel Xeon, (2 each), 2GB RAM, Gb Ethernet

Last summer, 16 duals (2u) + 500 GB cache (8u) per 19” rack

Recently, 5 TB IDE cache disk (5 x 8u) per 19”

Intel Linux Farm

First purchases, 9 duals per 24” rack

Clusters at JLAB - 2

• Lattice QCD cluster(s)– Existing clusters – in collaboration with MIT, at JLAB:

• Compaq Alpha– 16 XP1000 (500 MHz 21264), 256 or 512 MB, 100 Mbit – 12 Dual UP2000 (667 MHz 21264), 256 MB, 100 Mbit

• All have Myrinet interconnect• Front-end (login) machine has GB Ethernet, 400 GB

fileserver for data staging and transfers MIT JLAB

– Anticipated (funded)• 128 cpu (June 2001), Alpha or P4(?) in 1u• 128 cpu (Dec/Jan ?) – identical to 1st 128• Myrinet

LQCD Clusters

16 single Alpha 21264, 1999

12 dual Alpha (Linux Networks), 2000

Environment

• JLAB has central computing environment (CUE)– NetApp fileservers – NFS & CIFS

• Home directories, group (software) areas, etc.

– Centrally provided software apps

• Available in– General computing environment– Farms and clusters– Managed desktops

• Compatibility between all environments – home and group areas available in farm, library compatibility, etc.

• Locally written software provides access to farm (and mass storage) from any JLAB system

• Campus network backbone is Gigabit Ethernet, with 100 Mbit to physicist desktops, OC-3 to ESnet

Jefferson LabMass Storage and Farm Systems

2001

Work File ServersWork File Servers10 TB – RAID 510 TB – RAID 5

Farm Cache File ServersFarm Cache File Servers4 x 400GB4 x 400GB

DST/Cache File DST/Cache File ServersServers

15 TB – RAID 015 TB – RAID 0

Batch and Interactive FarmBatch and Interactive Farm

DB ServerDB Server

Tape ServersTape Servers

From CLAS DAQFrom Hall A,C DAQ100 Mbit/s1000 Mbit/sFCALSCSI

Lattice QCD Metacenter

Myrinet switch128 ports

ClusterEthernet Switch

100 Mb ethernet

Gigabit

Dual

Alpha 21264

Qty. 64 (128)

Interactive Duals

Alpha 21264Qty. 2

400 GB

Development Cluster

CISCOCAT 5500

ESNET

OC-3

QuadSUN

E4000

270 GB 342 GB STK Redwood, STK 9840 Tape Drives

Staging Disks

300 TB

Mass Storage

System

Other Jlabcomputers

MetaStore

SH7400

File Server

MetaStore

SH7400

File Server

1 TB 1 TB

JLabCluster

ClusterEthernet Switch

100 Mb ethernet

Quad

Alpha 21264

Qty. 12

Interactive

Alpha 21164

Qty. 2

296 GB

Myrinet switch16 ports

AlantecEthernet Switch

Other MIT computers

AlphaServer8400 (12 cpu)

1 TB

13 TB

DLT

Library

SUN

E6000

10 cpu

120 GB

MIT Cluster

Development Cluster

File Server

Dual Pentium

RAIDFile Server

Batch Software

• Farm– Use LSF (v 4.0.1)

• Pricing now acceptable

– Manage resource allocation with• Job queues

– Production (reconstruction, etc)– Low-priority (for simulations), High-priority (short jobs)– Idle (pre-emptable)

• User + group allocations (shares)– Make full use of hierarchical shares - allows single

undivided cluster to be used efficiently by many groups– E.g.

Batch software - 2

• Users do not use LSF directly, use Java client (jsub), that:– Is available from any machine (does not need LSF)– Provides missing functionality, e.g.

• Submit 1000 jobs in 1 command• Fetches files from tape, pre-stages before job queued

for execution (don’t block farm with jobs waiting for data),

– Ensures efficient retrieval of files from tape - e.g. sort 1000 files by tape and by file no. on tape.

– Web interface (via servlet) to monitor job status and progress (as well as host, queue, etc.)

View job status

View host status

Batch software - 3

• LQCD clusters use PBS– JLAB written scheduler

• 7 stages – mimic LSF hierarchical behaviour

– Users access PBS commands directly– Web interface (portal) – authorization based on

certificates• Used to submit jobs between JLAB & MIT clusters

Batch software - 4

• Future– Combine jsub & LQCD portal features to wrap both

LSF and PBS– XML-based description language– Provide web-interface toolkit to experiments to

enable them to generate jobs based on expt. run data

– In context of PPDG

Cluster management

• Configuration– Initial configuration

• Kickstart, 2 post-install scripts for configuration, sw install (LSF etc), driven by a floppy

• Looking at PXE – DHCP (available on newer motherboards)– Avoids need for floppy – just power on– System working (last week)– Software: PXE standard bootprom (www.nilo.org/docs/pxe.html) – talks to

DHCP,» bpbatch – pre-boot shell (www.bpbatch.org) - downloads vmlinux, kickstart etc

• Alphas configured “by hand + kickstart”– Updates etc.

• Autorpm (especially for patches)• New kernels – by hand with scripts

• OS upgrades– Rolling upgrades – use queues to manage transition

• Missing piece:– Remote, network-accessible console screen access

• Have used serial console, KVM switches, monitor on a cart …• Linux Networks Alphas have remote power management – don’t use!

System monitoring

• Farm systems – LM78 to monitor temp + fans via /proc

• This was our largest failure mode for Pentiums

– Mon (www.kernel.org/software/mon) • Used extensively for all our systems – page “on-call”• For batch farm checks mostly – fan, temp, ping

– Mprime (prime number search) has checks on memory and arithmetic integrity

• Used in initial system burn-in

Monitoring

Performance monitoring

• Use variety of mechanisms– Publish weekly tables and graphs based on LSF

statistics– Graphs from mrtg/rrd

• Network performance, #jobs, utilization, etc

Comments & Issues

• Space – very limited– Installing a new STK silo, moved all sys admins out

• Now have no admins in same building as machine room– Plans to build a new Computer Center …

• Have always been lights-out

Future

• Accelerator and experiment upgrades– Expect first data in 2006, full rate 2007– 100 MB/s data acquisition– 1 – 3 PB/year (1 PB raw, > 1 PB simulated)– Compute clusters:

• Level 3 triggers• Reconstruction• Simulation• Analysis – PWA can be parallelized, but needs access to

very large reconstructed and simulated datasets

• Expansion of LQCD clusters– 10 Tflops by 2005

Recommended