Upload
sebastian-barr
View
215
Download
1
Tags:
Embed Size (px)
Citation preview
QMUL e-Science Research Cluster
Introduction• (New) Hardware• Performance• Software Infrastucture• What still needs to be done
Alex Martin QMUL e-Science Research Cluster Slide 2
Background• Formed e-Science consortium within QMUL to
bid for SRIF money etc (no existing central resource)
• Received money in all 3 SRIF rounds so far.
• Led by EPP + Astro + Materials+ Engineering
• Started from scratch in 2002, new machine room, Gb networking. Now have 230 kW of A/C
• Differing needs other fields tend to need parallel processing support MPI etc.
• Support effort a bit of a problem.
Alex Martin QMUL e-Science Research Cluster Slide 3
History of the High Throughput Cluster
Already in its 4th year (3 installation phases)
Date Description KSI2K TB
HTC phase 1a (SRIF1) 01/06/2003 32 x dual 2 Ghz Athlon, 4.8 TB RAID 51 7
HTC phase 1b (SRIF1) 01/04/2004 128 x dual 2.8 Ghz Xeon, 19 TB RAID 277 30
HTC phase 2 (SRIF2) 01/04/2006 280 x dual dual-core 2 Ghz Opteron 1456 160
HTC phase 3 (SRIF3) 01/04/2008 4368 320
HTC phase 4 01/04/2010? 13104 640
HTC phase 5 01/04/2012? 39312 1280
In addition Astro Cluster of ~70 machines
Alex Martin QMUL e-Science Research Cluster Slide 4
Alex Martin QMUL e-Science Research Cluster Slide 5
Alex Martin QMUL e-Science Research Cluster Slide 6
• 280 + 4 dual dual – core 2 Ghz Opteron nodes
• 40 + 4 with 8 Gbyte remainder with 4
• Each with 2 x 250 Gbyte HD
• 3-COM Superstack 3 3870 network stack
• Dedicated second network for MPI traffic
• APC 7953 vertical PDU's
• Total measured power usage seems to be ~1A/machine ~ 65-70 kW total
Alex Martin QMUL e-Science Research Cluster Slide 7
Crosscheck:
Alex Martin QMUL e-Science Research Cluster Slide 8
• Ordered last week in March
• 1st batch of machines delivered in 2 weeks
• 5 further batches 1 week apart
• 3 week delay for proper PDU's
• Cluster cabled up and powered 2 weeks ago
• Currently all production boxes running legacy sl3/x86
• Issues with scalability of services torque/ganglia. Also shared experimental area is I/0 bottleneck
Alex Martin QMUL e-Science Research Cluster Slide 9
Alex Martin QMUL e-Science Research Cluster Slide 10
Cluster has been fairly heavily used
~40-45% on average
Year K CPU Hours Main Users K CPU Hours
2003 82 Atlas 7932004 456 LHCB 9732005 1723 local HEP (mostly LC) 688
2006 (so far) 1630 Astro 448Total 3891 materials 613
Engineering 335
Alex Martin QMUL e-Science Research Cluster Slide 11
Tier-2 Allocations
Alex Martin QMUL e-Science Research Cluster Slide 12
S/W Infrastructure
• MySQL database containing all static info about machines and other hardware + network + power configuration
• Keep s/w configuration info in a subversion database: os version and release tag
• Automatic (re)installation and upgrades using a combination of both, tftp/kickstart pulls dynamic pages from web (Mason).
Alex Martin QMUL e-Science Research Cluster Slide 13
http://www.esc.qmul.ac.uk/cluster/
Alex Martin QMUL e-Science Research Cluster Slide 14
Ongoing work
• Commission SL4/x86_64 service (~30% speed improvement) (assume non-hep usage initially). Able to migrate boxes on demand.
• Tune MPI performance for jobs upto ~160 CPUs (non-ip protocol?)
• Better integrated monitoring (ganglia +pbs + opensmart? + existing db) dump Nagios? Add 1-wire Temp + power sensors.
•
Alex Martin QMUL e-Science Research Cluster Slide 15
Ongoing work continued• Learn how to use large amount of
distributed storage in efficient and robust way. Need to provide a POSIX f/s ( probably extending poolfs or something like lustre )