19
Jon Wakelin, Physics & ACRC Bristol

Jon Wakelin, Physics & ACRC

  • Upload
    gerek

  • View
    32

  • Download
    0

Embed Size (px)

DESCRIPTION

Jon Wakelin, Physics & ACRC. Bristol. ACRC. Server Rooms PTR – 48 APC water cooled racks (Hot aisle, cold aisle) MVB – 12 APC water cooled racks (Hot aisle, cold aisle) HPC IBM, ClusterVision, ClearSpeed. Storage 2008-2011? Petabyte scale facility 6 Staff - PowerPoint PPT Presentation

Citation preview

Page 1: Jon Wakelin, Physics & ACRC

Jon Wakelin, Physics & ACRC

Bristol

Page 2: Jon Wakelin, Physics & ACRC

2 ACRC

• Server Rooms– PTR – 48 APC water cooled racks (Hot aisle, cold aisle)– MVB – 12 APC water cooled racks (Hot aisle, cold aisle)

• HPC– IBM, ClusterVision, ClearSpeed.

• Storage – 2008-2011?– Petabyte scale facility

• 6 Staff– 1 Director, 2 HPC Admins, 1 Research Facilitator– 1 Visualization Specialist, 1 e-Research Specialist– (1 Storage admin post?)

Page 3: Jon Wakelin, Physics & ACRC

3 ACRC Resources

• Phase 1 – ~ March 07– 384 core - AMD Opteron 2.6 Ghz dual-socket dual-core system, 8GB Mem.– MVB server room– CVOS and SL 4 on WN. GPFS, Torque/Maui, QLogic InfiniPath

• Phase 2 - ~ May 08– 3328 core - Intel Harpertown 2.8Ghz dual-socket, quad-core, 8GB Mem.– PTR server room - ~600 meter from MVB server room.– CVOS and SL? WN. GPFS, Torque/Moab, QLogic InfiniPath

• Storage Project (2008 - 2011)– Initial purchase of additional 100 TB for PP and Climate Modelling groups– PTR server room – Operational by ~ sep 08.– GPFS will be installed on initial 100TB.

Page 4: Jon Wakelin, Physics & ACRC

4 ACRC Resources

184 Registered Users

54 Projects

5 Faculties

• Eng

• Science

• Social Science

• Medicine & Dentistry

• Medical & Vet.

Page 5: Jon Wakelin, Physics & ACRC

5 PP Resources

• Initial LCG/PP setup– SE (DPM), CE and 16 core PP Cluster, MON and UI– CE for HPC (and SE and GridFTP servers for use with ACRC facilities)

• HPC Phase 1 – PP have a 5% target fair-share, and up to 32 concurrent jobs– New CE, but uses existing SE - accessed via NAT (and slow).– Operational since end of Feb 08

• HPC Phase 2– SL 5 will limit PP exploitation in short term.– Exploring Virtualization – but this is a medium- to long-term solution– PP to negotiate larger share of Phase 1 system to compensate

• Storage– 50TB to arrive shortly, operational ~ Sep 08 – Additional networking necessary for short/medium-term access.

Page 6: Jon Wakelin, Physics & ACRC

6 Storage

• Storage Cluster – Separate to HPC cluster – Will run GPFS– Being installed and configure ‘as we speak’

• Running a ‘test’ Storm SE– This is the second time

• Due to changes in the underlying architecture

– Passing simple SAM SE tests• But, now removed from BDII

– Direct access between storage and WN• Through multi-cluster GPFS (rather than NAT)

• Test and Real system may differ in the following ways…– Real system will have a separate GridFTP server– Possibly NFS export for Physics Cluster– 10Gb NICs (Myricom Myri10G PCI-Express)

Page 7: Jon Wakelin, Physics & ACRC

7

Page 8: Jon Wakelin, Physics & ACRC

8

Page 9: Jon Wakelin, Physics & ACRC

9

Page 10: Jon Wakelin, Physics & ACRC

10

Page 11: Jon Wakelin, Physics & ACRC

11

Page 12: Jon Wakelin, Physics & ACRC

12

Page 13: Jon Wakelin, Physics & ACRC

13

Page 14: Jon Wakelin, Physics & ACRC

145510-48

5510-48

5510-48

5510-48

5510-24

5510-48

5510-48

5510-48

5510-48

5530

8683

8683

8683

8683

86838648 GTR

5510-48

5510-48

5510-48

5510-48

5510-24

5510-48

5510-48

5510-48

5510-48

5530

5510-48

5510-48

5510-48

x3650 + Myri-10G

x3650 + Myri-10G

x3650 + Myri-10G

x3650 + Myri-10G

Storage

HPC Phase 2

HPC Phase 1

MVB Server Room

PTR Server Room

NB: All components are Nortel

Page 15: Jon Wakelin, Physics & ACRC

155510-48

5510-48

5510-48

5510-48

5510-24

5510-48

5510-48

5510-48

5510-48

5530

8683

8683

8683

8683

86838648 GTR

5510-48

5510-48

5510-48

5510-48

5510-24

5510-48

5510-48

5510-48

5510-48

5530

5510-48

5510-48

5510-48

x3650 + Myri-10G

x3650 + Myri-10G

x3650 + Myri-10G

x3650 + Myri-10G

Storage

HPC Phase 2

HPC Phase 1

MVB Server Room

PTR Server Room

5530

NB: All components are Nortel

Page 16: Jon Wakelin, Physics & ACRC

16

Page 17: Jon Wakelin, Physics & ACRC

17 SoC

• Separation of Concerns– Storage/Compute managed independently of Grid Interfaces– Storage/Compute managed by dedicated HPC experts.– Tap into storage/compute in the manner the ‘electricity grid’ analogy suggested

• Provide PP with centrally managed compute and storage– Tarball WN install on HPC– Storm writing files to a remote GPFS mount (devs. and tests confirm this)

• In theory this is a good idea - in practice it is hard to achieve– (Originally) implicit assumption that admin has full control over all components

• Software now allows for (mainly) non-root installations– Depend on others for some aspects of support

• Impact on turn-around times for resolving issues (SLAs?!?!!)

Page 18: Jon Wakelin, Physics & ACRC

18 General Issues

• Limit the number of task that we pass on to HPC admins– Set up user, ‘admin’ accounts (sudo) and shared software areas– Torque - allow remote submission host (i.e. our CE)– Maui – ADMIN3 access for certain users (All users are A3 anyway)– NAT

• Most other issues are solvable with less privileges– SSH Keys – RPM or rsync for Cert updates– WN tarball for software

• Other issues– APEL accounting assumes ExecutingCE == SubmitHost (Bug report)– Work around for Maui client - key embedded in binaries!!! (now changed)– Home dir path has to be exactly the same on CE and Cluster.– Static route into HPC private network

Page 19: Jon Wakelin, Physics & ACRC

19 Q’s?

• Any questions…

• https://webpp.phy.bris.ac.uk/wiki/index.php/Grid/HPC_Documentation

• http://www.datadirectnet.com/s2a-storage-systems/capacity-optimized-configuration

• http://www.datadirectnet.com/direct-raid/direct-raid

• hepix.caspur.it/spring2006/TALKS/6apr.dellagnello.gpfs.ppt