Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004

Dr. David Wallom

Experience of Setting up and Running a Production Grid on a University CampusJuly 2004

2 Outline

• The Centre for e-Research Bristol & its place in national efforts

• University of Bristol Grid• Available tool choices• Support models for a distributed system• Problems encountered• Summary

3 Centre for e-Research Bristol

• Established as a Centre of Excellence in visualisation.

• Currently has one full time member of staff with several shared resources.

• Intended to lead the University e-Research effort including as many departments and non-traditional computational users as possible.

4 NGS (www.ngs.ac.uk) UK National Grid Service

• ‘Free’ dedicated resources accessible only through

Grid interfaces, i.e. GSI-SSH, Globus Toolkit

• Compute clusters (York & Oxford)– 64 dual CPU Intel 3.06 GHz nodes, 2GB RAM– Gigabit & Myrinet networking

• Data clusters (Manchester & RAL)– 20 dual CPU Intel 3.06 GHz nodes, 4GB RAM– Gigabit & Myrinet networking– 18TB Fibre SAN

• Also national HPC resources: HPC(x), CSAR

• Affiliates: Bristol, Cardiff, …

5 The University of Bristol Grid

• Established as a way of leveraging extra use from existing resources.

• Planned to consist of ~400 CPU from 1.2 → 3.2GHz arranged in 6 clusters. Currently about 100 CPU in 3 clusters.

• Initially legacy OS though all now moving to Red Hat Enterprise Linux 3.

• Based in and maintained by several different departments.

6 The University of Bristol Grid

• Decided to construct a campus grid to gain experience with middleware & system management before formally joining NGS.

• Central services all run on Viglen servers.– Resource Broker– Monitoring and Discovery System & Systems Monitoring.– Virtual Organisation Management– Storage Resource Broker Vault– myProxy Server

• Choices of software to provide these was lead by personal experience & other UK efforts to standardise.

7 The University of Bristol Grid, 2

• Based in and maintained by several different

departments.

• Each system with a different System Manager!

• Different OS’s, initiall just Linux & Windows, though

others will come.

• Linux versions initially legacy though all now

moving to Red Hat Enterprise Linux.

8 The System Layout

9 System Installation Model

Draw it on the board!

10 Middleware

• Virtual Data Toolkit.– Chosen for stability and support

structure.– Widely used in other European

production grid systems.• Contains the standard Globus Toolkit

version 2.4 with several enhancements.

11 Resource Brokering

• Uses the Condor-G job distribution mechanism.

• Custom script for determination of resource priority.

• Integrated the Condor job submission system to the Globus Monitoring and Discovery Service.

13 Accessing the Grid with Condor-G

• Condor-G allows the user to treat the Grid as a

local resource, and the same command-line tools

perform basic job management such as: – Submit a job, indicating an executable, input and output files, and arguments – Query a job's status – Cancel a job – Be informed when events happen, such as normal job termination or errors – Obtain access to detailed logs that provide a complete history of a job

• Condor-G extends basic Condor functionality to the

grid, providing resource management while still

providing fault tolerance and exactly-once

execution semantics.

14 How to submit a job to the system

15 Limitations of Condor-G

• Submitting jobs to run under Globus has not yet

been perfected. The following is a list of known

limitations: – No checkpoints. – No job exit codes. Job exit codes are not

available. – Limited platform availability. Condor-G is only

available on Linux, Solaris, Digital UNIX, and IRIX. HP-UX support will hopefully be available later.

16 Resource Broker Operation

17 Load Management

• Only defines the raw numbers of jobs running, idle

& with problems.

• Has little measure of relative performance of nodes

within grid.

• Once a job has been allocated to a remote cluster

then rescheduling elsewhere is difficult.

18

Provision of a Shared Filesystem

• Providing a Grid means it is beneficial to provide a shared file system.

• Newest machines come with minimum of 80GB hard-drives of which minimum is necessary for OS & user scratch space

• System will have 1TB Storage Resource Broker Vault as one of the core services.

– Take this one step further buy partitioning system drives on core servers,– Create virtual disk of ~400GB using spare space on then all!– Install SRB client on all machines so that they can directly access shared storage.

19

Automation of Processes for Maintenance

• Installation• Grid state monitoring• System maintenance• User control• Grid Testing

20

Individual System Installation

• Simple shell scripts for overall control.• Ensures middleware, monitoring and user

software all installed in consistent place.• Ensures ease of system upgrades.• Ensures system managers have a chance

to view installation method before hand.

21 Overall System Status and Status of the Grid

22 Ensuring the System Availability

• Uses the Big Brother™ system.– Monitoring occurs through server-client

model.– Server maintains limits settings and

pings resources listed.– Clients record system information and

report this to the server using secure port.

23 Big Brother™ Monitoring

24 Grid Middleware Testing

• Uses the Grid Interface Test Script (GITS)

developed for the ETF.– Tests the following;

• Globus Gatekeeper running and available.• Globus Jobsubmission system• Presence of machine within the Monitoring & Discovery Service.• Ability to retrieve and distribute files through GridFTP.

• Run within the UoB grid every 3 hours.

• Latest results available on the Service webpage.

• Only downside is that it also needs to run as a

standard user not system.

25 Grid Middleware Testing

26 What is currently running and how do I find out?

27 Authorisation And Authentication on the University of Bristol Grid

• Make use of the standard UK e-Science Certification Authority.

• Bristol is an authorised Registration Authority for this CA.

• Uses x509 type certificates and proxies for user AAA.

• May be replaced at a later date dependant on the current system scaling model.

28 User Management

• Globus uses a mapping between Distinguished name as defined in a Digital Certificate to local usernames on resources.

• Located in controlled disk space.• Important that for each resource that a user is

expecting to use, his DN is mapped locally.• Distributing this is Organisation Management

29 Virtual Organisation Management and Resource Usage Monitoring/Accounting

30 Virtual Organisation Management and Resource Usage Monitoring/Accounting, 2

• The server (previous) runs as a grid service using the ICENI framework.

• Clients located on machines that form part of the Virtual Organisation.

• Drawback currently is that this service must run using a personal certificate instead of machine certificate that would be ideal.

• Coming in new versions from OMII.

31 Locally Supporting a Distributed System

• Within the university first point of contact is always Information Services Helpdesk.– Given a preset list of questions to ask and log files to see

if available.– Not expected to do any actual debugging.– Pass problems onto Grid experts who then pass problems

on a system by system basis to their own maintenance staff.

• As one of the UK e-Science Centres we also have access to the Grid Operations and Support Centre.

32 Supporting a Distributed System

• Having a system that is well defined simplifies the support model.

• Trying to define a Service Level Description for each department to the UOBGrid as well as a overall UOBGrid Service Level Agreement to users.– Defines hardware support levels and availability.– Defines at a basic level the software support

that will also be available.

33 Problems Encountered

• Some of the middleware that we have been trying to use has not been a reliable as we would have hoped.

– MDS is a prime examples where necessity for reliability has defined our usage model.

– More software than desired still has to make use of a user with an individual DN to operate. This must change for a production system.

• Getting time and effort from some already overworked

System Managers has been tricky with sociological barriers – “Won’t letting other people use my system just mean I will have less available for

me?”

34 Notes to think about!

• Choose your test application carefully

• Choose your first test users even more carefully!

• One use with a bad experience is worth 10 with

good experiences– Grid has been very over hyped so people

expect it all to work first-time every time!

35 Future Directions within Bristol

• Make sure the rest of the University clusters are installed

and running on the UoBGrid as quickly as possible.

• Ensure that the ~600 Windows CPU currently part of

Condor pools are integrated as soon as possible. This

will give ~800CPU.

• Start accepting users from outside the University as part

of our commitment to the National Grid Service.

• Run the Bristol systems as part of the WUNGrid.

36 Further Information

• Centre for e-Research Bristol:http://escience.bristol.ac.uk

• Email: [email protected]

• Telephone: +44 (0)117 928 8769

Documents

Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004