13
Grid tool integration within the eMinerals project Mark Calleja

Grid tool integration within the eMinerals project

Embed Size (px)

DESCRIPTION

Grid tool integration within the eMinerals project. Mark Calleja. Background. Model the atomistic processes involved in environmental issues (radioactive waste disposal, pollution, weathering). - PowerPoint PPT Presentation

Citation preview

Page 1: Grid tool integration within the eMinerals project

Grid tool integration within the eMinerals project

Mark Calleja

Page 2: Grid tool integration within the eMinerals project

• Model the atomistic processes involved in environmental issues (radioactive waste disposal, pollution, weathering).

• Use many different codes/methods: large scale molecular dynamics, lattice dynamics, ab initio DFT, quantum Monte Carlo.

• Jobs can last minutes to weeks, require a few Mb to a few Gb of memory.

• Project has 12 postdocs (4 scientists, 4 code developers, 4 grid engineers).

• Spread over a number of sites: Bath, Cambridge, Daresbury, Reading, RI and UCL.

Background

Page 3: Grid tool integration within the eMinerals project

• Two Condor pools: large one at UCL (~930 Windows boxes) and a small one at Cambridge (~25 heterogeneous nodes).

• Three Linux clusters, each with one master + 16 nodes running under PBS queues.

• An IBM pSeries platform with 24 processors under LoadLeveller.

• A number of Storage Resource Broker (SRB) instances, providing ~3TB of distributed, transparent, storage.

• Application server, including SRB Metadata Catalogue (MCAT), and database cluster at Daresbury.

• All accessed via Globus.

Minigrid Resources

Page 4: Grid tool integration within the eMinerals project

• Department of CS, University of Wisconsin.• Allows the formation of pools from heterogeneous mix of

architectures and operating systems.• Excellent for utilising idle CPU cycles.• Highly configurable, allowing very flexible pool policies

(when to run Condor jobs, with what priority, for how long, etc.)

• All our department desktops are in our pool.• Provides Condor-G, a client tool for submitting jobs to

Globus gatekeepers.• Also provides DAGMan, a meta-scheduler for Condor

which allows workflows to be built.

Condor

Page 5: Grid tool integration within the eMinerals project

• San Diego Supercomputer Center• SRB is client-server middleware that provides a uniform

interface for connecting to heterogeneous data resources over a network and accessing replicated data sets.

• In conjunction with the Metadata Catalog (MCAT), provides a way to access data sets and resources based on their attributes and/or logical names rather than their names or physical locations.

• Provides a number of user interfaces: command line (useful for scripting), Jargon (java toolkit), inQ (Windows GUI) and MySRB (web browser).

Storage Resource Broker (SRB)

Page 6: Grid tool integration within the eMinerals project
Page 7: Grid tool integration within the eMinerals project

• Start by uploading input data into the SRB using one of the three client tools:

a) S-commands (command line tools)

b) InQ (Windows)

c) MySRB (web browser)

Data in the SRB can be annotated using the Metadata Editor, and then

searched using the CCLRC DataPortal. This is especially useful for the

output data.

• Construct relevant Condor/DAGman submit script/workflow.• Launch onto minigrid using Condor-G client tools.

Typical work process

Page 8: Grid tool integration within the eMinerals project

Job workflow

SRB

Gatekeeper

Condor JMgr

PBS JMgr

1) On remote gatekeeper, run a jobmanager-fork job to create a temporary directory and extract input files from SRB.

2) Submit next node in workflow to relevant jobmanager, e.g. PBS or Condor, to actually perform the required computational job.

3) On completion of the job, run another jobmanager-fork job on the relevant gatekeeper to ingest the output data into the SRB and clean up the temporary working area.Condor

pool

Page 9: Grid tool integration within the eMinerals project

• Together with SDSC, approached Wisconsin about absorbing this SRB functionality into condor_submit.

• However, Wisconsin would seem to prefer to use Stork, their grid data placement scheduler (currently stand-alone beta version).

• In the meantime, we’ve provided our own wrapper to these workflows, called my_condor_submit.

• This takes as its argument an ordinary Condor, or Condor-G, submit script, but also recognises some SRB-specific extensions.

• Limitations: the SRB extensions currently can’t make use of Condor macros, e.g. job.$$(OpSys).

• Currently also developing a job submission portal (see related talk at AHM).

my_condor_submit

Page 10: Grid tool integration within the eMinerals project

my_condor_submit

# Example submit script for a remote Condor pool

Universe = globus Globusscheduler = lake.esc.cam.ac.uk/jobmanager-condor-INTEL-LINUX Executable = add.pl Notification = NEVER

GlobusRSL = (condorsubmit=(transfer_files ALWAYS)(universe vanilla)(transfer_input_files A,B))(arguments=A B res)

Sget = A, B # Or just “Sget = *” Sput = res Sdir = test Sforce = true Output = job.out Log = job.log Error = job.error Queue

# To turn into a PBS job replace with:## Globusscheduler = lake.esc.cam.ac.uk/jobmanager-pbs# GlobusRSL = (arguments=A B res)(job_type=single)

Page 11: Grid tool integration within the eMinerals project

• Providing desktop client machines not seamless: require major firewall reconfiguration and tools have not always been simple to install, though lately better.

• Mixed reception by users, especially when we started: “why should I have to hack your scripts to run one simple job?”.

• Jobmanager modules have needed tweaking e.g. to allow users to nominate a particular flavour of MPI.

• Load balancing across the minigrid still an issue, though we have provided some simple tools to monitor the state of the queues.

• Similarly, job monitoring an issue: “what’s the state of health of my week-old simulation?”. Again, some tools provided.

Experience of setting up a minigrid

Page 12: Grid tool integration within the eMinerals project

• Job submission to the clusters is only via globus…• ...except for one cluster which allows gsissh access and

direct job submission to facilitate code development and porting.

• To share out trouble-shooting work fairly we introduced a ticket-based helpdesk system, OTRS. Users can email problems to [email protected]

• eMinerals members use UK eScience certificates, but for foreign colleagues we’ve set up our own CA to enable access to our minigrid.

Configuration and administration

Page 13: Grid tool integration within the eMinerals project

• eMinerals minigrid now in full service. • It can meet vast majority of our requirements (except

maybe for very low latency MPI jobs; e.g. Myrinet). • The integration of the SRB with the job-execution

components of the minigrid have provided most obvious added value to the project.

• Next major step will be when the ComputePortal goes online.

• Also would like to make job submission to non-minigrid facilities transparent, e.g. the NGS clusters.

• Keeping an eye on WSRF developments and GT4. • Intend to migrate to SRB v3 “soon”.

Summary and Outlook